Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
ahatamiz authored Oct 25, 2023
1 parent 54975f3 commit 2d111d0
Showing 1 changed file with 3 additions and 3 deletions.
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
[Jan Kautz](https://jankautz.com/),
[Arash Vahdat](https://research.nvidia.com/person/arash-vahdat).

Diffusion models with their powerful expressivity and high sample quality have enabled many new applications and use-cases in various domains. For sample generation, these models rely on a denoising neural network that generates images by iterative denoising. Yet, the role of denoising network architecture is not well-studied with most efforts relying on convolutional residual U-Nets. In this paper, we study the effectiveness of vision transformers in diffusion-based generative learning. Specifically, we propose a new model, denoted as Diffusion Vision Transformers (DiffiT), which consists of a hybrid hierarchical architecture with a U-shaped encoder and decoder. We introduce a novel window-based time-dependent self-attention module that allows attention layers to adapt their behavior at different stages of the denoising process in an efficient manner. We also introduce latent DiffiT which consists of transformer model with the proposed self-attention layers, for high-resolution image generation. Our results show that DiffiT is surprisingly effective in generating high-fidelity images. DiffiT achieves state-of-the-art (SOTA) performance in terms of FID score on a variety of class-conditional and unconditional synthesis tasks in both latent and image space experiments.
Diffusion models with their powerful expressivity and high sample quality have enabled many new applications and use-cases in various domains. For sample generation, these models rely on a denoising neural network that generates images by iterative denoising. Yet, the role of denoising network architecture is not well-studied with most efforts relying on convolutional residual U-Nets. In this paper, we study the effectiveness of vision transformers in diffusion-based generative learning. Specifically, we propose a new model, denoted as Diffusion Vision Transformers (DiffiT), which consists of a hybrid hierarchical architecture with a U-shaped encoder and decoder. We introduce a novel time-dependent self-attention module that allows attention layers to adapt their behavior at different stages of the denoising process in an efficient manner. We also introduce latent DiffiT which consists of transformer model with the proposed self-attention layers, for high-resolution image generation. Our results show that DiffiT is surprisingly effective in generating high-fidelity images. DiffiT achieves state-of-the-art (SOTA) benchmarks on a variety of class-conditional and unconditional synthesis tasks. In the latent space, DiffiT achieves a new SOTA FID score of **1.73** on **ImageNet-256 dataset**. The code and pretrained model will be publicly available.

![teaser](./assets/teaser.png)

Expand All @@ -16,10 +16,10 @@ Diffusion models with their powerful expressivity and high sample quality have e

| Model| Dataset | Resolution | FID-50K | Inception Score |
|---------|----------|-----------|---------|--------|
|DiffiT | ImageNet | 256x256 | 1.73 | 276.49|
|**DiffiT** | ImageNet | 256x256 | **1.73** | **276.49**|

## Performance on ImageNet-512

| Model| Dataset | Resolution | FID-50K | Inception Score |
|---------|----------|-----------|---------|--------|
|DiffiT | ImageNet | 512x512 | 2.67 | 252.12|
|**DiffiT** | ImageNet | 512x512 | **2.67** | **252.12**|

0 comments on commit 2d111d0

Please sign in to comment.