var paper

LeonEricsson · Dec 31, 2024 · dcbe32d · dcbe32d
1 parent 6c18b0e
commit dcbe32d
Show file tree

Hide file tree

Showing 3 changed files with 56 additions and 0 deletions.
diff --git a/blog/2024-12-29-var.md b/blog/2024-12-29-var.md
@@ -0,0 +1,56 @@
+---
+layout: post
+title: "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction"
+categories: []
+year: 2024
+type: paper
+---
+
+i don't typically cover the cv domain but this paper won 'best paper' at neurips and i heard its a really good paper for people unfamiliar with the field. you may also have heard of the controversy surrounding this paper, the paper is a collaboration between peking university and bytedance, and bytedance is suing the main author because apparently he was sabotaging other projects, slowing down colleagues, everything in attempt to hog more compute for his project. this project. which now won best paper and outperformed previous autoregressive image modelling sota's by 10x. crazy story. anyway let's get into it.
+
+in image generation, diffusion models have been unmatched in their performance for the past years. despite the undeniable success of autoregressive models in the language domain, their ability in computer vision seems to have stagnated, at a position far outmatched by diffusion models. why? in language, the success of AR models have been highlighted by their scalability and generalizabilty, so why has this not transferred to the image domain? just like text on the internet, humans have already curated an immense amount of visual data that is readily available on the internet, we've got the data, just like for llms, but performance hasn't reflected this. so what's different?
+
+well, it's simple really. text has a natural causal order, images don't. the traditional autoregressive approach has been to turn images into a next-image-token prediction task, shoehorning it into something that matches autoregressive language models
+
+![](/images/arimage.png)
+
+images are patchified, discretized into image tokens, then ordered in a 1D sequence. from left-to-right, top-down, in what's known as a raster-scan order. this induces an inductive prior that was originally designed for text, it makes sense for text, because it has a natural 1 dimensional order, but this isn't true for images. to counteract this inductive bias, people use positional embeddings, specifically rope embeddings, to inform the neural net of the true order of the image, attempting to induce a spatial dimension, but this has clearly yet to prove successful, now if this has to do with the raster scan specifically is hard to say, but the success of this paper surely does make it seem that way. because this paper, VAR, is all about addressing the raster scan order. 
+
+autoregressive modeling requires defining the order of data. VAR reconsiders what this *order* means for images by reformulating the objective from a next-image-token prediction task into a next-scale (or next-resolution) prediction, where entire images are autoregressively generated from coarse to fine scales. Humans typically perceive images in a hierarchical manner, suggesting that there exists some multi-scale, coarse-to-fine ordering in images, providing a much better inductive prior. this intuition, which "dates" back to the findings of how human vision and vision processing works, is akin to how CNNs process images, using a receptive field to progressively aggregate information across layers. CNNs have been proven to process different levels of detail in an image throughout the layers of the network. 
+
+there are two stages to training a VAR model, the first part is to train a multi-scale VQ autoencoder that transforms an image into $K$ token maps $R = (r_1, r_2, ..., r_K)$, the second is to train a transformer on $R = ([s], r_1, r_2, ..., r_{K-1})$ predicting $R = (r_1, r_2, ..., r_K)$. i won't say anything more about the actual VAR transformer, there's nothing there that you haven't seen before, gpt-2 style transformer. what's interesting here is $r_k$, and to understand $r_k$ we'll take a look at the tokenizer, the multi-scale VQVAE. 
+
+**vqvae**. before understanding a multi-scale VQVAE, which to be clear is a novel architecture introduced in this paper, one needs to understand a vanilla VQVAE. i'll run through this briefly, [click here](https://mlberkeley.substack.com/p/vq-vae) if you want a more thorough explanation. VQVAE's are used in autoregressive image modelling to tokenize a image into discrete tokens. like the name suggests, the architecture is of classical autoencoder style, but the latent space representation, or embedding space, comes from a *discrete vocabulary*, known as a *codebook*. 
+
+![](/images/vqvae.png)
+
+the encoder processes the image using a cnn produce a continuous latent representation $z_e(x)$, which serves as a mapping into the embedding space. the quantizer, $q(\cdot)$, discretizes this representation by mapping $z_e(x)$ to the nearest embedding vector $e_k$ in a learnable set of $k$ embedding vectors (right-hand side of figure). this results in a quantized representation $z_q(x)$, effectively enforcing a discrete and structured latent space. the decoder takes this quantized representation and passes it through another cnn to reconstruct the input, generating $p(x|z_q)$. the model is trained to minimize a compound perceptual and discriminative loss between the original image and $p(x|z_q)$.
+
+**multi-scale vqvae**. identically to vqvae, the encoder produces a continuous feature map $z_e(x)$ using a cnn. however, instead of producing a single mapping of the same resolution as $z_e(x)$, the multi-scale vqvae iterates at produces $K$ token maps at different scales containing $h_k \times w_k$ tokens:
+
+Loop through each scale $k$ (from the coarsest to the finest resolution):
+
+1. Downsample the feature map $z_e(x)$ to $(h_k, w_k)$ using an interpolation function
+2. Quantize the downsampled feature map using codebook $Z$ to obtain discrete token map $r_k$
+3. Save token map to list $R$
+
+So at each scale $k$ we get a token map of size $h_k \times w_k$ that points to a discrete vector in our codebook. After the vqvae has been fully trained, this is used as input to the VAR transformer. But, we need train the vqvae first, so how do we decode this representation? After we've collected the multi-scale token map $R$, we attempt to reconstruct the original image conditioned on our embedding space:
+
+Loop through $R$ (from the coarsest to the finest resolution):
+
+1. Retrieve $r_k$ (discrete tokens of shape $(h_k, w_k)$) from $R$
+2. Lookup embeddings $z_k$ from codebook using $r_k$
+3. Upsample $z_k$ to original image size $(h_K, w_K)$
+4. Add reconstructed embeddings $z_k$ to $z_q$
+
+The final step is then again to use the decoder on the quantized representation $z_q$ to reconstruct the image $p(x|z_q)$.
+
+that's it. training this multi-scale vqvae will provides a way to generate the multi-scale token maps which are then used to train the VAR transformer. this method completely preserves the spatial locality of the image as scale encodes the entire image without a flattening process. Tokens in $r_k$ are fully correlated. I can see why this paper won the best paper award, its such a clean proposition to a inherent problem of previous AR image modelling, it's intuitive, it aligns with the natural, coarse-to-fine progression characteristics of human visual perception, and the results are stunning. congrats to the authors. 
+
+
+
+
+
+
+
+
diff --git a/public/images/arimage.png b/public/images/arimage.png
diff --git a/public/images/vqvae.png b/public/images/vqvae.png