Skip to content

Commit

Permalink
grammar clean up
Browse files Browse the repository at this point in the history
  • Loading branch information
LeonEricsson committed Dec 31, 2024
1 parent dcbe32d commit f1e1b69
Showing 1 changed file with 5 additions and 5 deletions.
10 changes: 5 additions & 5 deletions blog/2024-12-29-var.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,17 +8,17 @@ type: paper

i don't typically cover the cv domain but this paper won 'best paper' at neurips and i heard its a really good paper for people unfamiliar with the field. you may also have heard of the controversy surrounding this paper, the paper is a collaboration between peking university and bytedance, and bytedance is suing the main author because apparently he was sabotaging other projects, slowing down colleagues, everything in attempt to hog more compute for his project. this project. which now won best paper and outperformed previous autoregressive image modelling sota's by 10x. crazy story. anyway let's get into it.

in image generation, diffusion models have been unmatched in their performance for the past years. despite the undeniable success of autoregressive models in the language domain, their ability in computer vision seems to have stagnated, at a position far outmatched by diffusion models. why? in language, the success of AR models have been highlighted by their scalability and generalizabilty, so why has this not transferred to the image domain? just like text on the internet, humans have already curated an immense amount of visual data that is readily available on the internet, we've got the data, just like for llms, but performance hasn't reflected this. so what's different?
in image generation, diffusion models are unmatched in their performance, they've been the staple architecture with adopters such as StableDiffusion and SORA. meanwhile, despite the undeniable success of autoregressive (ar) models in the language domain, their performance in computer vision has stagnated, falling far behind diffusion models. but why? in language, ar models have been celebrated for their scalability and generalizability. so why hasn’t this translated to the image domain? just like text, humans have curated an immense amount of visual data available online. we have the data, much like we do for llms, yet performance hasn't reflected this. so whats different?

well, it's simple really. text has a natural causal order, images don't. the traditional autoregressive approach has been to turn images into a next-image-token prediction task, shoehorning it into something that matches autoregressive language models
well, it's simple really. text has a natural causal order, images don't. traditional autoregressive approaches attempt to turn images into a next-image-token prediction task, forcing them into a structure originally designed for language.

![](/images/arimage.png)

images are patchified, discretized into image tokens, then ordered in a 1D sequence. from left-to-right, top-down, in what's known as a raster-scan order. this induces an inductive prior that was originally designed for text, it makes sense for text, because it has a natural 1 dimensional order, but this isn't true for images. to counteract this inductive bias, people use positional embeddings, specifically rope embeddings, to inform the neural net of the true order of the image, attempting to induce a spatial dimension, but this has clearly yet to prove successful, now if this has to do with the raster scan specifically is hard to say, but the success of this paper surely does make it seem that way. because this paper, VAR, is all about addressing the raster scan order.
in this approach, images are patchified, discretized into tokens, and arranged into a 1d sequence—typically following a raster-scan order (left-to-right, top-down). this introduces an inductive prior originally designed for text. for text, this makes sense because it inherently follows a 1d order. for images, however, this assumption is unnatural. to address this mismatch, researchers often rely on positional embeddings (like rope embeddings) to encode spatial relationships into the neural network. despite these efforts, this workaround has yet to achieve significant success. whether the raster-scan order itself is the main limitation remains debatable, but the results of this paper suggest it might be. that’s because this paper, var, directly tackles the shortcomings of raster-scan ordering.

autoregressive modeling requires defining the order of data. VAR reconsiders what this *order* means for images by reformulating the objective from a next-image-token prediction task into a next-scale (or next-resolution) prediction, where entire images are autoregressively generated from coarse to fine scales. Humans typically perceive images in a hierarchical manner, suggesting that there exists some multi-scale, coarse-to-fine ordering in images, providing a much better inductive prior. this intuition, which "dates" back to the findings of how human vision and vision processing works, is akin to how CNNs process images, using a receptive field to progressively aggregate information across layers. CNNs have been proven to process different levels of detail in an image throughout the layers of the network.
autoregressive modeling inherently requires defining an order for the data. var redefines what *order* means for images by shifting the objective from predicting the next image token to predicting the next resolution (or scale). instead of processing images token by token, the model generates entire images autoregressively from coarse to fine scales. humans naturally perceive images hierarchically, which suggests that a multi-scale, coarse-to-fine ordering offers a much better inductive prior. this idea, rooted in studies of human vision, mirrors how CNNs process images—aggregating information progressively through receptive fields. CNNs are known to capture different levels of detail across their layers, making this coarse-to-fine approach both intuitive and effective.

there are two stages to training a VAR model, the first part is to train a multi-scale VQ autoencoder that transforms an image into $K$ token maps $R = (r_1, r_2, ..., r_K)$, the second is to train a transformer on $R = ([s], r_1, r_2, ..., r_{K-1})$ predicting $R = (r_1, r_2, ..., r_K)$. i won't say anything more about the actual VAR transformer, there's nothing there that you haven't seen before, gpt-2 style transformer. what's interesting here is $r_k$, and to understand $r_k$ we'll take a look at the tokenizer, the multi-scale VQVAE.
there are two stages to training a VAR model, the first is to train a multi-scale VQ autoencoder that transforms an image into $K$ token maps $R = (r_1, r_2, ..., r_K)$, the second is to train a transformer on $R = ([s], r_1, r_2, ..., r_{K-1})$ predicting $R = (r_1, r_2, ..., r_K)$. i won’t go into details about the VAR transformer itself—it’s a standard gpt-2-style transformer and likely nothing you havent seen before. what's interesting here is $r_k$, and to understand $r_k$ we'll take a look at the tokenizer, the multi-scale VQVAE.

**vqvae**. before understanding a multi-scale VQVAE, which to be clear is a novel architecture introduced in this paper, one needs to understand a vanilla VQVAE. i'll run through this briefly, [click here](https://mlberkeley.substack.com/p/vq-vae) if you want a more thorough explanation. VQVAE's are used in autoregressive image modelling to tokenize a image into discrete tokens. like the name suggests, the architecture is of classical autoencoder style, but the latent space representation, or embedding space, comes from a *discrete vocabulary*, known as a *codebook*.

Expand Down

0 comments on commit f1e1b69

Please sign in to comment.