Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about method details #1

Open
hutaiHang opened this issue May 30, 2023 · 2 comments
Open

Questions about method details #1

hutaiHang opened this issue May 30, 2023 · 2 comments

Comments

@hutaiHang
Copy link

Thanks for your great work!!!

I have some questions regarding the method details mentioned in the paper and would appreciate some clarification:

  1. As per my understanding, is CLIP Enc referring to the Embedding layer of the CLIP Text Encoder in the diagram?

image

  1. During training, I understand that an initial vector of size 10*768 is mapped to text embeddings through attention and feed-forward mechanisms. The mapping network is optimized using denoising reconstruction loss. For each time step, the conditional embedding in the conditional branch is 1*768 (unless padded). Is right?

  2. During inference,

    • if it is a text reference, I assume that the corresponding text embedding in each time step is replaced with the optimized embedding. For example, "a teddy walking in times square" would be replaced with "a teddy* walking in times square," where "teddy*" is the optimized embedding obtained during training.

    • In the case of an image reference, based on the editing purpose (style/content/layout), the input at specific time steps in the conditional branch comes from different images denoted as p*.

Is my understanding accurate?

@zyxElsa
Copy link
Owner

zyxElsa commented Jun 16, 2023

Hi!

  1. Yes. The CLIP Enc denotes the CLIP Text Encoder.
  2. Yes. For each time step, the conditional embedding in the conditional branch is 1*768.
  3. Only explicit * will be replaced with the learned token embedding, i.e. "a teddy walking in times square" will not be changed.
    To implement the arttributes editing of an imageA shown in the paper, the token embedding * of imageA is needed. Use the code in the #txt2img part to generate new images.
    To edit the existing imageB with imageA, you can learn the token embedding * in imageA and use the code #img2img, or jointly use (to be released) the token embeddings of *A and *B with codes #txt2img.

@Occulte
Copy link

Occulte commented Nov 23, 2023

Hi!

  1. Yes. The CLIP Enc denotes the CLIP Text Encoder.
  2. Yes. For each time step, the conditional embedding in the conditional branch is 1*768.
  3. Only explicit * will be replaced with the learned token embedding, i.e. "a teddy walking in times square" will not be changed.
    To implement the arttributes editing of an imageA shown in the paper, the token embedding * of imageA is needed. Use the code in the #txt2img part to generate new images.
    To edit the existing imageB with imageA, you can learn the token embedding * in imageA and use the code #img2img, or jointly use (to be released) the token embeddings of *A and *B with codes #txt2img.

Hi!

Thank you for your wonderful job. I am very curious about how to modify image B through image A like the teaser turns cake into gems. I didn't find the code #img2img in the repository. I assume that you used two different hypernets in the teaser to learn image A and image B respectively, and generate new images with token embeddings of *A and *B. I would like to know what kind of prompt works behind the teaser.

Thank you for your great work again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants