Questions about method details #1

hutaiHang · 2023-05-30T13:15:45Z

Thanks for your great work!!!

I have some questions regarding the method details mentioned in the paper and would appreciate some clarification:

As per my understanding, is CLIP Enc referring to the Embedding layer of the CLIP Text Encoder in the diagram?

During training, I understand that an initial vector of size 10*768 is mapped to text embeddings through attention and feed-forward mechanisms. The mapping network is optimized using denoising reconstruction loss. For each time step, the conditional embedding in the conditional branch is 1*768 (unless padded). Is right?
During inference,
- if it is a text reference, I assume that the corresponding text embedding in each time step is replaced with the optimized embedding. For example, "a teddy walking in times square" would be replaced with "a teddy* walking in times square," where "teddy*" is the optimized embedding obtained during training.
- In the case of an image reference, based on the editing purpose (style/content/layout), the input at specific time steps in the conditional branch comes from different images denoted as p*.

Is my understanding accurate?

zyxElsa · 2023-06-16T07:03:45Z

Hi!

Yes. The CLIP Enc denotes the CLIP Text Encoder.
Yes. For each time step, the conditional embedding in the conditional branch is 1*768.
Only explicit * will be replaced with the learned token embedding, i.e. "a teddy walking in times square" will not be changed.
To implement the arttributes editing of an imageA shown in the paper, the token embedding * of imageA is needed. Use the code in the #txt2img part to generate new images.
To edit the existing imageB with imageA, you can learn the token embedding * in imageA and use the code #img2img, or jointly use (to be released) the token embeddings of *A and *B with codes #txt2img.

Occulte · 2023-11-23T16:45:24Z

Hi!

Yes. The CLIP Enc denotes the CLIP Text Encoder.

Yes. For each time step, the conditional embedding in the conditional branch is 1*768.

Only explicit * will be replaced with the learned token embedding, i.e. "a teddy walking in times square" will not be changed.
To implement the arttributes editing of an imageA shown in the paper, the token embedding * of imageA is needed. Use the code in the #txt2img part to generate new images.
To edit the existing imageB with imageA, you can learn the token embedding * in imageA and use the code #img2img, or jointly use (to be released) the token embeddings of *A and *B with codes #txt2img.

Hi!

Thank you for your wonderful job. I am very curious about how to modify image B through image A like the teaser turns cake into gems. I didn't find the code #img2img in the repository. I assume that you used two different hypernets in the teaser to learn image A and image B respectively, and generate new images with token embeddings of *A and *B. I would like to know what kind of prompt works behind the teaser.

Thank you for your great work again.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about method details #1

Questions about method details #1

hutaiHang commented May 30, 2023

zyxElsa commented Jun 16, 2023

Occulte commented Nov 23, 2023

Questions about method details #1

Questions about method details #1

Comments

hutaiHang commented May 30, 2023

zyxElsa commented Jun 16, 2023

Occulte commented Nov 23, 2023