-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions about method details #1
Comments
Hi!
|
Hi! Thank you for your wonderful job. I am very curious about how to modify image B through image A like the teaser turns cake into gems. I didn't find the code #img2img in the repository. I assume that you used two different hypernets in the teaser to learn image A and image B respectively, and generate new images with token embeddings of *A and *B. I would like to know what kind of prompt works behind the teaser. Thank you for your great work again. |
Thanks for your great work!!!
I have some questions regarding the method details mentioned in the paper and would appreciate some clarification:
During training, I understand that an initial vector of size 10*768 is mapped to text embeddings through attention and feed-forward mechanisms. The mapping network is optimized using denoising reconstruction loss. For each time step, the conditional embedding in the conditional branch is 1*768 (unless padded). Is right?
During inference,
if it is a text reference, I assume that the corresponding text embedding in each time step is replaced with the optimized embedding. For example, "a teddy walking in times square" would be replaced with "a teddy* walking in times square," where "teddy*" is the optimized embedding obtained during training.
In the case of an image reference, based on the editing purpose (style/content/layout), the input at specific time steps in the conditional branch comes from different images denoted as p*.
Is my understanding accurate?
The text was updated successfully, but these errors were encountered: