Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combining OMG with InstantID for multi-concept generation #3

Open
yunbinmo opened this issue Mar 20, 2024 · 5 comments
Open

Combining OMG with InstantID for multi-concept generation #3

yunbinmo opened this issue Mar 20, 2024 · 5 comments

Comments

@yunbinmo
Copy link

Hi, thanks for the amazing job and I would like to ask what is the exact approach of using OMG with InstantID for multi-concept generation? I understand that you have the inference code available but I don't quite understand what it is doing.

As I know, the architecture of InstantID only takes in one input reference image, it would be good if I can get a high-level view of how you combine OMG with InstantID for multi-concept generation when there are more than one reference images, thank you!

@yzhang2016
Copy link
Collaborator

InstantID can take multiple images as reference. The embeddings of all reference images are averaged as the input of the identity net (control net).

@yzhang2016
Copy link
Collaborator

The key idea is the two-stage generation and the noise blending.

Why two-stage?

  • For finetuned SDXL or SDXL + Lora weights, the concept composition ability or text-image alignment are worse than the original SDXL. Hence, we use the original SDXL to generate the content and layout in the first stage and collect intermediate information such as attention maps and noises for noise blending in the second stage.

How to combine with InstantID or ID LoRAs

  • Our OMG framework can be combined with other single ID personalization methods, not only InstantID and LoRAs. Noise blending takes the regional personlized noises of multiple IDs and composite a fused one. The regions can be detected using the generated image in the first stage.
  • Blending in the latent space is much better than blending in pixel space, which can avoid illumiation disharmony and artifacts around the edges.

@yunbinmo
Copy link
Author

I see! Thanks for the reply!

But I have one more question, if embeddings of multiple images are averaged as an input to the identityNet, would we expect some kind of mixture of facial features from different IDs?

In other words, the performance of using an average of 3 ID image embeddings would look worse than using an average of 2 ID image embeddings, is that generally the case?

@tanghengjian
Copy link

i guess:
1、3 ID or 2 ID, may not the OMG's scope
2、averaged 3 ID face imbedings extracted by insight face may has been implied by instantid?

@tanghengjian
Copy link

hi, @yzhang2016 we tested that, the first stage's person face area will limit the second stage's user face generation, in other word, we found less face similarity in some case.
can we make an adaptive similar face with user's face in stage 1, utilize adapter faceID ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants