Shared-Context Distillation - No Normalization Loss Function #23

Etzelkut · 2025-03-11T09:25:53Z

Hello!
Thank you for your work! It was a very interesting read.

I have a question regarding Table 1 and Table 2.
According to Table 2, using only Shared-Context Distillation already leads to significant improvements in results. As I understand, this is applied without normalization.

However, since no normalization is used in this setting, does the total loss function (9) still require L_lg? I assume it would be analogous to L_sc but applied to different random patches on top of total loss. Please correct me if I’m misunderstanding.

Best regards!

Etzelkut · 2025-03-11T09:28:45Z

I suppose this misunderstanding comes from Figure 3, where Shared-Context Distillation include L_lg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shared-Context Distillation - No Normalization Loss Function #23

Shared-Context Distillation - No Normalization Loss Function #23

Etzelkut commented Mar 11, 2025

Etzelkut commented Mar 11, 2025

Shared-Context Distillation - No Normalization Loss Function #23

Shared-Context Distillation - No Normalization Loss Function #23

Comments

Etzelkut commented Mar 11, 2025

Etzelkut commented Mar 11, 2025