Skip to content

Added HIT-TMG_KaLM-embedding-multilingual-mini-instruct-v1 with instruct wrapper #2478

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

ayush1298
Copy link
Contributor

@ayush1298 ayush1298 commented Apr 2, 2025

fixes #1445 #2482
Added 3 models:

  1. HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1 with instruct wrapper
  2. HIT_TMG/KaLM_embedding_multilingual_mini_v1
  3. HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1.5 with instruct wrapper

Code Quality

  • Code Formatted: Format the code using make lint to maintain consistent style.

Documentation

  • Updated Documentation: Add or update documentation to reflect the changes introduced in this PR.

Testing

  • New Tests Added: Write tests to cover new functionality. Validate with make test-with-coverage.
  • Tests Passed: Run tests locally using make test or make test-with-coverage to ensure no existing functionality is broken.

Adding a model checklist

  • I have filled out the ModelMeta object to the extent possible
  • I have ensured that my model can be loaded using
    • mteb.get_model(model_name, revision) and
    • mteb.get_model_meta(model_name, revision)
  • I have tested the implementation works on a representative set of tasks.

@ayush1298
Copy link
Contributor Author

@Samoed I will not be able to run the models on all tasks and add results to results repo. Can you do that if possible?

Copy link
Member

@Samoed Samoed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ayush1298
Copy link
Contributor Author

ayush1298 commented Apr 3, 2025

Detailed Analysis of Results Comparison:
M1: HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1
M2: HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1.5
M3: HIT-TMG/KaLM-embedding-multilingual-mini-v1

Task Type - "Classification": Task - "EmotionClassification" Significant Differences in Results
Metric M1-New M1-Old M2-New M2-Old M3-New M3-Old
Accuracy 0.604 0.85565 0.6017 0.869 0.5118 0.53945
F1 0.5475 0.81123 0.5469 0.82434 0.4573 0.46749
F1 Weighted 0.6202 0.85983 0.6184 0.87211 0.5321 0.55445
Main Score 0.604 0.85565 0.6017 0.869 0.5118 0.53945
Task Type - "MultilabelClassification": Task - "CEDRClassification"
Metric M1-New M1-Old M2-New M2-Old M3-New M3-Old
Accuracy 0.3972 0.4330 0.3908 0.4376 0.4015 0.4216
F1 0.2724 0.4111 0.2719 0.4247 0.3118 0.3909
LRAP 0.6429 0.7206 0.6409 0.7363 0.6595 0.7107
Main Score 0.3972 0.4330 0.3908 0.4376 0.4015 0.4216
Task Type - "Clustering": Task - "GeoreviewClusteringP2P"
Metric M1-New M1-Old M2-New M2-Old M3-New M3-Old
Main Score 0.6329 0.6028 0.6324 0.6076 0.6211 0.6340
V-Measure 0.6329 0.6028 0.6324 0.6076 0.6211 0.6340
V-Measure Std 0.0088 0.0103 0.0091 0.0066 0.0098 0.0044
Task Type - "PairClassification": Task - "Ocnli"
Metric M1-Old M1-New M2-Old M2-New M3-Old M3-New
Cosine Accuracy 0.6687 0.6665 0.6622 0.6622 0.6703 0.6703
Cosine Accuracy Threshold 0.8435 0.8587 0.8553 0.8553 0.6514 0.6514
Cosine AP 0.6983 0.6968 0.6921 0.6921 0.6949 0.6949
Cosine F1 0.7134 0.7075 0.7045 0.7045 0.7128 0.7128
Cosine F1 Threshold 0.8386 0.8333 0.8385 0.8385 0.5884 0.5884
Cosine Precision 0.6393 0.6043 0.6136 0.6136 0.6234 0.6234
Cosine Recall 0.8068 0.8532 0.8268 0.8268 0.8321 0.8321
Dot Accuracy 0.6687 0.6665 0.6622 0.6622 0.6703 0.6703
Dot Accuracy Threshold 0.8435 0.8587 0.8553 0.8553 0.6514 0.6514
Dot AP 0.6983 0.6968 0.6921 0.6921 0.6949 0.6949
Dot F1 0.7134 0.7075 0.7045 0.7045 0.7128 0.7128
Dot F1 Threshold 0.8386 0.8333 0.8385 0.8385 0.5884 0.5884
Dot Precision 0.6393 0.6043 0.6136 0.6136 0.6234 0.6234
Dot Recall 0.8068 0.8532 0.8268 0.8268 0.8321 0.8321
Euclidean Accuracy 0.6687 0.6665 0.6622 0.6622 0.6703 0.6703
Euclidean Accuracy Threshold 0.5595 0.5316 0.5379 0.5379 0.8349 0.8349
Euclidean AP 0.6983 0.6968 0.6921 0.6921 0.6949 0.6949
Euclidean F1 0.7134 0.7075 0.7045 0.7045 0.7128 0.7128
Euclidean F1 Threshold 0.5682 0.5775 0.5682 0.5682 0.9074 0.9074
Euclidean Precision 0.6393 0.6043 0.6136 0.6136 0.6234 0.6234
Euclidean Recall 0.8068 0.8532 0.8268 0.8268 0.8321 0.8321
Manhattan Accuracy 0.6605 0.6611 0.6589 0.6589 0.6692 0.6692
Manhattan Accuracy Threshold 12.1546 12.0459 12.5711 12.5711 19.5184 19.5184
Manhattan AP 0.6951 0.6940 0.6893 0.6893 0.6902 0.6902
Manhattan F1 0.7056 0.7068 0.7002 0.7002 0.7090 0.7090
Manhattan F1 Threshold 13.4479 13.3175 13.2169 13.2169 22.0142 22.0142
Manhattan Precision 0.6178 0.6184 0.6172 0.6172 0.5891 0.5891
Manhattan Recall 0.8226 0.8247 0.8089 0.8089 0.8902 0.8902
Max AP 0.6983 0.6968 0.6921 0.6921 0.6949 0.6949
Max F1 0.7134 0.7075 0.7045 0.7045 0.7128 0.7128
Max Precision 0.6393 0.6184 0.6172 0.6172 0.6234 0.6234
Max Recall 0.8226 0.8532 0.8268 0.8268 0.8902 0.8902
Similarity Accuracy 0.6687 0.6665 0.6622 0.6622 0.6703 0.6703
Similarity Accuracy Threshold 0.8435 0.8587 0.8553 0.8553 0.6514 0.6514
Similarity AP 0.6983 0.6968 0.6921 0.6921 0.6949 0.6949
Similarity F1 0.7134 0.7075 0.7045 0.7045 0.7128 0.7128
Similarity F1 Threshold 0.8386 0.8333 0.8385 0.8385 0.5884 0.5884
Similarity Precision 0.6393 0.6043 0.6136 0.6136 0.6234 0.6234
Similarity Recall 0.8068 0.8532 0.8268 0.8268 0.8321 0.8321
Main Score 0.6983 0.6665 0.6921 0.6622 0.6949 0.6703

@Samoed
Copy link
Member

Samoed commented Apr 3, 2025

Hm. I they've reported different prompts in paper with what've using. Can you update your implementation with their prompts? You can change model to use sentence transformer wrapper, but this is a hack and not clear how to integrate their resuls properly. At least can you try to change prompt for 2-3 tasks directly to test if our implementation will match?

@ayush1298
Copy link
Contributor Author

ayush1298 commented Apr 3, 2025

Hm. I they've reported different prompts in paper with what've using. Can you update your implementation with their prompts? You can change model to use sentence transformer wrapper, but this is a hack and not clear how to integrate their resuls properly. At least can you try to change prompt for 2-3 tasks directly to test if our implementation will match?

I think only Classification and MultilabelClassification results are having some differences. For retrieval, reranking, STS tasks(whose results I was going to share in sometime), there are no differences.

Update:
I saw thier paper, they have given different task instructions, each 1 is specific for the task. Should we support task-specific instructions in MTEB?

@Samoed
Copy link
Member

Samoed commented Apr 3, 2025

I saw thier paper, they have given different task instructions, each 1 is specific for the task. Should we support task-specific instructions in MTEB?

I think you can create an issue for it to discuss. After we will decide what to do with this model

@Samoed
Copy link
Member

Samoed commented Apr 7, 2025

@ayush1298 You can change get_instruction

def get_instruction(task_name: str, prompt_type: PromptType | None) -> str:

similarly to get_prompt_name
def get_prompt_name(

@ayush1298
Copy link
Contributor Author

ayush1298 commented Apr 8, 2025

@Samoed I have modified get_instruction similar to get_prompt_name, but I don't know how to exactly incorporate this in model_meta for each model.

1 more thing, I think what I missed is the prompt given at end in paper are having same format only of:
HIT_TMG_INSTRUCTION = "Instruct: {instruction}\nQuery: "

they just have given these as an example with task-specific instruction and query for each task.

"""Get the instruction/prompt to be used for encoding sentences."""
if prompts_dict and task_name in prompts_dict:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And what if task want to create different instructions to query and passages?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should be done for that?

Copy link
Member

@Samoed Samoed Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think code can be changed like

task = mteb.get_task(task_name=task_name)
prompt = task_metadata.prompts
if prompt dict and task_name in prompts_dict:
    prompt = prompts_dict[task_name]

if isinstance(prompt, dict) and prompt_type:
    ...
if prompt:
    return prompt
...

"EightTagsClustering": "Instruct: Identify of headlines from social media posts in Polish into 8 categories: film, history, food, medicine, motorization, work, sport and technology \n Query: {query}",
"GeoreviewClusteringP2P": "Instruct: Identify the topic or theme of the Russian reviews. \n Query: {query}",
"RuSciBenchGRNTIClusteringP2P": "Instruct: Identify the topic or theme of the Russian articles. \n Query: {query}",
"RuSciBenchOECDClusteringP2P": "Instruct: Identify the topic or theme of the Russian articles. \n Query: {query}",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't add {query}, because we append text to instruction

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"KeyError: 'document' not found and no similar keys were found.
2 participants