Added HIT-TMG_KaLM-embedding-multilingual-mini-instruct-v1 with instruct wrapper #2478

ayush1298 · 2025-04-02T17:19:17Z

fixes #1445 #2482
Added 3 models:

HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1 with instruct wrapper
HIT_TMG/KaLM_embedding_multilingual_mini_v1
HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1.5 with instruct wrapper

Code Quality

Code Formatted: Format the code using make lint to maintain consistent style.

Documentation

Updated Documentation: Add or update documentation to reflect the changes introduced in this PR.

Testing

New Tests Added: Write tests to cover new functionality. Validate with make test-with-coverage.
Tests Passed: Run tests locally using make test or make test-with-coverage to ensure no existing functionality is broken.

Adding a model checklist

I have filled out the ModelMeta object to the extent possible
I have ensured that my model can be loaded using
- mteb.get_model(model_name, revision) and
- mteb.get_model_meta(model_name, revision)
I have tested the implementation works on a representative set of tasks.

…uct wrapper

ayush1298 · 2025-04-02T17:20:01Z

@Samoed I will not be able to run the models on all tasks and add results to results repo. Can you do that if possible?

Samoed

Can you also add https://huggingface.co/HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1.5?

mteb/models/hit_tmg_models.py

mteb/models/ops_moa_models.py

docs/tasks.md

ayush1298 · 2025-04-03T18:56:26Z

Detailed Analysis of Results Comparison:
M1: HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1
M2: HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1.5
M3: HIT-TMG/KaLM-embedding-multilingual-mini-v1

Task Type - "Classification": Task - "EmotionClassification"

Significant Differences in Results

Metric	M1-New	M1-Old	M2-New	M2-Old	M3-New	M3-Old
Accuracy	0.604	0.85565	0.6017	0.869	0.5118	0.53945
F1	0.5475	0.81123	0.5469	0.82434	0.4573	0.46749
F1 Weighted	0.6202	0.85983	0.6184	0.87211	0.5321	0.55445
Main Score	0.604	0.85565	0.6017	0.869	0.5118	0.53945

Task Type - "MultilabelClassification": Task - "CEDRClassification"

Metric	M1-New	M1-Old	M2-New	M2-Old	M3-New	M3-Old
Accuracy	0.3972	0.4330	0.3908	0.4376	0.4015	0.4216
F1	0.2724	0.4111	0.2719	0.4247	0.3118	0.3909
LRAP	0.6429	0.7206	0.6409	0.7363	0.6595	0.7107
Main Score	0.3972	0.4330	0.3908	0.4376	0.4015	0.4216

Task Type - "Clustering": Task - "GeoreviewClusteringP2P"

Metric	M1-New	M1-Old	M2-New	M2-Old	M3-New	M3-Old
Main Score	0.6329	0.6028	0.6324	0.6076	0.6211	0.6340
V-Measure	0.6329	0.6028	0.6324	0.6076	0.6211	0.6340
V-Measure Std	0.0088	0.0103	0.0091	0.0066	0.0098	0.0044

Task Type - "PairClassification": Task - "Ocnli"

Metric	M1-Old	M1-New	M2-Old	M2-New	M3-Old	M3-New
Cosine Accuracy	0.6687	0.6665	0.6622	0.6622	0.6703	0.6703
Cosine Accuracy Threshold	0.8435	0.8587	0.8553	0.8553	0.6514	0.6514
Cosine AP	0.6983	0.6968	0.6921	0.6921	0.6949	0.6949
Cosine F1	0.7134	0.7075	0.7045	0.7045	0.7128	0.7128
Cosine F1 Threshold	0.8386	0.8333	0.8385	0.8385	0.5884	0.5884
Cosine Precision	0.6393	0.6043	0.6136	0.6136	0.6234	0.6234
Cosine Recall	0.8068	0.8532	0.8268	0.8268	0.8321	0.8321
Dot Accuracy	0.6687	0.6665	0.6622	0.6622	0.6703	0.6703
Dot Accuracy Threshold	0.8435	0.8587	0.8553	0.8553	0.6514	0.6514
Dot AP	0.6983	0.6968	0.6921	0.6921	0.6949	0.6949
Dot F1	0.7134	0.7075	0.7045	0.7045	0.7128	0.7128
Dot F1 Threshold	0.8386	0.8333	0.8385	0.8385	0.5884	0.5884
Dot Precision	0.6393	0.6043	0.6136	0.6136	0.6234	0.6234
Dot Recall	0.8068	0.8532	0.8268	0.8268	0.8321	0.8321
Euclidean Accuracy	0.6687	0.6665	0.6622	0.6622	0.6703	0.6703
Euclidean Accuracy Threshold	0.5595	0.5316	0.5379	0.5379	0.8349	0.8349
Euclidean AP	0.6983	0.6968	0.6921	0.6921	0.6949	0.6949
Euclidean F1	0.7134	0.7075	0.7045	0.7045	0.7128	0.7128
Euclidean F1 Threshold	0.5682	0.5775	0.5682	0.5682	0.9074	0.9074
Euclidean Precision	0.6393	0.6043	0.6136	0.6136	0.6234	0.6234
Euclidean Recall	0.8068	0.8532	0.8268	0.8268	0.8321	0.8321
Manhattan Accuracy	0.6605	0.6611	0.6589	0.6589	0.6692	0.6692
Manhattan Accuracy Threshold	12.1546	12.0459	12.5711	12.5711	19.5184	19.5184
Manhattan AP	0.6951	0.6940	0.6893	0.6893	0.6902	0.6902
Manhattan F1	0.7056	0.7068	0.7002	0.7002	0.7090	0.7090
Manhattan F1 Threshold	13.4479	13.3175	13.2169	13.2169	22.0142	22.0142
Manhattan Precision	0.6178	0.6184	0.6172	0.6172	0.5891	0.5891
Manhattan Recall	0.8226	0.8247	0.8089	0.8089	0.8902	0.8902
Max AP	0.6983	0.6968	0.6921	0.6921	0.6949	0.6949
Max F1	0.7134	0.7075	0.7045	0.7045	0.7128	0.7128
Max Precision	0.6393	0.6184	0.6172	0.6172	0.6234	0.6234
Max Recall	0.8226	0.8532	0.8268	0.8268	0.8902	0.8902
Similarity Accuracy	0.6687	0.6665	0.6622	0.6622	0.6703	0.6703
Similarity Accuracy Threshold	0.8435	0.8587	0.8553	0.8553	0.6514	0.6514
Similarity AP	0.6983	0.6968	0.6921	0.6921	0.6949	0.6949
Similarity F1	0.7134	0.7075	0.7045	0.7045	0.7128	0.7128
Similarity F1 Threshold	0.8386	0.8333	0.8385	0.8385	0.5884	0.5884
Similarity Precision	0.6393	0.6043	0.6136	0.6136	0.6234	0.6234
Similarity Recall	0.8068	0.8532	0.8268	0.8268	0.8321	0.8321
Main Score	0.6983	0.6665	0.6921	0.6622	0.6949	0.6703

Samoed · 2025-04-03T19:15:04Z

Hm. I they've reported different prompts in paper with what've using. Can you update your implementation with their prompts? You can change model to use sentence transformer wrapper, but this is a hack and not clear how to integrate their resuls properly. At least can you try to change prompt for 2-3 tasks directly to test if our implementation will match?

ayush1298 · 2025-04-03T19:18:59Z

Hm. I they've reported different prompts in paper with what've using. Can you update your implementation with their prompts? You can change model to use sentence transformer wrapper, but this is a hack and not clear how to integrate their resuls properly. At least can you try to change prompt for 2-3 tasks directly to test if our implementation will match?

I think only Classification and MultilabelClassification results are having some differences. For retrieval, reranking, STS tasks(whose results I was going to share in sometime), there are no differences.

Update:
I saw thier paper, they have given different task instructions, each 1 is specific for the task. Should we support task-specific instructions in MTEB?

Samoed · 2025-04-03T19:33:44Z

I saw thier paper, they have given different task instructions, each 1 is specific for the task. Should we support task-specific instructions in MTEB?

I think you can create an issue for it to discuss. After we will decide what to do with this model

Samoed · 2025-04-07T10:38:31Z

@ayush1298 You can change get_instruction

mteb/mteb/models/wrapper.py

Line 91 in cb2825c

def get_instruction(task_name: str, prompt_type: PromptType | None) -> str:

similarly to get_prompt_name

mteb/mteb/models/wrapper.py

Line 21 in cb2825c

def get_prompt_name(

ayush1298 · 2025-04-08T06:42:52Z

@Samoed I have modified get_instruction similar to get_prompt_name, but I don't know how to exactly incorporate this in model_meta for each model.

1 more thing, I think what I missed is the prompt given at end in paper are having same format only of:
HIT_TMG_INSTRUCTION = "Instruct: {instruction}\nQuery: "

they just have given these as an example with task-specific instruction and query for each task.

mteb/models/wrapper.py

Samoed · 2025-04-11T19:38:36Z

mteb/models/wrapper.py

        """Get the instruction/prompt to be used for encoding sentences."""
+        if prompts_dict and task_name in prompts_dict:


And what if task want to create different instructions to query and passages?

What should be done for that?

I think code can be changed like

task = mteb.get_task(task_name=task_name) prompt = task_metadata.prompts if prompt dict and task_name in prompts_dict: prompt = prompts_dict[task_name] if isinstance(prompt, dict) and prompt_type: ... if prompt: return prompt ...

Samoed · 2025-04-11T19:39:22Z

mteb/models/hit_tmg_models.py

+    "EightTagsClustering": "Instruct: Identify of headlines from social media posts in Polish into 8 categories: film, history, food, medicine, motorization, work, sport and technology \n Query: {query}",
+    "GeoreviewClusteringP2P": "Instruct: Identify the topic or theme of the Russian reviews. \n Query: {query}",
+    "RuSciBenchGRNTIClusteringP2P": "Instruct: Identify the topic or theme of the Russian articles. \n Query: {query}",
+    "RuSciBenchOECDClusteringP2P": "Instruct: Identify the topic or theme of the Russian articles. \n Query: {query}",


You shouldn't add {query}, because we append text to instruction

Added HIT-TMG_KaLM-embedding-multilingual-mini-instruct-v1 with instr…

ebacf6b

…uct wrapper

Samoed reviewed Apr 2, 2025

View reviewed changes

mteb/models/hit_tmg_models.py Outdated Show resolved Hide resolved

mteb/models/hit_tmg_models.py Outdated Show resolved Hide resolved

mteb/models/hit_tmg_models.py Outdated Show resolved Hide resolved

mteb/models/hit_tmg_models.py Outdated Show resolved Hide resolved

ayush1298 added 2 commits April 2, 2025 23:59

Added KaLM_embedding_multilingual_mini_instruct_v1_5

e56ca18

Added model to overview.py

6b8f86e

ayush1298 commented Apr 3, 2025

View reviewed changes

mteb/models/ops_moa_models.py Show resolved Hide resolved

Fix Task Count Per Language Table in tasks.md

87eaa03

ayush1298 commented Apr 3, 2025

View reviewed changes

docs/tasks.md Outdated Show resolved Hide resolved

ayush1298 added 2 commits April 4, 2025 00:59

resolve conflicts

3fb15d1

remove tasks.md

77df67c

ayush1298 mentioned this pull request Apr 3, 2025

How to Support Task Specific Instructions in HIT-TMG/KaLM-embedding Models #2482

Open

Modified get_instruction funcion

d358108

Samoed reviewed Apr 8, 2025

View reviewed changes

mteb/models/wrapper.py Outdated Show resolved Hide resolved

ayush1298 added 2 commits April 12, 2025 00:27

Added support for prompt dict in get_instruction

cfb6f66

fix lang code

fc80892

Samoed reviewed Apr 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added HIT-TMG_KaLM-embedding-multilingual-mini-instruct-v1 with instruct wrapper #2478

Added HIT-TMG_KaLM-embedding-multilingual-mini-instruct-v1 with instruct wrapper #2478

ayush1298 commented Apr 2, 2025 •

edited

Loading

ayush1298 commented Apr 2, 2025

Samoed left a comment

ayush1298 commented Apr 3, 2025 •

edited by Samoed

Loading

Samoed commented Apr 3, 2025 •

edited

Loading

ayush1298 commented Apr 3, 2025 •

edited

Loading

Samoed commented Apr 3, 2025

Samoed commented Apr 7, 2025

ayush1298 commented Apr 8, 2025 •

edited

Loading

Samoed Apr 11, 2025

ayush1298 Apr 16, 2025

Samoed Apr 16, 2025 •

edited

Loading

Samoed Apr 11, 2025

		"""Get the instruction/prompt to be used for encoding sentences."""
		if prompts_dict and task_name in prompts_dict:

Added HIT-TMG_KaLM-embedding-multilingual-mini-instruct-v1 with instruct wrapper #2478

Are you sure you want to change the base?

Added HIT-TMG_KaLM-embedding-multilingual-mini-instruct-v1 with instruct wrapper #2478

Conversation

ayush1298 commented Apr 2, 2025 • edited Loading

Code Quality

Documentation

Testing

Adding a model checklist

ayush1298 commented Apr 2, 2025

Samoed left a comment

Choose a reason for hiding this comment

ayush1298 commented Apr 3, 2025 • edited by Samoed Loading

Samoed commented Apr 3, 2025 • edited Loading

ayush1298 commented Apr 3, 2025 • edited Loading

Samoed commented Apr 3, 2025

Samoed commented Apr 7, 2025

ayush1298 commented Apr 8, 2025 • edited Loading

Samoed Apr 11, 2025

Choose a reason for hiding this comment

ayush1298 Apr 16, 2025

Choose a reason for hiding this comment

Samoed Apr 16, 2025 • edited Loading

Choose a reason for hiding this comment

Samoed Apr 11, 2025

Choose a reason for hiding this comment

ayush1298 commented Apr 2, 2025 •

edited

Loading

ayush1298 commented Apr 3, 2025 •

edited by Samoed

Loading

Samoed commented Apr 3, 2025 •

edited

Loading

ayush1298 commented Apr 3, 2025 •

edited

Loading

ayush1298 commented Apr 8, 2025 •

edited

Loading

Samoed Apr 16, 2025 •

edited

Loading