How to create prefix_tree_opt.pkl for new dataset? #3

zhiweihu1103 · 2024-04-15T16:17:56Z

Hi, fork. Nice work, I need to know how to create prefix_tree_opt.pkl for new dataset? I want to use this thought to another task.

Senbao-Shi · 2024-04-16T02:49:24Z

Hi, thank you for your question.

We use the initializer function of the Trie class to build the prefix tree. Make sure every title begins with 'bos' and ends with 'eos'.

from trie import Trie

tree = Trie(tittle_ids)
prefix_tree_dict = tree.trie_dict
with open(prefix_tree_file, 'wb') as f:
    pickle.dump(prefix_tree_dict, f)

zhiweihu1103 · 2024-04-16T04:06:46Z

Thanks for your quick reply. If I have a list of new entity names. Would you mind tell me how to get tittle_ids in the Trie(tittle_ids)?

Senbao-Shi · 2024-04-16T06:22:10Z

Just tokenize all the new entity names and get the input_ids as follows:

tittle_ids = [tokenizer(t)['input_ids'] for t in tittle]

zhiweihu1103 · 2024-04-16T06:25:21Z

Which model I need as tokenizer? If OPT, I need OPT tokenizer, if Llama, I need Llama tokenizer. Right?

Senbao-Shi · 2024-04-16T08:50:05Z

yep

zhiweihu1103 · 2024-04-16T09:44:35Z

Thx, let me try.

zhiweihu1103 · 2024-04-16T11:32:59Z

Hi, further question. If I have an entity 'Joe Biden', the input to Trie should be 'eos Joe Biden eos', right? I need to add eos and spaces to separate from Joe Biden.

Senbao-Shi · 2024-04-17T02:46:37Z

Yes, you can use this method, and you can test it with a small amount of data to see if it performs well.

zhiweihu1103 · 2024-04-17T09:52:17Z

Further question is, why for training, you pad right, for dev or test, you pad left?

zhiweihu1103 · 2024-04-17T15:13:00Z

would you mind provide the entity_list that you used to create tree.pkl file, I may need to test whether my code is correct.

zhiweihu1103 · 2024-04-17T16:04:06Z

I may need to give more details on how I generated tree_opt.pkl, and I hope I can get your help.
First, I give some of my data scenarios:

I use the opt-1.3b as the tokenizer.
I use the kb_entity.json from WikiDiverse datatset as the entity_names, the kb_entity.json from the MIMIC paper, I uploaded a copy here.

I use the following code to generate the tree_opt.pkl file:

import pickle
import json

from trie import Trie
from transformers import AutoTokenizer

def create_trie_pkl(data_path, tokenizer_path, output_path):
    entity_name_list = []
    with open(data_path, 'r') as file:
        data = json.load(file)
    for single_data in data:
        entity_name_list.append(single_data['entity_name'])

    tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, use_fast=False)
    tittle_ids = [tokenizer(t)['input_ids'] for t in entity_name_list]
    tree = Trie(tittle_ids)
    prefix_tree_dict = tree.trie_dict
    with open(output_path, 'wb') as f:
        pickle.dump(prefix_tree_dict, f)

if __name__ == '__main__':
    data_path = './kb_entity.json'
    tokenizer_path = './opt-1.3b'
    output_path = './tree_opt.pkl'
    create_trie_pkl(data_path, tokenizer_path, output_path)

However, the generated tree_opt.pkl file is very small. I also uploaded it here. The size of your prefix_tree_opt.pkl file is 209M, but the size of the tree_opt.pkl I generated is only 2.9M. I don't understand how your prefix_tree_opt.pkl is generated. In particular, what exactly does the content of the entity_name_list you use look like? I don’t know what happened in the middle? Looking forward for your reply.

zhiweihu1103 · 2024-04-17T20:52:12Z

If it's convenient, can you leave a WeChat ID? Thank you for your help.

zhiweihu1103 · 2024-04-18T09:16:25Z

Sorry, I have solved this problem, I will close this issue.

Senbao-Shi · 2024-04-19T03:06:00Z

Sorry for the late response. We have provided guidelines on how to build and use a prefix tree for constrained decoding.

zhiweihu1103 · 2024-04-19T08:32:50Z

Great, thanks for your hard work.

zhiweihu1103 · 2024-04-20T09:45:26Z

Hi, folk. I want to know why you append the target_embed into item_feat in line 51 of model.py. Is there any label leakage? Our purpose is to predict the target, but the text you input contains the target.

https://github.com/Senbao-Shi/GEMEL/blob/2560d08f866f53134932d0298621122b746a9316/model.py#L51

zhiweihu1103 · 2024-04-20T21:59:42Z

Hi, would you mind provide the entity_name files of WikiDiverse and WikiMEL you used? I use the code you provided to create the tree.pkl file, find that the file size is very different from you provided, and I use my tree.pkl to test model, the results in w/o In-context Learning setting is only 65.95 on WikiDiverse dataset, however, after replace the tree.pkl with you provided, the performance is 77.85, I need to know whether the entity_name lists are different from you used. Thx.

zhiweihu1103 · 2024-04-22T08:31:19Z

I hope I can find you well, Sorry to bother you again, I hope you can share the entity_name_list you used to generate tree.pkl of WikiDiverse and WikiMEL datasets. Thanks again.

zhiweihu1103 · 2024-04-29T15:57:39Z

Hi, friend, anything update about the entity_list?

KarimAsh11 · 2024-10-25T14:11:40Z

Hello,
Did you get a solution for this issue? Which entity list is used for the paper? that of GENRE?
Thanks!

zhiweihu1103 · 2024-10-25T14:18:16Z

No, the author did not reply to me, even though I sent a separate email to ask about it. The prefix_tree_opt.pkl I generated myself based on the Benchmark dataset entity list is completely different from the one provided by the author.

KarimAsh11 · 2024-10-25T14:24:33Z

Ok thank you. Let's wait and hope for a reply I guess.

zhiweihu1103 · 2024-10-25T14:26:47Z

No, half a year has passed and there is still no hope, so don't count on it.

zhiweihu1103 closed this as completed Apr 18, 2024

zhiweihu1103 reopened this Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to create prefix_tree_opt.pkl for new dataset? #3

How to create prefix_tree_opt.pkl for new dataset? #3

zhiweihu1103 commented Apr 15, 2024

Senbao-Shi commented Apr 16, 2024 •

edited

Loading

zhiweihu1103 commented Apr 16, 2024

Senbao-Shi commented Apr 16, 2024

zhiweihu1103 commented Apr 16, 2024

Senbao-Shi commented Apr 16, 2024

zhiweihu1103 commented Apr 16, 2024

zhiweihu1103 commented Apr 16, 2024

Senbao-Shi commented Apr 17, 2024

zhiweihu1103 commented Apr 17, 2024

zhiweihu1103 commented Apr 17, 2024

zhiweihu1103 commented Apr 17, 2024 •

edited

Loading

zhiweihu1103 commented Apr 17, 2024

zhiweihu1103 commented Apr 18, 2024

Senbao-Shi commented Apr 19, 2024

zhiweihu1103 commented Apr 19, 2024

zhiweihu1103 commented Apr 20, 2024 •

edited

Loading

zhiweihu1103 commented Apr 20, 2024

zhiweihu1103 commented Apr 22, 2024

zhiweihu1103 commented Apr 29, 2024

KarimAsh11 commented Oct 25, 2024

zhiweihu1103 commented Oct 25, 2024

KarimAsh11 commented Oct 25, 2024

zhiweihu1103 commented Oct 25, 2024

How to create prefix_tree_opt.pkl for new dataset? #3

How to create prefix_tree_opt.pkl for new dataset? #3

Comments

zhiweihu1103 commented Apr 15, 2024

Senbao-Shi commented Apr 16, 2024 • edited Loading

zhiweihu1103 commented Apr 16, 2024

Senbao-Shi commented Apr 16, 2024

zhiweihu1103 commented Apr 16, 2024

Senbao-Shi commented Apr 16, 2024

zhiweihu1103 commented Apr 16, 2024

zhiweihu1103 commented Apr 16, 2024

Senbao-Shi commented Apr 17, 2024

zhiweihu1103 commented Apr 17, 2024

zhiweihu1103 commented Apr 17, 2024

zhiweihu1103 commented Apr 17, 2024 • edited Loading

zhiweihu1103 commented Apr 17, 2024

zhiweihu1103 commented Apr 18, 2024

Senbao-Shi commented Apr 19, 2024

zhiweihu1103 commented Apr 19, 2024

zhiweihu1103 commented Apr 20, 2024 • edited Loading

zhiweihu1103 commented Apr 20, 2024

zhiweihu1103 commented Apr 22, 2024

zhiweihu1103 commented Apr 29, 2024

KarimAsh11 commented Oct 25, 2024

zhiweihu1103 commented Oct 25, 2024

KarimAsh11 commented Oct 25, 2024

zhiweihu1103 commented Oct 25, 2024

Senbao-Shi commented Apr 16, 2024 •

edited

Loading

zhiweihu1103 commented Apr 17, 2024 •

edited

Loading

zhiweihu1103 commented Apr 20, 2024 •

edited

Loading