Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to create prefix_tree_opt.pkl for new dataset? #3

Open
zhiweihu1103 opened this issue Apr 15, 2024 · 23 comments
Open

How to create prefix_tree_opt.pkl for new dataset? #3

zhiweihu1103 opened this issue Apr 15, 2024 · 23 comments

Comments

@zhiweihu1103
Copy link

Hi, fork. Nice work, I need to know how to create prefix_tree_opt.pkl for new dataset? I want to use this thought to another task.

@Senbao-Shi
Copy link
Collaborator

Senbao-Shi commented Apr 16, 2024

Hi, thank you for your question.

We use the initializer function of the Trie class to build the prefix tree. Make sure every title begins with 'bos' and ends with 'eos'.

from trie import Trie

tree = Trie(tittle_ids)
prefix_tree_dict = tree.trie_dict
with open(prefix_tree_file, 'wb') as f:
    pickle.dump(prefix_tree_dict, f)

@zhiweihu1103
Copy link
Author

Thanks for your quick reply. If I have a list of new entity names. Would you mind tell me how to get tittle_ids in the Trie(tittle_ids)?

@Senbao-Shi
Copy link
Collaborator

Just tokenize all the new entity names and get the input_ids as follows:

tittle_ids = [tokenizer(t)['input_ids'] for t in tittle]

@zhiweihu1103
Copy link
Author

Which model I need as tokenizer? If OPT, I need OPT tokenizer, if Llama, I need Llama tokenizer. Right?

@Senbao-Shi
Copy link
Collaborator

yep

@zhiweihu1103
Copy link
Author

Thx, let me try.

@zhiweihu1103
Copy link
Author

Hi, further question. If I have an entity 'Joe Biden', the input to Trie should be 'eos Joe Biden eos', right? I need to add eos and spaces to separate from Joe Biden.

@Senbao-Shi
Copy link
Collaborator

Yes, you can use this method, and you can test it with a small amount of data to see if it performs well.

@zhiweihu1103
Copy link
Author

Further question is, why for training, you pad right, for dev or test, you pad left?

@zhiweihu1103
Copy link
Author

would you mind provide the entity_list that you used to create tree.pkl file, I may need to test whether my code is correct.

@zhiweihu1103
Copy link
Author

zhiweihu1103 commented Apr 17, 2024

I may need to give more details on how I generated tree_opt.pkl, and I hope I can get your help.
First, I give some of my data scenarios:

  • I use the opt-1.3b as the tokenizer.
  • I use the kb_entity.json from WikiDiverse datatset as the entity_names, the kb_entity.json from the MIMIC paper, I uploaded a copy here.

I use the following code to generate the tree_opt.pkl file:

import pickle
import json

from trie import Trie
from transformers import AutoTokenizer

def create_trie_pkl(data_path, tokenizer_path, output_path):
    entity_name_list = []
    with open(data_path, 'r') as file:
        data = json.load(file)
    for single_data in data:
        entity_name_list.append(single_data['entity_name'])

    tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, use_fast=False)
    tittle_ids = [tokenizer(t)['input_ids'] for t in entity_name_list]
    tree = Trie(tittle_ids)
    prefix_tree_dict = tree.trie_dict
    with open(output_path, 'wb') as f:
        pickle.dump(prefix_tree_dict, f)

if __name__ == '__main__':
    data_path = './kb_entity.json'
    tokenizer_path = './opt-1.3b'
    output_path = './tree_opt.pkl'
    create_trie_pkl(data_path, tokenizer_path, output_path)

However, the generated tree_opt.pkl file is very small. I also uploaded it here. The size of your prefix_tree_opt.pkl file is 209M, but the size of the tree_opt.pkl I generated is only 2.9M. I don't understand how your prefix_tree_opt.pkl is generated. In particular, what exactly does the content of the entity_name_list you use look like? I don’t know what happened in the middle? Looking forward for your reply.

@zhiweihu1103
Copy link
Author

If it's convenient, can you leave a WeChat ID? Thank you for your help.

@zhiweihu1103
Copy link
Author

Sorry, I have solved this problem, I will close this issue.

@Senbao-Shi
Copy link
Collaborator

Sorry for the late response. We have provided guidelines on how to build and use a prefix tree for constrained decoding.

@zhiweihu1103
Copy link
Author

Great, thanks for your hard work.

@zhiweihu1103
Copy link
Author

zhiweihu1103 commented Apr 20, 2024

Hi, folk. I want to know why you append the target_embed into item_feat in line 51 of model.py. Is there any label leakage? Our purpose is to predict the target, but the text you input contains the target.

https://github.com/Senbao-Shi/GEMEL/blob/2560d08f866f53134932d0298621122b746a9316/model.py#L51

@zhiweihu1103
Copy link
Author

Hi, would you mind provide the entity_name files of WikiDiverse and WikiMEL you used? I use the code you provided to create the tree.pkl file, find that the file size is very different from you provided, and I use my tree.pkl to test model, the results in w/o In-context Learning setting is only 65.95 on WikiDiverse dataset, however, after replace the tree.pkl with you provided, the performance is 77.85, I need to know whether the entity_name lists are different from you used. Thx.

@zhiweihu1103
Copy link
Author

I hope I can find you well, Sorry to bother you again, I hope you can share the entity_name_list you used to generate tree.pkl of WikiDiverse and WikiMEL datasets. Thanks again.

@zhiweihu1103 zhiweihu1103 reopened this Apr 22, 2024
@zhiweihu1103
Copy link
Author

Hi, friend, anything update about the entity_list?

@KarimAsh11
Copy link

Hello,
Did you get a solution for this issue? Which entity list is used for the paper? that of GENRE?
Thanks!

@zhiweihu1103
Copy link
Author

No, the author did not reply to me, even though I sent a separate email to ask about it. The prefix_tree_opt.pkl I generated myself based on the Benchmark dataset entity list is completely different from the one provided by the author.

@KarimAsh11
Copy link

Ok thank you. Let's wait and hope for a reply I guess.

@zhiweihu1103
Copy link
Author

No, half a year has passed and there is still no hope, so don't count on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants