-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to create prefix_tree_opt.pkl for new dataset? #3
Comments
Hi, thank you for your question. We use the initializer function of the Trie class to build the prefix tree. Make sure every title begins with 'bos' and ends with 'eos'.
|
Thanks for your quick reply. If I have a list of new entity names. Would you mind tell me how to get tittle_ids in the Trie(tittle_ids)? |
Just tokenize all the new entity names and get the input_ids as follows:
|
Which model I need as tokenizer? If OPT, I need OPT tokenizer, if Llama, I need Llama tokenizer. Right? |
yep |
Thx, let me try. |
Hi, further question. If I have an entity 'Joe Biden', the input to Trie should be 'eos Joe Biden eos', right? I need to add eos and spaces to separate from Joe Biden. |
Yes, you can use this method, and you can test it with a small amount of data to see if it performs well. |
Further question is, why for training, you pad right, for dev or test, you pad left? |
would you mind provide the entity_list that you used to create tree.pkl file, I may need to test whether my code is correct. |
I may need to give more details on how I generated tree_opt.pkl, and I hope I can get your help.
I use the following code to generate the tree_opt.pkl file: import pickle
import json
from trie import Trie
from transformers import AutoTokenizer
def create_trie_pkl(data_path, tokenizer_path, output_path):
entity_name_list = []
with open(data_path, 'r') as file:
data = json.load(file)
for single_data in data:
entity_name_list.append(single_data['entity_name'])
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, use_fast=False)
tittle_ids = [tokenizer(t)['input_ids'] for t in entity_name_list]
tree = Trie(tittle_ids)
prefix_tree_dict = tree.trie_dict
with open(output_path, 'wb') as f:
pickle.dump(prefix_tree_dict, f)
if __name__ == '__main__':
data_path = './kb_entity.json'
tokenizer_path = './opt-1.3b'
output_path = './tree_opt.pkl'
create_trie_pkl(data_path, tokenizer_path, output_path) However, the generated tree_opt.pkl file is very small. I also uploaded it here. The size of your prefix_tree_opt.pkl file is 209M, but the size of the tree_opt.pkl I generated is only 2.9M. I don't understand how your prefix_tree_opt.pkl is generated. In particular, what exactly does the content of the entity_name_list you use look like? I don’t know what happened in the middle? Looking forward for your reply. |
If it's convenient, can you leave a WeChat ID? Thank you for your help. |
Sorry, I have solved this problem, I will close this issue. |
Sorry for the late response. We have provided guidelines on how to build and use a prefix tree for constrained decoding. |
Great, thanks for your hard work. |
Hi, folk. I want to know why you append the target_embed into item_feat in line 51 of model.py. Is there any label leakage? Our purpose is to predict the target, but the text you input contains the target. |
Hi, would you mind provide the entity_name files of WikiDiverse and WikiMEL you used? I use the code you provided to create the tree.pkl file, find that the file size is very different from you provided, and I use my tree.pkl to test model, the results in w/o In-context Learning setting is only 65.95 on WikiDiverse dataset, however, after replace the tree.pkl with you provided, the performance is 77.85, I need to know whether the entity_name lists are different from you used. Thx. |
I hope I can find you well, Sorry to bother you again, I hope you can share the entity_name_list you used to generate tree.pkl of WikiDiverse and WikiMEL datasets. Thanks again. |
Hi, friend, anything update about the entity_list? |
Hello, |
No, the author did not reply to me, even though I sent a separate email to ask about it. The prefix_tree_opt.pkl I generated myself based on the Benchmark dataset entity list is completely different from the one provided by the author. |
Ok thank you. Let's wait and hope for a reply I guess. |
No, half a year has passed and there is still no hope, so don't count on it. |
Hi, fork. Nice work, I need to know how to create prefix_tree_opt.pkl for new dataset? I want to use this thought to another task.
The text was updated successfully, but these errors were encountered: