Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing a field in Entity Linking datasets #23

Open
dalek-who opened this issue Dec 20, 2023 · 2 comments
Open

Missing a field in Entity Linking datasets #23

dalek-who opened this issue Dec 20, 2023 · 2 comments

Comments

@dalek-who
Copy link

Here is the data example of EL provided in the README:

'23235546-1', # table id
'Ivan Lendl career statistics', # page title
'Singles: 19 finals (8 titles, 11 runner-ups)', # section title
'', # caption
['outcome', 'year', ...], # headers
[[[0, 4], 'Björn Borg'], [[9, 2], 'Wimbledon'], ...], # cells, [index, entity mention (cell text)]
[['Björn Borg', 'Swedish tennis player', []], ['Björn Borg', 'Swedish swimmer', ['Swimmer']], ...], # candidate entities, this the merged set for all cells. [entity name, entity description, entity types]
[0, 12, ...] # labels, this is the index of the gold entity in the candidate entities
[[0, 1, ...], [11, 12, 13, ...], ...] # candidates for each cell

However, the final field:

[[0, 1, ...], [11, 12, 13, ...], ...] # candidates for each cell

is only provided in the test split, while in the train and dev split, it is missing. How to generate this field?

@belerico
Copy link

belerico commented Feb 8, 2024

I'm trying to understand the same here...

cc @xiang-deng @huan-sunrise

@xiang-deng
Copy link
Contributor

Hi, as you can see in

table_id, pgTitle, secTitle, caption, headers, entities, candidate_entities, labels,_ = input_data

The field is not used for training. If I recall correctly, when tuning the model, I compute the loss against all candidates for the table, not individual cells, as it is more efficient.

The field is used at test time to compute the final metric, i.e. if the model predicts something that is not in the candidate set associated with the specific cell. We can ignore it. As such we only provide it for the test set. The logic is in evaluate_task.ipynb and data_processing.ipynb.

Let me know if you have other questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants