- The once with _sentences contain the sentences and the conference information
- The once without it are in IOB2 format
Only information about the datasets without _sentences are provided, _sentences contain the same sentences as there counterpart
Dataset.csv - The whole dataset
Zero-shot.csv - Contains 200 sentences with datasets that do not occur in the train or test set
Train_set.csv - The 80% of the 80/20 split of the remaining sentences of the whole dataset (so without the zero-shot set)
Test_set.csv - The 20% of the 80/20 split of the remaining sentences of the whole dataset (so without the zero-shot set)
Test_set_real_ratio - A test set where only 1% of the sentences contain dataset names
Test_set_easy - 444 sentences from the test set that only contain dataset names with three or fewer words
Test_set_hard - 444 sentences from the test set that only contain dataset names with four or more words
SSC - A dataset containing only weak supervised examples
SSC_positives - Only the potive weak supervised examples
- The files with _cross-validation are the code for cross-validation
- The other ones are the regular methods.
In the code the standard situation is shown, so training on the train set and testing on the test set.
For the more specialised cases, code is provided below
DATAset = pd.read_csv('Dataset_sentences.csv')
BIOset = pd.read_csv('Dataset.csv')
ids = DATAset[DATAset.conference == 'VISION'].id.to_list()
sentences = ['Sentence: ' + str(id) for id in ids]
data = BIOset[~BIOset['Sentence #'].isin(sentences)]
test = BIOset[BIOset['Sentence #'].isin(sentences)]
This is done via slicing, slices of the stratisfied 20 fold are used and each time added to the previous slice
skf = StratifiedKFold(20, shuffle=True, random_state=42)
The right data is selected using slicing, example: np.array(X_tr)[:2168]
TRAINset = pd.read_csv('Train_set_sentences.csv')
ids = TRAINset[~TRAINset.labels.str.contains('Geen')].id.to_list()
sentences = ['Sentence: ' + str(id) for id in ids]
ds = data[data['Sentence #'].isin(sentences)]
nds = data[~data['Sentence #'].isin(sentences)]
data = ds.append(nds)
This is done by appending, example:
data = pd.read_csv('Train_set.csv') data2 = pd.read_csv('SSC.csv') data = data.append(data2)