-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about train/validation set. #9
Comments
Yes, we are training just on the 7000 examples of train_spider.json to get to SOTA. I was meaning to get to adding spider_others.json examples and I was really busy with other things. |
If anyone is interested in adding it, I would be very very grateful. |
What should I do while adding the spider_other.json to spider_train.json? |
I guess start by making the dataset reader accept a list, and pass it ["spider_other.json","spider_train.json"]. |
Hi ,I have tried which just simply modify the default.jsonet from and the spider.py file:
|
Great. let me know how it does on the dev set. |
I have tried a different approach which consisted in just changing the number of examples in the config file from 7000 to 8659 and concatenating train_spider.json, others_spider.json through: ` with open('/home/ubuntu/golem/data/spider/train_others.json', encoding="latin1") as json_file: out_file = open("/home/ubuntu/golem/data/spider/train_spider.json", "w") However, I am getting this error: Any idea why @OhadRubin ? I will also try @Young1993 code in the meantime. |
Hey, try to modify the following line:
|
@OhadRubin that was a good fix, I managed to preprocess some more instances, but now I am getting an uncaught exception: Any idea on how to fix it, or just skip this instance ? |
Just skip the instance. |
Nevermind, I managed to get the thing running after re-cloning the repository. Will report the results soon. |
@OhadRubin Create a file merge_data.py
Then modify the config, is ok |
mark, except the final results |
I have finished training and evaluating SmBop with the 8659 instances, instead of using just the 7000 and despite only achieving both 72.1 in Execution Accuracy and Exact Set Match Accuracy, I managed to obtain 68.6 in Execution accuracy with the Test suite database from: https://github.com/taoyds/test-suite-sql-eval (which includes 25-60 extra DB's for each query for best validation). This 68.6 is better than the 67.6 one gets from the pre-trained model provided in this page despite the values for ES-M and Execution Accuracy being worse. Test Suite Accuracy gives the best estimate at model performance because many times DB content is insufficient to distinguish different queries: Ex: If the table has just one row with the age of an employee a SELECT max(age) from employees and SELECT min(age) from employees would return the same. I've also run other models and so far, SmBop is the best, because despite already getting this 68.6 for other model, the value for ES-M was lower. EDIT[23-5-2021]: Nevermind, I re-runned the pre-trained model and got 76.2% Execution Accuracy with fuzzing on the development set (my script was extracting the query from an older model... ). So yeah, Smbop, specially the trained model blows the others out of the water EDIT2[28-5-2021] - The explanation for the exec accuracy being higher with fuzzing than the normal exec accuracy is that for fuzzing I used "plug value", which means I did not take into account the terminals predicted by the model as used the gold ones, only the query structure. |
Hi @Muradean I tried to train that on my dataset so I manually changed the dev.json and train_spider.json and changed the cache files. I am able to preprocess my data however while training its not taking my whole dataset. IWhile validating its taking my whole validation set though. What point am I missing? |
Hum... That seems weird @anumi1999 , I just merged the datasets (in this case train_spider.json and train_others.json) like this: ` with open('train_spider.json', encoding="utf-8") as json_file: with open('train_others.json', encoding="utf-8") as json_file: total_train = train_spider + train_others random.shuffle(total_train) out_file = open("train_spider.json", "w") And that's it... This way I did not have to change any config file. You can create a copy of train_spider.json and give it other name in order to avoid losing it |
@Muradean Actually the data is less than the number of examples given in spider. So Thats why changed the number of examples in config. I had made the new cache coz it was taking using old data and not using my data. |
Hi @Muradean can you please share your new trained model, and the new code changes that you for training the model on the concatenated datasets (train_spider + train_others).It will be great help . |
Hi! I ran into the same issue trying to train on the complete dataset and I noticed you also get an exception in the smbop/utils/ra_preproc.py:codegen_agg function with queries like "SELECT DISTINCT COUNT" and "SELECT DISTINCT MAX". There are no such queries in the train_spiral.json or val.json dataset, but there are some in the train_others.json file. If I solve the issue of find any more, i'll let you know. |
Hi, this is just one question
Spider train_set has 8659 instances, although it comes divided into train_spider.json which has 7000, and train_others.json which has the other instances and which is used in most models as a validation set.
I would like a clarification, about whether SmBop is trained in the 7000 instances or the full 8659 to achieve state-of-the-art performance.
I've been checking from a high-level perspective the config file, and I am not sure...
Thanks for your work and attention
The text was updated successfully, but these errors were encountered: