Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about train/validation set. #9

Open
PedroEstevesPT opened this issue May 10, 2021 · 20 comments
Open

Question about train/validation set. #9

PedroEstevesPT opened this issue May 10, 2021 · 20 comments
Labels
good first issue Good for newcomers

Comments

@PedroEstevesPT
Copy link
Contributor

Hi, this is just one question

Spider train_set has 8659 instances, although it comes divided into train_spider.json which has 7000, and train_others.json which has the other instances and which is used in most models as a validation set.

I would like a clarification, about whether SmBop is trained in the 7000 instances or the full 8659 to achieve state-of-the-art performance.

I've been checking from a high-level perspective the config file, and I am not sure...

a

Thanks for your work and attention

@OhadRubin
Copy link
Owner

Yes, we are training just on the 7000 examples of train_spider.json to get to SOTA. I was meaning to get to adding spider_others.json examples and I was really busy with other things.

@OhadRubin OhadRubin assigned OhadRubin and unassigned OhadRubin May 10, 2021
@OhadRubin OhadRubin added the good first issue Good for newcomers label May 11, 2021
@OhadRubin
Copy link
Owner

If anyone is interested in adding it, I would be very very grateful.

@Young1993
Copy link

Yes, we are training just on the 7000 examples of train_spider.json to get to SOTA. I was meaning to get to adding spider_others.json examples and I was really busy with other things.

What should I do while adding the spider_other.json to spider_train.json?

@OhadRubin
Copy link
Owner

Yes, we are training just on the 7000 examples of train_spider.json to get to SOTA. I was meaning to get to adding spider_others.json examples and I was really busy with other things.

What should I do while adding the spider_other.json to spider_train.json?

I guess start by making the dataset reader accept a list, and pass it ["spider_other.json","spider_train.json"].
It might be as simple as that. If you have issues lmk.

@Young1993
Copy link

Young1993 commented May 12, 2021

Hi ,I have tried which just simply modify the default.jsonet

from
"train_data_path": dataset_path + "train_spider.json",
to
"train_data_path": [dataset_path + "train_spider.json", dataset_path + "train_others.json"],

and the spider.py file:

def _read(self, file_path): #:str
        if type(file_path) == str and file_path.endswith(".json"):
            yield from self._read_examples_file(file_path, 'str')
        elif type(file_path) == list and file_path[0].endswith(".json"):
            yield from self._read_examples_file(file_path, 'list')
        else:
            raise ConfigurationError(f"Don't know how to read filetype of {file_path}")

    def _read_examples_file(self, file_path, type): # : str
        # cache_dir = os.path.join("cache", file_path.split("/")[-1])

        cnt = 0
        cache_buffer = []
        sent_set = set()
        for total_cnt,ins in self.cache:
            if cnt >= self._max_instances:
                break
            if ins is not None:
                yield ins
                cnt += 1
            sent_set.add(total_cnt)

        if type == 'str':
            with open(file_path, "r") as data_file:
                json_obj = json.load(data_file)
                for total_cnt, ex in enumerate(json_obj):
                    if cnt >= self._max_instances:
                        break
                    if len(cache_buffer)>50:
                        self.cache.write(cache_buffer)
                        cache_buffer = []
                    if total_cnt in sent_set:
                        continue
                    else:
                        ins = self.create_instance(ex)
                        cache_buffer.append([total_cnt, ins])
                    if ins is not None:
                        yield ins
                        cnt +=1
        else:
            for tmp_file_path in file_path:
                with open(tmp_file_path, "r") as data_file:
                    json_obj = json.load(data_file)
                    for total_cnt, ex in enumerate(json_obj):
                        if cnt >= self._max_instances:
                            break
                        if len(cache_buffer) > 50:
                            self.cache.write(cache_buffer)
                            cache_buffer = []
                        if total_cnt in sent_set:
                            continue
                        else:
                            ins = self.create_instance(ex)
                            cache_buffer.append([total_cnt, ins])
                        if ins is not None:
                            yield ins
                            cnt += 1
        self.cache.write(cache_buffer)

@OhadRubin
Copy link
Owner

Great. let me know how it does on the dev set.
There's a bit of code duplication here, but we can solve this later.

@PedroEstevesPT
Copy link
Contributor Author

PedroEstevesPT commented May 14, 2021

I have tried a different approach which consisted in just changing the number of examples in the config file from 7000 to 8659 and concatenating train_spider.json, others_spider.json through:

`
import json
with open('/home/ubuntu/golem/data/spider/train_spider.json', encoding="latin1") as json_file:
train_spiders = json.load(json_file)
print(len(train_spiders))

with open('/home/ubuntu/golem/data/spider/train_others.json', encoding="latin1") as json_file:
train_others = json.load(json_file)
print(len(train_others))

out_file = open("/home/ubuntu/golem/data/spider/train_spider.json", "w")
total_train = train_spiders + train_others
json.dump(total_train, out_file)
`

However, I am getting this error:
error

Any idea why @OhadRubin ? I will also try @Young1993 code in the meantime.

@OhadRubin
Copy link
Owner

Hey, try to modify the following line:

sys.setrecursionlimit(2000)
. Set it to around 4000 maybe.

@PedroEstevesPT
Copy link
Contributor Author

@OhadRubin that was a good fix, I managed to preprocess some more instances, but now I am getting an uncaught exception:
a

Any idea on how to fix it, or just skip this instance ?

@OhadRubin
Copy link
Owner

Just skip the instance.

@PedroEstevesPT
Copy link
Contributor Author

I skipped the instance by applying a try/except in: smbop/dataset_readers/spider.py and managed to pre-process the augmented data sucessfully.

a

However, when training the model, a new error appears while validating:

excepcao disto

Do you have any clue of what might be going on?
Thanks again

@PedroEstevesPT
Copy link
Contributor Author

PedroEstevesPT commented May 16, 2021

Nevermind, I managed to get the thing running after re-cloning the repository. Will report the results soon.

@Young1993
Copy link

Great. let me know how it does on the dev set.
There's a bit of code duplication here, but we can solve this later.

@OhadRubin
Hi, I just think that manually merge train_spider.json and train_others.json may be better?

Create a file merge_data.py

import json

def read_json(file):
    with open(file) as f:
        tmp = json.load(f)
        return tmp
spider = read_json('./dataset/train_spider.json')
other = read_json('./dataset/train_others.json')
train = spider + other

with open('./dataset/train.json', 'w') as f:
    json.dump(train, f, indent=4)

Then modify the config, is ok

@huybery
Copy link

huybery commented May 18, 2021

mark, except the final results

@PedroEstevesPT
Copy link
Contributor Author

PedroEstevesPT commented May 18, 2021

I have finished training and evaluating SmBop with the 8659 instances, instead of using just the 7000 and despite only achieving both 72.1 in Execution Accuracy and Exact Set Match Accuracy, I managed to obtain 68.6 in Execution accuracy with the Test suite database from: https://github.com/taoyds/test-suite-sql-eval (which includes 25-60 extra DB's for each query for best validation). This 68.6 is better than the 67.6 one gets from the pre-trained model provided in this page despite the values for ES-M and Execution Accuracy being worse.

a

Test Suite Accuracy gives the best estimate at model performance because many times DB content is insufficient to distinguish different queries:

Ex: If the table has just one row with the age of an employee a SELECT max(age) from employees and SELECT min(age) from employees would return the same.

I've also run other models and so far, SmBop is the best, because despite already getting this 68.6 for other model, the value for ES-M was lower.

EDIT[23-5-2021]: Nevermind, I re-runned the pre-trained model and got 76.2% Execution Accuracy with fuzzing on the development set (my script was extracting the query from an older model... ). So yeah, Smbop, specially the trained model blows the others out of the water

EDIT2[28-5-2021] - The explanation for the exec accuracy being higher with fuzzing than the normal exec accuracy is that for fuzzing I used "plug value", which means I did not take into account the terminals predicted by the model as used the gold ones, only the query structure.

@anumi1999
Copy link

anumi1999 commented Jul 5, 2021

Hi @Muradean I tried to train that on my dataset so I manually changed the dev.json and train_spider.json and changed the cache files. I am able to preprocess my data however while training its not taking my whole dataset. IWhile validating its taking my whole validation set though. What point am I missing?

@PedroEstevesPT
Copy link
Contributor Author

PedroEstevesPT commented Jul 5, 2021

Hum... That seems weird @anumi1999 , I just merged the datasets (in this case train_spider.json and train_others.json) like this:

`
import json
import random

with open('train_spider.json', encoding="utf-8") as json_file:
train_spider = json.load(json_file)

with open('train_others.json', encoding="utf-8") as json_file:
train_others = json.load(json_file)

total_train = train_spider + train_others

random.shuffle(total_train)

out_file = open("train_spider.json", "w")
json.dump(total_train, out_file,indent=1)
`

And that's it... This way I did not have to change any config file. You can create a copy of train_spider.json and give it other name in order to avoid losing it

@anumi1999
Copy link

anumi1999 commented Jul 5, 2021

@Muradean Actually the data is less than the number of examples given in spider. So Thats why changed the number of examples in config. I had made the new cache coz it was taking using old data and not using my data.

@alan-ai-learner
Copy link

Hi @Muradean can you please share your new trained model, and the new code changes that you for training the model on the concatenated datasets (train_spider + train_others).It will be great help .
Thanks alot!

@mit0110
Copy link

mit0110 commented Sep 24, 2021

Hi! I ran into the same issue trying to train on the complete dataset and I noticed you also get an exception in the smbop/utils/ra_preproc.py:codegen_agg function with queries like "SELECT DISTINCT COUNT" and "SELECT DISTINCT MAX". There are no such queries in the train_spiral.json or val.json dataset, but there are some in the train_others.json file. If I solve the issue of find any more, i'll let you know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

7 participants