Question about train/validation set. #9

PedroEstevesPT · 2021-05-10T14:46:32Z

Hi, this is just one question

Spider train_set has 8659 instances, although it comes divided into train_spider.json which has 7000, and train_others.json which has the other instances and which is used in most models as a validation set.

I would like a clarification, about whether SmBop is trained in the 7000 instances or the full 8659 to achieve state-of-the-art performance.

I've been checking from a high-level perspective the config file, and I am not sure...

Thanks for your work and attention

OhadRubin · 2021-05-10T15:23:32Z

Yes, we are training just on the 7000 examples of train_spider.json to get to SOTA. I was meaning to get to adding spider_others.json examples and I was really busy with other things.

OhadRubin · 2021-05-11T12:14:37Z

If anyone is interested in adding it, I would be very very grateful.

Young1993 · 2021-05-12T02:35:33Z

Yes, we are training just on the 7000 examples of train_spider.json to get to SOTA. I was meaning to get to adding spider_others.json examples and I was really busy with other things.

What should I do while adding the spider_other.json to spider_train.json?

OhadRubin · 2021-05-12T06:11:08Z

Yes, we are training just on the 7000 examples of train_spider.json to get to SOTA. I was meaning to get to adding spider_others.json examples and I was really busy with other things.

What should I do while adding the spider_other.json to spider_train.json?

I guess start by making the dataset reader accept a list, and pass it ["spider_other.json","spider_train.json"].
It might be as simple as that. If you have issues lmk.

Young1993 · 2021-05-12T09:26:00Z

Hi ,I have tried which just simply modify the default.jsonet

from
"train_data_path": dataset_path + "train_spider.json",
to
"train_data_path": [dataset_path + "train_spider.json", dataset_path + "train_others.json"],

and the spider.py file:

def _read(self, file_path): #:str
        if type(file_path) == str and file_path.endswith(".json"):
            yield from self._read_examples_file(file_path, 'str')
        elif type(file_path) == list and file_path[0].endswith(".json"):
            yield from self._read_examples_file(file_path, 'list')
        else:
            raise ConfigurationError(f"Don't know how to read filetype of {file_path}")

    def _read_examples_file(self, file_path, type): # : str
        # cache_dir = os.path.join("cache", file_path.split("/")[-1])

        cnt = 0
        cache_buffer = []
        sent_set = set()
        for total_cnt,ins in self.cache:
            if cnt >= self._max_instances:
                break
            if ins is not None:
                yield ins
                cnt += 1
            sent_set.add(total_cnt)

        if type == 'str':
            with open(file_path, "r") as data_file:
                json_obj = json.load(data_file)
                for total_cnt, ex in enumerate(json_obj):
                    if cnt >= self._max_instances:
                        break
                    if len(cache_buffer)>50:
                        self.cache.write(cache_buffer)
                        cache_buffer = []
                    if total_cnt in sent_set:
                        continue
                    else:
                        ins = self.create_instance(ex)
                        cache_buffer.append([total_cnt, ins])
                    if ins is not None:
                        yield ins
                        cnt +=1
        else:
            for tmp_file_path in file_path:
                with open(tmp_file_path, "r") as data_file:
                    json_obj = json.load(data_file)
                    for total_cnt, ex in enumerate(json_obj):
                        if cnt >= self._max_instances:
                            break
                        if len(cache_buffer) > 50:
                            self.cache.write(cache_buffer)
                            cache_buffer = []
                        if total_cnt in sent_set:
                            continue
                        else:
                            ins = self.create_instance(ex)
                            cache_buffer.append([total_cnt, ins])
                        if ins is not None:
                            yield ins
                            cnt += 1
        self.cache.write(cache_buffer)

OhadRubin · 2021-05-12T10:19:48Z

Great. let me know how it does on the dev set.
There's a bit of code duplication here, but we can solve this later.

PedroEstevesPT · 2021-05-14T15:29:01Z

I have tried a different approach which consisted in just changing the number of examples in the config file from 7000 to 8659 and concatenating train_spider.json, others_spider.json through:

`
import json
with open('/home/ubuntu/golem/data/spider/train_spider.json', encoding="latin1") as json_file:
train_spiders = json.load(json_file)
print(len(train_spiders))

with open('/home/ubuntu/golem/data/spider/train_others.json', encoding="latin1") as json_file:
train_others = json.load(json_file)
print(len(train_others))

out_file = open("/home/ubuntu/golem/data/spider/train_spider.json", "w")
total_train = train_spiders + train_others
json.dump(total_train, out_file)
`

However, I am getting this error:

Any idea why @OhadRubin ? I will also try @Young1993 code in the meantime.

OhadRubin · 2021-05-14T15:55:48Z

Hey, try to modify the following line:

SmBop/smbop/utils/moz_sql_parser/sql_parser.py

Line 40 in d325a27

sys.setrecursionlimit(2000)

. Set it to around 4000 maybe.

PedroEstevesPT · 2021-05-14T17:55:29Z

@OhadRubin that was a good fix, I managed to preprocess some more instances, but now I am getting an uncaught exception:

Any idea on how to fix it, or just skip this instance ?

OhadRubin · 2021-05-14T17:58:28Z

Just skip the instance.

PedroEstevesPT · 2021-05-15T15:44:28Z

I skipped the instance by applying a try/except in: smbop/dataset_readers/spider.py and managed to pre-process the augmented data sucessfully.

However, when training the model, a new error appears while validating:

Do you have any clue of what might be going on?
Thanks again

PedroEstevesPT · 2021-05-16T16:39:01Z

Nevermind, I managed to get the thing running after re-cloning the repository. Will report the results soon.

Young1993 · 2021-05-18T06:20:10Z

Great. let me know how it does on the dev set.
There's a bit of code duplication here, but we can solve this later.

@OhadRubin
Hi， I just think that manually merge train_spider.json and train_others.json may be better?

Create a file merge_data.py

import json

def read_json(file):
    with open(file) as f:
        tmp = json.load(f)
        return tmp
spider = read_json('./dataset/train_spider.json')
other = read_json('./dataset/train_others.json')
train = spider + other

with open('./dataset/train.json', 'w') as f:
    json.dump(train, f, indent=4)

Then modify the config, is ok

huybery · 2021-05-18T07:27:01Z

mark, except the final results

PedroEstevesPT · 2021-05-18T19:01:50Z

I have finished training and evaluating SmBop with the 8659 instances, instead of using just the 7000 and despite only achieving both 72.1 in Execution Accuracy and Exact Set Match Accuracy, I managed to obtain 68.6 in Execution accuracy with the Test suite database from: https://github.com/taoyds/test-suite-sql-eval (which includes 25-60 extra DB's for each query for best validation). This 68.6 is better than the 67.6 one gets from the pre-trained model provided in this page despite the values for ES-M and Execution Accuracy being worse.

Test Suite Accuracy gives the best estimate at model performance because many times DB content is insufficient to distinguish different queries:

Ex: If the table has just one row with the age of an employee a SELECT max(age) from employees and SELECT min(age) from employees would return the same.

I've also run other models and so far, SmBop is the best, because despite already getting this 68.6 for other model, the value for ES-M was lower.

EDIT[23-5-2021]: Nevermind, I re-runned the pre-trained model and got 76.2% Execution Accuracy with fuzzing on the development set (my script was extracting the query from an older model... ). So yeah, Smbop, specially the trained model blows the others out of the water

EDIT2[28-5-2021] - The explanation for the exec accuracy being higher with fuzzing than the normal exec accuracy is that for fuzzing I used "plug value", which means I did not take into account the terminals predicted by the model as used the gold ones, only the query structure.

anumi1999 · 2021-07-05T11:23:21Z

Hi @Muradean I tried to train that on my dataset so I manually changed the dev.json and train_spider.json and changed the cache files. I am able to preprocess my data however while training its not taking my whole dataset. IWhile validating its taking my whole validation set though. What point am I missing?

PedroEstevesPT · 2021-07-05T12:22:38Z

Hum... That seems weird @anumi1999 , I just merged the datasets (in this case train_spider.json and train_others.json) like this:

`
import json
import random

with open('train_spider.json', encoding="utf-8") as json_file:
train_spider = json.load(json_file)

with open('train_others.json', encoding="utf-8") as json_file:
train_others = json.load(json_file)

total_train = train_spider + train_others

random.shuffle(total_train)

out_file = open("train_spider.json", "w")
json.dump(total_train, out_file,indent=1)
`

And that's it... This way I did not have to change any config file. You can create a copy of train_spider.json and give it other name in order to avoid losing it

anumi1999 · 2021-07-05T16:15:23Z

@Muradean Actually the data is less than the number of examples given in spider. So Thats why changed the number of examples in config. I had made the new cache coz it was taking using old data and not using my data.

alan-ai-learner · 2021-07-31T08:33:11Z

Hi @Muradean can you please share your new trained model, and the new code changes that you for training the model on the concatenated datasets (train_spider + train_others).It will be great help .
Thanks alot!

mit0110 · 2021-09-24T18:08:15Z

Hi! I ran into the same issue trying to train on the complete dataset and I noticed you also get an exception in the smbop/utils/ra_preproc.py:codegen_agg function with queries like "SELECT DISTINCT COUNT" and "SELECT DISTINCT MAX". There are no such queries in the train_spiral.json or val.json dataset, but there are some in the train_others.json file. If I solve the issue of find any more, i'll let you know.

OhadRubin assigned OhadRubin and unassigned OhadRubin May 10, 2021

OhadRubin added the good first issue Good for newcomers label May 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about train/validation set. #9

Question about train/validation set. #9

PedroEstevesPT commented May 10, 2021

OhadRubin commented May 10, 2021

OhadRubin commented May 11, 2021

Young1993 commented May 12, 2021

OhadRubin commented May 12, 2021

Young1993 commented May 12, 2021 •

edited

Loading

OhadRubin commented May 12, 2021

PedroEstevesPT commented May 14, 2021 •

edited

Loading

OhadRubin commented May 14, 2021

PedroEstevesPT commented May 14, 2021

OhadRubin commented May 14, 2021

PedroEstevesPT commented May 15, 2021

PedroEstevesPT commented May 16, 2021 •

edited

Loading

Young1993 commented May 18, 2021

huybery commented May 18, 2021

PedroEstevesPT commented May 18, 2021 •

edited

Loading

anumi1999 commented Jul 5, 2021 •

edited

Loading

PedroEstevesPT commented Jul 5, 2021 •

edited

Loading

anumi1999 commented Jul 5, 2021 •

edited

Loading

alan-ai-learner commented Jul 31, 2021

mit0110 commented Sep 24, 2021

Question about train/validation set. #9

Question about train/validation set. #9

Comments

PedroEstevesPT commented May 10, 2021

OhadRubin commented May 10, 2021

OhadRubin commented May 11, 2021

Young1993 commented May 12, 2021

OhadRubin commented May 12, 2021

Young1993 commented May 12, 2021 • edited Loading

OhadRubin commented May 12, 2021

PedroEstevesPT commented May 14, 2021 • edited Loading

OhadRubin commented May 14, 2021

PedroEstevesPT commented May 14, 2021

OhadRubin commented May 14, 2021

PedroEstevesPT commented May 15, 2021

PedroEstevesPT commented May 16, 2021 • edited Loading

Young1993 commented May 18, 2021

huybery commented May 18, 2021

PedroEstevesPT commented May 18, 2021 • edited Loading

anumi1999 commented Jul 5, 2021 • edited Loading

PedroEstevesPT commented Jul 5, 2021 • edited Loading

anumi1999 commented Jul 5, 2021 • edited Loading

alan-ai-learner commented Jul 31, 2021

mit0110 commented Sep 24, 2021

Young1993 commented May 12, 2021 •

edited

Loading

PedroEstevesPT commented May 14, 2021 •

edited

Loading

PedroEstevesPT commented May 16, 2021 •

edited

Loading

PedroEstevesPT commented May 18, 2021 •

edited

Loading

anumi1999 commented Jul 5, 2021 •

edited

Loading

PedroEstevesPT commented Jul 5, 2021 •

edited

Loading

anumi1999 commented Jul 5, 2021 •

edited

Loading