-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
load_dataset for CSV files not working #743
Comments
Thank you ! |
I think another good example is the following: Displayed error |
Hi, seems I also can't read csv file. I was trying with a dummy csv with only three rows.
I was using the HuggingFace image in Paperspace Gradient (datasets==1.1.3). The following code doesn't work:
It outputs the following:
But However, loading from pandas dataframe is working.
|
This is because load_dataset without from datasets import load_dataset
dataset = load_dataset('csv', script_version="master", data_files=['test_data.csv'], delimiter=",")
print(dataset["train"][0]) Or if you want to directly get the train split: from datasets import load_dataset
dataset = load_dataset('csv', script_version="master", data_files=['test_data.csv'], delimiter=",", split="train")
print(dataset[0]) |
Good point Design question for us, though: should |
In this case the user expects to get only one dataset object instead of the dictionary of datasets since only one csv file was specified without any split specifications. For the other datasets ton the other hand the user doesn't know in advance the splits so I would keep the dictionary by default. What do you think ? |
Thanks for your quick response! I'm fine with specifying the split as @lhoestq suggested. My only concern is when I'm loading from python dict or pandas, the library returns a dataset instead of a dictionary of datasets when no split is specified. I know that they use a different function |
I was running the above line, but got this error.
The data is amazon product data. I load the Video_Games_5.json.gz data into pandas and save it as csv file. and then load the csv file using the above code. I thought, Thank you! |
Hi ! the Indeed since from datasets import load_dataset
dataset = load_dataset('csv', data_files='./amazon_data/Video_Games_5.csv', delimiter=",", split="train") And then to get both a train and test split you can do dataset = dataset.train_test_split()
print(dataset.keys())
# ['train', 'test'] Also note that a csv dataset may have several available splits if it is defined this way: from datasets import load_dataset
dataset = load_dataset('csv', data_files={
"train": './amazon_data/Video_Games_5_train.csv',
"test": './amazon_data/Video_Games_5_test.csv'
})
print(dataset.keys())
# ['train', 'test'] |
Yes maybe this would be good. I think having to select 'train' from the resulting object why the user gave no split information is a confusing and not intuitive behavior. |
I'm also facing the same issue when trying to load from a csv file locally: from nlp import load_dataset
dataset = load_dataset('csv', data_files='sample_data.csv') Error when executed from Google Colab: ArrowInvalid Traceback (most recent call last)
<ipython-input-34-79a8d4f65ed6> in <module>()
1 from nlp import load_dataset
----> 2 dataset = load_dataset('csv', data_files='sample_data.csv')
9 frames
/usr/local/lib/python3.7/dist-packages/nlp/load.py in load_dataset(path, name, version, data_dir, data_files, split, cache_dir, features, download_config, download_mode, ignore_verifications, save_infos, **config_kwargs)
547 # Download and prepare data
548 builder_instance.download_and_prepare(
--> 549 download_config=download_config, download_mode=download_mode, ignore_verifications=ignore_verifications,
550 )
551
/usr/local/lib/python3.7/dist-packages/nlp/builder.py in download_and_prepare(self, download_config, download_mode, ignore_verifications, try_from_hf_gcs, dl_manager, **download_and_prepare_kwargs)
461 if not downloaded_from_gcs:
462 self._download_and_prepare(
--> 463 dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
464 )
465 # Sync info
/usr/local/lib/python3.7/dist-packages/nlp/builder.py in _download_and_prepare(self, dl_manager, verify_infos, **prepare_split_kwargs)
535 try:
536 # Prepare split will record examples associated to the split
--> 537 self._prepare_split(split_generator, **prepare_split_kwargs)
538 except OSError:
539 raise OSError("Cannot find data file. " + (self.manual_download_instructions or ""))
/usr/local/lib/python3.7/dist-packages/nlp/builder.py in _prepare_split(self, split_generator)
863
864 generator = self._generate_tables(**split_generator.gen_kwargs)
--> 865 for key, table in utils.tqdm(generator, unit=" tables", leave=False):
866 writer.write_table(table)
867 num_examples, num_bytes = writer.finalize()
/usr/local/lib/python3.7/dist-packages/tqdm/notebook.py in __iter__(self, *args, **kwargs)
213 def __iter__(self, *args, **kwargs):
214 try:
--> 215 for obj in super(tqdm_notebook, self).__iter__(*args, **kwargs):
216 # return super(tqdm...) will not catch exception
217 yield obj
/usr/local/lib/python3.7/dist-packages/tqdm/std.py in __iter__(self)
1102 fp_write=getattr(self.fp, 'write', sys.stderr.write))
1103
-> 1104 for obj in iterable:
1105 yield obj
1106 # Update and possibly print the progressbar.
/usr/local/lib/python3.7/dist-packages/nlp/datasets/csv/ede98314803c971fef04bcee45d660c62f3332e8a74491e0b876106f3d99bd9b/csv.py in _generate_tables(self, files)
78 read_options=self.config.pa_read_options,
79 parse_options=self.config.pa_parse_options,
---> 80 convert_options=self.config.convert_options,
81 )
82 yield i, pa_table
/usr/local/lib/python3.7/dist-packages/pyarrow/_csv.pyx in pyarrow._csv.read_csv()
/usr/local/lib/python3.7/dist-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
/usr/local/lib/python3.7/dist-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
ArrowInvalid: CSV parse error: Expected 1 columns, got 8 Version:
|
Hi @kauvinlucas You can use the latest versions of from datasets import load_dataset
dataset = load_dataset('csv', data_files='sample_data.csv') |
Hi from datasets import load_dataset
dataset = load_dataset('csv', data_files='sample.csv') gives versions:
|
Oh.. I figured it out. According to issue #42387 from pandas, this new version does not accept None for both parameters (which was being done by the repo I'm testing). Dowgrading Pandas==1.0.4 and Python==3.8 worked |
Hi, versions
The entire Error message is on below:
|
Hi ! It looks like the error stacktrace doesn't match with your code snippet. What error do you get when running this ?
can you check that both tsv files are in the same folder as the current working directory of your shell ? |
Hi @lhoestq, Below is the entire error message after I move both tsv files to the same directory. It's the same with I got before.
|
Hi ! import os
from datasets import load_dataset
data_files = {"train": "train.tsv", "test": "test.tsv"}
assert all(os.path.isfile(data_file) for data_file in data_files.values()), "Couln't find files"
datasets = load_dataset("csv", data_files=data_files, delimiter="\t")
print("success !") This way all the code from |
Hi @lhoestq, Below is what I got from terminal after I copied and run your code. I think the files themselves are good since there is no assertion error.
|
Hi, could this be a permission error ? I think it fails to close the arrow file that contains the data from your CSVs in the cache. By default datasets are cached in |
Thank you!! @lhoestq For some reason, I don't have the default path for datasets to cache, maybe because I work from a remote system. The issue solved after I pass the |
This is the exact solution I have been finding for the whole afternoon. Thanks a lot! |
Similar to #622, I've noticed there is a problem when trying to load a CSV file with datasets.
from datasets import load_dataset
dataset = load_dataset("csv", data_files=["./sample_data.csv"], delimiter="\t", column_names=["title", "text"], script_version="master")
Displayed error:
... ArrowInvalid: CSV parse error: Expected 2 columns, got 1
I should mention that when I've tried to read data from
https://github.com/lhoestq/transformers/tree/custom-dataset-in-rag-retriever/examples/rag/test_data/my_knowledge_dataset.csv
it worked without a problem. I've read that there might be some problems with /r character, so I've removed them from the custom dataset, but the problem still remains.I've added a colab reproducing the bug, but unfortunately I cannot provide the dataset.
https://colab.research.google.com/drive/1Qzu7sC-frZVeniiWOwzoCe_UHZsrlxu8?usp=sharing
Are there any work around for it ?
Thank you
The text was updated successfully, but these errors were encountered: