Data

The following will output the 10 most popular datasets on HuggingFace that contain the substring: emotion.

gft_summary --data H:__contains__emotion

The following will produce a long list of datasets that are available on (H) HuggingFace and (P) PaddleNLP.

gft_summary --data H:__contains__ --topn 10000000 --fast_mode
gft_summary --data P:__contains__ --topn 10000000 --fast_mode

In addition to these datasets, one can also create models on the local disk with gft_fit (or any other method).

Many of gft functions take a --data argument. The value of this argument is typically: data_provider:data_key, where data_provider is a hub, H or P, and data_key is the name of a dataset supported by the hub.

HuggingFace provides a convenient GUI for looking at datasets. Here are a few popular datasets:

glue
squad
emotion
common_voice

Datasets typically contain splits such as train, val, test. Some datasets have more (or less) splits. Splits go by various names. Val is often called valid, validation, dev.

Each split typically contains a number of rows and columns. The squad dataset, for example, has 5 columns: id, title, context, question, answers. There are typically hundreds or thousands of rows. Rows are typically assumed to be iid, though this assumption is often idealistic (and unrealistic).

There tend to be more papers on models than datasets. There is relatively little interest in what the dataset is testing. Is the dataset realistic? What is the motivation? Who is the intended audience? Lexicographers care about sampling and balance. Many of these datasets can be gamed. That is, there are often (spurious) features in the training split that make it possible for a system to do well on predicting labels on the test split without mastering the task. This kind of leakage between training and set splits is referred to as spurious correlations. See here for a recent paper on this topic.

Psychology has a considerble literature on validity and reliability. This discussion presupposes the existence of a hypothesis. Datasets are collected in order to test that hypothesis. There tends to be less emphasis on hypothesis testing in machine learning.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data.md

data.md

Data

Files

data.md

Latest commit

History

data.md

File metadata and controls

Data