Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iterating over values of a column in the IterableDataset #7381

Open
TopCoder2K opened this issue Jan 28, 2025 · 2 comments
Open

Iterating over values of a column in the IterableDataset #7381

TopCoder2K opened this issue Jan 28, 2025 · 2 comments
Labels
enhancement New feature or request

Comments

@TopCoder2K
Copy link

TopCoder2K commented Jan 28, 2025

Feature request

I would like to be able to iterate (and re-iterate if needed) over a column of an IterableDataset instance. The following example shows the supposed API:

def gen():
    yield {"text": "Good", "label": 0}
    yield {"text": "Bad", "label": 1}

ds = IterableDataset.from_generator(gen)
texts = ds["text"]

for v in texts:
    print(v)  # Prints "Good" and "Bad"

for v in texts:
    print(v)  # Prints "Good" and "Bad" again

Motivation

In the real world problems, huge NNs like Transformer are not always the best option, so there is a need to conduct experiments with different methods. While 🤗Datasets is perfectly adapted to 🤗Transformers, it may be inconvenient when being used with other libraries. The ability to retrieve a particular column is the case (e.g., gensim's FastText requires only lists of strings, not dictionaries).
While there are ways to achieve the desired functionality, they are not good (forum). It would be great if there was a built-in solution.

Your contribution

Theoretically, I can submit a PR, but I have very little knowledge of the internal structure of 🤗Datasets, so some help may be needed.
Moreover, I can only work on weekends, since I have a full-time job. However, the feature does not seem to be popular, so there is no need to implement it as fast as possible.

@TopCoder2K TopCoder2K added the enhancement New feature or request label Jan 28, 2025
@lhoestq
Copy link
Member

lhoestq commented Feb 3, 2025

I'd be in favor of that ! I saw many people implementing their own iterables that wrap a dataset just to iterate on a single column, that would make things more practical.

Kinda related: #5847

@TopCoder2K
Copy link
Author

(For anyone's information, I'm going on vacation for the next 3 weeks, so the work is postponed. If anyone can implement this feature within the next 4 weeks, go ahead :) )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants