-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Language Modeling Datasets and Sampler #9514
Conversation
96a08a1
to
a440c1f
Compare
python/mxnet/gluon/data/sampler.py
Outdated
Parameters | ||
---------- | ||
length : int | ||
Length of the sequence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interval : int
Sampling interval
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added docstring.
python/mxnet/gluon/data/sampler.py
Outdated
""" | ||
def __init__(self, length, interval): | ||
self._length = length | ||
self._interval = interval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add range check?
python/mxnet/gluon/data/text.py
Outdated
Path to temp folder for storing data. | ||
segment : str, default 'train' | ||
Dataset segment. Options are 'train', 'validation', 'test'. | ||
indexer : :class:`~mxnet.contrib.text.indexer.TokenIndexer`, default None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't expose this to users.
- Indexer is not the standard term for this.
- This is contrib API and subject to change. Gluon Dataset should use a separate vocabulary API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
- I think it's safe to say that indexer is a clear enough term to reflect what it does.
- If I understand correctly, I believe the indexer class is intended to serve the same purpose as what you call 'vocabulary'
There is a reason this needs to be exposed. Suppose we have training dataset whose vocabulary is {a, b, c} plus unknown tokens, an test dataset has vocabulary {a, b, d}. As a standard practice, the token 'd' in the test dataset should be indexed as unknown. This means the indexing of test dataset depends on the index from training dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I don't think so. Do you have reference of it being used somewhere?
- It is, but it is contrib API. If you want to use it directly then gluon.data.text need to be in gluon.contrib too.
We do need to expose something like this. But it can't be TokenIndexer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have something else in mind?
Also, what should I provide here in place of TokenIndexer? If you could help me understand the reasoning for "it can't be TokenIndexer", I can help propose alternatives too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's probably a safer bet to move the dataset to contrib first.
python/mxnet/gluon/data/text.py
Outdated
License: Creative Commons Attribution-ShareAlike | ||
|
||
Each sample is a vector of length equal to the specified sequence length. | ||
At the end of each sentence, an end-of-sentence token '<eos>' is added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if seq_len doesn't respect sentence boundary why should it end with eos?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though the sentence boundary is not considered in providing sample chunks, it's still necessary for language model to be able to predict where sentence ends. In that sense, these concepts are orthogonal.
python/mxnet/gluon/data/text.py
Outdated
The indexer to use for indexing the text dataset. If None, a default indexer is created. | ||
seq_len : int, default 35 | ||
The sequence length of each sample, regardless of the sentence boundary. | ||
transform : function, default None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dataset now has a transform API. Use that instead of adding transform callback to every dataset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can take a look at that. What about vision? Are the existing transform options dropped? Where can I find relevant discussion?
python/mxnet/gluon/data/sampler.py
Outdated
-------- | ||
>>> sampler = gluon.data.IntervalSampler(13, interval=3) | ||
>>> list(sampler) | ||
[0, 3, 6, 9, 12, 1, 4, 7, 10, 2, 5, 8, 11] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why should it roll over at the end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason that you think it shouldn't? I think this sampler should exhaust every sample in a dataset. If for some reason it needs to drop some samples, for the purpose of mini-batch for example, then a wrapper sampler should take care of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name interval sampler suggests it should behave like [begin:end:step]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the confusion now. Should I add an option to specify whether to roll over?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem very generic anyway. I would put it in examples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piiswrong This sampler is needed for any long-form text processing that requires passing hidden state from sample to sample. I'd expect repeated use for this, which is why I chose to put it here. Do you prefer to update its interface for handling roll-over, or to move this contrib, or do you still prefer it to be dropped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piiswrong ping
This Dataset.indexer design doesn't work for the use case where you want to combine (or take intersection of) the vocab of too datasets (like train and val) |
Indeed, I wasn't considering such case because it isn't good practice to index using anything other than training set. That said, providing an option to update the input indexer should be sufficient to cover this case. Would that be OK? |
Then you would have problem when you want only the top 2000 tokens. Since this is not a very common use case, I think the current version is fine for contrib |
I am more concerned with the name |
@zhreshold thanks. @astonzhang shall we consider a different name for indexer, like the aforementioned "vocabulary"? |
I would like to propose the following change for class names: Otherwise, having both Vocabulary and Glossary is likely confusing. Having both VocabularyEmbedding and TokenEmbedding is also likely confusing. Is the proposed change OK? |
VocabularyEmbedding and PretrainedEmbedding don't sound like they would inherit each other, and the concerns are unclear just based on the name. I probably won't remember which is which after a couple of weeks. Let's consider other names for those two. |
How about: |
CompositeEmbedding sounds more like what Glossary does. |
I updated this PR based on the latest change in text api naming. Also, I made the vocabulary as a property of the dataset for exchanging the index. Feel free to comment and I'd like to get this merged once 1.1 release is cut. |
To address the concern of merging datasets based on frequencies, I made the frequencies (word-counts) a property of the dataset too. This way, user has the control on how vocabulary is made. Currently the tokenization is naive and the next step should be to have a proper tokenizer class. Once that's available, the datasets should expose an option for specifying tokenizers. |
@zhreshold @piiswrong pinging for another pass of review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM to me now
Thanks. I will wait another day before merging, in case @piiswrong has additional feedback. |
Connected offline with @piiswrong that current design is OK to check in in contrib package. |
* refactor dataset * add interval sampler * wikitext-2/-103 * update word language model * address comments * move interval sampler to contrib * update * add frequencies property
* refactor dataset * add interval sampler * wikitext-2/-103 * update word language model * address comments * move interval sampler to contrib * update * add frequencies property
* refactor dataset * add interval sampler * wikitext-2/-103 * update word language model * address comments * move interval sampler to contrib * update * add frequencies property
Description
Add language modeling dataset wikitext-2 and wikitext-103. Add interval sampler that is suitable for batched language model training. Update word-language-model example.
Checklist
Essentials
make lint
)Changes