Language Modeling Datasets and Sampler #9514

szha · 2018-01-22T06:33:29Z

Description

Add language modeling dataset wikitext-2 and wikitext-103. Add interval sampler that is suitable for batched language model training. Update word-language-model example.

Checklist

Essentials

Passed code style checking (make lint)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Add wikitext-2, wikitext-103, and tests.
Add Interval Sampler, and test.
Update example to use new dataset.

zhreshold · 2018-01-22T21:56:13Z

python/mxnet/gluon/data/sampler.py

+    Parameters
+    ----------
+    length : int
+        Length of the sequence.


interval : int
Sampling interval

Added docstring.

zhreshold · 2018-01-22T21:58:01Z

python/mxnet/gluon/data/sampler.py

+    """
+    def __init__(self, length, interval):
+        self._length = length
+        self._interval = interval


add range check?

piiswrong · 2018-01-22T23:11:28Z

python/mxnet/gluon/data/text.py

+        Path to temp folder for storing data.
+    segment : str, default 'train'
+        Dataset segment. Options are 'train', 'validation', 'test'.
+    indexer : :class:`~mxnet.contrib.text.indexer.TokenIndexer`, default None


I wouldn't expose this to users.

Indexer is not the standard term for this.

This is contrib API and subject to change. Gluon Dataset should use a separate vocabulary API

Thanks.

I think it's safe to say that indexer is a clear enough term to reflect what it does.

If I understand correctly, I believe the indexer class is intended to serve the same purpose as what you call 'vocabulary'

There is a reason this needs to be exposed. Suppose we have training dataset whose vocabulary is {a, b, c} plus unknown tokens, an test dataset has vocabulary {a, b, d}. As a standard practice, the token 'd' in the test dataset should be indexed as unknown. This means the indexing of test dataset depends on the index from training dataset.

I don't think so. Do you have reference of it being used somewhere?

It is, but it is contrib API. If you want to use it directly then gluon.data.text need to be in gluon.contrib too.

We do need to expose something like this. But it can't be TokenIndexer.

Do you have something else in mind?

Also, what should I provide here in place of TokenIndexer? If you could help me understand the reasoning for "it can't be TokenIndexer", I can help propose alternatives too.

I think it's probably a safer bet to move the dataset to contrib first.

piiswrong · 2018-01-22T23:13:27Z

python/mxnet/gluon/data/text.py

+    License: Creative Commons Attribution-ShareAlike
+
+    Each sample is a vector of length equal to the specified sequence length.
+    At the end of each sentence, an end-of-sentence token '<eos>' is added.


if seq_len doesn't respect sentence boundary why should it end with eos?

Even though the sentence boundary is not considered in providing sample chunks, it's still necessary for language model to be able to predict where sentence ends. In that sense, these concepts are orthogonal.

piiswrong · 2018-01-22T23:14:03Z

python/mxnet/gluon/data/text.py

+        The indexer to use for indexing the text dataset. If None, a default indexer is created.
+    seq_len : int, default 35
+        The sequence length of each sample, regardless of the sentence boundary.
+    transform : function, default None


Dataset now has a transform API. Use that instead of adding transform callback to every dataset

Sure, I can take a look at that. What about vision? Are the existing transform options dropped? Where can I find relevant discussion?

piiswrong · 2018-01-22T23:17:53Z

python/mxnet/gluon/data/sampler.py

+    --------
+    >>> sampler = gluon.data.IntervalSampler(13, interval=3)
+    >>> list(sampler)
+    [0, 3, 6, 9, 12, 1, 4, 7, 10, 2, 5, 8, 11]


why should it roll over at the end?

Is there a reason that you think it shouldn't? I think this sampler should exhaust every sample in a dataset. If for some reason it needs to drop some samples, for the purpose of mini-batch for example, then a wrapper sampler should take care of that.

The name interval sampler suggests it should behave like [begin:end:step]

I see the confusion now. Should I add an option to specify whether to roll over?

This doesn't seem very generic anyway. I would put it in examples

@piiswrong This sampler is needed for any long-form text processing that requires passing hidden state from sample to sample. I'd expect repeated use for this, which is why I chose to put it here. Do you prefer to update its interface for handling roll-over, or to move this contrib, or do you still prefer it to be dropped?

@piiswrong ping

piiswrong · 2018-01-22T23:37:03Z

This Dataset.indexer design doesn't work for the use case where you want to combine (or take intersection of) the vocab of too datasets (like train and val)

szha · 2018-01-22T23:59:02Z

Indeed, I wasn't considering such case because it isn't good practice to index using anything other than training set. That said, providing an option to update the input indexer should be sufficient to cover this case. Would that be OK?

piiswrong · 2018-01-23T01:16:45Z

Then you would have problem when you want only the top 2000 tokens.

Since this is not a very common use case, I think the current version is fine for contrib

zhreshold · 2018-01-23T05:55:44Z

I am more concerned with the name Indexer and the related behavior(manually extracting Indexer and synchronize between train/valid dataset). The others LGTM now.

szha · 2018-01-23T06:24:22Z

@zhreshold thanks. @astonzhang shall we consider a different name for indexer, like the aforementioned "vocabulary"?

astonzhang · 2018-01-23T15:50:49Z

I would like to propose the following change for class names:
TokenIndexer -> Vocabulary
Glossary -> VocabularyEmbedding
TokenEmbedding -> PretrainedEmbedding

Otherwise, having both Vocabulary and Glossary is likely confusing. Having both VocabularyEmbedding and TokenEmbedding is also likely confusing.

Is the proposed change OK?

szha · 2018-01-23T17:45:57Z

VocabularyEmbedding and PretrainedEmbedding don't sound like they would inherit each other, and the concerns are unclear just based on the name. I probably won't remember which is which after a couple of weeks. Let's consider other names for those two.

astonzhang · 2018-01-23T17:55:07Z

How about:
TokenIndexer -> Vocabulary
Glossary -> VocabularyEmbedding
TokenEmbedding -> Embedding
OR
TokenEmbedding (no change)

szha · 2018-01-23T18:07:46Z

CompositeEmbedding sounds more like what Glossary does.

szha · 2018-01-25T06:56:34Z

I updated this PR based on the latest change in text api naming. Also, I made the vocabulary as a property of the dataset for exchanging the index. Feel free to comment and I'd like to get this merged once 1.1 release is cut.

szha · 2018-01-26T02:41:32Z

To address the concern of merging datasets based on frequencies, I made the frequencies (word-counts) a property of the dataset too. This way, user has the control on how vocabulary is made.

Currently the tokenization is naive and the next step should be to have a proper tokenizer class. Once that's available, the datasets should expose an option for specifying tokenizers.

szha · 2018-01-28T20:07:09Z

@zhreshold @piiswrong pinging for another pass of review.

zhreshold

LGTM to me now

szha · 2018-01-29T05:14:18Z

Thanks. I will wait another day before merging, in case @piiswrong has additional feedback.

szha · 2018-01-30T02:00:42Z

Connected offline with @piiswrong that current design is OK to check in in contrib package.

* refactor dataset * add interval sampler * wikitext-2/-103 * update word language model * address comments * move interval sampler to contrib * update * add frequencies property

szha force-pushed the lm_dataset branch from 4e92a88 to e2380d8 Compare January 22, 2018 07:16

szha requested a review from zhreshold January 22, 2018 07:16

szha force-pushed the lm_dataset branch 3 times, most recently from 96a08a1 to a440c1f Compare January 22, 2018 21:44

zhreshold reviewed Jan 22, 2018

View reviewed changes

piiswrong reviewed Jan 22, 2018

View reviewed changes

szha force-pushed the lm_dataset branch from 1c18261 to 48ef500 Compare January 23, 2018 20:45

szha added 6 commits January 24, 2018 18:25

refactor dataset

27c0eec

add interval sampler

3279e6b

wikitext-2/-103

39e4ad3

update word language model

5a5331b

address comments

bc4ebe5

move interval sampler to contrib

12891d4

szha force-pushed the lm_dataset branch from 48ef500 to fa6e36a Compare January 25, 2018 04:39

update

2d42d83

szha force-pushed the lm_dataset branch from fa6e36a to 2d42d83 Compare January 25, 2018 05:16

add frequencies property

9fca6cc

zhreshold approved these changes Jan 28, 2018

View reviewed changes

szha merged commit 8bdc806 into apache:master Jan 30, 2018

szha deleted the lm_dataset branch April 26, 2018 18:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Language Modeling Datasets and Sampler #9514

Language Modeling Datasets and Sampler #9514

szha commented Jan 22, 2018

zhreshold Jan 22, 2018

szha Jan 23, 2018

zhreshold Jan 22, 2018

piiswrong Jan 22, 2018

szha Jan 22, 2018

piiswrong Jan 22, 2018

szha Jan 23, 2018

szha Jan 23, 2018

piiswrong Jan 22, 2018

szha Jan 22, 2018

piiswrong Jan 22, 2018

szha Jan 22, 2018

piiswrong Jan 22, 2018

szha Jan 22, 2018

piiswrong Jan 23, 2018 •

edited

Loading

szha Jan 23, 2018

piiswrong Jan 23, 2018

szha Jan 23, 2018 •

edited

Loading

szha Jan 23, 2018

piiswrong commented Jan 22, 2018

szha commented Jan 22, 2018

piiswrong commented Jan 23, 2018

zhreshold commented Jan 23, 2018

szha commented Jan 23, 2018

astonzhang commented Jan 23, 2018

szha commented Jan 23, 2018

astonzhang commented Jan 23, 2018 •

edited

Loading

szha commented Jan 23, 2018

szha commented Jan 25, 2018

szha commented Jan 26, 2018 •

edited

Loading

szha commented Jan 28, 2018

zhreshold left a comment

szha commented Jan 29, 2018

szha commented Jan 30, 2018

Language Modeling Datasets and Sampler #9514

Language Modeling Datasets and Sampler #9514

Conversation

szha commented Jan 22, 2018

Description

Checklist

Essentials

Changes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piiswrong Jan 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szha Jan 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piiswrong commented Jan 22, 2018

szha commented Jan 22, 2018

piiswrong commented Jan 23, 2018

zhreshold commented Jan 23, 2018

szha commented Jan 23, 2018

astonzhang commented Jan 23, 2018

szha commented Jan 23, 2018

astonzhang commented Jan 23, 2018 • edited Loading

szha commented Jan 23, 2018

szha commented Jan 25, 2018

szha commented Jan 26, 2018 • edited Loading

szha commented Jan 28, 2018

zhreshold left a comment

Choose a reason for hiding this comment

szha commented Jan 29, 2018

szha commented Jan 30, 2018

piiswrong Jan 23, 2018 •

edited

Loading

szha Jan 23, 2018 •

edited

Loading

astonzhang commented Jan 23, 2018 •

edited

Loading

szha commented Jan 26, 2018 •

edited

Loading