Change the way NDArrayIter handle the last batch #12285

stu1130 · 2018-08-21T23:18:03Z

Description

Update some behaviors of NDArrayIter.

Change the functionality of class NDArrayIter last_batch_handle = roll_over.
Change the timing of shuffling. Previously, it shuffles only during the initialization, which didn't meet training needs.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

shuffle when calling the reset.
when last_batch_handle is roll_over, the last batch should be rolled over to next epoch
add the unit test for three last_batch_handle params

Comments

N/A

…ordingly

anirudhacharya · 2018-08-21T23:32:55Z

python/mxnet/io.py

@@ -635,6 +633,9 @@ class NDArrayIter(DataIter):
        How to handle the last batch. This parameter can be 'pad', 'discard' or
        'roll_over'. 'roll_over' is intended for training and can cause problems
        if used for prediction.
+        If 'pad', the last batch will be padded with data starting from the begining
+        If 'discard', the last batch will be discarded
+        If 'roll_over', the remaining elements will be rolled over to the next iteration


'roll_over' option is described in line 634. Can this be merged with that rather than describe 'roll_over' twice.

chinakook · 2018-08-22T05:28:42Z

What's the behavior of 'roll_over'? I think the doc is not very clear.

stu1130 · 2018-08-22T16:32:26Z

roll_over means the remaining elements of the last batch would be rolled over to next iteration if the last batch is less than the batch size

anirudhacharya · 2018-08-22T08:19:10Z

python/mxnet/io.py

-        'roll_over'. 'roll_over' is intended for training and can cause problems
-        if used for prediction.
+        'roll_over'.
+        If 'pad', the last batch will be padded with data starting from the begining


how is pad and roll_over different, it is not clear in the documentation? In both it would seem you are taking data from the first batch of off the next epoch and adding it to the current last batch

Let say data look like this [1,2,3,4,5,6,7,8,9,10] with batch_size 3
pad would be like [1,2,3],...[7,8,9],[10,1,2], while roll_over would be [1,2,3],...[7,8,9] and second iteration would be [10,1,2], [3,4,5], [6,7,8] after calling reset().
I've updated example starting from line 610

Yeah, It's so clear with an example.

anirudhacharya · 2018-08-22T08:34:32Z

python/mxnet/io.py

        else:
            self.cursor = -self.batch_size

    def iter_next(self):
+        """Increments the coursor and check current cursor if exceed num of data."""


This doc string does not make sense and has mistakes. What is num of data?

anirudhacharya · 2018-08-22T08:40:14Z

tests/python/unittest/test_io.py

-        else:
-            assert(labelcount[i] == 100)
+def _test_last_batch_handle(data, labels):
+    idx = 0


why is this being initialized here? It is not needed as you do for idx in range(... later.

anirudhacharya · 2018-08-22T08:41:49Z

tests/python/unittest/test_io.py


+def test_NDArrayIter():
+    data, labels = _init_NDArrayIter_data()


please add doc string for this method. same for other methods in this module.

We don't need to add doc string for the test

it would be good to have a couple of comments describing what use-cases are being tested?

anirudhacharya · 2018-08-22T08:42:10Z

tests/python/unittest/test_io.py

@@ -87,82 +87,68 @@ def test_Cifar10Rec():
    for i in range(10):
        assert(labelcount[i] == 5000)

-
-def test_NDArrayIter():
+def _init_NDArrayIter_data():


doc string?

anirudhacharya · 2018-08-22T08:42:43Z

tests/python/unittest/test_io.py


+def test_NDArrayIter():
+    data, labels = _init_NDArrayIter_data()
+    _test_last_batch_handle(data, labels)

 def test_NDArrayIter_h5py():


doc string?

anirudhacharya · 2018-08-22T08:58:27Z

tests/python/unittest/test_io.py

+        batch_count = 0
+        for _ in dataiter:
+            batch_count += 1
+        assert batch_count == batch_count_list[idx]


can we have a test where you verify that the data has indeed been shuffled

Now I can't come up with a good solution to test if shuffle work. Shuffle testing will make unit test nondeterministic. if you have any idea, I would love to implement that

Effectively testing shuffling will be like testing a random number generator, which is a very involved problem by itself. We do not have to do that here. What I suggest is to test if we have the same set of elements pre and post shuffling and ensure that they are not in the same order.

I thought there is a tiny chance that the data remain the same after shuffling?

if there are n elements being shuffled, the chance that the list remains the same after shuffling is 1/n!(assuming unique elements). For example if there are 10 elements in the shuffled list, the probability of the list being unaltered post shuffling is 2.75573192e-7

Thanks for bringing up this issue. As @sandeep-krishnamurthy suggested, I would check if the data points are moved to the right positions based on index array. Within shuffle's implementation, the index array would shuffle first and then we get the data by their shuffled index.

anirudhacharya · 2018-08-22T16:55:53Z

tests/python/unittest/test_io.py

-            assert((batch.data[0].asnumpy()[:, 0, 0] == label).all())
-            for i in range(label.shape[0]):
-                labelcount[int(label[i])] += 1
+    with h5py.File('ndarraytest.h5') as f:


NDArrayIter is supposed to return iterators for mx.nd.NDArray, numpy.ndarray, h5py.Dataset mx.nd.sparse.CSRNDArray or scipy.sparse.csr_matrix. Do we have a test for scipy.sparse.csr_matrix like for h5py

Good Catch. will implement that

…to change_NDArrayIter_last_batch_handle

zhreshold · 2018-08-31T20:29:20Z

python/mxnet/io.py

-from .ndarray.random import shuffle as random_shuffle
+from .ndarray import concat
+
+from .io_utils import init_data, has_instance, getdata_by_idx


not convinced to create a separate utils file, and this import will actually pollute the namespace. The original _private_function is good IMO.

The reason I created another file is that the io.py file exceeds 1000 lines.

okay, make sense to me. Here's my suggestion in order to make things clear:

make a new directory named io under python/mxnet

. ├── __init__.py ├── io.py └── io_utils.py

In __init__.py, reimport everything in io.py, so it won't break the original path.

* 1. move the shuffle to the reset 2. modify the roll_over behavior accordingly * refactor the concat part * refactor the code * implement unit test for last_batch_handle * refactor the getdata part * add docstring and refine the code according to linter * 1. add test case for NDArrayIter_h5py 2. refactor the implementation * update contributions doc * fix wording * update doc for roll_over * 1. add test for second iteration of roll_over 2. add shuffle test case * fix some wording and refine the variables naming * move utility function to new file * move utility function to io_utils.py * change shuffle function name to avoid redefining name * make io as a module * rename the utility functions * disable wildcard-import

iamthebot · 2018-09-12T00:57:54Z

This is causing a regression which is breaking most of our test cases using simple iteration over NDArrayIter.

.com/airbnb/bighead/python/bighead/ml_frameworks/mxnet/gluon.py", line 434, in transform
    for batch in data_iter:
  File "/anaconda3/envs/py36/lib/python3.6/site-packages/mxnet/io/io.py", line 228, in __next__
    return self.next()
  File "/anaconda3/envs/py36/lib/python3.6/site-packages/mxnet/io/io.py", line 680, in next
    label = self.getlabel()
  File "/anaconda3/envs/py36/lib/python3.6/site-packages/mxnet/io/io.py", line 750, in getlabel
    return self._batchify(self.label)
  File "/anaconda3/envs/py36/lib/python3.6/site-packages/mxnet/io/io.py", line 732, in _batchify
    first_data = self._getdata(data_source, start=self.cursor)
  File "/anaconda3/envs/py36/lib/python3.6/site-packages/mxnet/io/io.py", line 694, in _getdata
    end = end if end is not None else data_source[0][1].shape[0]
IndexError: list index out of range

This reverts commit 597a637.

* Revert "Removing the re-size for validation data, which breaking the validation accuracy of CIFAR training (#12362)" This reverts commit ceabcaa. * Revert "[MXNET-580] Add SN-GAN example (#12419)" This reverts commit 46a5cee. * Revert "Remove regression checks for website links (#12507)" This reverts commit 619bc3e. * Revert "Revert "Fix flaky test: test_mkldnn.test_activation #12377 (#12418)" (#12516)" This reverts commit 7ea0533. * Revert "further bump up tolerance for sparse dot (#12527)" This reverts commit 90599e1. * Revert "Fix broken URLs (#12508)" This reverts commit 3d83c89. * Revert "Temporarily disable flaky tests (#12520)" This reverts commit 35ca13c. * Revert "Add support for more req patterns for bilinear sampler backward (#12386)" This reverts commit 4ee866f. * Revert "Change the way NDArrayIter handle the last batch (#12285)" This reverts commit 597a637.

* 1. move the shuffle to the reset 2. modify the roll_over behavior accordingly * refactor the concat part * refactor the code * implement unit test for last_batch_handle * refactor the getdata part * add docstring and refine the code according to linter * 1. add test case for NDArrayIter_h5py 2. refactor the implementation * update contributions doc * fix wording * update doc for roll_over * 1. add test for second iteration of roll_over 2. add shuffle test case * fix some wording and refine the variables naming * move utility function to new file * move utility function to io_utils.py * change shuffle function name to avoid redefining name * make io as a module * rename the utility functions * disable wildcard-import

* Revert "Removing the re-size for validation data, which breaking the validation accuracy of CIFAR training (apache#12362)" This reverts commit ceabcaa. * Revert "[MXNET-580] Add SN-GAN example (apache#12419)" This reverts commit 46a5cee. * Revert "Remove regression checks for website links (apache#12507)" This reverts commit 619bc3e. * Revert "Revert "Fix flaky test: test_mkldnn.test_activation apache#12377 (apache#12418)" (apache#12516)" This reverts commit 7ea0533. * Revert "further bump up tolerance for sparse dot (apache#12527)" This reverts commit 90599e1. * Revert "Fix broken URLs (apache#12508)" This reverts commit 3d83c89. * Revert "Temporarily disable flaky tests (apache#12520)" This reverts commit 35ca13c. * Revert "Add support for more req patterns for bilinear sampler backward (apache#12386)" This reverts commit 4ee866f. * Revert "Change the way NDArrayIter handle the last batch (apache#12285)" This reverts commit 597a637.

stu1130 added 8 commits August 20, 2018 15:26

1. move the shuffle to the reset 2. modify the roll_over behavior acc…

ce4ae1c

…ordingly

refactor the concat part

26a4c58

refactor the code

21d399b

implement unit test for last_batch_handle

c14b4ef

refactor the getdata part

e8fb568

add docstring and refine the code according to linter

33290f6

1. add test case for NDArrayIter_h5py 2. refactor the implementation

3eb93c5

update contributions doc

4ace292

stu1130 requested a review from szha as a code owner August 21, 2018 23:18

stu1130 changed the title ~~Change nd array iter last batch handle~~ [WIP] Change the way that NDArrayIter handle the last batch Aug 21, 2018

stu1130 changed the title ~~[WIP] Change the way that NDArrayIter handle the last batch~~ [WIP] Change the way NDArrayIter handle the last batch Aug 21, 2018

anirudhacharya reviewed Aug 21, 2018

View reviewed changes

fix wording

e939b06

update doc for roll_over

21ccec6

anirudhacharya reviewed Aug 22, 2018

View reviewed changes

stu1130 added 6 commits August 22, 2018 15:42

1. add test for second iteration of roll_over 2. add shuffle test case

353afe7

fix some wording and refine the variables naming

012a419

move utility function to new file

68c10f2

move utility function to io_utils.py

12059d4

change shuffle function name to avoid redefining name

c3f5b73

Merge branch 'master' of https://github.com/apache/incubator-mxnet in…

9505ed3

…to change_NDArrayIter_last_batch_handle

stu1130 changed the title ~~[WIP] Change the way NDArrayIter handle the last batch~~ Change the way NDArrayIter handle the last batch Aug 23, 2018

sandeep-krishnamurthy added Data-loading pr-awaiting-review PR is waiting for code review labels Aug 24, 2018

zhreshold reviewed Aug 31, 2018

View reviewed changes

stu1130 added 3 commits September 4, 2018 14:49

make io as a module

fa1b23d

rename the utility functions

4966e73

disable wildcard-import

2fb14f3

zhreshold approved these changes Sep 6, 2018

View reviewed changes

sandeep-krishnamurthy merged commit 597a637 into apache:master Sep 11, 2018

iamthebot mentioned this pull request Sep 12, 2018

#12285 Breaks NDArrayIter For 3D Arrays #12526

Closed

zhreshold added a commit that referenced this pull request Sep 12, 2018

Revert "Change the way NDArrayIter handle the last batch (#12285)"

222ce83

This reverts commit 597a637.

zhreshold mentioned this pull request Sep 12, 2018

Revert "Change the way NDArrayIter handle the last batch" #12537

Merged

stu1130 mentioned this pull request Sep 13, 2018

Change the way NDArrayIter handle the last batch #12545

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change the way NDArrayIter handle the last batch #12285

Change the way NDArrayIter handle the last batch #12285

stu1130 commented Aug 21, 2018 •

edited

Loading

anirudhacharya Aug 21, 2018

chinakook commented Aug 22, 2018

stu1130 commented Aug 22, 2018

anirudhacharya Aug 22, 2018

stu1130 Aug 22, 2018 •

edited

Loading

chinakook Aug 23, 2018 •

edited

Loading

anirudhacharya Aug 22, 2018

anirudhacharya Aug 22, 2018

anirudhacharya Aug 22, 2018

stu1130 Aug 22, 2018

anirudhacharya Aug 22, 2018

anirudhacharya Aug 22, 2018

anirudhacharya Aug 22, 2018

anirudhacharya Aug 22, 2018

stu1130 Aug 22, 2018 •

edited

Loading

anirudhacharya Aug 22, 2018 •

edited

Loading

stu1130 Aug 22, 2018

anirudhacharya Aug 22, 2018

stu1130 Aug 22, 2018

anirudhacharya Aug 22, 2018 •

edited

Loading

stu1130 Aug 22, 2018

zhreshold Aug 31, 2018

stu1130 Sep 4, 2018

zhreshold Sep 4, 2018

iamthebot commented Sep 12, 2018


		def test_NDArrayIter():
		data, labels = _init_NDArrayIter_data()

Change the way NDArrayIter handle the last batch #12285

Change the way NDArrayIter handle the last batch #12285

Conversation

stu1130 commented Aug 21, 2018 • edited Loading

Description

Checklist

Essentials

Changes

Comments

Choose a reason for hiding this comment

chinakook commented Aug 22, 2018

stu1130 commented Aug 22, 2018

Choose a reason for hiding this comment

stu1130 Aug 22, 2018 • edited Loading

Choose a reason for hiding this comment

chinakook Aug 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stu1130 Aug 22, 2018 • edited Loading

Choose a reason for hiding this comment

anirudhacharya Aug 22, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anirudhacharya Aug 22, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iamthebot commented Sep 12, 2018

stu1130 commented Aug 21, 2018 •

edited

Loading

stu1130 Aug 22, 2018 •

edited

Loading

chinakook Aug 23, 2018 •

edited

Loading

stu1130 Aug 22, 2018 •

edited

Loading

anirudhacharya Aug 22, 2018 •

edited

Loading

anirudhacharya Aug 22, 2018 •

edited

Loading