-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Change the way NDArrayIter handle the last batch #12285
Change the way NDArrayIter handle the last batch #12285
Conversation
python/mxnet/io.py
Outdated
@@ -635,6 +633,9 @@ class NDArrayIter(DataIter): | |||
How to handle the last batch. This parameter can be 'pad', 'discard' or | |||
'roll_over'. 'roll_over' is intended for training and can cause problems | |||
if used for prediction. | |||
If 'pad', the last batch will be padded with data starting from the begining | |||
If 'discard', the last batch will be discarded | |||
If 'roll_over', the remaining elements will be rolled over to the next iteration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'roll_over' option is described in line 634. Can this be merged with that rather than describe 'roll_over' twice.
What's the behavior of 'roll_over'? I think the doc is not very clear. |
roll_over means the remaining elements of the last batch would be rolled over to next iteration if the last batch is less than the batch size |
'roll_over'. 'roll_over' is intended for training and can cause problems | ||
if used for prediction. | ||
'roll_over'. | ||
If 'pad', the last batch will be padded with data starting from the begining |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is pad and roll_over different, it is not clear in the documentation? In both it would seem you are taking data from the first batch of off the next epoch and adding it to the current last batch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let say data look like this [1,2,3,4,5,6,7,8,9,10] with batch_size 3
pad
would be like [1,2,3],...[7,8,9],[10,1,2], while roll_over
would be [1,2,3],...[7,8,9] and second iteration would be [10,1,2], [3,4,5], [6,7,8] after calling reset().
I've updated example starting from line 610
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, It's so clear with an example.
python/mxnet/io.py
Outdated
else: | ||
self.cursor = -self.batch_size | ||
|
||
def iter_next(self): | ||
"""Increments the coursor and check current cursor if exceed num of data.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doc string does not make sense and has mistakes. What is num of data?
tests/python/unittest/test_io.py
Outdated
else: | ||
assert(labelcount[i] == 100) | ||
def _test_last_batch_handle(data, labels): | ||
idx = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this being initialized here? It is not needed as you do for idx in range(...
later.
tests/python/unittest/test_io.py
Outdated
|
||
def test_NDArrayIter(): | ||
data, labels = _init_NDArrayIter_data() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add doc string for this method. same for other methods in this module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to add doc string for the test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be good to have a couple of comments describing what use-cases are being tested?
tests/python/unittest/test_io.py
Outdated
@@ -87,82 +87,68 @@ def test_Cifar10Rec(): | |||
for i in range(10): | |||
assert(labelcount[i] == 5000) | |||
|
|||
|
|||
def test_NDArrayIter(): | |||
def _init_NDArrayIter_data(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc string?
tests/python/unittest/test_io.py
Outdated
|
||
def test_NDArrayIter(): | ||
data, labels = _init_NDArrayIter_data() | ||
_test_last_batch_handle(data, labels) | ||
|
||
def test_NDArrayIter_h5py(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc string?
tests/python/unittest/test_io.py
Outdated
batch_count = 0 | ||
for _ in dataiter: | ||
batch_count += 1 | ||
assert batch_count == batch_count_list[idx] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we have a test where you verify that the data has indeed been shuffled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I can't come up with a good solution to test if shuffle work. Shuffle testing will make unit test nondeterministic. if you have any idea, I would love to implement that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Effectively testing shuffling will be like testing a random number generator, which is a very involved problem by itself. We do not have to do that here. What I suggest is to test if we have the same set of elements pre and post shuffling and ensure that they are not in the same order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought there is a tiny chance that the data remain the same after shuffling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there are n elements being shuffled, the chance that the list remains the same after shuffling is 1/n!
(assuming unique elements). For example if there are 10 elements in the shuffled list, the probability of the list being unaltered post shuffling is 2.75573192e-7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for bringing up this issue. As @sandeep-krishnamurthy suggested, I would check if the data points are moved to the right positions based on index array. Within shuffle's implementation, the index array would shuffle first and then we get the data by their shuffled index.
assert((batch.data[0].asnumpy()[:, 0, 0] == label).all()) | ||
for i in range(label.shape[0]): | ||
labelcount[int(label[i])] += 1 | ||
with h5py.File('ndarraytest.h5') as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NDArrayIter
is supposed to return iterators for mx.nd.NDArray
, numpy.ndarray
, h5py.Dataset
mx.nd.sparse.CSRNDArray
or scipy.sparse.csr_matrix
. Do we have a test for scipy.sparse.csr_matrix
like for h5py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good Catch. will implement that
…to change_NDArrayIter_last_batch_handle
python/mxnet/io.py
Outdated
from .ndarray.random import shuffle as random_shuffle | ||
from .ndarray import concat | ||
|
||
from .io_utils import init_data, has_instance, getdata_by_idx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not convinced to create a separate utils file, and this import will actually pollute the namespace. The original _private_function is good IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I created another file is that the io.py file exceeds 1000 lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, make sense to me. Here's my suggestion in order to make things clear:
make a new directory named io
under python/mxnet
.
├── __init__.py
├── io.py
└── io_utils.py
In __init__.py
, reimport everything in io.py, so it won't break the original path.
* 1. move the shuffle to the reset 2. modify the roll_over behavior accordingly * refactor the concat part * refactor the code * implement unit test for last_batch_handle * refactor the getdata part * add docstring and refine the code according to linter * 1. add test case for NDArrayIter_h5py 2. refactor the implementation * update contributions doc * fix wording * update doc for roll_over * 1. add test for second iteration of roll_over 2. add shuffle test case * fix some wording and refine the variables naming * move utility function to new file * move utility function to io_utils.py * change shuffle function name to avoid redefining name * make io as a module * rename the utility functions * disable wildcard-import
This is causing a regression which is breaking most of our test cases using simple iteration over
|
This reverts commit 597a637.
* Revert "Removing the re-size for validation data, which breaking the validation accuracy of CIFAR training (#12362)" This reverts commit ceabcaa. * Revert "[MXNET-580] Add SN-GAN example (#12419)" This reverts commit 46a5cee. * Revert "Remove regression checks for website links (#12507)" This reverts commit 619bc3e. * Revert "Revert "Fix flaky test: test_mkldnn.test_activation #12377 (#12418)" (#12516)" This reverts commit 7ea0533. * Revert "further bump up tolerance for sparse dot (#12527)" This reverts commit 90599e1. * Revert "Fix broken URLs (#12508)" This reverts commit 3d83c89. * Revert "Temporarily disable flaky tests (#12520)" This reverts commit 35ca13c. * Revert "Add support for more req patterns for bilinear sampler backward (#12386)" This reverts commit 4ee866f. * Revert "Change the way NDArrayIter handle the last batch (#12285)" This reverts commit 597a637.
* 1. move the shuffle to the reset 2. modify the roll_over behavior accordingly * refactor the concat part * refactor the code * implement unit test for last_batch_handle * refactor the getdata part * add docstring and refine the code according to linter * 1. add test case for NDArrayIter_h5py 2. refactor the implementation * update contributions doc * fix wording * update doc for roll_over * 1. add test for second iteration of roll_over 2. add shuffle test case * fix some wording and refine the variables naming * move utility function to new file * move utility function to io_utils.py * change shuffle function name to avoid redefining name * make io as a module * rename the utility functions * disable wildcard-import
* Revert "Removing the re-size for validation data, which breaking the validation accuracy of CIFAR training (apache#12362)" This reverts commit ceabcaa. * Revert "[MXNET-580] Add SN-GAN example (apache#12419)" This reverts commit 46a5cee. * Revert "Remove regression checks for website links (apache#12507)" This reverts commit 619bc3e. * Revert "Revert "Fix flaky test: test_mkldnn.test_activation apache#12377 (apache#12418)" (apache#12516)" This reverts commit 7ea0533. * Revert "further bump up tolerance for sparse dot (apache#12527)" This reverts commit 90599e1. * Revert "Fix broken URLs (apache#12508)" This reverts commit 3d83c89. * Revert "Temporarily disable flaky tests (apache#12520)" This reverts commit 35ca13c. * Revert "Add support for more req patterns for bilinear sampler backward (apache#12386)" This reverts commit 4ee866f. * Revert "Change the way NDArrayIter handle the last batch (apache#12285)" This reverts commit 597a637.
Description
Update some behaviors of NDArrayIter.
last_batch_handle = roll_over
.Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
last_batch_handle
paramsComments
N/A