Gluon 2.0 Dataloader should support BERT training using GluonNLP #18672

rondogency · 2020-07-08T17:32:29Z

Description

Currently we cannot use 2.0 Dataloader to train BERT, and the reason is 2.0 Dataloader is not flexible to support the data schema used by GluonNLP BERT, specifically if passing in a nested list of variable length numpy array, the construction of dataset would fail and throw NDArray conversion errors

Here is a minimal reproducible code, which is the similar data schema BERT pre-training script is using:

import mxnet as mx
import numpy as np
a = np.ndarray(shape=(128,)) # similar to one feature of one sequence
b = np.ndarray(shape=(19,))
l1 = [a,b] # similar to one feature of all sequences
l2 = [a,b]
c = [l1, l2] # similar to a training instance that will be sampled against
ds = mx.gluon.data.ArrayDataset(*c)
dt = mx.gluon.data.DataLoader(dataset=ds, batch_size=1, num_workers=1, try_nopython=True)
print('ok') # error out before prints

References

#17841

rondogency · 2020-07-08T17:33:16Z

@eric-haibin-lin @sxjscience @zhreshold FYI

sxjscience · 2020-07-08T17:37:05Z

I can reproduce this failure message:

import mxnet as mx
import numpy as np
mx.npx.set_np()
a = np.ndarray(shape=(128,)) # similar to one feature of one sequence
b = np.ndarray(shape=(19,))
l1 = [a,b] # similar to one feature of all sequences
l2 = [a,b]
c = [l1, l2] # similar to a training instance that will be sampled against
ds = mx.gluon.data.ArrayDataset(*c)
dt = mx.gluon.data.DataLoader(dataset=ds, batch_size=1, num_workers=1, try_nopython=True)
print('ok') # error out before prints

Error message:

~/miniconda3/lib/python3.7/site-packages/mxnet/gluon/data/dataset.py in __mx_handle__(self)
    383                     datasets.append(data.__mx_handle__())
    384                 else:
--> 385                     datasets.append(NDArrayDataset(arr=default_array(data)))
    386             self.handle = GroupDataset(datasets=datasets)
    387         return self.handle

~/miniconda3/lib/python3.7/site-packages/mxnet/util.py in default_array(source_array, ctx, dtype)
    936     from . import np as _mx_np
    937     if is_np_array():
--> 938         return _mx_np.array(source_array, ctx=ctx, dtype=dtype)
    939     else:
    940         return _mx_nd.array(source_array, ctx=ctx, dtype=dtype)

~/miniconda3/lib/python3.7/site-packages/mxnet/numpy/multiarray.py in array(object, dtype, ctx)
   2407             # printing out the error raised by official NumPy's array function
   2408             # for transparency on users' side
-> 2409             raise TypeError('{}'.format(str(e)))
   2410     ret = empty(object.shape, dtype=dtype, ctx=ctx)
   2411     if len(object.shape) == 0:

TypeError: setting an array element with a sequence.

zhreshold · 2020-07-08T19:07:00Z

in 2.0, if try_nopython is set to false, then the behavior is the same as 1.0
if try_nopython is true, dataset has to be converted to ndarray and the nested arrays with different types and shapes is causing the problem. If anyone can help figure out the correct layout for converting the complex bert style dataset I can help look into the fix.

rondogency added the Feature request label Jul 8, 2020

sxjscience added Data-loading Gluon labels Jul 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gluon 2.0 Dataloader should support BERT training using GluonNLP #18672

Gluon 2.0 Dataloader should support BERT training using GluonNLP #18672

rondogency commented Jul 8, 2020

rondogency commented Jul 8, 2020

sxjscience commented Jul 8, 2020

zhreshold commented Jul 8, 2020

Gluon 2.0 Dataloader should support BERT training using GluonNLP #18672

Gluon 2.0 Dataloader should support BERT training using GluonNLP #18672

Comments

rondogency commented Jul 8, 2020

Description

References

rondogency commented Jul 8, 2020

sxjscience commented Jul 8, 2020

zhreshold commented Jul 8, 2020