Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Gluon 2.0 Dataloader should support BERT training using GluonNLP #18672

Open
rondogency opened this issue Jul 8, 2020 · 3 comments
Open

Gluon 2.0 Dataloader should support BERT training using GluonNLP #18672

rondogency opened this issue Jul 8, 2020 · 3 comments

Comments

@rondogency
Copy link
Contributor

Description

Currently we cannot use 2.0 Dataloader to train BERT, and the reason is 2.0 Dataloader is not flexible to support the data schema used by GluonNLP BERT, specifically if passing in a nested list of variable length numpy array, the construction of dataset would fail and throw NDArray conversion errors

Here is a minimal reproducible code, which is the similar data schema BERT pre-training script is using:

import mxnet as mx
import numpy as np
a = np.ndarray(shape=(128,)) # similar to one feature of one sequence
b = np.ndarray(shape=(19,))
l1 = [a,b] # similar to one feature of all sequences
l2 = [a,b]
c = [l1, l2] # similar to a training instance that will be sampled against
ds = mx.gluon.data.ArrayDataset(*c)
dt = mx.gluon.data.DataLoader(dataset=ds, batch_size=1, num_workers=1, try_nopython=True)
print('ok') # error out before prints

References

#17841

@rondogency
Copy link
Contributor Author

@sxjscience
Copy link
Member

I can reproduce this failure message:

import mxnet as mx
import numpy as np
mx.npx.set_np()
a = np.ndarray(shape=(128,)) # similar to one feature of one sequence
b = np.ndarray(shape=(19,))
l1 = [a,b] # similar to one feature of all sequences
l2 = [a,b]
c = [l1, l2] # similar to a training instance that will be sampled against
ds = mx.gluon.data.ArrayDataset(*c)
dt = mx.gluon.data.DataLoader(dataset=ds, batch_size=1, num_workers=1, try_nopython=True)
print('ok') # error out before prints

Error message:

~/miniconda3/lib/python3.7/site-packages/mxnet/gluon/data/dataset.py in __mx_handle__(self)
    383                     datasets.append(data.__mx_handle__())
    384                 else:
--> 385                     datasets.append(NDArrayDataset(arr=default_array(data)))
    386             self.handle = GroupDataset(datasets=datasets)
    387         return self.handle

~/miniconda3/lib/python3.7/site-packages/mxnet/util.py in default_array(source_array, ctx, dtype)
    936     from . import np as _mx_np
    937     if is_np_array():
--> 938         return _mx_np.array(source_array, ctx=ctx, dtype=dtype)
    939     else:
    940         return _mx_nd.array(source_array, ctx=ctx, dtype=dtype)

~/miniconda3/lib/python3.7/site-packages/mxnet/numpy/multiarray.py in array(object, dtype, ctx)
   2407             # printing out the error raised by official NumPy's array function
   2408             # for transparency on users' side
-> 2409             raise TypeError('{}'.format(str(e)))
   2410     ret = empty(object.shape, dtype=dtype, ctx=ctx)
   2411     if len(object.shape) == 0:

TypeError: setting an array element with a sequence.

@zhreshold
Copy link
Member

in 2.0, if try_nopython is set to false, then the behavior is the same as 1.0
if try_nopython is true, dataset has to be converted to ndarray and the nested arrays with different types and shapes is causing the problem. If anyone can help figure out the correct layout for converting the complex bert style dataset I can help look into the fix.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants