You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.
Currently we cannot use 2.0 Dataloader to train BERT, and the reason is 2.0 Dataloader is not flexible to support the data schema used by GluonNLP BERT, specifically if passing in a nested list of variable length numpy array, the construction of dataset would fail and throw NDArray conversion errors
Here is a minimal reproducible code, which is the similar data schema BERT pre-training script is using:
import mxnet as mx
import numpy as np
a = np.ndarray(shape=(128,)) # similar to one feature of one sequence
b = np.ndarray(shape=(19,))
l1 = [a,b] # similar to one feature of all sequences
l2 = [a,b]
c = [l1, l2] # similar to a training instance that will be sampled against
ds = mx.gluon.data.ArrayDataset(*c)
dt = mx.gluon.data.DataLoader(dataset=ds, batch_size=1, num_workers=1, try_nopython=True)
print('ok') # error out before prints
importmxnetasmximportnumpyasnpmx.npx.set_np()
a=np.ndarray(shape=(128,)) # similar to one feature of one sequenceb=np.ndarray(shape=(19,))
l1= [a,b] # similar to one feature of all sequencesl2= [a,b]
c= [l1, l2] # similar to a training instance that will be sampled againstds=mx.gluon.data.ArrayDataset(*c)
dt=mx.gluon.data.DataLoader(dataset=ds, batch_size=1, num_workers=1, try_nopython=True)
print('ok') # error out before prints
Error message:
~/miniconda3/lib/python3.7/site-packages/mxnet/gluon/data/dataset.pyin__mx_handle__(self)
383datasets.append(data.__mx_handle__())
384else:
-->385datasets.append(NDArrayDataset(arr=default_array(data)))
386self.handle=GroupDataset(datasets=datasets)
387returnself.handle~/miniconda3/lib/python3.7/site-packages/mxnet/util.pyindefault_array(source_array, ctx, dtype)
936from . importnpas_mx_np937ifis_np_array():
-->938return_mx_np.array(source_array, ctx=ctx, dtype=dtype)
939else:
940return_mx_nd.array(source_array, ctx=ctx, dtype=dtype)
~/miniconda3/lib/python3.7/site-packages/mxnet/numpy/multiarray.pyinarray(object, dtype, ctx)
2407# printing out the error raised by official NumPy's array function2408# for transparency on users' side->2409raiseTypeError('{}'.format(str(e)))
2410ret=empty(object.shape, dtype=dtype, ctx=ctx)
2411iflen(object.shape) ==0:
TypeError: settinganarrayelementwithasequence.
in 2.0, if try_nopython is set to false, then the behavior is the same as 1.0
if try_nopython is true, dataset has to be converted to ndarray and the nested arrays with different types and shapes is causing the problem. If anyone can help figure out the correct layout for converting the complex bert style dataset I can help look into the fix.
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Description
Currently we cannot use 2.0 Dataloader to train BERT, and the reason is 2.0 Dataloader is not flexible to support the data schema used by GluonNLP BERT, specifically if passing in a nested list of variable length numpy array, the construction of dataset would fail and throw NDArray conversion errors
Here is a minimal reproducible code, which is the similar data schema BERT pre-training script is using:
import mxnet as mx
import numpy as np
a = np.ndarray(shape=(128,)) # similar to one feature of one sequence
b = np.ndarray(shape=(19,))
l1 = [a,b] # similar to one feature of all sequences
l2 = [a,b]
c = [l1, l2] # similar to a training instance that will be sampled against
ds = mx.gluon.data.ArrayDataset(*c)
dt = mx.gluon.data.DataLoader(dataset=ds, batch_size=1, num_workers=1, try_nopython=True)
print('ok') # error out before prints
References
#17841
The text was updated successfully, but these errors were encountered: