TypeError: can't pickle Tokenizer objects when num_workers > 0 and lazy = true #4399

JohnGiorgi · 2020-06-25T00:47:30Z

Checklist

Description

I get a TypeError: can't pickle Tokenizer objects when trying to train a model that uses a PretrainedTransformerTokenizer tokenizer when "dataset_reader.lazy": true and "data_loader.num_workers" > 0. This appears to happen for every version of AllenNLP after 1.0.0rc3 (specifically this commit) including the current master branch. The 1.0.0rc3 release and earlier releases do not have this issue.

The notes in #4344 seem to suggest it has been solved, but I can still trigger it with a minimal example (see below).

Python traceback:

Traceback (most recent call last):
  File "/home/johnmg/t2t/bin/allennlp", line 33, in <module>
    sys.exit(load_entry_point('allennlp', 'console_scripts', 'allennlp')())
  File "/scratch/johnmg/allennlp/allennlp/__main__.py", line 24, in run
    main(prog="allennlp")
  File "/scratch/johnmg/allennlp/allennlp/commands/__init__.py", line 92, in main
    args.func(args)
  File "/scratch/johnmg/allennlp/allennlp/commands/train.py", line 112, in train_model_from_args
    dry_run=args.dry_run,
  File "/scratch/johnmg/allennlp/allennlp/commands/train.py", line 171, in train_model_from_file
    dry_run=dry_run,
  File "/scratch/johnmg/allennlp/allennlp/commands/train.py", line 295, in train_model
    nprocs=num_procs,
  File "/home/johnmg/t2t/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/johnmg/t2t/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/home/johnmg/t2t/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
    raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/johnmg/t2t/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/scratch/johnmg/allennlp/allennlp/commands/train.py", line 418, in _train_worker
    params=params, serialization_dir=serialization_dir, local_rank=process_rank,
  File "/scratch/johnmg/allennlp/allennlp/common/from_params.py", line 580, in from_params
    **extras,
  File "/scratch/johnmg/allennlp/allennlp/common/from_params.py", line 611, in from_params
    return constructor_to_call(**kwargs)  # type: ignore
  File "/scratch/johnmg/allennlp/allennlp/commands/train.py", line 647, in from_partial_objects
    data_loader_ = data_loader.construct(dataset=datasets["train"])
  File "/scratch/johnmg/allennlp/allennlp/common/lazy.py", line 46, in construct
    return self._constructor(**kwargs)
  File "/scratch/johnmg/allennlp/allennlp/common/from_params.py", line 446, in constructor
    return value_cls.from_params(params=deepcopy(popped_params), **constructor_extras)
  File "/scratch/johnmg/allennlp/allennlp/common/from_params.py", line 580, in from_params
    **extras,
  File "/scratch/johnmg/allennlp/allennlp/common/from_params.py", line 611, in from_params
    return constructor_to_call(**kwargs)  # type: ignore
  File "/scratch/johnmg/allennlp/allennlp/data/dataloader.py", line 151, in from_partial_objects
    batches_per_epoch=batches_per_epoch,
  File "/scratch/johnmg/allennlp/allennlp/data/dataloader.py", line 90, in __init__
    self._data_generator = super().__iter__()
  File "/home/johnmg/t2t/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 279, in __iter__
    return _MultiProcessingDataLoaderIter(self)
  File "/home/johnmg/t2t/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 719, in __init__
    w.start()
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.7.4/lib/python3.7/multiprocessing/process.py", line 112, in start
    self._popen = self._Popen(self)
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.7.4/lib/python3.7/multiprocessing/context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.7.4/lib/python3.7/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.7.4/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.7.4/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__
    self._launch(process_obj)
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.7.4/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.7.4/lib/python3.7/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
TypeError: can't pickle Tokenizer objects

Related issues or possible duplicates

simplify dataset classes, fix multi-process lazy loading #4344

Environment

OS:

NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

Python version: 3.7.4

Output of pip freeze:

absl-py==0.7.1
aiohttp==3.6.2
alabaster==0.7.12
-e git+https://github.com/allenai/allennlp.git@b6fd6978b507ce6118023e23f3e4dbfa334d39b5#egg=allennlp
apex==0.1
appdirs==1.4.3
aspy.yaml==1.3.0
astor==0.8.1
async-timeout==3.0.1
atomicwrites==1.3.0
attrs==19.3.0
Babel==2.7.0
backcall==0.1.0
beautifulsoup4==4.8.2
black==19.10b0
bleach==3.1.0
blis==0.2.4
boto==2.49.0
boto3==1.10.9
botocore==1.13.9
cachetools==3.1.1
cc-net==0.1.0
certifi==2019.9.11
cffi==1.13.2
cfgv==2.0.1
chardet==3.0.4
click==7.1.1
codecov==2.0.15
conllu==2.3.2
coverage==4.5.4
cryptography==2.8
cycler==0.10.0
cymem==2.0.2
-e git+https://github.com/JohnGiorgi/t2t.git@5cc03ed58253e12bd1060f1fea2b89bae3acdb84#egg=declutr
decorator==4.4.1
dill==0.3.1.1
docutils==0.15.2
editdistance==0.5.2
en-core-web-sm==2.1.0
entrypoints==0.3
fastapi==0.58.0
fasttext==0.9.1
filelock==3.0.12
fire==0.2.1
flake8==3.7.9
flaky==3.6.1
Flask==1.1.1
Flask-Cors==3.0.8
ftfy==5.5.1
func-argparse==1.1.1
future==0.17.1
gast==0.2.2
gensim==3.8.1
getpy==0.9.9
gevent==1.4.0
google-auth==1.11.0
google-auth-oauthlib==0.4.1
google-pasta==0.1.8
greenlet==0.4.15
grpcio==1.25.0
h11==0.9.0
h5py==2.9.0
htmlmin==0.1.12
httptools==0.1.1
hypothesis==5.16.0
identify==1.4.10
idna==2.8
imagesize==1.1.0
importlib-metadata==0.23
ipython==7.10.1
ipython-genutils==0.2.0
isort==4.3.21
itsdangerous==1.1.0
jedi==0.15.1
jeepney==0.4.2
Jinja2==2.10.3
jmespath==0.9.4
joblib==0.14.0
jsmin==2.2.2
jsonnet==0.10.0
jsonpickle==1.2
jsonschema==3.0.2
kenlm==0.0.0
Keras-Applications==1.0.8
Keras-Preprocessing==1.1.0
keyring==21.1.0
kiwisolver==1.1.0
livereload==2.6.1
lxml==4.4.1
Markdown==3.1.1
markdown-include==0.5.1
MarkupSafe==1.1.1
mathy-pydoc==0.6.7
matplotlib==3.0.3
maturin==0.8.1
mccabe==0.6.1
mkdocs==1.0.4
mkdocs-material==4.6.3
mkdocs-minify-plugin==0.2.1
more-itertools==7.2.0
multidict==4.5.2
murmurhash==0.28.0
mypy==0.770
mypy-extensions==0.4.3
nltk==3.4
nodeenv==1.3.4
numpy==1.16.3
numpydoc==0.8.0
oauthlib==3.1.0
opt-einsum==2.3.2
overrides==3.1.0
packaging==19.2
pandas==0.25.3
parsimonious==0.8.0
parso==0.5.1
pathspec==0.7.0
pep562==1.0
pexpect==4.7.0
pickleshare==0.7.5
Pillow==6.2.1
Pillow-SIMD==7.0.0.post3
pkginfo==1.5.0.1
plac==0.9.6
pluggy==0.13.0
pre-commit==2.2.0
preshed==2.0.1
prompt-toolkit==3.0.2
protobuf==3.10.0
ptyprocess==0.6.0
py==1.8.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pybind11==2.4.3
pycodestyle==2.5.0
pycparser==2.19
pydantic==1.5.1
pydoc-markdown==2.0.5
pyflakes==2.1.1
Pygments==2.4.2
pymdown-extensions==6.3
pyparsing==2.4.3
pyrsistent==0.15.3
pytest==5.2.2
pytest-cov==2.8.1
python-dateutil==2.8.0
-e git+https://github.com/KevinMusgrave/pytorch-metric-learning.git@48de2dd9c4d78873d675f19187c5205075a6a9de#egg=pytorch_metric_learning
pytz==2019.3
PyYAML==5.1.2
-e git+https://github.com/JohnGiorgi/QuickThought.git@397b8b18f3cc50a3471fe26f9725401fb2297816#egg=quickthought
readme-renderer==24.0
regex==2018.1.10
requests==2.22.0
requests-oauthlib==1.3.0
requests-toolbelt==0.9.1
responses==0.10.6
rsa==4.0
ruamel.yaml==0.16.5
ruamel.yaml.clib==0.2.0
s3transfer==0.2.1
sacremoses==0.0.35
scikit-learn==0.21.2
scipy==1.4.1
SecretStorage==3.1.2
semantic-version==2.8.4
sentence-splitter==1.4
sentence-transformers==0.2.6.1
sentencepiece==0.1.82
setuptools-rust==0.10.6
singledispatch==3.4.0.3
six==1.12.0
smart-open==1.8.4
snowballstemmer==2.0.0
sortedcontainers==2.2.2
soupsieve==2.0
spacy==2.1.4
Sphinx==2.2.1
sphinxcontrib-applehelp==1.0.1
sphinxcontrib-devhelp==1.0.1
sphinxcontrib-htmlhelp==1.0.2
sphinxcontrib-jsmath==1.0.1
sphinxcontrib-qthelp==1.0.2
sphinxcontrib-serializinghtml==1.1.3
sqlparse==0.3.0
srsly==0.0.5
starlette==0.13.4
tensorboard==1.15.0
tensorboardX==1.9
tensorflow-estimator==1.15.1
tensorflow-gpu==1.15.0
tensorflow-hub==0.8.0
termcolor==1.1.0
Theano==1.0.1
thinc==7.0.4
tokenizers==0.7.0
toml==0.10.0
torch==1.5.0
torchvision==0.6.0+cu101
tornado==6.0.3
tqdm==4.37.0
traitlets==4.3.3
transformers==2.11.0
twine==3.1.1
typed-ast==1.4.1
typer==0.2.1
typing-extensions==3.7.4.1
Unidecode==1.1.1
urllib3==1.25.6
uvicorn==0.11.5
uvloop==0.14.0
virtualenv==16.7.9
wasabi==0.4.0
wcwidth==0.1.7
webencodings==0.5.1
websockets==8.1
Werkzeug==0.16.0
word2number==1.1
wrapt==1.11.2
yarl==1.4.2
zipp==0.6.0

Steps to reproduce

Install a version of AllenNLP and AllenNLP-Models newer than 1.0.0rc3.
Train a model which uses a PretrainedTransformerTokenizer with "dataset_reader.lazy": true and "data_loader.num_workers" > 0. E.g. I used this config with some overrides (see below).

Example source:

allennlp train mnli_roberta.jsonnet \
	--serialization-dir ./debug \
        --overrides "{'dataset_reader.lazy': true, 'data_loader.batch_sampler': null, 'data_loader.num_workers': 1}" \
	-f

The text was updated successfully, but these errors were encountered:

epwalsh · 2020-06-25T16:28:58Z

Hi @JohnGiorgi, can you share your config? Are you using the num_workers option with your data loader?

JohnGiorgi · 2020-06-25T16:51:47Z

Hi @epwalsh, yes, it looks like num_workers > 0 was the culprit here. I just noticed that the logger prints:

UserWarning: Using multi-process data loading without setting DatasetReader.manual_multi_process_sharding to True.
Did you forget to set this?
If you're not handling the multi-process sharding logic within your _read() method, there is probably no benefit to using more than one worker.

so maybe my issue is unnecessary and I should leave num_workers at its default? (I confirmed the error does not happen when num_workers is unset).

In any case, I have updated my original issue with a minimal example that triggers the error.

epwalsh · 2020-06-25T19:15:32Z

Gotcha. Yea, like the warning says there is probably no benefit to using num_workers > 0 unless you implement some custom logic within _read() to handle that.

But even then, you'll probably still see his exception, which arises because each TextField within each of your data Instances includes a PretrainedTransformerIndexer, which itself wraps a HuggingFace Tokenizer object.

Now when the main process loading data needs to gather the Instances from the data loading workers, it uses pickle to communicate. But since HuggingFace Tokenizers currently can't be pickled, this error is raised.

epwalsh · 2020-06-25T19:21:01Z

That said, we are planning on making some changes to our data loading story soon. One of the proposed changes is to make Instances / Fields pure data objects - i.e. with no references to tokenizers, token indexers, or anything else - which would solve this particular issue without requiring the HuggingFace tokenizers to be pickle-able.

JohnGiorgi · 2020-06-25T19:54:20Z

@epwalsh Gotcha, thanks for the detailed response.

For now, I will leave num_workers unset (I think I only set it to 1 in the first place as it gave me a small reduction in train time, but I don't actually remember).

I will lookout for the proposed changes to the Instance/Field objects :)

JohnGiorgi added the bug label Jun 25, 2020

epwalsh self-assigned this Jun 25, 2020

JohnGiorgi changed the title ~~TypeError: can't pickle Tokenizer objects when distributed training with a lazy dataset reader.~~ TypeError: can't pickle Tokenizer objects when num_workers > 1 and lazy = true Jun 25, 2020

JohnGiorgi changed the title ~~TypeError: can't pickle Tokenizer objects when num_workers > 1 and lazy = true~~ TypeError: can't pickle Tokenizer objects when num_workers > 0 and lazy = true Jun 25, 2020

JohnGiorgi closed this as completed Jun 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TypeError: can't pickle Tokenizer objects when num_workers > 0 and lazy = true #4399

TypeError: can't pickle Tokenizer objects when num_workers > 0 and lazy = true #4399

JohnGiorgi commented Jun 25, 2020 •

edited

Loading

epwalsh commented Jun 25, 2020

JohnGiorgi commented Jun 25, 2020 •

edited

Loading

epwalsh commented Jun 25, 2020

epwalsh commented Jun 25, 2020

JohnGiorgi commented Jun 25, 2020

TypeError: can't pickle Tokenizer objects when num_workers > 0 and lazy = true #4399

TypeError: can't pickle Tokenizer objects when num_workers > 0 and lazy = true #4399

Comments

JohnGiorgi commented Jun 25, 2020 • edited Loading

Checklist

Description

Related issues or possible duplicates

Environment

Steps to reproduce

epwalsh commented Jun 25, 2020

JohnGiorgi commented Jun 25, 2020 • edited Loading

epwalsh commented Jun 25, 2020

epwalsh commented Jun 25, 2020

JohnGiorgi commented Jun 25, 2020

JohnGiorgi commented Jun 25, 2020 •

edited

Loading

JohnGiorgi commented Jun 25, 2020 •

edited

Loading