Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

TypeError: can't pickle Tokenizer objects when num_workers > 0 and lazy = true #4399

Closed
10 tasks done
JohnGiorgi opened this issue Jun 25, 2020 · 5 comments
Closed
10 tasks done
Assignees
Labels

Comments

@JohnGiorgi
Copy link
Contributor

JohnGiorgi commented Jun 25, 2020

Checklist

  • I have verified that the issue exists against the master branch of AllenNLP.
  • I have read the relevant section in the contribution guide on reporting bugs.
  • I have checked the issues list for similar or identical bug reports.
  • I have checked the pull requests list for existing proposed fixes.
  • I have checked the CHANGELOG and the commit log to find out if the bug was already fixed in the master branch.
  • I have included in the "Description" section below a traceback from any exceptions related to this bug.
  • I have included in the "Related issues or possible duplicates" section below all related issues and possible duplicate issues (If there are none, check this box anyway).
  • I have included in the "Environment" section below the name of the operating system and Python version that I was using when I discovered this bug.
  • I have included in the "Environment" section below the output of pip freeze.
  • I have included in the "Steps to reproduce" section below a minimally reproducible example.

Description

I get a TypeError: can't pickle Tokenizer objects when trying to train a model that uses a PretrainedTransformerTokenizer tokenizer when "dataset_reader.lazy": true and "data_loader.num_workers" > 0. This appears to happen for every version of AllenNLP after 1.0.0rc3 (specifically this commit) including the current master branch. The 1.0.0rc3 release and earlier releases do not have this issue.

The notes in #4344 seem to suggest it has been solved, but I can still trigger it with a minimal example (see below).

Python traceback:

Traceback (most recent call last):
  File "/home/johnmg/t2t/bin/allennlp", line 33, in <module>
    sys.exit(load_entry_point('allennlp', 'console_scripts', 'allennlp')())
  File "/scratch/johnmg/allennlp/allennlp/__main__.py", line 24, in run
    main(prog="allennlp")
  File "/scratch/johnmg/allennlp/allennlp/commands/__init__.py", line 92, in main
    args.func(args)
  File "/scratch/johnmg/allennlp/allennlp/commands/train.py", line 112, in train_model_from_args
    dry_run=args.dry_run,
  File "/scratch/johnmg/allennlp/allennlp/commands/train.py", line 171, in train_model_from_file
    dry_run=dry_run,
  File "/scratch/johnmg/allennlp/allennlp/commands/train.py", line 295, in train_model
    nprocs=num_procs,
  File "/home/johnmg/t2t/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/johnmg/t2t/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/home/johnmg/t2t/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
    raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/johnmg/t2t/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/scratch/johnmg/allennlp/allennlp/commands/train.py", line 418, in _train_worker
    params=params, serialization_dir=serialization_dir, local_rank=process_rank,
  File "/scratch/johnmg/allennlp/allennlp/common/from_params.py", line 580, in from_params
    **extras,
  File "/scratch/johnmg/allennlp/allennlp/common/from_params.py", line 611, in from_params
    return constructor_to_call(**kwargs)  # type: ignore
  File "/scratch/johnmg/allennlp/allennlp/commands/train.py", line 647, in from_partial_objects
    data_loader_ = data_loader.construct(dataset=datasets["train"])
  File "/scratch/johnmg/allennlp/allennlp/common/lazy.py", line 46, in construct
    return self._constructor(**kwargs)
  File "/scratch/johnmg/allennlp/allennlp/common/from_params.py", line 446, in constructor
    return value_cls.from_params(params=deepcopy(popped_params), **constructor_extras)
  File "/scratch/johnmg/allennlp/allennlp/common/from_params.py", line 580, in from_params
    **extras,
  File "/scratch/johnmg/allennlp/allennlp/common/from_params.py", line 611, in from_params
    return constructor_to_call(**kwargs)  # type: ignore
  File "/scratch/johnmg/allennlp/allennlp/data/dataloader.py", line 151, in from_partial_objects
    batches_per_epoch=batches_per_epoch,
  File "/scratch/johnmg/allennlp/allennlp/data/dataloader.py", line 90, in __init__
    self._data_generator = super().__iter__()
  File "/home/johnmg/t2t/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 279, in __iter__
    return _MultiProcessingDataLoaderIter(self)
  File "/home/johnmg/t2t/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 719, in __init__
    w.start()
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.7.4/lib/python3.7/multiprocessing/process.py", line 112, in start
    self._popen = self._Popen(self)
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.7.4/lib/python3.7/multiprocessing/context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.7.4/lib/python3.7/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.7.4/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.7.4/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__
    self._launch(process_obj)
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.7.4/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.7.4/lib/python3.7/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
TypeError: can't pickle Tokenizer objects

Related issues or possible duplicates

Environment

OS:

NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

Python version: 3.7.4

Output of pip freeze:

absl-py==0.7.1
aiohttp==3.6.2
alabaster==0.7.12
-e git+https://github.com/allenai/allennlp.git@b6fd6978b507ce6118023e23f3e4dbfa334d39b5#egg=allennlp
apex==0.1
appdirs==1.4.3
aspy.yaml==1.3.0
astor==0.8.1
async-timeout==3.0.1
atomicwrites==1.3.0
attrs==19.3.0
Babel==2.7.0
backcall==0.1.0
beautifulsoup4==4.8.2
black==19.10b0
bleach==3.1.0
blis==0.2.4
boto==2.49.0
boto3==1.10.9
botocore==1.13.9
cachetools==3.1.1
cc-net==0.1.0
certifi==2019.9.11
cffi==1.13.2
cfgv==2.0.1
chardet==3.0.4
click==7.1.1
codecov==2.0.15
conllu==2.3.2
coverage==4.5.4
cryptography==2.8
cycler==0.10.0
cymem==2.0.2
-e git+https://github.com/JohnGiorgi/t2t.git@5cc03ed58253e12bd1060f1fea2b89bae3acdb84#egg=declutr
decorator==4.4.1
dill==0.3.1.1
docutils==0.15.2
editdistance==0.5.2
en-core-web-sm==2.1.0
entrypoints==0.3
fastapi==0.58.0
fasttext==0.9.1
filelock==3.0.12
fire==0.2.1
flake8==3.7.9
flaky==3.6.1
Flask==1.1.1
Flask-Cors==3.0.8
ftfy==5.5.1
func-argparse==1.1.1
future==0.17.1
gast==0.2.2
gensim==3.8.1
getpy==0.9.9
gevent==1.4.0
google-auth==1.11.0
google-auth-oauthlib==0.4.1
google-pasta==0.1.8
greenlet==0.4.15
grpcio==1.25.0
h11==0.9.0
h5py==2.9.0
htmlmin==0.1.12
httptools==0.1.1
hypothesis==5.16.0
identify==1.4.10
idna==2.8
imagesize==1.1.0
importlib-metadata==0.23
ipython==7.10.1
ipython-genutils==0.2.0
isort==4.3.21
itsdangerous==1.1.0
jedi==0.15.1
jeepney==0.4.2
Jinja2==2.10.3
jmespath==0.9.4
joblib==0.14.0
jsmin==2.2.2
jsonnet==0.10.0
jsonpickle==1.2
jsonschema==3.0.2
kenlm==0.0.0
Keras-Applications==1.0.8
Keras-Preprocessing==1.1.0
keyring==21.1.0
kiwisolver==1.1.0
livereload==2.6.1
lxml==4.4.1
Markdown==3.1.1
markdown-include==0.5.1
MarkupSafe==1.1.1
mathy-pydoc==0.6.7
matplotlib==3.0.3
maturin==0.8.1
mccabe==0.6.1
mkdocs==1.0.4
mkdocs-material==4.6.3
mkdocs-minify-plugin==0.2.1
more-itertools==7.2.0
multidict==4.5.2
murmurhash==0.28.0
mypy==0.770
mypy-extensions==0.4.3
nltk==3.4
nodeenv==1.3.4
numpy==1.16.3
numpydoc==0.8.0
oauthlib==3.1.0
opt-einsum==2.3.2
overrides==3.1.0
packaging==19.2
pandas==0.25.3
parsimonious==0.8.0
parso==0.5.1
pathspec==0.7.0
pep562==1.0
pexpect==4.7.0
pickleshare==0.7.5
Pillow==6.2.1
Pillow-SIMD==7.0.0.post3
pkginfo==1.5.0.1
plac==0.9.6
pluggy==0.13.0
pre-commit==2.2.0
preshed==2.0.1
prompt-toolkit==3.0.2
protobuf==3.10.0
ptyprocess==0.6.0
py==1.8.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pybind11==2.4.3
pycodestyle==2.5.0
pycparser==2.19
pydantic==1.5.1
pydoc-markdown==2.0.5
pyflakes==2.1.1
Pygments==2.4.2
pymdown-extensions==6.3
pyparsing==2.4.3
pyrsistent==0.15.3
pytest==5.2.2
pytest-cov==2.8.1
python-dateutil==2.8.0
-e git+https://github.com/KevinMusgrave/pytorch-metric-learning.git@48de2dd9c4d78873d675f19187c5205075a6a9de#egg=pytorch_metric_learning
pytz==2019.3
PyYAML==5.1.2
-e git+https://github.com/JohnGiorgi/QuickThought.git@397b8b18f3cc50a3471fe26f9725401fb2297816#egg=quickthought
readme-renderer==24.0
regex==2018.1.10
requests==2.22.0
requests-oauthlib==1.3.0
requests-toolbelt==0.9.1
responses==0.10.6
rsa==4.0
ruamel.yaml==0.16.5
ruamel.yaml.clib==0.2.0
s3transfer==0.2.1
sacremoses==0.0.35
scikit-learn==0.21.2
scipy==1.4.1
SecretStorage==3.1.2
semantic-version==2.8.4
sentence-splitter==1.4
sentence-transformers==0.2.6.1
sentencepiece==0.1.82
setuptools-rust==0.10.6
singledispatch==3.4.0.3
six==1.12.0
smart-open==1.8.4
snowballstemmer==2.0.0
sortedcontainers==2.2.2
soupsieve==2.0
spacy==2.1.4
Sphinx==2.2.1
sphinxcontrib-applehelp==1.0.1
sphinxcontrib-devhelp==1.0.1
sphinxcontrib-htmlhelp==1.0.2
sphinxcontrib-jsmath==1.0.1
sphinxcontrib-qthelp==1.0.2
sphinxcontrib-serializinghtml==1.1.3
sqlparse==0.3.0
srsly==0.0.5
starlette==0.13.4
tensorboard==1.15.0
tensorboardX==1.9
tensorflow-estimator==1.15.1
tensorflow-gpu==1.15.0
tensorflow-hub==0.8.0
termcolor==1.1.0
Theano==1.0.1
thinc==7.0.4
tokenizers==0.7.0
toml==0.10.0
torch==1.5.0
torchvision==0.6.0+cu101
tornado==6.0.3
tqdm==4.37.0
traitlets==4.3.3
transformers==2.11.0
twine==3.1.1
typed-ast==1.4.1
typer==0.2.1
typing-extensions==3.7.4.1
Unidecode==1.1.1
urllib3==1.25.6
uvicorn==0.11.5
uvloop==0.14.0
virtualenv==16.7.9
wasabi==0.4.0
wcwidth==0.1.7
webencodings==0.5.1
websockets==8.1
Werkzeug==0.16.0
word2number==1.1
wrapt==1.11.2
yarl==1.4.2
zipp==0.6.0

Steps to reproduce

  1. Install a version of AllenNLP and AllenNLP-Models newer than 1.0.0rc3.
  2. Train a model which uses a PretrainedTransformerTokenizer with "dataset_reader.lazy": true and "data_loader.num_workers" > 0. E.g. I used this config with some overrides (see below).
Example source:

allennlp train mnli_roberta.jsonnet \
	--serialization-dir ./debug \
        --overrides "{'dataset_reader.lazy': true, 'data_loader.batch_sampler': null, 'data_loader.num_workers': 1}" \
	-f

@JohnGiorgi JohnGiorgi added the bug label Jun 25, 2020
@epwalsh
Copy link
Member

epwalsh commented Jun 25, 2020

Hi @JohnGiorgi, can you share your config? Are you using the num_workers option with your data loader?

@epwalsh epwalsh self-assigned this Jun 25, 2020
@JohnGiorgi JohnGiorgi changed the title TypeError: can't pickle Tokenizer objects when distributed training with a lazy dataset reader. TypeError: can't pickle Tokenizer objects when num_workers > 1 and lazy = true Jun 25, 2020
@JohnGiorgi JohnGiorgi changed the title TypeError: can't pickle Tokenizer objects when num_workers > 1 and lazy = true TypeError: can't pickle Tokenizer objects when num_workers > 0 and lazy = true Jun 25, 2020
@JohnGiorgi
Copy link
Contributor Author

JohnGiorgi commented Jun 25, 2020

Hi @epwalsh, yes, it looks like num_workers > 0 was the culprit here. I just noticed that the logger prints:

UserWarning: Using multi-process data loading without setting DatasetReader.manual_multi_process_sharding to True.
Did you forget to set this?
If you're not handling the multi-process sharding logic within your _read() method, there is probably no benefit to using more than one worker.

so maybe my issue is unnecessary and I should leave num_workers at its default? (I confirmed the error does not happen when num_workers is unset).

In any case, I have updated my original issue with a minimal example that triggers the error.

@epwalsh
Copy link
Member

epwalsh commented Jun 25, 2020

Gotcha. Yea, like the warning says there is probably no benefit to using num_workers > 0 unless you implement some custom logic within _read() to handle that.

But even then, you'll probably still see his exception, which arises because each TextField within each of your data Instances includes a PretrainedTransformerIndexer, which itself wraps a HuggingFace Tokenizer object.

Now when the main process loading data needs to gather the Instances from the data loading workers, it uses pickle to communicate. But since HuggingFace Tokenizers currently can't be pickled, this error is raised.

@epwalsh
Copy link
Member

epwalsh commented Jun 25, 2020

That said, we are planning on making some changes to our data loading story soon. One of the proposed changes is to make Instances / Fields pure data objects - i.e. with no references to tokenizers, token indexers, or anything else - which would solve this particular issue without requiring the HuggingFace tokenizers to be pickle-able.

@JohnGiorgi
Copy link
Contributor Author

@epwalsh Gotcha, thanks for the detailed response.

For now, I will leave num_workers unset (I think I only set it to 1 in the first place as it gave me a small reduction in train time, but I don't actually remember).

I will lookout for the proposed changes to the Instance/Field objects :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants