Problem deserializing Tokenizer on Windows (spaCy 2.0.3) #1634

AurelienMassiot · 2017-11-23T13:35:47Z

Hi,
When I train a model with spaCy 2.0.3 on my environment 1, everything works well : I can save it, load it, use it.
However when I try loading it with environment 2, I get the following error :

>>> spacy.load('my_model')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Anaconda3\lib\site-packages\spacy\__init__.py", line 19, in load
    return util.load_model(name, **overrides)
  File "C:\Anaconda3\lib\site-packages\spacy\util.py", line 116, in load_model
    return load_model_from_path(Path(name), **overrides)
  File "C:\Anaconda3\lib\site-packages\spacy\util.py", line 158, in load_model_from_path
    return nlp.from_disk(model_path)
  File "C:\Anaconda3\lib\site-packages\spacy\language.py", line 626, in from_disk
    util.from_disk(path, deserializers, exclude)
  File "C:\Anaconda3\lib\site-packages\spacy\util.py", line 521, in from_disk
    reader(path / key)
  File "C:\Anaconda3\lib\site-packages\spacy\language.py", line 614, in <lambda>
    ('tokenizer', lambda p: self.tokenizer.from_disk(p, vocab=False)),
  File "tokenizer.pyx", line 364, in spacy.tokenizer.Tokenizer.from_disk
  File "tokenizer.pyx", line 399, in spacy.tokenizer.Tokenizer.from_bytes
  File "C:\Anaconda3\lib\site-packages\spacy\util.py", line 500, in from_bytes
    msg = msgpack.loads(bytes_data, encoding='utf8')
  File "C:\Anaconda3\lib\site-packages\msgpack_numpy.py", line 187, in unpackb
    return _unpacker.unpackb(packed, encoding=encoding, **kwargs)
  File "msgpack/_unpacker.pyx", line 139, in msgpack._unpacker.unpackb (msgpack/_unpacker.cpp:2068)
TypeError: unhashable type: 'list'

Environment 1 : it works

* spaCy version      2.0.3
* Platform           Linux-3.10.0-693.5.2.el7.x86_64-x86_64-with-centos-7.4.1708-Core
* Python version     3.6.3
* Models             en

Environment 2 : it doesn't work

* spaCy version      2.0.3
* Platform           Windows-2012Server-6.2.9200-SPO
* Python version     3.6.1
* Models             en

'EN' models are installed on both, spaCy versions are the same, could it be because of Windows ? Or do you have any ideas why I get this error ?

Thanks a lot !

The text was updated successfully, but these errors were encountered:

ines · 2017-11-23T13:46:52Z

Thanks for the report! It looks like something is going wrong when deserializing the tokenizer:

File "tokenizer.pyx", line 399, in spacy.tokenizer.Tokenizer.from_bytes

In any case, it looks like there might be a problem with the serialization of the tokenizer on Windows. Will look into this! To help us debug: Are you using any custom tokenization rules?

AurelienMassiot · 2017-11-23T14:01:31Z

Thanks for your quick answer,
I am not using any custom tokenizarion rules I guess, the only things I do for training and saving the model are :

define train data, for example,

train_data = [
    ('Who is Shaka Khan?', {
        'entities': [(7, 17, 'PERSON')]
    }),
    ('I like London and Berlin.', {
        'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]
    })
]

nlp = spacy.load("en")
train a NER with a function pretty similar to the example from spaCy,

def train_ner(nlp, train_data, output_dir, nb_iterations=50, dropout=0.5):
    # create the built-in pipeline components and add them to the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if 'ner' not in nlp.pipe_names:
        ner = nlp.create_pipe('ner')
        nlp.add_pipe(ner, last=True)
    # otherwise, get it so we can add labels
    else:
        ner = nlp.get_pipe('ner')

    # add labels
    for _, annotations in train_data:
        for ent in annotations.get('entities'):
            ner.add_label(ent[2])

    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    with nlp.disable_pipes(*other_pipes):  # only train NER
        optimizer = nlp.begin_training()
        for itn in range(nb_iterations):
            random.shuffle(train_data)
            losses = {}
            for text, annotations in train_data:
                nlp.update(
                    [text],  # batch of texts
                    [annotations],  # batch of annotations
                    drop=dropout,  # dropout - make it harder to memorise data
                    sgd=optimizer,  # callable to update weights
                    losses=losses)

    # Save model
    if not Path(output_dir).exists():
        Path(output_dir).mkdir()
    nlp.to_disk(Path(output_dir))
    print("model saved to: {}".format(output_dir))

ines · 2017-11-23T14:20:32Z

Thanks – definitely looks like a serialization bug then.

The tests for this are currently incomplete, because the output of msgpack for the tokenizer turned out to be inconsistent, which made it hard to test the way we're testing the other components (e.g. by asserting that the msgpack before and after output are equal). But we should definitely adjust the tests to at least make sure the serialization roundtrip works, so we can test the Windows behaviour properly on Appveyor.

eranhirs · 2018-03-28T10:29:25Z

I built a model a week ago and successfully loaded it from my Windows 10 with spacy 2.0.7.

Not sure what updated, I didn't run any pip installs in quite a while, but suddenly I get the same error when using spacy.load as before.

alexvy86 · 2018-05-30T21:00:58Z

Just to add another data point, we're seeing the same issue with spacy 2.0.11, a custom model trained in one machine causes a TypeError: unhashable type: 'list' error when loading it in another. Re-training the model in the second machine makes everything work, so it sounds like somehow a machine-specific "something" (?) might be getting used during serialization/deserialization? Reminded me of cookie encryption/decryption issues when a web server farm isn't configured to use the same encryption/decryption key.

alexvy86 · 2018-05-31T03:22:54Z

Strangely enough, a third computer was able to use the same model... Trying to figure out how machine 1 and 3 match and 2 is different, I'll update the thread if I come up with something.

ghost · 2018-07-05T10:41:03Z

Anyone find a solution without adding a new data point/re training the model on the computer?

alexvy86 · 2018-07-05T20:21:56Z

Not me, but coming back to this thread I just thought of something... in my case I'm putting the models in source control (git), so maybe the auto-handling of LF/CRLF characters is messing up the files? The machines where the models failed for us aren't mine so I can't check what their settings look like, but I'll ask the people who own them to check and try with different settings (basically, check-out as-is, commit as-is).

alexvy86 · 2018-07-05T21:14:07Z

Yep, in my case that was the problem! I fixed it by adding a .gitattributes file to the root of my repo, with something like this:

path/to/a/folder/with/a/spacy/model/** -text

That "unsets" the text attribute, telling git that it should not do CRLF conversion on any files under that path. Once that file is commited to the repo, the easiest solution is to clone the repository again. I also managed to fix the files by running rm .git/index followed by git reset --hard origin/<my-branch> (having the local version of <my-branch> checked out).

I guess one last thing to consider, is that the files might have been changed by git at commit time, in which case the model might need to be retrained and commited again after adding the .gitattributes file, so it doesn't get modified.

Adds guidance on what to do if users encounter the error described in [1634](explosion#1634), which probably only happens in Windows environments.

Adds guidance on what to do if users encounter the error described in [1634](#1634), which probably only happens in Windows environments.

Adds guidance on what to do if users encounter the error described in [1634](explosion#1634), which probably only happens in Windows environments.

sachin-s-h · 2018-12-03T11:24:27Z

Hey, I too faced the same issue and this is what fixed me. Follow below steps to resolve the issue in windows platform:

If you have cloned your repository, just delete that.
run the command in git as: git config --global core.autocrlf false
now clone your respective repository again and re-run the code

honnibal · 2018-12-06T14:45:37Z

tl;dr: run pip install "msgpack<0.6.0" and you should get everything fixed. Alternatively update spaCy, with pip install spacy>=2.0.18

The issue here is that the msgpack library has changed behaviour around this flag, use_list, and spaCy previously wasn't pinned to a precise enough version of the library to prevent breaking changes. This means that if you install older versions of spaCy, they cease to work, because you're getting a newly released version of msgpack that breaks our code.

To stop this happening we're now switching our dependencies to our own fork of msgpack and other serialisation utilities, which we're shipping in a library called srsly. We have this ready to release on spacy-nightly.

lock · 2019-01-05T14:55:24Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines changed the title ~~Problem loading a model in spaCy 2.0.3~~ Problem dererializing Tokenizer on Windows (spaCy 2.0.3) Nov 23, 2017

ines changed the title ~~Problem dererializing Tokenizer on Windows (spaCy 2.0.3)~~ Problem deserializing Tokenizer on Windows (spaCy 2.0.3) Nov 23, 2017

ines added the windows Issues related to Windows label Nov 23, 2017

ines added bug Bugs and behaviour differing from documentation tests New, missing or incorrect tests labels Nov 23, 2017

ines added the feat / serialize Feature: Serialization, saving and loading label Mar 27, 2018

alexvy86 added a commit to alexvy86/spaCy that referenced this issue Jul 6, 2018

Guidance to handle binary files in git in Windows

9c3a84e

Adds guidance on what to do if users encounter the error described in [1634](explosion#1634), which probably only happens in Windows environments.

alexvy86 added a commit to alexvy86/spaCy that referenced this issue Jul 6, 2018

Guidance to handle binary files in git in Windows

c3dab59

Adds guidance on what to do if users encounter the error described in [1634](explosion#1634), which probably only happens in Windows environments.

alexvy86 added a commit to alexvy86/spaCy that referenced this issue Jul 6, 2018

Guidance to handle binary files in git in Windows

05a65d9

Adds guidance on what to do if users encounter the error described in [1634](explosion#1634), which probably only happens in Windows environments.

alexvy86 mentioned this issue Jul 6, 2018

Guidance to handle binary files in git in Windows #2526

Merged

3 tasks

ines pushed a commit that referenced this issue Jul 9, 2018

Guidance to handle binary files in git in Windows (#2526)

bd35bf7

Adds guidance on what to do if users encounter the error described in [1634](#1634), which probably only happens in Windows environments.

theoldhat mentioned this issue Oct 17, 2018

Error Loading Model into spaCy #2861

Closed

ines closed this as completed Dec 6, 2018

lock bot locked as resolved and limited conversation to collaborators Jan 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem deserializing Tokenizer on Windows (spaCy 2.0.3) #1634

Problem deserializing Tokenizer on Windows (spaCy 2.0.3) #1634

AurelienMassiot commented Nov 23, 2017

ines commented Nov 23, 2017

AurelienMassiot commented Nov 23, 2017 •

edited

Loading

ines commented Nov 23, 2017

eranhirs commented Mar 28, 2018 •

edited

Loading

alexvy86 commented May 30, 2018

alexvy86 commented May 31, 2018

ghost commented Jul 5, 2018

alexvy86 commented Jul 5, 2018

alexvy86 commented Jul 5, 2018

sachin-s-h commented Dec 3, 2018

honnibal commented Dec 6, 2018 •

edited

Loading

lock bot commented Jan 5, 2019

Problem deserializing Tokenizer on Windows (spaCy 2.0.3) #1634

Problem deserializing Tokenizer on Windows (spaCy 2.0.3) #1634

Comments

AurelienMassiot commented Nov 23, 2017

Environment 1 : it works

Environment 2 : it doesn't work

ines commented Nov 23, 2017

AurelienMassiot commented Nov 23, 2017 • edited Loading

ines commented Nov 23, 2017

eranhirs commented Mar 28, 2018 • edited Loading

alexvy86 commented May 30, 2018

alexvy86 commented May 31, 2018

ghost commented Jul 5, 2018

alexvy86 commented Jul 5, 2018

alexvy86 commented Jul 5, 2018

sachin-s-h commented Dec 3, 2018

honnibal commented Dec 6, 2018 • edited Loading

lock bot commented Jan 5, 2019

AurelienMassiot commented Nov 23, 2017 •

edited

Loading

eranhirs commented Mar 28, 2018 •

edited

Loading

honnibal commented Dec 6, 2018 •

edited

Loading