Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-438: added byte pair embeddings #473

Merged
merged 1 commit into from
Feb 9, 2019
Merged

Conversation

alanakbik
Copy link
Collaborator

Add bype pair embeddings to Flair.

@gccome
Copy link

gccome commented Feb 8, 2019

Thanks for adding this! Can we use BPEmb to do sequence tagging? If so, how to get the tag for each token that is separated as multiple subwords?

@alanakbik
Copy link
Collaborator Author

Yes, you can use it in sequence tagging. In this current implementation, the word embedding is constructed as a concatenation of the first and last subword. Alternatively, we could try other ways such as pooling all subword embeddings, or using only the last subword.

To use the embeddings when training a sequence labeler, do this:

corpus: TaggedCorpus = NLPTaskDataFetcher.load_corpus(NLPTask.UD_ENGLISH)

# 2. what tag do we want to predict?
tag_type = 'upos'

# 3. make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)

# initialize embeddings
embeddings =  BytePairEmbeddings(language='en')

# initialize sequence tagger
from flair.models import SequenceTagger

tagger: SequenceTagger = SequenceTagger(hidden_size=256,
                                        embeddings=embeddings,
                                        tag_dictionary=tag_dictionary,
                                        tag_type=tag_type)
print(tagger)

# train model
trainer: ModelTrainer = ModelTrainer(tagger, corpus)
trainer.train('resources/taggers/ud_bpe')

@stefan-it
Copy link
Member

@alanakbik I'm currently running my UD Basque experiment with bpe embeddings and will report some results here (it's also a nice playground for testing different merge operations and their impact on the downstream task) - so e.g. here's a nice paper for machine translation that uses different merge operations (for a specific amount of parallel sentences) :)

@alanakbik
Copy link
Collaborator Author

@stefan-it great, thanks - I'd be very interested to hear which parameters work best in your experience. We could then use them as default params in the BytePairEmbeddings class.

@alanakbik alanakbik merged commit fcac0e1 into master Feb 9, 2019
@gccome
Copy link

gccome commented Feb 9, 2019

@alanakbik Thanks Alan! This is great!

@kashif kashif deleted the GH-438-bp-embeddings branch February 12, 2019 15:05
@alanakbik
Copy link
Collaborator Author

@bheinzerling we noted that this workaround for the serialization has some problems, see #504

@bheinzerling
Copy link

bheinzerling commented Feb 15, 2019

Ah, that's annoying. I cannot find a way to properly serialize SWIG objects (something like this doesn't work either).

As a further workaround, the sentence piece model is now downloaded to a temp directory if it isn't already in the cache during deserialization:

https://github.com/bheinzerling/bpemb/blob/54259b99a7b61100ecfb30c2d72bf590b4726297/bpemb/bpemb.py#L438

I'll push this change to pip later.

@alanakbik
Copy link
Collaborator Author

Ah great - I think you would also need to update the cache_dir variable when de-serializing, otherwise it will use the cache_dir of the machine the model was serialized on and potentially try to download the model into a nonexisting cache_dir.

https://github.com/zalandoresearch/flair/blob/4d4a3a4d93fda6fc20c6bd30950891c02a2bd167/flair/embeddings.py#L268

@bheinzerling
Copy link

You're right, thanks! Hopefully the last fix for serialization now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants