GH-438: added byte pair embeddings #473

alanakbik · 2019-02-08T17:49:31Z

gccome · 2019-02-08T17:56:26Z

Thanks for adding this! Can we use BPEmb to do sequence tagging? If so, how to get the tag for each token that is separated as multiple subwords?

alanakbik · 2019-02-08T18:55:43Z

Yes, you can use it in sequence tagging. In this current implementation, the word embedding is constructed as a concatenation of the first and last subword. Alternatively, we could try other ways such as pooling all subword embeddings, or using only the last subword.

To use the embeddings when training a sequence labeler, do this:

corpus: TaggedCorpus = NLPTaskDataFetcher.load_corpus(NLPTask.UD_ENGLISH)

# 2. what tag do we want to predict?
tag_type = 'upos'

# 3. make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)

# initialize embeddings
embeddings =  BytePairEmbeddings(language='en')

# initialize sequence tagger
from flair.models import SequenceTagger

tagger: SequenceTagger = SequenceTagger(hidden_size=256,
                                        embeddings=embeddings,
                                        tag_dictionary=tag_dictionary,
                                        tag_type=tag_type)
print(tagger)

# train model
trainer: ModelTrainer = ModelTrainer(tagger, corpus)
trainer.train('resources/taggers/ud_bpe')

stefan-it · 2019-02-08T19:35:46Z

@alanakbik I'm currently running my UD Basque experiment with bpe embeddings and will report some results here (it's also a nice playground for testing different merge operations and their impact on the downstream task) - so e.g. here's a nice paper for machine translation that uses different merge operations (for a specific amount of parallel sentences) :)

alanakbik · 2019-02-09T12:06:58Z

@stefan-it great, thanks - I'd be very interested to hear which parameters work best in your experience. We could then use them as default params in the BytePairEmbeddings class.

gccome · 2019-02-09T17:12:06Z

@alanakbik Thanks Alan! This is great!

alanakbik · 2019-02-15T07:02:00Z

@bheinzerling we noted that this workaround for the serialization has some problems, see #504

bheinzerling · 2019-02-15T12:47:41Z

Ah, that's annoying. I cannot find a way to properly serialize SWIG objects (something like this doesn't work either).

As a further workaround, the sentence piece model is now downloaded to a temp directory if it isn't already in the cache during deserialization:

https://github.com/bheinzerling/bpemb/blob/54259b99a7b61100ecfb30c2d72bf590b4726297/bpemb/bpemb.py#L438

I'll push this change to pip later.

alanakbik · 2019-02-15T12:53:24Z

Ah great - I think you would also need to update the cache_dir variable when de-serializing, otherwise it will use the cache_dir of the machine the model was serialized on and potentially try to download the model into a nonexisting cache_dir.

https://github.com/zalandoresearch/flair/blob/4d4a3a4d93fda6fc20c6bd30950891c02a2bd167/flair/embeddings.py#L268

bheinzerling · 2019-02-15T13:26:14Z

You're right, thanks! Hopefully the last fix for serialization now

GH-438: added byte pair embeddings

52df1b7

alanakbik merged commit fcac0e1 into master Feb 9, 2019

kashif deleted the GH-438-bp-embeddings branch February 12, 2019 15:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-438: added byte pair embeddings #473

GH-438: added byte pair embeddings #473

alanakbik commented Feb 8, 2019

gccome commented Feb 8, 2019

alanakbik commented Feb 8, 2019

stefan-it commented Feb 8, 2019

alanakbik commented Feb 9, 2019

gccome commented Feb 9, 2019

alanakbik commented Feb 15, 2019

bheinzerling commented Feb 15, 2019 •

edited

Loading

alanakbik commented Feb 15, 2019

bheinzerling commented Feb 15, 2019

GH-438: added byte pair embeddings #473

GH-438: added byte pair embeddings #473

Conversation

alanakbik commented Feb 8, 2019

gccome commented Feb 8, 2019

alanakbik commented Feb 8, 2019

stefan-it commented Feb 8, 2019

alanakbik commented Feb 9, 2019

gccome commented Feb 9, 2019

alanakbik commented Feb 15, 2019

bheinzerling commented Feb 15, 2019 • edited Loading

alanakbik commented Feb 15, 2019

bheinzerling commented Feb 15, 2019

bheinzerling commented Feb 15, 2019 •

edited

Loading