-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-438: added byte pair embeddings #473
Conversation
Thanks for adding this! Can we use BPEmb to do sequence tagging? If so, how to get the tag for each token that is separated as multiple subwords? |
Yes, you can use it in sequence tagging. In this current implementation, the word embedding is constructed as a concatenation of the first and last subword. Alternatively, we could try other ways such as pooling all subword embeddings, or using only the last subword. To use the embeddings when training a sequence labeler, do this: corpus: TaggedCorpus = NLPTaskDataFetcher.load_corpus(NLPTask.UD_ENGLISH)
# 2. what tag do we want to predict?
tag_type = 'upos'
# 3. make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
# initialize embeddings
embeddings = BytePairEmbeddings(language='en')
# initialize sequence tagger
from flair.models import SequenceTagger
tagger: SequenceTagger = SequenceTagger(hidden_size=256,
embeddings=embeddings,
tag_dictionary=tag_dictionary,
tag_type=tag_type)
print(tagger)
# train model
trainer: ModelTrainer = ModelTrainer(tagger, corpus)
trainer.train('resources/taggers/ud_bpe') |
@alanakbik I'm currently running my UD Basque experiment with bpe embeddings and will report some results here (it's also a nice playground for testing different merge operations and their impact on the downstream task) - so e.g. here's a nice paper for machine translation that uses different merge operations (for a specific amount of parallel sentences) :) |
@stefan-it great, thanks - I'd be very interested to hear which parameters work best in your experience. We could then use them as default params in the BytePairEmbeddings class. |
@alanakbik Thanks Alan! This is great! |
@bheinzerling we noted that this workaround for the serialization has some problems, see #504 |
Ah, that's annoying. I cannot find a way to properly serialize SWIG objects (something like this doesn't work either). As a further workaround, the sentence piece model is now downloaded to a temp directory if it isn't already in the cache during deserialization: I'll push this change to pip later. |
Ah great - I think you would also need to update the |
You're right, thanks! Hopefully the last fix for serialization now |
Add bype pair embeddings to Flair.