Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-label classification not working? #678

Closed
gwohlgen opened this issue Apr 20, 2019 · 23 comments
Closed

Multi-label classification not working? #678

gwohlgen opened this issue Apr 20, 2019 · 23 comments
Labels
bug Something isn't working wontfix This will not be worked on

Comments

@gwohlgen
Copy link

gwohlgen commented Apr 20, 2019

No description provided.

@gwohlgen gwohlgen added the bug Something isn't working label Apr 20, 2019
@gwohlgen
Copy link
Author

gwohlgen commented Apr 20, 2019

Hi,
I am trying to make multi-label classification work with the dataset used in this fasttext tutorial: https://fasttext.cc/docs/en/supervised-tutorial.html.

The problem is, that no matter what embedding used, and what hyperparameter, training always quickly goes towards 0.000 F1 / acc:
2019-04-20 23:31:15,587 EPOCH 4 done: loss 0.0009 - lr 0.1000 - bad epochs 0
2019-04-20 23:31:30,078 DEV : loss 0.00073111 - f-score 0.0000 - acc 0.0000
2019-04-20 23:31:44,417 TEST : loss 0.00073590 - f-score 0.0000 - acc 0.000
0

Maybe the problem is that it has a high number of labels, some with low frequency (1)?

Full code and logs here: https://github.com/gwohlgen/misc/blob/master/classifier__multi-label.ipynb

I split it into train/dev/test, eg

bash$ head cooking.train
wohlg@wohlg-XPS:~/itmo/misc/cooking_classification/preprocessed$ head cooking.train 
__label__sauce __label__cheese how much does potato starch affect a cheese sauce recipe ? 
__label__food-safety __label__acidity dangerous pathogens capable of growing in acidic environments
__label__cast-iron __label__stove how do i cover up the white spots on my cast iron stove ? 
__label__restaurant michelin three star restaurant; but if the chef is not there
__label__knife-skills __label__dicing without knife skills ,  how can i quickly and accurately dice vegetables ? 
__label__storage-method __label__equipment __label__bread what ' s the purpose of a bread box ? 
.....

Looks fine.

Then created corpus etc:

from flair.data_fetcher import NLPTaskDataFetcher
from flair.embeddings import WordEmbeddings, FlairEmbeddings, DocumentLSTMEmbeddings, CharacterEmbeddings
from flair.models import TextClassifier
from flair.trainers import ModelTrainer
from pathlib import Path

data_path = '/home/wohlg/itmo/misc/cooking_classification/preprocessed'
corpus = NLPTaskDataFetcher.load_classification_corpus(Path(data_path), 
                                                       test_file='cooking.test', 
                                                       dev_file='cooking.valid', 
                                                       train_file='cooking.train')

word_embeddings = [WordEmbeddings('glove'), 
                   FlairEmbeddings('news-forward-fast'), 
                   FlairEmbeddings('news-backward-fast')]

document_embeddings = DocumentLSTMEmbeddings(word_embeddings, 
                                             hidden_size=512, 
                                             reproject_words=True, 
                                             reproject_words_dimension=256)

Still looks good:

print(corpus.obtain_statistics())
TaggedCorpus: 12404 train + 1500 dev + 1500 test sentences
{
    "TRAIN": {
        "dataset": "TRAIN",
        "total_number_of_documents": 12404,
        "number_of_documents_per_class": {
            "sauce": 332,
            "cheese": 235,
            "food-safety": 967,
            "acidity": 33,
            "cast-iron": 111,
....
[all other stats, also for test and dev]

Finally training:

classifier = TextClassifier(document_embeddings, 
                            label_dictionary=corpus.make_label_dictionary(), 
                            multi_label=True)

trainer = ModelTrainer(classifier, corpus)

trainer.train('/tmp', max_epochs=20)

In training loss improves, but Acc / F1 goes quickly to 0.000.
Any finally, predicting with the learned model doesn't work, it just returns and empty set of labels [],
so my guess is that flair for some reason learns to predict an empty label set -- but why?

Did anyone else try to train on the fasttext tutorial dataset? With success?

Full code and logs here: https://github.com/gwohlgen/misc/blob/master/classifier__multi-label.ipynb

@gwohlgen
Copy link
Author

In order to make sure that flair is not overwhelmed by many low-frequency classes I made a simplified dataset for multi-label classification with only the 30 most frequent classes, and re-did the experiments, see here: https://github.com/gwohlgen/misc/blob/master/classifier__multi-label-simple.ipynb

But the same problem persists. :(

@stefan-it
Copy link
Member

@gwohlgen Have you used the latest master of flair (recently, there was a softmax bug fix there) 🤔

@gwohlgen
Copy link
Author

gwohlgen commented Apr 21, 2019

@stefan-it Hallo Stefan, I used the latest pip version (0.4.1). Is the softmax bug still existing in that version?

@gwohlgen
Copy link
Author

@stefan-it Just cloned the latest version from github, but the problem persists.
Did anyone try multilabel classification with flair? Is there a working example somewhere? That would help a lot to find the problem.

@alanakbik
Copy link
Collaborator

Hello @gwohlgen - thanks for reporting this and thanks in particular for sharing all details to reproduce the experiment. I unfortunately get the same results so something does not seem to be working.

We use multi-label classification on a set of internal problems - to double-check I've just rerun the training on one of our multi-label datasets with the current master branch and everything seems to be working. So somehow it does not work on the cooking dataset whereas it works on ours. I'll take a closer look and let you know if I find anything. Please also let us know should you find out anything else.

@abishekk92
Copy link

I ran into the same issue while using flair for a multi label classification task, the empty labels seems to be due to the confidence value check. It would be good if somebody has a fix, otherwise I can attempt a patch.

@gwohlgen
Copy link
Author

@alanakbik @abishekk92 I would highly appreciate any attempt to solve the problem :)

@alanakbik
Copy link
Collaborator

@abishekk92 @gwohlgen the confidence value check does not seem to be the problem in this case, although we want to change the check for the next version. But even when removing the check a model trained on the cooking dataset does not predict anything well. We are still looking into why this is the case. It could be a bug, or even general inapplicability of this type of model to this type of task.

@prabhatM
Copy link

Hi,
I have struggled with the same problems for last 2 months. Today, I realized I am not the only one. I was losing confidence on myself!!!

@prabhatM
Copy link

I was feeling really bad because I had developed a big multi label dataset of our domain painstakingly and it was a real let down after days of training when I started getting F1 as 0.

I am glad you guys have started looking at the issue proactively.

@gwohlgen
Copy link
Author

Hi @prabhatM .. yes I also hope it will be fixed soon, I am curious to see how well flair works on multilabel classification ..

@alanakbik
Copy link
Collaborator

Just a quick update: we are still looking into this and some other classification-related issues (see #709). Unfortunately we haven't found the error yet, but fixed a bunch of smaller things and implemented more baselines (PRs coming soon). Hopefully we find out what the problem is soon.

@collinpu
Copy link

I am having the same problem!

@collinpu
Copy link

I may have a hacky fix. I changed loss function from BCELoss to BCEWithLogitsLoss and used a large positive pos_weight vector to bias the model away from predicting all nulls. My intuition is that there is a huge class imbalance between the labels seen in each sample and the labels not seen in each sample with the later being much much larger so the model may have been converging to a local minimum that just always predicted no labels. At least this is the case in my data, not sure if this will help everyone.

@alanakbik
Copy link
Collaborator

Hello @collinpu that's interesting - could you provide more details? How/where did you prodive the pos_weight vector? Perhaps we could try this for these problems.

@collinpu
Copy link

You initialize BCEWithLogitsLoss with the pos_weight vector you want to use. See https://pytorch.org/docs/stable/nn.html.

Something to note is that by biasing the model in this way you need to be careful not to make the pos_weights too large or they will over bias the model and cause it to overpredict the existence of labels. You'll see this if your recall is very high but your precision is low.

@bealjm823
Copy link

Hi everyone. I was wondering if anyone was still having the issues described above with the mult-label data.. I'm having the same issues as described by @gwohlgen.. I see there was a merge by @alanakbik toward classification improvements. Do we need to make the change suggested by @collinpu manually? Thanks beforehand for any guidance.

@tombburnell
Copy link

I'm getting empty list of labels too [] when using multiple tags - particularly with lots of tags and short body.
If I have longer body I am getting results but typically the results are not good and all score just above 0.5.

@paragkr007
Copy link

Hello everyone,

I am also facing same issue and getting score 0.0 for multi_label(True) classification.
Hoping that it will be fixed soon.

MICRO_AVG: acc 0.0 - f1-score 0.0
MACRO_AVG: acc 0.0 - f1-score 0.0

Thanks.

@stale
Copy link

stale bot commented Apr 29, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Apr 29, 2020
@stale stale bot closed this as completed May 6, 2020
@alanrios2001
Copy link

I'm having this problem but training with torch's Adam optimizer, using MADGRAD the f1-score just work's fine...

@None-Such
Copy link

@alanrios2001 - Would be most grateful if you could post the code snippet illustrating how to set the training to use torch's Adam optimizer using MADGRAD.

Best Regards,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests