Simplify boilerplate for monoT5 and monoBERT #83

yuxuan-ji · 2020-09-10T19:50:52Z

closes #80

usage is like so:

from pygaggle.rerank.pretrained import monoBERT, monoT5
reranker = monoBERT()
or
reranker = monoBERT('model-name', 'tokenizer-name')
or
reranker = monoBERT(model, tokenizer)

tested outputs are the same as https://github.com/castorini/pygaggle#a-simple-reranking-example

I was a bit confused on the goal of these two functions, see my comment here: #80 (comment)

rodrigonogueira4

Great, thanks for implementing this!

rodrigonogueira4 · 2020-09-10T19:54:25Z

Can you also change the README accordingly?

lintool · 2020-09-10T20:02:58Z

hey @rodrigonogueira4 - do you prefer this impl or the alternative of folding the boilerplate code into constructors for existing models? e.g., https://github.com/castorini/pygaggle/blob/master/pygaggle/rerank/transformer.py#L25

rodrigonogueira4 · 2020-09-10T20:16:05Z

hey @rodrigonogueira4 - do you prefer this impl or the alternative of folding the boilerplate code into constructors for existing models? e.g., https://github.com/castorini/pygaggle/blob/master/pygaggle/rerank/transformer.py#L25

Yes, it is actually better to rename T5Reranker to monoT5 and add the option to initialize it without arguments.

yuxuan-ji · 2020-09-10T21:17:11Z

@rodrigonogueira4 @lintool which do you prefer?

folding into constructors:

class T5Reranker(Reranker):
    def __init__(self,
                 model: T5ForConditionalGeneration = None,
                 tokenizer: QueryDocumentBatchTokenizer = None):
        if not model:
            device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
            model = T5ForConditionalGeneration.from_pretrained('castorini/monot5-base-msmarco').to(device).eval()
        self.model = model

        if not tokenizer:
            tokenizer = T5BatchTokenizer(AutoTokenizer.from_pretrained('t5-base'), batch_size=8)
        self.tokenizer = tokenizer

        self.device = next(self.model.parameters(), None).device


class SequenceClassificationTransformerReranker(Reranker):
    def __init__(self,
                 model: PreTrainedModel,
                 tokenizer: PreTrainedTokenizer):
        if not model:
            device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
            model = AutoModelForSequenceClassification.from_pretrained('castorini/monobert-large-msmarco').to(device).eval()
        self.model = model

        if not tokenizer:
            tokenizer = AutoTokenizer.from_pretrained('bert-large-uncased')
        self.tokenizer = tokenizer

        self.device = next(model.parameters()).device

making subclasses:

class MonoT5(T5Reranker):
    def __init__(self,
                 model: T5ForConditionalGeneration = None,
                 tokenizer: QueryDocumentBatchTokenizer = None):
        if not model:
            device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
            model = T5ForConditionalGeneration.from_pretrained('castorini/monot5-base-msmarco').to(device).eval()
        if not tokenizer:
            tokenizer = T5BatchTokenizer(AutoTokenizer.from_pretrained('t5-base'), batch_size=8)
        super().__init__(model, tokenizer)

class MonoBERT(SequenceClassificationTransformerReranker):
    def __init__(self,
                 model: T5ForConditionalGeneration = None,
                 tokenizer: QueryDocumentBatchTokenizer = None):
        if not model:
            device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
            model = AutoModelForSequenceClassification.from_pretrained('castorini/monobert-large-msmarco').to(device).eval()
        if not tokenizer:
            tokenizer = AutoTokenizer.from_pretrained('bert-large-uncased')
        super().__init__(model, tokenizer)

Option 2 is in case T5Reranker & SequenceClassificationTransformerReranker have some lower level meaning we'd like to keep, but I don't mind option 1 if MonoBERT & MonoT5 mean the same thing

lintool · 2020-09-10T21:48:57Z

My vote is for option 1, but renaming the classes monoT5 and monoBERT. I know this is against Python conventions, but is started with lowercase "m" really that bad? Makes it consistent with our papers, etc.

@ronakice should chime in also...

Also, once we build these abstractions we should propagate to the replications also, e.g.,:
https://github.com/castorini/pygaggle/blob/master/pygaggle/run/evaluate_passage_ranker.py

ronakice · 2020-09-10T22:05:35Z

I agree with @lintool, lowercase "m" as it is consistent with our previous work!

lintool · 2020-09-10T22:06:57Z

I agree with @lintool, lowercase "m" as it is consistent with our previous work!

But re: Option 1 vs. Option 2? I.e., is there something special about our current abstractions that we should keep?

yuxuan-ji · 2020-09-10T22:34:26Z

the main things I can think of with using lowercase are

pretty much every linter will complain unless it's added as an exception
it's not exactly clear from seeing from pygaggle.rerank.transformers import monoBERT, monoT5 that those are classes

I'd avoid it unless the consistency w/ paper is particularly important and worth ignoring Python's conventions

As for the current abstractions, I was mainly concerned about SequenceClassificationTransformerReranker; the way it's written seems like it was intended to work with any model architecture supported by AutoModelForSequenceClassification i.e. you could use a distilbert, roberta, etc. model. If so then I think calling it monoBERT wouldn't exactly fit its usage

ronakice · 2020-09-12T10:54:16Z

Hey @yuxuan-ji sorry I got a bit carried away with some other work.

Firstly, yes I think option 1 is better. You make a fair point about linters complaining about monoBERT/monoT5 as well as the lack of clarity to the general dev. So I concede, I think it is better to go with MonoBERT and MonoT5. @lintool what do you think?

As to the usage of MonoBERT in the general form, I think it is mostly fine since all these other Sequence Classification transformers are kinda BERT derived and use the same "principles", unlike monoT5 which signified a bit of a shift. Also, we currently only support MonoBERT so until we find a better name :)/add support for other models, we shouldn't worry about it.

lintool · 2020-09-13T02:13:54Z

Okay, we've converged. Option 1, MonoBERT and MonoT5 as class names.

FWIW - huggingface made the model names fugly, favoring conformance to conventions - see Bert and MBart.

@yuxuan-ji please execute.

ronakice

LGTM! Merging! Thanks @yuxuan-ji for swiftly finishing this :)

Simplify boilerplate for monoT5 and monoBERT

b60b42f

rodrigonogueira4 approved these changes Sep 10, 2020

View reviewed changes

Fold into constructors

26de4a6

Capitalize class names

5a6a0fb

yuxuan-ji force-pushed the simplify-boilerplate branch from 8a636c7 to 5a6a0fb Compare September 13, 2020 03:52

ronakice approved these changes Sep 13, 2020

View reviewed changes

ronakice merged commit 41513a9 into castorini:master Sep 13, 2020

lintool mentioned this pull request Sep 13, 2020

MonoBERT and MonoT5 defaults #86

Closed

yuxuan-ji deleted the simplify-boilerplate branch September 15, 2020 18:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify boilerplate for monoT5 and monoBERT #83

Simplify boilerplate for monoT5 and monoBERT #83

yuxuan-ji commented Sep 10, 2020 •

edited

Loading

rodrigonogueira4 left a comment

rodrigonogueira4 commented Sep 10, 2020

lintool commented Sep 10, 2020

rodrigonogueira4 commented Sep 10, 2020

yuxuan-ji commented Sep 10, 2020 •

edited

Loading

lintool commented Sep 10, 2020

ronakice commented Sep 10, 2020

lintool commented Sep 10, 2020

yuxuan-ji commented Sep 10, 2020 •

edited

Loading

ronakice commented Sep 12, 2020

lintool commented Sep 13, 2020

ronakice left a comment

Simplify boilerplate for monoT5 and monoBERT #83

Simplify boilerplate for monoT5 and monoBERT #83

Conversation

yuxuan-ji commented Sep 10, 2020 • edited Loading

rodrigonogueira4 left a comment

Choose a reason for hiding this comment

rodrigonogueira4 commented Sep 10, 2020

lintool commented Sep 10, 2020

rodrigonogueira4 commented Sep 10, 2020

yuxuan-ji commented Sep 10, 2020 • edited Loading

lintool commented Sep 10, 2020

ronakice commented Sep 10, 2020

lintool commented Sep 10, 2020

yuxuan-ji commented Sep 10, 2020 • edited Loading

ronakice commented Sep 12, 2020

lintool commented Sep 13, 2020

ronakice left a comment

Choose a reason for hiding this comment

yuxuan-ji commented Sep 10, 2020 •

edited

Loading

yuxuan-ji commented Sep 10, 2020 •

edited

Loading

yuxuan-ji commented Sep 10, 2020 •

edited

Loading