Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support wrapping sequences across samples for LM tasks #701

Open
mattdangerw opened this issue Jan 30, 2023 · 3 comments
Open

Support wrapping sequences across samples for LM tasks #701

mattdangerw opened this issue Jan 30, 2023 · 3 comments
Labels
team-created Issues created by Keras Hub team as part of development roadmap. type:feature New feature or request

Comments

@mattdangerw
Copy link
Member

Both RoBERTa and GPT2 pretraining leverage, wrapped, densely packed sequences for unsupervised language model learning. Essentially the training samples will look something like this (I'm omitting masking and labeling for clarity)...

The  qu   #ick  br   #own  fox    jump   #ed
over the  lazy  dog  .     </s>   The    lazy
dog  sle  #pt   un   #der  the    pale   moon

Essentially every sample will always have a full sequence length, and end of text markers need not line up with sample boundaries whatsoever. This has the advantage of being both simple and efficient, all weights are being trained perpetually during the unsupervised task.

We should consider if we want to support this as the task level, and if so, how, as this type of preprocessing is inexpressible with our preprocessing layer design.

@mattdangerw mattdangerw added the type:feature New feature or request label Jan 30, 2023
@mattdangerw mattdangerw changed the title Consider supporting wrapping sequences across sample boundaries for LM tasks Support wrapping sequences across sample boundaries for LM tasks Jan 30, 2023
@mattdangerw mattdangerw changed the title Support wrapping sequences across sample boundaries for LM tasks Support wrapping sequences across samples for LM tasks Jan 30, 2023
@mattdangerw
Copy link
Member Author

A few notes and musings on this design problem, which is quite an interesting one.

  • This is not a strictly pretraining issue. As @chenmoneygithub has pointed out for GPT, this type of windowing is useful for fine-tuning a generative model as well.
  • There will never be a way we can write a preprocessing layer that supports this type of preprocessing! This is not an operation that can be expressed as a dataset.map(). This can be roughly expressed as ds.map(tokenizer).rebatch(sequence_length).batch(batch_size), but essentially this is an operation of an entire tf.data.Dataset stream. It could potentially be supported at the task level, but never with a single preprocessing layer.
  • We could choose not to support this out of the box! We have a way to express this with our raw tokenizers, and a task model with preprocessor=None. We could decide this is sufficient, with proper code examples.

A few open questions we should investigate.

  • How should we expect an input dataset to annotate where end of document markers are? Roberta pretraining expects a new document to be marked by an empty line. Is this sufficient?
  • Is fully dynamic preprocessing (e.g. raw text -> sample happening dynamically on the CPU), ever a reasonable call for pretraining? RoBERTa does dynamic masking dynamically, but will still tokenizer and shard files in a separate job. BERT does all of it's pretraining preprocessing in a separate job. This is something we should investigate experimentally, but if preprocessing in the training process is always a slowdown, that should inform our design.

@jbischof
Copy link
Contributor

My default strategy (not having looked into this myself) is that we should replicate prior art unless we can improve upon it. If BERT/RoBERTa repos offer a separate script for featurizing raw text data we can

  1. Have our preprocessors expect the output of these scripts
  2. Offer a version of these scripts outside the repo in the long run

This is part of an overarching "simple preprocessing" proposal I'm thinking about: make our task models fairly dumb and assume any complex preprocessing that will inevitably depend on the raw data format is already handled.

@mattdangerw
Copy link
Member Author

The issue is going to be the uniformity of our task API. Right now all of our task models operate on raw strings. If we let BERT do what upstream BERT does, the input format for a BERT task will be tokenized, windowed and masked tf records (this is how our example is structured). If we let RoBERTa do what upstream RoBERTa does, the input format will tokenized and sharded files, not yet windowed or masked. (And it's unclear to me still if we can do everything RoBERTa does dynamically efficiently with tf.data)

We have to worry about the consistency of our task API. The obvious escape hatch (to me) is to show "pretraining recipes" with preprocessor=None. Then we could complete ship RoBERTa and BERT examples that have a slightly different breakdown of what preprocessing goes into what script.

To me, a bad outcome would be a API in which:

  • our classification task models all expect strings
  • our LM task models expect a varying level of preprocessing depending on the model in question
  • our token classification tasks all expect tokenized input (with label re-mapping happening before the model)

This would be really confusing and a significant point of friction. We are going to have to be editorial somewhat with these models, if we want our tasks to have consistent UX.

@sachinprasadhs sachinprasadhs added the team-created Issues created by Keras Hub team as part of development roadmap. label Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-created Issues created by Keras Hub team as part of development roadmap. type:feature New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants