-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[1.8 release] Switch to the new datasets in torchtext 0.9.0 release - text classification tutorial #1352
[1.8 release] Switch to the new datasets in torchtext 0.9.0 release - text classification tutorial #1352
Conversation
Deploy preview for pytorch-tutorials-preview ready! Built with commit 7176fe0 https://deploy-preview-1352--pytorch-tutorials-preview.netlify.app |
e431957
to
40718fe
Compare
6dd5e4a
to
a9fb27b
Compare
text = torch.cat(text) | ||
return text, offsets, label | ||
train_iter = AG_NEWS(split='train') | ||
num_class = len(set([label for (label, text) in train_iter])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we're materializing the dataset again, but this already happened earlier in the context of DataLoader. We can just assign list(train_iter)
to a variable to avoid this. We should probably also add the number of labels to our dataset documentation, which would be much more efficient to use than this. I'll add this as a task.
fae943b
to
3c47635
Compare
abb7c99
to
2347905
Compare
@@ -2,7 +2,7 @@ | |||
Text classification with the torchtext library | |||
================================== | |||
|
|||
In this tutorial, we will show how to use the new torchtext library to build the dataset for the text classification analysis. In the nightly release of the torchtext library, we provide a few prototype building blocks for data processing. Users will have the flexibility to | |||
In this tutorial, we will show how to use the new torchtext library to build the dataset for the text classification analysis. Users will have the flexibility to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to say "new" torchtext library anymore, because the datasets are now part of the top folder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
# computes the mean value of a “bag” of embeddings. The text entries here | ||
# have different lengths. ``nn.EmbeddingBag`` requires no padding here | ||
# since the text lengths are saved in offsets. | ||
# The model is composed of the `nn.EmbeddingBag <https://pytorch.org/docs/stable/nn.html?highlight=embeddingbag#torch.nn.EmbeddingBag>`__ layer plus a linear layer for the classification purpose. ``nn.EmbeddingBag`` computes the mean value of a “bag” of embeddings. Although the text entries here have different lengths, nn.EmbeddingBag module requires no padding here since the text lengths are saved in offsets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think EmbeddingBag provides 'mode' option where 'mean' is just by default. So perhaps it's better to be explicit about instead of stating that EmbeddingBag take mean to combine embeddings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update the context and explicitly say the default mode of mean.
'| accuracy {:8.3f}'.format(epoch, idx, len(dataloader), | ||
total_acc/total_count)) | ||
total_acc, total_count = 0, 0 | ||
start_time = time.time() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we reset the start_time variable to have the new time period.
# | ||
|
||
from torch.utils.data import DataLoader | ||
import time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps can be imported in next code snippet as it might not be used here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's used in L195?
* Update build.sh * Update audio tutorial for release pytorch 1.8 / torchaudio 0.8 (#1379) * [wip] replace audio tutorial * Update * Update * Update * fixup * Update requirements.txt * update * Update Co-authored-by: Brian Johnson <[email protected]> * [1.8 release] Switch to the new datasets in torchtext 0.9.0 release - text classification tutorial (#1352) * switch to the new dataset API * checkpoint * checkpoint * checkpoint * update docs * checkpoint * switch to legacy vocab * update to follow the master API * checkpoint * checkpoint * address reviewer's comments Co-authored-by: Guanheng Zhang <[email protected]> Co-authored-by: Brian Johnson <[email protected]> * [1.8 release] Switch to LM dataset in torchtext 0.9.0 release (#1349) * switch to raw text dataset in torchtext 0.9.0 release * follow the new API in torchtext master Co-authored-by: Guanheng Zhang <[email protected]> Co-authored-by: Brian Johnson <[email protected]> * [WIP][FX] CPU Performance Profiling with FX (#1319) Co-authored-by: Brian Johnson <[email protected]> * [FX] Added fuser tutorial (#1356) * Added fuser tutorial * updated index.rst * fixed conclusion * responded to some comments * responded to comments * respond Co-authored-by: Brian Johnson <[email protected]> * Update numeric_suite_tutorial.py * Tutorial combining DDP with Pipeline Parallelism to Train Transformer models (#1347) * Tutorial combining DDP with Pipeline Parallelism to Train Transformer models. Summary: Tutorial which places a pipe on GPUs 0 and 1 and another Pipe on GPUs 2 and 3. Both pipe replicas are replicated via DDP. One process drives GPUs 0 and 1 and another drives GPUs 2 and 3. * Polish out some of the docs. * Add thumbnail and address some comments. Co-authored-by: pritam <[email protected]> * More updates to numeric_suite * Even more updates * Update numeric_suite_tutorial.py Hopefully that's the last one * Update numeric_suite_tutorial.py Last one * Update build.sh Co-authored-by: moto <[email protected]> Co-authored-by: Guanheng George Zhang <[email protected]> Co-authored-by: Guanheng Zhang <[email protected]> Co-authored-by: James Reed <[email protected]> Co-authored-by: Horace He <[email protected]> Co-authored-by: Pritam Damania <[email protected]> Co-authored-by: pritam <[email protected]> Co-authored-by: Nikita Shulga <[email protected]>
* Update build.sh * Update audio tutorial for release pytorch 1.8 / torchaudio 0.8 (pytorch#1379) * [wip] replace audio tutorial * Update * Update * Update * fixup * Update requirements.txt * update * Update Co-authored-by: Brian Johnson <[email protected]> * [1.8 release] Switch to the new datasets in torchtext 0.9.0 release - text classification tutorial (pytorch#1352) * switch to the new dataset API * checkpoint * checkpoint * checkpoint * update docs * checkpoint * switch to legacy vocab * update to follow the master API * checkpoint * checkpoint * address reviewer's comments Co-authored-by: Guanheng Zhang <[email protected]> Co-authored-by: Brian Johnson <[email protected]> * [1.8 release] Switch to LM dataset in torchtext 0.9.0 release (pytorch#1349) * switch to raw text dataset in torchtext 0.9.0 release * follow the new API in torchtext master Co-authored-by: Guanheng Zhang <[email protected]> Co-authored-by: Brian Johnson <[email protected]> * [WIP][FX] CPU Performance Profiling with FX (pytorch#1319) Co-authored-by: Brian Johnson <[email protected]> * [FX] Added fuser tutorial (pytorch#1356) * Added fuser tutorial * updated index.rst * fixed conclusion * responded to some comments * responded to comments * respond Co-authored-by: Brian Johnson <[email protected]> * Update numeric_suite_tutorial.py * Tutorial combining DDP with Pipeline Parallelism to Train Transformer models (pytorch#1347) * Tutorial combining DDP with Pipeline Parallelism to Train Transformer models. Summary: Tutorial which places a pipe on GPUs 0 and 1 and another Pipe on GPUs 2 and 3. Both pipe replicas are replicated via DDP. One process drives GPUs 0 and 1 and another drives GPUs 2 and 3. * Polish out some of the docs. * Add thumbnail and address some comments. Co-authored-by: pritam <[email protected]> * More updates to numeric_suite * Even more updates * Update numeric_suite_tutorial.py Hopefully that's the last one * Update numeric_suite_tutorial.py Last one * Update build.sh Co-authored-by: moto <[email protected]> Co-authored-by: Guanheng George Zhang <[email protected]> Co-authored-by: Guanheng Zhang <[email protected]> Co-authored-by: James Reed <[email protected]> Co-authored-by: Horace He <[email protected]> Co-authored-by: Pritam Damania <[email protected]> Co-authored-by: pritam <[email protected]> Co-authored-by: Nikita Shulga <[email protected]>
In torchtext 0.9.0 release, we will include the raw text datasets as beta release. Update the text classification tutorial with the new torchtext library.
This PR should be tested against pytorch 1.8.0 rc and torchtext 0.9.0 rc.