New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[1.8 release] Switch to the new datasets in torchtext 0.9.0 release - text classification tutorial #1352

Merged

brianjo merged 17 commits into pytorch:1.8-RC5-TEST from zhangguanheng66:new_text_classification

Mar 4, 2021

Contributor

zhangguanheng66 commented Feb 9, 2021 •

edited

Loading

In torchtext 0.9.0 release, we will include the raw text datasets as beta release. Update the text classification tutorial with the new torchtext library.

This PR should be tested against pytorch 1.8.0 rc and torchtext 0.9.0 rc.


          switch to the new dataset API

9e01aa7

facebook-github-bot added the cla signed label

netlify bot commented Feb 9, 2021 •

edited

Loading

Deploy preview for pytorch-tutorials-preview ready!

Built with commit 7176fe0

https://deploy-preview-1352--pytorch-tutorials-preview.netlify.app

zhangguanheng66 force-pushed the new_text_classification branch from e431957 to 40718fe Compare

February 10, 2021 20:44


          checkpoint

a9fb27b

zhangguanheng66 force-pushed the new_text_classification branch from 6dd5e4a to a9fb27b Compare

February 10, 2021 22:13

zhangguanheng66 changed the title ~~[WIP][DO NOT REVIEW] Switch to the new datasets in torchtext 0.9.0 release - text classification tutorial~~ [WIP][DO NOT REVIEW][1.8 release] Switch to the new datasets in torchtext 0.9.0 release - text classification tutorial


          checkpoint

cf3d3b3

zhangguanheng66 changed the title ~~[WIP][DO NOT REVIEW][1.8 release] Switch to the new datasets in torchtext 0.9.0 release - text classification tutorial~~ [1.8 release] Switch to the new datasets in torchtext 0.9.0 release - text classification tutorial

cpuhrsch reviewed

View reviewed changes

beginner_source/text_sentiment_ngrams_tutorial.py Outdated Show resolved Hide resolved

cpuhrsch reviewed

View reviewed changes

beginner_source/text_sentiment_ngrams_tutorial.py Outdated Show resolved Hide resolved

cpuhrsch reviewed

View reviewed changes

beginner_source/text_sentiment_ngrams_tutorial.py Outdated Show resolved Hide resolved

cpuhrsch reviewed

View reviewed changes

beginner_source/text_sentiment_ngrams_tutorial.py

-                  text = torch.cat(text)
-                  return text, offsets, label
+              train_iter = AG_NEWS(split='train')
+              num_class = len(set([label for (label, text) in train_iter]))

Contributor

cpuhrsch Feb 11, 2021

Here we're materializing the dataset again, but this already happened earlier in the context of DataLoader. We can just assign list(train_iter) to a variable to avoid this. We should probably also add the number of labels to our dataset documentation, which would be much more efficient to use than this. I'll add this as a task.

Guanheng Zhang added 2 commits

February 11, 2021 13:58


          checkpoint

ed76c2c


          update docs

3c47635

zhangguanheng66 force-pushed the new_text_classification branch from fae943b to 3c47635 Compare

February 11, 2021 23:09


          checkpoint

zhangguanheng66 force-pushed the new_text_classification branch from abb7c99 to 2347905 Compare

February 12, 2021 17:35

Base automatically changed from master to main

February 16, 2021 19:33

Base automatically changed from main to master

February 16, 2021 19:37

Guanheng Zhang added 2 commits

February 17, 2021 19:13


          Merge branch 'master' into new_text_classification

17f2511


          switch to legacy vocab

bb62d28

cpuhrsch reviewed

View reviewed changes

beginner_source/text_sentiment_ngrams_tutorial.py Outdated Show resolved Hide resolved

Guanheng Zhang added 3 commits

February 19, 2021 12:59


          Merge branch 'master' into new_text_classification

af5a634


          update to follow the master API

560ec11


          checkpoint

c1a9695

cpuhrsch reviewed

View reviewed changes

beginner_source/text_sentiment_ngrams_tutorial.py Outdated

@@ @@ -2,7 +2,7 @@ @@
               Text classification with the torchtext library
               ==================================
-              In this tutorial, we will show how to use the new torchtext library to build the dataset for the text classification analysis. In the nightly release of the torchtext library, we provide a few prototype building blocks for data processing. Users will have the flexibility to
+              In this tutorial, we will show how to use the new torchtext library to build the dataset for the text classification analysis. Users will have the flexibility to

Contributor

cpuhrsch Feb 24, 2021

We don't need to say "new" torchtext library anymore, because the datasets are now part of the top folder.

Contributor Author

zhangguanheng66 Feb 27, 2021

Fixed.

Guanheng Zhang added 2 commits

February 23, 2021 17:10


          Merge branch 'master' into new_text_classification

48544de


          checkpoint

602e595

parmeet reviewed

View reviewed changes

beginner_source/text_sentiment_ngrams_tutorial.py Outdated

-              # computes the mean value of a “bag” of embeddings. The text entries here
-              # have different lengths. ``nn.EmbeddingBag`` requires no padding here
-              # since the text lengths are saved in offsets.
+              # The model is composed of the `nn.EmbeddingBag <https://pytorch.org/docs/stable/nn.html?highlight=embeddingbag#torch.nn.EmbeddingBag>`__ layer plus a linear layer for the classification purpose. ``nn.EmbeddingBag`` computes the mean value of a “bag” of embeddings. Although the text entries here have different lengths, nn.EmbeddingBag module requires no padding here since the text lengths are saved in offsets.

Contributor

parmeet Feb 27, 2021

I think EmbeddingBag provides 'mode' option where 'mean' is just by default. So perhaps it's better to be explicit about instead of stating that EmbeddingBag take mean to combine embeddings.

Contributor Author

zhangguanheng66 Feb 27, 2021

Update the context and explicitly say the default mode of mean.

beginner_source/text_sentiment_ngrams_tutorial.py Show resolved Hide resolved

beginner_source/text_sentiment_ngrams_tutorial.py Outdated Show resolved Hide resolved

beginner_source/text_sentiment_ngrams_tutorial.py Outdated Show resolved Hide resolved

parmeet reviewed

View reviewed changes

beginner_source/text_sentiment_ngrams_tutorial.py

+                                '| accuracy {:8.3f}'.format(epoch, idx, len(dataloader),
+                                                            total_acc/total_count))
+                          total_acc, total_count = 0, 0
+                          start_time = time.time()

Contributor

parmeet Feb 27, 2021

unused variable?

Contributor Author

zhangguanheng66 Feb 27, 2021

Here we reset the start_time variable to have the new time period.

parmeet reviewed

View reviewed changes

beginner_source/text_sentiment_ngrams_tutorial.py

-              #
-              from torch.utils.data import DataLoader
+              import time

Contributor

parmeet Feb 27, 2021

perhaps can be imported in next code snippet as it might not be used here?

Contributor Author

zhangguanheng66 Feb 27, 2021 •

edited

Loading

It's used in L195?

Guanheng Zhang added 2 commits

February 27, 2021 14:38


          Merge branch 'master' into new_text_classification

8581a98


          address reviewer's comments

66f8b5b

parmeet approved these changes

View reviewed changes

brianjo added the 1.8 label

brianjo added 2 commits

March 3, 2021 15:12


          Merge branch 'master' into new_text_classification

760cbb6


          Merge branch 'master' into new_text_classification

7176fe0

brianjo changed the base branch from master to 1.8-RC5-TEST

March 4, 2021 05:07

brianjo merged commit b5d9d7c into pytorch:1.8-RC5-TEST

brianjo added a commit that referenced this pull request


          1.8 PRs (#1382)

4794be6

* Update build.sh

* Update audio tutorial for release pytorch 1.8 / torchaudio 0.8 (#1379)

* [wip] replace audio tutorial

* Update

* Update

* Update

* fixup

* Update requirements.txt

* update

* Update

Co-authored-by: Brian Johnson <[email protected]>

* [1.8 release] Switch to the new datasets in torchtext 0.9.0 release - text classification tutorial (#1352)

* switch to the new dataset API

* checkpoint

* checkpoint

* checkpoint

* update docs

* checkpoint

* switch to legacy vocab

* update to follow the master API

* checkpoint

* checkpoint

* address reviewer's comments

Co-authored-by: Guanheng Zhang <[email protected]>
Co-authored-by: Brian Johnson <[email protected]>

* [1.8 release] Switch to LM dataset in torchtext 0.9.0 release (#1349)

* switch to raw text dataset in torchtext 0.9.0 release

* follow the new API in torchtext master

Co-authored-by: Guanheng Zhang <[email protected]>
Co-authored-by: Brian Johnson <[email protected]>

* [WIP][FX] CPU Performance Profiling with FX (#1319)

Co-authored-by: Brian Johnson <[email protected]>

* [FX] Added fuser tutorial (#1356)

* Added fuser tutorial

* updated index.rst

* fixed conclusion

* responded to some comments

* responded to comments

* respond

Co-authored-by: Brian Johnson <[email protected]>

* Update numeric_suite_tutorial.py

* Tutorial combining DDP with Pipeline Parallelism to Train Transformer models (#1347)

* Tutorial combining DDP with Pipeline Parallelism to Train Transformer models.

Summary: Tutorial which places a pipe on GPUs 0 and 1 and another Pipe
on GPUs 2 and 3. Both pipe replicas are replicated via DDP. One process
drives GPUs 0 and 1 and another drives GPUs 2 and 3.

* Polish out some of the docs.

* Add thumbnail and address some comments.

Co-authored-by: pritam <[email protected]>

* More updates to numeric_suite

* Even more updates

* Update numeric_suite_tutorial.py

Hopefully that's the last one

* Update numeric_suite_tutorial.py

Last one

* Update build.sh

Co-authored-by: moto <[email protected]>
Co-authored-by: Guanheng George Zhang <[email protected]>
Co-authored-by: Guanheng Zhang <[email protected]>
Co-authored-by: James Reed <[email protected]>
Co-authored-by: Horace He <[email protected]>
Co-authored-by: Pritam Damania <[email protected]>
Co-authored-by: pritam <[email protected]>
Co-authored-by: Nikita Shulga <[email protected]>

rodrigo-techera pushed a commit to Experience-Monks/tutorials that referenced this pull request


          1.8 PRs (pytorch#1382)

1f4be13

* Update build.sh

* Update audio tutorial for release pytorch 1.8 / torchaudio 0.8 (pytorch#1379)

* [wip] replace audio tutorial

* Update

* Update

* Update

* fixup

* Update requirements.txt

* update

* Update

Co-authored-by: Brian Johnson <[email protected]>

* [1.8 release] Switch to the new datasets in torchtext 0.9.0 release - text classification tutorial (pytorch#1352)

* switch to the new dataset API

* checkpoint

* checkpoint

* checkpoint

* update docs

* checkpoint

* switch to legacy vocab

* update to follow the master API

* checkpoint

* checkpoint

* address reviewer's comments

Co-authored-by: Guanheng Zhang <[email protected]>
Co-authored-by: Brian Johnson <[email protected]>

* [1.8 release] Switch to LM dataset in torchtext 0.9.0 release (pytorch#1349)

* switch to raw text dataset in torchtext 0.9.0 release

* follow the new API in torchtext master

Co-authored-by: Guanheng Zhang <[email protected]>
Co-authored-by: Brian Johnson <[email protected]>

* [WIP][FX] CPU Performance Profiling with FX (pytorch#1319)

Co-authored-by: Brian Johnson <[email protected]>

* [FX] Added fuser tutorial (pytorch#1356)

* Added fuser tutorial

* updated index.rst

* fixed conclusion

* responded to some comments

* responded to comments

* respond

Co-authored-by: Brian Johnson <[email protected]>

* Update numeric_suite_tutorial.py

* Tutorial combining DDP with Pipeline Parallelism to Train Transformer models (pytorch#1347)

* Tutorial combining DDP with Pipeline Parallelism to Train Transformer models.

Summary: Tutorial which places a pipe on GPUs 0 and 1 and another Pipe
on GPUs 2 and 3. Both pipe replicas are replicated via DDP. One process
drives GPUs 0 and 1 and another drives GPUs 2 and 3.

* Polish out some of the docs.

* Add thumbnail and address some comments.

Co-authored-by: pritam <[email protected]>

* More updates to numeric_suite

* Even more updates

* Update numeric_suite_tutorial.py

Hopefully that's the last one

* Update numeric_suite_tutorial.py

Last one

* Update build.sh

Co-authored-by: moto <[email protected]>
Co-authored-by: Guanheng George Zhang <[email protected]>
Co-authored-by: Guanheng Zhang <[email protected]>
Co-authored-by: James Reed <[email protected]>
Co-authored-by: Horace He <[email protected]>
Co-authored-by: Pritam Damania <[email protected]>
Co-authored-by: pritam <[email protected]>
Co-authored-by: Nikita Shulga <[email protected]>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels