New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Omniglot Dataset #323

Merged

fmassa merged 13 commits into pytorch:master from activatedgeek:omniglot-dataset

Jan 28, 2018

Contributor

activatedgeek commented Nov 6, 2017 •

edited

Loading

This is the loader for Omniglot Dataset.

One of the use cases of this dataset is for One-Shot Learning where we sample a pair from the dataset and train a Neural Network to learn the similarity metric between the pair.

P.S.: It is amazing how simple it is to write data loaders!

Contributor Author

activatedgeek commented Nov 6, 2017

I am particularly concerned as to how to implement the __len__ function for the randomized pair. Should I generate a preset number of pairs and just return them on every call to __getitem__?

activatedgeek changed the title ~~[WIP] Omniglot Dataset~~ Omniglot Dataset

Contributor Author

activatedgeek commented Nov 6, 2017

I precompute the random set of pairs with an added argument of pair_count to the new OmniglotRandomPair class constructor. I think this should solve the random pair generation problem and make it deterministic after the object is instantiated.

alykhantejani suggested changes

View reviewed changes

Contributor

alykhantejani left a comment

Thanks for the PR - I've left some inline comments

.gitignore Outdated

		docs/build
		.idea/

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

torchvision/datasets/omniglot.py Outdated

+                          print('Files already downloaded and verified')
+                          return
+                      for fzip in self.zips_md5:

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

torchvision/datasets/omniglot.py Outdated

+                                             ' You can use download=True to download it')
+                      self.target_folder = os.path.join(self.root, self._get_target_folder())
+                      self.alphabets_ = list_dir(self.target_folder)

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

torchvision/datasets/omniglot.py Outdated

+                      ['images_background', '68d2efa1b9178cc56df9314c21c6e718'],
+                      ['images_evaluation', '6b91aef0f799c5bb55b94e3f2daec811'],
+                      # Provision in future
+                      # ['images_background_small1', 'e704a628b5459e08445c13499850abc4'],

This comment was marked as off-topic.

Sign in to view

torchvision/datasets/omniglot.py Outdated

+                      self.target_folder = os.path.join(self.root, self._get_target_folder())
+                      self.alphabets_ = list_dir(self.target_folder)
+                      self.characters_ = list(

This comment was marked as off-topic.

Sign in to view

torchvision/datasets/omniglot.py Outdated

+                      self.target_folder = os.path.join(self.root, self._get_target_folder())
+                      self.alphabets_ = list_dir(self.target_folder)
+                      self.characters_ = list(
+                          reduce(

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

torchvision/datasets/omniglot.py Outdated

+                          ]
+                          for idx, character in enumerate(self.characters_)
+                      ]
+                      self.flat_character_images_ = list(reduce(lambda x, y: x + y, self.character_images_))

This comment was marked as off-topic.

Sign in to view

torchvision/datasets/omniglot.py Outdated

+                      )
+                      self.character_images_ = [
+                          [
+                              tuple([image, idx])

This comment was marked as off-topic.

Sign in to view

torchvision/datasets/omniglot.py Outdated

		return 'images_background' if self.background is True else 'images_evaluation'


		class OmniglotRandomPair(Omniglot):

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

Sanyam Kapoor added 8 commits

November 8, 2017 12:54


          Add basic Omniglot dataset loader

9bd67de


          Remove unused import

019b16d


          Add Omniglot random pair to sample pair of characters

29b27cf


          Precompute random set of pairs, deterministic after object instantiation

0e2e37b


          Export OmniglotRandomPair via the datasets module interfact

dbc796a


          Fix naming convention, use sum instead of reduce

5ccc7ae


          Fix downloading to not download everything, fix Python2 syntax

594247d


          Fix end line lint

b2e0b18

alykhantejani suggested changes

View reviewed changes

Contributor

alykhantejani left a comment

Thanks for the updates, I've left some more comments inline, mostly around the OmniglotRandomPair dataset

torchvision/datasets/omniglot.py Outdated

+                          zip_file.extractall(self.root)
+                  def _get_target_folder(self):
+                      return 'images_background' if self.background is True else 'images_evaluation'

This comment was marked as off-topic.

Sign in to view

torchvision/datasets/omniglot.py Outdated

+                      self.target_folder = os.path.join(self.root, self._get_target_folder())
+                      self._alphabets = list_dir(self.target_folder)
+                      self._characters = sum(

This comment was marked as off-topic.

Sign in to view

torchvision/datasets/omniglot.py Outdated

+                  def __init__(self, root, pair_count=10000, background=True,
+                               transform=None, target_transform=None,
+                               download=False):
+                      super(self.__class__, self).__init__(root, background=background,

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

torchvision/datasets/omniglot.py Outdated

+                          ],
+                          []
+                      )
+                      self._character_images = [

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

torchvision/datasets/omniglot.py Outdated

		return 'images_background' if self.background is True else 'images_evaluation'


		class OmniglotRandomPair(Omniglot):

This comment was marked as off-topic.

Sign in to view


          Add random_seed, syntax fixes

7985f47

Contributor Author

activatedgeek commented Nov 10, 2017

Hey @alykhantejani, do you mind checking some updates to the randomized pair generation?

Contributor Author

activatedgeek commented Nov 13, 2017 •

edited

Loading

@alykhantejani Writing the comments on the approach here because I couldn't find it anywhere when I had responded to your review earlier (weirdly)

If all you want are random pairs of images and whether they are the same character or not, then why not just sample on (img, idx) pair from self._flattened_character_images and then sample another continuously until they either match or not depending on what is_match is.

I believe this way, it is certainly harder to arrive at a collision when is_match=1. The way I have maintained the internal data helps me quickly arrive at a pair for both is_match = {0,1} because once I flatten the array, I lose information about the character class.

There's no need to precompute pairs, just make __len__ return then length of self._flattened_character_images and then when this is used with the pytorch DataLoader, it will stop sampling after __len__ samples.

The problem with this approach would be that I am not staying true to what I wanted to achieve with the class. The total number of pairs combinatorially possible are huge and that is why I introduced a pair_count defaulting to 10000 and then set that as the length. It gives the user enough flexibility to sample training pairs as well as setup a One-Shot task. If I limit it to the length of the flattened character list, it gives out the wrong perception and might confuse the user. I have also added random_seed to allow reproducibility.

Contributor

alykhantejani commented Nov 15, 2017

Hey @activatedgeek, sorry for the response.

Pinging @fmassa in case he has any opinions/thoughts on the random pair dataset

Member

fmassa commented Nov 16, 2017

Hi @activatedgeek , sorry for the delay in replying.

So, my first thought about the Pair dataset is the following: we should try not to add sampling logic into the dataset itself, which should only hold information on how to load individual elements, and let the sampling be performed by the Sampler/DataLoader.
It might look non-trivial in this case, but here is one possibility:

We introduce a new Dataset class, in a similar spirit of ConcatDataset, that performs n samples of the data and returns whatever you want?
Here is a draft implementation (untested):

class MultiDataset(object):
    def __init__(self, dataset, num_outputs=1, transforms=None):
        self.dataset = dataset
        self.num_outputs = num_outputs
        self.transforms = transforms

    def __getitem__(self, idx):
        # here comes the logic to convert a 1d index into a 
        # self.num_output indices, each of size len(self.dataset)
        individual_idx = []
        for i in range(self.num_outputs):
            individual_idx.append(idx % len(self.dataset))
            idx = idx // len(self.dataset)
        
        result = []
        for i in reversed(idx):
            result.append(self.dataset[i])

        if self.transforms is not None:
            result = self.transforms(result)

        return result    

    def __len__(self):
        return len(self.dataset) ** self.num_outputs

This way, you generate on-the-fly an arbitrarily large dataset, that can accommodate pairs/triplets/etc of elements of the same dataset. Plus, the logic of how to combine the different targets of each dataset becomes something that the user should do (via the transforms in the MultiDataset, each individual dataset has its own transforms as well).
And, if you want a biased sampler (say to have a roughly equal number of matches / non-matches) this could be handled by the sampler maybe?

This is just a rough idea, but let me know what you think.

Contributor Author

activatedgeek commented Nov 17, 2017 •

edited

Loading

@fmassa That is a great idea. I was in fact wondering the same because I recently came across an obvious requirement of similar kind for the ImageNet/Mini-ImageNet datasets as well. It didn't feel right to create custom rules every now and then.

So here is what I will do - the Omniglot data loader will return a tuple (image, character_class_id) and I will wrap up this PR.

I will take up a MultiDataset like utility wrapper in another PR.

Does that sound good?

Member

fmassa commented Nov 17, 2017

Sounds good! Also, giving how generic this dataset is, might be worth considering sending it to pytorch/tnt, but we can discuss that later


          Remove randomized pair, take up as a separate generic wrapper

f0d4664

Contributor Author

activatedgeek commented Nov 17, 2017

@fmassa @alykhantejani Can you please verify if everything is in order now?

activatedgeek mentioned this pull request

Dataset transforms to sample a set from data #338

Open

Contributor Author

activatedgeek commented Nov 26, 2017 •

edited

Loading

Any updates here?

cc @fmassa @alykhantejani

Contributor Author

activatedgeek commented Nov 28, 2017

Hi, I am wondering what the hold up is about? Is there something that I could further help with? It has been quite a while that this PR has been up and I would really appreciate if we could make some progress here. (Or closed if it doesn't align well).

cc @alykhantejani @fmassa

Contributor Author

activatedgeek commented Dec 9, 2017

@alykhantejani @fmassa Any possibility of this getting merged?

Member

fmassa commented Dec 11, 2017

@activatedgeek sorry for the delay, I'll have a look at it today.

Member

fmassa commented Dec 11, 2017

This looks very good, thanks! Once the merge conflicts are addressed, I think this can be merged.

Sanyam Kapoor and others added 2 commits

December 22, 2017 17:31


          Fix master conflict

5a93b5f


          Merge branch 'master' into omniglot-dataset

7e5789e


          Merge branch 'master' into omniglot-dataset

a91d07b

Contributor Author

activatedgeek commented Jan 3, 2018

Hey @alykhantejani @fmassa , can we merge this? I can continue on a discussion on #338 after this.

Contributor Author

activatedgeek commented Jan 24, 2018

Pinging @fmassa and @alykhantejani here again. Can we please merge this?

fmassa merged commit dac9efa into pytorch:master

Member

fmassa commented Jan 28, 2018

Awesome! Thanks a lot @activatedgeek !

activatedgeek deleted the omniglot-dataset branch

January 28, 2018 23:49

Contributor Author

activatedgeek commented Jan 28, 2018 •

edited

Loading

Hey @fmassa thanks! Do you mind taking a look at #338 as well?

fmassa mentioned this pull request

OMNIGLOT Dataset #46

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet