Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Omniglot Dataset #323

Merged
merged 13 commits into from
Jan 28, 2018
Merged

Omniglot Dataset #323

merged 13 commits into from
Jan 28, 2018

Conversation

activatedgeek
Copy link
Contributor

@activatedgeek activatedgeek commented Nov 6, 2017

This is the loader for Omniglot Dataset.

One of the use cases of this dataset is for One-Shot Learning where we sample a pair from the dataset and train a Neural Network to learn the similarity metric between the pair.

P.S.: It is amazing how simple it is to write data loaders!

@activatedgeek
Copy link
Contributor Author

I am particularly concerned as to how to implement the __len__ function for the randomized pair. Should I generate a preset number of pairs and just return them on every call to __getitem__?

@activatedgeek activatedgeek changed the title [WIP] Omniglot Dataset Omniglot Dataset Nov 6, 2017
@activatedgeek
Copy link
Contributor Author

I precompute the random set of pairs with an added argument of pair_count to the new OmniglotRandomPair class constructor. I think this should solve the random pair generation problem and make it deterministic after the object is instantiated.

Copy link
Contributor

@alykhantejani alykhantejani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR - I've left some inline comments

.gitignore Outdated
docs/build
.idea/

This comment was marked as off-topic.

This comment was marked as off-topic.

print('Files already downloaded and verified')
return

for fzip in self.zips_md5:

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

' You can use download=True to download it')

self.target_folder = os.path.join(self.root, self._get_target_folder())
self.alphabets_ = list_dir(self.target_folder)

This comment was marked as off-topic.

This comment was marked as off-topic.

['images_background', '68d2efa1b9178cc56df9314c21c6e718'],
['images_evaluation', '6b91aef0f799c5bb55b94e3f2daec811'],
# Provision in future
# ['images_background_small1', 'e704a628b5459e08445c13499850abc4'],

This comment was marked as off-topic.


self.target_folder = os.path.join(self.root, self._get_target_folder())
self.alphabets_ = list_dir(self.target_folder)
self.characters_ = list(

This comment was marked as off-topic.

self.target_folder = os.path.join(self.root, self._get_target_folder())
self.alphabets_ = list_dir(self.target_folder)
self.characters_ = list(
reduce(

This comment was marked as off-topic.

This comment was marked as off-topic.

]
for idx, character in enumerate(self.characters_)
]
self.flat_character_images_ = list(reduce(lambda x, y: x + y, self.character_images_))

This comment was marked as off-topic.

)
self.character_images_ = [
[
tuple([image, idx])

This comment was marked as off-topic.

return 'images_background' if self.background is True else 'images_evaluation'


class OmniglotRandomPair(Omniglot):

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

Copy link
Contributor

@alykhantejani alykhantejani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates, I've left some more comments inline, mostly around the OmniglotRandomPair dataset

zip_file.extractall(self.root)

def _get_target_folder(self):
return 'images_background' if self.background is True else 'images_evaluation'

This comment was marked as off-topic.


self.target_folder = os.path.join(self.root, self._get_target_folder())
self._alphabets = list_dir(self.target_folder)
self._characters = sum(

This comment was marked as off-topic.

def __init__(self, root, pair_count=10000, background=True,
transform=None, target_transform=None,
download=False):
super(self.__class__, self).__init__(root, background=background,

This comment was marked as off-topic.

This comment was marked as off-topic.

],
[]
)
self._character_images = [

This comment was marked as off-topic.

This comment was marked as off-topic.

return 'images_background' if self.background is True else 'images_evaluation'


class OmniglotRandomPair(Omniglot):

This comment was marked as off-topic.

@activatedgeek
Copy link
Contributor Author

Hey @alykhantejani, do you mind checking some updates to the randomized pair generation?

@activatedgeek
Copy link
Contributor Author

activatedgeek commented Nov 13, 2017

@alykhantejani Writing the comments on the approach here because I couldn't find it anywhere when I had responded to your review earlier (weirdly)

If all you want are random pairs of images and whether they are the same character or not, then why not just sample on (img, idx) pair from self._flattened_character_images and then sample another continuously until they either match or not depending on what is_match is.

I believe this way, it is certainly harder to arrive at a collision when is_match=1. The way I have maintained the internal data helps me quickly arrive at a pair for both is_match = {0,1} because once I flatten the array, I lose information about the character class.

There's no need to precompute pairs, just make __len__ return then length of self._flattened_character_images and then when this is used with the pytorch DataLoader, it will stop sampling after __len__ samples.

The problem with this approach would be that I am not staying true to what I wanted to achieve with the class. The total number of pairs combinatorially possible are huge and that is why I introduced a pair_count defaulting to 10000 and then set that as the length. It gives the user enough flexibility to sample training pairs as well as setup a One-Shot task. If I limit it to the length of the flattened character list, it gives out the wrong perception and might confuse the user. I have also added random_seed to allow reproducibility.

@alykhantejani
Copy link
Contributor

Hey @activatedgeek, sorry for the response.

Pinging @fmassa in case he has any opinions/thoughts on the random pair dataset

@fmassa
Copy link
Member

fmassa commented Nov 16, 2017

Hi @activatedgeek , sorry for the delay in replying.

So, my first thought about the Pair dataset is the following: we should try not to add sampling logic into the dataset itself, which should only hold information on how to load individual elements, and let the sampling be performed by the Sampler/DataLoader.
It might look non-trivial in this case, but here is one possibility:

We introduce a new Dataset class, in a similar spirit of ConcatDataset, that performs n samples of the data and returns whatever you want?
Here is a draft implementation (untested):

class MultiDataset(object):
    def __init__(self, dataset, num_outputs=1, transforms=None):
        self.dataset = dataset
        self.num_outputs = num_outputs
        self.transforms = transforms

    def __getitem__(self, idx):
        # here comes the logic to convert a 1d index into a 
        # self.num_output indices, each of size len(self.dataset)
        individual_idx = []
        for i in range(self.num_outputs):
            individual_idx.append(idx % len(self.dataset))
            idx = idx // len(self.dataset)
        
        result = []
        for i in reversed(idx):
            result.append(self.dataset[i])

        if self.transforms is not None:
            result = self.transforms(result)

        return result    

    def __len__(self):
        return len(self.dataset) ** self.num_outputs

This way, you generate on-the-fly an arbitrarily large dataset, that can accommodate pairs/triplets/etc of elements of the same dataset. Plus, the logic of how to combine the different targets of each dataset becomes something that the user should do (via the transforms in the MultiDataset, each individual dataset has its own transforms as well).
And, if you want a biased sampler (say to have a roughly equal number of matches / non-matches) this could be handled by the sampler maybe?

This is just a rough idea, but let me know what you think.

@activatedgeek
Copy link
Contributor Author

activatedgeek commented Nov 17, 2017

@fmassa That is a great idea. I was in fact wondering the same because I recently came across an obvious requirement of similar kind for the ImageNet/Mini-ImageNet datasets as well. It didn't feel right to create custom rules every now and then.

So here is what I will do - the Omniglot data loader will return a tuple (image, character_class_id) and I will wrap up this PR.

I will take up a MultiDataset like utility wrapper in another PR.

Does that sound good?

@fmassa
Copy link
Member

fmassa commented Nov 17, 2017

Sounds good! Also, giving how generic this dataset is, might be worth considering sending it to pytorch/tnt, but we can discuss that later

@activatedgeek
Copy link
Contributor Author

@fmassa @alykhantejani Can you please verify if everything is in order now?

@activatedgeek
Copy link
Contributor Author

activatedgeek commented Nov 26, 2017

Any updates here?

cc @fmassa @alykhantejani

@activatedgeek
Copy link
Contributor Author

Hi, I am wondering what the hold up is about? Is there something that I could further help with? It has been quite a while that this PR has been up and I would really appreciate if we could make some progress here. (Or closed if it doesn't align well).

cc @alykhantejani @fmassa

@activatedgeek
Copy link
Contributor Author

@alykhantejani @fmassa Any possibility of this getting merged?

@fmassa
Copy link
Member

fmassa commented Dec 11, 2017

@activatedgeek sorry for the delay, I'll have a look at it today.

@fmassa
Copy link
Member

fmassa commented Dec 11, 2017

This looks very good, thanks! Once the merge conflicts are addressed, I think this can be merged.

@activatedgeek
Copy link
Contributor Author

Hey @alykhantejani @fmassa , can we merge this? I can continue on a discussion on #338 after this.

@activatedgeek
Copy link
Contributor Author

Pinging @fmassa and @alykhantejani here again. Can we please merge this?

@fmassa fmassa merged commit dac9efa into pytorch:master Jan 28, 2018
@fmassa
Copy link
Member

fmassa commented Jan 28, 2018

Awesome! Thanks a lot @activatedgeek !

@activatedgeek activatedgeek deleted the omniglot-dataset branch January 28, 2018 23:49
@activatedgeek
Copy link
Contributor Author

activatedgeek commented Jan 28, 2018

Hey @fmassa thanks! Do you mind taking a look at #338 as well?

@fmassa fmassa mentioned this pull request Jan 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants