Add Poincare model (#1696)

* Initial classes and loading data for poincare model * Initial implementation of training using autograd * faster negative sampling, bugfix in vector updates * allows poincare dist function to be differentiable by autograd * batched gradient descent initial implementation * minor changes to batch poincare distance computation * Adds calculation of gradients for poincare model * Correct implementation of clipping of updated vectors * Fixes error in gradient computation * Better messages while training * Renames PoincareDistance to PoincareExample for clarity * Compares computed gradients to autograd gradients every few iterations * Avoids doing some numpy computations twice * Avoids creating copies of numpy vectors * Only calls nan_to_num when gamma has at least one value equal to 1 * Simply sets nan gradients to zero instead of nan_to_num * Adds batch-wise implementation of training and gradient computations * Minor correction in clipping * Fixes typo in clip_vectors * Prints average loss every few iterations instead of current loss * Adds weighted negative sampling * Ensures positive edges are not returned by negative sampling * Poincare model stores node indices in relations instead of node keys * Minor renaming; uses node indices for batch training instead of node keys * Changes shapes of vectors passed to PoincareBatch * Minor bugfixes related to batch size * Corrects implementation of negative sampling for batch training * Adds option to check gradients in batchwise training * Checks gradients only every few iterations * Handles multiple occurrence of same node across and within batches * Removes unused section of code * Implements slightly different clipping method * Fixes bugs with wrong reshape in batchwise training * Example-wise training takes into account multiple occurrences of same node in an example too * Batchwise training prints average loss over many iterations instead of current batch * Fixes bug in updating vector for batchwise training * Faster implementation of negative sampling * Negative sampling for a node follows different paths depending on fraction of positive relations * Uses a buffer for negative samples to reduce calls to np.random.choice * Cleans up poincare.py, removes unused code * Adds shapes to PoincareBatch, more documentation * Adds more documentation to PoincareModel * Stores indices for nodes in a batch in PoincareBatch for better encapsulation * More documentation for poincare module * Implements burn-in for poincare model * Slightly better logging for poincare model * Uses np.random.random and np.searchsorted for random sampling rather than np.random.choice * Removes duplicates in negative samples * Moves helper classes in poincare after PoincareModel * Change in PoincareModel API to allow initializing from an iterable, separate class for streaming from file * Adds failing test for handling encoding in PoincareData * Fixes encoding handling in PoincareData * Adds docstrings to PoincareData, PoincareData streams tuples now * More unittests for PoincareModel * Changes handle_duplicates to staticmethod, adds test * Adds batch size and print_every parameters to train method * Renames print_check to should_print * Adds separate parameter for checking gradients * Minor fixes for coding style * Removes default values from docstrings, redundant * Adds example to PoincareModel init docstring * Extracts buffer for negatives out into a separate class * More detailed logging, fix to check_gradients * Minor fixes to documentation in poincare.py * Adds tests for gradients checking * Raise AssertionError if gradients check fails * Adds failing tests for saving/loading PoincareModel instances * Fixes bug with saving/loading PoincareModel to disk * Adds test and fix for raising error on invalid input data * Adds test and fix for no duplicates and positives in negative sample * Bugfix with NegativesBuffer having less than items left * Uses larger data for poincare tests, adds data files * Bugfix with incorrect use of random state * Minor fixes in documentation style * Renames PoincareData to PoincareRelations * Change in the order of conditions checked before resampling * Imports datapath from test.utils instead of defining own * Adds working examples and a more detailed description in docstring * Renames term_relations to node_relations * Removes unused imports * Moves iter parameter to train instead of __init__, renames to epochs * Fixes term_relations in tests * Adds option to disable gradient check, disabled by default * Extracts gradient checking code into a separate method * Conditionally import autograd only if gradient checking is enabled * Marks private methods in poincare module with leading underscore * Adds init_range as an API parameter to PoincareModel * Marks private properties with a leading underscore * Fixes bug with burn-in happening on subsequent calls to train * Adds test for training multiple times * Adds autograd to test dependencies * Renames wv to kv in PoincareModel * add numpy==1.12 as test dependency * add missing quote * try to run tests without autograd * fix PEP8 in poincare.py * fix PEP8 in test_poincare * PoincareRelations handles python2 correctly * Bugfix with int division for python2 * Imports mock module for tests correctly in python2 * Cleaner implementation of __iter__ for PoincareRelations * Adds rst file and updates apiref.rst for poincare module * Adds clarifying comment to PoincareRelations.__iter__ * Updates rst file for poincare * Renames hypernym pair to relations everywhere * Simpler way of detecting duplicates * Minor documentation updates in poincare.py * Skips gradients test if autograd not installed, adds test for bytes input data * Fix flake8 (noqa + remove unused var) * Fix missing mock dependency for win * Fix links in docstrings * Changes error message for negative sampling failing * Adds option to specify dtype for PoincareModel and corresponding unittest * Extends test for dtype to check after training, updates docstring
piskvorky · Nov 15, 2017 · 0ae0f96 · 0ae0f96
1 parent b183c67
commit 0ae0f96
Show file tree

Hide file tree

Showing 10 changed files with 1,181 additions and 1 deletion.
diff --git a/appveyor.yml b/appveyor.yml
@@ -79,7 +79,7 @@ test_script:
   # installed library.
   - "mkdir empty_folder"
   - "cd empty_folder"
-  - "pip install pyemd testfixtures sklearn Morfessor==2.0.2a4"
+  - "pip install pyemd mock testfixtures sklearn Morfessor==2.0.2a4"
   - "pip freeze"
   - "python -c \"import nose; nose.main()\" -s -v gensim"
   # Move back to the project folder

diff --git a/docs/src/apiref.rst b/docs/src/apiref.rst
@@ -44,6 +44,7 @@ Modules:
     models/doc2vec
     models/fasttext
     models/phrases
+    models/poincare
     models/coherencemodel
     models/basemodel
     models/callbacks

diff --git a/docs/src/models/poincare.rst b/docs/src/models/poincare.rst
@@ -0,0 +1,10 @@
+:mod:`models.poincare` -- Train and use Poincare embeddings
+=============================================================
+
+.. automodule:: gensim.models.poincare
+    :synopsis: Train and use Poincare embeddings
+    :members:
+    :inherited-members:
+    :special-members: __iter__, __getitem__, __contains__
+    :undoc-members:
+    :show-inheritance:
diff --git a/gensim/models/poincare.py b/gensim/models/poincare.py
diff --git a/gensim/test/test_data/poincare_cp852.tsv b/gensim/test/test_data/poincare_cp852.tsv
@@ -0,0 +1,2 @@
+t�mto	bude�
+budem	byli
diff --git a/gensim/test/test_data/poincare_hypernyms.tsv b/gensim/test/test_data/poincare_hypernyms.tsv
@@ -0,0 +1,5 @@
+kangaroo.n.01	marsupial.n.01
+kangaroo.n.01	metatherian.n.01
+kangaroo.n.01	mammal.n.01
+gib.n.02	cat.n.01
+striped_skunk.n.01	mammal.n.01
diff --git a/gensim/test/test_data/poincare_hypernyms_large.tsv b/gensim/test/test_data/poincare_hypernyms_large.tsv
@@ -0,0 +1,95 @@
+kangaroo.n.01	marsupial.n.01
+kangaroo.n.01	metatherian.n.01
+kangaroo.n.01	mammal.n.01
+gib.n.02	cat.n.01
+striped_skunk.n.01	mammal.n.01
+domestic_goat.n.01	even-toed_ungulate.n.01
+rock_squirrel.n.01	ground_squirrel.n.02
+vizsla.n.01	dog.n.01
+dandie_dinmont.n.01	mammal.n.01
+broodmare.n.01	horse.n.01
+spotted_skunk.n.01	spotted_skunk.n.01
+hispid_pocket_mouse.n.01	hispid_pocket_mouse.n.01
+lesser_kudu.n.01	placental.n.01
+water_shrew.n.01	insectivore.n.01
+silky_anteater.n.01	placental.n.01
+giant_kangaroo.n.01	metatherian.n.01
+bronco.n.01	bronco.n.01
+pekinese.n.01	pekinese.n.01
+seattle_slew.n.01	thoroughbred.n.02
+kinkajou.n.01	kinkajou.n.01
+boxer.n.04	mammal.n.01
+rabbit.n.01	placental.n.01
+longhorn.n.01	bovid.n.01
+blue_fox.n.01	fox.n.01
+woolly_monkey.n.01	new_world_monkey.n.01
+jungle_cat.n.01	jungle_cat.n.01
+vole.n.01	mammal.n.01
+western_big-eared_bat.n.01	long-eared_bat.n.01
+leopard.n.02	leopard.n.02
+hackney.n.02	hackney.n.02
+shetland_sheepdog.n.01	placental.n.01
+coati.n.01	carnivore.n.01
+wild_boar.n.01	mammal.n.01
+post_horse.n.01	placental.n.01
+porker.n.01	porker.n.01
+mouflon.n.01	mouflon.n.01
+australian_sea_lion.n.01	seal.n.09
+coondog.n.01	placental.n.01
+schipperke.n.01	mammal.n.01
+black_rat.n.01	rodent.n.01
+waterbuck.n.01	placental.n.01
+hack.n.06	odd-toed_ungulate.n.01
+central_chimpanzee.n.01	anthropoid_ape.n.01
+harrier.n.02	harrier.n.02
+lesser_panda.n.01	mammal.n.01
+wether.n.01	ruminant.n.01
+collie.n.01	shepherd_dog.n.01
+prancer.n.01	horse.n.01
+doberman.n.01	placental.n.01
+pygmy_marmoset.n.01	monkey.n.01
+phalanger.n.01	metatherian.n.01
+black-and-tan_coonhound.n.01	black-and-tan_coonhound.n.01
+woolly_monkey.n.01	primate.n.02
+ferret_badger.n.01	badger.n.02
+mountain_chinchilla.n.01	placental.n.01
+english_foxhound.n.01	english_foxhound.n.01
+leveret.n.01	leporid.n.01
+shetland_sheepdog.n.01	canine.n.02
+beagle.n.01	beagle.n.01
+tibetan_mastiff.n.01	tibetan_mastiff.n.01
+bouvier_des_flandres.n.01	canine.n.02
+wheel_horse.n.01	placental.n.01
+pocket_rat.n.01	rat.n.01
+malinois.n.01	working_dog.n.01
+white_elephant.n.02	white_elephant.n.02
+camel.n.01	camel.n.01
+mexican_pocket_mouse.n.01	rat.n.01
+vaquita.n.01	toothed_whale.n.01
+manchester_terrier.n.01	hunting_dog.n.01
+chacma.n.01	monkey.n.01
+binturong.n.01	viverrine.n.01
+mastiff_bat.n.01	mammal.n.01
+goat.n.01	mammal.n.01
+pembroke.n.01	canine.n.02
+steenbok.n.01	ungulate.n.01
+tarsius_syrichta.n.01	mammal.n.01
+maltese.n.03	domestic_cat.n.01
+pacific_bottlenose_dolphin.n.01	toothed_whale.n.01
+tamandua.n.01	tamandua.n.01
+murine.n.01	rodent.n.01
+coyote.n.01	canine.n.02
+king_charles_spaniel.n.01	placental.n.01
+basset.n.01	canine.n.02
+pygmy_mouse.n.01	pygmy_mouse.n.01
+toy_spaniel.n.01	carnivore.n.01
+cactus_mouse.n.01	mouse.n.01
+hart.n.03	ruminant.n.01
+broodmare.n.01	equine.n.01
+sussex_spaniel.n.01	sporting_dog.n.01
+omaha.n.04	odd-toed_ungulate.n.01
+alaska_fur_seal.n.01	placental.n.01
+cattalo.n.01	bovine.n.01
+soft-coated_wheaten_terrier.n.01	mammal.n.01
+harness_horse.n.01	horse.n.01
+banteng.n.01	even-toed_ungulate.n.01
diff --git a/gensim/test/test_data/poincare_utf8.tsv b/gensim/test/test_data/poincare_utf8.tsv
@@ -0,0 +1,2 @@
+tímto	budeš
+budem	byli
diff --git a/gensim/test/test_poincare.py b/gensim/test/test_poincare.py
@@ -0,0 +1,216 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+#
+# Author: Jayant Jain <[email protected]>
+# Copyright (C) 2017 Radim Rehurek <[email protected]>
+# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html
+
+"""
+Automated tests for checking the poincare module from the models package.
+"""
+
+import logging
+import os
+import tempfile
+import unittest
+try:
+    from mock import Mock
+except ImportError:
+    from unittest.mock import Mock
+
+import numpy as np
+try:
+    import autograd  # noqa:F401
+    autograd_installed = True
+except ImportError:
+    autograd_installed = False
+
+from gensim.models.poincare import PoincareRelations, PoincareModel
+from gensim.test.utils import datapath
+
+
+logger = logging.getLogger(__name__)
+
+
+def testfile():
+    # temporary data will be stored to this file
+    return os.path.join(tempfile.gettempdir(), 'gensim_word2vec.tst')
+
+
+class TestPoincareData(unittest.TestCase):
+    def test_encoding_handling(self):
+        """Tests whether utf8 and non-utf8 data loaded correctly."""
+        non_utf8_file = datapath('poincare_cp852.tsv')
+        relations = [relation for relation in PoincareRelations(non_utf8_file, encoding='cp852')]
+        self.assertEqual(len(relations), 2)
+        self.assertEqual(relations[0], (u'tímto', u'budeš'))
+
+        utf8_file = datapath('poincare_utf8.tsv')
+        relations = [relation for relation in PoincareRelations(utf8_file)]
+        self.assertEqual(len(relations), 2)
+        self.assertEqual(relations[0], (u'tímto', u'budeš'))
+
+
+class TestPoincareModel(unittest.TestCase):
+    def setUp(self):
+        self.data = PoincareRelations(datapath('poincare_hypernyms.tsv'))
+        self.data_large = PoincareRelations(datapath('poincare_hypernyms_large.tsv'))
+
+    def models_equal(self, model_1, model_2):
+        self.assertEqual(len(model_1.kv.vocab), len(model_2.kv.vocab))
+        self.assertEqual(set(model_1.kv.vocab.keys()), set(model_2.kv.vocab.keys()))
+        self.assertTrue(np.allclose(model_1.kv.syn0, model_2.kv.syn0))
+
+    def test_data_counts(self):
+        """Tests whether data has been loaded correctly and completely."""
+        model = PoincareModel(self.data)
+        self.assertEqual(len(model.all_relations), 5)
+        self.assertEqual(len(model.node_relations[model.kv.vocab['kangaroo.n.01'].index]), 3)
+        self.assertEqual(len(model.kv.vocab), 7)
+        self.assertTrue('mammal.n.01' not in model.node_relations)
+
+    def test_data_counts_with_bytes(self):
+        """Tests whether input bytes data is loaded correctly and completely."""
+        model = PoincareModel([(b'\x80\x01c', b'\x50\x71a'), (b'node.1', b'node.2')])
+        self.assertEqual(len(model.all_relations), 2)
+        self.assertEqual(len(model.node_relations[model.kv.vocab[b'\x80\x01c'].index]), 1)
+        self.assertEqual(len(model.kv.vocab), 4)
+        self.assertTrue(b'\x50\x71a' not in model.node_relations)
+
+    def test_persistence(self):
+        """Tests whether the model is saved and loaded correctly."""
+        model = PoincareModel(self.data, burn_in=0, negative=3)
+        model.train(epochs=1)
+        model.save(testfile())
+        loaded = PoincareModel.load(testfile())
+        self.models_equal(model, loaded)
+
+    def test_persistence_separate_file(self):
+        """Tests whether the model is saved and loaded correctly when the arrays are stored separately."""
+        model = PoincareModel(self.data, burn_in=0, negative=3)
+        model.train(epochs=1)
+        model.save(testfile(), sep_limit=1)
+        loaded = PoincareModel.load(testfile())
+        self.models_equal(model, loaded)
+
+    def test_invalid_data_raises_error(self):
+        """Tests that error is raised on invalid input data."""
+        with self.assertRaises(ValueError):
+            PoincareModel([("a", "b", "c")])
+        with self.assertRaises(ValueError):
+            PoincareModel(["a", "b", "c"])
+        with self.assertRaises(ValueError):
+            PoincareModel("ab")
+
+    def test_vector_shape(self):
+        """Tests whether vectors are initialized with the correct size."""
+        model = PoincareModel(self.data, size=20)
+        self.assertEqual(model.kv.syn0.shape, (7, 20))
+
+    def test_vector_dtype(self):
+        """Tests whether vectors have the correct dtype before and after training."""
+        model = PoincareModel(self.data_large, dtype=np.float32, burn_in=0, negative=3)
+        self.assertEqual(model.kv.syn0.dtype, np.float32)
+        model.train(epochs=1)
+        self.assertEqual(model.kv.syn0.dtype, np.float32)
+
+    def test_training(self):
+        """Tests that vectors are different before and after training."""
+        model = PoincareModel(self.data_large, burn_in=0, negative=3)
+        old_vectors = np.copy(model.kv.syn0)
+        model.train(epochs=2)
+        self.assertFalse(np.allclose(old_vectors, model.kv.syn0))
+
+    def test_training_multiple(self):
+        """Tests that calling train multiple times results in different vectors."""
+        model = PoincareModel(self.data_large, burn_in=0, negative=3)
+        model.train(epochs=2)
+        old_vectors = np.copy(model.kv.syn0)
+
+        model.train(epochs=1)
+        self.assertFalse(np.allclose(old_vectors, model.kv.syn0))
+
+        old_vectors = np.copy(model.kv.syn0)
+        model.train(epochs=0)
+        self.assertTrue(np.allclose(old_vectors, model.kv.syn0))
+
+    def test_gradients_check(self):
+        """Tests that the model is trained successfully with gradients check enabled."""
+        model = PoincareModel(self.data, negative=3)
+        try:
+            model.train(epochs=1, batch_size=1, check_gradients_every=1)
+        except Exception as e:
+            self.fail('Exception %s raised unexpectedly while training with gradient checking' % repr(e))
+
+    @unittest.skipIf(not autograd_installed, 'autograd needs to be installed for this test')
+    def test_wrong_gradients_raises_assertion(self):
+        """Tests that discrepancy in gradients raises an error."""
+        model = PoincareModel(self.data, negative=3)
+        model._loss_grad = Mock(return_value=np.zeros((2 + model.negative, model.size)))
+        with self.assertRaises(AssertionError):
+            model.train(epochs=1, batch_size=1, check_gradients_every=1)
+
+    def test_reproducible(self):
+        """Tests that vectors are same for two independent models trained with the same seed."""
+        model_1 = PoincareModel(self.data_large, seed=1, negative=3, burn_in=1)
+        model_1.train(epochs=2)
+
+        model_2 = PoincareModel(self.data_large, seed=1, negative=3, burn_in=1)
+        model_2.train(epochs=2)
+        self.assertTrue(np.allclose(model_1.kv.syn0, model_2.kv.syn0))
+
+    def test_burn_in(self):
+        """Tests that vectors are different after burn-in."""
+        model = PoincareModel(self.data, burn_in=1, negative=3)
+        original_vectors = np.copy(model.kv.syn0)
+        model.train(epochs=0)
+        self.assertFalse(np.allclose(model.kv.syn0, original_vectors))
+
+    def test_burn_in_only_done_once(self):
+        """Tests that burn-in does not happen when train is called a second time."""
+        model = PoincareModel(self.data, negative=3, burn_in=1)
+        model.train(epochs=0)
+        original_vectors = np.copy(model.kv.syn0)
+        model.train(epochs=0)
+        self.assertTrue(np.allclose(model.kv.syn0, original_vectors))
+
+    def test_negatives(self):
+        """Tests that correct number of negatives are sampled."""
+        model = PoincareModel(self.data, negative=5)
+        self.assertEqual(len(model._get_candidate_negatives()), 5)
+
+    def test_error_if_negative_more_than_population(self):
+        """Tests error is rased if number of negatives to sample is more than remaining nodes."""
+        model = PoincareModel(self.data, negative=5)
+        with self.assertRaises(ValueError):
+            model.train(epochs=1)
+
+    def test_no_duplicates_and_positives_in_negative_sample(self):
+        """Tests that no duplicates or positively related nodes are present in negative samples."""
+        model = PoincareModel(self.data_large, negative=3)
+        positive_nodes = model.node_relations[0]  # Positive nodes for node 0
+        num_samples = 100  # Repeat experiment multiple times
+        for i in range(num_samples):
+            negatives = model._sample_negatives(0)
+            self.assertFalse(positive_nodes & set(negatives))
+            self.assertEqual(len(negatives), len(set(negatives)))
+
+    def test_handle_duplicates(self):
+        """Tests that correct number of negatives are used."""
+        vector_updates = np.array([[0.5, 0.5], [0.1, 0.2], [0.3, -0.2]])
+        node_indices = [0, 1, 0]
+        PoincareModel._handle_duplicates(vector_updates, node_indices)
+        vector_updates_expected = np.array([[0.0, 0.0], [0.1, 0.2], [0.8, 0.3]])
+        self.assertTrue((vector_updates == vector_updates_expected).all())
+
+    @classmethod
+    def tearDownClass(cls):
+        try:
+            os.unlink(testfile())
+        except OSError:
+            pass
+
+
+if __name__ == '__main__':
+    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.DEBUG)
+    unittest.main()
diff --git a/setup.py b/setup.py
@@ -233,6 +233,7 @@ def finalize_options(self):
     'annoy',
     'tensorflow <= 1.3.0',
     'keras >= 2.0.4',
+    'mock==2.0.0',
 ]
 
 setup(