Skip to content

Commit

Permalink
Add Poincare model (#1696)
Browse files Browse the repository at this point in the history
* Initial classes and loading data for poincare model

* Initial implementation of training using autograd

* faster negative sampling, bugfix in vector updates

* allows poincare dist function to be differentiable by autograd

* batched gradient descent initial implementation

* minor changes to batch poincare distance computation

* Adds calculation of gradients for poincare model

* Correct implementation of clipping of updated vectors

* Fixes error in gradient computation

* Better messages while training

* Renames PoincareDistance to PoincareExample for clarity

* Compares computed gradients to autograd gradients every few iterations

* Avoids doing some numpy computations twice

* Avoids creating copies of numpy vectors

* Only calls nan_to_num when gamma has at least one value equal to 1

* Simply sets nan gradients to zero instead of nan_to_num

* Adds batch-wise implementation of training and gradient computations

* Minor correction in clipping

* Fixes typo in clip_vectors

* Prints average loss every few iterations instead of current loss

* Adds weighted negative sampling

* Ensures positive edges are not returned by negative sampling

* Poincare model stores node indices in relations instead of node keys

* Minor renaming; uses node indices for batch training instead of node keys

* Changes shapes of vectors passed to PoincareBatch

* Minor bugfixes related to batch size

* Corrects implementation of negative sampling for batch training

* Adds option to check gradients in batchwise training

* Checks gradients only every few iterations

* Handles multiple occurrence of same node across and within batches

* Removes unused section of code

* Implements slightly different clipping method

* Fixes bugs with wrong reshape in batchwise training

* Example-wise training takes into account multiple occurrences of same node in an example too

* Batchwise training prints average loss over many iterations instead of current batch

* Fixes bug in updating vector for batchwise training

* Faster implementation of negative sampling

* Negative sampling for a node follows different paths depending on fraction of positive relations

* Uses a buffer for negative samples to reduce calls to np.random.choice

* Cleans up poincare.py, removes unused code

* Adds shapes to PoincareBatch, more documentation

* Adds more documentation to PoincareModel

* Stores indices for nodes in a batch in PoincareBatch for better encapsulation

* More documentation for poincare module

* Implements burn-in for poincare model

* Slightly better logging for poincare model

* Uses np.random.random and np.searchsorted for random sampling rather than np.random.choice

* Removes duplicates in negative samples

* Moves helper classes in poincare after PoincareModel

* Change in PoincareModel API to allow initializing from an iterable, separate class for streaming from file

* Adds failing test for handling encoding in PoincareData

* Fixes encoding handling in PoincareData

* Adds docstrings to PoincareData, PoincareData streams tuples now

* More unittests for PoincareModel

* Changes handle_duplicates to staticmethod, adds test

* Adds batch size and print_every parameters to train method

* Renames print_check to should_print

* Adds separate parameter for checking gradients

* Minor fixes for coding style

* Removes default values from docstrings, redundant

* Adds example to PoincareModel init docstring

* Extracts buffer for negatives out into a separate class

* More detailed logging, fix to check_gradients

* Minor fixes to documentation in poincare.py

* Adds tests for gradients checking

* Raise AssertionError if gradients check fails

* Adds failing tests for saving/loading PoincareModel instances

* Fixes bug with saving/loading PoincareModel to disk

* Adds test and fix for raising error on invalid input data

* Adds test and fix for no duplicates and positives in negative sample

* Bugfix with NegativesBuffer having less than  items left

* Uses larger data for poincare tests, adds data files

* Bugfix with incorrect use of random state

* Minor fixes in documentation style

* Renames PoincareData to PoincareRelations

* Change in the order of conditions checked before resampling

* Imports datapath from test.utils instead of defining own

* Adds working examples and a more detailed description in docstring

* Renames term_relations to node_relations

* Removes unused imports

* Moves iter parameter to train instead of __init__, renames to epochs

* Fixes term_relations in tests

* Adds option to disable gradient check, disabled by default

* Extracts gradient checking code into a separate method

* Conditionally import autograd only if gradient checking is enabled

* Marks private methods in poincare module with leading underscore

* Adds init_range as an API parameter to PoincareModel

* Marks private properties with a leading underscore

* Fixes bug with burn-in happening on subsequent calls to train

* Adds test for training multiple times

* Adds autograd to test dependencies

* Renames wv to kv in PoincareModel

* add numpy==1.12 as test dependency

* add missing quote

* try to run tests without autograd

* fix PEP8 in poincare.py

* fix PEP8 in test_poincare

* PoincareRelations handles python2 correctly

* Bugfix with int division for python2

* Imports mock module for tests correctly in python2

* Cleaner implementation of __iter__ for PoincareRelations

* Adds rst file and updates apiref.rst for poincare module

* Adds clarifying comment to PoincareRelations.__iter__

* Updates rst file for poincare

* Renames hypernym pair to relations everywhere

* Simpler way of detecting duplicates

* Minor documentation updates in poincare.py

* Skips gradients test if autograd not installed, adds test for bytes input data

* Fix flake8 (noqa + remove unused var)

* Fix missing mock dependency for win

* Fix links in docstrings

* Changes error message for negative sampling failing

* Adds option to specify dtype for PoincareModel and corresponding unittest

* Extends test for dtype to check after training, updates docstring
  • Loading branch information
jayantj authored and menshikh-iv committed Nov 15, 2017
1 parent b183c67 commit 0ae0f96
Show file tree
Hide file tree
Showing 10 changed files with 1,181 additions and 1 deletion.
2 changes: 1 addition & 1 deletion appveyor.yml
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ test_script:
# installed library.
- "mkdir empty_folder"
- "cd empty_folder"
- "pip install pyemd testfixtures sklearn Morfessor==2.0.2a4"
- "pip install pyemd mock testfixtures sklearn Morfessor==2.0.2a4"
- "pip freeze"
- "python -c \"import nose; nose.main()\" -s -v gensim"
# Move back to the project folder
Expand Down
1 change: 1 addition & 0 deletions docs/src/apiref.rst
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ Modules:
models/doc2vec
models/fasttext
models/phrases
models/poincare
models/coherencemodel
models/basemodel
models/callbacks
Expand Down
10 changes: 10 additions & 0 deletions docs/src/models/poincare.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
:mod:`models.poincare` -- Train and use Poincare embeddings
=============================================================

.. automodule:: gensim.models.poincare
:synopsis: Train and use Poincare embeddings
:members:
:inherited-members:
:special-members: __iter__, __getitem__, __contains__
:undoc-members:
:show-inheritance:
848 changes: 848 additions & 0 deletions gensim/models/poincare.py

Large diffs are not rendered by default.

2 changes: 2 additions & 0 deletions gensim/test/test_data/poincare_cp852.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
t�mto bude�
budem byli
5 changes: 5 additions & 0 deletions gensim/test/test_data/poincare_hypernyms.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
kangaroo.n.01 marsupial.n.01
kangaroo.n.01 metatherian.n.01
kangaroo.n.01 mammal.n.01
gib.n.02 cat.n.01
striped_skunk.n.01 mammal.n.01
95 changes: 95 additions & 0 deletions gensim/test/test_data/poincare_hypernyms_large.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
kangaroo.n.01 marsupial.n.01
kangaroo.n.01 metatherian.n.01
kangaroo.n.01 mammal.n.01
gib.n.02 cat.n.01
striped_skunk.n.01 mammal.n.01
domestic_goat.n.01 even-toed_ungulate.n.01
rock_squirrel.n.01 ground_squirrel.n.02
vizsla.n.01 dog.n.01
dandie_dinmont.n.01 mammal.n.01
broodmare.n.01 horse.n.01
spotted_skunk.n.01 spotted_skunk.n.01
hispid_pocket_mouse.n.01 hispid_pocket_mouse.n.01
lesser_kudu.n.01 placental.n.01
water_shrew.n.01 insectivore.n.01
silky_anteater.n.01 placental.n.01
giant_kangaroo.n.01 metatherian.n.01
bronco.n.01 bronco.n.01
pekinese.n.01 pekinese.n.01
seattle_slew.n.01 thoroughbred.n.02
kinkajou.n.01 kinkajou.n.01
boxer.n.04 mammal.n.01
rabbit.n.01 placental.n.01
longhorn.n.01 bovid.n.01
blue_fox.n.01 fox.n.01
woolly_monkey.n.01 new_world_monkey.n.01
jungle_cat.n.01 jungle_cat.n.01
vole.n.01 mammal.n.01
western_big-eared_bat.n.01 long-eared_bat.n.01
leopard.n.02 leopard.n.02
hackney.n.02 hackney.n.02
shetland_sheepdog.n.01 placental.n.01
coati.n.01 carnivore.n.01
wild_boar.n.01 mammal.n.01
post_horse.n.01 placental.n.01
porker.n.01 porker.n.01
mouflon.n.01 mouflon.n.01
australian_sea_lion.n.01 seal.n.09
coondog.n.01 placental.n.01
schipperke.n.01 mammal.n.01
black_rat.n.01 rodent.n.01
waterbuck.n.01 placental.n.01
hack.n.06 odd-toed_ungulate.n.01
central_chimpanzee.n.01 anthropoid_ape.n.01
harrier.n.02 harrier.n.02
lesser_panda.n.01 mammal.n.01
wether.n.01 ruminant.n.01
collie.n.01 shepherd_dog.n.01
prancer.n.01 horse.n.01
doberman.n.01 placental.n.01
pygmy_marmoset.n.01 monkey.n.01
phalanger.n.01 metatherian.n.01
black-and-tan_coonhound.n.01 black-and-tan_coonhound.n.01
woolly_monkey.n.01 primate.n.02
ferret_badger.n.01 badger.n.02
mountain_chinchilla.n.01 placental.n.01
english_foxhound.n.01 english_foxhound.n.01
leveret.n.01 leporid.n.01
shetland_sheepdog.n.01 canine.n.02
beagle.n.01 beagle.n.01
tibetan_mastiff.n.01 tibetan_mastiff.n.01
bouvier_des_flandres.n.01 canine.n.02
wheel_horse.n.01 placental.n.01
pocket_rat.n.01 rat.n.01
malinois.n.01 working_dog.n.01
white_elephant.n.02 white_elephant.n.02
camel.n.01 camel.n.01
mexican_pocket_mouse.n.01 rat.n.01
vaquita.n.01 toothed_whale.n.01
manchester_terrier.n.01 hunting_dog.n.01
chacma.n.01 monkey.n.01
binturong.n.01 viverrine.n.01
mastiff_bat.n.01 mammal.n.01
goat.n.01 mammal.n.01
pembroke.n.01 canine.n.02
steenbok.n.01 ungulate.n.01
tarsius_syrichta.n.01 mammal.n.01
maltese.n.03 domestic_cat.n.01
pacific_bottlenose_dolphin.n.01 toothed_whale.n.01
tamandua.n.01 tamandua.n.01
murine.n.01 rodent.n.01
coyote.n.01 canine.n.02
king_charles_spaniel.n.01 placental.n.01
basset.n.01 canine.n.02
pygmy_mouse.n.01 pygmy_mouse.n.01
toy_spaniel.n.01 carnivore.n.01
cactus_mouse.n.01 mouse.n.01
hart.n.03 ruminant.n.01
broodmare.n.01 equine.n.01
sussex_spaniel.n.01 sporting_dog.n.01
omaha.n.04 odd-toed_ungulate.n.01
alaska_fur_seal.n.01 placental.n.01
cattalo.n.01 bovine.n.01
soft-coated_wheaten_terrier.n.01 mammal.n.01
harness_horse.n.01 horse.n.01
banteng.n.01 even-toed_ungulate.n.01
2 changes: 2 additions & 0 deletions gensim/test/test_data/poincare_utf8.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
tímto budeš
budem byli
216 changes: 216 additions & 0 deletions gensim/test/test_poincare.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,216 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# Author: Jayant Jain <[email protected]>
# Copyright (C) 2017 Radim Rehurek <[email protected]>
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html

"""
Automated tests for checking the poincare module from the models package.
"""

import logging
import os
import tempfile
import unittest
try:
from mock import Mock
except ImportError:
from unittest.mock import Mock

import numpy as np
try:
import autograd # noqa:F401
autograd_installed = True
except ImportError:
autograd_installed = False

from gensim.models.poincare import PoincareRelations, PoincareModel
from gensim.test.utils import datapath


logger = logging.getLogger(__name__)


def testfile():
# temporary data will be stored to this file
return os.path.join(tempfile.gettempdir(), 'gensim_word2vec.tst')


class TestPoincareData(unittest.TestCase):
def test_encoding_handling(self):
"""Tests whether utf8 and non-utf8 data loaded correctly."""
non_utf8_file = datapath('poincare_cp852.tsv')
relations = [relation for relation in PoincareRelations(non_utf8_file, encoding='cp852')]
self.assertEqual(len(relations), 2)
self.assertEqual(relations[0], (u'tímto', u'budeš'))

utf8_file = datapath('poincare_utf8.tsv')
relations = [relation for relation in PoincareRelations(utf8_file)]
self.assertEqual(len(relations), 2)
self.assertEqual(relations[0], (u'tímto', u'budeš'))


class TestPoincareModel(unittest.TestCase):
def setUp(self):
self.data = PoincareRelations(datapath('poincare_hypernyms.tsv'))
self.data_large = PoincareRelations(datapath('poincare_hypernyms_large.tsv'))

def models_equal(self, model_1, model_2):
self.assertEqual(len(model_1.kv.vocab), len(model_2.kv.vocab))
self.assertEqual(set(model_1.kv.vocab.keys()), set(model_2.kv.vocab.keys()))
self.assertTrue(np.allclose(model_1.kv.syn0, model_2.kv.syn0))

def test_data_counts(self):
"""Tests whether data has been loaded correctly and completely."""
model = PoincareModel(self.data)
self.assertEqual(len(model.all_relations), 5)
self.assertEqual(len(model.node_relations[model.kv.vocab['kangaroo.n.01'].index]), 3)
self.assertEqual(len(model.kv.vocab), 7)
self.assertTrue('mammal.n.01' not in model.node_relations)

def test_data_counts_with_bytes(self):
"""Tests whether input bytes data is loaded correctly and completely."""
model = PoincareModel([(b'\x80\x01c', b'\x50\x71a'), (b'node.1', b'node.2')])
self.assertEqual(len(model.all_relations), 2)
self.assertEqual(len(model.node_relations[model.kv.vocab[b'\x80\x01c'].index]), 1)
self.assertEqual(len(model.kv.vocab), 4)
self.assertTrue(b'\x50\x71a' not in model.node_relations)

def test_persistence(self):
"""Tests whether the model is saved and loaded correctly."""
model = PoincareModel(self.data, burn_in=0, negative=3)
model.train(epochs=1)
model.save(testfile())
loaded = PoincareModel.load(testfile())
self.models_equal(model, loaded)

def test_persistence_separate_file(self):
"""Tests whether the model is saved and loaded correctly when the arrays are stored separately."""
model = PoincareModel(self.data, burn_in=0, negative=3)
model.train(epochs=1)
model.save(testfile(), sep_limit=1)
loaded = PoincareModel.load(testfile())
self.models_equal(model, loaded)

def test_invalid_data_raises_error(self):
"""Tests that error is raised on invalid input data."""
with self.assertRaises(ValueError):
PoincareModel([("a", "b", "c")])
with self.assertRaises(ValueError):
PoincareModel(["a", "b", "c"])
with self.assertRaises(ValueError):
PoincareModel("ab")

def test_vector_shape(self):
"""Tests whether vectors are initialized with the correct size."""
model = PoincareModel(self.data, size=20)
self.assertEqual(model.kv.syn0.shape, (7, 20))

def test_vector_dtype(self):
"""Tests whether vectors have the correct dtype before and after training."""
model = PoincareModel(self.data_large, dtype=np.float32, burn_in=0, negative=3)
self.assertEqual(model.kv.syn0.dtype, np.float32)
model.train(epochs=1)
self.assertEqual(model.kv.syn0.dtype, np.float32)

def test_training(self):
"""Tests that vectors are different before and after training."""
model = PoincareModel(self.data_large, burn_in=0, negative=3)
old_vectors = np.copy(model.kv.syn0)
model.train(epochs=2)
self.assertFalse(np.allclose(old_vectors, model.kv.syn0))

def test_training_multiple(self):
"""Tests that calling train multiple times results in different vectors."""
model = PoincareModel(self.data_large, burn_in=0, negative=3)
model.train(epochs=2)
old_vectors = np.copy(model.kv.syn0)

model.train(epochs=1)
self.assertFalse(np.allclose(old_vectors, model.kv.syn0))

old_vectors = np.copy(model.kv.syn0)
model.train(epochs=0)
self.assertTrue(np.allclose(old_vectors, model.kv.syn0))

def test_gradients_check(self):
"""Tests that the model is trained successfully with gradients check enabled."""
model = PoincareModel(self.data, negative=3)
try:
model.train(epochs=1, batch_size=1, check_gradients_every=1)
except Exception as e:
self.fail('Exception %s raised unexpectedly while training with gradient checking' % repr(e))

@unittest.skipIf(not autograd_installed, 'autograd needs to be installed for this test')
def test_wrong_gradients_raises_assertion(self):
"""Tests that discrepancy in gradients raises an error."""
model = PoincareModel(self.data, negative=3)
model._loss_grad = Mock(return_value=np.zeros((2 + model.negative, model.size)))
with self.assertRaises(AssertionError):
model.train(epochs=1, batch_size=1, check_gradients_every=1)

def test_reproducible(self):
"""Tests that vectors are same for two independent models trained with the same seed."""
model_1 = PoincareModel(self.data_large, seed=1, negative=3, burn_in=1)
model_1.train(epochs=2)

model_2 = PoincareModel(self.data_large, seed=1, negative=3, burn_in=1)
model_2.train(epochs=2)
self.assertTrue(np.allclose(model_1.kv.syn0, model_2.kv.syn0))

def test_burn_in(self):
"""Tests that vectors are different after burn-in."""
model = PoincareModel(self.data, burn_in=1, negative=3)
original_vectors = np.copy(model.kv.syn0)
model.train(epochs=0)
self.assertFalse(np.allclose(model.kv.syn0, original_vectors))

def test_burn_in_only_done_once(self):
"""Tests that burn-in does not happen when train is called a second time."""
model = PoincareModel(self.data, negative=3, burn_in=1)
model.train(epochs=0)
original_vectors = np.copy(model.kv.syn0)
model.train(epochs=0)
self.assertTrue(np.allclose(model.kv.syn0, original_vectors))

def test_negatives(self):
"""Tests that correct number of negatives are sampled."""
model = PoincareModel(self.data, negative=5)
self.assertEqual(len(model._get_candidate_negatives()), 5)

def test_error_if_negative_more_than_population(self):
"""Tests error is rased if number of negatives to sample is more than remaining nodes."""
model = PoincareModel(self.data, negative=5)
with self.assertRaises(ValueError):
model.train(epochs=1)

def test_no_duplicates_and_positives_in_negative_sample(self):
"""Tests that no duplicates or positively related nodes are present in negative samples."""
model = PoincareModel(self.data_large, negative=3)
positive_nodes = model.node_relations[0] # Positive nodes for node 0
num_samples = 100 # Repeat experiment multiple times
for i in range(num_samples):
negatives = model._sample_negatives(0)
self.assertFalse(positive_nodes & set(negatives))
self.assertEqual(len(negatives), len(set(negatives)))

def test_handle_duplicates(self):
"""Tests that correct number of negatives are used."""
vector_updates = np.array([[0.5, 0.5], [0.1, 0.2], [0.3, -0.2]])
node_indices = [0, 1, 0]
PoincareModel._handle_duplicates(vector_updates, node_indices)
vector_updates_expected = np.array([[0.0, 0.0], [0.1, 0.2], [0.8, 0.3]])
self.assertTrue((vector_updates == vector_updates_expected).all())

@classmethod
def tearDownClass(cls):
try:
os.unlink(testfile())
except OSError:
pass


if __name__ == '__main__':
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.DEBUG)
unittest.main()
1 change: 1 addition & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -233,6 +233,7 @@ def finalize_options(self):
'annoy',
'tensorflow <= 1.3.0',
'keras >= 2.0.4',
'mock==2.0.0',
]

setup(
Expand Down

0 comments on commit 0ae0f96

Please sign in to comment.