C implementation of Luke Vilnis and Andrew McCallum Word Representations via Gaussian Embedding, in ICLR 2015 where each word is represented as a multivariate Gaussian distribution.
A GCC compiler is required for the installation. The code is compiled by running 'make'.
Embeddings can be learned by executing './learn -train [OPTIONS]' where file is the training corpus.
Example: './learn -train data.txt -output vec.txt -size 200 -window 5 -sample 1e-4 -binary 0 -iter 3'
The 40 closest embeddings to a query word can be displayed with './distance ' where FILE contains word projections in the binary format.
By executing './distance ' where FILE contains word projections in the binary format, the top 100 nearest words are displayed, sorted by descending variance.
Word embeddings in binary format can be converted to readable (text) format with: './binary2text ' where FILE contains word projections in the binary format. It is also possible to drop the header and/of the covariance matrix and separate the means and the covariance matrices into different files.
An example of how to visualize the embeddings is shown in visualize.m. It requires the embeddings generated by binary2text with the option -sep-mat 1 and LS-SVMlab imported.