Sequence autoencoder in keras

This repository contains an implementation of text autoencoder in Keras.

Dataset

I downloaded a toy dataset from https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews on which we will train the autoencoder. You can replace this dataset with anything else and put it in data/ repository.

Make sure you put your embedding file in embeddings directory. In this example we use Glove vectors of size 300.
Data preprocessing: whatever is the format of your input, you should preprocess it. The script prepare_dataset.py uses Keras preprocessings to tokenize and pad input sequences and finally embedd all the sequences. Use the following instruction :

python prepare_dataset.py ~/udc_coherence/dual_encoder_udc/embeddings/glove.840B.300d.txt  data/params data/reviews.csv

Training: the train_autoencoder.py script supposes that your input data is already preprocessed embedded (ref. step 2). It will produce a model that we store at models directory.

python train_autoencoder.py --seq_length 100 --n_epochs 100 --optimizer adam --input_data data/reviews.csv.pkl --model_fname models/autoencoder.h5

Evaluation: run evaluate_autoencoder.py script to measure the autoencoder capacity to produce an output that is similar to the input sequences.

python evaluate_autoencoder.py --input_data data/reviews.csv.pkl --model_fname models/autoencoder.h5 --output_data data/encoded_reviews.pkl

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
data		data
embeddings		embeddings
models		models
README.md		README.md
evaluate_autoencoder.py		evaluate_autoencoder.py
prepare_dataset.py		prepare_dataset.py
train_autoencoder.py		train_autoencoder.py