In this project we implement TBERT: Topic models and BERT joining forces for semantic similarity detectio.
Semantic similarity detection problem has applications ranging from question answering to plagiarism detection.
The idea is adding topic information to pre trained bert
Bert has only the encoder part of the transformer and has the following features:
- Can be fine-tuned with just one additional output layer
- Applications such as question answering and language inferencet
- Adds [CLS] at beginning
- Adds [SEP] between sentences
- Position, Segment, Token embedding
- Two tasks:
- masking some percentage of the input
- predict if second sentence is next one
Two methods of topic modeling are as follows:
- GSDMM
- Assuming only one topic per document, it is Similar to LDA, and is specifically aimed at detecting topics in smaller documents
- LDA
- A Bayesian unsupervised learning
- Generates topic based on word frequency
- Mixtures of topics in a document
- Starts from randomly assigning topics to each word of a document
- Counts frequency of topics in a document c(Tj,Di)
- Counts frequency of assigning each word to a topic c(wq,Tj)
- Removes a words topic from document and updates counts
- Multiplies c(Tj,Di) and c(wq,Tj) for each topic j
- Assigns topic with max{c(Tj,Di)*c(wq,Tj)} to word wq
- Repeats for all words in each pass
- MSRP The Microsoft Research Paraphrase dataset (MSRP):
- news websites sentences
- Label 1 : same context, 0 : O.W.
- The SemEval CQA dataset :
- An initial post as well as 10 possibly relevant posts
- ranking
- The Quora duplicate questions dataset
- two questions
- label 1 : If paraphrases, 0 : O.W.
- Input: Tokenized and combined two sentences
- Output: only CLS part
The topic model part has: - Input: Tokenized sentence 1 and sentence 2
- Two methods for topic modeling:
- document and word topics
- Uses LDA and GSDMM Topic modeling methods
The top layer combins topic vectors and C vector and Passes from two layer of MLP and Softmax.
The following figure compares our results with that of the paper.