tBERT

In this project we implement TBERT: Topic models and BERT joining forces for semantic similarity detectio. Semantic similarity detection problem has applications ranging from question answering to plagiarism detection. The idea is adding topic information to pre trained bert

What is BERT:

Bert has only the encoder part of the transformer and has the following features:

Can be fine-tuned with just one additional output layer
Applications such as question answering and language inferencet
Adds [CLS] at beginning
Adds [SEP] between sentences
Position, Segment, Token embedding
Two tasks:
- masking some percentage of the input
- predict if second sentence is next one

Topic models:

Two methods of topic modeling are as follows:

GSDMM
- Assuming only one topic per document, it is Similar to LDA, and is specifically aimed at detecting topics in smaller documents
LDA
- A Bayesian unsupervised learning
- Generates topic based on word frequency
- Mixtures of topics in a document
- Starts from randomly assigning topics to each word of a document
- Counts frequency of topics in a document c(Tj,Di)
- Counts frequency of assigning each word to a topic c(wq,Tj)
- Removes a words topic from document and updates counts
- Multiplies c(Tj,Di) and c(wq,Tj) for each topic j
- Assigns topic with max{c(Tj,Di)*c(wq,Tj)} to word wq
- Repeats for all words in each pass

Datasets:

MSRP The Microsoft Research Paraphrase dataset (MSRP):
- news websites sentences
- Label 1 : same context, 0 : O.W.
The SemEval CQA dataset :
- An initial post as well as 10 possibly relevant posts
- ranking
The Quora duplicate questions dataset
- two questions
- label 1 : If paraphrases, 0 : O.W.

tBert Architecture:

The Bert part has:

Input: Tokenized and combined two sentences
Output: only CLS part
The topic model part has:
Input: Tokenized sentence 1 and sentence 2
Two methods for topic modeling:
- document and word topics
Uses LDA and GSDMM Topic modeling methods
The top layer combins topic vectors and C vector and Passes from two layer of MLP and Softmax.

Results:

The following figure compares our results with that of the paper.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
readme.md		readme.md
tBert.ipynb		tBert.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tBERT

What is BERT:

Topic models:

Datasets:

tBert Architecture:

Results:

About

Releases

Packages

Contributors 2

Languages

BanafshehKarimian/TBERT

Folders and files

Latest commit

History

Repository files navigation

tBERT

What is BERT:

Topic models:

Datasets:

tBert Architecture:

Results:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages