Name		Name	Last commit message	Last commit date
parent directory ..
img		img
models		models
vectors		vectors
README.md		README.md
equation_statistics_labels.txt		equation_statistics_labels.txt
equation_statistics_sentences.txt		equation_statistics_sentences.txt
helpers.py		helpers.py
statisticsAnalysis.py		statisticsAnalysis.py
tsne_wikipedia_statistics_analysis.ipynb		tsne_wikipedia_statistics_analysis.ipynb
wikipedia_statistics_articles.json		wikipedia_statistics_articles.json
wikipedia_statistics_articles.pkl		wikipedia_statistics_articles.pkl
wikipedia_statistics_articles.txt		wikipedia_statistics_articles.txt
wikipedia_statistics_equations.json		wikipedia_statistics_equations.json
wikipedia_statistics_equations.pkl		wikipedia_statistics_equations.pkl

README.md

Equation Mapping: Statistics

Here we will derive vectors based on equations from statistics topics. You should have already installed the required python modules in requirements.txt.

I first started with these steps in separate files, but I find it easier to provide one cohesive analysis file, now represented at statisticsAnalysis.py. The goal of this step is to build a word2vec model that has character vectors derived from a huge corpus of equations.

Summary of Files Included

Below includes a summary of files included. For more detail, please read the following section.

Preprocessing

The following scripts were used to generate the data, and run the analysis:

The following are relevant to preprocessing:

wikipedia_statistics_articles.txt is the final list of wikipedia pages parsed. You would only be interested in this if you want to recreate the data and need to scrape Wikipedia.

Raw Data

The following are raw data extracted from the Wikipedia articles. The json/pkl combinations are the same data saved in different formats:

wikipedia_statistics_articles.json complete metadata for each article, including summary text, links, categories, etc. This is where you could find text from the page to associate equations with.
wikipedia_statistics_articles.pkl
wikipedia_statistics_equations.json the raw equations
wikipedia_statistics_equations.pkl

Processed Data

The following are akin to training data, equations (the "sentences") with labels:

equation_statistics_sentences.txt are equations from the articles, with one equation per line. We call them "sentences" because word2vec treats them as such.
equation_statistics_labels.txt is one label per equation sentence, the article (page) from where the article came from.

Analysis

The following folders are relevant to word2vec, including vectors and models, and my sample analysis notebook

vectors is the exported vectors data, in both tsv and pickle.
models include the word2vec models
tsne_wikipedia_statistics_analysis.ipynb

Overall Strategy

You are free to use this data as you please! The approach I took is the following.

Create a word2vec model using (some set) of equations from wikipedia
Use the character (latex token) embeddings to map a set of topics (via their equations) to the model space
compare similarity of the topics

For a first shot, I did all of the above with just the statistics articles, however while I think this would build a good model of the character relationships (over 66K equations present!) there isn't enough specificity for kinds of math equations in the articles, so you get a clustering (TSNE) that looks like this:

This likely means that the articles each have a varied set of equation types (which is logical) with a few articles representing a specific kind of math equation (the cleaner clusters around the outside). What that tells us is that while the model might be good as a base, we need to generate topic equation embeddings from a more meaningful set. This second set (portion of work) is in the math directory. Here we will continue to describe generating the overall model.

1. Create List of Statistics Articles

We first need a crapton of wikipedia pages to parse equations from. This means getting a list of links from the statistics topics page, and then (manually) disambiguating the terms when appropriate. These manual steps are now preserved in the code to be reproducible.

At the end of this step, we have a list of final articles (wikipedia pages) we will use for the model in wikipedia_statistics_articles.txt.

2. Obtain Articles and Metadata

From our list of articles above, we now want to populate a lookup dictionary with metadata about each article. This step pulls the entire page from Wikipedia, and saves fields such as links, images, url, etc. This is the starting data structure we need to keep in the case of needing to go back and redo any portions of the analysis.

At the end of this procedure we will have a wikipedia_statistics_articles.json and matching wikipedia_statistics_articles.pkl to each store the same data structure.

3. Equation Extraction

From our articles, the equations are represented as an attribute of an image. Wikipedia does this because browsers that don't support MathJax (or similar) can fall back to showing the image itself. We can take advantage of this and find the images having a particular class, and then extracting the raw latex from it. We thus:

use BeautifulSoup to parse the raw html of each article (subpage)
find equations in images based on their class
save the equations, and image, to an equations data structure organized by the topic page

Here is an example of an entry in the list of equations

  {'png': 'https://wikimedia.org/api/rest_v1/media/math/render/svg/b7c3ba47cc5436c389f86a3f617a191d0dbe4877',
   'tex': '2^{n\\mathrm {H} (k/n)}'},

At the end of this step, we have a data structure with indices as article name, and indexing into a list of equations that were extracted. We save both as wikipedia_statistics_equations.json and wikipedia_statistics_equations.pkl.

4. Word2Vec Model

The first step here was to extract "sentences" of the equations, meaning a text file of equation "sentences," where each sentence is a set of characters (or LaTeX symbol) delimited by white spaces. This was first done by way of calling the helpers function extract_tokens, but ultimately done by the same function integrated into the class TrainEquations now a part of wordfish. Before we run TrainEquations, we have saved a single file with every extracted equation sentence, equation_statistics_sentences.txt, and one for the labels, equation_statistics_labels.txt, respectively.

With these sentences, I could then use the TrainEquations method from wordfish to break apart the equation sentences by single character or LaTex symbol (e.g., /begin) and then build the Word2Vec model from it.

After this step we have:

word2vec models to describe either the entire wikipedia corpus, or subset of domains (pages)
vectors of the embeddings that represent the embeddings for the tokens, or individual characters

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

statistics

statistics

README.md

Equation Mapping: Statistics

Summary of Files Included

Preprocessing

Raw Data

Processed Data

Analysis

Overall Strategy

1. Create List of Statistics Articles

2. Obtain Articles and Metadata

3. Equation Extraction

4. Word2Vec Model

Files

statistics

Directory actions

More options

Directory actions

More options

Latest commit

History

statistics

Folders and files

parent directory

README.md

Equation Mapping: Statistics

Summary of Files Included

Preprocessing

Raw Data

Processed Data

Analysis

Overall Strategy

1. Create List of Statistics Articles

2. Obtain Articles and Metadata

3. Equation Extraction

4. Word2Vec Model