Skip to content

Commit

Permalink
adding math and original files
Browse files Browse the repository at this point in the history
  • Loading branch information
vsoch committed Jan 12, 2019
1 parent 9e4735c commit 37abb5f
Show file tree
Hide file tree
Showing 46 changed files with 10,294 additions and 2,769 deletions.
28 changes: 22 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,33 @@
# Equation Mapping

Here we are going to extract equations and associated text from wikipedia.
We will derive embeddings for the equations and try to link them to text.
So far, I have extracted:
This is a dataset that uses word2vec to extract embeddings to describe equations
from statistics and math articles from wikipedia. We will do this for groups of links
that generally fall into these categories:

- [statistics articles](statistics)
- [statistics](statistics)
- [mathematics](math)

For the first, we use a list of statistics articles. For the second, we do a
best effort to parse pages of math topics. You are free to use the vectors
for your analysis and efforts! Here are some interesting questions:

1. Can you build a model to predict an equation from one or more terms?
2. Can you predict terms from equations?

Please reference the README.md in each folder for further details.

## 1. Install Requirements

For both, you need to first install requirements.
If you intend to try to recreate the data, for both, you need to first
install requirements, including a few libraries I created as a graduate
student, [wordfish](https://vsoch.github.io/2016/2016-wordfish/) and
[repofish](https://pypi.org/project/repofish/)
wordfish is a small library that uses gensim to run word2vec, and repofish uses it
to parse various internet resources for words, etc.

```bash
pip install -r requirements.txt
```

Then continue with instructions in the subfolder of choice.
Then continue with instructions in the subfolder of choice. The steps are generally the same,
but the second (math) was developed after statistics.
107 changes: 107 additions & 0 deletions math/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# Equation Mapping: Math

Here we will derive vectors based on equations from these math topics. While
I ultimately will just want to map these equations to the [statistics](statistics)
space, I will also generate the same character embeddings here to be consistent.

# Step 1. Get equations from Wikipedia

I am using the following list of (domain, topic) from wikipedia:

```bash
# Organized by (AbstractTopic, PageTopic)
pages = [
( "Algrebra", "Linear algebra"),
( "Algrebra", "Multilinear_algebra"),
( "Algrebra", "Abstract algebra"),
( "Algrebra", "Elementary_algebra"),
( "Arithmetic", "Number theory"),
( "Calculus", "Mathematical analysis"),
( "Calculus", "Differential equations"),
( "Calculus", "Dynamical systems theory"),
( "Calculus", "Numerical analysis"),
( "Calculus", "Mathematical optimization"),
( "Calculus", "Functional analysis"),
( "Geometry", "Discrete geometry"),
( "Geometry", "Algebraic geometry"),
( "Geometry", "Analytic geometry"),
( "Geometry", "Differential geometry"),
( "Geometry", "Finite geometry"),
( "Geometry", "Topology"),
( "Geometry", "Trigonometry"),
( "Foundations of Mathematics", "Philosophy of mathematics"),
( "Foundations of Mathematics", "Mathematical logic"),
( "Foundations of Mathematics", "Set theory"),
( "Foundations of Mathematics", "Category theory"),
( "Applied Mathematics", "Mathematical physics"),
( "Applied Mathematics", "Probability theory"),
( "Applied Mathematics", "Mathematical statistics"),
( "Applied Mathematics", "Statistics"),
( "Applied Mathematics", "Game theory"),
( "Applied Mathematics", "Information theory"),
( "Applied Mathematics", "Computer science"),
( "Applied Mathematics", "Theory of computation"),
( "Applied Mathematics", "Control theory"),
( "Others", "Order theory"),
( "Others", "Graph theory")]
```

I found these categories and groupings [here](https://en.wikipedia.org/wiki/Areas_of_mathematics#External_links).
The general idea is that the first token is an abstract token (e.g., Linear Algebra and Abstract Algebra are both
kinds of Algreba") and the sceond is the page name to parse. For this parsing I will also try
to minimize dependencies (e.g., removing repofish).

Before starting here you should have already installed the required python
modules in [requirements.txt](../requirements.txt). Since I've already done this for
[statistics](../statistics) articles, I can clean up the code a bit too and make a more succint
pipeline. The entire set of steps will be in [mathDomainAnalysis.py](mathDomainAnalysis.py).

After this step, we have generated [wikipedia_math_articles.json](wikipedia_math_articles.json),
a dictionary with key as method indexing metadata for each page above.
The biggest difference here between this extraction and the statistics is that I didn't
save any intermediate files.


# Step 2: Extract Equations ####################################################

The above step gave us a list of math articles, and what we want to do now is:
- retrieve each page as a WikipediaPage
- parse the html with BeautifulSoup
- look through the images on the page (some of which are equations) and find them
- save the list of equations, along with the domain and topic of the page

At the end of this step, we have a dictionary of equations, indexed by method
name, and then content including the tex, png, domain, and topic:

```python
{'domain': 'Geometry',
'png': 'https://wikimedia.org/api/rest_v1/media/math/render/svg/a1da4e06eb6f25cd7f7fc1a7784a11a82ae53f9f',
'tex': '\\frac{a-b}{a+b}=\\frac{\\tan\\left[\\tfrac{1}{2}(A-B)\\right]}{\\tan\\left[\\tfrac{1}{2}(A+B)\\right]}',
'topic': 'Trigonometry'}
```

And we save this dictionary to both [wikipedia_math_equations.json](wikipedia_math_equations.json)
and [wikipedia_math_equations.pkl](wikipedia_math_equations.pkl).


# Step 3: Word2Vec Model #######################################################

At this point, we would actually want to give the equations to the statistics model,
and then generate embeddings for our math equations based on the statistics
character embeddings. I haven't done that yet, but instead I've
created an example showing how to do this for the math equations.
With our equations loaded, we *could* now use the wordfish `TrainEquations` class to
take in the list of equations, and generate a model. That comes down to this:

```python
sentences = TrainEquations(text_list=equations_list,
remove_stop_words=False,
remove_non_english_chars=False)

model = Word2Vec(sentences, size=300, workers=8, min_count=1)
```

We don't want to mess with removing non english characters or stop words (which
uses nltk to filter, etc.) because we aren't working with a standard English corpus!
At the end of this step, we have a word2vec model, and we can save it under
[models](models).
113 changes: 113 additions & 0 deletions math/mathDomainAnalysis.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
#!/usr/bin/env python

from bs4 import BeautifulSoup
from wordfish.utils import ( get_attribute, save_pretty_json )
from wikipedia import WikipediaPage
import pickle
import json
import re
import os

# Now let's derive equations from core mathematics ideas (not statistics articles)
# Organized by (AbstractTopic, PageTopic)

results = dict()

pages = [
( "Algrebra", "Linear algebra"),
( "Algrebra", "Multilinear_algebra"),
( "Algrebra", "Abstract algebra"),
( "Algrebra", "Elementary_algebra"),
( "Arithmetic", "Number theory"),
( "Calculus", "Mathematical analysis"),
( "Calculus", "Differential equations"),
( "Calculus", "Dynamical systems theory"),
( "Calculus", "Numerical analysis"),
( "Calculus", "Mathematical optimization"),
( "Calculus", "Functional analysis"),
( "Geometry", "Discrete geometry"),
( "Geometry", "Algebraic geometry"),
( "Geometry", "Analytic geometry"),
( "Geometry", "Differential geometry"),
( "Geometry", "Finite geometry"),
( "Geometry", "Topology"),
( "Geometry", "Trigonometry"),
( "Foundations of Mathematics", "Philosophy of mathematics"),
( "Foundations of Mathematics", "Mathematical logic"),
( "Foundations of Mathematics", "Set theory"),
( "Foundations of Mathematics", "Category theory"),
( "Applied Mathematics", "Mathematical physics"),
( "Applied Mathematics", "Probability theory"),
( "Applied Mathematics", "Mathematical statistics"),
( "Applied Mathematics", "Statistics"),
( "Applied Mathematics", "Game theory"),
( "Applied Mathematics", "Information theory"),
( "Applied Mathematics", "Computer science"),
( "Applied Mathematics", "Theory of computation"),
( "Applied Mathematics", "Control theory"),
( "Others", "Order theory"),
( "Others", "Graph theory")]


# Step 1. Get pages (and raw equations) from wikipedia

for pair in pages:
domain = pair[0]
method = pair[1]
if method not in results:

result = WikipediaPage(method)

# Show a visual check!
print("Matching %s to %s" %(method,result.title))
entry = { 'categories': result.categories,
'title': result.title,
'method': method,
'url': result.url,
'summary': result.summary,
'images': result.images }

# We can use links to calculate relatedness
entry['links'] = get_attribute(result, 'links')
entry['references'] = get_attribute(result, 'references')

results[method] = entry


save_pretty_json(results, "wikipedia_math_articles.json")

## STEP 2: EQUATIONS ###########################################################

equations = dict()

for pair in pages:
domain = pair[0]
method = pair[1]
if method not in equations:
print("Extracting equations from %s" %(method))
result = WikipediaPage(method)
html = result.html()
soup = BeautifulSoup(html, 'lxml')

equation_list = []

# Equations are represented as images, they map to annotations
images = soup.findAll('img')
for image in images:
image_class = image.get("class")
if image_class != None:
if any(re.search("tex|math",x) for x in image_class):
png = image.get("src")
tex = image.get("alt")
entry = {"png":png,
"tex":tex,
"domain": domain, # inefficient to store many times,
"topic": method} # but more conservative
equation_list.append(entry)

if len(equation_list) > 0:
equations[method] = equation_list

# The next step is to load these equations, and map them to the space
# of characters generated from the statistics model. See the "analysis"
# subfolder for these next steps.
Binary file added math/models/wikipdeia_math_equations.word2vec
Binary file not shown.
Loading

0 comments on commit 37abb5f

Please sign in to comment.