-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
46 changed files
with
10,294 additions
and
2,769 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,17 +1,33 @@ | ||
# Equation Mapping | ||
|
||
Here we are going to extract equations and associated text from wikipedia. | ||
We will derive embeddings for the equations and try to link them to text. | ||
So far, I have extracted: | ||
This is a dataset that uses word2vec to extract embeddings to describe equations | ||
from statistics and math articles from wikipedia. We will do this for groups of links | ||
that generally fall into these categories: | ||
|
||
- [statistics articles](statistics) | ||
- [statistics](statistics) | ||
- [mathematics](math) | ||
|
||
For the first, we use a list of statistics articles. For the second, we do a | ||
best effort to parse pages of math topics. You are free to use the vectors | ||
for your analysis and efforts! Here are some interesting questions: | ||
|
||
1. Can you build a model to predict an equation from one or more terms? | ||
2. Can you predict terms from equations? | ||
|
||
Please reference the README.md in each folder for further details. | ||
|
||
## 1. Install Requirements | ||
|
||
For both, you need to first install requirements. | ||
If you intend to try to recreate the data, for both, you need to first | ||
install requirements, including a few libraries I created as a graduate | ||
student, [wordfish](https://vsoch.github.io/2016/2016-wordfish/) and | ||
[repofish](https://pypi.org/project/repofish/) | ||
wordfish is a small library that uses gensim to run word2vec, and repofish uses it | ||
to parse various internet resources for words, etc. | ||
|
||
```bash | ||
pip install -r requirements.txt | ||
``` | ||
|
||
Then continue with instructions in the subfolder of choice. | ||
Then continue with instructions in the subfolder of choice. The steps are generally the same, | ||
but the second (math) was developed after statistics. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,107 @@ | ||
# Equation Mapping: Math | ||
|
||
Here we will derive vectors based on equations from these math topics. While | ||
I ultimately will just want to map these equations to the [statistics](statistics) | ||
space, I will also generate the same character embeddings here to be consistent. | ||
|
||
# Step 1. Get equations from Wikipedia | ||
|
||
I am using the following list of (domain, topic) from wikipedia: | ||
|
||
```bash | ||
# Organized by (AbstractTopic, PageTopic) | ||
pages = [ | ||
( "Algrebra", "Linear algebra"), | ||
( "Algrebra", "Multilinear_algebra"), | ||
( "Algrebra", "Abstract algebra"), | ||
( "Algrebra", "Elementary_algebra"), | ||
( "Arithmetic", "Number theory"), | ||
( "Calculus", "Mathematical analysis"), | ||
( "Calculus", "Differential equations"), | ||
( "Calculus", "Dynamical systems theory"), | ||
( "Calculus", "Numerical analysis"), | ||
( "Calculus", "Mathematical optimization"), | ||
( "Calculus", "Functional analysis"), | ||
( "Geometry", "Discrete geometry"), | ||
( "Geometry", "Algebraic geometry"), | ||
( "Geometry", "Analytic geometry"), | ||
( "Geometry", "Differential geometry"), | ||
( "Geometry", "Finite geometry"), | ||
( "Geometry", "Topology"), | ||
( "Geometry", "Trigonometry"), | ||
( "Foundations of Mathematics", "Philosophy of mathematics"), | ||
( "Foundations of Mathematics", "Mathematical logic"), | ||
( "Foundations of Mathematics", "Set theory"), | ||
( "Foundations of Mathematics", "Category theory"), | ||
( "Applied Mathematics", "Mathematical physics"), | ||
( "Applied Mathematics", "Probability theory"), | ||
( "Applied Mathematics", "Mathematical statistics"), | ||
( "Applied Mathematics", "Statistics"), | ||
( "Applied Mathematics", "Game theory"), | ||
( "Applied Mathematics", "Information theory"), | ||
( "Applied Mathematics", "Computer science"), | ||
( "Applied Mathematics", "Theory of computation"), | ||
( "Applied Mathematics", "Control theory"), | ||
( "Others", "Order theory"), | ||
( "Others", "Graph theory")] | ||
``` | ||
|
||
I found these categories and groupings [here](https://en.wikipedia.org/wiki/Areas_of_mathematics#External_links). | ||
The general idea is that the first token is an abstract token (e.g., Linear Algebra and Abstract Algebra are both | ||
kinds of Algreba") and the sceond is the page name to parse. For this parsing I will also try | ||
to minimize dependencies (e.g., removing repofish). | ||
|
||
Before starting here you should have already installed the required python | ||
modules in [requirements.txt](../requirements.txt). Since I've already done this for | ||
[statistics](../statistics) articles, I can clean up the code a bit too and make a more succint | ||
pipeline. The entire set of steps will be in [mathDomainAnalysis.py](mathDomainAnalysis.py). | ||
|
||
After this step, we have generated [wikipedia_math_articles.json](wikipedia_math_articles.json), | ||
a dictionary with key as method indexing metadata for each page above. | ||
The biggest difference here between this extraction and the statistics is that I didn't | ||
save any intermediate files. | ||
|
||
|
||
# Step 2: Extract Equations #################################################### | ||
|
||
The above step gave us a list of math articles, and what we want to do now is: | ||
- retrieve each page as a WikipediaPage | ||
- parse the html with BeautifulSoup | ||
- look through the images on the page (some of which are equations) and find them | ||
- save the list of equations, along with the domain and topic of the page | ||
|
||
At the end of this step, we have a dictionary of equations, indexed by method | ||
name, and then content including the tex, png, domain, and topic: | ||
|
||
```python | ||
{'domain': 'Geometry', | ||
'png': 'https://wikimedia.org/api/rest_v1/media/math/render/svg/a1da4e06eb6f25cd7f7fc1a7784a11a82ae53f9f', | ||
'tex': '\\frac{a-b}{a+b}=\\frac{\\tan\\left[\\tfrac{1}{2}(A-B)\\right]}{\\tan\\left[\\tfrac{1}{2}(A+B)\\right]}', | ||
'topic': 'Trigonometry'} | ||
``` | ||
|
||
And we save this dictionary to both [wikipedia_math_equations.json](wikipedia_math_equations.json) | ||
and [wikipedia_math_equations.pkl](wikipedia_math_equations.pkl). | ||
|
||
|
||
# Step 3: Word2Vec Model ####################################################### | ||
|
||
At this point, we would actually want to give the equations to the statistics model, | ||
and then generate embeddings for our math equations based on the statistics | ||
character embeddings. I haven't done that yet, but instead I've | ||
created an example showing how to do this for the math equations. | ||
With our equations loaded, we *could* now use the wordfish `TrainEquations` class to | ||
take in the list of equations, and generate a model. That comes down to this: | ||
|
||
```python | ||
sentences = TrainEquations(text_list=equations_list, | ||
remove_stop_words=False, | ||
remove_non_english_chars=False) | ||
|
||
model = Word2Vec(sentences, size=300, workers=8, min_count=1) | ||
``` | ||
|
||
We don't want to mess with removing non english characters or stop words (which | ||
uses nltk to filter, etc.) because we aren't working with a standard English corpus! | ||
At the end of this step, we have a word2vec model, and we can save it under | ||
[models](models). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,113 @@ | ||
#!/usr/bin/env python | ||
|
||
from bs4 import BeautifulSoup | ||
from wordfish.utils import ( get_attribute, save_pretty_json ) | ||
from wikipedia import WikipediaPage | ||
import pickle | ||
import json | ||
import re | ||
import os | ||
|
||
# Now let's derive equations from core mathematics ideas (not statistics articles) | ||
# Organized by (AbstractTopic, PageTopic) | ||
|
||
results = dict() | ||
|
||
pages = [ | ||
( "Algrebra", "Linear algebra"), | ||
( "Algrebra", "Multilinear_algebra"), | ||
( "Algrebra", "Abstract algebra"), | ||
( "Algrebra", "Elementary_algebra"), | ||
( "Arithmetic", "Number theory"), | ||
( "Calculus", "Mathematical analysis"), | ||
( "Calculus", "Differential equations"), | ||
( "Calculus", "Dynamical systems theory"), | ||
( "Calculus", "Numerical analysis"), | ||
( "Calculus", "Mathematical optimization"), | ||
( "Calculus", "Functional analysis"), | ||
( "Geometry", "Discrete geometry"), | ||
( "Geometry", "Algebraic geometry"), | ||
( "Geometry", "Analytic geometry"), | ||
( "Geometry", "Differential geometry"), | ||
( "Geometry", "Finite geometry"), | ||
( "Geometry", "Topology"), | ||
( "Geometry", "Trigonometry"), | ||
( "Foundations of Mathematics", "Philosophy of mathematics"), | ||
( "Foundations of Mathematics", "Mathematical logic"), | ||
( "Foundations of Mathematics", "Set theory"), | ||
( "Foundations of Mathematics", "Category theory"), | ||
( "Applied Mathematics", "Mathematical physics"), | ||
( "Applied Mathematics", "Probability theory"), | ||
( "Applied Mathematics", "Mathematical statistics"), | ||
( "Applied Mathematics", "Statistics"), | ||
( "Applied Mathematics", "Game theory"), | ||
( "Applied Mathematics", "Information theory"), | ||
( "Applied Mathematics", "Computer science"), | ||
( "Applied Mathematics", "Theory of computation"), | ||
( "Applied Mathematics", "Control theory"), | ||
( "Others", "Order theory"), | ||
( "Others", "Graph theory")] | ||
|
||
|
||
# Step 1. Get pages (and raw equations) from wikipedia | ||
|
||
for pair in pages: | ||
domain = pair[0] | ||
method = pair[1] | ||
if method not in results: | ||
|
||
result = WikipediaPage(method) | ||
|
||
# Show a visual check! | ||
print("Matching %s to %s" %(method,result.title)) | ||
entry = { 'categories': result.categories, | ||
'title': result.title, | ||
'method': method, | ||
'url': result.url, | ||
'summary': result.summary, | ||
'images': result.images } | ||
|
||
# We can use links to calculate relatedness | ||
entry['links'] = get_attribute(result, 'links') | ||
entry['references'] = get_attribute(result, 'references') | ||
|
||
results[method] = entry | ||
|
||
|
||
save_pretty_json(results, "wikipedia_math_articles.json") | ||
|
||
## STEP 2: EQUATIONS ########################################################### | ||
|
||
equations = dict() | ||
|
||
for pair in pages: | ||
domain = pair[0] | ||
method = pair[1] | ||
if method not in equations: | ||
print("Extracting equations from %s" %(method)) | ||
result = WikipediaPage(method) | ||
html = result.html() | ||
soup = BeautifulSoup(html, 'lxml') | ||
|
||
equation_list = [] | ||
|
||
# Equations are represented as images, they map to annotations | ||
images = soup.findAll('img') | ||
for image in images: | ||
image_class = image.get("class") | ||
if image_class != None: | ||
if any(re.search("tex|math",x) for x in image_class): | ||
png = image.get("src") | ||
tex = image.get("alt") | ||
entry = {"png":png, | ||
"tex":tex, | ||
"domain": domain, # inefficient to store many times, | ||
"topic": method} # but more conservative | ||
equation_list.append(entry) | ||
|
||
if len(equation_list) > 0: | ||
equations[method] = equation_list | ||
|
||
# The next step is to load these equations, and map them to the space | ||
# of characters generated from the statistics model. See the "analysis" | ||
# subfolder for these next steps. |
Binary file not shown.
Oops, something went wrong.