adding math and original files

vsoch · Jan 12, 2019 · 37abb5f · 37abb5f
1 parent 9e4735c
commit 37abb5f
Show file tree

Hide file tree

Showing 46 changed files with 10,294 additions and 2,769 deletions.
diff --git a/README.md b/README.md
@@ -1,17 +1,33 @@
 # Equation Mapping
 
-Here we are going to extract equations and associated text from wikipedia.
-We will derive embeddings for the equations and try to link them to text.
-So far, I have extracted:
+This is a dataset that uses word2vec to extract embeddings to describe equations
+from statistics and math articles from wikipedia. We will do this for groups of links
+that generally fall into these categories:
 
- - [statistics articles](statistics)
+ - [statistics](statistics)
+ - [mathematics](math)
+
+For the first, we use a list of statistics articles. For the second, we do a 
+best effort to parse pages of math topics. You are free to use the vectors
+for your analysis and efforts! Here are some interesting questions:
+
+ 1. Can you build a model to predict an equation from one or more terms?
+ 2. Can you predict terms from equations?
+
+Please reference the README.md in each folder for further details.
 
 ## 1. Install Requirements
 
-For both, you need to first install requirements.
+If you intend to try to recreate the data, for both, you need to first 
+install requirements, including a few libraries I created as a graduate
+student, [wordfish](https://vsoch.github.io/2016/2016-wordfish/) and 
+[repofish](https://pypi.org/project/repofish/)
+wordfish is a small library that uses gensim to run word2vec, and repofish uses it
+to parse various internet resources for words, etc.
 
 ```bash
 pip install -r requirements.txt
 ```
 
-Then continue with instructions in the subfolder of choice.
+Then continue with instructions in the subfolder of choice. The steps are generally the same,
+but the second (math) was developed after statistics.
diff --git a/math/README.md b/math/README.md
@@ -0,0 +1,107 @@
+# Equation Mapping: Math
+
+Here we will derive vectors based on equations from these math topics. While
+I ultimately will just want to map these equations to the [statistics](statistics)
+space, I will also generate the same character embeddings here to be consistent.
+
+# Step 1. Get equations from Wikipedia
+
+I am using the following list of (domain, topic) from wikipedia:
+
+```bash
+# Organized by (AbstractTopic, PageTopic)
+pages = [
+ ( "Algrebra", "Linear algebra"),
+ ( "Algrebra", "Multilinear_algebra"),
+ ( "Algrebra", "Abstract algebra"),
+ ( "Algrebra", "Elementary_algebra"),
+ ( "Arithmetic", "Number theory"),
+ ( "Calculus", "Mathematical analysis"),
+ ( "Calculus", "Differential equations"),
+ ( "Calculus", "Dynamical systems theory"),
+ ( "Calculus", "Numerical analysis"),
+ ( "Calculus", "Mathematical optimization"),
+ ( "Calculus", "Functional analysis"),
+ ( "Geometry", "Discrete geometry"),
+ ( "Geometry", "Algebraic geometry"),
+ ( "Geometry", "Analytic geometry"),
+ ( "Geometry", "Differential geometry"),
+ ( "Geometry", "Finite geometry"),
+ ( "Geometry", "Topology"), 
+ ( "Geometry", "Trigonometry"),
+ ( "Foundations of Mathematics", "Philosophy of mathematics"), 
+ ( "Foundations of Mathematics", "Mathematical logic"),
+ ( "Foundations of Mathematics", "Set theory"),
+ ( "Foundations of Mathematics", "Category theory"),
+ ( "Applied Mathematics", "Mathematical physics"),
+ ( "Applied Mathematics", "Probability theory"),
+ ( "Applied Mathematics", "Mathematical statistics"), 
+ ( "Applied Mathematics", "Statistics"),
+ ( "Applied Mathematics", "Game theory"),
+ ( "Applied Mathematics", "Information theory"), 
+ ( "Applied Mathematics", "Computer science"),
+ ( "Applied Mathematics", "Theory of computation"),
+ ( "Applied Mathematics", "Control theory"),
+ ( "Others", "Order theory"),
+ ( "Others", "Graph theory")]
+```
+
+I found these categories and groupings [here](https://en.wikipedia.org/wiki/Areas_of_mathematics#External_links).
+The general idea is that the first token is an abstract token (e.g., Linear Algebra and Abstract Algebra are both
+kinds of Algreba") and the sceond is the page name to parse. For this parsing I will also try
+to minimize dependencies (e.g., removing repofish).
+
+Before starting here you should have already installed the required python
+modules in [requirements.txt](../requirements.txt). Since I've already done this for
+[statistics](../statistics) articles, I can clean up the code a bit too and make a more succint
+pipeline. The entire set of steps will be in [mathDomainAnalysis.py](mathDomainAnalysis.py).
+
+After this step, we have generated [wikipedia_math_articles.json](wikipedia_math_articles.json),
+a dictionary with key as method indexing metadata for each page above. 
+The biggest difference here between this extraction and the statistics is that I didn't
+save any intermediate files.
+
+
+# Step 2: Extract Equations ####################################################
+
+The above step gave us a list of math articles, and what we want to do now is:
+ - retrieve each page as a WikipediaPage
+ - parse the html with BeautifulSoup
+ - look through the images on the page (some of which are equations) and find them
+ - save the list of equations, along with the domain and topic of the page
+
+At the end of this step, we have a dictionary of equations, indexed by method
+name, and then content including the tex, png, domain, and topic:
+
+```python
+{'domain': 'Geometry',
+   'png': 'https://wikimedia.org/api/rest_v1/media/math/render/svg/a1da4e06eb6f25cd7f7fc1a7784a11a82ae53f9f',
+   'tex': '\\frac{a-b}{a+b}=\\frac{\\tan\\left[\\tfrac{1}{2}(A-B)\\right]}{\\tan\\left[\\tfrac{1}{2}(A+B)\\right]}',
+   'topic': 'Trigonometry'}
+```
+
+And we save this dictionary to both [wikipedia_math_equations.json](wikipedia_math_equations.json)
+and [wikipedia_math_equations.pkl](wikipedia_math_equations.pkl).
+
+
+# Step 3: Word2Vec Model #######################################################
+
+At this point, we would actually want to give the equations to the statistics model,
+and then generate embeddings for our math equations based on the statistics
+character embeddings. I haven't done that yet, but instead I've
+created an example showing how to do this for the math equations.
+With our equations loaded, we *could* now use the wordfish `TrainEquations` class to
+take in the list of equations, and generate a model. That comes down to this:
+
+```python
+sentences = TrainEquations(text_list=equations_list,
+                           remove_stop_words=False,
+                           remove_non_english_chars=False)
+
+model = Word2Vec(sentences, size=300, workers=8, min_count=1)
+```
+
+We don't want to mess with removing non english characters or stop words (which
+uses nltk to filter, etc.) because we aren't working with a standard English corpus!
+At the end of this step, we have a word2vec model, and we can save it under 
+[models](models).
diff --git a/math/mathDomainAnalysis.py b/math/mathDomainAnalysis.py
@@ -0,0 +1,113 @@
+#!/usr/bin/env python
+
+from bs4 import BeautifulSoup
+from wordfish.utils import ( get_attribute, save_pretty_json )
+from wikipedia import WikipediaPage
+import pickle
+import json
+import re
+import os
+
+# Now let's derive equations from core mathematics ideas (not statistics articles)
+# Organized by (AbstractTopic, PageTopic)
+
+results = dict()
+
+pages = [
+ ( "Algrebra", "Linear algebra"),
+ ( "Algrebra", "Multilinear_algebra"),
+ ( "Algrebra", "Abstract algebra"),
+ ( "Algrebra", "Elementary_algebra"),
+ ( "Arithmetic", "Number theory"),
+ ( "Calculus", "Mathematical analysis"),
+ ( "Calculus", "Differential equations"),
+ ( "Calculus", "Dynamical systems theory"),
+ ( "Calculus", "Numerical analysis"),
+ ( "Calculus", "Mathematical optimization"),
+ ( "Calculus", "Functional analysis"),
+ ( "Geometry", "Discrete geometry"),
+ ( "Geometry", "Algebraic geometry"),
+ ( "Geometry", "Analytic geometry"),
+ ( "Geometry", "Differential geometry"),
+ ( "Geometry", "Finite geometry"),
+ ( "Geometry", "Topology"), 
+ ( "Geometry", "Trigonometry"),
+ ( "Foundations of Mathematics", "Philosophy of mathematics"), 
+ ( "Foundations of Mathematics", "Mathematical logic"),
+ ( "Foundations of Mathematics", "Set theory"),
+ ( "Foundations of Mathematics", "Category theory"),
+ ( "Applied Mathematics", "Mathematical physics"),
+ ( "Applied Mathematics", "Probability theory"),
+ ( "Applied Mathematics", "Mathematical statistics"), 
+ ( "Applied Mathematics", "Statistics"),
+ ( "Applied Mathematics", "Game theory"),
+ ( "Applied Mathematics", "Information theory"), 
+ ( "Applied Mathematics", "Computer science"),
+ ( "Applied Mathematics", "Theory of computation"),
+ ( "Applied Mathematics", "Control theory"),
+ ( "Others", "Order theory"),
+ ( "Others", "Graph theory")]
+
+
+# Step 1. Get pages (and raw equations) from wikipedia
+
+for pair in pages:
+    domain = pair[0]
+    method = pair[1]
+    if method not in results:
+
+        result = WikipediaPage(method)
+
+        # Show a visual check!
+        print("Matching %s to %s" %(method,result.title))
+        entry = { 'categories': result.categories,
+                  'title': result.title,
+                  'method': method,
+                  'url': result.url,
+                  'summary': result.summary,
+                  'images': result.images }
+
+        # We can use links to calculate relatedness
+        entry['links'] = get_attribute(result, 'links')
+        entry['references'] = get_attribute(result, 'references')
+
+        results[method] = entry
+
+
+save_pretty_json(results, "wikipedia_math_articles.json")
+
+## STEP 2: EQUATIONS ###########################################################
+
+equations = dict()
+
+for pair in pages:
+    domain = pair[0]
+    method = pair[1]
+    if method not in equations:
+        print("Extracting equations from %s" %(method))
+        result = WikipediaPage(method)
+        html = result.html()
+        soup = BeautifulSoup(html, 'lxml')
+
+        equation_list = []
+
+        # Equations are represented as images, they map to annotations
+        images = soup.findAll('img')
+        for image in images:
+            image_class = image.get("class")
+            if image_class != None:
+                if any(re.search("tex|math",x) for x in image_class):
+                    png = image.get("src")
+                    tex = image.get("alt")
+                    entry = {"png":png,
+                             "tex":tex,
+                             "domain": domain,  # inefficient to store many times,
+                             "topic": method}   # but more conservative
+                    equation_list.append(entry)
+
+        if len(equation_list) > 0:
+            equations[method] = equation_list
+
+# The next step is to load these equations, and map them to the space
+# of characters generated from the statistics model. See the "analysis"
+# subfolder for these next steps.
diff --git a/math/models/wikipdeia_math_equations.word2vec b/math/models/wikipdeia_math_equations.word2vec