Merge pull request #50 from Lambda-School-Labs/docs/README

README completion
BloomTech-Labs · Oct 23, 2020 · 0f3b943 · 0f3b943
2 parents 76542c6 + ed5bdd2
commit 0f3b943
Show file tree

Hide file tree

Showing 11 changed files with 214 additions and 14 deletions.
diff --git a/README.md b/README.md
diff --git a/assets/OCR_Models.png b/assets/OCR_Models.png
diff --git a/assets/arch_layout.png b/assets/arch_layout.png
diff --git a/assets/endpoints.png b/assets/endpoints.png
diff --git a/assets/histogram.png b/assets/histogram.png
diff --git a/assets/line_graph.png b/assets/line_graph.png
diff --git a/assets/product_video.mp4 b/assets/product_video.mp4
diff --git a/assets/security_diagram.png b/assets/security_diagram.png
diff --git a/data/README.md b/data/README.md
@@ -1,8 +1,9 @@
 # Story Squad Data
 
 - Children's story submissions are not able to be made public due to [COPPA](https://www.ecfr.gov/cgi-bin/text-idx?SID=4939e77c77a1a1a08c1cbf905fc4b409&node=16%3A1.0.1.3.36&rgn=div5) guidelines, and our team decided to extend that to transcriptions as well out of precaution. 
-- The `squad_score_metrics` csv file in this folder contains the Squad Score v1.1 metrics from all 167 provided stories in our training data set, and was generated from the `squad_score_mvp` notebook. 
+- The `squad_score_metrics` csv file in this folder contains the Squad Score v1.1 metrics from all 167 provided stories in our training data set, and was generated from the `squad_score_mvp` [notebook](../notebooks/squad_score_mvp.ipynb). 
    - features: story_id, story_length, avg_word_len, quotes_num, unique_words_num, adj_num, squad_score
 - The `rankings` csv file contains the hand-rankings of 25 stories in the dataset, which is the only piece of labeled data provided by the stakeholder.
    - features: ranking, story_id
-- Anyone with access to the Story Squad data can download any of the notebooks in this repository to generate any additional needed csv files.
+- Anyone with access to the Story Squad data can download any of the notebooks in this repository to generate any additional needed csv files. The [README](../notebooks) in the notebooks folder will list any csv a notebook creates.
+- Note: for anyone with access to Story Squad data, be advised that the human transcriptions of the stories corresponding to the following Story IDs are missing pages, and are therefore inaccurate and should be removed from any comparisons of human vs computer transcriptions: 3213, 3215, 3240, 5104, 5109, 5262
diff --git a/notebooks/README.md b/notebooks/README.md
@@ -1,8 +1,8 @@
 #### Overview of `notebooks` content:
-- `clustering`: This notebook explores three different clustering methods to create groupings of users for the gamification portion of Story Squad. The currently implemented version creates groups based on the ranking of the squad scores. The other methods explored were `KMeans Clustering` and `Nearest Neighbors`. These have not been implemented in our application due to time constraints.
+- `clustering`: This notebook explores three different clustering methods to create groupings of users for the gamification portion of Story Squad. The currently implemented version ([clustering_mvp.py](../project/app/utils/clustering/clustering_mvp.py)) creates groups based on the ranking of the squad scores. The other methods explored were `KMeans Clustering` and `Nearest Neighbors`. These have not been implemented in our application due to time constraints.
 - `count_spelling_errors`: This notebook contains exploration of various spell check libraries to explore whether spell check could correct transcription errors, act as a metric for student writing, and/or increase the reliability of other metrics. For the time being, we did not see enough improvement and consistency to implement this feature.
-- `score_visual`: This notebook explores different visualizations to display on the parent's dashboard. Several versions were mocked up and presented to the stakeholder. `histogram.py` and `line_graph.py` are the resulting final visuals per the feedback provided by the stakeholders. Each of these `.py` files are implemented in our application at the visualization endpoint.
-- `squad_score_mvp`: This notebook contains data exploration of training data, generation of MinMaxScaler, and Squad Score formula composition for complexity metric. Also produces `squad_score_metrics.csv` which contains a row for each training data transcription. Features include `story_id`, all features used in the most recent Squad Score formula, and `squad_score`.
-- `submission_endpoint_interactions`: This Notebook demonstrates the functionality for `submission.py` endpoints and outlines the file structure that is required from the endpoints `UploadFile` type.
-- `transcribed_stories`: This notebook connects to the Google Cloud Vision API and transcribes the given 167 stories. Produces the `transcribed_stories.csv` which includes the Submission ID and the Transcribed Text. The `transcribe` method is used to create `transcription.py` which is used in the app. 
-- `transcription_confidence`: This notebook explores Google Cloud Vision API's method to return confidence levels of its transcription. Produces the `error_confidence.csv` which includes story_id, error (calculated between the api transcription and provided human transcription) and confidence for each submission. The `image_confidence` method is modified to create the `confidence_flag.py` which is used in the app.
+- `score_visual`: This notebook explores different visualizations to display on the parent's dashboard. Several versions were mocked up and presented to the stakeholder. [`histogram.py`](../project/app/utils/visualizations/histogram.py) and [`line_graph.py`](../project/app/utils/visualizations/line_graph.py) are the resulting final visuals per the feedback provided by the stakeholders. Each of these `.py` files are implemented in our application at the visualization endpoint.
+- `squad_score_mvp`: This notebook contains data exploration of training data, generation of MinMaxScaler, and Squad Score formula composition for complexity metric. Also produces [`squad_score_metrics.csv`](../data/squad_score_metrics.csv) which contains a row for each training data transcription. Features include `story_id`, all features used in the most recent Squad Score formula, and [`squad_score.py`](../project/app/utils/complexity/squad_score.py).
+- `submission_endpoint_interactions`: This Notebook demonstrates the functionality for [`submission.py`](../project/app/api/submission.py) endpoints and outlines the file structure that is required from the endpoints `UploadFile` type.
+- `transcribed_stories`: This notebook connects to the Google Cloud Vision API and transcribes the given 167 stories. Produces the [`transcribed_stories.csv`](../data) which includes the Submission ID and the Transcribed Text. The `transcribe` method is used to create [`transcription.py`](../project/app/utils/img_processing/transcription.py) which is used in the application. 
+- `transcription_confidence`: This notebook explores Google Cloud Vision API's method to return confidence levels of its transcription. Produces the [`error_confidence_metrics.csv`](../data) which includes story_id, error (calculated between the api transcription and provided human transcription) and confidence for each submission. The `image_confidence` method is modified to create the [`confidence_flag.py`](../project/app/utils/img_processing/confidence_flag.py) which is used in the application.
diff --git a/project/app/utils/README.md b/project/app/utils/README.md
@@ -2,16 +2,19 @@
 
 #### `clustering` subfolder:
 - `clustering_mvp.py`: Contains two functions: 1) `cluster`, which takes in a dictionary of one cohort's submissions, orders by complexity score, and clusters submission IDs into groups of 4 by score, duplicating 1-3 submission IDs as needed to ensure all clusters contain 4 IDs. Returns a list of lists. 2) `batch_cluster`, which takes a dictionary of nested dictionaries of all the cohort's submission scores for a week and runs them all through the `cluster` function. Returns a JSON object of the returns from each cohort.
+- Work for this `.py` file can be found in this [notebook](../../../notebooks/clustering.ipynb).
 
 #### `complexity` subfolder:
 - `squad_score.py`: Contains two functions: `metrics`, which generates a single row DataFrame of complexity metrics from a transcription string, and `squad_score` which takes a transcription string, runs it through `metrics`, then generates a complexity metric integer, or "Squad Score."
+- Work for this `.py` file can be found in this [notebook](../../../notebooks/squad_score_mvp.ipynb).
 
 #### `img_processing` subfolder:
 - `transcription.py`: Utilizes the Google Cloud Vision API and their `document_text_detection` method to transcribe text from a given image
 - `safe_search.py`: Utilizes the Google Cloud Vision API and their `safe_search` method to perform moderation of user uploaded illustrations
 - `google_api.py`: Utilizes methods from `transcription.py` and `safe_search.py` to provide the DS API with an Object Oriented Programming interface to the Google API and to prepare the google credentials for parsing by the Google API
-- `confidence_flag.py`: Utilizes the Google Cloud Vision API to calculate a confidence level for each page transcription. Will return a flag if the confidence level is below 0.85.
+- `confidence_flag.py`: Utilizes the Google Cloud Vision API to calculate a confidence level for each page transcription. Will return a flag if the confidence level is below 0.85. Work for this `.py` file can be found in this [notebook](../../../notebooks/transcription_confidence.ipynb).
 
 #### `visualization` subfolder:
-- `histogram.py`: Creates a Plotly histogram to show the distribution of `squad_scores` of a specified grade level for the current week. Additionally plots a vertical line with the most recent `squad_score` for the specified user to compare against their grade level. Accompanying exploration work can be found in the `score_visual` notebook.
-- `line_graph.py`: Creates a Plotly line graph to show the history of a specified user's `squad scores`. Accompanying exploration work can be found in the `score_visual` notebook. 
+- `histogram.py`: Creates a Plotly histogram to show the distribution of `squad_scores` of a specified grade level for the current week. Additionally plots a vertical line with the most recent `squad_score` for the specified user to compare against their grade level.
+- `line_graph.py`: Creates a Plotly line graph to show the history of a specified user's `squad scores`. 
+- Accompanying exploration work for both visuals can be found in the `score_visual` [notebook](../../../notebooks/score_visual.ipynb).