Skip to content
This repository has been archived by the owner on May 17, 2023. It is now read-only.

Commit

Permalink
Merge pull request #1 from raybuhr/master
Browse files Browse the repository at this point in the history
added some course info and resources
  • Loading branch information
Chris Walker committed Sep 28, 2015
2 parents cd43f37 + c4f6cc5 commit f9dc2d9
Show file tree
Hide file tree
Showing 4 changed files with 201 additions and 0 deletions.
49 changes: 49 additions & 0 deletions w201_rdada/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
## What is Data Science?

RDADA aims to help incoming students from all sorts of backgrounds get on the same page about what it means to study and practice data science. To understand what to learn, you need to know how we got to this situation, what types of roles exist and what the future might have in store.

RDADA covers a lot of broad topics regarding communicating and interpreting information. Examples include dealing with high stakes situations, explaining technical information to different audiences and considering cognitive biases and the concepts in behavior economics.

Due to the diversity of MIDS students, no single path to success in RDADA truly exists. Instead, read as much as you can about the field, both technical and 'soft' articles. You'll eventually find that a big part of successful data science projects involve basic considerations of human thinking, like how to actually motivate people to take your recommendation or how to convince someone that the data available isn't sufficient to accomplish a particular analysis.

Here are some suggested readings, both from class and otherwise. Links are broken down in roughly 3 topics: mathy stuff; computery stuff; humany stuff.

## Resources

### Math / Stats / Analysis:
- Freakonometrics, a blog about econometrics projects http://freakonometrics.hypotheses.org/
- UCI Machine Learning Repository, a lot of dataset to practice on http://archive.ics.uci.edu/ml/
- Statistics Done Wrong, what better way to learn! https://www.nostarch.com/statsdonewrong
- Machine Learning is Fun! the worlds easiest intro to ML, https://medium.com/@ageitgey/machine-learning-is-fun-80ea3ec3c471
- Python Getting Started with Data Analysis. If you currently use Excel, SAS or R for data analysis, but want to know how to do it Python then start here. http://alstatr.blogspot.ca/2015/02/python-getting-started-with-data.html
- Exploratory Data Science with R, a free book on how to get started. https://leanpub.com/exdata


### Computer Science / Programming / Cloud Computing:
- Automated the Boring Stuff, a free web book (real book available) that is an awesome way to learn to program in Python because you actually use it to get things done. https://automatetheboringstuff.com/
- Data Science Toolbox, a virtual environment hosted locally (your computer) or in the cloud (AWS) that comes with tons of resources for doing data science already installed. If you want stay in Windows, this is an awesome option to outsource your tasks to a unix environment without needed in install linux or buy a Mac. If you have a Mac or already run linux, this is an awesome option to have a separate place to install packages that won't interfere with the existing dependencies your computer already uses. http://datasciencetoolbox.org/
- Getting started with Amazon Web Services http://docs.aws.amazon.com/gettingstarted/latest/awsgsg-intro/gsg-aws-intro.html
- Getting Started with Google Cloud Platform https://cloud.google.com/free-trial/?utm_source=twitter&utm_medium=cpc&utm_campaign=2015-q3-cloud-na-gcp-directbuy-freetrial
- Getting Started with Git https://git-scm.com/book/en/v1/Getting-Started
- Working with R, RStudio and Github http://stat545-ubc.github.io/git00_index.html
- Command Line Crash Course, http://cli.learncodethehardway.org/book/


### Communication / Visualization / Cool Stuff Data Scientists Do / Ides for Learning Data Science / Data Science Career Advice:
- What is data science? a really big picture take, https://datajobs.com/what-is-data-science
- datascience@berkeley videos from prior immersions http://datadialogs.berkeley.edu/video/
- Open Source Data Science Masters, because you don't have to pay for everything. http://datasciencemasters.org/
- The Art of Data Science, a guide to what works and what doesn't based on experience (FREE) https://leanpub.com/artofdatascience
- The Signal and the Noise, a popular book that is easy to read and a good intro in what to expect when doing predictive analytics http://www.amazon.ca/The-Signal-Noise-Predictions-Fail-but/dp/0143125087
- Thinking Fast and Slow, a nobel prize winner's lifelong work on understanding cognitive biases. http://www.amazon.com/Thinking-Fast-Slow-Daniel-Kahneman/dp/0374533555
- Nizkor Project, a curration of fallacies in human thinking. http://www.nizkor.org/features/fallacies/
- Data + Design, intro to preparing and visualizing data. https://infoactive.co/data-design/titlepage01.html
- Dataquest, interactie tutorials to learn data science https://www.dataquest.io/
- Datacamp, interactive tutorials to learn R https://www.datacamp.com/
- Datatau, the hacker news equivalent for data science http://www.datatau.com/
- Datascience Weekly Newsletter http://www.datascienceweekly.org/newsletters
- Partially Derivative, a data science podcast http://www.partiallyderivative.com/
- Naked Statistics, an easy to read and digest version of what statistics is all about. Not really meant to learn statistics, but great for explaining statistics to other people. http://www.amazon.com/Naked-Statistics-Stripping-Dread-Data/dp/039334777X/ref=sr_1_1?ie=UTF8&qid=1435289294&sr=8-1&keywords=naked+statistics
- Overview of popular data science tools and salary associations, http://www.techrepublic.com/blog/big-data-analytics/data-scientists-can-find-big-money-in-open-source/#.


42 changes: 42 additions & 0 deletions w203_exploring_and_analyzing_data/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
## Exploring and Analyzing Data

Also known as *Introduction to Statistics with R*

This course guides someone with basic understanding of statistics through the same introduction to statistics as you took in undergrad, but this time using R (unless you also used R in undergrad, then it is exactly the same).

You will learn:
- The Central Limit Theorem
- Normal Distibution
- Hypothesis Testing
- Parametric vs. Non-parametric tests
- Linear Regression
- Logistic Regression
- Some basic R syntax

## Resources

**R**
- RStudio, the most popular IDE for R. https://www.rstudio.com/products/rstudio/
- R course material, primer for working with R. http://stcorp.nl/R_course/
- Working with RStudio and Github. http://stat545-ubc.github.io/git00_index.html
- RStudio Cheetsheets, PDFs of how to do stuff so you don't have to remeber. https://www.rstudio.com/resources/cheatsheets/
- Cookbook for R, quick recipes of how to do common tasks. http://www.cookbook-r.com/
- Datacamp Intro to R, interactive way to learn R basics. https://www.datacamp.com/courses/introduction-to-r
- Intro to Dplyr, the easiest way to get data into the form you want. https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html
- 10 R Packages I wish I knew earlier... (swap plyr with newer version dplyr) http://blog.yhathq.com/posts/10-R-packages-I-wish-I-knew-about-earlier.html
- Tufte in R, a guide to producing effective and aesthetically pleasing charts. http://motioninsocial.com/tufte/
- Exploratory Data Analysis in R, a free e-book to help you become effective using R. https://leanpub.com/exdata
- R Programming for Data Science, same author as above and next step in becoming better at R. https://leanpub.com/rprogramming
- Writing a R Package from Scratch, in case you write a series of awesome functions that let you quickly and easily solve a problem. http://hilaryparker.com/2014/04/29/writing-an-r-package-from-scratch/

**Statistics**
- Quick reference to variable types, http://www.ats.ucla.edu/stat/mult_pkg/whatstat/nominal_ordinal_interval.htm
- Berkeley Glossary of Statistical Terms, http://www.stat.berkeley.edu/~stark/SticiGui/Text/gloss.htm
- Which Test to Use? A table of which statistical test to use split by Parametric and Non-Parametric, http://changingminds.org/explanations/research/analysis/parametric_non-parametric.htm
- Example of Logistic Regression, http://www.ats.ucla.edu/stat/r/dae/logit.htm
- Simply Statistics, blog from the Biostatistics professors at John Hopkins University. http://simplystatistics.org/

**Sometimes you need to hear statistics explained in a different way.**
- Udacity Intro to Statistics, https://www.udacity.com/course/intro-to-statistics--st101
- Udacity Inferential Statistics, https://www.udacity.com/course/intro-to-inferential-statistics--ud201
- Intro to Descriptive Statistics, https://www.udacity.com/course/intro-to-descriptive-statistics--ud827
58 changes: 58 additions & 0 deletions w205_storing_and_retrieving_data/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
## Storing and Retrieving Data

_No one really knows what this course is about because it has been changed multiple times since creation._

Data comes in many forms, usually broken into two categories; structured, like a spreadsheet or a relational database; unstructured, like comments on the web or health records between different hospitals.

Big data comes in multiple flavors, usually a combination of V's.
**__Velocity__** a.k.a. the firehose; sensor data is a good example, or twitter stream.
**__Volume__** changes each year relative to user and new technology, but generally anything that takes a lot of planning and design to figure out how to store, move, or process the amount of data.
**__Variety__** means the type of data or range of possible formats changes frequently or makes it hard to join to other data sets.
**__Varacity__** includes concerns about whether the data was collected properly or can be readily accessed.

As a data scientist, you'll need to know how to get data, put it some place, clean it up and hook it up with other data. This course should prepare you do those tasks in an efficient, programmatic manner. The more you learn to automate data preparation and manipulation tasks, the more time you get to spend on the analysis and presentation pieces.

## Resources:

**Basics like Terminal and Python**
- Command Line Crash Course, be prepared... http://cli.learncodethehardway.org/book/
- Think Python: How to Think Like a Computer Scientist, free e-book and great intro to python. http://www.greenteapress.com/thinkpython/html/index.html
- Data Science Tookbok, a virtual environment for data science that you can have up and running quickly. **__Use this if you are on Windows.__** http://datasciencetoolbox.org/

**Relational Databases**
- PostgreSQL vs Microsoft SQL Server, a crazy person rant that explains a lot about how features you want in a relational database. http://www.pg-versus-ms.com/
- Use the Index Luke, a free e-book on why indexes help SQL and how to create them. http://use-the-index-luke.com/sql/preface

**Data Transformation in Python**
- Learn Pandas, the python library for dataframes and data analysis. https://bitbucket.org/hrojas/learn-pandas
- Online JSON Viewer, so you know the format is valid. http://jsonviewer.stack.hu/
- Regular Expressions Quickstart. Usually called regex, a standard library in python called `re` and generally good way to do pattern matching. http://www.regular-expressions.info/quickstart.html
- Readable regex in python, http://tonysyu.github.io/readable-regular-expressions-in-python.html#.VgZakSBVhBe

**MapReduce**
- MapReduce, slides from Google research authors. http://research.google.com/archive/mapreduce-osdi04-slides/index.html
- Hadoop, an open-source Apache project for distributed data processing. https://hadoop.apache.org/
- Spark, an open-source Apache project for fast, large-scale data processing. http://spark.apache.org/
- mrjob and S3 tutorial. mrjob is python library for writing MapReduce functions. S3 is Amazon cloud object storage. https://www.classes.cs.uchicago.edu/archive/2013/spring/12300-1/labs/lab5/
- Word Count MapReduce example using local and EMR, http://www.lichun.cc/blog/2012/06/wordcount-mapreduce-example-using-hive-on-local-and-emr/
- Guide to Setting Up a cluster of virtual machines running Hadoop and Spark. https://github.com/dnafrance/vagrant-hadoop-spark-cluster
- Hortonworks Sandbox, tutorials for getting started with Hadoop. http://hortonworks.com/products/hortonworks-sandbox/#tutorial_gallery

**Natural Language Processing**
- NLTK, python Natural Language Toolkit, a library for processing textual data. http://www.nltk.org/
- TextBlob, a python library that makes processing textual data even easier than NLTK. https://textblob.readthedocs.org/en/dev/
- Whoosh! An embedded search engine written in python. http://sowingseasons.com/blog/introduction-to-whoosh.html
- Audiogrep, a python library that transcribes audio files and helps you search them. https://github.com/antiboredom/audiogrep

**Web Scraping**
- BeautifulSoup, a python library to get data out of webpages. http://www.crummy.com/software/BeautifulSoup/
- lxml, python library for working with XML and HTML. https://github.com/lxml/lxml/
- Scrapy, the basics. A python library that lets you "scrape" webpages. https://seanmckaybeck.com/scrapy-the-basics.html
- Grab, an alternative to Scrapy. http://docs.grablib.org/en/latest/
- Mechanize, a python library to build programmatic web browsing bots. https://pypi.python.org/pypi/mechanize/

**NoSQL databases**
- Comparison of NoSQL Options, with features and best use cases. http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
- MongoDB, an open-source BSON (binary JSON) data storage application. https://docs.mongodb.org/manual/?_ga=1.138812147.1950857371.1443257647
- Cassandra, an Apache open-source project offering great performance. http://cassandra.apache.org/
- Neo4j, graph database, useful for mapping relationships like in a network. http://neo4j.com/
52 changes: 52 additions & 0 deletions w209_data_visualization_and_communication/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
## Data Visualization and Communication

This course aims to teach the fundamentals of communicating data and analysis to people of different backgrounds.

In academia, you present to people who want to criticize your work to make sure it stands up to the standards of scientific research and sound logic. In professional environments, you present to people who want to use your work to make more money by increasing revenue or decreasing costs. In data science conferences or meetups, you present to people who want to learn how to work smarter and find out cool things other people do. Each of these groups loves data visualization, but they all prefer different takes.

Academics prefer statistical plots that provide a lot of information without a lot of wasted ink. Business people love pie charts because they don't take up a lot of space and can have pretty colors. Data science fellows love interactive charts that they can explore and try to learn more about the udnerlying relationships without having to endlessly peer at numbers.

Understaning your audience and catering to them can hurt, especially when they want to see pie charts. Sometimes it makes sense to present information in a different way than expected, but sometimes that causes misunderstanding due to learning curve involved with change. The best and only way to truly learn great data visualization skills? Practice. Create charts on a lot of different data sources and create different charts for the same data source. Practicing creating charts helps you build intuition on what looks good, what can be quickly digested and what takes more time to create than the value it provides.


## Resources:

- Data + Design: Visualization. A good place to start on what to do and what not to do. https://infoactive.co/data-design/part04.html
- Tufte in R. Examples of various statistical plots done in R with minimal chart junk and maximum Tufte style. http://motioninsocial.com/tufte/
- Cookbook for R: Graphs. Tutorials on how to do all the basic charts with ggplot2. http://www.cookbook-r.com/Graphs/
- Matplotlib: plotting with Python. http://matplotlib.org/
- Bokeh: interactive plotting with Python. http://bokeh.pydata.org/en/latest/
- Pyxley: python powered dashboards. http://multithreaded.stitchfix.com/blog/2015/07/16/pyxley/
- Shiny: interactive plotting with R. http://shiny.rstudio.com/
- D3 Tips and Tricks, a free e-book. https://leanpub.com/D3-Tips-and-Tricks

##From Week 5: Vendor Tutorials

**Tableau Desktop**
- http://www.tableausoftware.com/products/desktop
- http://www.tableausoftware.com/learn/training

**Adobe Illustrator**
- http://www.adobe.com/products/creativecloud/students.edu.html
- https://helpx.adobe.com/creative-cloud/learn/tutorials/illustrator.html

**R and ggplot2**
- http://www.r-project.org
- http://ggplot2.org
- http://wiki.stdout.org/rcookbook/Graphs/

**d3**
- http://d3js.org
- http://chimera.labs.oreilly.com/books/1230000000345/index.html
- https://github.com/mbostock/d3/wiki/Tutorials
- Refreshers on HTML, the DOM, CSS, SVG, Javascript
- http://chimera.labs.oreilly.com/books/1230000000345/ch03.html

**Highcharts**
- http://www.highcharts.com
- http://www.highcharts.com/docs

**VisIt**
- https://www.llnl.gov/visit
- https://wci.llnl.gov/codes/visit/manuals.html

0 comments on commit f9dc2d9

Please sign in to comment.