Audience: Data analysts, developers, data engineers, data scientists, and anyone who works with data to gain insight. Prerequisites: None. Some familiarity with mongodb, mongodb shell and Atlas are helpful Data Set: This workshop uses the publicly available AirBnB data set. This data set has been pre-loaded into an instance of MongoDB hosted in Altas, our fully managed MongoDB-as-a-Service database. Attendees of the live workshop will run the exercises against this hosted instance. If you missed the workshop, you may register an Atlas account here https://www.mongodb.com/cloud and run these examples anytime.
the MongoDB Shell (the command line interface for MongoDB) Download the shell here: Windows: mongodb-shell-win32-x86_64-2008plus-ssl-4.0.2.zip Mac OS X: mongodb-shell-osx-ssl-x86_64-4.0.2.tgz
Use the mongo command line shell to connect to the Altas Instance typing this command in your terminal:
mongo "mongodb+srv://dataanalyticsworkshop-sshrq.mongodb.net/airbnb" \
--username workshop_user \
--password dotlocal2018
Step 1 List the collections in the database From the sell prompt, execute the following command
show collections
Step 2 examine an individual document in the austinListingsandReviews collection at the prompt execute:
db.austinListingsAndReviews.findOne()
The MongoDB Connector for BI translates queries and data between SQL and MongoDB, allowing you to connect data visualiztion tools such as Tableau, PowerBI, MS Excel, or the MySQL Command Line to data managed in MongoDB. We are going to use a fully managed version of the BI Connnector, hosted in Atlas. The BI Connector can be accessed with the following information
hostname: dataanalyticsworkshop-biconnector-sshrq.mongodb.net port: 27015 user: workshop_user?source=admin password: dotlocal2018
To Connect the BI Connector from the MySQL client Step 1 Download and install MySQL
Step 2 Install MongoDB Authentication Plugin The Authentication Plugin eases authentication between the BI Connector and SQL clients. Download and install the plugin per our instructions
Step 3 Connect view MySQL Client from your command line prompt, execute
mysql -h dataanalyticsworkshop-biconnector-sshrq.mongodb.net \
-P 27015 \
-u workshop_user -p \
--default-auth=mongosql_auth airbnb
authenticate with the password "dotlocal2018"
Step 4 Explore the tabular representation of the airbnb data at the MySQL prompt execute:
mysql> show tables;
mysql> describe austinListingsAndReviews;
mysql> describe austinListingsAndReviews_amenities;
Step 5 Execute a JOIN to list the set of amenities per listing
mysql> SELECT austinListingsAndReviews._id AS id, austinListingsAndReviews_amenities.amenities AS amenity
FROM austinListingsAndReviews
JOIN austinListingsAndReviews_amenities
ON austinListingsAndReviews._id = austinListingsAndReviews_amenities._id
LIMIT 10;
Step 1 Download MongoDB ODBC driver Select and download the appropriate version for your OS
Step 2 Create a DSN Follow OS specfic instructions. Be sure to use SERVER dataanalyticsworkshop-biconnector-sshrq.mongodb.net
Step 3 Connect from Excel
- Start Excel, select the "data" Tab, and clik on "New Database Query"
- Select the DSN you created in step 2 from the pop-up window and click "ok"
- enter username "workshop_user" and password "dotlocal2018" More detailed and Illustrated details on connecting from Excel are available in this tutorial
Step 4 Enter Query or Join In the popup's "SQL Query" section, input query
SELECT *
FROM austinListingsAndReviews
JOIN austinListingsAndReviews_amenities
ON austinListingsAndReviews._id = austinListingsAndReviews_amenities._id;
Step 5 Load data into sheet Click "return data" to populate the spread sheet
Goal: Create pivot table with average price/neighborhood step 1 Create a pivot table Set Rows: neighborhood Values: price, average
Goal: Expand on above pivot table to include price per Airbed Step 1 Create pivot table: price, beds Step 2 Add calculated field price per bed
charts server: http://charts.local.mongodb.parts/
username: workshop_user
password: dotlocal2018
Add a Data Source Sign in to Charts: http://charts.local.mongodb.parts Go to Data Sources tab Click New Data Source Use the URI: Choose a data source for a city of your choice Keep the data source private (don’t share with others) Publish the data source Edit the Alias to have a friendly name, e.g. “AirBnB New York” Create a Dashboard Go to the Dashboards tab Click New Dashboard Enter a name and description, e.g. “Tom’s AirBnB dashboard” Add some charts For each of the following, click Add Chart from the dashboard view. After the chart has been saved, resize and rearrange on the dashboard as desired.
Goal: Show the localities with the most properties, split out by property type. Data Source: Chart Type: Bar / Stacked Encodings: X axis: _id, Count aggregation Y axis: address.government_area Sort By: Aggregated Value, Descending Limit: 30 Series: property_type Chart Title: Properties by Location Result:
Goal: Show the areas with the most expensive properties, and correlate with the review scores. Data Source: Chart Type: Bar / Colored Encodings: X axis: price, Mean aggregation Y axis: address.government_area Sort By: Aggregated Value, Descending Limit: 30 Color: review_scores.review_scores_location Chart Title: Price by Location, coloured by Location Review score Result:
Goal: Show the number of properties with various review scores Data Source: Chart Type: Column / Grouped Encodings: X axis: review_scores.review_scores_rating Binning: On, bin size 5 Y axis: _id, Count aggregation Filter (to hide properties with no review score): { 'review_scores.review_scores_rating': { $gte: 0}} Chart Title: Review Score Histogram Result:
Goal: Show the types of properties that contain pools Data Source: Chart Type: Circular / Donut Encodings: Label: Property Type Arc: _id, Count aggregation Filter (to show only properties with pools): { amenities: 'Pool'} Chart Title: Properties with Pools Result:
Goal: Show the number of reviews of any property, month by month Data Source: Chart Type: Area / Discrete Encodings: X axis: reviews.date Array Reduction: Unwind Array Binning: On, bin by Month Y axis: _id, Count aggregation Chart Title: Reviews over time Result:
Goal: Show the number of properties with and without pools by area Data Source: Chart Type: Column / Stacked Encodings: X axis: address.government_area Y axis: _id, Count aggregation Series: amenities Array Reduction: Existence of value: “Pool” Chart Title: Pool or No Pool Result:
Now that you’ve got an understanding of what Charts can do, try creating your own charts that provides some new insight from this data set.
If you want to make something similar to one of your existing charts, click the [...] menu on the chart and then choose Duplicate Chart. Share your dashboard and data source
When you’ve finished creating your dashboard, share it with the class! Go to the Data Sources tab Find your data source and click the ACCESS button Choose to share with Everyone, with Reader role Save the data source Go to the Dashboards tab Find your dashboard and click the ACCESS button Choose to share with Everyone, with Viewer role.
Once others have shared their dashboards with you, take a look at what they have created!
The following examples are executed in the mongo shell, but you are welcome to use MongoDB Compass the GUI for MongoDB. Compass Allows you Visually explore your data and construct ad hoc queries with its Aggregation Builder.
Download Compass Windows: mongodb-compass-community-1.15.4-win32-x64.exe Mac OS X: mongodb-compass-community-1.15.4-darwin-x64.dmg
connect to MongoDB Atlas with this URI: mongodb+srv://workshop_user:@dataanalyticsworkshop-sshrq.mongodb.net/admin
- Connect to your mongodb Atlas cluster using mongodb shell:
mongo "mongodb+srv://dataanalyticsworkshop-sshrq.mongodb.net/test" --username dotlocaluser --password dotlocaluser
- Define which database we are going to use
use airbnb
- Explore collections available to use for analysis. We will use austinListingsAndReviews collection.
show collections
- Find a single document
db.austinListingsAndReviews.findOne()
- Find how many listings have a real bed and are greater than $100 but less than $200 per night by filling in the blanks. _Hint:
db.___.find( {
price: { ___: ___, $lt: ___ } } ,
{ ___ : "Real Bed"
} ).count()
We will use the following operators:
$match
$group
Take a look at this example aggregation which computes minimum price per night by number of beds
db.austinListingsAndReviews.aggregate( [
{
$group: {
_id: "$beds",
avg_by_beds: { $min: "$price" }
}
},
{ $sort: { avg_by_beds: 1 } }
] )
- Construct a very similar query to find average price per night by room type by filling in the blanks.
db.austinListingsAndReviews.aggregate( [
{
$group: {
_id: "___",
avg_by_roomtype: { $avg: "___" }
}
}}
] )
- Let’s practice some more. This time write the query from scratch to find the average price per night by suburb. Examine the document structure first. Hint: suburb is a field within a subdocument address; use dot notation to reference it.
- Now let’s combine the skills we’ve learned. Compute the count of properties and average price per night by suburbs and number of bedrooms by filling in the blanks.
db.austinListingsAndReviews.aggregate([
{"$group" :
{
_id:{suburb:"$___", bedrooms:"$___"},
___:{$sum:1},
avg:{$___:"$price"}}
}
])
- Use the query in #3 and add a $match stage to only show those properties that have a pool.
We will use the following operators:
$objectToArray
$geonear
$graphlookup
And the following aggregation pipeline stages:
$match → $project: → $unwind: "$scores" → $group
- Let’s take a look at an example first. Let’s assume we want to determine average review scores for all properties in Austin based on some amenities. Remember that review_scores is a sub document. Luckily we have $objectToArray to transform our object into an array which we can then $unwind and $group.
db.austinListingsAndReviews.aggregate([
{ $match : {amenities: { $in: ["Pool", "Air conditioning", "Wifi"] } } },
{ $project: { scores: { $objectToArray: "$review_scores" } } },
{ $unwind: "$scores" },
{ $group: { _id: "$scores.k", average_score: { $avg: "$scores.v" } } }
])
Compare two neighbourhoods - South Lamar and West Campus - and decide based on average review where you would rather stay. Assume you also want a place that has pool, wifi, and air-conditioning, under $200 Hint: You will need to look up how to use $and operator for this
- Let’s find all airbnb listings around our .local Austin venue, Alamo Drafthouse. The coordinates are 30.2988474,-97.7064386. We can do this by using $geoNear operator, as follows:
db.austinListingsAndReviews.aggregate([
{
$geoNear: {
near: { type: "Point", coordinates: [ -97.7064386 , 30.2988474 ] },
distanceField: "dist.calculated",
maxDistance: 500,
includeLocs: "dist.location",
num: 5,
spherical: true
}
}
]).pretty()
Now practice writing this query for yourself. Use $geoNear to find airbnbs near location of your choice and add the same search criteria we used in #1 above - a place that has pool, wifi, and air-conditioning, under $200 You can find $geoNear documentation and additional examples here.
- Take a look at this example. Let’s say we want to build a recommendation engine based on users that reviewed the same properties. We can use $graphlookup to run a recursive query that says: “for a given reviewer, e.g. 7538 let’s find all listings reviewed by users who also reviewed the same listings as 7538”
db.austinListingsAndReviews.aggregate( [
{ $match: { "reviews.reviewer_id": "7538" } } ,{ $unwind: "$reviews" },{
$graphLookup: {
from: "austinListingsAndReviews",
startWith: "$reviews.reviewer_id",
connectFromField: "reviews.reviewer_id",
connectToField: "reviews.reviewer_id",
as: "reviewed_listings",
maxDepth: 1,
depthField: "depth",
restrictSearchWithMatch: { "reviews.reviewer_id": { $ne: "7538" } }
}},
{ $project: {
_id: 0,
listing: "$_id",
url: "$listing_url" }}]).pretty()
Now build your own $graphLookup query. Some host in Austin have multiple listings, for example, host_id:100208662. Construct a $graphlookup query that finds all the listings by that host. You can find $graphLookup documentation and additional examples here.
In this section we'll use the machine learning ALS library in Spark to generate a set of personalized movie recommendations for a given user, based on the publically available MovieLens dataset
Ensure you have downloaded the data and imported it into MongoDB with mongorestore. You can find instructions on using mongorestore in the MongoDB Documentation. You can also run this full example with Execute the following Databricks Notebook
Step 1 Download Spark From the commandline download and decompress Spark v2.3
http://apache.claz.org/spark/spark-2.3.2/spark-2.3.2-bin-hadoop2.7.tgz
tar -xf spark-2.3.2-bin-hadoop2.7.tgz
Step 2 Start the Spark interactive shell From the commandline execute
./spark-2.3.1-bin-hadoop2.7/bin/spark-shell \
--conf \
spark.mongodb.input.uri=mongodb+srv://demo:[email protected]/airbnb.austinListingsAndReviews \
--conf \
spark.mongodb.output.uri=mongodb+srv://demo:[email protected]/airbnb.clusters \
--packages org.mongodb.spark:mongo-spark-connector_2.11:2.3.0
Step 3 Prepare to Read data in from MongoDB At the Spark prompt execute the follwing code
val pipeline = """[ { $project: {
"_id": 0,
"accommodates": "$accommodates",
"price": "$price",
"coordinates": "$address.location.coordinates"
}
}]"""
// Read from the collection defined in the command line parameter
val ds = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("pipeline", pipeline).load()
// The Spark Connector will use MongoDB's $sample command to infer the Dataset's schema from a sample set of docs.
// However, you can avoid the need to sample if you know what the schema is
// We use the following case class to explicitly declare the schema without sampling
case class Listing( accommodates: Int, price: BigDecimal, coordinates: Array[Double] )
// This second case class is used to flatten out the coordinate array in to lat / long pairs
case class FlatListing( accommodates: Int, price: BigDecimal, latitude: Double, longitude: Double )
import spark.implicits._
val listings = ds.as[Listing].map( v => ( FlatListing ( v.accommodates, v.price, v.coordinates(1), v.coordinates(0) ) ) )
// let's cache the dataset as we are going to use it in a multi-pass algorithm
listings.cache()
// Let's see what the data looks like so far
listings.show()
Step 4 Reformat data so that the KMeans class can process it At the Spark prompt execute the follwing code
// We need to reformat the dataset so that it can be processed by MLLib's KMeans class
import org.apache.spark.ml.feature._
val assembler = new VectorAssembler().setInputCols( listings.columns ).setOutputCol("features")
val vectors = assembler.transform(listings)
//Here's what it looks like now
vectors.show()
Step 5 Train a KMeans model with the prepared data vector At the Spark prompt execute the follwing code
import org.apache.spark.ml.clustering.{KMeans, KMeansModel}
// Cluster the data into 4 classes of listing using KMeans.
// This is where the machine learning magic happens!!!
val kmeans = new KMeans().setK(4).setSeed(1L).setMaxIter(5)
val model = kmeans.fit( vectors )
// let see what our centroids look like
model.clusterCenters.foreach( println )
Step 6 Assign each airbnb listing in the area to a cluster At the Spark prompt execute the follwing code
// let's now assign each listing to a cluster
val classifiedListings = model.transform( vectors ).drop( "features" )
// and see what that looks like
classifiedListings.show()
Step 7 Write the cluster assignments back to MongoDB At the Spark prompt execute the follwing code // finally, let's write the listing clusters to MongoDB classifiedListings.write.format("com.mongodb.spark.sql.DefaultSource").mode("overwrite").save()