fire_machinelearning.py

# -*- coding: utf-8 -*-
"""Fire_MachineLearning.ipynb

Automatically generated by Colaboratory.

Original file is located at
    https://colab.research.google.com/drive/1PswZ0yRsQqPIpbQ5iCACisPZUYX59H5l

#Charlottesville Fire Department Project: Machine Learning Predictions

Authors: Jackson Barkstrom, Habib Karaky, Josh Schuck, Garrett Vercoe. We joined together the data we used here in the "Cleaning and Merging" code. The data was originally worked on by many, including us, during Civic Innovation Day (special shoutouts to Stephen and Katharine).

Note: We assume a basic understanding of the data we're working with, but we assume only little understanding of machine learning. Our code walks through different machine learning models and their possible utility--the big question, obviously, is which model we should use to predict fires. Obviously, a decision tree is a simple model that we probably don't want to use, but we figured it would be useful for teaching purposes. 

For now, we settled on the Random Forest Regressor because it performed the best out of all of the regression models we somewhat understood and tried--easily beating out models such as bagging and simple decision trees. The random forest is an extremely powerful model, and it produced useful results both as a classifier and as a regressor. We decided we had to use a regression model becuase it allows us to make our own risk categories--it returned decimal risk values between 0 and 1 and we could split these up as we desired. We can easily say "Put anything below .2 in the low risk category, then put everything greater than .6 in the high risk category, then put everything else in the medium risk category. Thus, we' can generate low, medium, and high risk categories. However, we our model really only works for finding the highest-risk homes: the high risk category is the only one that's really significant.

One of the biggest issues we ran into was predicting the fire risk of a building *after it already caught on fire.* Fire risk would go down since the owners of the building would take more safety precautions in the future, no? However, our model didn't look at multiple fires: it only looked at 0 or 1, if the house had a fire at least once since 2003 or if it didn't. We could change this. We recognize that there's some serious danger here of predicting the past (what we're doing) not being the same as predicting the future. 

Despite this shortcoming our models generalize to most houses well. We found pretty clearly that the square footage and the age of buildings are the largest predictors of fire risk, and that makes sense. Really big old houses are way more likely to catch on fire than small new houses. Our models clearly have flaws and cannot predict everything: they could use a lot more data, such as if or if not the homes have smoke alarms installed, and they don't work well for predicting low risk versus medium risk. But they work well for high risk. Our models can predict the homes with the highest risks of fire and the fire department can respond accordingly. Our models aren't the best they could be (we recommend improvement), but they could already be used--right now--to inspect the highest risk homes and decrease fire risk in the Charlottesville community. If our model designates a house as "high risk," the house probably needs attention.

Link to cleaning and merging notebook: https://colab.research.google.com/drive/1EPmKwBAJ560MV5pDJYD1e_iQbE0NezB0#scrollTo=phy9ec8A488x
"""

import pandas as pd
import numpy as np

"""## Import Data"""

# Import our joined together data from the "Cleaning and Merging" code

residential = pd.read_csv("https://raw.githubusercontent.com/garrettvercoe/CharlottesvilleFireModel/master/Updated_Residential_Results_cleaned.csv")
commercial = pd.read_csv("https://raw.githubusercontent.com/garrettvercoe/CharlottesvilleFireModel/master/Updated_Commercial_Results_cleaned.csv")

# Examine residential data (feel free not to run this)
residential.head()

# Examine commercial data (feel free not to run this)
commercial.head()

"""## Cleaning"""

# Drop variables we are no longer using for Machine Learning
# Drops latitude, longitude, address, if there was a fire 2003-2016,
# and if there was a fire 2016-. Our 'Fire_final' column still shows
# if there was a fire 2003-2018, and this is what we will train our
# models on. Feel free to modify to your liking.

residential_cleaned = residential.drop(['lat','lon','fire_late','fire_early','Address', 'Type'], 1)
commercial_cleaned = commercial.drop(['lat','lon','fire_late','fire_early','Address', 'address', 'Type'], 1)

# Clean the data to be ready for an algorithm 
# We replace nan NaN with 0 and then made it its own category in case there was a pattern to the NaN values
# We make numerical categories with dataframe["colname"].notnull().astype("category").cat.codes

residential_cleaned = residential_cleaned.replace(np.nan, 0, regex=True)
commercial_cleaned = commercial_cleaned.replace(np.nan, 0, regex=True)


residential_cleaned["use_type"] = residential_cleaned["use_type"].astype("category").cat.codes
residential_cleaned["use_code"] = residential_cleaned["use_code"].astype("category").cat.codes
residential_cleaned["grade"] = residential_cleaned["grade"].astype("category").cat.codes
residential_cleaned["ext_walls"] = residential_cleaned["ext_walls"].astype("category").cat.codes
residential_cleaned["roof"] = residential_cleaned["roof"].astype("category").cat.codes
residential_cleaned["flooring"] = residential_cleaned["flooring"].astype("category").cat.codes
residential_cleaned["bsmt_type"] = residential_cleaned["bsmt_type"].astype("category").cat.codes                                                       
residential_cleaned["heating"] = residential_cleaned["heating"].astype("category").cat.codes
                                                       
commercial_cleaned["use_type"] = commercial_cleaned["use_type"].astype("category").cat.codes                                                       
commercial_cleaned["use_code"] = commercial_cleaned["use_code"].astype("category").cat.codes

# Examine the data
residential_cleaned.head()

# Examine the data
commercial_cleaned.head()

"""## Train Test Split

Note: running the split in conjunction with the first cell is does not get rid of variables, which we used because it produces the most accurate model. However, there is a chance that this model will be overfitted. We made the second cell give the model only four variables, just to show how important those 4 variables are in predicting fires, and to show that our model can produce powerful insights on only four variables.

We used 50% of the residential data and 50% of the commercial data to train, but this can easily be changed by editing test_size. That way, if our model tests well among the other 50% of the data, it has high validity.

This first one uses all of the variables. We used it in our final model. Although overfitting is dangerous, there are few variables relative to the size of the dataset.
"""

# train test split (so that we can validate our model)
from sklearn.model_selection import train_test_split

residential_cleaned_split = residential_cleaned
commercial_cleaned_split = commercial_cleaned

"""This second one gets rid of variables. Don't run it if you want to look at all the variables.

For our simplest models we used only four variables and found good results in finding the highest risk homes. For residential data we used 1) square footage, 2) year built, 3) does it have a basement, and 4) how many total rooms does it have. For commercial data we used 1) square footage, 2) year built, 3) use code, and 4) number of stories. We found these were the most significant. The final regression model will definitely need more than just these four variables.
"""

# get rid of all but the most useful variables
residential_cleaned_split = residential_cleaned[["sq_footage_finished_living", "year_built", "basement", "total_rooms", "Fire_final"]]
commercial_cleaned_split = commercial_cleaned[["gross_area", "year_built", "use_code", "number_of_stories", "Fire_final"]]

"""Basic train test split with 50% train 50% split, using data from one of the above three cells"""

# train test split
from sklearn.model_selection import train_test_split

# train/test split the shortened residential data set
residential_train, residential_test = train_test_split(residential_cleaned_split, test_size = 0.5)

# split into x and y
residential_test_x = residential_test.drop('Fire_final', 1)
residential_test_y = residential_test['Fire_final']

residential_train_x = residential_train.drop('Fire_final', 1)
residential_train_y = residential_train["Fire_final"]

# train/test split the shortened commercial data set
commercial_train, commercial_test = train_test_split(commercial_cleaned_split, test_size = 0.5)

# split into x and y
commercial_test_x = commercial_test.drop('Fire_final', 1)
commercial_test_y = commercial_test['Fire_final']

commercial_train_x = commercial_train.drop('Fire_final', 1)
commercial_train_y = commercial_train["Fire_final"]

# Examine datatypes (just to make check what we're dealing with in our models)
# Everything should say float or int
print(residential_train_x.dtypes)
print(residential_train_y.dtypes)

# Again, examine datatypes
# Everything should say float or int
print(commercial_train_x.dtypes)
print(commercial_train_y.dtypes)

"""## Decision Tree Classifier

First we run our Decision Tree Classifier on the residential data. This is a basic machine learning model that is explained below in the code. It's worth noting that in our case a classifier predicts either 0 or 1, while a regressor will predict a value between 0 and 1. A classifier predicts fires (yes or no, high risk or not), while a regressor predicts more nuanced levels of fire risk (0.2, 0.4, 0.0, 0.9, etc). This decision tree did OK: using only four variables it predicted a little under half of the fires, and ~40% of the time when it said there was a very high risk of fire, there was a fire. Note that 1.0 in the confusion matrix corresponds to fire, and 0 to no fire.
"""

# Our decision tree has "branches" that are based on each variable. For example, our most important variable (with the highest gain)
# is sq_footage_finished_living, so our model might decide to make the left side of the first branch >15k sq feet and the right side 
# <15k sq feet. Essentially, every variable divides the data, with the most important coming first. Google for more explanation. At 
# the smallest divisions in the tree (called "leaf nodes") we will have either a 0 (no probable fire) or 1 (probable fire). 
# Say we're at an arbitrary leaf, and we have 10 addresses from our training data that fall from the top of the tree into this 
# category. If we know 6 of them had fires, every element in this leaf would be predicted as "1" for a fire (60% accuracy on training).

from sklearn.tree import DecisionTreeClassifier

# model
tree = DecisionTreeClassifier(criterion = "entropy")

# train (residential first)
tree.fit(residential_train_x, residential_train_y)

# predict
tree_predictions = tree.predict(residential_test_x)

# This prints the information gain for each feature (very valuable)
print(pd.DataFrame({'Information Gain': tree.feature_importances_}, index = residential_train_x.columns).sort_values('Information Gain', ascending = False))
print("")
confusion = pd.crosstab(residential_test_y, tree_predictions, rownames=['Actual'], colnames = ['Predicted:'], margins = True)

# Note that the success rate isn't a good metric, because obviously our model would be very accurate if it just predicted no fires.
# If we could have 60% accuracy but predict every fire, that's a lot better than 95% accuracy and predicting half of the fires!
print("Total success rate: " + str((confusion.iloc[0,0] + confusion.iloc[1,1]) / confusion.iloc[2,2]))
# This is a confusion matrix, with 1 corresponding ot fire and 0 corresponding to no fire
print(confusion)
print("")
print("When this model said there was high fire risk, there was a fire " + str(confusion.iloc[1,1]/confusion.iloc[2,1]) + " percent of the time.")

# This is a test of cross validation... for our model to be consistent these scores need to be close every time.
# Obviously, this tests on success rate, which is a pretty useless metric for what we're predicting, but it shows
# the presence or absence of consistency.

from sklearn.cross_validation import cross_val_score, cross_val_predict
from sklearn import metrics

scores = cross_val_score(tree, residential_test_x, residential_test_y, cv=10)
print("Scores are: ", scores)

"""Next, we run our Decision Tree Classifier on the commercial data. This did even better: using only four variables it predicted well over half of the fires, and ~50% of the time when it said there was a very high risk of fire, there was a fire. Note that 1.0 in the confusion matrix corresponds to fire, and 0 to no fire."""

# train (commercial second)
tree.fit(commercial_train_x, commercial_train_y)

# predict
tree_predictions = tree.predict(commercial_test_x)

# This prints the information gain for each feature (very valuable)
print(pd.DataFrame({'Information Gain': tree.feature_importances_}, index = commercial_train_x.columns).sort_values('Information Gain', ascending = False))
print("")
confusion = pd.crosstab(commercial_test_y, tree_predictions, rownames=['Actual'], colnames = ['Predicted:'], margins = True)

# Note that the success rate isn't a good metric, because obviously our model would be very accurate if it just predicted no fires.
# If we could have 60% accuracy but predict every fire, that's a lot better than 95% accuracy and predicting half of the fires!
print("Total success rate: " + str((confusion.iloc[0,0] + confusion.iloc[1,1]) / confusion.iloc[2,2]))
# This is a confusion matrix, with 1 corresponding ot fire and 0 corresponding to no fire
print(confusion)
print("")
print("When this model said there was high fire risk, there was a fire " + str(confusion.iloc[1,1]/confusion.iloc[2,1]) + " percent of the time.")

# This is a test of cross validation... for our model to be consistent these scores need to be close every time.
# Obviously, this tests on success rate, which is a pretty useless metric for what we're predicting, but it shows
# the presence or absence of consistency.

from sklearn.cross_validation import cross_val_score, cross_val_predict
from sklearn import metrics

scores = cross_val_score(tree, residential_test_x, residential_test_y, cv=10)
print("Scores are: ", scores)

"""## Random Forest Classifier

First we run our Random Forest Classifier on the residential data. A random forest is like a decision tree, but a lot more complex (it's literally just a combination of multiple decision trees) and generally a lot more accurate. It's also explained below in the code. This did very well: using only four variables it predicted about 4/10 of the fires, and *~75-80%* of the time when it said there was a very high risk of fire, there was a fire. Note that 1.0 in the confusion matrix corresponds to fire, and 0 to no fire.
"""

# The Random Forest method introduces more randomness and diversity by applying the bagging method to the feature space. Bagging or Bootstrap Aggregating, 
# consists of randomly sampling subsets of the training data, fitting a model to these smaller data sets, and aggregating the predictions. That is, instead of
# searching greedily for the best predictors to create branches, it randomly samples elements of the predictor space, thus adding more diversity and reducing the 
# variance of the trees at the cost of equal or higher bias.
#
# In plain English, we model decision tree classifiers off of subsets of our data that don't include all the variables. We might have a subset that's just square 
# footage, number of rooms, and number of exterior walls, for example. We have a LOT of different subsets we can take, and each one gets a tree. Then we decide 
# how much weight each tree should get (a tree that uses square footage and year built will be more important than a tree only using roof and the flooring data, 
# since square footage and year built are really important factors). Then, by combining weights of all of these little trees, we get our model (a forest!).
# Again, Google is your friend.

from sklearn.ensemble import RandomForestClassifier

# model
forest =  RandomForestClassifier()

# train (residential first)
forest.fit(residential_train_x, residential_train_y)

# predict
forest_predictions = forest.predict(residential_test_x)

# show feature importances (very valuable, and basically the same meaning to us as information gain in a decision tree)
# shows how important each variable is
print(pd.DataFrame({'Importance': forest.feature_importances_}, index = residential_train_x.columns).sort_values('Importance', ascending = False))
print("")

confusion = pd.crosstab(residential_test_y, forest_predictions, rownames=['Actual'], colnames = ['Predicted:'], margins = True)

# Note that the success rate isn't a good metric, because obviously our model would be very accurate if it just predicted no fires.
# If we could have 60% accuracy but predict every fire, that's a lot better than 95% accuracy and predicting half of the fires!
print("Total success rate: " + str((confusion.iloc[0,0] + confusion.iloc[1,1]) / confusion.iloc[2,2]))
# This is a confusion matrix, with 1 corresponding ot fire and 0 corresponding to no fire
print(confusion)
print("")
print("When this model said there was high fire risk, there was a fire " + str(confusion.iloc[1,1]/confusion.iloc[2,1]) + " percent of the time.")

# This is a test of cross validation... for our model to be consistent these scores need to be close every time.
# Obviously, this tests on success rate, which is a pretty useless metric for what we're predicting, but it shows
# the presence or absence of consistency.

from sklearn.cross_validation import cross_val_score, cross_val_predict
from sklearn import metrics

scores = cross_val_score(forest, residential_test_x, residential_test_y, cv=10)
print("Scores are: ", scores)

"""Next, we run our Random Forest Classifier on the commercial data. This also did very well: using only four variables it predicted a well over half of the fires (better than residential), and ~60% (worse than residential) of the time when it said there was a very high risk of fire, there was a fire. Note that 1.0 in the confusion matrix corresponds to fire, and 0 to no fire."""

# train (commercial second)
forest.fit(commercial_train_x, commercial_train_y)

# predict
forest_predictions = forest.predict(commercial_test_x)

# show feature importances (very valuable, and basically the same meaning to us as information gain in a decision tree)
# shows how important each variable is
print(pd.DataFrame({'Importance': forest.feature_importances_}, index = commercial_train_x.columns).sort_values('Importance', ascending = False))
print("")

confusion = pd.crosstab(commercial_test_y, forest_predictions, rownames=['Actual'], colnames = ['Predicted:'], margins = True)

# Note that the success rate isn't a good metric, because obviously our model would be very accurate if it just predicted no fires.
# If we could have 60% accuracy but predict every fire, that's a lot better than 95% accuracy and predicting half of the fires!
print("Total success rate: " + str((confusion.iloc[0,0] + confusion.iloc[1,1]) / confusion.iloc[2,2]))
# This is a confusion matrix, with 1 corresponding ot fire and 0 corresponding to no fire
print(confusion)
print("")
print("When this model said there was high fire risk, there was a fire " + str(confusion.iloc[1,1]/confusion.iloc[2,1]) + " percent of the time.")

# This is a test of cross validation... for our model to be consistent these scores need to be close every time.
# Obviously, this tests on success rate, which is a pretty useless metric for what we're predicting, but it shows
# the presence or absence of consistency.

from sklearn.cross_validation import cross_val_score, cross_val_predict
from sklearn import metrics

scores = cross_val_score(forest, residential_test_x, residential_test_y, cv=10)
print("Scores are: ", scores)

"""## Random Forest Regressor

First we run our Random Forest Regressor on the residential data. It's again worth noting that a regressor returns values between 0 and 1 (i.e. 0.226) as opposed to just 0's and 1's. Otherwise, this model is pretty much the same as the Random Forest Classifier. This model worked best for us--better than models such as a decision tree regressor or a bagging regressor. For now this will be our final model (because it is a regressor it allows us to classify houses according to risk, which is much more in line with the task at hand). Using only four variables it was able to do very well at predicting a high risk category (see below), but more variables were useful in distinguishing between medium risk and low risk. See the code and the output for more information. 

Modify the function risk_function to modify how we convert into risk categories. For more explanation of our model see the cell below.
"""

# This works just like the Random Forest Classifier, only it's a Regressor. To explain, I took the description of the Random Forest Classifier and replaced the
# word "Classifier" with "Regressor" (see below). A decision tree regressor (what this is made of) is a decision tree that has decimal numbers between 0 and 1
# at the leaf nodes instead of just 0 or 1. Say we're at an arbitrary leaf, and we have 10 addresses from our training data that fall into this category. 
# If we know 6 of them had fires, every element in this leaf would be probably be predicted as a decently high decimal value (since we have 6/10 fires).
#
# The Random Forest method introduces more randomness and diversity by applying the bagging method to the feature space. Bagging or Bootstrap Aggregating, 
# consists of randomly sampling subsets of the training data, fitting a model to these smaller data sets, and aggregating the predictions. That is, instead of
# searching greedily for the best predictors to create branches, it randomly samples elements of the predictor space, thus adding more diversity and reducing the 
# variance of the trees at the cost of equal or higher bias.
#
# In plain English, we model decision tree regressors off of subsets of our data that don't include all the variables. We might have a subset that's just square 
# footage, number of rooms, and number of exterior walls, for example. We have a LOT of different subsets we can take, and each one gets a tree. Then we decide 
# how much weight each tree should get (a tree that uses square footage and year built will be more important than a tree only using roof and the flooring data, 
# since square footage and year built are really important factors). Then, by combining weights of all of these little trees, we get our model (a forest!).
# Again, Google is your friend.

from sklearn.ensemble import RandomForestRegressor

# model
forest =  RandomForestRegressor()

# train (residential first)
forest.fit(residential_train_x, residential_train_y)

# predict
forest_predictions = forest.predict(residential_test_x)

# This is our risk function, which converts the outputs of this regression into risk categories 
# low = 1, medium = 2, high = 3 (so if the regressor outputted a 0, we get 1, if it outputted 0.1, we get 2, and 0.4 would return 3)
def risk_function(risk):
  if risk < 0.25:
    return 1
  elif risk < 0.6:
    return 2
  else:
    return 3
  
risk_predictions = pd.Series(forest_predictions).apply(risk_function)

# show feature importances 
print(pd.DataFrame({'Importance': forest.feature_importances_}, index = residential_train_x.columns).sort_values('Importance', ascending = False))

# Since we are using a regressor, a confusion matrix is useless. We're going to have to test the model ourselves.
# This test predicts a fire for every single high risk house, then compares it to the actual data with a confusion matrix
def high_risk_test(risk):
  if risk == 3:
    return 2
  else:
    return 0
  
fire_predictions = risk_predictions.apply(high_risk_test)

confusion = pd.crosstab(residential_test_y, np.array(fire_predictions), rownames=['Actual'], colnames = ['Predicted:'], margins = True)
print("")
print("Percentage of homes in the high risk category with fires: " + str(confusion.iloc[1,1]/confusion.iloc[2,1]))

def medium_risk_test(risk):
  if risk == 2:
    return 1
  else:
    return 0

fire_predictions = risk_predictions.apply(medium_risk_test)
confusion = pd.crosstab(residential_test_y, np.array(fire_predictions), rownames=['Actual'], colnames = ['Predicted:'], margins = True)
print("Precentage of homes in the medium risk category with fires: " + str(confusion.iloc[1,1]/confusion.iloc[2,1]))

def low_risk_test(risk):
  if risk == 1:
    return 1
  else:
    return 0

fire_predictions = risk_predictions.apply(low_risk_test)
confusion = pd.crosstab(residential_test_y, np.array(fire_predictions), rownames=['Actual'], colnames = ['Predicted:'], margins = True)
print("Precentage of homes in the low risk category with fires: " + str(confusion.iloc[1,1]/confusion.iloc[2,1]))

"""Next, we run the Random Forest Regressor on the commercial data. Even on just four variables our model for the commercial data worked extremely well--as you can see below, there are clear divisions between low medium and high risk."""

# model
forest =  RandomForestRegressor()

# train (residential first)
forest.fit(commercial_train_x, commercial_train_y)

# predict
forest_predictions = forest.predict(commercial_test_x)

# This is our risk function, which converts the outputs of this regression into risk categories 
# low = 1, medium = 2, high = 3 (so if the regressor outputted a 0, we get 1, if it outputted 0.1, we get 2, and 0.4 would return 3)
def risk_function(risk):
  if risk < 0.25:
    return 1
  elif risk < 0.6:
    return 2
  else:
    return 3
  
risk_predictions = pd.Series(forest_predictions).apply(risk_function)

# show feature importances 
print(pd.DataFrame({'Importance': forest.feature_importances_}, index = commercial_train_x.columns).sort_values('Importance', ascending = False))

# Since we are using a regressor, a confusion matrix is useless. We're going to have to test the model ourselves.
# This test predicts a fire for every single high risk house, then compares it to the actual data with a confusion matrix
def high_risk_test(risk):
  if risk == 3:
    return 2
  else:
    return 0
  
fire_predictions = risk_predictions.apply(high_risk_test)

confusion = pd.crosstab(commercial_test_y, np.array(fire_predictions), rownames=['Actual'], colnames = ['Predicted:'], margins = True)
print("")
print("Percentage of homes in the high risk category with fires: " + str(confusion.iloc[1,1]/confusion.iloc[2,1]))

def medium_risk_test(risk):
  if risk == 2:
    return 1
  else:
    return 0

fire_predictions = risk_predictions.apply(medium_risk_test)
confusion = pd.crosstab(commercial_test_y, np.array(fire_predictions), rownames=['Actual'], colnames = ['Predicted:'], margins = True)
print("Precentage of homes in the medium risk category with fires: " + str(confusion.iloc[1,1]/confusion.iloc[2,1]))

def low_risk_test(risk):
  if risk == 1:
    return 1
  else:
    return 0

fire_predictions = risk_predictions.apply(low_risk_test)
confusion = pd.crosstab(commercial_test_y, np.array(fire_predictions), rownames=['Actual'], colnames = ['Predicted:'], margins = True)
print("Precentage of homes in the low risk category with fires: " + str(confusion.iloc[1,1]/confusion.iloc[2,1]))

"""## Outputting Our Results

If you're just trying to look at our process, stop reading now. Hopefully this helped!

We will output our results for residential data using the Random Forest Regressor, but we could use any regression model on either commercial or residential data with modification. Obviously, one would want to modify the train test split, play with cross validation, and optimize the training of a model as much as possible before outputting results.

First, we calculate our risk values (based on our model that we've trained) and append the calculated risk value to our dataframe. Then, we change it to reflect risk categories (1,2, and 3 for low, medium, and high) which can be changed to include more categories if necessary. Then, we output the data for later use.
"""

residential_cleaned_split.head()

# The word "forest" in the first and second lines can be changed to match whatever regression model we are using (in this case it's the random forest regressor), 
# and we used the word forest to designate our model. If we want to output predictions for the commercial data, replace the word "residential" with the word 
# "commercial" in the code below.

forest_predictions = pd.Series(forest.predict(commercial_cleaned_split.drop(["Fire_final"],1)))
commercial["Detailed Risk"] = forest_predictions
risk_predictions = forest_predictions.apply(risk_function)
commercial["Risk Level"] = risk_predictions

predictions = commercial[["lat", "lon", "Address", "Fire_final", "Detailed Risk"]]

# This is how to output/download a csv from collaboratory
# Forces a download when ran

from IPython.display import Javascript
js_download = """
var csv = '%s';

var filename = 'predictions.csv';
var blob = new Blob([csv], { type: 'text/csv;charset=utf-8;' });
if (navigator.msSaveBlob) { // IE 10+
    navigator.msSaveBlob(blob, filename);
} else {
    var link = document.createElement("a");
    if (link.download !== undefined) { // feature detection
        // Browsers that support HTML5 download attribute
        var url = URL.createObjectURL(blob);
        link.setAttribute("href", url);
        link.setAttribute("download", filename);
        link.style.visibility = 'hidden';
        document.body.appendChild(link);
        link.click();
        document.body.removeChild(link);
    }
}
""" % predictions.to_csv(index=False).replace('\n','\\n').replace("'","\'")

Javascript(js_download)