-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathfire_machinelearning.py
489 lines (359 loc) · 29.5 KB
/
fire_machinelearning.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
# -*- coding: utf-8 -*-
"""Fire_MachineLearning.ipynb
Automatically generated by Colaboratory.
Original file is located at
https://colab.research.google.com/drive/1PswZ0yRsQqPIpbQ5iCACisPZUYX59H5l
#Charlottesville Fire Department Project: Machine Learning Predictions
Authors: Jackson Barkstrom, Habib Karaky, Josh Schuck, Garrett Vercoe. We joined together the data we used here in the "Cleaning and Merging" code. The data was originally worked on by many, including us, during Civic Innovation Day (special shoutouts to Stephen and Katharine).
Note: We assume a basic understanding of the data we're working with, but we assume only little understanding of machine learning. Our code walks through different machine learning models and their possible utility--the big question, obviously, is which model we should use to predict fires. Obviously, a decision tree is a simple model that we probably don't want to use, but we figured it would be useful for teaching purposes.
For now, we settled on the Random Forest Regressor because it performed the best out of all of the regression models we somewhat understood and tried--easily beating out models such as bagging and simple decision trees. The random forest is an extremely powerful model, and it produced useful results both as a classifier and as a regressor. We decided we had to use a regression model becuase it allows us to make our own risk categories--it returned decimal risk values between 0 and 1 and we could split these up as we desired. We can easily say "Put anything below .2 in the low risk category, then put everything greater than .6 in the high risk category, then put everything else in the medium risk category. Thus, we' can generate low, medium, and high risk categories. However, we our model really only works for finding the highest-risk homes: the high risk category is the only one that's really significant.
One of the biggest issues we ran into was predicting the fire risk of a building *after it already caught on fire.* Fire risk would go down since the owners of the building would take more safety precautions in the future, no? However, our model didn't look at multiple fires: it only looked at 0 or 1, if the house had a fire at least once since 2003 or if it didn't. We could change this. We recognize that there's some serious danger here of predicting the past (what we're doing) not being the same as predicting the future.
Despite this shortcoming our models generalize to most houses well. We found pretty clearly that the square footage and the age of buildings are the largest predictors of fire risk, and that makes sense. Really big old houses are way more likely to catch on fire than small new houses. Our models clearly have flaws and cannot predict everything: they could use a lot more data, such as if or if not the homes have smoke alarms installed, and they don't work well for predicting low risk versus medium risk. But they work well for high risk. Our models can predict the homes with the highest risks of fire and the fire department can respond accordingly. Our models aren't the best they could be (we recommend improvement), but they could already be used--right now--to inspect the highest risk homes and decrease fire risk in the Charlottesville community. If our model designates a house as "high risk," the house probably needs attention.
Link to cleaning and merging notebook: https://colab.research.google.com/drive/1EPmKwBAJ560MV5pDJYD1e_iQbE0NezB0#scrollTo=phy9ec8A488x
"""
import pandas as pd
import numpy as np
"""## Import Data"""
# Import our joined together data from the "Cleaning and Merging" code
residential = pd.read_csv("https://raw.githubusercontent.com/garrettvercoe/CharlottesvilleFireModel/master/Updated_Residential_Results_cleaned.csv")
commercial = pd.read_csv("https://raw.githubusercontent.com/garrettvercoe/CharlottesvilleFireModel/master/Updated_Commercial_Results_cleaned.csv")
# Examine residential data (feel free not to run this)
residential.head()
# Examine commercial data (feel free not to run this)
commercial.head()
"""## Cleaning"""
# Drop variables we are no longer using for Machine Learning
# Drops latitude, longitude, address, if there was a fire 2003-2016,
# and if there was a fire 2016-. Our 'Fire_final' column still shows
# if there was a fire 2003-2018, and this is what we will train our
# models on. Feel free to modify to your liking.
residential_cleaned = residential.drop(['lat','lon','fire_late','fire_early','Address', 'Type'], 1)
commercial_cleaned = commercial.drop(['lat','lon','fire_late','fire_early','Address', 'address', 'Type'], 1)
# Clean the data to be ready for an algorithm
# We replace nan NaN with 0 and then made it its own category in case there was a pattern to the NaN values
# We make numerical categories with dataframe["colname"].notnull().astype("category").cat.codes
residential_cleaned = residential_cleaned.replace(np.nan, 0, regex=True)
commercial_cleaned = commercial_cleaned.replace(np.nan, 0, regex=True)
residential_cleaned["use_type"] = residential_cleaned["use_type"].astype("category").cat.codes
residential_cleaned["use_code"] = residential_cleaned["use_code"].astype("category").cat.codes
residential_cleaned["grade"] = residential_cleaned["grade"].astype("category").cat.codes
residential_cleaned["ext_walls"] = residential_cleaned["ext_walls"].astype("category").cat.codes
residential_cleaned["roof"] = residential_cleaned["roof"].astype("category").cat.codes
residential_cleaned["flooring"] = residential_cleaned["flooring"].astype("category").cat.codes
residential_cleaned["bsmt_type"] = residential_cleaned["bsmt_type"].astype("category").cat.codes
residential_cleaned["heating"] = residential_cleaned["heating"].astype("category").cat.codes
commercial_cleaned["use_type"] = commercial_cleaned["use_type"].astype("category").cat.codes
commercial_cleaned["use_code"] = commercial_cleaned["use_code"].astype("category").cat.codes
# Examine the data
residential_cleaned.head()
# Examine the data
commercial_cleaned.head()
"""## Train Test Split
Note: running the split in conjunction with the first cell is does not get rid of variables, which we used because it produces the most accurate model. However, there is a chance that this model will be overfitted. We made the second cell give the model only four variables, just to show how important those 4 variables are in predicting fires, and to show that our model can produce powerful insights on only four variables.
We used 50% of the residential data and 50% of the commercial data to train, but this can easily be changed by editing test_size. That way, if our model tests well among the other 50% of the data, it has high validity.
This first one uses all of the variables. We used it in our final model. Although overfitting is dangerous, there are few variables relative to the size of the dataset.
"""
# train test split (so that we can validate our model)
from sklearn.model_selection import train_test_split
residential_cleaned_split = residential_cleaned
commercial_cleaned_split = commercial_cleaned
"""This second one gets rid of variables. Don't run it if you want to look at all the variables.
For our simplest models we used only four variables and found good results in finding the highest risk homes. For residential data we used 1) square footage, 2) year built, 3) does it have a basement, and 4) how many total rooms does it have. For commercial data we used 1) square footage, 2) year built, 3) use code, and 4) number of stories. We found these were the most significant. The final regression model will definitely need more than just these four variables.
"""
# get rid of all but the most useful variables
residential_cleaned_split = residential_cleaned[["sq_footage_finished_living", "year_built", "basement", "total_rooms", "Fire_final"]]
commercial_cleaned_split = commercial_cleaned[["gross_area", "year_built", "use_code", "number_of_stories", "Fire_final"]]
"""Basic train test split with 50% train 50% split, using data from one of the above three cells"""
# train test split
from sklearn.model_selection import train_test_split
# train/test split the shortened residential data set
residential_train, residential_test = train_test_split(residential_cleaned_split, test_size = 0.5)
# split into x and y
residential_test_x = residential_test.drop('Fire_final', 1)
residential_test_y = residential_test['Fire_final']
residential_train_x = residential_train.drop('Fire_final', 1)
residential_train_y = residential_train["Fire_final"]
# train/test split the shortened commercial data set
commercial_train, commercial_test = train_test_split(commercial_cleaned_split, test_size = 0.5)
# split into x and y
commercial_test_x = commercial_test.drop('Fire_final', 1)
commercial_test_y = commercial_test['Fire_final']
commercial_train_x = commercial_train.drop('Fire_final', 1)
commercial_train_y = commercial_train["Fire_final"]
# Examine datatypes (just to make check what we're dealing with in our models)
# Everything should say float or int
print(residential_train_x.dtypes)
print(residential_train_y.dtypes)
# Again, examine datatypes
# Everything should say float or int
print(commercial_train_x.dtypes)
print(commercial_train_y.dtypes)
"""## Decision Tree Classifier
First we run our Decision Tree Classifier on the residential data. This is a basic machine learning model that is explained below in the code. It's worth noting that in our case a classifier predicts either 0 or 1, while a regressor will predict a value between 0 and 1. A classifier predicts fires (yes or no, high risk or not), while a regressor predicts more nuanced levels of fire risk (0.2, 0.4, 0.0, 0.9, etc). This decision tree did OK: using only four variables it predicted a little under half of the fires, and ~40% of the time when it said there was a very high risk of fire, there was a fire. Note that 1.0 in the confusion matrix corresponds to fire, and 0 to no fire.
"""
# Our decision tree has "branches" that are based on each variable. For example, our most important variable (with the highest gain)
# is sq_footage_finished_living, so our model might decide to make the left side of the first branch >15k sq feet and the right side
# <15k sq feet. Essentially, every variable divides the data, with the most important coming first. Google for more explanation. At
# the smallest divisions in the tree (called "leaf nodes") we will have either a 0 (no probable fire) or 1 (probable fire).
# Say we're at an arbitrary leaf, and we have 10 addresses from our training data that fall from the top of the tree into this
# category. If we know 6 of them had fires, every element in this leaf would be predicted as "1" for a fire (60% accuracy on training).
from sklearn.tree import DecisionTreeClassifier
# model
tree = DecisionTreeClassifier(criterion = "entropy")
# train (residential first)
tree.fit(residential_train_x, residential_train_y)
# predict
tree_predictions = tree.predict(residential_test_x)
# This prints the information gain for each feature (very valuable)
print(pd.DataFrame({'Information Gain': tree.feature_importances_}, index = residential_train_x.columns).sort_values('Information Gain', ascending = False))
print("")
confusion = pd.crosstab(residential_test_y, tree_predictions, rownames=['Actual'], colnames = ['Predicted:'], margins = True)
# Note that the success rate isn't a good metric, because obviously our model would be very accurate if it just predicted no fires.
# If we could have 60% accuracy but predict every fire, that's a lot better than 95% accuracy and predicting half of the fires!
print("Total success rate: " + str((confusion.iloc[0,0] + confusion.iloc[1,1]) / confusion.iloc[2,2]))
# This is a confusion matrix, with 1 corresponding ot fire and 0 corresponding to no fire
print(confusion)
print("")
print("When this model said there was high fire risk, there was a fire " + str(confusion.iloc[1,1]/confusion.iloc[2,1]) + " percent of the time.")
# This is a test of cross validation... for our model to be consistent these scores need to be close every time.
# Obviously, this tests on success rate, which is a pretty useless metric for what we're predicting, but it shows
# the presence or absence of consistency.
from sklearn.cross_validation import cross_val_score, cross_val_predict
from sklearn import metrics
scores = cross_val_score(tree, residential_test_x, residential_test_y, cv=10)
print("Scores are: ", scores)
"""Next, we run our Decision Tree Classifier on the commercial data. This did even better: using only four variables it predicted well over half of the fires, and ~50% of the time when it said there was a very high risk of fire, there was a fire. Note that 1.0 in the confusion matrix corresponds to fire, and 0 to no fire."""
# train (commercial second)
tree.fit(commercial_train_x, commercial_train_y)
# predict
tree_predictions = tree.predict(commercial_test_x)
# This prints the information gain for each feature (very valuable)
print(pd.DataFrame({'Information Gain': tree.feature_importances_}, index = commercial_train_x.columns).sort_values('Information Gain', ascending = False))
print("")
confusion = pd.crosstab(commercial_test_y, tree_predictions, rownames=['Actual'], colnames = ['Predicted:'], margins = True)
# Note that the success rate isn't a good metric, because obviously our model would be very accurate if it just predicted no fires.
# If we could have 60% accuracy but predict every fire, that's a lot better than 95% accuracy and predicting half of the fires!
print("Total success rate: " + str((confusion.iloc[0,0] + confusion.iloc[1,1]) / confusion.iloc[2,2]))
# This is a confusion matrix, with 1 corresponding ot fire and 0 corresponding to no fire
print(confusion)
print("")
print("When this model said there was high fire risk, there was a fire " + str(confusion.iloc[1,1]/confusion.iloc[2,1]) + " percent of the time.")
# This is a test of cross validation... for our model to be consistent these scores need to be close every time.
# Obviously, this tests on success rate, which is a pretty useless metric for what we're predicting, but it shows
# the presence or absence of consistency.
from sklearn.cross_validation import cross_val_score, cross_val_predict
from sklearn import metrics
scores = cross_val_score(tree, residential_test_x, residential_test_y, cv=10)
print("Scores are: ", scores)
"""## Random Forest Classifier
First we run our Random Forest Classifier on the residential data. A random forest is like a decision tree, but a lot more complex (it's literally just a combination of multiple decision trees) and generally a lot more accurate. It's also explained below in the code. This did very well: using only four variables it predicted about 4/10 of the fires, and *~75-80%* of the time when it said there was a very high risk of fire, there was a fire. Note that 1.0 in the confusion matrix corresponds to fire, and 0 to no fire.
"""
# The Random Forest method introduces more randomness and diversity by applying the bagging method to the feature space. Bagging or Bootstrap Aggregating,
# consists of randomly sampling subsets of the training data, fitting a model to these smaller data sets, and aggregating the predictions. That is, instead of
# searching greedily for the best predictors to create branches, it randomly samples elements of the predictor space, thus adding more diversity and reducing the
# variance of the trees at the cost of equal or higher bias.
#
# In plain English, we model decision tree classifiers off of subsets of our data that don't include all the variables. We might have a subset that's just square
# footage, number of rooms, and number of exterior walls, for example. We have a LOT of different subsets we can take, and each one gets a tree. Then we decide
# how much weight each tree should get (a tree that uses square footage and year built will be more important than a tree only using roof and the flooring data,
# since square footage and year built are really important factors). Then, by combining weights of all of these little trees, we get our model (a forest!).
# Again, Google is your friend.
from sklearn.ensemble import RandomForestClassifier
# model
forest = RandomForestClassifier()
# train (residential first)
forest.fit(residential_train_x, residential_train_y)
# predict
forest_predictions = forest.predict(residential_test_x)
# show feature importances (very valuable, and basically the same meaning to us as information gain in a decision tree)
# shows how important each variable is
print(pd.DataFrame({'Importance': forest.feature_importances_}, index = residential_train_x.columns).sort_values('Importance', ascending = False))
print("")
confusion = pd.crosstab(residential_test_y, forest_predictions, rownames=['Actual'], colnames = ['Predicted:'], margins = True)
# Note that the success rate isn't a good metric, because obviously our model would be very accurate if it just predicted no fires.
# If we could have 60% accuracy but predict every fire, that's a lot better than 95% accuracy and predicting half of the fires!
print("Total success rate: " + str((confusion.iloc[0,0] + confusion.iloc[1,1]) / confusion.iloc[2,2]))
# This is a confusion matrix, with 1 corresponding ot fire and 0 corresponding to no fire
print(confusion)
print("")
print("When this model said there was high fire risk, there was a fire " + str(confusion.iloc[1,1]/confusion.iloc[2,1]) + " percent of the time.")
# This is a test of cross validation... for our model to be consistent these scores need to be close every time.
# Obviously, this tests on success rate, which is a pretty useless metric for what we're predicting, but it shows
# the presence or absence of consistency.
from sklearn.cross_validation import cross_val_score, cross_val_predict
from sklearn import metrics
scores = cross_val_score(forest, residential_test_x, residential_test_y, cv=10)
print("Scores are: ", scores)
"""Next, we run our Random Forest Classifier on the commercial data. This also did very well: using only four variables it predicted a well over half of the fires (better than residential), and ~60% (worse than residential) of the time when it said there was a very high risk of fire, there was a fire. Note that 1.0 in the confusion matrix corresponds to fire, and 0 to no fire."""
# train (commercial second)
forest.fit(commercial_train_x, commercial_train_y)
# predict
forest_predictions = forest.predict(commercial_test_x)
# show feature importances (very valuable, and basically the same meaning to us as information gain in a decision tree)
# shows how important each variable is
print(pd.DataFrame({'Importance': forest.feature_importances_}, index = commercial_train_x.columns).sort_values('Importance', ascending = False))
print("")
confusion = pd.crosstab(commercial_test_y, forest_predictions, rownames=['Actual'], colnames = ['Predicted:'], margins = True)
# Note that the success rate isn't a good metric, because obviously our model would be very accurate if it just predicted no fires.
# If we could have 60% accuracy but predict every fire, that's a lot better than 95% accuracy and predicting half of the fires!
print("Total success rate: " + str((confusion.iloc[0,0] + confusion.iloc[1,1]) / confusion.iloc[2,2]))
# This is a confusion matrix, with 1 corresponding ot fire and 0 corresponding to no fire
print(confusion)
print("")
print("When this model said there was high fire risk, there was a fire " + str(confusion.iloc[1,1]/confusion.iloc[2,1]) + " percent of the time.")
# This is a test of cross validation... for our model to be consistent these scores need to be close every time.
# Obviously, this tests on success rate, which is a pretty useless metric for what we're predicting, but it shows
# the presence or absence of consistency.
from sklearn.cross_validation import cross_val_score, cross_val_predict
from sklearn import metrics
scores = cross_val_score(forest, residential_test_x, residential_test_y, cv=10)
print("Scores are: ", scores)
"""## Random Forest Regressor
First we run our Random Forest Regressor on the residential data. It's again worth noting that a regressor returns values between 0 and 1 (i.e. 0.226) as opposed to just 0's and 1's. Otherwise, this model is pretty much the same as the Random Forest Classifier. This model worked best for us--better than models such as a decision tree regressor or a bagging regressor. For now this will be our final model (because it is a regressor it allows us to classify houses according to risk, which is much more in line with the task at hand). Using only four variables it was able to do very well at predicting a high risk category (see below), but more variables were useful in distinguishing between medium risk and low risk. See the code and the output for more information.
Modify the function risk_function to modify how we convert into risk categories. For more explanation of our model see the cell below.
"""
# This works just like the Random Forest Classifier, only it's a Regressor. To explain, I took the description of the Random Forest Classifier and replaced the
# word "Classifier" with "Regressor" (see below). A decision tree regressor (what this is made of) is a decision tree that has decimal numbers between 0 and 1
# at the leaf nodes instead of just 0 or 1. Say we're at an arbitrary leaf, and we have 10 addresses from our training data that fall into this category.
# If we know 6 of them had fires, every element in this leaf would be probably be predicted as a decently high decimal value (since we have 6/10 fires).
#
# The Random Forest method introduces more randomness and diversity by applying the bagging method to the feature space. Bagging or Bootstrap Aggregating,
# consists of randomly sampling subsets of the training data, fitting a model to these smaller data sets, and aggregating the predictions. That is, instead of
# searching greedily for the best predictors to create branches, it randomly samples elements of the predictor space, thus adding more diversity and reducing the
# variance of the trees at the cost of equal or higher bias.
#
# In plain English, we model decision tree regressors off of subsets of our data that don't include all the variables. We might have a subset that's just square
# footage, number of rooms, and number of exterior walls, for example. We have a LOT of different subsets we can take, and each one gets a tree. Then we decide
# how much weight each tree should get (a tree that uses square footage and year built will be more important than a tree only using roof and the flooring data,
# since square footage and year built are really important factors). Then, by combining weights of all of these little trees, we get our model (a forest!).
# Again, Google is your friend.
from sklearn.ensemble import RandomForestRegressor
# model
forest = RandomForestRegressor()
# train (residential first)
forest.fit(residential_train_x, residential_train_y)
# predict
forest_predictions = forest.predict(residential_test_x)
# This is our risk function, which converts the outputs of this regression into risk categories
# low = 1, medium = 2, high = 3 (so if the regressor outputted a 0, we get 1, if it outputted 0.1, we get 2, and 0.4 would return 3)
def risk_function(risk):
if risk < 0.25:
return 1
elif risk < 0.6:
return 2
else:
return 3
risk_predictions = pd.Series(forest_predictions).apply(risk_function)
# show feature importances
print(pd.DataFrame({'Importance': forest.feature_importances_}, index = residential_train_x.columns).sort_values('Importance', ascending = False))
# Since we are using a regressor, a confusion matrix is useless. We're going to have to test the model ourselves.
# This test predicts a fire for every single high risk house, then compares it to the actual data with a confusion matrix
def high_risk_test(risk):
if risk == 3:
return 2
else:
return 0
fire_predictions = risk_predictions.apply(high_risk_test)
confusion = pd.crosstab(residential_test_y, np.array(fire_predictions), rownames=['Actual'], colnames = ['Predicted:'], margins = True)
print("")
print("Percentage of homes in the high risk category with fires: " + str(confusion.iloc[1,1]/confusion.iloc[2,1]))
def medium_risk_test(risk):
if risk == 2:
return 1
else:
return 0
fire_predictions = risk_predictions.apply(medium_risk_test)
confusion = pd.crosstab(residential_test_y, np.array(fire_predictions), rownames=['Actual'], colnames = ['Predicted:'], margins = True)
print("Precentage of homes in the medium risk category with fires: " + str(confusion.iloc[1,1]/confusion.iloc[2,1]))
def low_risk_test(risk):
if risk == 1:
return 1
else:
return 0
fire_predictions = risk_predictions.apply(low_risk_test)
confusion = pd.crosstab(residential_test_y, np.array(fire_predictions), rownames=['Actual'], colnames = ['Predicted:'], margins = True)
print("Precentage of homes in the low risk category with fires: " + str(confusion.iloc[1,1]/confusion.iloc[2,1]))
"""Next, we run the Random Forest Regressor on the commercial data. Even on just four variables our model for the commercial data worked extremely well--as you can see below, there are clear divisions between low medium and high risk."""
# model
forest = RandomForestRegressor()
# train (residential first)
forest.fit(commercial_train_x, commercial_train_y)
# predict
forest_predictions = forest.predict(commercial_test_x)
# This is our risk function, which converts the outputs of this regression into risk categories
# low = 1, medium = 2, high = 3 (so if the regressor outputted a 0, we get 1, if it outputted 0.1, we get 2, and 0.4 would return 3)
def risk_function(risk):
if risk < 0.25:
return 1
elif risk < 0.6:
return 2
else:
return 3
risk_predictions = pd.Series(forest_predictions).apply(risk_function)
# show feature importances
print(pd.DataFrame({'Importance': forest.feature_importances_}, index = commercial_train_x.columns).sort_values('Importance', ascending = False))
# Since we are using a regressor, a confusion matrix is useless. We're going to have to test the model ourselves.
# This test predicts a fire for every single high risk house, then compares it to the actual data with a confusion matrix
def high_risk_test(risk):
if risk == 3:
return 2
else:
return 0
fire_predictions = risk_predictions.apply(high_risk_test)
confusion = pd.crosstab(commercial_test_y, np.array(fire_predictions), rownames=['Actual'], colnames = ['Predicted:'], margins = True)
print("")
print("Percentage of homes in the high risk category with fires: " + str(confusion.iloc[1,1]/confusion.iloc[2,1]))
def medium_risk_test(risk):
if risk == 2:
return 1
else:
return 0
fire_predictions = risk_predictions.apply(medium_risk_test)
confusion = pd.crosstab(commercial_test_y, np.array(fire_predictions), rownames=['Actual'], colnames = ['Predicted:'], margins = True)
print("Precentage of homes in the medium risk category with fires: " + str(confusion.iloc[1,1]/confusion.iloc[2,1]))
def low_risk_test(risk):
if risk == 1:
return 1
else:
return 0
fire_predictions = risk_predictions.apply(low_risk_test)
confusion = pd.crosstab(commercial_test_y, np.array(fire_predictions), rownames=['Actual'], colnames = ['Predicted:'], margins = True)
print("Precentage of homes in the low risk category with fires: " + str(confusion.iloc[1,1]/confusion.iloc[2,1]))
"""## Outputting Our Results
If you're just trying to look at our process, stop reading now. Hopefully this helped!
We will output our results for residential data using the Random Forest Regressor, but we could use any regression model on either commercial or residential data with modification. Obviously, one would want to modify the train test split, play with cross validation, and optimize the training of a model as much as possible before outputting results.
First, we calculate our risk values (based on our model that we've trained) and append the calculated risk value to our dataframe. Then, we change it to reflect risk categories (1,2, and 3 for low, medium, and high) which can be changed to include more categories if necessary. Then, we output the data for later use.
"""
residential_cleaned_split.head()
# The word "forest" in the first and second lines can be changed to match whatever regression model we are using (in this case it's the random forest regressor),
# and we used the word forest to designate our model. If we want to output predictions for the commercial data, replace the word "residential" with the word
# "commercial" in the code below.
forest_predictions = pd.Series(forest.predict(commercial_cleaned_split.drop(["Fire_final"],1)))
commercial["Detailed Risk"] = forest_predictions
risk_predictions = forest_predictions.apply(risk_function)
commercial["Risk Level"] = risk_predictions
predictions = commercial[["lat", "lon", "Address", "Fire_final", "Detailed Risk"]]
# This is how to output/download a csv from collaboratory
# Forces a download when ran
from IPython.display import Javascript
js_download = """
var csv = '%s';
var filename = 'predictions.csv';
var blob = new Blob([csv], { type: 'text/csv;charset=utf-8;' });
if (navigator.msSaveBlob) { // IE 10+
navigator.msSaveBlob(blob, filename);
} else {
var link = document.createElement("a");
if (link.download !== undefined) { // feature detection
// Browsers that support HTML5 download attribute
var url = URL.createObjectURL(blob);
link.setAttribute("href", url);
link.setAttribute("download", filename);
link.style.visibility = 'hidden';
document.body.appendChild(link);
link.click();
document.body.removeChild(link);
}
}
""" % predictions.to_csv(index=False).replace('\n','\\n').replace("'","\'")
Javascript(js_download)