myproject.Rmd

---
title: "Baltimore Life Expectancy"
author: "Kayode Sosina"
date: "September 12, 2016"
references:
- URL: http://www.jstor.org/stable/1400528
  author:
  - family: Dutilleul
    given: Pierre
  - family: Stockwell
    given: Jason
  - family: Frigon
    given: Dominic
  - family: Legendre
    given: Pierre
  container-title: Journal of Agricultural, Biological, and Environmental Statistics
  id: mantel
  issued:
    month: 6
    year: 2000
  page: 131-150
  publisher: International Biometric Society
  title: 'The Mantel Test versus Pearson''s Correlation Analysis: Assessment of the
    Differences for Biological and Environmental Studies'
  type: article-journal
  volume: 5
- URL: http://www.jstor.org/stable/2332142
  author:
  - family: Moran
    given: Patrick Alfred Pierce
  container-title: Biometrika
  id: moran
  issued:
    month: 6
    year: 1950
  page: 17-23
  publisher: ' Oxford University Press on behalf of Biometrika Trust'
  title: Notes on Continuous Stochastic Phenomena
  type: article-journal
  volume: 37
- author:
  - family: Mantel
    given: Nathan
  container-title: American Association for Cancer Research.
  id: mantel1
  issued:
    month: 9
    year: 1966
  title: The Detection of Disease Clustering and a Generalized Regression Approach
  type: article-journal
- author:
  - family: Fotheringham
    given: A. Stewart
  - family: Brunsdon
    given: Chris
  - family: Charlton
    given: Martin
  container-title: ' Wiley'
  id: fother
  issued:
    year: 2002
  title: 'Geographically Weighted Regression: The Analysis of Spatially Varying Relationships'
  type: Book
- URL: http://www.who.int/healthinfo/indicators/hsi_indicators_SDG_TechnicalMeeting_December2015_BackgroundPaper.pdf
  id: who
  issued:
    year: 2014
  title: An overarching health indicator for the Post-2015 Development Agenda
  type: Article
- URL: http://hdr.undp.org/en/content/human-development-index-hdi
  id: hdi
  title: An overarching health indicator for the Post-2015 Development Agenda
- URL: http://dx.doi.org/10.18637/jss.v063.i17
  author:
  - family: Gollini
    given: Isabella
  - family: Lu
    given: Binbin
  - family: Charlton
    given: Martin
  - family: Brunsdon
    given: Christopher
  - family: Harris
    given: Paul
  container-title: Journal of Statistical softwar
  id: gwmodel
  issued:
    month: 2
    year: 2015
  title: 'GWmodel: An R Package for Exploring Spatial Heterogeneity Using Geographically
    Weighted Models'
  type: article-journal
  volume: 63
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```


Housekeeping
-------------------

We need to create folders which will contain the data, plots and text.

```{r directories, include=FALSE}
dir_create <- function(){
  dir_names <- c("R code", "Data", "Text", "Plots")
  sapply(dir_names, function(x) {
    if(!file.exists(x)){
      dir.create(x)
    }
  })
}


dir_create()
```


```{r housekeeping, include=FALSE, message=FALSE, eval=T}

match.package <- function(){
  
  #list of packages that will be used
  list.of.packages <- c("ggplot2", "Rcpp", "lubridate", "downloader", 
                        "readr", "readxl", "maptools", "RColorBrewer",
                        "ggmap", "rgeos", "broom", "rgdal", "grDevices",
                        "animation", "ade4", "sp", "ape", "geosphere", "dplyr",
                        "plyr", "pryr", "tidyr", "gstat", "spdep", "spgwr",
                        "GWmodel", "ModelMap", "acs", "tigris", "gridExtra",
                        "animation", "devtools", "cvTools") 
  new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
  if(length(new.packages) > 0) {
    install.packages(new.packages,repos = "http://cran.us.r-project.org")
    # stop(paste("Please install the following packages", paste0(new.packages,collapse = " ")))
    # print(paste("The following packages are missing", new.packages))
    # x <- readline("Would you like to install them now?[y/n] >")
    # if (any(x %in% c("y", "n")) & x == "y")
    # {
    #   install.packages(new.packages)
    # }
    # else if (!any(x %in% c("y", "n")))
    #   print("Please enter y or n")
    # else
    #   stop(paste("Please install the following packages", paste0(new.packages,collapse = " ")))
  }
  
}

match.package()

```

Data
------------------

We have data from [Baltimore city website](https://data.baltimorecity.gov), [Baltimore Neighborhood Indicators Alliance BNIA-JF](http://bniajfi.org), [Maryland department of planning](http://www.mdp.state.md.us/), and from the [Census Bureau](http://www.census.gov). The data consists of information about life expectancy estimates for each neighbourhood, along with crime, economic development and education informmation, all over a 5 year period (2010-2014). I also have street level, and [block group](https://www.census.gov/geo/reference/gtc/gtc_bg.html) level data. 

```{r data_download, include=FALSE, message=FALSE}

setwd(file.path(".", "Data"))  
packages <- c("ggplot2","lubridate", "downloader", 
              "readr", "readxl", "maptools", "RColorBrewer", "ggmap", "devtools")
sapply(packages, library, character.only = T, quietly = T)

Real_Property_Taxes <- "https://data.baltimorecity.gov/api/views/27w9-urtv/rows.csv?accessType=DOWNLOAD"
Parks <- "https://data.baltimorecity.gov/api/views/3r8a-uawz/rows.csv?accessType=DOWNLOAD"
Religious_Buildings <- "https://data.baltimorecity.gov/api/views/kbdc-bpw3/rows.csv?accessType=DOWNLOAD"
Libraries <- "https://data.baltimorecity.gov/api/views/tgtv-wr5u/rows.csv?accessType=DOWNLOAD"
Liquor_Licenses <- "https://data.baltimorecity.gov/api/views/xv8d-bwgi/rows.csv?accessType=DOWNLOAD"
Customer_Service_Requests_311 <- "https://data.baltimorecity.gov/api/views/9agw-sxsr/rows.csv?accessType=DOWNLOAD"
Assisted_Living_Facilities <- "https://data.baltimorecity.gov/api/views/q2vm-e9dp/rows.csv?accessType=DOWNLOAD"
Adult_Day_Care_Facilities <- "https://data.baltimorecity.gov/api/views/yc75-xbrv/rows.csv?accessType=DOWNLOAD"
Nursing_Homes <- "https://data.baltimorecity.gov/api/views/53js-3bkd/rows.csv?accessType=DOWNLOAD"
Census_Profile_by_Neighborhood_Statistical_Areas_2010 <- "https://data.baltimorecity.gov/api/views/5iam-bd6p/rows.csv?accessType=DOWNLOAD"
Census_Demographics_2010 <- "https://data.baltimorecity.gov/api/views/cix3-h4cy/rows.csv?accessType=DOWNLOAD"
Neighborhood_Action_Sense_of_Community_2010 <- "https://data.baltimorecity.gov/api/views/ipje-efsv/rows.csv?accessType=DOWNLOAD"
Real_Property <- "http://gisdata.baltimorecity.gov/datasets/b41551f53345445fa05b554cd77b3732_0.csv"
CSA_to_NSA_2010 <- "http://bniajfi.org/wp-content/uploads/2014/04/CSA-to-NSA-2010.xlsx"
Census_Blocks_and_NSAs_2010 <- "http://bniajfi.org/wp-content/uploads/2014/04/Census-Blocks-and-NSAs-2010.xlsx"
BPD_Part_1_Victim_Based_Crime_Data <- "https://data.baltimorecity.gov/api/views/wsfq-mvij/rows.csv?accessType=DOWNLOAD"
Vacant_Buildings <- "https://data.baltimorecity.gov/api/views/qqcv-ihn5/rows.csv?accessType=DOWNLOAD"

#####     All 2010 to 2014 data     #####

# Noticed that demo data for 2010-2014 has only partial info
Census_Demographics_2010_to_2014 <- "https://data.baltimorecity.gov/api/views/t7sb-aegk/rows.csv?accessType=DOWNLOAD"
Census_Demographics_2010_complete <- "https://data.baltimorecity.gov/api/views/cix3-h4cy/rows.csv?accessType=DOWNLOAD"
Census_Demographics_2010_and_2012 <- "https://data.baltimorecity.gov/api/views/yp84-wh4q/rows.csv?accessType=DOWNLOAD"
Census_Demographics_2010_and_2013 <- "https://data.baltimorecity.gov/api/views/7pnq-8ebe/rows.csv?accessType=DOWNLOAD"


Children_and_Family_Health_Well_Being_2010_to_2014 <- "https://data.baltimorecity.gov/api/views/rtbq-mnni/rows.csv?accessType=DOWNLOAD"
Housing_and_Community_Development_2010_to_2014 <- "https://data.baltimorecity.gov/api/views/mvvs-32jm/rows.csv?accessType=DOWNLOAD"
Crime_Safety_2010_to_2014 <- "https://data.baltimorecity.gov/api/views/qmw9-b8ep/rows.csv?accessType=DOWNLOAD"
Workforce_and_Economic_Development_2010_to_2014 <- "http://bniajfi.org/wp-content/uploads/2016/04/VS-14-Workforce-2010-2014.xlsx"
Arts_and_Culture_2010_to_2014 <- "http://bniajfi.org/wp-content/uploads/2016/04/VS-14-Arts-2011-2014.xlsx"
Education_and_Youth_2010_to_2014 <- "http://bniajfi.org/wp-content/uploads/2016/04/VS-14-Education-2010-2014.xlsx"
Sustainability_2010_to_2014 <- "http://bniajfi.org/wp-content/uploads/2016/04/VS-14-Sustainability-2010-2014.xlsx"

#####      CODEBOOK     #####
BNIA_Vital_Signs_Codebook <- "https://data.baltimorecity.gov/api/views/ryvy-9zw6/rows.csv?accessType=DOWNLOAD"
if(!file.exists(file.path("..", "Text","codebook.csv"))){
  switch(Sys.info()[['sysname']],
         Windows= {download(url = BNIA_Vital_Signs_Codebook, destfile = file.path("..", "Text","codebook.csv"),
                            # method = "wget",
                            mode="wb")},
         Linux  = {download.file(url = BNIA_Vital_Signs_Codebook, destfile = file.path("..", "Text","codebook.csv"),
                                 method = "wget",
                                 mode="wb")},
         Darwin = {download.file(url = BNIA_Vital_Signs_Codebook, destfile = file.path("..", "Text","codebook.csv"),
                                 method = "wget",
                                 mode="wb")})
}


#####     Downloading the files und save ze dates    #####
####      1 Neighbourhood data      ####
# data_path <- file.path(".", "Data")
data_names <- c("census.csv", "child_and_fam_wellbeing.csv", "housing.csv",
                "crime.csv", "workforce.xlsx", "culture.xlsx", "edu_and_youth.xlsx", 
                "sustain.xlsx", "census10.csv", "census12.csv", "census13.csv")
data_urls <- c(Census_Demographics_2010_to_2014, Children_and_Family_Health_Well_Being_2010_to_2014,
               Housing_and_Community_Development_2010_to_2014, Crime_Safety_2010_to_2014,
               Workforce_and_Economic_Development_2010_to_2014, Arts_and_Culture_2010_to_2014,
               Education_and_Youth_2010_to_2014, Sustainability_2010_to_2014, Census_Demographics_2010_complete,
               Census_Demographics_2010_and_2012, Census_Demographics_2010_and_2013)


mapply(function(x,y) {
  if(!file.exists("raw_data")){
    dir.create("raw_data")
  }
  if(!file.exists(file.path(".", "raw_data",y))){
    
    switch(Sys.info()[['sysname']],
           Windows= {download(url = x, destfile = file.path(".", "raw_data",y),
                                   # method = "wget",
                                   mode="wb")},
           Linux  = {download.file(url = x, destfile = file.path(".", "raw_data",y),
                                   method = "wget",
                                   mode="wb")},
           Darwin = {download.file(url = x, destfile = file.path(".", "raw_data",y),
                                   method = "wget",
                                   mode="wb")})
    
    
    date_downloaded <- now()
    write.table(date_downloaded, file.path(".", "raw_data", "date_downloaded.txt"))
  }
},data_urls,data_names)

#####     2 Data that could ID Blocks within a Neighbourhood     #####
data_names <- c("csa_nsa.xlsx", "blocks_nsa.xlsx","property.csv", "parks.csv", "religious.csv",
                "libraries.csv", "cust_311.csv", "real_property.csv","street_crime.csv", "vacants.csv") 

#cust_311 has zip, address and neighbourhood

data_urls <- c(CSA_to_NSA_2010,Census_Blocks_and_NSAs_2010,
               Real_Property_Taxes, Parks, Religious_Buildings, 
               Libraries, Customer_Service_Requests_311,Real_Property,
               BPD_Part_1_Victim_Based_Crime_Data, Vacant_Buildings)


mapply(function(x,y) {
  if(!file.exists("raw_data")){
    dir.create("raw_data")
  }
  v <- list.files(file.path(".", "raw_data"), pattern = "*.csv")
  lv <- length(v)
  if(!file.exists(file.path("raw_data",y)) & lv < 15){
    switch(Sys.info()[['sysname']],
           Windows= {download(url = x, destfile = file.path(".", "raw_data",y),
                  # method = "wget",
                  mode="wb")},
           Linux  = {download.file(url = x, destfile = file.path(".", "raw_data",y),
                  method = "wget",
                  mode="wb")},
           Darwin = {download.file(url = x, destfile = file.path(".", "raw_data",y),
                  method = "wget",
                  mode="wb")})
    date_downloaded <- now()
    write.table(date_downloaded, file.path(".", "raw_data", "date_downloaded.txt"))
    if(file.info(file.path("raw_data",y))$size*1e-6 > 30 )
      system(paste("gzip", file.path(".", "raw_data",y)))
  }
},data_urls,data_names)


####      Shape_files     ####
zip_Census_Demographics_2010_to_2014 <- "http://bniajfi.org/wp-content/uploads/2016/04/VS-14-Census.zip"
zip_Children_and_Family_Health_Well_Being_2010_to_2014 <- "http://bniajfi.org/wp-content/uploads/2016/04/VS-14-Health.zip"
zip_Housing_and_Community_Development_2010_to_2014 <- "http://bniajfi.org/wp-content/uploads/2016/04/VS-14-Housing.zip"
zip_Crime_Safety_2010_to_2014 <- "http://bniajfi.org/wp-content/uploads/2016/04/VS-14-Crime.zip"

zip_Workforce_and_Economic_Development_2010_to_2014 <- "http://bniajfi.org/wp-content/uploads/2016/04/VS-14-Workforce.zip"
zip_Arts_and_Culture_2010_to_2014 <- "http://bniajfi.org/wp-content/uploads/2016/04/VS-14-Arts.zip"
zip_Education_and_Youth_2010_to_2014 <- "http://bniajfi.org/wp-content/uploads/2016/04/VS-14-Education.zip"
zip_Sustainability_2010_to_2014 <- "http://bniajfi.org/wp-content/uploads/2016/04/VS-14-Sustainability.zip"
zip_census_block_2010 <- "http://www.mdp.state.md.us/msdc/census/cen2010/maps/tiger10/blk2010.zip" # 2010 shapefile showing block info
zip_census_tract_2010 <- "http://planning.maryland.gov/msdc/census/cen2010/maps/tiger10/ct2010.zip" # 2010 shapefile showing tract info
zip_neighbour <- "https://data.baltimorecity.gov/download/ysi8-7icr/application%2Fzip"

data_names <- c("census.zip", "child_and_fam_wellbeing.zip", "housing.zip",
                "crime.zip", "workforce.zip", "culture.zip", "edu_and_youth.zip", 
                "sustain.zip", "census_tract.zip", 
                "census_blk.zip",
                "neighbour.zip")
data_urls <- c(zip_Census_Demographics_2010_to_2014, zip_Children_and_Family_Health_Well_Being_2010_to_2014,
               zip_Housing_and_Community_Development_2010_to_2014, zip_Crime_Safety_2010_to_2014,
               zip_Workforce_and_Economic_Development_2010_to_2014, zip_Arts_and_Culture_2010_to_2014,
               zip_Education_and_Youth_2010_to_2014, zip_Sustainability_2010_to_2014, zip_census_tract_2010, 
               zip_census_block_2010,
               zip_neighbour)


mapply(function(x,y) {
  if(!file.exists("raw_data")){
    dir.create("raw_data")
  }
  if(!file.exists(file.path("raw_data",y))){
    switch(Sys.info()[['sysname']],
           Windows= {download(url = x, destfile = file.path(".", "raw_data",y),
                  # method = "wget",
                  mode="wb")},
           Linux  = {download.file(url = x, destfile = file.path(".", "raw_data",y),
                  method = "wget",
                  mode="wb")},
           Darwin = {download.file(url = x, destfile = file.path(".", "raw_data",y),
                  method = "wget",
                  mode="wb")})
    date_downloaded <- now()
    write.table(date_downloaded, file.path("raw_data", "date_downloaded.txt"))
  }
},data_urls,data_names)

###   Extract the zip files   ###

##All except neighbour and block data
sapply(data_names[1:9], function(x){
  if(!file.exists("wip")){
    dir.create("wip")
  }
  no_files_expected <- 7*length(data_names)
  
  if (length(list.files(file.path(".","wip"))) < no_files_expected){
    unzip(file.path(".","raw_data", x), 
          # files = grep("*.shp|*.dbf|*.shx",unzip(file.path(".","raw_data", x), list = T)[,1], value = T),
          exdir = file.path(".","wip"), junkpaths = T)
  }
  
})

##rename them
sapply(list.files(file.path(".","wip")), 
                  function(x){
                    file.rename(from =file.path(".","wip", x), 
                                to = file.path(".","wip", tolower(gsub("^.*?_", "", x, ignore.case = T)) ) )
                  }
)

##    Create a sub dir under WIP for each dataset and move the file into it
shape_dirs <- sapply(list.files(file.path(".","wip")), function(x){
  strsplit(x, "[.]")[[1]][1]
}
)


sapply(shape_dirs, function(x){
  if(!file.exists( file.path(".","wip",x)) ){
    dir.create(file.path(".","wip",x))
    files_to_copy <- list.files(file.path(".","wip"), pattern= paste0(x,"[.]*"))[-1]
    file.copy( paste0(file.path(".","wip", files_to_copy)) ,
               file.path(file.path(".","wip", x)) )
    file.remove(paste0(file.path(".","wip", files_to_copy)))
  }
  else if (file.exists( file.path(".","wip",x))){
    files_to_del <- list.files(file.path(".","wip"), pattern= paste0(x,"[.]*"))[-1]
    file.remove(paste0(file.path(".","wip", files_to_del)))
  }
}
)

#Block data
# ifiles <- unzip(file.path(".","raw_data", "census_blk.zip"), list = T)
# ifiles.name <- substr(ifiles[,1][1], 1, nchar(ifiles[,1][1]) - 4)
# if(!file.exists( file.path(".","wip", ifiles.name ) ) & length(list.files(file.path("..","Plots"))) == 0){
#   dir.create(file.path(".","wip", ifiles.name ))
#   unzip(file.path(".","raw_data", "census_blk.zip"), 
#         files = grep("*[.]", ifiles[,1], value = T), exdir = file.path(".","wip",ifiles.name), junkpaths = T)
# }


#Neighbourhood data
ifiles <- unzip(file.path(".","raw_data", "neighbour.zip"), list = T)
ifiles.name <- substr(ifiles[,1][1], 1, 5)
if(!file.exists( file.path(".","wip", ifiles.name ) )){
  dir.create(file.path(".","wip", ifiles.name ))
  unzip(file.path(".","raw_data", "neighbour.zip"), 
        files = grep("*[.]", ifiles[,1], value = T), exdir = file.path(".","wip",ifiles.name), junkpaths = T)
  file.to.del <- list.files(file.path(".","wip", ifiles.name ), pattern = ".shp|.atx", full.names = T)[-2]
  # system(paste("rm", file.to.del[1], file.to.del[2]))
}


rm(list = ls())

setwd(file.path(".."))


```

The data fall in three general categories.

  1. Street level 
  2. [Block group](https://www.census.gov/geo/reference/gtc/gtc_bg.html) level 
  3. [Community statistical area level (CSA)](http://bniajfi.org/faqs/)

###Data Cleaning and interpolation

The following table gives some of the variables used in the model fitting process, which level we originally got the data at and assumptions we made to get it at a street block level

| Variables | Name | Level | Cleaning steps |
|:----------:|:-----------:|:-----------:|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
| propfemhh | Proportion of households headed by a female with related children under 18 years | Block group | Since the data was at a block group level and we were interested in getting street block level data, we used [Kriging](https://en.wikipedia.org/wiki/Kriging) to [interpolate](https://en.wikipedia.org/wiki/Interpolation) data at new locations (street block locations) using the information from the block group level. The locations for the street blocks were ascertained as the median longitude and latitude of all streets that made up the street block. One of the assumptions made here was that the distribution of the variable (propfemhh) was smooth in the sense that street blocks with such households will tend to be similar. |
| propkids_withinsurance | The proportion of individuals less than 18 years who have health insurance for a given block group | Block group | Here we used the block group value as the value for each street block in that block group. The assumption here was that block groups would tend to be quite homogenous with regard to this variable. |
| racdiv | Racial diversity as calculated per block group | Block group | This variable was not given but was estimated from the block group data on race. Its estimation proceeds as follows calculate the percent of each race, square the percent for each group, sum the squares, subtract the sum from 1.00. Eight groups were used for the index: White, not Hispanic; Black or African American; American Indian and Alaska Native (AIAN); Asian; Native Hawaiian and Other Pacific Islander (NHOPI); two or more races, not Hispanic; some other race, not Hispanic; Hispanic or Latino. This method is based on that used by the census bureau. More information can be found [here](http://www.census.gov/population/cen2000/atlas/censr01-104.pdf). We decided not to interpolate these values for the street blocks but instead used the values from block groups that they belonged to. This was done due to the unique structure of neighborhoods in Baltimore city.  |
| propbelow | Proportion of individuals within a block group that lives below the poverty line | Block group | To get the data at the street block level we interpolated values from the block group level. The assumption used here is that the further into a particular neighborhood you go, the more representative each block is of the aggregate level data for this variable. |
| mhhi | Median household income | Block group | We interpolated the values for the street blocks from the block group level data using Kriging. Again this is based on the assumption that the further into a particular neighborhood you go, the more representative each block is of the aggregate level data for this variable. |
| totalincidents  | # of crime incidents per street | Street | We aggregated this to get the number of crimes committed per street block |
| prop.vacant | Proportion of vacant homes | Street  | We divided the number of vacant homes per street block by the total number of homes in that street block.  |

The rest of the variables used in the final model includes: Percentage of Students Suspended or Expelled During School Year (susp); Liquor Outlet density per 1,000 Residents (liquor); and Percent of Residences Heated by Electricity (elheat). Note that all the variables mentioned above were observed at the CSA level. Furthermore, I did not do any interpolation for these variables at the street block level as I felt that the assumptions inherent in the process would be untenable.


```{r analyses_data, echo = F, message=FALSE, warning=F}
##########################
## Warning: this takes a while to run
##########################

#Create all data files to be used during analysis
setwd(file.path(".","Data"))


# stop(print(substr(getwd(), 37, nchar(getwd()))))

#First dataset is Census. 1. 2010-2014 has missing info for 2010, 2012, and 2013
census10.14 <- readr::read_csv(file.path("raw_data", "census.csv"))
census10 <- readr::read_csv(file.path("raw_data", "census10.csv"))
census10.12 <- readr::read_csv(file.path("raw_data", "census12.csv"))
census10.13 <- readr::read_csv(file.path("raw_data", "census13.csv"))

#First rename colums to match
names(census10)[c(1:2,5:6, 8:10,12,13:14, 16:17)] <- c(names(census10.14)[c(1:2,5:7,9,8,10,13:16)])
names(census10)[15] <- "age44_10"
names(census10.13)[1] <- names(census10.14)[1]

#Second merging datasets
library(dplyr);library(lubridate)

#I noticed that some variables have * in front of each figure for hhs10 and mhhi13 in census10.13 so I will replace em
census10.13 %>% dplyr::arrange_(.dots=names(census10.13)[1:20]) %>%
  dplyr::mutate(hhs10 = as.numeric(gsub("\\*|\\* |,", "", hhs10)),
                mhhi13 = as.numeric(gsub("\\*|\\* |,", "", mhhi13))) -> census10.13

#I noticed that mhhi12 has $ in front of each figure in census10.12 so I will replace em
census10.12  %>% dplyr::arrange_(.dots=names(census10.12)[1:20]) %>%
  dplyr::mutate(mhhi12 = gsub("\\$", "", mhhi12), mhhi12 = as.numeric(mhhi12)) -> census10.12


#Check which varaibles are the same across the datasets
census10 %>% arrange_(.dots=names(census10)[1:20]) %>%
  select_(.dots = names(census10)[2:20]) %>%
  as.data.frame() %>% apply(2, summary) -> c1
census10.12 %>%
  select_(.dots = names(census10.12)[2:20]) %>%
  as.data.frame() %>% apply(2, summary) -> c2
census10.13 %>%
  select_(.dots = names(census10.13)[2:20]) %>%
  mutate_each(funs(as.numeric)) %>% apply(2, summary)  -> c3
census10.14 %>% arrange_(.dots=names(census10.14)[1:20]) %>%
  select_(.dots = names(census10.14)[2:20][-16]) %>% apply(2, summary) -> c4

#IDs which variables are the same, note that I excluded hhs10 from both c2,and c3 since it is NA in c4
# identical(c1,c2);identical(c2,c3);identical(c2[,-16],c4);identical(c3[,-16],c4)

# sapply(1:19, function(x){
#   identical(c1[,x],c2[,x])
# })


#So we can join c2(census10.12) and c3(census10.13) on the first 20 variables c.f  names(census10.12)[names(census10.12) %in% names(census10.13)]

census10.13$racdiv10 <- as.numeric(census10.12$racdiv10)

census12_14 <- inner_join(census10.12,census10.13) %>% 
  inner_join(census10.14, by = names(census10.14)[2:20][-16]) %>% #join on 19 variables after we exclude hhs10
  dplyr::select(-CSA2010.y,-hhs10.y, CSA2010 = CSA2010.x)
rm(census10.12, census10.13, census10.14, c1,c2,c3,c4)


#### ID which neighborhoods fall into a CSA

##Load data from real properties that contains block info

data <- readr::read_csv(file.path("raw_data", "property.csv.gz"))

#get count of houses per block
data %>% dplyr::group_by(Neighborhood,Block) %>%
  dplyr::summarise(count_house = n()) -> houses

dat1 <- subset(data, select = names(data)[c(2:3,6,8,9,11,13:15)])
subset(data, select = c("Block", "Neighborhood", "Location")) -> data

data <- na.omit(data)

new_loc <-  sapply(data$Location, function(x) {
  y <- substr(x, start = 2, stop = nchar(x) - 1)
  strsplit(y, ", ")[[1]]
}
)
new_loc <- t(new_loc)
data <- data.frame(data[,c(1,2)], lon = as.numeric(new_loc[,2]), lat = as.numeric(new_loc[,1]))
dat1$lon <- NA;dat1$lat <- NA
dat1[!is.na(dat1$Neighborhood),] <- data.frame(dat1[!is.na(dat1$Neighborhood),][,1:9],
                                               lon = as.numeric(new_loc[,2]), 
                                               lat = as.numeric(new_loc[,1]))


#Step 1
### Check if Lng and Lat fall inside polygons from ESRI Shape file for Child and wellbeing (this has the outcome)
dat.le <-  rgdal::readOGR(file.path("wip", "health"), "health", verbose = F)
csa <- as.character(dat.le$CSA2010)
dat.le <- sp::spTransform(dat.le, sp::CRS("+proj=longlat +datum=WGS84")) #SpatialPolygonsDataFrame


# Assignment modified according
sp::coordinates(data) <- ~lon + lat #SpatialPointsDataFrame
real_prop <- data

# Set the projection of the SpatialPointsDataFrame using the projection of the shapefile
sp::proj4string(data) <- sp::proj4string(dat.le)

sp::over(dat.le, data, returnList = T) -> neighbhd_csa #gives which Neighborhood belongs to what CSA

names(neighbhd_csa) <- csa

#Gives a dataframe with cols CSA, Blocks, and Neighborhood so I can match block level info to CSA
neighbhd_csa <- plyr::ldply (neighbhd_csa, data.frame) 
clnames <- names(neighbhd_csa)
clnames[1] <- "CSA"
names(neighbhd_csa) <- clnames

#Check if neighborhoods are the same, they should be! (note that the lengths are diff since dat1 has NA, but no matter)
identical(sort(unique(dat1$Neighborhood)), sort(unique(neighbhd_csa$Neighborhood)))

#Careful file size for join without summary is ~2GB!!!
inner_join(neighbhd_csa, dat1) %>%
  dplyr::mutate(CityTax = gsub("\\$", "", CityTax), CityTax = as.numeric(CityTax),
                StateTax = gsub("\\$", "", StateTax), StateTax = as.numeric(StateTax),
                AmountDue = gsub("\\$", "", AmountDue), AmountDue = as.numeric(AmountDue)) %>%
  dplyr::group_by(CSA, Neighborhood, Block) %>%
  dplyr::summarise(CityTax.med = median(CityTax, na.rm = T), StateTax.med = median(StateTax, na.rm = T), 
                   AmountDue.med = median(AmountDue, na.rm = T), lon.med = median(lon, na.rm = T), lat.med = median(lat)) -> csa.prop

# pryr::object_size(csa.prop)


#Outcome data


health <- readr::read_csv(file.path(".", "raw_data", "child_and_fam_wellbeing.csv"))
clnames <- names(health)
clnames[1] <- "CSA"
names(health) <- clnames


gsub("_[[:digit:]]*","",names(health))[-1] -> variables
variables[variables == "mort1"] <- "mort01"
variables <- sapply(variables, function(x){
  substr(x, start = 1, stop = nchar(x) - 2)
})
unname(variables) -> variables
unique(variables) -> var.names

setdiff(names(health), grep(paste0("mort"), names(health), value = T)) ->rm.mort

#Change from short to long

health.long <- lapply(var.names[var.names!= "mort"], function(x){
  #get the columns
  columns <- grep(paste0(x),rm.mort, value = T)
  
  #get the time in years
  time <- sapply(gsub("[^_[:digit:]]","",columns), function(x){
    substr(x, start = nchar(x)-1, stop = nchar(x))
  }
  )
  unname(time) -> time
  
  #Select the columns
  subset(health, select = c("CSA",columns)) -> dat.h
  
  n <- dim(dat.h)[1]
  
  # print(length(time))
  dat.h <- tidyr::gather(dat.h, variable, value, -CSA)
  dat.h$time <- rep(as.numeric(time), each = n)
  dat.h
  # data.frame(tidyr::gather(dat.h, variable, value, -CSA), time = time)
  # rbind(cbind(tidyr::gather(dat.h, variable, value, -CSA), time))
  
})

plyr::ldply(health.long, data.frame) -> health.long


#get the columns
columns <-  sapply(strsplit(grep(paste0("mort"),names(health), value = T),"_"),function(x) x[1])
unique(columns) -> columns

health.long2 <- lapply(columns, function(x){
  
  #get the time in years
  time <- sapply(strsplit(grep(paste0("^", x, "_"),
                               names(health), value = T),"_"),function(x) x[2])
  time <- as.numeric(time)
  
  #Select the columns
  subset(health, select = c("CSA",
                            grep(paste0("^", x, "_"), names(health), value = T))
  ) -> dat.h
  
  n <- dim(dat.h)[1]
  
  # print(length(time))
  dat.h <- tidyr::gather(dat.h, variable, value, -CSA)
  dat.h$time <- rep(as.numeric(time), each = n)
  dat.h
  # data.frame(tidyr::gather(dat.h, variable, value, -CSA), time = time)
  # rbind(cbind(tidyr::gather(dat.h, variable, value, -CSA), time))
  
})

plyr::ldply(health.long2, data.frame) -> health.long2

health.long <- rbind(health.long, health.long2)

rm(health.long2, columns, rm.mort, var.names, variables, new_loc)

#Merge
health.sub <- subset(health, select = c("CSA", "LifeExp11", 
                                        "LifeExp12", "LifeExp13",
                                        "LifeExp14"))

#This will have the same number of rows as csa.prop since block neighborhood combinations are unique
inner_join(csa.prop, health.sub) -> csa.prop.health  


rm(csa, dat.le,data, new_loc, neighbhd_csa, dat1)


#Crime data


ifiles <- unzip(file.path("raw_data", "census_blk.zip"), list = T)
ifiles.name <- substr(ifiles[,1][1], 1, nchar(ifiles[,1][1]) - 4)

if(!file.exists( file.path("wip", ifiles.name ) )){
  dir.create(file.path("wip", ifiles.name ))
  unzip(file.path("raw_data", "census_blk.zip"), 
        files = grep("*[.]", ifiles[,1], value = T), exdir = file.path("wip",ifiles.name), junkpaths = T)
} 

##Load data from crime data that contains street,Neighborhood and Police District info

data <- readr::read_csv(file.path("raw_data", "street_crime.csv.gz"))
names(data)[c(4,11)] <- c("Street",names(data)[4])
dat1 <- data #subset(data, select = -Location)
subset(data, select = c("Street", "Neighborhood", "Location", "CrimeDate")) -> data

data <- data[!is.na(data$Location),]

new_loc <-  sapply(data$Location, function(x) {
  y <- substr(x, start = 2, stop = nchar(x) - 1)
  strsplit(y, ", ")[[1]]
}
)
new_loc <- t(new_loc)

#side note. Records have been geocoded to the hundredth block and not the precise point that the crime took place.
data <- data.frame(data[,c(1,2, 4)], lon = as.numeric(new_loc[,2]), lat = as.numeric(new_loc[,1]))
dat1$lon <- NA;dat1$lat <- NA
dat1[!is.na(dat1$Location),][,-11] <- data.frame(dat1[!is.na(dat1$Location),][,-c(11,13,14)],
                                                 lon = as.numeric(new_loc[,2]), 
                                                 lat = as.numeric(new_loc[,1]))
dat1 <- subset(dat1, select = -Location)

#Step 1
### Check if Lng and Lat fall inside polygons from ESRI Shape file for Child and wellbeing (this has the outcome)
dat.le <-  rgdal::readOGR(file.path("wip", "blk2010"), "blk2010", verbose = F)
block <- as.character(dat.le$BLOCK)
dat.le <- sp::spTransform(dat.le, sp::CRS("+proj=longlat +datum=WGS84")) #SpatialPolygonsDataFrame

#

# Assignment modified according
sp::coordinates(data) <- ~lon + lat #SpatialPointsDataFrame

# Set the projection of the SpatialPointsDataFrame using the projection of the shapefile
sp::proj4string(data) <- sp::proj4string(dat.le)

sp::over(dat.le, data, returnList = T) -> block_crime #gives which Neighborhood belongs to what crime block


unlink(file.path("wip", "blk2010"), force = T, recursive = T)


names(block_crime) <- block

#Gives a dataframe with cols Block, Neighborhood and street 
block_crime <- plyr::ldply (block_crime, data.frame) 
clnames <- names(block_crime)
clnames[1] <- "Blocks"
names(block_crime) <- clnames
block_crime$Neighborhood <- toupper(block_crime$Neighborhood)

block_crime <- block_crime %>% 
  dplyr::mutate(year = year(as.Date(CrimeDate, "%m/%d/%Y")))
block_crime <- subset(plyr::arrange(block_crime, Neighborhood, Blocks, year, Street, CrimeDate), 
                      select = c(Neighborhood, Blocks, year, Street, CrimeDate)) 

block_crime <- unique(block_crime)


#Step 2
##Load data from real properties that contains block info
data <- readr::read_csv(file.path("raw_data", "property.csv.gz"))
dat2 <- subset(data, select = names(data)[c(2:3,6,8,9,11,13:15)])
subset(data, select = c("Block", "Neighborhood", "Location")) -> data

data <- na.omit(data)

new_loc <-  sapply(data$Location, function(x) {
  y <- substr(x, start = 2, stop = nchar(x) - 1)
  strsplit(y, ", ")[[1]]
}
)
new_loc <- t(new_loc)
data <- data.frame(data[,c(1,2)], lon = as.numeric(new_loc[,2]), lat = as.numeric(new_loc[,1]))

dat2$lon <- NA;dat2$lat <- NA
dat2[!is.na(dat2$Neighborhood),] <- data.frame(dat2[!is.na(dat2$Neighborhood),][,1:9],
                                               lon = as.numeric(new_loc[,2]), 
                                               lat = as.numeric(new_loc[,1]))

# Assignment modified according
sp::coordinates(data) <- ~lon + lat #SpatialPointsDataFrame

# Set the projection of the SpatialPointsDataFrame using the projection of the shapefile
sp::proj4string(data) <- sp::proj4string(dat.le)

sp::over(dat.le, data, returnList = T) -> block_prop  #gives which Neighborhood-block belongs to what real property Neighbhd-block

names(block_prop) <- block

#Gives a dataframe with cols Block, Neighborhood and street 
block_prop <- plyr::ldply (block_prop, data.frame) 
clnames <- names(block_prop)
clnames[1] <- "Blocks"
names(block_prop) <- clnames
block_prop$Neighborhood <- toupper(block_prop$Neighborhood)

block_prop <- subset(plyr::arrange(block_prop, Neighborhood, Blocks, Block), select = c(Neighborhood, Blocks, Block)) 
block_prop <- unique(block_prop)


# Get which block numeration corresponds to street block in real_prop taxes

dplyr::inner_join(block_crime,block_prop) -> block_crime_pop

dat1$Neighborhood <- toupper(dat1$Neighborhood)

## Need to summarize otherwise the vanilla form is ~7.9 GB

block_crime_pop %>% 
  inner_join(dat1[,-c(12,13)]) %>% #Removing lon and lat since it is approximate for crime data
  inner_join(dat2[,c(1,4:7, 10:11)]) %>%
  dplyr::mutate(CityTax = gsub("\\$", "", CityTax), CityTax = as.numeric(CityTax),
                StateTax = gsub("\\$", "", StateTax), StateTax = as.numeric(StateTax),
                AmountDue = gsub("\\$", "", AmountDue), AmountDue = as.numeric(AmountDue), 
                TotalIncidents = as.numeric(`Total Incidents`)) %>%
  dplyr::group_by(Neighborhood, Block, year) %>% 
  dplyr::summarise(TotalIncidents = sum(TotalIncidents, na.rm = T),
                   mean(CityTax, na.rm = T),
                   mean(StateTax, na.rm = T),
                   mean(AmountDue, na.rm = T),
                   median(lon, na.rm = T),
                   median(lat, na.rm = T)) -> block_crime_pop

#Step 3
##Load data from real properties that contains block info
data <- readr::read_csv(file.path("raw_data", "vacants.csv"))
dat3 <- subset(data, select = names(data)[c(2,4:6)])
subset(data, select = c("BuildingAddress", "Neighborhood", "Location", "NoticeDate")) -> data

data <- na.omit(data)

new_loc <-  sapply(data$Location, function(x) {
  y <- substr(x, start = 2, stop = nchar(x) - 1)
  strsplit(y, ", ")[[1]]
}
)
new_loc <- t(new_loc)
data <- data.frame(data[,c(1,2, 4)], lon = as.numeric(new_loc[,2]), lat = as.numeric(new_loc[,1]))

dat3$lon <- NA;dat3$lat <- NA
data.frame(dat3[!is.na(dat3$BuildingAddress) & !is.na(dat3$Neighborhood) ,][,1:4],
           lon = as.numeric(new_loc[,2]), 
           lat = as.numeric(new_loc[,1])) -> dat3[!is.na(dat3$BuildingAddress) & !is.na(dat3$Neighborhood),]

#Convert Noticedate to date obj and then extract the year
dat3 %>% 
  dplyr::mutate(year =  year(as.Date(NoticeDate, "%m/%d/%Y")),Neighborhood = toupper(Neighborhood)) -> dat3

# Assignment modified according
sp::coordinates(data) <- ~lon + lat #SpatialPointsDataFrame

# Set the projection of the SpatialPointsDataFrame using the projection of the shapefile
sp::proj4string(data) <- sp::proj4string(dat.le)

sp::over(dat.le, data, returnList = T) -> block_vac  #gives which Neighborhood-block belongs to what vacant property Neighbhd-block


names(block_vac) <- block

#Gives a dataframe with cols Block, Neighborhood and street 
block_vac <- plyr::ldply (block_vac, data.frame) 
clnames <- names(block_vac)
clnames[1] <- "Blocks"
names(block_vac) <- clnames
block_vac$Neighborhood <- toupper(block_vac$Neighborhood)

block_vac <- subset(plyr::arrange(block_vac, Neighborhood, Blocks, BuildingAddress, NoticeDate), 
                    select = c(Neighborhood, Blocks, BuildingAddress, NoticeDate)) 
block_vac <- unique(block_vac)

# Get which block numeration corresponds to street block in real_prop taxes

#Join to real_property taxes dataset by Neighbourhood and block(numerical) to get which street and Neighbourhood correspond to what 
#to what Neighbourhood and block in real_property taxes

dplyr::inner_join(block_vac,block_prop) -> block_vac_pop 

block_vac_pop %>% 
  dplyr::inner_join(dat3, by  = c("Neighborhood", "BuildingAddress", "NoticeDate")) %>% #Join to vacant_building dataset by street, nbhd, and date
  dplyr::group_by(Neighborhood, Block = Block.x, Year = year) %>% #Group by Neighbourhood, block and year to obtain summary stats
  dplyr::summarise(Count_vancant = length(BuildingAddress)) -> block_vac_pop 

rm(dat.le,dat1, dat2, dat3, block, 
   new_loc, ifiles, ifiles.name,clnames, data)


##For block_crime_pop and block_vac_pop, I want the "years' variable to rep all the possible years between 2010 and 2016
block_crime_pop -> block_crime_pop1
block_crime_pop1 %>% 
  dplyr::select(Neighborhood, Block) %>%
  unique() -> block_crime_pop1

block_crime_pop1[rep(seq_len(nrow(block_crime_pop1)), each=5),] -> block_crime_pop1

block_crime_pop1$year <- rep(c(2010:2014), n_distinct(block_crime_pop1))

names(block_crime_pop) <- c(names(block_crime_pop)[1:4], "CityTax.avg",
                            "StateTax.avg", "AmountDue.avg",
                            "lon.med", "lat.med")

block_crime_pop1 %>% 
  left_join(block_crime_pop) -> crime_pop 

crime_pop_impute <- crime_pop %>%
  subset(select = -Block) %>%
  dplyr::group_by(Neighborhood, year) %>%
  summarise_all(mean, na.rm = T)
# Impute neighbourhood average for the particular year
attach(crime_pop)
crime_pop[is.na(TotalIncidents), ][,c(1,3)] %>%
  inner_join(crime_pop_impute) -> crime_pop[is.na(TotalIncidents), ][,-2]
crime_pop[is.na(CityTax.avg), ][,c(1,3)] %>%
  inner_join(crime_pop_impute) -> crime_pop[is.na(CityTax.avg), ][,-2]
crime_pop[is.na(StateTax.avg), ][,c(1,3)] %>%
  inner_join(crime_pop_impute) -> crime_pop[is.na(StateTax.avg), ][,-2]
crime_pop[is.na(AmountDue.avg), ][,c(1,3)] %>%
  inner_join(crime_pop_impute) -> crime_pop[is.na(AmountDue.avg), ][,-2]

#After imputation some years just have missing info
crime_pop %>%
  dplyr::mutate(TotalIncidents = ifelse(is.na(TotalIncidents), 0, TotalIncidents),
                CityTax.avg = ifelse(is.na(CityTax.avg), 0, CityTax.avg),
                StateTax.avg = ifelse(is.na(StateTax.avg), 0, StateTax.avg),
                AmountDue.avg = ifelse(is.na(AmountDue.avg), 0, AmountDue.avg) ) -> crime_pop

detach(crime_pop)

block_vac_pop -> block_vac_pop1
block_vac_pop1 %>% 
  dplyr::select(Neighborhood, Block) %>%
  unique() -> block_vac_pop1

block_vac_pop1[rep(seq_len(nrow(block_vac_pop1)), each=5),] -> block_vac_pop1

block_vac_pop1$Year <- rep(c(2010:2014), n_distinct(block_vac_pop1))

names(block_vac_pop) <- c(names(block_vac_pop)[1:3], "Count_vacant")

block_vac_pop1 %>% 
  left_join(block_vac_pop)  -> vac_pop
vac_pop$Count_vacant <- ifelse(is.na(vac_pop$Count_vacant), 0, vac_pop$Count_vacant)

rm(block_crime_pop1, block_vac_pop1,block_crime_pop, block_vac_pop)
rm(block_crime, block_prop, block_vac, crime_pop_impute, csa.prop)


##Create analyses data
names(csa.prop.health) <- tolower(names(csa.prop.health))
names(vac_pop) <- tolower(names(vac_pop))
names(crime_pop) <- tolower(names(crime_pop)) #Note that lat and lon here refer to that gotten from real_prop data
names(houses) <- tolower(names(houses))

#Co-ordinates
csa.prop.health %>% 
  dplyr::group_by(csa) %>%
  dplyr::summarise(lon.med.avg = mean(lon.med), lat.med.avg = mean(lat.med)) -> coord.lon_lat

## 2014


# crime_pop[crime_pop$year == 2014,]

csa.prop.health[,c(1:3, 12)] %>% 
  left_join(subset(crime_pop, year == 2014, select = -c(lon.med, lat.med))) %>%
  left_join(subset(vac_pop, year == 2014)) %>%
  dplyr::mutate(count_vacant = as.numeric(ifelse(is.na(count_vacant), 0, count_vacant)))-> health_prop_le_crime_vac_block

health_prop_le_crime_vac_block %>%
  inner_join(houses) %>%
  dplyr::mutate(prop.vacant = count_vacant/count_house) -> health_prop_le_crime_vac_block

#For any variables that are NA use the neighbourhood average of the csa they belong to for imputation
health_prop_le_crime_vac_block %>%
  dplyr::select(-year, -block, -lifeexp14) %>%
  dplyr::group_by(csa, neighborhood) %>%
  dplyr::summarise( totalincidents = mean(totalincidents, na.rm = T),
                    citytax.avg = mean(citytax.avg, na.rm = T),
                    statetax.avg = mean(statetax.avg, na.rm = T),
                    amountdue.avg = mean(amountdue.avg, na.rm = T),
                    prop.vacant = mean(prop.vacant, na.rm = T)) -> impute

health_prop_le_crime_vac_block[is.na(health_prop_le_crime_vac_block$totalincidents), c(1:4)] %>%
  inner_join(impute) -> health_prop_le_crime_vac_block[is.na(health_prop_le_crime_vac_block$totalincidents),][,-c(5, 10:11)]
rm(impute)

#For any variables that are still NA use the CSA average that they belong to for imputation
health_prop_le_crime_vac_block %>%
  dplyr::select(-year, -block, -lifeexp14) %>%
  dplyr::group_by(csa) %>%
  dplyr::summarise( totalincidents = mean(totalincidents, na.rm = T),
                    citytax.avg = mean(citytax.avg, na.rm = T),
                    statetax.avg = mean(statetax.avg, na.rm = T),
                    amountdue.avg = mean(amountdue.avg, na.rm = T),
                    prop.vacant = mean(prop.vacant, na.rm = T)) -> impute

health_prop_le_crime_vac_block[is.na(health_prop_le_crime_vac_block$totalincidents), c(1:4)] %>%
  inner_join(impute) -> health_prop_le_crime_vac_block[is.na(health_prop_le_crime_vac_block$totalincidents),][,-c(5, 10:11)]
rm(impute)


#Aggregate by CSA 

health_prop_le_crime_vac_block %>%
  dplyr::select(-year, -block, -neighborhood) %>%
  filter(!is.nan(totalincidents)) %>%
  # mutate(count_vacant = as.numeric(count_vacant)) %>%
  dplyr::group_by(csa) %>%
  summarise_all(mean) -> csa.data

census12_14 %>% dplyr::select(csa = CSA2010, tpop = tpop10, racdiv10,mhhi13, femhhs10) %>% 
  inner_join(csa.data) %>%
  subset(select = (c(csa, lifeexp14, tpop, 
                     racdiv10, mhhi13, femhhs10, 
                     totalincidents, citytax.avg, 
                     statetax.avg, amountdue.avg, 
                     prop.vacant ))) -> csa.data

housing <- readr::read_csv(file.path("raw_data", "housing.csv"))
edu <- readxl::read_excel(file.path("raw_data", "edu_and_youth.xlsx"))
welfare <- readr::read_csv(file.path("raw_data", "child_and_fam_wellbeing.csv"))
sustain <- readxl::read_excel(file.path("raw_data", "sustain.xlsx"))
crime <- readr::read_csv(file.path("raw_data", "crime.csv"))


names(housing)[1] <- "csa";names(edu)[1] <- "csa";names(welfare)[1] <- "csa";names(sustain)[1] <- "csa"
names(crime)[1] <- "csa"


edu[-1,] -> edu
edu <- tbl_df(data.frame(edu[,1],apply(edu[,-1],2,as.numeric)))
sustain[-1,] -> sustain
sustain <- tbl_df(data.frame(sustain[,1],apply(sustain[,-1],2,as.numeric)))


csa.data %>%
  inner_join(subset(housing, select = c(csa, shomes14,cashsa14, fore14))) %>%
  inner_join(subset(crime, select = c(csa, shoot11, gunhom13, narc12))) %>%
  inner_join(subset(edu, select = c(csa, abse14,absmd14, abshs14, susp13 ))) %>%
  inner_join(subset(welfare, select = c(csa, birthwt14, liquor14))) %>%
  inner_join(subset(sustain, select = c(csa, heatgas14, elheat14,wlksc11))) -> csa.data


csa.data.anal <- data.frame(csa.data[,1:2], csa.data[,3:26], coord.lon_lat[,c(2:3)])
rm(csa.data)


#Block level

csa.data.anal[,c(1:2,4,6 ,17,19,21,23:26)] %>% 
  inner_join(health_prop_le_crime_vac_block[,c(1:4,6,12)]) %>%
  inner_join(csa.prop.health[,c(1:3, 7:8)]) %>% unique() -> block_data.anal


block_data.anal <- data.frame(block_data.anal[,c(1,12:13)], block_data.anal[,c(3:11, 14:15)], block_data.anal[,16:17])


##American Community Survey 

switch(Sys.info()[['sysname']],
       Windows= {
         library(acs);library(dplyr)
         api.key.install(key = "74dc730cf3f4a5d715eadf2db90cd6ac80d8c8cc")
         my.geo <- geo.make(state="MD", county = "Baltimore city", tract ="*", block.group ="*",check = T)
         #Block Group data
         
         ##Related CHILDREN UNDER 18 YEARS BY FAMILY TYPE AND AGE (get prop fem headed with kid <18)
         
         cdbk <- acs.lookup(2014, 5, table.number="B11004")
         hh <- acs.fetch(2014, span  = 5, 
                         geography=my.geo, 
                         table.number = "B11004", case.sensitive = F)
         hh <- data.frame(hh@geography[,4:5],hh@estimate, row.names = 1:dim(hh@estimate)[1])
         
         hh %>%
           dplyr::mutate(propfemhh = B11004_015/B11004_001) %>% 
           dplyr::select(tract, blockgroup, propfemhh) -> hh
         
         #OWN CHILDREN UNDER 18 YEARS BY FAMILY TYPE AND AGE (get prop fem headed with kid <18)
         child.code <- acs.lookup(2014, 5, table.number="B09002")
         child.ft <- acs.fetch(2014, span  = 5, 
                               geography=my.geo, 
                               table.number = "B09002", case.sensitive = F)
         child.ft <- data.frame(child.ft@geography[,4:5],child.ft@estimate, row.names = 1:dim(child.ft@estimate)[1])
         
         
         #POVERTY STATUS IN THE PAST 12 MONTHS BY DISABILITY STATUS BY EMPLOYMENT STATUS FOR THE POPULATION 20 TO 64 YEARS (get prop below pov line)
         poverty.code <- acs.lookup(2014, 5, table.number="B23024")
         pov.ft <- acs.fetch(2014, span  = 5, 
                             geography=geo.make(state=24, county = 510, tract ="*", block.group ="*",check = T), 
                             table.number = "B23024", case.sensitive = F)
         pov.ft <- data.frame(pov.ft@geography[,4:5],pov.ft@estimate, row.names = 1:dim(pov.ft@estimate)[1])
         
         pov.ft %>%
           dplyr::mutate(propbelow = B23024_002/B23024_001) %>% 
           dplyr::select(tract, blockgroup, propbelow) -> pov.ft
         
         
         #MEDIAN HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 2014 INFLATION-ADJUSTED DOLLARS)
         medinc.code <- acs.lookup(2014, 5, table.number="B19013")
         medinc.ft <- acs.fetch(2014, span  = 5, 
                                geography=geo.make(state="MD", county = "Baltimore city", tract ="*", block.group ="*"), 
                                table.number = "B19013", case.sensitive = F)
         medinc.ft <- data.frame(medinc.ft@geography[,4:5],medinc.ft@estimate, row.names = 1:dim(medinc.ft@estimate)[1])
         
         
         #TYPES OF HEALTH INSURANCE COVERAGE BY AGE
         insur.code <- acs.lookup(2014, 5, table.number="B27010")
         insur.ft <- acs.fetch(2014, span  = 5, 
                               geography=geo.make(state="MD", county = "Baltimore city", tract ="*", block.group ="*"), 
                               table.number = "B27010", case.sensitive = F)
         insur.ft <- data.frame(insur.ft@geography[,4:5],insur.ft@estimate, row.names = 1:dim(insur.ft@estimate)[1])
         
         insur.ft %>%
           dplyr::mutate(propkids_withinsurance = 1- (B27010_017/B27010_002)) %>% 
           dplyr::select(tract, blockgroup, propkids_withinsurance) %>% 
           dplyr::mutate(propkids_withinsurance = ifelse(is.nan(propkids_withinsurance),1, propkids_withinsurance)) -> insur.ft
         
         #RACE "to calculate racial diversity"
         
         # Working with percents expressed as ratios (e.g., 63 percent = 0.63), the index is calculated in three steps:  A. Square the percent for each
         # group, B. Sum the squares, and C. Subtract the sum from 1.00. Eight groups were used for the index: 1. White, not Hispanic;
         # 2. Black or African American; 3. American Indian and Alaska Native (AIAN); 4. Asian; 5. Native Hawaiian and Other Pacific
         # Islander (NHOPI); 6. Two or more races, not Hispanic; 7. Some other race, not Hispanic; and 8. Hispanic or Latino
         
         race.code <- acs.lookup(2014, 5, table.number="B02001")
         race.ft <- acs.fetch(2014, span  = 5, 
                              geography=geo.make(state="MD", county = "Baltimore city", tract ="*", block.group ="*"), 
                              table.number = "B02001", case.sensitive = F)
         race.ft <- data.frame(race.ft@geography[,4:5],race.ft@estimate, row.names = 1:dim(race.ft@estimate)[1])
         
         race.ft %>%
           dplyr::mutate(B02001_002 = B02001_002/B02001_001, B02001_003 = B02001_003/B02001_001, B02001_004 = B02001_004/B02001_001,
                         B02001_005 = B02001_005/B02001_001, B02001_006 = B02001_006/B02001_001, B02001_007 = B02001_007/B02001_001,
                         B02001_010 = B02001_010/B02001_001, 
                         racdiv = 1-(B02001_002^2 + B02001_003^2 + B02001_004^2 + B02001_005^2 + B02001_006^2 + B02001_007^2 + B02001_010^2)) %>% 
           dplyr::select(tract, blockgroup, racdiv) -> race.ft
         
         library(tigris)
         bmore.city <- block_groups("MD", "Baltimore city")
         
         # BG <- as.character(dat.le$CSA2010)
         bmore.city_sp <- sp::spTransform(bmore.city, sp::CRS("+proj=longlat +datum=WGS84")) #SpatialPolygonsDataFrame
         
         bmore.city <- bmore.city_sp
         
         
         rm(bmore.city_sp)
         # Set the projection of the SpatialPointsDataFrame using the projection of the shapefile
         sp::proj4string(real_prop) <- sp::proj4string(bmore.city)
         
         sp::over(bmore.city, real_prop, returnList = T) -> BG_neighbhd_csa 
         
         BG_neighbhd_csa <- plyr::ldply(BG_neighbhd_csa, data.frame)
         
         as.data.frame(real_prop) %>%
           dplyr::group_by(Neighborhood, Block) %>%
           dplyr::summarise(lon = median(lon), lat = median(lat)) -> areal.prop
         
         BG_neighbhd_csa %>%
           dplyr::inner_join(dplyr::mutate(bmore.city@data, .id = row.names(bmore.city@data))) %>%
           dplyr::select(.id, Neighborhood, Block, tract = TRACTCE, blockgroup = BLKGRPCE) %>%
           dplyr::mutate(tract = as.numeric(tract)) %>%
           dplyr::left_join(na.omit(hh)) %>% 
           dplyr::left_join(na.omit(pov.ft)) %>%
           dplyr::left_join(na.omit(medinc.ft)) %>%
           dplyr::left_join(na.omit(insur.ft)) %>% 
           dplyr::left_join(na.omit(race.ft)) %>%
           dplyr::inner_join(areal.prop) %>%
           unique() %>% na.omit() -> bg_smooth
         
         bg_smooth -> BG_neighbhd_csa
         
         bg_smooth %>%
           dplyr::group_by(tract, blockgroup) %>%
           dplyr::summarise(propfemhh = mean(propfemhh), propbelow = mean(propbelow), 
                            B19013_001 = mean(B19013_001), propkids_withinsurance = mean(propkids_withinsurance), 
                            racdiv = mean(racdiv),lon = median(lon), lat = median(lat)) %>% data.frame -> bg_smooth
         saveRDS(bg_smooth,"bg_smooth.rds")
         saveRDS(BG_neighbhd_csa, "BG_neighbhd_csa.rds")},
       
       Linux  = {link <- "https://github.com/ksosina/BLE/raw/master/Data/BG_neighbhd_csa.rds"
       download.file(link, "BG_neighbhd_csa.rds", 
                     method = "wget", mode = "wb")
       BG_neighbhd_csa<- readRDS("BG_neighbhd_csa.rds")
       
       link <- "https://github.com/ksosina/BLE/raw/master/Data/bg_smooth.rds"
       download.file(link, "bg_smooth.rds", 
                     method = "wget", mode = "wb")
       bg_smooth<- readRDS("bg_smooth.rds")},
       
       Darwin = {link <- "https://github.com/ksosina/BLE/raw/master/Data/BG_neighbhd_csa.rds"
       download.file(link, "BG_neighbhd_csa.rds", 
                     method = "wget", mode = "wb")
       BG_neighbhd_csa<- readRDS("BG_neighbhd_csa.rds")
       
       link <- "https://github.com/ksosina/BLE/raw/master/Data/bg_smooth.rds"
       download.file(link, "bg_smooth.rds", 
                     method = "wget", mode = "wb")
       bg_smooth<- readRDS("bg_smooth.rds")})


race.code <- acs.lookup(2014, 5, table.number="B02001")
race.ft <- acs.fetch(2014, span  = 5, 
                     geography=geo.make(state="MD", county = "Baltimore city", tract ="*", block.group ="*"), 
                     table.number = "B02001", case.sensitive = F)
race.ft <- data.frame(race.ft@geography[,4:5],race.ft@estimate, row.names = 1:dim(race.ft@estimate)[1])

race.ft %>%
  dplyr::mutate(B02001_002 = B02001_002/B02001_001, B02001_003 = B02001_003/B02001_001, B02001_004 = B02001_004/B02001_001,
                B02001_005 = B02001_005/B02001_001, B02001_006 = B02001_006/B02001_001, B02001_007 = B02001_007/B02001_001,
                B02001_010 = B02001_010/B02001_001, 
                racdiv = 1-(B02001_002^2 + B02001_003^2 + B02001_004^2 + B02001_005^2 + B02001_006^2 + B02001_007^2 + B02001_010^2)) %>% 
  dplyr::select(tract, blockgroup, racdiv) -> race.ft

library(tigris)
bmore.city <- block_groups("MD", "Baltimore city")

# BG <- as.character(dat.le$CSA2010)
bmore.city_sp <- sp::spTransform(bmore.city, sp::CRS("+proj=longlat +datum=WGS84")) #SpatialPolygonsDataFrame

bmore.city <- bmore.city_sp


rm(bmore.city_sp)
# Set the projection of the SpatialPointsDataFrame using the projection of the shapefile
sp::proj4string(real_prop) <- sp::proj4string(bmore.city)

sp::over(bmore.city, real_prop, returnList = T) -> BG_neighbhd_csa 

BG_neighbhd_csa <- plyr::ldply(BG_neighbhd_csa, data.frame)

as.data.frame(real_prop) %>%
  dplyr::group_by(Neighborhood, Block) %>%
  dplyr::summarise(lon = median(lon), lat = median(lat)) -> areal.prop

BG_neighbhd_csa %>%
  dplyr::inner_join(dplyr::mutate(bmore.city@data, .id = row.names(bmore.city@data))) %>%
  dplyr::select(.id, Neighborhood, Block, tract = TRACTCE, blockgroup = BLKGRPCE) %>%
  dplyr::mutate(tract = as.numeric(tract)) %>%
  dplyr::left_join(na.omit(hh)) %>% 
  dplyr::left_join(na.omit(pov.ft)) %>%
  dplyr::left_join(na.omit(medinc.ft)) %>%
  dplyr::left_join(na.omit(insur.ft)) %>% 
  dplyr::left_join(na.omit(race.ft)) %>%
  dplyr::inner_join(areal.prop) %>%
  unique() %>% na.omit() -> bg_smooth

bg_smooth -> BG_neighbhd_csa

bg_smooth %>%
  dplyr::group_by(tract, blockgroup) %>%
  dplyr::summarise(propfemhh = mean(propfemhh), propbelow = mean(propbelow), 
                   B19013_001 = mean(B19013_001), propkids_withinsurance = mean(propkids_withinsurance), 
                   racdiv = mean(racdiv),lon = median(lon), lat = median(lat)) %>% data.frame -> bg_smooth
saveRDS(BG_neighbhd_csa, file =  "BG_neighbhd_csa.rds")
saveRDS(bg_smooth, file = "bg_smooth.rds")

rm(areal.prop)


# Kriging the Block group predictors

# BG_neighbhd_csa <- readRDS("BG_neighbhd_csa.rds")
# bg_smooth <- readRDS("bg_smooth.rds")

#proportion below Poverty
sp::coordinates(bg_smooth) <- ~lon + lat
semivariog <- gstat::variogram(propbelow~1, locations=bg_smooth, data=bg_smooth)
# plot(semivariog)
#the data looks like it might be an exponential shape, so we will try that first with the values estimated from the empirical 
model.variog<-gstat::vgm(psill=0.012, model="Sph", nugget=0.018, range=0.074)
fit.variog<-gstat::fit.variogram(semivariog, model.variog)
# plot(semivariog, fit.variog)

## now expand your range to a grid with spacing that you'd like to use in your interpolation
#here we will use 200m grid cells:
# grd <- expand.grid(x=seq(from=range(BG_neighbhd_csa$lon)[1], to=range(BG_neighbhd_csa$lon)[2], by=1e-04),
#                    y=seq(from=range(BG_neighbhd_csa$lat)[1], to=range(BG_neighbhd_csa$lat)[2], by=1e-04))

grd <- data.frame(x=BG_neighbhd_csa$lon,
                  y=BG_neighbhd_csa$lat)

## convert grid to SpatialPixel class
coordinates(grd) <- ~ x+y
# gridded(grd) <- TRUE

krig<-gstat::krige(formula=propbelow~1, locations=bg_smooth, newdata=grd, model=model.variog, nmax = 5)

krig.output <- as.data.frame(krig)
names(krig.output)[1:3]<-c("lon","lat","propbelow.pred")

BG_neighbhd_csa %>%
  dplyr::left_join(krig.output[,-4]) %>%
  dplyr::mutate(propbelow.pred = ifelse(is.na(propbelow.pred),propbelow, propbelow.pred))-> BG_neighbhd_csa

#Median Income
semivariog <- gstat::variogram(B19013_001~1, locations=bg_smooth, data=bg_smooth)
# plot(semivariog)
#the data looks like it might be an exponential shape, so we will try that first with the values estimated from the empirical 
model.variog<-gstat::vgm(psill=751835064, model="Exp", nugget=257444639, range=0.074)
fit.variog<-gstat::fit.variogram(semivariog, model.variog)
# plot(semivariog, fit.variog)

krig<-gstat::krige(formula=B19013_001~1, locations=bg_smooth, newdata=grd, model=model.variog, nmax = 5)

krig.output <- as.data.frame(krig)
names(krig.output)[1:3]<-c("lon","lat","B19013_001.pred")

BG_neighbhd_csa %>%
  dplyr::left_join(krig.output[,-4]) %>%
  dplyr::mutate(mhhi = ifelse(is.na(B19013_001.pred),B19013_001, B19013_001.pred)) -> BG_neighbhd_csa


rm(krig.output, krig, fit.variog, model.variog, semivariog, grd, bg_smooth)

detach("package:dplyr", unload=TRUE)
detach("package:lubridate", unload=TRUE)
# detach("package:acs", unload=TRUE)

save.image("analyses_data.RData")
setwd(file.path(".."))
```

## Descriptives

Since the goal of this analysis is to predict life expectancy at the street block level and since the block information contained in the dataset was not properly defined, I made a couple of plots to see what was census block and what was a street block.

Furthermore, since some of the data files have information on neighbourhood blocks, I plotted the Neighbourhood information as defined or delineated by the block level data gotten from the [Baltimore city website](https://data.baltimorecity.gov) and then overlayed the neighbourhood data gotten from the [Maryland department of planning](http://www.mdp.state.md.us/). Futhermore, using information from the [Baltimore gisdata website](http://gisdata.baltimorecity.gov) I was able to obtain what "block" was actually defined as. All of this points to the possiblity of using blocks from our dataset as street blocks.

```{r Plots, echo=FALSE, message=FALSE, cache.lazy=TRUE, eval=TRUE}
#Plotting 

setwd(file.path(".", "Data"))  

n <- length(list.files(file.path("..", "Plots")))
if (n == 0){
  packages <- c("ggplot2","lubridate", "downloader", 
                "readr", "readxl", "maptools", "RColorBrewer", 
                "ggmap", "devtools", "rgeos", "broom", "rgdal", "tigris")
  sapply(packages, library, character.only = T, quietly = T)
  
  
  data <- read_csv(file.path("raw_data", "property.csv.gz"))
  subset(data, select = c("Block", "Neighborhood", "Location")) -> data
  data <- na.omit(data)
  new_loc <-  sapply(data$Location, function(x) {
    y <- substr(x, start = 2, stop = nchar(x) - 1)
    strsplit(y, ", ")[[1]]
  }
  )
  new_loc <- t(new_loc)
  data <- data.frame(data[,c(1,2)], lat = as.numeric(new_loc[,1]), lon = as.numeric(new_loc[,2]))
  
  p <- ggplot(data=data,aes(x = lon, y = lat))
  
  
  #Load in the data file (could this be done from the downloaded zip file directly?
  
  
  #Fit of neighbourhood info
  gor <- readOGR(file.path("wip", "nhood"), "nhood_2010", verbose = F)
  gor <- spTransform(gor, CRS("+proj=longlat +datum=WGS84"))
  gor <- tidy(gor)
  
  # p <- ggplot(data = gor,aes(x=long, y=lat, group=group))
  p + 
    geom_point(data=data,aes(x = lon, y = lat, colour = as.factor(data$Neighborhood)))  +
    geom_polygon(data=gor, aes(x=long, y=lat, group=group), color="black", alpha=0) +
    coord_quickmap() +
    labs(title = "The fit of neighbourhood info on block data in Baltimore City",
         x = "Longitude",
         y = "Latitude") +
    theme(axis.text = element_text(size = 18),
          axis.title = element_text(size = 20),
          legend.text = element_text(size = 15),
          legend.title = element_text(size = 15),
          strip.text = element_text(size = 15),
          title = element_text(size = 18),
          legend.position = "none",
          panel.background = element_blank(),
          panel.grid.major = element_blank(), 
          panel.grid.minor = element_blank(),
          axis.line = element_line(colour = "black"))
  
  ggsave(filename = file.path("..", "Plots", 
                              "block_block.png"), width = 45, height = 45, units = "cm")
  
  
  ##    The fit of Block info on Neighbourhood data in Baltimore City
  
  
  # gor <- readOGR(file.path("wip", "blk2010"), "blk2010", verbose = F)
  gor <- blocks("MD", county = "Baltimore city")
  gor <- spTransform(gor, CRS("+proj=longlat +datum=WGS84"))
  gor <- tidy(gor)
  
  p <- ggplot(data = data, aes(x=lon, y=lat))
  p + 
    geom_jitter(aes(colour =  factor(data$Block)),size = 1.5) +
    geom_polygon(data=gor, aes(x=long, y=lat, group=group, fill = group), color="black", alpha=0) +
    coord_map() +
    labs(title = "Census blocks vs Street Blocks",
         x = "Longitude",
         y = "Latitude") +
    theme(axis.text = element_text(size = 18),
          axis.title = element_text(size = 20),
          legend.text = element_text(size = 15),
          legend.title = element_text(size = 15),
          strip.text = element_text(size = 15),
          title = element_text(size = 18),
          legend.position = "none",
          panel.background = element_blank(),
          panel.grid.major = element_blank(), 
          panel.grid.minor = element_blank(),
          axis.line = element_line(colour = "black"))
  
  ggsave(filename = file.path("..", "Plots", 
                              "block_n.png"),width = 45, height = 45, units = "cm")
  
  # p + 
  #   geom_point(data=data,aes(x = lon, y = lat, colour = "blue" ))  +
  #   # geom_text(data = dat1, aes(label = as.factor(dat1$Neighborhood)), 
  #   #           colour="Black",size=2,hjust="center", 
  #   #           vjust="center") +
  #   geom_polygon(data=gor, aes(x=long, y=lat, group=group), color="black", alpha=0) +
  #   # geom_map(map=gor, data=gor, aes(map_id=id, x=long, y=lat, group=group), color="red", alpha=0) +
  #   coord_quickmap() +
  #   labs(title = "The fit of Census blocks on street block data in Baltimore City",
  #        x = "Longitude",
  #        y = "Latitude") +
  #   theme(axis.text = element_text(size = 18),
  #         axis.title = element_text(size = 20),
  #         legend.text = element_text(size = 15),
  #         legend.title = element_text(size = 15),
  #         strip.text = element_text(size = 15),
  #         title = element_text(size = 18),
  #         legend.position = "none",
  #         panel.background = element_blank(),
  #         panel.grid.major = element_blank(), 
  #         panel.grid.minor = element_blank(),
  #         axis.line = element_line(colour = "black"))
  # 
  # ggsave(filename = file.path(".", "Plots", 
  #                             "block_b.png"), 
  #        width = 45, height = 45, units = "cm")
  
  
  # unlink(file.path("wip", "blk2010"), force = T, recursive = T)
}

```

<!-- ![Neighbourhoods as defined by blocks using just block level data](../Plots/n_block_block.png) ![The fit of neighbourhood info on block data in Baltimore City](../Plots/block_block.png) ![Blocks in Baltimore City](../Plots/block_n.png)   -->

![](./Plots/block_n.png) 

Here the colored points are the street blocks-the colours vary by neighbourhood, while the grids represents a census block. For more plots examining the fits click [here](./Plots/) or go to the plots folder in your working directory.

```{r proof, echo=FALSE, message=FALSE,warning=FALSE, eval=FALSE}
##Need to install image magick else this won't work
real.prop <- read_csv(file.path(".","Data","raw_data", "real_property.csv.gz"))
random_block <- sample(real.prop$BLOCKPLAT,1)
if(!file.exists(file.path(".", "Plots", "random_block.pdf")))
{
  download.file(random_block, destfile = file.path(".", "Plots", "random_block.pdf"),
           # method = "curl",
           mode="wb")
}
# print(getwd())
# file.show(file.path("..", "Plots", "random_block.pdf"))
if(!file.exists(file.path(".", "Plots", "random_block.png")))
{
  animation::im.convert(file.path(".", "Plots", "random_block.pdf"), output = file.path(".", "Plots", "random_block.png"))
}

```

All of this indicate a good fit. I also used gis data from the [baltimore city website](http://gisdata.baltimorecity.gov/) and I found that each block was defined as a street block. An example of a cityblock pulled from dataset can be found [here](./Plots/random_block.pdf) or in the plots folder

##Analysis
###Checking for Spatial correlation

Since we have spatial data I ran the both Mantel test[c.f @mantel1] and Moran's I [c.f @moran] to examine if spatial autocorrelation exists in this dataset. Please note that while both test measure spatial autocorrelation, they refer to quite different concepts.

Mantel's test[@mantel1; @mantel] gives correlation between different variables due to their spatial location, that is Mantel's test judges whether closeness in one set of variables is related to closeness in another set of variable. Relating this to our datasetwe can use it to see if samples that are close in terms of their geographic location values are also close in terms of life expectancy values. I.e test if the distance matrix based on life expectancy values is correlated with the distance matrix based on spatial location for the CSA's

```{r Mantel, echo=FALSE, cache=F, message=FALSE, warning=F}
setwd(file.path(".", "Data"))


#### ID which neighborhoods fall into a CSA

##Load data from real properties that contains block info
data <- readr::read_csv(file.path("raw_data", "property.csv.gz"))
subset(data, select = c("Block", "Neighborhood", "Location")) -> data

data <- na.omit(data)

new_loc <-  sapply(data$Location, function(x) {
  y <- substr(x, start = 2, stop = nchar(x) - 1)
  strsplit(y, ", ")[[1]]
}
)
new_loc <- t(new_loc)
data <- data.frame(data[,c(1,2)], lon = as.numeric(new_loc[,2]), lat = as.numeric(new_loc[,1]))
dat1 <- data

# dat <- data.frame(Neighborhood = sort(unique(data$Neighborhood)),
#                   lon = data$lon,
#                   lat =data$lat, stringsAsFactors=FALSE)

#Step 1
### Check if Lng and Lat fall inside polygons from ESRI Shape file for Child and wellbeing (this has the outcome)
dat.le <-  rgdal::readOGR(file.path("wip", "health"), "health", verbose = F)
csa <- as.character(dat.le$CSA2010)
dat.le <- sp::spTransform(dat.le, sp::CRS("+proj=longlat +datum=WGS84")) #SpatialPolygonsDataFrame


# Assignment modified according
sp::coordinates(data) <- ~lon + lat #SpatialPointsDataFrame

# Set the projection of the SpatialPointsDataFrame using the projection of the shapefile
sp::proj4string(data) <- sp::proj4string(dat.le)

sp::over(dat.le, data, returnList = T) -> neighbhd_csa #gives which Neighborhood belongs to what CSA

names(neighbhd_csa) <- csa

#Gives a dataframe with cols CSA, Blocks, and Neighborhood so I can match block level info to CSA
neighbhd_csa <- plyr::ldply (neighbhd_csa, data.frame) 
clnames <- names(neighbhd_csa)
clnames[1] <- "CSA"
names(neighbhd_csa) <- clnames

#Got the CSA to NBHD from BNIA site to compare
csa_nsa <- readxl::read_excel(file.path(".", "raw_data", "csa_nsa.xlsx"))

#Caution CSA and neighbourhoods are not in 1-1
with(neighbhd_csa, tapply(CSA, Neighborhood, function(x) length(unique(x)))) -> test.dat

#Get neighbhds where count of CSA = 2 and check
unique(neighbhd_csa[neighbhd_csa$Neighborhood == names(test.dat[test.dat == 2][1]),][,c(1,3)]) -> not.unique
# not.unique


#Step 2
# Now that all that is done, I start merging (on block and neighbhd) to get CSA level, neighborhood level and block level data in one dataset
health <- readr::read_csv(file.path(".", "raw_data", "child_and_fam_wellbeing.csv"))
clnames <- names(health)
clnames[1] <- "CSA"
names(health) <- clnames

# Get variable names
library(dplyr)

gsub("_[[:digit:]]*","",names(health))[-1] -> variables
variables[variables == "mort1"] <- "mort01"
variables <- sapply(variables, function(x){
  substr(x, start = 1, stop = nchar(x) - 2)
})
unname(variables) -> variables
unique(variables) -> var.names


setdiff(names(health), grep(paste0("mort"), names(health), value = T)) ->rm.mort

#Change from short to long

health.long <- lapply(var.names[var.names!= "mort"], function(x){
  #get the columns
  columns <- grep(paste0(x),rm.mort, value = T)
  
  #get the time in years
  time <- sapply(gsub("[^_[:digit:]]","",columns), function(x){
    substr(x, start = nchar(x)-1, stop = nchar(x))
  }
  )
  unname(time) -> time
  
  #Select the columns
  subset(health, select = c("CSA",columns)) -> dat.h
  
  n <- dim(dat.h)[1]
  
  # print(length(time))
  dat.h <- tidyr::gather(dat.h, variable, value, -CSA)
  dat.h$time <- rep(as.numeric(time), each = n)
  dat.h
  # data.frame(tidyr::gather(dat.h, variable, value, -CSA), time = time)
  # rbind(cbind(tidyr::gather(dat.h, variable, value, -CSA), time))
  
})

plyr::ldply(health.long, data.frame) -> health.long


#get the columns
columns <-  sapply(strsplit(grep(paste0("mort"),names(health), value = T),"_"),function(x) x[1])
unique(columns) -> columns

health.long2 <- lapply(columns, function(x){
  
  #get the time in years
  time <- sapply(strsplit(grep(paste0("^", x, "_"),
                               names(health), value = T),"_"),function(x) x[2])
  time <- as.numeric(time)
  
  #Select the columns
  subset(health, select = c("CSA",
                            grep(paste0("^", x, "_"), names(health), value = T))
  ) -> dat.h
  
  n <- dim(dat.h)[1]
  
  # print(length(time))
  dat.h <- tidyr::gather(dat.h, variable, value, -CSA)
  dat.h$time <- rep(as.numeric(time), each = n)
  dat.h
  # data.frame(tidyr::gather(dat.h, variable, value, -CSA), time = time)
  # rbind(cbind(tidyr::gather(dat.h, variable, value, -CSA), time))
  
})

plyr::ldply(health.long2, data.frame) -> health.long2

health.long <- rbind(health.long, health.long2)

rm(health.long2, columns, rm.mort, var.names, variables, new_loc)

#Merge
health.sub <- subset(health, select = c("CSA", "LifeExp11", 
                                        "LifeExp12", "LifeExp13",
                                        "LifeExp14"))


inner_join(health.sub, neighbhd_csa) %>% #This will have the same number of rows as neighbhd_csa since block neighborhood combinations are unique
  inner_join(dat1) -> merged.h_n         #This will have more rows since dat1 each block in property.csv has multiple streets

# Mantel's Test
#we need CSA to be unique so get the median longitude and latitude per CSA

merged.h_n %>% group_by(CSA) %>%
  summarise(lon.med = median(lon), lat.med = median(lat)) -> mtdata

#note that the number of rows for mtdata is the same as the number of CSA's in our data. Furthermore the number of CSA from BNIA is the same (Baltimore city is not a CSA!!! And so should not be used in the calculations!)

mtdata %>% inner_join(health.sub) -> mtdata

#Testing
# csa.dists <- dist(cbind(mtdata$lon.med, mtdata$lat.med), method = "euclidean")
csa.dists <- geosphere::distm(cbind(mtdata$lon.med, mtdata$lat.med), fun = geosphere::distVincentyEllipsoid)
csa.dists <- as.dist(csa.dists)
save.image("corr.RData")
le14.dists <- dist(mtdata$LifeExp14, method = "euclidean")


# plot(ade4::mantel.rtest(csa.dists, le11.dists, nrepet = 9999))
result <- ade4::mantel.randtest(csa.dists, le14.dists, nrepet = 9999)
print(result)
plot(result, main = "Mantel's test")


detach("package:dplyr", unload=TRUE)


setwd(file.path(".."))
```

Based on these results, we can reject the null hypothesis that these two matrices, spatial distance and life expectancy distance (2014), are unrelated with alpha = 0.05.  The observed correlation, r = 0.13, suggests that the matrix entries are positively associated. So smaller differences in life expectancy are generally seen among pairs of CSA's that are close to each other than far from each other. Note that since this test is based on random permutations, the same code will always arrive at the same observed correlation but rarely the same p-value. Furthemore, I ran this test for all four years in the datset set and the conclusions are consistent. If you are interested in the correlation values for those years [here is the code](https://github.com/ksosina/BLE/blob/master/R%20code/check_sp_corr.R).

Moran's I[@moran] is useful when one wants to know the correlation of a variable with itself through space. I.e., when one wants to know to which extent the occurrence of an event in an areal unit makes it more likely or unlikely the occurrence of an event in a neighboring areal unit. I.e if life expectancy is low in the north does that mean that we likely to see low life expectancy in the same region? Thus the null is the lack of existence of spatial autocorrelation.
```{r Moran, echo=FALSE, cache=F, message=FALSE, warning=F}
setwd(file.path(".", "Data"))


attach("corr.RData")
csa.dists <- as.matrix(csa.dists)
csa.dists.inv <- 1/csa.dists # not solve(csa.dists)
diag(csa.dists.inv) <- 0

#2014
result <- ape::Moran.I(mtdata$LifeExp14, csa.dists.inv)
print(result)

detach()
unlink("corr.RData")

setwd(file.path(".."))
```
Based on these results, we can reject the null hypothesis that there is zero spatial autocorrelation present in life expectancy at the 5\% level of significance.
<!-- For more tests using data from 2011 to 2014 please check [here]((https://github.com/ksosina/BLE/blob/master/R%20code/check_sp_corr.R)). -->

###Regression Model for spatial data


####Geographically Weighted Regression (GWR)
* The structure of the model does not remain constant over the study area (there are local variations in the parameter estimates)
* To account for this potential spatial heterogeneity we use the GWR model [@fother]
* GWR permits the parameter estimates to vary locally.

#####GWR
This model uses a weighted least squares approach to account for spatial heteorgeniety and is as follows $$Y_i = X\beta_i +\epsilon$$ where $\beta_i$ is solved using the WLS approach. Thus $$\beta = (X^TWX)^{-1}X^TWY$$ where $W$ is the spatial weight matrix which is based on the distance between observations. Using the approach of [@fother], $W(u_i,v_i)$ is an $n \times n$ diagonal matrix denoting the spatial weighting of each observation point for model calibration at location $(u_i,v_i)$. The spatial weights can be specificed using three metrics: 1. The type of distance function used e.g the Great circle distance; 2. The kernel function, that is how to relate the distances; and 3. Its bandwidth, I.e how many neighborhoods to use. So for the jth element in $W(u_i,v_i)$ if we use the Gaussian kernel, we have that the $w_{ij}$ element is $$\exp\left(-\dfrac{1}{2}\left(\dfrac{d_{ij}}{b}\right)^2\right)$$. This model used in the paper is based on the Gaussian kernel. 

####Model selection
#####Data
First I divided my data into a training and testing dataset. All the model selection procedures were then performed on the training dataset. To obtain candidate models to use for the GWR method, I used step wise model selection under four scenarios using the ordinary least squares regression approach

|                                           | Criteria = AIC | Criteria = BIC |
|-------------------------|:--------------:|:--------------:|
| Force variables in model selection  = Yes |       st1      |       st2      |
| Force variables in model selection = No   |       st3      |       st4      |

Where the variables that are forced are the aggregated CSA level variables for which we have street block level information. They include propfemhh, totalincidents, and prop.vacant (as defined above). Based on this four models, I then selected the model that minimised the predictor error. I.e, the model with the best prediction performance. 

```{r modsel, echo = F, message=FALSE, warning=F}
setwd(file.path(".", "Data"))


## Load spatial packages

library(sp)           ## Data management
library(dplyr)        ## Data management
library(spdep)        ## Spatial autocorrelation
library(gstat)        ## Geostatistics
library(spgwr)        ## GWR
library(ggplot2)      ## Plotting
library(broom)        ## Data management
library(GWmodel)      ## Predict GWR
library(cvTools)      ## Compare fits using cross validation


attach("analyses_data.RData")

#Data formatting
names(csa.data.anal) <- gsub("[[:digit:]]", "", names(csa.data.anal))
names(block_data.anal) <- gsub("[[:digit:]]", "", names(block_data.anal))

BG_neighbhd_csa -> BG_neighbhd

names(BG_neighbhd_csa) <- tolower(names(BG_neighbhd_csa))
BG_neighbhd_csa[,c(2,3,6,9:13,15)] -> my_data

my_data %>%
  inner_join(block_data.anal[,-c(4,5,15:16)]) -> my_block

#Center and scall all the predictors
my_block_raw <- my_block
my_block <- data.frame(my_block[,c(10,1:2)],apply(my_block[,-c(1:2, 6:7,10)], 2, scale), my_block[,6:7])


BG_neighbhd_csa[,c(2,3,6,9:13,15)] -> my_data

my_data %>%
  inner_join(block_data.anal[,c(1:3)]) %>%
  dplyr::select(-neighborhood, -block) %>% 
  group_by(csa) %>%
  summarise_all(mean) -> my_data


my_data %>% 
  inner_join(csa.data.anal[,-c(4:6, 27:28)])-> my_csa

#Center and scale all the predictors
my_csa$birthwt <- as.numeric(gsub("%", "", my_csa$birthwt))
my_csa$cashsa <- as.numeric(gsub("%", "", my_csa$cashsa))
my_csa$fore <- as.numeric(gsub("%", "", my_csa$fore))

switch(Sys.info()[['sysname']],
       Linux  = {my_csa$birthwt <- as.numeric(gsub("%", "", my_csa$birthwt))
       my_csa$cashsa <- as.numeric(gsub("%", "", my_csa$cashsa))
       my_csa$fore <- as.numeric(gsub("%", "", my_csa$fore))},
       Darwin = {my_csa$birthwt <- as.numeric(gsub("%", "", my_csa$birthwt))
       my_csa$cashsa <- as.numeric(gsub("%", "", my_csa$cashsa))
       my_csa$fore <- as.numeric(gsub("%", "", my_csa$fore))})
my_csa <- data.frame(my_csa[,c(1,9)],apply(my_csa[,-c(1, 9,5:6)], 2, scale), my_csa[,5:6])

rm(my_data)


#Randomly Divide Data into Training and Test Sets
get.test_train <- function(p = 0.3, data, seed = 1234){
  set.seed(seed = seed)
  n <- NROW(data)
  test <- sample(1:n, p*n)
  train <- !c(1:n) %in% test
  list(train_data = data[train,], test_data = data[test,])
}
get.test_train(0.30, data = my_csa, seed = 1234) -> dat

########
## GWR (WARNING: This may take a while to run)
########

#### Distance Conversion         ####
#####################################


## Function: Convert km to degrees
km2d <- function(km){
  out <- (km/1.852)/60
  return(out)
}


col.lm <- lm(lifeexp~.,data=dat$train_data[,-c(1,29:30)])
# col.lm <- lm(lifeexp~.,data=my_csa[,-c(1,29:30)])


# Stepwise using AIC
st1 <- step(col.lm, 
            scope = list(lower = lifeexp ~ propfemhh + totalincidents + prop.vacant, upper = col.lm),
            trace = 0)
st3 <- step(col.lm,trace = 0)
# Stepwise using BIC
n <- NROW(dat$train_data)
# n <- NROW(my_csa)
st2 <- step(col.lm, 
            scope = list(lower = lifeexp ~ propfemhh + totalincidents + prop.vacant, upper = col.lm),
            trace = 0, k = log(n))

st4 <- step(col.lm, trace = 0, k = log(n))

# set up folds for cross-validation
set.seed(1234)
folds <- cvFolds(nrow(dat$train_data), K = 5, R = 10)
# folds <- cvFolds(nrow(my_csa), K = 5, R = 10)

cvFitLm1 <- cvLm(st1, cost = rtmspe, 
                 folds = folds, trim = 0.1)
cvFitLm2 <- cvLm(st2, cost = rtmspe, 
                 folds = folds, trim = 0.1)
cvFitLm3 <- cvLm(st3, cost = rtmspe, 
                 folds = folds, trim = 0.1)
cvFitLm4 <- cvLm(st4, cost = rtmspe, 
                 folds = folds, trim = 0.1)

print(cvSelect(st1 = cvFitLm1, st2 = cvFitLm2, st3 = cvFitLm3, st4 = cvFitLm4))
print(summary(st4))

save.image("analyses_data.RData")
detach()
setwd("..")
```


####Methods for Downscaling
* Delta method:
Here, after we find the model that fits the data best, using aggregated data. 

    1. We predict what the life expectancy would be after we remove one of the blocks from the aggregated data, call this $$ T_{-b} = E(Y)_{-b} $$ 
    2. Then we find the delta in predicted life expectancy at the CSA level due to the removed block as $$ T_{\delta_b} = T_{full} – T_{-b} $$ Call this delta the change in the mean life expectancy at the CSA level due to that block.
    3. Add the delta to the observed life expectancy at that CSA. Call this the predicted life expectancy due to that block.
    
Note that this inherently assumes that the observed life expectancy at a CSA is the true underlying life expectancy and that all the blocks in the neighborhoods that belong that CSA vary about it.

* Transfer function: Find which aggregated predictors provide the best fit, then use a "transfer function" to map the aggregated variables to the block level and use the value gotten as a predictor to get block level estimates. For this sceanario, I centered and scaled both the street block level data and the aggregated CSA level data, I.e I subtracted the mean of each variable (taken over the whole dataset) from itself and divided the centered variable by its standard deviation. By this my aim was to get estimates that were invariant to the fact that the CSA level data was an aggregated form of the street block level data, so my transfer function here is just the identity function on the mean centered and scaled variables. Then using the GWR model above, I predicted[@gwmodel] the life expectancy for a given street block. 

```{r pred, echo=FALSE, message=FALSE, warning=FALSE,tidy=TRUE, include=TRUE, eval = T}
setwd(file.path(".", "Data"))

## Load spatial packages

library(sp)           ## Data management
library(dplyr)        ## Data management
library(spdep)        ## Spatial autocorrelation
library(gstat)        ## Geostatistics
library(spgwr)        ## GWR
library(ggplot2)      ## Plotting
library(broom)        ## Data management
library(GWmodel)      ## Predict GWR
library(cvTools)      ## Compare fits using cross validation


attach("analyses_data.RData")


########
## GWR (WARNING: This may take a while to run)
########

#### Distance Conversion         ####
#####################################


## Function: Convert km to degrees
km2d <- function(km){
  out <- (km/1.852)/60
  return(out)
}


## Prediction

#Convert to spdataframe
coordinates(csa.data.anal) <- ~lon.med.avg + lat.med.avg

coordinates(dat$train_data) <- ~lon + lat


bw <- bw.gwr(scale(lifeexp,scale = F) ~ propfemhh + propbelow.pred + susp + liquor + elheat,
             data = dat$train_data, approach="CV",kernel="gaussian",
             adaptive=F, p=2, theta=0, longlat=T)
coordinates(my_block) <- ~lon + lat
pred.b <- gwr.predict(scale(lifeexp,scale = F) ~ propfemhh + propbelow.pred + susp + liquor + elheat,
                      data = dat$train_data, 
                      predictdata = my_block,
                      bw = bw, kernel = "gaussian",adaptive = F, longlat = T)


m <- mean(dat$train_data$lifeexp)
data.frame(my_block, pred = pred.b$SDF$prediction + m) %>%
  group_by(csa) %>%
  summarise(lifeexp.pred = mean(pred)) -> csa_pred


csa_pred %>%
  inner_join(dat$test_data[,1:2]) -> test_train.csa

csa_pred$id <- as.character(0:54)
csa_pred %>%
  inner_join(data.frame(my_block, pred = pred.b$SDF$prediction + m, predvar = pred.b$SDF$prediction_var)) %>%
  inner_join(subset(my_csa, select = c(csa, lifeexp))) %>%
  select(id,csa,neighborhood, block,lifeexp, pred,predvar, lon,lat) -> block_pred

#Plotting the variance and the prediction

gor <- rgdal::readOGR(file.path("wip", "census"), "census", verbose = F)
gor <- sp::spTransform(gor, sp::CRS("+proj=longlat +datum=WGS84"))
gor <- broom::tidy(gor)


# block_pred <- block_pred[block_pred$csa %in% dat$test_data$csa,]
data <- subset(block_pred, select = -c(lon,lat))

gor %>% dplyr::inner_join(data, by = "id") -> gor


p <- ggplot(data=gor, aes(x=long, y=lat))

# set.seed(1234)
# cols <- sample(colours(distinct = TRUE), nrow(gor))

p1 <- p + geom_point(data = block_pred, aes(x = block_pred$lon, y = block_pred$lat, colour = block_pred$pred)) +
  geom_polygon(aes(group = group),color="black", alpha=0) +
  coord_quickmap() +
  labs(title = "Predicted life expectancy\n for each street block in Baltimore City",
       x = "Longitude",
       y = "Latitude") +
  scale_color_gradient2(name = "Predicted\nvalues", midpoint = 74, mid = "yellow", low = "blue", high = "red") +
  # scale_color_continuous() +
  # facet_grid(. ~ind, labeller = labeller(ind = type)) +
  theme(axis.text = element_text(size = 18),
        axis.title = element_text(size = 20),
        legend.text = element_text(size = 15),
        legend.title = element_text(size = 15),
        strip.text = element_text(size = 15),
        title = element_text(size = 18),
        # legend.position = "none",
        panel.background = element_blank(),
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        axis.line = element_line(colour = "black"),
        legend.background = element_rect(fill ="transparent"),
        legend.key = element_rect(fill = "transparent", color = "transparent"),
        legend.position = c(0.15, 0.2))


p3 <- p + geom_point(data = block_pred, aes(x = block_pred$lon, y = block_pred$lat, colour = block_pred$predvar)) +
  geom_polygon(aes(group = group),color="black", alpha=0) +
  coord_quickmap() +
  labs(title = "Variability of the predicted life expectancy\n for each street block in Baltimore City",
       x = "Longitude",
       y = "Latitude") +
  scale_color_gradient2(name = "Variance",midpoint = 7, mid = "yellow", low = "blue", high = "red") +
  # facet_grid(. ~ind, labeller = labeller(ind = type)) +
  theme(axis.text = element_text(size = 18),
        axis.title = element_text(size = 20),
        legend.text = element_text(size = 15),
        legend.title = element_text(size = 15),
        strip.text = element_text(size = 15),
        title = element_text(size = 18),
        # legend.position = "none",
        panel.background = element_blank(),
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        axis.line = element_line(colour = "black"),
        legend.background = element_rect(fill ="transparent"),
        legend.key = element_rect(fill = "transparent", color = "transparent"),
        legend.position = c(0.15, 0.2))

# gridExtra::grid.arrange(p1,p3,ncol=2)


ggsave(file.path("..", "Plots","fitblock.png"), gridExtra::arrangeGrob(p1, p3,ncol=2),width = 45, height = 45, units = "cm")

csa_pred %>% inner_join(subset(my_csa, select = c(csa,lifeexp))) %>% 
  select(-id) -> line.fits

stack(line.fits, -csa) -> line.fits

line.fits$id <- rep(csa_pred$id, 2)

l <- c("Observed", "Predicted")
p4 <- ggplot(data = line.fits,aes(x=id,y = values, group = ind, color = ind)) + geom_line() +
  labs(title = "Fit of the model to the data in whole dataset", x = "CSA", colour = "Life expectancy") +
  scale_color_discrete(labels = l) +
  theme(axis.text = element_text(size = 18),
        axis.title = element_text(size = 20),
        legend.text = element_text(size = 15),
        legend.title = element_text(size = 15),
        strip.text = element_text(size = 15),
        title = element_text(size = 18),
        # legend.position = "none",
        panel.background = element_blank(),
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        axis.line = element_line(colour = "black"),
        legend.background = element_rect(fill ="transparent"),
        legend.key = element_rect(fill = "transparent", color = "transparent"),
        legend.position = c(0.3, 0.8))

csa_pred %>% inner_join(subset(dat$test_data, select = c(csa,lifeexp))) %>% 
  select(-id) -> line.fits

stack(line.fits, -csa) -> line.fits

line.fits$id <- rep(csa_pred$id[csa_pred$csa %in% dat$test_data$csa], 2)


p5 <- ggplot(data = line.fits,aes(x=id,y = values, group = ind, color = ind)) + geom_line() +
  labs(title = "Fit of the model to the data in the testing dataset", x = "CSA", colour = "Life expectancy") +
  scale_color_discrete(labels = l) +
  theme(axis.text = element_text(size = 18),
        axis.title = element_text(size = 20),
        legend.text = element_text(size = 15),
        legend.title = element_text(size = 15),
        strip.text = element_text(size = 15),
        title = element_text(size = 18),
        # legend.position = "none",
        panel.background = element_blank(),
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        axis.line = element_line(colour = "black"),
        legend.background = element_rect(fill ="transparent"),
        legend.key = element_rect(fill = "transparent", color = "transparent"),
        legend.position = c(0.3, 0.8))

ggsave(file.path("..", "Plots","fitlines.png"), gridExtra::arrangeGrob(p5, p4,ncol=2),width = 50, height = 45, units = "cm")


#Using CV method I.e data


#Get blocks and neighborhoods in the whole dataset
my_block_cv <- my_block_raw
n <- NROW(my_block_cv)

arrange(my_block_cv,csa,neighborhood,block)-> my_block_cv
my_block_cv$pred <- NA
my_block_cv$predvar <- NA

##########################
## Warning: this takes a while to run
##########################

for (i in 1:n){
  #Remove all values relating to block i along with the NA column for pred
  dat_cv <- my_block_cv[-i,-c(20:21)]
  
  #Aggregate to csa without block i
  dat_cv %>%
    dplyr::select(-neighborhood, -block) %>%
    group_by(csa) %>%
    summarise_all(mean) -> my_dat
  
  #Center and scale the aggregated variables
  my_dat <- data.frame(my_dat[,c(1)],apply(my_dat[,-c(1,5:6)], 2, scale), my_dat[,5:6])
  
  #Convert to spatialpointdataframe
  coordinates(my_dat) <- ~lon + lat
  
  #Get the prediction at the CSA realting to block i
  pred <- gwr.predict(scale(lifeexp,scale = F) ~ propfemhh + propbelow.pred + susp + liquor + elheat, 
                      data = dat$train_data,
                      predictdata = my_dat,
                      bw = bw, kernel = "gaussian",adaptive = F, longlat = T)
  
  #Create a new dataset where the i'th prediction at the CSA level. I am looping over all the blocks in a particular csa
  my_block_cv[,-c(20:21)] %>%
    inner_join(data.frame(csa = my_dat$csa,pred = pred$SDF$prediction, predvar = pred$SDF$prediction_var)) -> output
  my_block_cv$pred[i] <- output$pred[i]
  my_block_cv$predvar[i] <- output$predvar[i]
}

#Get blocks and neighborhoods in the csa test dataset
my_block_cv <- my_block_cv[my_block_cv$csa %in% dat$test_data$csa, ]
coordinates(dat$test_data) <- ~lon + lat

pred.csa <- gwr.predict(scale(lifeexp,scale = F) ~ propfemhh + propbelow.pred + susp + liquor + elheat, 
                        data = dat$train_data, predictdata = dat$test_data,
                        bw = bw, kernel = "gaussian",adaptive = F, longlat = T)

data.frame(csa = dat$test_data@data[,1], pred.csa = pred.csa$SDF$prediction, lifeexp = dat$test_data@data[,2]) %>%
  inner_join(subset(my_block_cv, select = c(csa,neighborhood,block, pred))) %>%
  dplyr::select(csa,neighborhood,block, lifeexp, pred, pred.csa) %>%
  mutate(delta = pred.csa - pred, block.le = lifeexp + delta) %>%
  unique() -> block.pred


#Aggregate the results
block.pred %>%
  dplyr::select(-neighborhood, -block) %>%
  dplyr::group_by(csa) %>%
  dplyr::summarise_all(mean) -> csa_block.pred

m = mean(dat$train_data@data$lifeexp)
csa_block.pred[,c(1,2,6,4)] %>%
  dplyr::mutate(pred.csa = pred.csa+m ) -> csa_block.pred

#Plotting
map <- ggmap::get_map(location = "Baltimore City", zoom = 12, maptype = "roadmap" )
p <- ggmap::ggmap(map)
gor <- rgdal::readOGR(file.path("wip", "census"), "census", verbose = F)
gor <- sp::spTransform(gor, sp::CRS("+proj=longlat +datum=WGS84"))
gor <- broom::tidy(gor)

data <- as.data.frame(dat$test_data)

data %>% 
  dplyr::select(csa, lifeexp, lon, lat) %>%
  dplyr::inner_join(csa_block.pred) %>%
  dplyr::inner_join(test_train.csa)  %>%
  dplyr::mutate(pred.csa = lifeexp.pred)  %>%
  dplyr::select(csa, lifeexp, block.le,pred.csa,lon, lat) -> data


test <- subset(as.data.frame(my_csa), select = c(csa, lifeexp))
test$pred.csa <- NA
test$block.le <- NA
test$lifeexp <- NA
test$lifeexp[test$csa %in% data$csa] <- data$lifeexp
test$pred.csa[test$csa %in% data$csa] <- data$pred.csa
test$block.le[test$csa %in% data$csa] <- data$block.le
stack(test, select = -csa) -> test
# dplyr::bind_rows(test[,1:2],test[,c(1,3)], test[,c(1,4)]) -> test

test$id <- rep(as.character(0:54), 3)
gor %>% dplyr::inner_join(test, by = "id") -> gor

type <- c(
  block.le = "Aggregated CSA estimates\n life expectancy from\n leave one outs",
  pred.csa = "Aggregated CSA estimates\n life expectancy from\n transfer method",
  lifeexp = "Observed life expectancy"
)

p <- p + 
  geom_polygon(data=gor, aes(x=long, y=lat, group = group,fill = values), color="red", alpha=0.6) +
  # geom_text(data=data,
  #           aes(label = as.factor(csa)), 
  #           colour="Black",size=2,hjust="center", 
  #           vjust="center") +
  coord_quickmap() +
  labs(title = "Predicted life expectancy compared to observed\n life expectancy per CSA in the testing dataset using two different methods ",
       x = "Longitude",
       y = "Latitude") +
  # theme(legend.position = "none") +
  scale_fill_gradient2(name = "Values", midpoint = 75, mid = "yellow", low = "brown", high = "red") +
  facet_grid(. ~ind, labeller = labeller(ind = type)) +
  theme(axis.text = element_text(size = 18),
        axis.title = element_text(size = 20),
        legend.text = element_text(size = 15),
        legend.title = element_text(size = 15),
        strip.text = element_text(size = 15),
        title = element_text(size = 18))

p <- p + theme(panel.margin = unit(0.2, "in"))
p <- p +  coord_fixed(ratio= 16.4/6.5)
ggsave(filename = file.path("..", "Plots", 
                            "pred.png"), plot = p,
       width = 45, height = 45, units = "cm")
  

# print(NROW(block_pred))
# ggsave(file.path("..","Plots","pred.png"), width = 45, height = 45, units = "cm")

detach()
unlink("analyses_data.RData")
setwd("..")
```
<!-- ![](./Plots/pred.png)  -->
<!-- The plot above shows what the observed life expectancy was in the testing dataset compared to the two methods mentioned above. The grey polygons represent CSA's that are in the training dataset, while the colored regions represent CSA's that are in the testing dataset. -->
 

## Datasets

| Name                                    | Information                                                                                                                                                                                                             | Type      | Data Source                                                                                          | Geographic Scale   | Date        |
|-----------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------|------------------------------------------------------------------------------------------------------|--------------------|-------------|
| Real Property Taxes                     | Contains information on which streets belong to which block and in what neighbourhood along with their longitude and latitude. Also has information on police district.                                                 | Dataset   | [Baltimore city website](https://data.baltimorecity.gov)                                             | Street Level       | 2016        |
| Real Property                           | Contains the City of Baltimore parcel boundaries, with ownership, address, valuation and other property information. Furthermore, it also contains street block definitions.                                            | Dataset   | [Baltimore gisdata website](http://gisdata.baltimorecity.gov)                                        | Street level       | 2016        |
| Census Block                            | GIS shapefile which has information on census block designation for 2010                                                                                                                                                | Shapefile | [Maryland department of planning](http://planning.maryland.gov/msdc/S5_Map_GIS.Shtml)                | Block level        | 2010        |
| Neighborhoood                           | Polygon feature representing the boundaries of Baltimore City's neighborhoods as of the year 2010                                                                                                                       | Shapefile | [Baltimore city website](https://data.baltimorecity.gov)                                             | Neighborhood level | 2010        |
| Census Demographics for 2010 to 2014    | Contains neighborhood level demographics data                                                                                                                                                                           | Dataset   | [Baltimore Neighborhood Indicators Alliance BNIA-JF](http://bniajfi.org/vital_signs/data_downloads/) | Neighborhood level | 2010 - 2014 |
| Children and Family Health & Well-Being | Has information on life expectancy for 2010 to 2014                                                                                                                                                                     | Dataset   | [Baltimore Neighborhood Indicators Alliance BNIA-JF](http://bniajfi.org/vital_signs/data_downloads/) | Neighborhood level | 2010 - 2014 |
| BNIA Vital Signs Codebook               | Contain information on short variable names and their corresponding full names, along with their sources for each dataset                                                                                               | Dataset   | [Baltimore city website](https://data.baltimorecity.gov)                                             | Neighborhood level | 2016        |
| Housing and Community Development       | Has information on the state of households in Baltimore city, viz;Number of Homes Sold,Percentage of Residential Properties that are Vacant and Abandoned,Percent Residential Properties that do Not Receive Mail, etc. | Dataset   | [Baltimore Neighborhood Indicators Alliance BNIA-JF](http://bniajfi.org/vital_signs/data_downloads/) | Neighborhood level | 2010-2014   |
| BNIA Data linking CSA to Neighborhoods      | Has information on which neighborhoods belong to what CSA. Note that a neighborhood may belong to more than one CSA | Dataset   | [Baltimore Neighborhood Indicators Alliance BNIA-JF](http://bniajfi.org/mapping-resources/) | CSA and Neighborhood level | 2010   |
| Census Bureau    | Has information at the block group level. This includes information on family types, poverty status, the median househoold income  | Dataset   | [ American FactFinder](https://www.census.gov/acs/www/data/data-tables-and-tools/american-factfinder/) | Census tract and Block group level | 2014 |


```{r version info}
devtools::session_info()
```

# References