Rich Pauloo
5/21/2018
The Global Landslide Catalog (GLC) was created to identify landslides triggered maintly by rainfall. The paper describing the dataset can be found here. We will start with an exploratory data analysis to get a broad overview of the data. Following this analysis, we will develop a research question, and set out to answer it.
Load some packages for analysis
library(maps)
library(tidyverse) # conflicts with `maps` function `map`
Read in data
d <- readRDS("landslide.rds") # spatial points data frame
maps::map("world", fill=TRUE, col="white", bg="lightblue", ylim=c(-60, 90), mar=c(0,0,0,0), main = "Landslides in the World")
points(d, cex = .2, col = "red")
df <- d@data # get just the data
How large is our data and what does it contain? It looks like we have 3327 observations of landslides and 34 columns, or predictor variables, excluding latitude and longitude.
dim(df)
## [1] 3327 34
colnames(df)
## [1] "objectid" "date" "time"
## [4] "country" "trigger" "hazardtype"
## [7] "hazardintensity" "injuries" "fatalities"
## [10] "population" "aspect" "bio01"
## [13] "bio02" "bio03" "bio04"
## [16] "bio05" "bio06" "bio07"
## [19] "bio08" "bio09" "bio10"
## [22] "bio11" "bio12" "bio13"
## [25] "bio14" "bio15" "bio16"
## [28] "bio17" "bio18" "bio19"
## [31] "elevation" "hillshade" "landcover"
## [34] "slope"
What dates does our data span? Our data ranges nearly 20 years, from 1997 to 2017.
substr(df$date,1,4) %>% # extract years
sort() %>% # sort years
.[c(1,nrow(df))] # subset first and last year
## [1] "1997" "2015"
In the column names, we see that there are 19 bioclimatic variables from WorldClim. They are coded as follows:
There are a few issues with our categorical data, for instance, duplicate entries in the trigger
, hazardtype
, and hazardintensity
variables.
unique(df$trigger)[c(10,15)]
## [1] "Continuous_rain" "Continuous_Rain"
unique(df$trigger)[c(9,16)]
## [1] "Snowfall_snowmelt" "Snowfall"
unique(df$trigger)[c(2,6)]
## [1] "unknown" "Unknown"
unique(df$trigger)[c(1,3)]
## [1] "Downpour" "Rain"
unique(df$trigger)[c(14,19)]
## [1] "Other" "other"
unique(df$hazardtype)[c(1,12)]
## [1] "Mudslide" "mudslide"
unique(df$hazardtype)[c(5,7)]
## [1] "Rock_Fall" "Rockfall"
unique(df$hazardtype)[c(3,14)]
## [1] "Landslide" "landslide"
unique(df$hazardintensity)[c(1,6)]
## [1] "Medium" "medium"
unique(df$hazardintensity)[c(2,8)]
## [1] "Very_large" "Extra Large"
unique(df$hazardintensity)[c(3,7)]
## [1] "Large" "large"
unique(df$hazardintensity)[c(4,9)]
## [1] "Small" "small"
Let’s correct these duplicate entries by collapsing them into common categories. We can take care of many of these issues by simply converting all of the characters to lowercase, and in the instances where that doesn’t work, we can replace values.
# mutate existing columns to lowercase, which fixes most encoding errors
df <- df %>%
mutate(trigger = tolower(trigger), # coerce characters to lowercase
hazardtype = tolower(hazardtype), # same as above
hazardintensity = tolower(hazardintensity)) # same as above
Combine the remaining similar entries.
library(stringr)
# replace terms in the trigger category
df$trigger <- df$trigger %>% str_replace("snowfall_snowmelt", "snowfall")
# replace terms in the hazardtype category
df$hazardtype <- df$hazardtype %>% str_replace("rock_fall", "rockfall")
# replace terms in the hazardintensity category
df$hazardintensity <- df$hazardintensity %>% str_replace("extra large", "very_large")
With these cleaned categorical variables, we can overwrite the data in our SpatialPointsDataFrame
.
d@data <- df
Are there obvious spatial patterns in hazard triggers, type, or intensity?
# make a world basemap as an sf object
world <- sf::st_as_sf(maps::map('world', plot = FALSE, fill = TRUE))
# convert the spdf into an sf object for plotting
d_sf <- sf::st_as_sf(d)
# map of landslide triggers
d_sf %>%
ggplot() +
geom_sf(data = world) +
geom_sf(aes(color = trigger), alpha = .6) +
labs(title = "Landslide Triggers")
# map of hazrd type
d_sf %>%
ggplot() +
geom_sf(data = world) +
geom_sf(aes(color = hazardtype), alpha = .6) +
labs(title = "Landslide Hazard Type")
# map of hazard intensity
d_sf %>%
ggplot() +
geom_sf(data = world) +
geom_sf(aes(color = hazardintensity), alpha = .6) +
labs(title = "Landslide Hazard Intensity")
Large and Very Large landslides are defined in Kirschbaum (2015) as landslides with moderate to high numbers of fatalities.Combined, Large and Very Large landslides account for less than 7% of the total landslides in this dataset…
df %>%
count(hazardintensity) %>%
filter(hazardintensity != "unknown") %>%
ggplot(aes(reorder(hazardintensity, n), n)) +
geom_col(fill = c("red", "red", "grey50","grey50")) +
coord_flip() +
labs(title = "Counts of Worldwide Landslide Hazard Intensity",
subtitle = "Period of Record: 1997-2015",
y = "Count",
x = "Hazard Intensity")
df %>%
filter(hazardintensity %in% c("large", "very_large")) %>% # filter for large and vl landslides
nrow() / nrow(df) # calculate proportion of these landslides
## [1] 0.06672678
…but account for roughly 84% of fatalities and 63% of injuries.
df %>%
filter(hazardintensity != "unknown") %>%
group_by(hazardintensity) %>%
summarise(fatalities = sum(fatalities),
injuries = sum(injuries)) %>%
mutate(prop_fatalities = (fatalities / sum(fatalities)) %>% round(2),
prop_injuries = (injuries / sum(injuries)) %>% round(2)) %>%
knitr::kable()
hazardintensity | fatalities | injuries | prop_fatalities | prop_injuries |
---|---|---|---|---|
large | 1837 | 432 | 0.12 | 0.62 |
medium | 2357 | 216 | 0.15 | 0.31 |
small | 136 | 44 | 0.01 | 0.06 |
very_large | 11390 | 7 | 0.72 | 0.01 |
Let’s combine large and very large landslides into a category called disaster, and small and medium landslides into another category called non-disaster. Hazard intensity will be our binary response variable in the model.
# re-encode "large" and "very large" landslides "disaster", and "small" and "medium" landslides as "non-disaster"
df %>%
mutate(severity = ifelse(hazardintensity == "very_large", "disaster",
ifelse(hazardintensity == "medium", "non-disaster",
ifelse(hazardintensity == "large", "disaster",
ifelse(hazardintensity == "small", "non-disaster", hazardintensity))))) -> df
# now overwrite data in d with this new dataframe
d@data <- df
Where are the disasters located? It looks like they’re well-distributed throughout the world, with a large cluster in Oceania and along the Himalaya mountain range between India and China. The obvious spatial dependency of landslide intensity, suggests that “continent” might be a feature class that improves the prediction accuracy of the model. If time permits, this feature class will be added.
First remove the 14 “unknown” earthquakes, since we’re not interested in predicting these, and they complicate the model.
d <- d[d$severity != "unknown" , ] # remove "unknown" hazards
#d@data$hazardintensity <- factor(d@data$hazardintensity) # convert response to factor
d@data$country <- factor(d@data$country) # convert country predictor to factor
d@data$severity <- factor(d@data$severity) # convert country predictor to factor
View the disasters, or fatal landslides.
dis <- d[d$severity == "disaster", ]
maps::map("world", fill=TRUE, col="white", bg="lightblue", ylim=c(-60, 90), mar=c(0,0,0,0), main = "Landslides in the World")
points(dis, cex = .5, col = "red")
Let’s proceed with a suite of models to predict fatal landslides.
We begin with a rather simple method: Linear Discriminant Analysis (LDA). To validate our model, we will split the data randomly into two parts: one set for training, and another for testing. We specify the model as follows:
lda(severity ~ elevation + slope + bio12 + bio13 + bio14 + bio15 + bio16 + bio17 + bio18 + bio19)
where bio12
through bio19
are precipitation indices.
library(MASS) # for LDA and QDA models
set.seed(678983) # set seed for reproducibility
# split into training and testing
train_id <- sample(1:nrow(d), nrow(d)/2) # index of half of the data
train <- d[train_id, ] # test set: half of the data
test <- d[-train_id, ] # train set: other half
# fit the model
lda_fit <- lda(formula = severity ~ elevation + slope + bio12 + bio13 +
bio14 + bio15 + bio16 + bio17 + bio18 + bio19,
data = train)
# predict the test set
pred_prob <- predict(lda_fit, test, type = "response")
# view the confusion matrix
cm <- table(pred_prob$class, test$severity)
cm
##
## disaster non-disaster
## disaster 0 0
## non-disaster 114 1543
# proportion of incorrectly classified disasters
pic <- cm[2, 1] / sum(cm[, 1])
pic
## [1] 1
# proportion of incorrectly classified non-disasters
picnd <- cm[1, 2] / sum(cm[ , 2])
picnd
## [1] 0
# overall error rate: sum of off diagonals divided by total
error <- (sum(cm) - sum(diag(cm))) / sum(cm)
error
## [1] 0.06879903
LDA prefectly predicts non-disasters, and does rather poorly in predicting disasters. This is because LDA seeks to minimize the overall error, irrespective of class importance. The LDA classifier is essentially the same as the null classifier, which predicts that all events are non-disasters. This achieves an overall error rate of 7%, because the LDA classifier approximates the Bayes classifier, which returns the smallest possible total number of misclassified observations, regardless of which class the observations are from. In disaster management, we’re more concerned about misclassified disasters. We need a test with high sensitivity to disaster events.
Looking more closely, the posterior probability of any disaster event is always less than .5:
# the first column of the posterior maxtrix corrsponds to disasters
pred_prob$posterior[ , 1] %>% max()
## [1] 0.3569364
Because it uses a .5 threshold, LDA will always classify diasters as non-disasters to reduce the overall error. We can overcome this limitation by varying the posterior probability threshold at which LDA classifies an event as a disaster. This will come with some error because sometimes, we’ll incorrectly classify a non-disaster as a disaster, but we can tolerate some of these errors. For example, we change the posterior probability threshold, for example from .5 to .1. Formally, we change \(Pr(severity = disaster | X = x) > 0.5\) to \(Pr(severity = disaster | X = x) > 0.1\).
# change the posterior probability decision boundary from .5 to .1
pred <- rep("non-disaster", times = nrow(test))
pred[pred_prob$posterior[, "disaster"] > .1] <- "disaster"
cm <- table(pred, test$severity) # confusion matrix of predicted classes and actual class
cm
##
## pred disaster non-disaster
## disaster 40 232
## non-disaster 74 1311
# proportion of incorrectly classified large landslides
pic <- cm[2, 1] / sum(cm[, 1])
pic
## [1] 0.6491228
# overall error rate: sum of off diagonals divided by total
error <- (sum(cm) - sum(diag(cm))) / sum(cm)
error
## [1] 0.1846711
We see that for a posterior probability threshold of .1, the overall error increases from around 7% to about 18%, but we also go from a 100% error rate in our disaster prediction to a 64% error rate in predicting these events. Because we’re more concerned with class-specific performance, in this case, the sensitivity of the test (the proportion of true disasters correctly identified), we will investigate the tradeoff between error and a range of thresholds to inform what threshold to select.
# function to iterate over probabilities
pp <- seq(0.01, 0.5, .001) # posterior probabilities to iterate over
calc_errors <- function(pp, fit){
# predict the test set
pred_prob <- predict(fit, test, type = "response")
# change the posterior probability decision boundary from .5 to .1
pred <- rep("non-disaster", times = nrow(test))
pred[pred_prob$posterior[, "disaster"] > pp] <- "disaster"
cm <- table(pred, test$severity)
# proportion of incorrectly classified disasters
#pic <- cm[2, 1] / sum(cm[, 1])
pic <- ifelse(nrow(cm) == 2,
cm[2, 1] / sum(cm[, 1]),
cm[1, 1] / sum(cm[, 1]))
# proportion of incorrectly classified non-disasters
#picnd <- cm[1, 2] / sum(cm[, 2])
picnd <- ifelse(nrow(cm) == 2,
cm[1, 2] / sum(cm[, 2]),
0)
# overall error rate: sum of off diagonals divided by total
#oe <- (sum(cm) - sum(diag(cm))) / sum(cm)
oe <- ifelse(nrow(cm) == 2,
(sum(cm) - sum(diag(cm))) / sum(cm),
cm[1] / sum(cm))
# return the posterior probability threshold, and associated errors
return(c(pp, pic, picnd, oe))
}
# apply a range of posterior probabilites over this function and combine output
temp <- lapply(pp, calc_errors, fit = lda_fit) %>% do.call(rbind, .)
# make into df for plotting and rename columns
temp <- as.data.frame(temp)
colnames(temp) <- c("Threshold",
"Incorrectly Classified Disasters",
"Incorrectly Classified Non-Disasters",
"Overall Error Rate")
# plot
temp %>%
gather(key, value, -Threshold) %>% # convert data from wide to long format
ggplot(aes(Threshold, value)) +
geom_line(aes(color = key), size = .7) +
labs(title = "LDA at varying Posterior Probability Thresholds",
subtitle = "with associated error rates",
y = "Error Rate",
color = NULL) +
theme(legend.position = "bottom")
From this we can see that applying LDA with a threshold around .05 allows us to predict fatal landslides with about 75% accuracy, albeit with the caveat that we incorrectly classify 50% of non-disasters.
Let’s imagine that the highest error in terms of incorrectly classified non-disasters we’re willing to tolerate is 33%. So in other words, 1 out of every 3 locations we expect a disaster, there’s actaully no disaster. This corresponds to a threshold of 0.069 and an accuracy of about 60% in predicting disaster landslides.
temp %>% filter(`Incorrectly Classified Non-Disasters` <= .33) %>% head(1)
## Threshold Incorrectly Classified Disasters
## 1 0.069 0.3947368
## Incorrectly Classified Non-Disasters Overall Error Rate
## 1 0.3266364 0.3313217
It’s important to remember that LDA assumes that observations in the \(k\)th class are drawn from a multivariate Gaussian distribution \(N(\mu_k, \Sigma)\) where \(\mu_k\) is class-specific mean vector, and \(\Sigma\) is a common covariance matrix to all classes \(k\).
Quadratic Discriminant Analysis (QDA) is very similar to LDA, expect it assumes observations of the \(k\)th class have a class-spefic covariance matrix in addition to a class-specific mean vector. In other words, they are drawn from the multivariate Gaussian distribution \(N(\mu_k, \Sigma_k)\). QDA forms quadratic decision boundaries between classes where LDA can only form linear ones. Without knowledge of the covariance matrix of our predictors, we will try both LDA and QDA.
# set seed
set.seed(56743)
# fit the model
qda_fit <- qda(formula = severity ~ elevation + slope + bio12 + bio13 +
bio14 + bio15 + bio16 + bio17 + bio18 + bio19,
data = train)
# apply a range of posterior probabilites over this function and combine output
temp <- lapply(seq(.01, .5, .001), calc_errors, fit = qda_fit) %>% do.call(rbind, .)
# make into df for plotting and rename columns
temp <- as.data.frame(temp)
colnames(temp) <- c("Threshold",
"Incorrectly Classified Disasters",
"Incorrectly Classified Non-Disasters",
"Overall Error Rate")
# plot
temp %>%
gather(key, value, -Threshold) %>% # convert data from wide to long format
ggplot(aes(Threshold, value)) +
geom_line(aes(color = key), size = .7) +
labs(title = "QDA at varying Posterior Probability Thresholds",
subtitle = "With Associated Error Rates",
y = "Error Rate",
color = NULL) +
theme(legend.position = "bottom") +
coord_cartesian(ylim = c(0,1))
Unlike LDA, with QDA, we notice that when \(Pr(severity = disaster | X = x) > 0.5\), we actually have a an error rate lower than 100%! This shows that QDA is much better at picking out decision boundaries between disaster classes. However, let’s see how QDA performs according to our acceptable non-disaster error rate of 33%. This corresponds to a threshold of 0.06, and an accuracy of about 56% in predicting disaster landslides, which isn’t as good as what we achieved with LDA.
temp %>% filter(`Incorrectly Classified Non-Disasters` <= .33) %>% head(1)
## Threshold Incorrectly Classified Disasters
## 1 0.06 0.4385965
## Incorrectly Classified Non-Disasters Overall Error Rate
## 1 0.3266364 0.3343392
Let’s see how well a random forest with the same model specification predicts landslide disasters.
library(randomForest)
# set seed
set.seed(4576789)
# fit the model
rf_fit <-
randomForest(
formula = severity ~ elevation + slope + bio12 + bio13 +
bio14 + bio15 + bio16 + bio17 + bio18 + bio19,
data = train, importance = TRUE)
# predict the test set
yhat <- predict(rf_fit, test)
# confusion matrix
cm <- table(yhat, test$severity)
cm
##
## yhat disaster non-disaster
## disaster 5 11
## non-disaster 109 1532
# proportion of incorrectly classified large landslides
pic <- cm[2, 1] / sum(cm[, 1])
pic
## [1] 0.9561404
# proportion of incorrectly classified non-disasters
picnd <- cm[1, 2] / sum(cm[ , 2])
picnd
## [1] 0.00712897
# overall error rate: sum of off diagonals divided by total
error <- (sum(cm) - sum(diag(cm))) / sum(cm)
error
## [1] 0.07242004
At first glance, a random forest algorithm actaully performs worse than the LDA and QDA models. It correctly predicts disasters only about 4% of the time. Remember how we modified the thresholds at which the LDA and QDA models predicted classes? In the same way, we can modify the “decision boundaries” of the random forest by modifying the cutoff
argument. In classification problems, this parameter controls how a “winning class” is determined from the ensemble of trees grown (in regression, an observation is simply the mean of all values determined by the trees). By default, the “winning class” for an observation is determined by \(1/k\), which for a 2-class response is simply the majority vote. Like in LDA and QDA, we will adjust these boundaries. We will asssign a class of “non-disaster” to observations where 66% of the grown trees correspond to the “non-disaster” class, and we will investigate the tradeoffs in error associated with various cutoff levels for the “disaster” class.
set.seed(9876325)
rf_mod <- function(x){
# fit the model
rf_fit <-
randomForest(
formula = severity ~ elevation + slope + bio12 + bio13 +
bio14 + bio15 + bio16 + bio17 + bio18 + bio19,
data = train, importance = TRUE,
cutoff = c(x, .66)) # apply the cutoff for disaster and non-diaster class determination
# predict the test set
yhat <- predict(rf_fit, test)
# confusion matrix
cm <- table(yhat, test$severity)
cm
# proportion of incorrectly classified large landslides
pic <- cm[2, 1] / sum(cm[, 1])
pic
# proportion of incorrectly classified non-disasters
picnd <- cm[1, 2] / sum(cm[ , 2])
picnd
# overall error rate: sum of off diagonals divided by total
oe <- (sum(cm) - sum(diag(cm))) / sum(cm)
oe
# store errors
return(c(x, pic, picnd, oe))
}
# apply the random forest over a vector of disaster cutoffs from .01 to .34
rf_list <- lapply(seq(.01,.34, .01), rf_mod)
# collapse into a matrix
temp <- rf_list %>% do.call(rbind, .)
# make into df for plotting and rename columns
temp <- as.data.frame(temp)
colnames(temp) <- c("Threshold",
"Incorrectly Classified Disasters",
"Incorrectly Classified Non-Disasters",
"Overall Error Rate")
# plot
temp %>%
gather(key, value, -Threshold) %>% # convert data from wide to long format
ggplot(aes(Threshold, value)) +
geom_line(aes(color = key), size = .7) +
labs(title = "Random Forest at varying Cutoff Thresholds",
subtitle = "with associated error rates",
y = "Error Rate", x = "Cutoff for 'disaster' class",
color = NULL) +
theme(legend.position = "bottom")
Interestingly, applying the same acceptable limit of incorrectly classified non-disasters at less than or equal to 33%, random forest gives us nearly identical results as LDA, correctly preducting about 60% of disasters.
temp %>% filter(`Incorrectly Classified Non-Disasters` <= .33) %>% head(1)
## Threshold Incorrectly Classified Disasters
## 1 0.05 0.4298246
## Incorrectly Classified Non-Disasters Overall Error Rate
## 1 0.3097861 0.3180447
One useful output from the random forest algorithm is a rank of the variable importance, which helps us interpret the relationship between our response and predictors. We see that, in our selected model, BIO19, elevation, and BIO16 are the most important predictors in the model. BIO19 is the precipitation of the coldest quarter, and BIO16 is the precipitation of the wettest quarter which all make sense. Disasterous landslides are primarily influenced by the severity of winter storms, elevational gradients, and maximum precipitaiton observed in a quarter.
# rf model with the lowest error in disaster prediction given an acceptable 33% error rate in non-disasters
rf_fit <-
randomForest(
formula = severity ~ elevation + slope + bio12 + bio13 +
bio14 + bio15 + bio16 + bio17 + bio18 + bio19,
data = train, importance = TRUE,
cutoff = c(.05, .66))
# Variable Importance
temp <- importance(rf_fit)[, "MeanDecreaseGini"]
data.frame(variable = names(temp),
mean_decrease_gini = temp) %>%
ggplot() +
geom_col(aes(forcats::fct_reorder(variable, mean_decrease_gini), # x = sorted mean decrease in gini
mean_decrease_gini, # y = mean decrease in gini
fill = mean_decrease_gini)) + # fill by mean decrease in gini
coord_flip() + # flip coordinates
labs(title = "Variable Importance", subtitle = "Mean Decrease in Gini Coefficient",
y = "Mean Decrease in Gini Coefficient", x = "Variable") +
scale_fill_viridis_c() + # viridis color scale
guides(fill = FALSE) # remove legend
In random forests trees are grown in parallel. In boosting, trees are grown sequentially on the residuals of the previous tree. We use distribution = "bernoulli"
because this is a binary classification problem. If it were a regression problem, we would use distribution = "gaussian"
.
library(gbm)
# set a seed
set.seed(763459823)
# encode severity as a binary response for gbm
train@data <- train@data %>% mutate(severity_binary = ifelse(severity == "non-disaster", 0, 1))
# fit the model
boost_fit <- gbm(severity_binary ~ elevation + slope + bio12 + bio13 +
bio14 + bio15 + bio16 + bio17 + bio18 + bio19,
data = train,
distribution = "bernoulli", # binary response in 1, 0
n.trees = 5000, # number of trees to train
interaction.depth = 4) # each tree is a stump
# predict the test set
boost_pred <- predict(boost_fit, test,
n.trees = 5000,
type = "response") # return the predicted probability of 1 (disaster)
# confusion matrix
temp <- rep("non-disaster", length(boost_pred))
temp[boost_pred >= .1] <- "disaster" # use a proability threshold of .1
cm <- table(temp, test$severity)
cm
##
## temp disaster non-disaster
## disaster 38 223
## non-disaster 76 1320
Let’s explore the relationship between the probability threshold and various errors.
boost_error <- function(p, id){
# fit the model
boost_fit <- gbm(severity_binary ~ elevation + slope + bio12 + bio13 +
bio14 + bio15 + bio16 + bio17 + bio18 + bio19,
data = train,
distribution = "bernoulli", # binary response in 1, 0
n.trees = 5000, # number of trees to train
interaction.depth = id) # depth of trees
# predict the test set
boost_pred <- predict(boost_fit, test,
n.trees = 5000,
type = "response") # return the predicted probability of 1 (disaster)
# set probability threshold for disasters
pred <- rep("non-disaster", times = length(boost_pred))
pred[boost_pred >= p] <- "disaster"
cm <- table(pred, test$severity)
# proportion of incorrectly classified disasters
pic <- ifelse(nrow(cm) == 2,
cm[2, 1] / sum(cm[, 1]),
cm[1, 1] / sum(cm[, 1]))
# proportion of incorrectly classified non-disasters
picnd <- ifelse(nrow(cm) == 2,
cm[1, 2] / sum(cm[, 2]),
0)
# overall error rate: sum of off diagonals divided by total
oe <- ifelse(nrow(cm) == 2,
(sum(cm) - sum(diag(cm))) / sum(cm),
cm[1] / sum(cm))
# return the posterior probability threshold, and associated errors
return(c(p, pic, picnd, oe, id))
}
# apply a range of posterior probabilites over this function and combine output
# takes a while
temp <- list()
for(i in 1:4){
temp[[i]] <- lapply(seq(.01, .5, .01), boost_error, id = i) %>% do.call(rbind, .)
}
# make into df for plotting and rename columns
temp <- temp %>% do.call(rbind, .)
temp <- as.data.frame(temp)
colnames(temp) <- c("Threshold",
"Incorrectly Classified Disasters",
"Incorrectly Classified Non-Disasters",
"Overall Error Rate",
"Interaction Depth")
# save as rds for faster loading when knitting
write_rds(temp, "temp.rds")
# read in data
temp <- read_rds("temp.rds")
# plot
temp %>%
#dplyr::select(-`Interaction Depth`) %>%
gather(key, value, -Threshold, -`Interaction Depth`) %>% # convert data from wide to long format
#mutate(`Interaction Depth` = rep(1:4, each = 150)) %>%
dplyr::filter(Threshold > 0.02) %>%
ggplot(aes(Threshold, value)) +
geom_line(aes(color = key), size = .7) +
geom_hline(yintercept = .33, linetype = "dashed") +
labs(title = "Boosting Trees Error Rates",
subtitle = "At Varying Thresholds and Interaction Depths",
y = "Error Rate",
color = NULL) +
facet_grid(~`Interaction Depth`) +
theme(legend.position = "bottom") +
coord_cartesian(xlim = c(0,.5))
The plot above displays 4 interaction depths used for boosting. The dashed black horizontal line at 33% error shows that model performance appears insensitive to interaction depth. These calculations are computationally intensive. Rather than exhaustively computing the parameter space at a fine scale, we can use the graph to determine that incorrectly classified disasters reach a minumim at an interaction depth of 3, and more closely inspect this region to the overall performance of the bossted model.
# apply a range of posterior probabilites over this function and combine output
# takes a while
temp <- list()
temp <- lapply(seq(.05, .075, .001), boost_error, id = 3) %>% do.call(rbind, .)
temp <- as.data.frame(temp)
colnames(temp) <- c("Threshold",
"Incorrectly Classified Disasters",
"Incorrectly Classified Non-Disasters",
"Overall Error Rate",
"Interaction Depth")
write_rds(temp, "temp_2.rds")
The optimal boosted tree model uses an interaction depth of 3, with a disaster acceptance threshold of 0.064, and correctly predicts 64% of disasterous landslides, which is an improvement over LDA, QDA, and random forest models.
temp_2 <- read_rds("temp_2.rds")
temp_2 %>% filter(`Incorrectly Classified Non-Disasters` <= .33) %>% head(1)
## Threshold Incorrectly Classified Disasters
## 1 0.064 0.3596491
## Incorrectly Classified Non-Disasters Overall Error Rate
## 1 0.3298769 0.3319252
## Interaction Depth
## 1 3
As with random forests, we can view the variable importance for our optimal boosted model. Compared to our random forest model, there is agreement in the top 2 variables: bio19, and elevation.
# run the optimal model
boost_fit <- gbm(severity_binary ~ elevation + slope + bio12 + bio13 +
bio14 + bio15 + bio16 + bio17 + bio18 + bio19,
data = train,
distribution = "bernoulli", # binary response in 1, 0
n.trees = 5000, # number of trees to train
interaction.depth = 3) # depth of trees
# predict the test set
boost_pred <- predict(boost_fit, test,
n.trees = 5000,
type = "response") # return the predicted probability of 1 (disaster)
# plot variable importance
p %>% ggplot() + # p comes from a hidden chunk in the source code, which is essentially `summary(boost_fit)`
geom_col(aes(forcats::fct_reorder(var, rel.inf), rel.inf, fill = rel.inf)) +
coord_flip() +
scale_fill_viridis_c() +
guides(fill = FALSE) +
labs(title = "Variable Importance", subtitle = "Optimal Boosting Model",
y = "Relative Influence", x = "Variable")
Can we use our model predictions to predict regions of the world where fatal landslides are most likely?
We begin by assembling our test set: the independent variables used in the model, but on a worldwide scale.
library(raster)
library(sp)
bio <- getData("worldclim", var="bio", res=10)
Elevation data is a bit trickier because it involves downloading and merging rasters.
country_dem <- list()
# not all countries are available, so we filter them out
for(i in iso_codes[-c(4,12,21,28,37,41,49,55,56,75,82,84,85,98,104,106,114,121,137,142,144,147,
154,159,163,166,175,178,187,196,197,205,215,216,218,223,239,242,248,249,250,
251, 252)]){
country_dem[[i]] <- getData("alt", country = i)
}
# merge, save, plot
names(country_dem) <- NULL # do.call won't work with a named list
world_dem <- do.call(raster::merge, country_dem) # merge all rasters into one
write_rds(world_dem, "world_dem.rds") # save
plot(world_dem)
# a small test -- some scratch code for russia
list(country_dem[[162]]$`/Users/richpauloo/Desktop/GEO_200CN/Lab14/RUS1_msk_alt.grd`, country_dem[[162]]$`/Users/richpauloo/Desktop/GEO_200CN/Lab14/RUS2_msk_alt.grd`) %>%
do.call(raster::merge, .) %>% plot()
# do this for all 8 list elements with columns of values.
A bit of data wrangling is necessary to merge some of the larger rasters, which are split into multiple elements in a list.
# initalize vector
x <- vector(length = length(country_dem))
# logical test to find raster layers with multiple columns
for(i in 1:length(x)){
x[i] = ifelse(length(country_dem[[i]]) <= 10, TRUE, FALSE)
}
# extract these elements to perform a special merge on their data
fix <- country_dem[x]
# function to merge these data
merge_rasters <- function(x){
i <- length(x) # length of element
merged_raster <- NA
# conditions for merging depend on the number of separate raster elements
if(i == 2) {
merged_raster <- c(x[1], x[2])
names(merged_raster) <- NULL
merged_raster <- merged_raster %>% do.call(raster::merge, .)
}
if(i == 3) {
merged_raster <- c(x[1], x[2], x[3])
names(merged_raster) <- NULL
merged_raster <- merged_raster %>% do.call(raster::merge, .)
}
if(i ==4) {
merged_raster <- c(x[1], x[2], x[3], x[4])
names(merged_raster) <- NULL
merged_raster <- merged_raster %>% do.call(raster::merge, .)
}
return(merged_raster)
}
# apply the function
fix_merge <- lapply(fix, merge_rasters)
write_rds(fix_merge, "fix_merge.rds") # save this because it takes a while to generate
# merge all raster layers
all_rasters <- c(fix_merge, country_dem[!x])
world <- all_rasters %>% do.call(merge, .)
write_rds(world, "world.rds") # save this because it takes a while to generate
Let’s view the global DEM we just made.
# read in previously written world elevation raster
world <- read_rds("world.rds")
# plot
plot(world, main = "Elevation (meters)")
To save on computing time, and avoid aggregating all of the independent variables, let’s build a more simple model with which to predict landslide severity worldwide. We will use the top 4 predictors from our boosted tree instead of all 10. The new model is specified as severity_binary ~ elevation + bio13 + bio18 + bio19
. This results in a loss of about 2-3% prediction accuracy in predicting disasterous landslides.
# run the simplified model
boost_fit <- gbm(severity_binary ~ elevation + bio13 + bio18 + bio19,
data = train,
distribution = "bernoulli", # binary response in 1, 0
n.trees = 5000, # number of trees to train
interaction.depth = 3) # depth of trees
# predict the test set
boost_pred <- predict(boost_fit, test,
n.trees = 5000,
type = "response") # return the predicted probability of 1 (disaster)
# set probability threshold for disasters
pred <- rep("non-disaster", times = length(boost_pred))
pred[boost_pred >= .064] <- "disaster"
cm <- table(pred, test$severity)
# proportion of incorrectly classified disasters
pic <- cm[2, 1] / sum(cm[, 1])
pic
## [1] 0.3684211
# proportion of incorrectly classified non-disasters
picnd <- cm[1, 2] / sum(cm[, 2])
picnd
## [1] 0.3324692
# overall error rate: sum of off diagonals divided by total
oe <- (sum(cm) - sum(diag(cm))) / sum(cm)
oe
## [1] 0.3349427
We need to aggreate the world raster to the scale of the bio raster before running the model.
res(bio)/res(world) # find the scaling factor for x and y between the bio and world rasters
## [1] 20 20
# takes a while, so save it for future use
world_agg <- aggregate(world, c(20,20), fun = mean) # aggregate world raster by the x and y scaling factors
write_rds(world_agg, "world_agg.rds")
We also need to make sure the rasters line up in terms of rows, columns, and extent. So we must crop and fix the extent of the rasters.
world_agg <- read_rds("world_agg.rds")
bio_crop <- crop(bio, world_agg) # crop bioclim variables to the elevation raster
extent(bio_crop) <- extent(world_agg) # fix extent of bioclim
# stack rasters
world_stack <- stack(bio_crop[[13]], # bio13
bio_crop[[18]], # bio18
bio_crop[[19]], # bio19
world_agg) # elevation
names(world_stack)[4] <- "elevation" # give the 4th raster layer its correct name
Now we can look at our independent variables.
plot(world_stack)
And fit the model to this new data:
# coerce stack to data frame for the model
wsdf <- as.data.frame(world_stack)
# fit the model - outputs a vector of predictions with length = number of elements in the raster layer
boost_pred <- predict(boost_fit, wsdf,
n.trees = 5000,
type = "response")
We can view the posterior probability threshold for where the model predicts fatal landslides.
# reorganize the predicitons into a matrix -- the same dimensions as the raster stack
out <- matrix(boost_pred, nrow = 839, ncol = 2160, byrow = TRUE)
# shapefile of the world coastline
coast = shapefile("ne_50m_coastline/ne_50m_coastline.shp")
# replace values
rout <- raster(out) # turn prediction matrix into raster for plotting
extent(rout) <- extent(world_stack) # define extent of prediciton raster
proj4string(rout) <- proj4string(world_stack) # define CRS of prediciton raster
raster::mask(rout, world_stack$bio13) %>% # mask to only include land
plot(main = "Posterior Probability Threshold for Fatal Landslides", xlab = "Longitude", ylab = "Latitude")
plot(coast, add = TRUE) # add coastlines
Applying our threshold of 0.064 from above, we can map the areas that the model predicts disasterous landslides to occur, and overlay the locations where we have data for fatal landslide occurence.
# identify areas predicted by model to have fatal landslides
bin_response <- rep(0, times = length(boost_pred))
bin_response[boost_pred >= .064] = 1
out <- matrix(bin_response, nrow = 839, ncol = 2160, byrow = TRUE) # put into matrix for raster
# plot
rout <- raster(out)
extent(rout) <- extent(world_stack) # define extent of prediciton raster
proj4string(rout) <- proj4string(world_stack) # define CRS of prediciton raster
raster::mask(rout, world_stack$bio13) %>%
plot(main = "Predicted Fatal Landslide Areas",
xlab = "Longitude", ylab = "Latitude", legend = FALSE)
plot(coast, add = TRUE)
points(dis, cex = .5, col = "red") # fatal landslides
We examine the effect of changing our acceptance probabilty threshold in the plots below. As the threshold increases, less land is precicted to have fatal landslide danger, which decreases our overall error, and out misclassification error of small and medium landslides as fatal landslides, but increases the error of misclassifying fatal landslides, as illustrated in the error plots above.
#par(mfrow = c(2,2)) # set up graphical device
# iterate over probability thresholds
for(i in c(.064,.1, .15,.2)){
# identify areas predicted by model to have fatal landslides
bin_response <- rep(0, times = length(boost_pred))
bin_response[boost_pred >= i] = 1
out <- matrix(bin_response, nrow = 839, ncol = 2160, byrow = TRUE) # put into matrix for raster
# plot
rout <- raster(out)
extent(rout) <- extent(world_stack) # define extent of prediciton raster
proj4string(rout) <- proj4string(world_stack) # define CRS of prediciton raster
raster::mask(rout, world_stack$bio13) %>%
plot(main = paste("Probability Threshold =", i),
xlab = "Longitude", ylab = "Latitude", legend = FALSE)
plot(coast, add = TRUE)
points(dis, cex = .5, pch = 16, col = "red")
}
Green areas represent locations predicted by the boosted tree classification model where fatal landslides are likely to occur. Keep in mind that:
If we take the 6 year landslide dataset (2007-2013) to represent the population-level spatial distribution of global landslides, the model appears to overpredict areas susceptible to fatal landslides.
Inaccurate model specification is the most likely cause for this error. Comparing our predicitons with the data of where disasterous landslides are recorded, we see that the model predicts fatal landslides in many areas where there is no data to support this prediction.
Very northern regions in Alaska, Canada, Greenland, Russia, and Siberia as well as large areas throughout mainland Africa and in the Amazon are predicted to have fatal landslides.
By adding “latitude” into the model, we can correct for inaccurate predictions in northern regions. It is probable that landslides are related to soil type, underlying geology, and patterns of land use. In the absence of a worldwide soil, geology, or land use raster, we might lazily use “continent” as an easy-to-add covariate that should covary with these predictors.