Skip to content

Latest commit

 

History

History
347 lines (294 loc) · 9.25 KB

02_3_ants_knn.md

File metadata and controls

347 lines (294 loc) · 9.25 KB

Ant data: k nearest neighbors model algorithm

Brett Melbourne 21 Jan 2022

KNN for the regression case illustrated with the ants data. This code is much the same as ants_cv.R but instead of using a smoothing spline as the model algorithm we use KNN.

library(ggplot2)
library(dplyr)
library(tidyr)

Forest ant data:

forest_ants <- read.csv("data/ants.csv") %>% 
    filter(habitat=="forest")

K Nearest Neighbors (KNN) algorithm for 1 new value of x, translating our pseudocode to R code.

# Set k = number of nearest neighbors
k <- 4

# Input (x, y) = x, y data pairs
x <- forest_ants$latitude
y <- forest_ants$richness

# Input x_new = x value at which to predict y_new
x_new <- 42.25

# Calculate d = distance of x_new to other x 
d <- abs(x - x_new)
 
# Sort y data ascending by d; break ties randomly
y_sort <- y[order(d, sample(1:length(d)))]

# Predict new y = mean of k nearest y data
y_pred <- mean(y_sort[1:k])

The predicted value for richness of 10.5 makes sense.

y_pred
## [1] 10.5
forest_ants %>% 
    ggplot() +
    geom_point(aes(x=latitude, y=richness)) +
    geom_point(aes(x=x_new, y=y_pred), col="red") +
    geom_text(aes(x=x_new, y=y_pred, label="Predicted"), col="Red", hjust=-0.2) +
    coord_cartesian(ylim=c(0,20))

Now as a function that can accept a vector of new x values and return a vector of predictions.

# KNN function for a vector of x_new
# x:       x data (vector, numeric)
# y:       y data (vector, numeric)
# x_new:   x values at which to predict y (vector, numeric)
# k:       number of nearest neighbors to average (scalar, integer)
# return:  predicted y at x_new (vector, numeric)
#
knn <- function(x, y, x_new, k) {
    y_pred <- NA * x_new
    for ( i in 1:length(x_new) ) {
    #   Distance of x_new to other x
        d <- abs(x - x_new[i])
    #   Sort y ascending by d; break ties randomly
        y_sort <- y[order(d, sample(1:length(d)))]
    #   Mean of k nearest y data
        y_pred[i] <- mean(y_sort[1:k])
    }
    return(y_pred)
}

Test the output of the knn function.

knn(forest_ants$latitude, forest_ants$richness, x_new=42:45, k=4)
## [1] 14.25  6.50  5.75  6.50

Plot. Use this block of code to try different values of k (i.e. different numbers of nearest neighbors).

grid_latitude  <- seq(min(forest_ants$latitude), max(forest_ants$latitude), length.out=201)
pred_richness <- knn(forest_ants$latitude, forest_ants$richness, x_new=grid_latitude, k=8)
preds <- data.frame(grid_latitude, pred_richness)

forest_ants %>% 
    ggplot() +
    geom_point(aes(x=latitude, y=richness)) +
    geom_line(data=preds, aes(x=grid_latitude, y=pred_richness)) +
    coord_cartesian(ylim=c(0,20))

k-fold CV for KNN. Be careful not to confuse the k’s!

# Function to partition a data set into random folds for cross-validation
# n:       length of dataset (scalar, integer)
# k:       number of folds (scalar, integer)
# return:  fold labels (vector, integer)
# 
random_folds <- function(n, k) {
    min_n <- floor(n / k)
    extras <- n - k * min_n
    labels <- c(rep(1:k, each=min_n),rep(seq_len(extras)))
    folds <- sample(labels, n)
    return(folds)
}

# Function to perform k-fold CV for the KNN model algorithm on ants data
# k_cv:    number of folds (scalar, integer)
# k_knn:   number of nearest neighbors to average (scalar, integer)
# return:  CV error as RMSE (scalar, numeric)
#
cv_ants <- function(k_cv, k_knn) {
    forest_ants$fold <- random_folds(nrow(forest_ants), k_cv)
    e <- rep(NA, k_cv)
    for ( i in 1:k_cv ) {
        test_data <- forest_ants %>% filter(fold == i)
        train_data <- forest_ants %>% filter(fold != i)
        pred_richness <- knn(train_data$latitude, 
                             train_data$richness, 
                             x_new=test_data$latitude, 
                             k=k_knn)
        e[i] <- mean((test_data$richness - pred_richness) ^ 2)
    }
    cv_error <- mean(e)
    return(cv_error)
}

Test the function

cv_ants(k_cv=10, k_knn=8)
## [1] 12.60286
cv_ants(k=nrow(forest_ants), k_knn=7) #LOOCV
## [1] 12.75046

Explore a grid of values for k_cv and k_knn

grid <- expand.grid(k_cv=c(5,10,nrow(forest_ants)), k_knn=1:16)
grid
##    k_cv k_knn
## 1     5     1
## 2    10     1
## 3    22     1
## 4     5     2
## 5    10     2
## 6    22     2
## 7     5     3
## 8    10     3
## 9    22     3
## 10    5     4
## 11   10     4
## 12   22     4
## 13    5     5
## 14   10     5
## 15   22     5
## 16    5     6
## 17   10     6
## 18   22     6
## 19    5     7
## 20   10     7
## 21   22     7
## 22    5     8
## 23   10     8
## 24   22     8
## 25    5     9
## 26   10     9
## 27   22     9
## 28    5    10
## 29   10    10
## 30   22    10
## 31    5    11
## 32   10    11
## 33   22    11
## 34    5    12
## 35   10    12
## 36   22    12
## 37    5    13
## 38   10    13
## 39   22    13
## 40    5    14
## 41   10    14
## 42   22    14
## 43    5    15
## 44   10    15
## 45   22    15
## 46    5    16
## 47   10    16
## 48   22    16
cv_error <- rep(NA, nrow(grid))
set.seed(6363) #For reproducible results in this text
for ( i in 1:nrow(grid) ) {
    cv_error[i] <- cv_ants(grid$k_cv[i], grid$k_knn[i])
}
result1 <- cbind(grid,cv_error)

Plot the result.

result1 %>% 
    ggplot() +
    geom_line(aes(x=k_knn, y=cv_error, col=factor(k_cv))) +
    labs(col="k_cv")

LOOCV (k_cv = 22) identifies the KNN model with k_knn = 7 nearest neighbors as having the best predictive performance. We see again that there is a lot of variance in 5-fold and 10-fold CV and so we would not want to trust a single run of k-fold CV. For the KNN model, we probably don’t want to trust a single run of LOOCV either, since there is a random component to breaking ties in the KNN model. Let’s look at 5-fold CV and LOOCV with many replicate CV runs with random folds.

grid <- expand.grid(k_cv=c(5,22), k_knn=1:16)
grid

reps <- 250
cv_error <- matrix(NA, nrow=nrow(grid), ncol=reps)
set.seed(7419) #For reproducible results in this text
for ( j in 1:reps ) {
    for ( i in 1:nrow(grid) ) {
        cv_error[i,j] <- cv_ants(grid$k_cv[i], grid$k_knn[i])
    }
    print(j) #monitor
}
result2 <- cbind(grid,cv_error)
result2$mean_cv <- rowMeans(result2[,-(1:2)])

Plot the result.

result2 %>%
    select(k_cv, k_knn, mean_cv) %>%
    rename(cv_error=mean_cv) %>%
    ggplot() +
    geom_line(aes(x=k_knn, y=cv_error, col=factor(k_cv))) +
    labs(title=paste("Mean across",reps,"k-fold CV runs"), col="k_cv")

and print out to see the detailed numbers

result2 %>% 
    arrange(k_cv) %>% 
    select(k_cv, k_knn, mean_cv)
##    k_cv k_knn  mean_cv
## 1     5     1 26.41756
## 2     5     2 18.48430
## 3     5     3 15.31755
## 4     5     4 13.90081
## 5     5     5 13.41545
## 6     5     6 13.01379
## 7     5     7 13.02613
## 8     5     8 13.32905
## 9     5     9 13.75747
## 10    5    10 14.89716
## 11    5    11 15.52593
## 12    5    12 15.85196
## 13    5    13 16.09378
## 14    5    14 17.02255
## 15    5    15 18.02458
## 16    5    16 18.93897
## 17   22     1 27.65818
## 18   22     2 19.29545
## 19   22     3 16.17879
## 20   22     4 13.98970
## 21   22     5 13.40444
## 22   22     6 12.94657
## 23   22     7 12.62650
## 24   22     8 12.94886
## 25   22     9 12.69278
## 26   22    10 12.94079
## 27   22    11 13.62847
## 28   22    12 14.81282
## 29   22    13 15.40425
## 30   22    14 15.54777
## 31   22    15 15.53584
## 32   22    16 16.33452

We see (in both plot and table) that 5-fold CV identifies k_knn = 6 nearest neighbors as having the best predictive performance but k_knn = 7 has essentially the same performance. LOOCV averaged over multiple runs still picks k_knn = 7. Had we relied on a single run of LOOCV here, which identified k_knn = 7, we would still have made a good choice for k_knn since the models have similar performance.

What about the KNN model versus the smoothing-spline model? Which has the best predictive performance? Collating the results from our CV inference algorithms, we have the following estimated RMSEs:

Model LOOCV 5-fold CV
KNN 6 12.95 13.01
KNN 7 12.63 13.03
Smoothing spline 3 12.52 12.77

We see that both LOOCV and 5-fold CV give the edge to the smoothing spline model. Nevertheless, for these data any of these three models has about the same predictive performance. It is worth noting that a single run of LOOCV would have picked KNN 7 as the best model with LOOCV = 12.48; this illustrates that one should be wary wherever randomness is involved, whether in the model, training, or inference algorithm.