Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loss_accuracy returns 0 for mean_dropout_loss #535

Closed
JeffreyRStevens opened this issue Dec 29, 2022 · 9 comments
Closed

loss_accuracy returns 0 for mean_dropout_loss #535

JeffreyRStevens opened this issue Dec 29, 2022 · 9 comments
Labels
invalid ❕ This doesn't seem right, potential bug R 🐳 Related to R

Comments

@JeffreyRStevens
Copy link

I would like to use loss_accuracy as my loss function in model_parts(), but whenever I use it, the mean_drop_loss is always 0. I have tried loss_accuracy for regression, classification, and multiclass classification (see reprex below). Am I using it correctly?

library(DALEX)
#> Welcome to DALEX (version: 2.4.2).
#> Find examples and detailed introduction at: http://ema.drwhy.ai/
library(ranger)
df <- mtcars[, c('mpg', 'cyl', 'disp', 'hp', 'vs')]
# Regression
reg <- lm(mpg ~ ., data = df)
explainer_reg <- explain(reg, data = df[,-1], y = df[,1])
#> Preparation of a new explainer is initiated
#>   -> model label       :  lm  (  default  )
#>   -> data              :  32  rows  4  cols 
#>   -> target variable   :  32  values 
#>   -> predict function  :  yhat.lm  will be used (  default  )
#>   -> predicted values  :  No value for predict function target column. (  default  )
#>   -> model_info        :  package stats , ver. 4.2.2 , task regression (  default  ) 
#>   -> predicted values  :  numerical, min =  12.56206 , mean =  20.09062 , max =  27.04625  
#>   -> residual function :  difference between y and yhat (  default  )
#>   -> residuals         :  numerical, min =  -4.019038 , mean =  1.010303e-14 , max =  6.976988  
#>   A new explainer has been created!
feature_importance(explainer_reg, loss_function = loss_accuracy)
#>       variable mean_dropout_loss label
#> 1 _full_model_                 0    lm
#> 2          cyl                 0    lm
#> 3         disp                 0    lm
#> 4           hp                 0    lm
#> 5           vs                 0    lm
#> 6   _baseline_                 0    lm
# Classification
classif <- glm(vs ~ ., data = df, family = binomial)
explainer_classif <- explain(classif, data = df[,-5], y = df[,5])
#> Preparation of a new explainer is initiated
#>   -> model label       :  lm  (  default  )
#>   -> data              :  32  rows  4  cols 
#>   -> target variable   :  32  values 
#>   -> predict function  :  yhat.glm  will be used (  default  )
#>   -> predicted values  :  No value for predict function target column. (  default  )
#>   -> model_info        :  package stats , ver. 4.2.2 , task classification (  default  ) 
#>   -> predicted values  :  numerical, min =  7.696047e-06 , mean =  0.4375 , max =  0.9920295  
#>   -> residual function :  difference between y and yhat (  default  )
#>   -> residuals         :  numerical, min =  -0.9474062 , mean =  -1.483608e-12 , max =  0.5318376  
#>   A new explainer has been created!
feature_importance(explainer_classif, loss_function = loss_accuracy)
#>       variable mean_dropout_loss label
#> 1 _full_model_                 0    lm
#> 2          mpg                 0    lm
#> 3          cyl                 0    lm
#> 4         disp                 0    lm
#> 5           hp                 0    lm
#> 6   _baseline_                 0    lm
# Multiclass classification
multiclass <- ranger(cyl ~ ., data = df, probability = TRUE)
explainer_multiclass <- explain(multiclass, data = df[,-2], y = df[,2])
#> Preparation of a new explainer is initiated
#>   -> model label       :  ranger  (  default  )
#>   -> data              :  32  rows  4  cols 
#>   -> target variable   :  32  values 
#>   -> predict function  :  yhat.ranger  will be used (  default  )
#>   -> predicted values  :  No value for predict function target column. (  default  )
#>   -> model_info        :  package ranger , ver. 0.14.1 , task multiclass (  default  ) 
#>   -> model_info        :  Model info detected multiclass task but 'y' is a numeric .  (  WARNING  )
#>   -> model_info        :  By deafult multiclass tasks supports only factor 'y' parameter. 
#>   -> model_info        :  Consider changing to a factor vector with true class names.
#>   -> model_info        :  Otherwise I will not be able to calculate residuals or loss function.
#>   -> predicted values  :  predict function returns multiple columns:  3  (  default  ) 
#>   -> residual function :  difference between 1 and probability of true class (  default  )
#>   -> residuals         :  the residual_function returns an error when executed (  WARNING  ) 
#>   A new explainer has been created!
feature_importance(explainer_multiclass, loss_function = loss_accuracy)
#>       variable mean_dropout_loss  label
#> 1 _full_model_                 0 ranger
#> 2          mpg                 0 ranger
#> 3         disp                 0 ranger
#> 4           hp                 0 ranger
#> 5           vs                 0 ranger
#> 6   _baseline_                 0 ranger

Created on 2022-12-29 with reprex v2.0.2

When I try other loss functions (e.g., loss_root_mean_square for regression, loss_one_minus_auc for classification), they return non-zero values.

feature_importance(explainer_reg, loss_function = loss_root_mean_square)
#>       variable mean_dropout_loss label
#> 1 _full_model_          2.844520    lm
#> 2           vs          2.861546    lm
#> 3           hp          3.328176    lm
#> 4         disp          4.201312    lm
#> 5          cyl          4.498485    lm
#> 6   _baseline_          7.777811    lm
feature_importance(explainer_classif, loss_function = loss_one_minus_auc)
#>       variable mean_dropout_loss label
#> 1 _full_model_        0.03571429    lm
#> 2          mpg        0.04603175    lm
#> 3         disp        0.04642857    lm
#> 4           hp        0.31785714    lm
#> 5          cyl        0.36031746    lm
#> 6   _baseline_        0.51884921    lm

Created on 2022-12-29 with reprex v2.0.2

Is there something different about how loss_accuracy is used?

I'm using DALEX v2.4.2, R v4.2.2, RStudio v2022.12.0+353, Ubuntu 22.04.1

@hbaniecki
Copy link
Member

hbaniecki commented Dec 29, 2022

Hi, I think loss_accuracy was never used and it seems to be invalid. Or at least for it to work you would have to change predict_function to return a class i.e. 0/1 instead of probabilities, see

#' @rdname loss_functions
#' @export
loss_accuracy <- function(observed, predicted, na.rm = TRUE)
mean(observed == predicted, na.rm = na.rm)
attr(loss_accuracy, "loss_name") <- "Accuracy"

I think that loss_accuracy could use model_performance_accuracy

tp = sum((observed == 1) * (predicted >= cutoff))
fp = sum((observed == 0) * (predicted >= cutoff))
tn = sum((observed == 0) * (predicted < cutoff))
fn = sum((observed == 1) * (predicted < cutoff))

DALEX/R/model_performance.R

Lines 166 to 168 in b855207

model_performance_accuracy <- function(tp, fp, tn, fn) {
(tp + tn)/(tp + fp + tn + fn)
}

and also it probably should be a decreasing measure 1 - Accuracy as with 1 - AUC.

@hbaniecki hbaniecki added R 🐳 Related to R invalid ❕ This doesn't seem right, potential bug labels Dec 29, 2022
@JeffreyRStevens
Copy link
Author

Thanks! So putting that all together, something like this?

loss_one_minus_accuracy <- function(observed, predicted, na.rm = TRUE, cutoff = 0.5) {
  tp = sum((observed == 1) * (predicted >= cutoff)) 
  fp = sum((observed == 0) * (predicted >= cutoff)) 
  tn = sum((observed == 0) * (predicted < cutoff)) 
  fn = sum((observed == 1) * (predicted < cutoff)) 
  1 - (tp + tn)/(tp + fp + tn + fn) 
}

@hbaniecki
Copy link
Member

@JeffreyRStevens yes, would you like to make a PR?

@JeffreyRStevens
Copy link
Author

I would be happy to. Would you like me to do anything with loss_accuracy() or just add loss_one_minus_accuracy()?

@hbaniecki
Copy link
Member

perhaps also remove loss_accuracy() since it's wrong @pbiecek?

@pbiecek
Copy link
Member

pbiecek commented Dec 30, 2022

@hbaniecki what's wrong with loss_accuracy?
it shall work for classification models that return classes and it is supposed to be compatible with yardstick approach to validate models with scores and with classes

@pbiecek
Copy link
Member

pbiecek commented Dec 30, 2022

currently loss_accuracy does not assume that predicted is a number,
so if you are going to add loss_one_minus_accuracy then it shall has consistent contract

suggested approach:

  • use a different name (to avoid conflicts with loss_accuracy)
  • be precise in the documentation

@pbiecek
Copy link
Member

pbiecek commented Dec 30, 2022

maybe add model_performance_one_minus_accuracy ans use this function?

@hbaniecki
Copy link
Member

TODO

model_parts(explainer, loss_function = get_loss_yardstick(reverse=TRUE))
model_parts(explainer, loss_function = get_loss_accuracy(cutoff=0.5)) # returns DALEX::loss_one_minus_acc
model_parts(explainer, loss_function = DALEX::loss_one_minus_acc) # baseline cutoff=0.5

hbaniecki added a commit that referenced this issue Jan 8, 2023
pbiecek pushed a commit that referenced this issue Jan 26, 2023
* add loss_one_minus_accuracy #535

* fix typo, update doc

* warn -> warning

* update package version

* add more tests

* fix checks

* fix tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid ❕ This doesn't seem right, potential bug R 🐳 Related to R
Projects
None yet
Development

No branches or pull requests

3 participants