Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reproduce error in print_report() function #172

Open
wants to merge 30 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
d87e0e3
coerce id column into character
Karim-Mane Apr 5, 2024
ed4c52c
rename some files and functions and use standardize
Karim-Mane Apr 8, 2024
8e93416
allow for using column names before column names standardisation is p…
Karim-Mane Apr 8, 2024
9ed144f
Automatic readme update
actions-user Apr 8, 2024
a97c504
remove unnecesary line in DESCRIPTION file
Karim-Mane Apr 9, 2024
207d829
disable option for removing bad date sequences found by check_date_se…
Karim-Mane Apr 9, 2024
a85ce72
disable possibility to provide a comma-separated list of column names.
Karim-Mane Apr 9, 2024
47821a3
remove the internal call of standardize_dates() from check_date_seque…
Karim-Mane Apr 9, 2024
4e46588
use || instead of |
Karim-Mane Apr 9, 2024
0792230
optimize on the code in detect_to_numeric_columns() as suggested by Hugo
Karim-Mane Apr 9, 2024
6dafba4
use '.data$var' from rlang in some dplyr functions
Karim-Mane Apr 9, 2024
f7595f8
use regex to check prefix and suffix
Karim-Mane Apr 10, 2024
5ad6e62
allow for multiple prefix and suffix
Karim-Mane Apr 10, 2024
4c4aec9
fix pkgdown issue
Karim-Mane Apr 10, 2024
87eb266
create function to set the default cleaning operations.
Karim-Mane Apr 11, 2024
36917e8
update clean_data() documentation and account for the default cleanin…
Karim-Mane Apr 11, 2024
9f3f819
send warning the missing character is not found.
Karim-Mane Apr 11, 2024
05bd258
update documentation for standardize_dates() function
Karim-Mane Apr 11, 2024
84654b2
fine check params argument in clean_data() and fix linters
Karim-Mane Apr 15, 2024
2b02dd3
added new line to separate between messages and warning/errors
Karim-Mane Apr 15, 2024
b9bf764
allow for spelling check to run
Karim-Mane Apr 15, 2024
a270623
document how the common_na_strings data was generated
Karim-Mane Apr 15, 2024
65c0d49
fix typos in design vignette
Karim-Mane Apr 15, 2024
e641dd2
add the lang argument needed by convert_to_numeric()
Karim-Mane Apr 15, 2024
eac31c2
use a vector for the rename argument in standardize_column_names()
Karim-Mane Apr 15, 2024
23926f9
reduce else in scan_columns()
Karim-Mane Apr 16, 2024
07f237a
update the way the keep argument is handle in standardize_column_names()
Karim-Mane Apr 16, 2024
b638d9a
update replace_missing_values() with the usage of dplyr
Karim-Mane Apr 16, 2024
eebe908
update add_to_report()
Karim-Mane Apr 16, 2024
248e9ce
Automatic readme update
actions-user Aug 12, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .Rbuildignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,4 @@
^doc$
^Meta$
^CITATION\.cff$
^data-raw$
3 changes: 1 addition & 2 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,6 @@ Authors@R: c(
person("Joshua W.", "Lambert", , "[email protected]", role = "rev",
comment = c(ORCID = "0000-0001-5218-3046"))
)
Maintainer: Karim Mané <[email protected]>
Description: cleanepi provides functions for cleaning and standardizing tabular data,
tailored specifically for curating epidemiological data.
License: MIT + file LICENSE
Expand All @@ -44,6 +43,7 @@ Imports:
numberize,
R.utils,
readr,
rlang,
snakecase,
stringr,
utils
Expand All @@ -53,7 +53,6 @@ Suggests:
lintr,
markdown,
reactable,
rlang,
rmarkdown,
spelling,
testthat (>= 3.0.0)
Expand Down
3 changes: 2 additions & 1 deletion NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ export(convert_to_numeric)
export(correct_subject_ids)
export(find_duplicates)
export(print_report)
export(remove_constant)
export(remove_constants)
export(remove_duplicates)
export(replace_missing_values)
export(scan_data)
Expand All @@ -20,4 +20,5 @@ export(standardize_column_names)
export(standardize_dates)
importFrom(lubridate,"%--%")
importFrom(magrittr,"%>%")
importFrom(rlang,.data)
importFrom(utils,browseURL)
68 changes: 30 additions & 38 deletions R/check_date_sequence.R
Original file line number Diff line number Diff line change
Expand Up @@ -5,44 +5,49 @@
#'
#' @param data A data frame
#' @param target_columns A vector of event column names. Users should specify at
#' least 2 column names in the expected order.
#' For example: target_columns = c("date_symptoms_onset",
#' "date_hospitalization", "date_death"). When the input data is a `linelist`
#' object, this parameter can be set to `linelist_tags` if you wish to
#' the date sequence across tagged columns only.
#' @param remove A Boolean to specify if rows with incorrect order
#' should be filtered out or not. The default is FALSE
#' least 2 column names in the expected order. For example:
#' target_columns = c("date_symptoms_onset", "date_hospitalization",
#' "date_death").
#' When the input data is a `linelist` object, this parameter can be set to
#' `linelist_tags` if you wish to the date sequence across tagged columns
#' only.
#' The values in this column should be in the ISO format (2024-12-31).
#' Otherwise, use the `standardize_dates()` function to standardize them.
#'
#' @returns Rows of the input data frame with incorrect date sequence
#' if `remove = FALSE`, the input data frame without those
#' rows if not.
#' @returns The input dataset. When found, the incorrect date sequences will be
#' stored in the report where they can be accessed using
#' `attr(data, "report")`.
#' @export
#'
#' @examples
#' # import the data
#' data <- readRDS(system.file("extdata", "test_df.RDS", package = "cleanepi"))
#'
#' # standardize the date values
#' data <- data |>
#' standardize_dates(
#' target_columns = c("date_first_pcr_positive_test", "date.of.admission"),
#' error_tolerance = 0.4,
#' format = NULL,
#' timeframe = NULL
#' )
#'
#' good_date_sequence <- check_date_sequence(
#' data = readRDS(system.file("extdata", "test_df.RDS",
#' package = "cleanepi")),
#' target_columns = c("date_first_pcr_positive_test", "date.of.admission"),
#' remove = FALSE
#' data = data,
#' target_columns = c("date_first_pcr_positive_test", "date.of.admission")
#' )
check_date_sequence <- function(data, target_columns,
remove = FALSE) {

check_date_sequence <- function(data, target_columns) {
checkmate::assert_vector(target_columns, any.missing = FALSE, min.len = 1L,
max.len = dim(data)[2], null.ok = FALSE,
unique = TRUE)
checkmate::assert_data_frame(data, null.ok = FALSE)
checkmate::assert_logical(remove, any.missing = FALSE, len = 1L,
null.ok = FALSE)

# check if input is character string
if (all(grepl(",", target_columns, fixed = TRUE))) {
target_columns <- as.character(unlist(strsplit(target_columns, ",",
fixed = TRUE)))
target_columns <- trimws(target_columns)
}
# get the correct names in case some have been modified - see the
# `retrieve_column_names()` function for more details
target_columns <- retrieve_column_names(data, target_columns)
target_columns <- get_target_column_names(data, target_columns, cols = NULL)


# check if all columns are part of the data frame
if (!all(target_columns %in% names(data))) {
idx <- which(!(target_columns %in% names(data)))
Expand All @@ -54,14 +59,6 @@ check_date_sequence <- function(data, target_columns,
}
}

# check and convert to Date if required
for (cols in target_columns) {
if (!lubridate::is.Date(data[[cols]])) {
data <- standardize_dates(data, cols, timeframe = NULL,
error_tolerance = 0.5)
}
}

# checking the date sequence
tmp_data <- data %>% dplyr::select(dplyr::all_of(target_columns))
order_date <- apply(tmp_data, 1L, is_date_sequence_ordered)
Expand All @@ -76,11 +73,6 @@ check_date_sequence <- function(data, target_columns,
" incorrect date sequences at line(s): ",
glue::glue_collapse(bad_order, sep = ", "),
call. = FALSE)
if (remove) {
data <- data[-bad_order, ]
warning("The incorrect date sequences have been removed.",
call. = FALSE)
}
}

return(data)
Expand Down
Loading