Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't allow NA as a factor level with fread stringsAsFactors = TRUE #1408

Merged
merged 1 commit into from
Oct 27, 2015
Merged

Conversation

DexGroves
Copy link

The old fread with stringsAsFactors = TRUE was producing undesirable and difficult to manipulate factors. NA gets assigned as a possible level of the factor, and is.na doesn't produce TRUE.

devtools::install_github("Rdatatable/data.table")

library("data.table")

data <- "a   |b |c  |d
         this|is|row|1
         this|is|row|2
         this|NA|NA|3
         this|is|row|4
         this|is|row|5
         this|is|row|6
         this|is|row|7
         this|is|row|8
         this|is|row|9
         this|is|row|10"

tmp_filename <- tempfile()
write(data, tmp_filename)

test_dt <- fread(tmp_filename, sep = "|", stringsAsFactors = TRUE)
test_df <- read.delim(tmp_filename, sep = "|", stringsAsFactors = TRUE)

# fread produces an NA that produces FALSE with is.na
is.na(test_dt$b[3])
# [1] FALSE
is.na(test_df$b[3])
# [1] TRUE

# Even weirder, `levels(x) <- levels(x)` 'fixes' the is.na behaviour.
is.na(test_dt$b[3])
# [1] FALSE
levels(test_dt$b) <- levels(test_dt$b)
is.na(test_dt$b[3])
# [1] TRUE

This is annoying especially for situations where you want to recast an NA so you can use it with model.matrix or similar.

dt[is.na(factor_with_NAs), factor_with_NAs := "Missing"]
# arrrrgh

Simple fix forbids NA as a factor level, in line with the the default behavior of as.factor.

@arunsrinivasan
Copy link
Member

Thanks. That makes an unnecessary copy of lev for all cases (even when there's no NA). A more efficient fix is to replace the line:

lev = forderv(x, retGrp = TRUE)

with

lev = forderv(x, retGrp = TRUE, na.last = NA)

I'd be happy to accept the PR if you could edit / rebase with this instead, add tests in tests.Rraw file and update README with this PR under bugs for v1.9.7.

@arunsrinivasan
Copy link
Member

Please merge/rebase.

Tidying up PR #1408

Fix referencing the wrong issue
arunsrinivasan added a commit that referenced this pull request Oct 27, 2015
Don't allow NA as a factor level with fread stringsAsFactors = TRUE
@arunsrinivasan arunsrinivasan merged commit 67069da into Rdatatable:master Oct 27, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants