-
Notifications
You must be signed in to change notification settings - Fork 999
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fread ignores colClasses factor assignment #721
Comments
Hi @AmyMikhail, thanks for the report. Just a couple of things on the code you use to convert to So, the idiomatic way to do it would be: cols <- a character vector of all columns that should be factor type
amrdt[, (cols) := lapply(.SD, as.factor), .SDcols = cols] (Although it'd be more appropriate for Matt to comment on your actual post reg. your FR) I think it'd be nice to have |
Thanks @arunsrinivasan; Your suggestion has worked for the columns that I want to convert to factors, however I could not get this to work with as.Date, as there seems to be no way of specifying the input date format with this syntax? I look forward to hearing if colClasses will be expanded. |
@AmyMikhail I'm not sure I follow, but couldn't you just do: amrdt[, (cols) := lapply(.SD, function(x) as.Date(x, <all_other_arguments>)), .SDcols=cols] where @mattdowle any thoughts on |
Many thanks @arunsrinivasan; I'm still getting used to the data.table syntax.... Both these changes still take a little while to run (since I have thousands of factor levels); if these variable type assignments can be incorporated directly into fread would it be faster? |
I'm quite positive that this FR would be a great addition, and Matt (and therefore But you should know that, unless you're plotting (where Out of curiosity, could you briefly explain what you do with these factor columns? |
I'll be using the data.table to create summary tables of counts and proportions by date for different factor groupings, which I'll then work on further with the package surveillance. The surveillance package is quite fussy about the input format, so I will always need dates in the first column and each subsequent column to contain counts of records that are in various subgroups, by date, where each variable name is the subgroup (factor level). This differs a little from the standard output of summary tables with data.table (where subgroups are defined in a second variable rather than becoming the variable names) and I have worked out how to do this with dcast but have no idea how dependent my solution is on the grouping variables being factors. Admittedly I didn't try my code with the variables in character form as I didn't know this would work - however I also need to keep tabs on the number of factor levels in each variable (as this in itself is a descriptive summary of the data that I need, such as how many hospitals are represented in the dataset, or how many patients don't have NHS numbers etc.) I also need to keep track of the number of NAs and check that I haven't missed any items that need to be re-coded as NAs (the data is not entirely clean). str, summary etc will not produce summary information for character vectors but I'd be curious to know if there is another way of getting this information, for character vectors? |
Hi there, I've stumbled on this issue too. I'm trying to specify colClasses beforehand because I will use the data to train a GBM. It's possible to convert it manually, for sure, but is specifying colClasses on the roadmap? |
Put me down as yet another person that would like to see full support for factors in the fread() colClasses argument. (And thanks to @arunsrinivasan for the tip for mass converting columns to factors!) |
I asked for it already in 2013: http://lists.r-forge.r-project.org/pipermail/datatable-help/2013-October/002178.html |
I'm using the fread function of
data.table
to read in a large csv file (5 million records of 28 variables) as efficiently as possible on a laptop with just 4GB RAM. Many of my variables looknumeric
but they are actuallyfactors
(various id numbers). In order to avoid incorrect interpretation of these variables when reading into R and also because I have read this further improves the speed of the import, I individually specifiedcolClasses()
for each variable in the call tofread
. However my assignments are ignored, see below:Here are the results of
fread
. Seeminglyfread
has defaulted to automatic interpretation of colClasses:I can convert everything to a factor with the following code:
... but this is very slow, copies the data (which I want to avoid) and I would need to adjust the
colClasses()
for those few columns that are not supposed to be factors.From reading other posts it seems this behaviour has something to do with "factor" not being a basic type column class? In any case it would be enormously useful if fread could accept factors and other non-basic types of column class (such as the date type that I created above).
As a (related) aside, a feature that would be really useful is if one could convert an ffdf object directly to a
data.table
. The reason I have read in a.csv
file with the above code is because I couldn't figure out a way to do this (so wrote to.csv
withwrite.csv.ffdf
, which took about 30 mins to write to a 1.002 GB file and then read in the.csv
withfread
, which took 3 mins 12 secs). If it were possible to convert directly from ffdf to a data.table with fread, without dumping into a.csv
first, that would be a significant time saving.Many thanks for your help.
The text was updated successfully, but these errors were encountered: