data.table_1.9.4 fread failing on field with newline in quoted field. #1201

GitKlip · 2015-06-27T19:21:14Z

when I'm using fread on a large csv file that has a newline in a quoted column I get the following error. I see other similar issues have been fixed but am still having the same problem.

sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8 x64 (build 9200)
Running Version: data.table_1.9.4

Error in fread(input = unzFileName, sep = ",", header = TRUE, select = c("EVTYPE", :
Expected sep (',') but new line or EOF ends field 36 on line 249453 when reading data: 1.00,12/18/1996 0:00:00,"02:00:00 PM","CST",28.00,"ALZ028>029 - 035>038 - 040>049","AL","WINTER STORM",0.00,,,12/18/1996 0:00:00,"07:00:00 PM",0.00,,0.00,,,0.00,0.00,,0.00,0.00,0.00,240.00,"K",320.00,"K","BMX","ALABAMA, Central","CLAY - CLAY - RANDOLPH - CHILTON - COOSA - TALLAPOOSA - CHAMBERS - DALLAS - AUTAUGA - LOWNDES - ELMORE - MONTGOMERY - MACON - BULLOCK - LEE - RUSSELL - PIKE",0.00,0.00,0.00,0.00,"A snow storm that began in the early afternoon hours across the central sections of the state dumped 1 to 3 inches of snow on parts of the state. It was over by early evening. Schools and businesses let out early on the 18th across much of the area affected. A few roads became slick but there were no major travel problems reported. The snow remained on the ground in some areas for about two days. Here is a list of snowfall totals by county:
Autauga 2-3"" Bullock 1"" C

jangorecki · 2015-06-27T19:30:32Z

It would be good to check it using 1.9.5.
To better track the issue you may extract 249452-249454 rows from your file and check if they are sourcing correctly.

GitKlip · 2015-06-28T02:36:53Z

Thanks for the quick response. I'm having issues compiling 1.9.5... not sure what I'm doing wrong. (error at bottom of post).

When I extract extracted several examples of the issue into a test file (along with the rows before and after I found that 1.9.4 was able to read them in successfully. But when I try to process the entire file those same rows give me errors.

the zipped file can be found here:
https://github.com/GitKlip/RepData_PeerAssessment2/tree/master/data
I unzip using:
bunzip2(filename = zipFileName, destname = unzFileName, overwrite=TRUE, remove=FALSE)

ERROR WHEN TRYING TO BUILD 1.9.5

install_github("Rdatatable/data.table", build_vignettes = FALSE)
Downloading github repo Rdatatable/data.table@master
Installing data.table
"C:/PROGRA~~1/R/R-32~~1.0/bin/x64/R" --no-site-file --no-environ --no-save --no-restore CMD INSTALL
"C:/Users/Erik/AppData/Local/Temp/RtmpOsbhar/devtools313033432982/Rdatatable-data.table-c9cb395"
--library="C:/Users/Erik/Documents/R/win-library/3.2" --install-tests

installing source package 'data.table' ...
** libs
*** arch - i386
Warning: running command 'make -f "Makevars" -f "C:/PROGRA~~1/R/R-32~~1.0/etc/i386/Makeconf" -f "C:/PROGRA~~1/R/R-32~~1.0/share/make/winshlib.mk" SHLIB="data.table.dll" OBJECTS="assign.o bmerge.o chmatch.o dogroups.o fastmean.o fastradixdouble.o fastradixint.o fcast.o fmelt.o forder.o frank.o fread.o gsumm.o ijoin.o init.o rbindlist.o reorder.o shift.o transpose.o uniqlist.o vecseq.o wrappers.o"' had status 127
ERROR: compilation failed for package 'data.table'

removing 'C:/Users/Erik/Documents/R/win-library/3.2/data.table'

restoring previous 'C:/Users/Erik/Documents/R/win-library/3.2/data.table'
Error: Command failed (1)

jangorecki · 2015-06-28T19:50:57Z

Just to add 1.9.5 status of fread on your source file.

> dt <- fread("redata-pa2-StormData.csv", verbose=TRUE)
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 0.523066 GB.
Memory mapping ... ok
Detected eol as \r\n (CRLF) in that order, the Windows standard.
Positioned on line 1 after skip or autostart
This line is the autostart and not blank so searching up for the last non-blank ... line 1
Detecting sep ... ','
Detected 37 columns. Longest stretch was from line 1 to line 30
Starting data input on line 1 (either column names or first row of data). First 10 characters: "STATE__",
All the fields on line 1 are character fields. Treating as the column names.
Count of eol: 1307675 (including 1 at the end)
Count of sep: 34819802
nrow = MIN( nsep [34819802] / ncol [37] -1, neol [1307675] - nblank [1] ) = 967216
Type codes (   first 5 rows): 3444344430000303003343333430000333303
Type codes (+ middle 5 rows): 3444344434444303443343333434440333343
Type codes (+   last 5 rows): 3444344434444303443343333434444333343
Type codes: 3444344434444303443343333434444333343 (after applying colClasses and integer64)
Type codes: 3444344434444303443343333434444333343 (after applying drop or select (if supplied)
Allocating 37 column slots (37 - 0 dropped)
Read 902297 rows and 37 (of 37) columns from 0.523 GB file in 00:00:13
   0.090s (  1%) Memory map (rerun may be quicker)
   0.000s (  0%) sep and header detection
   1.975s ( 15%) Count rows (wc -l)
   0.001s (  0%) Column type detection (first, middle and last 5 rows)
   1.833s ( 14%) Allocation of 902297x37 result (xMB) in RAM
   9.068s ( 70%) Reading data
   0.000s (  0%) Allocation for type bumps (if any), including gc time if triggered
   0.000s (  0%) Coercing data already read in type bumps (if any)
   0.044s (  0%) Changing na.strings to NA
  13.011s        Total
Warning message:
In fread("redata-pa2-StormData.csv", verbose = TRUE) :
  Read less rows (902297) than were allocated (967216). Run again with verbose=TRUE and please report.

According to warning the data were partially loaded:

> str(dt)
Classes ‘data.table’ and 'data.frame':  902297 obs. of  37 variables:
 $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
 $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
 $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
...

arunsrinivasan · 2015-09-09T10:59:52Z

With 1.9.5, the file is read in properly (and completely). With that, the warning Jan sees is a result of #1116 / #1239. Closing as duplicate.

arunsrinivasan added the bug label Jun 29, 2015

arunsrinivasan added the fread label Sep 4, 2015

arunsrinivasan closed this as completed Sep 9, 2015

arunsrinivasan mentioned this issue Sep 9, 2015

fread: Read less rows (1327) than were allocated (1607) #1116

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data.table_1.9.4 fread failing on field with newline in quoted field. #1201

data.table_1.9.4 fread failing on field with newline in quoted field. #1201

GitKlip commented Jun 27, 2015

jangorecki commented Jun 27, 2015

GitKlip commented Jun 28, 2015

jangorecki commented Jun 28, 2015

arunsrinivasan commented Sep 9, 2015

data.table_1.9.4 fread failing on field with newline in quoted field. #1201

data.table_1.9.4 fread failing on field with newline in quoted field. #1201

Comments

GitKlip commented Jun 27, 2015

jangorecki commented Jun 27, 2015

GitKlip commented Jun 28, 2015

jangorecki commented Jun 28, 2015

arunsrinivasan commented Sep 9, 2015