Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data.table_1.9.4 fread failing on field with newline in quoted field. #1201

Closed
GitKlip opened this issue Jun 27, 2015 · 4 comments
Closed

data.table_1.9.4 fread failing on field with newline in quoted field. #1201

GitKlip opened this issue Jun 27, 2015 · 4 comments

Comments

@GitKlip
Copy link

GitKlip commented Jun 27, 2015

when I'm using fread on a large csv file that has a newline in a quoted column I get the following error. I see other similar issues have been fixed but am still having the same problem.

sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8 x64 (build 9200)
Running Version: data.table_1.9.4

Error in fread(input = unzFileName, sep = ",", header = TRUE, select = c("EVTYPE", :
Expected sep (',') but new line or EOF ends field 36 on line 249453 when reading data:
1.00,12/18/1996 0:00:00,"02:00:00 PM","CST",28.00,"ALZ028>029 - 035>038 - 040>049","AL","WINTER STORM",0.00,,,12/18/1996 0:00:00,"07:00:00 PM",0.00,,0.00,,,0.00,0.00,,0.00,0.00,0.00,240.00,"K",320.00,"K","BMX","ALABAMA, Central","CLAY - CLAY - RANDOLPH - CHILTON - COOSA - TALLAPOOSA - CHAMBERS - DALLAS - AUTAUGA - LOWNDES - ELMORE - MONTGOMERY - MACON - BULLOCK - LEE - RUSSELL - PIKE",0.00,0.00,0.00,0.00,"A snow storm that began in the early afternoon hours across the central sections of the state dumped 1 to 3 inches of snow on parts of the state. It was over by early evening. Schools and businesses let out early on the 18th across much of the area affected. A few roads became slick but there were no major travel problems reported. The snow remained on the ground in some areas for about two days. Here is a list of snowfall totals by county:
Autauga 2-3"" Bullock 1"" C

@jangorecki
Copy link
Member

It would be good to check it using 1.9.5.
To better track the issue you may extract 249452-249454 rows from your file and check if they are sourcing correctly.

@GitKlip
Copy link
Author

GitKlip commented Jun 28, 2015

Thanks for the quick response. I'm having issues compiling 1.9.5... not sure what I'm doing wrong. (error at bottom of post).

When I extract extracted several examples of the issue into a test file (along with the rows before and after I found that 1.9.4 was able to read them in successfully. But when I try to process the entire file those same rows give me errors.

the zipped file can be found here:
https://github.com/GitKlip/RepData_PeerAssessment2/tree/master/data
I unzip using:
bunzip2(filename = zipFileName, destname = unzFileName, overwrite=TRUE, remove=FALSE)

ERROR WHEN TRYING TO BUILD 1.9.5

install_github("Rdatatable/data.table", build_vignettes = FALSE)
Downloading github repo Rdatatable/data.table@master
Installing data.table
"C:/PROGRA1/R/R-321.0/bin/x64/R" --no-site-file --no-environ --no-save --no-restore CMD INSTALL
"C:/Users/Erik/AppData/Local/Temp/RtmpOsbhar/devtools313033432982/Rdatatable-data.table-c9cb395"
--library="C:/Users/Erik/Documents/R/win-library/3.2" --install-tests

  • installing source package 'data.table' ...
    ** libs
    *** arch - i386
    Warning: running command 'make -f "Makevars" -f "C:/PROGRA1/R/R-321.0/etc/i386/Makeconf" -f "C:/PROGRA1/R/R-321.0/share/make/winshlib.mk" SHLIB="data.table.dll" OBJECTS="assign.o bmerge.o chmatch.o dogroups.o fastmean.o fastradixdouble.o fastradixint.o fcast.o fmelt.o forder.o frank.o fread.o gsumm.o ijoin.o init.o rbindlist.o reorder.o shift.o transpose.o uniqlist.o vecseq.o wrappers.o"' had status 127
    ERROR: compilation failed for package 'data.table'
  • removing 'C:/Users/Erik/Documents/R/win-library/3.2/data.table'
  • restoring previous 'C:/Users/Erik/Documents/R/win-library/3.2/data.table'
    Error: Command failed (1)

@jangorecki
Copy link
Member

Just to add 1.9.5 status of fread on your source file.

> dt <- fread("redata-pa2-StormData.csv", verbose=TRUE)
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 0.523066 GB.
Memory mapping ... ok
Detected eol as \r\n (CRLF) in that order, the Windows standard.
Positioned on line 1 after skip or autostart
This line is the autostart and not blank so searching up for the last non-blank ... line 1
Detecting sep ... ','
Detected 37 columns. Longest stretch was from line 1 to line 30
Starting data input on line 1 (either column names or first row of data). First 10 characters: "STATE__",
All the fields on line 1 are character fields. Treating as the column names.
Count of eol: 1307675 (including 1 at the end)
Count of sep: 34819802
nrow = MIN( nsep [34819802] / ncol [37] -1, neol [1307675] - nblank [1] ) = 967216
Type codes (   first 5 rows): 3444344430000303003343333430000333303
Type codes (+ middle 5 rows): 3444344434444303443343333434440333343
Type codes (+   last 5 rows): 3444344434444303443343333434444333343
Type codes: 3444344434444303443343333434444333343 (after applying colClasses and integer64)
Type codes: 3444344434444303443343333434444333343 (after applying drop or select (if supplied)
Allocating 37 column slots (37 - 0 dropped)
Read 902297 rows and 37 (of 37) columns from 0.523 GB file in 00:00:13
   0.090s (  1%) Memory map (rerun may be quicker)
   0.000s (  0%) sep and header detection
   1.975s ( 15%) Count rows (wc -l)
   0.001s (  0%) Column type detection (first, middle and last 5 rows)
   1.833s ( 14%) Allocation of 902297x37 result (xMB) in RAM
   9.068s ( 70%) Reading data
   0.000s (  0%) Allocation for type bumps (if any), including gc time if triggered
   0.000s (  0%) Coercing data already read in type bumps (if any)
   0.044s (  0%) Changing na.strings to NA
  13.011s        Total
Warning message:
In fread("redata-pa2-StormData.csv", verbose = TRUE) :
  Read less rows (902297) than were allocated (967216). Run again with verbose=TRUE and please report.

According to warning the data were partially loaded:

> str(dt)
Classesdata.tableand 'data.frame':  902297 obs. of  37 variables:
 $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
 $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
 $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
...

@arunsrinivasan
Copy link
Member

With 1.9.5, the file is read in properly (and completely). With that, the warning Jan sees is a result of #1116 / #1239. Closing as duplicate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants