-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fwf_cols function #616
Conversation
This adds a helper function `fwf_cols` that is a more intuitive way of specifying fixed width column start and end points. While `fwf_positions` requires three vectors for start, end, and names, `fwf_cols` accepts a named list of length-2 vectors of the column start and end positions.
I think this is an improvement, but I wonder if a wrapper around tribble would be even nicer. |
I was thinking about whether a wrapper around a data frame would be useful and almost included a version in it, but decided against it. My thinking was there's two main way you could get the column specifications (1) if the column specifications are a data frame with two (variable, widths) or three columns (variable, start, end), or (2) they are entering it by hand. If the column specifications are already in a data frame (i'll call is fwf_postions(cols$start, cols$end, cols$varname) To me, that's still pretty clear, and not too much typing. The second case is entering it by hand (when it's not too many columns). In that case, having the columns as argument names and the widths or (start, end) as values seems most natural. # with widths
fwf_cols(foo = 1, bar = 5)
# with (start, end) tuples
fwf_cols(foo = c(1, 4), bar = c(5, 10) This came up when I was helping a student read a fixed-width file. I foolishly didn't RTFM before writing code, and assumed that the format was something like what I was just wrote. When we got an error and actually read the documentation, I was too lazy to adjust change the code and used |
It needs the .Rd file.
What if we allowed Then you'd have: read_fwf(fwf_sample, tibble(name = c(1, 10), ssn = c(30, 42)))
read_fwf(fwf_sample, tibble(name = 10, skip = 20, ssn = 12)) |
I don't know. It seems more natural and easy to document that If a user is able to write the following, it's about as concise as the code above, and I'd say as readable. read_fwf(fwf_sample, fwf_cols(name = c(1, 10), ssn = c(30, 42)))
read_fwf(fwf_sample, fwf_cols(name = 10, skip = 20, ssn = 12)) And the following would still work: x <- tribble(
~ col_name, ~start, ~ end
name, 1, 10,
ssn, 30, 42
)
read_fwf(fwf_sample, x) |
This adds a helper function `fwf_cols` that is a more intuitive way of specifying fixed width column start and end points. While `fwf_positions` requires three vectors for start, end, and names, `fwf_cols` accepts a named list of length-2 vectors of the column start and end positions.
It needs the .Rd file.
This adds a helper function `fwf_cols` that is a more intuitive way of specifying fixed width column start and end points. While `fwf_positions` requires three vectors for start, end, and names, `fwf_cols` accepts a named list of length-2 vectors of the column start and end positions.
- read_fwf arg col_positions will check for column names and whether the data frame is widths or a start/end data frame. - rewrite fwf_cols to accept named args of length 1 or 2. This makes it more concise. Also accept a data frame as the first argument - More checks for argument validity - Use tibbles instead of lists where appropriate Some tests failing. Still need to debug.
Now I have it so that
|
#' # 1. Guess based on position of empty columns | ||
#' read_fwf(fwf_sample, fwf_empty(fwf_sample, col_names = c("first", "last", "state", "ssn"))) | ||
#' # 2. A vector of field widths | ||
#' read_fwf(fwf_sample, fwf_widths(c(20, 10, 12), c("name", "state", "ssn"))) | ||
#' # 3. Paired vectors of start and end positions | ||
#' read_fwf(fwf_sample, fwf_positions(c(1, 30), c(10, 42), c("name", "ssn"))) | ||
#' # 4. Named arguments with start and end positions | ||
#' read_fwf(fwf_sample, fwf_cols(name = c(1, 10), ssn = c(30, 42))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you include the width form here too?
R/read_fwf.R
Outdated
return(tibble::data_frame()) | ||
return(tibble::tibble()) | ||
} | ||
if (!is.list(col_positions)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels too complicated to me. If we have fwf_cols()
I don't think we need to worry about list/data.frame inputs.
R/read_fwf.R
Outdated
} | ||
|
||
tokenizer <- tokenizer_fwf(col_positions$begin, col_positions$end, na = na, comment = comment) | ||
tokenizer <- tokenizer_fwf(col_positions$begin, col_positions$end, na = na, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you change this back please?
@@ -1,3 +1,10 @@ | |||
fwf_col_names <- function(nm, n) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be much lower in the file
|
||
#' @rdname read_fwf | ||
#' @export | ||
#' @param ... If the first element is a data frame, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels too flexible to me. But if you really think it's a good idea to keep it, the function signature should be x, ...
R/read_fwf.R
Outdated
names(x) <- fwf_col_names(names(x), length(x)) | ||
x <- tibble::as_tibble(x) | ||
if (nrow(x) == 2) { | ||
fwf_positions(as.integer(x[1, ]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indenting style
R/read_fwf.R
Outdated
if (is.list(x[[1]])) { | ||
x <- x[[1]] | ||
} | ||
x <- try(lapply(x, as.integer), silent = TRUE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think like approach. I'd say just let the error bubble up to the user.
tests/testthat/test-read-fwf.R
Outdated
@@ -127,6 +127,43 @@ test_that("error on empty spec (#511, #519)", { | |||
expect_error(read_fwf(txt, pos), "Zero-length.*specifications not supported") | |||
}) | |||
|
|||
# fwf_cols | |||
test_that("fwf_cols produces correct fwf_positions object with elements of length 2", { | |||
expected <- fwf_positions(c(1, 9, 4), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please fix the indenting here too?
If the arguments don't fit on one line it should look like:
function_name(
arg1,
argument_name = arg2,
...
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is a different style than the one in http://adv-r.had.co.nz/Style.html, which would be
function_name(arg1,
argument_name = arg2,
...)
Move fwf_col_names function lower in file. See tidyverse#616
Add widths form of fwf_cols to documentation See tidyverse#616
This is too complicated; since we have fwf_cols, don't worry about list inputs. See tidyverse#616
Revert added newline See tidyverse#616
Remove a try() call since, the preference is for errors to bubble up to users. See tidyverse#616
This seems too flexible, so I'll change it to just use ... See tidyverse#616
See comments in tidyverse#616
Revert this section in read_fwf since it is unnecessary to handle data list objects with the availability of fwf_cols See tidyverse#616
Convert some numeric constants to integer constants so that addition/subtraction does not coerce columns to numeric if they were integer. This is not a big deal, but since the positions represent integers anyways, it might as well keep them as such if they are already specified as such.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Now just needs a bullet point in NEWS.md in the appropriate place
NEWS.md
Outdated
@@ -1,5 +1,31 @@ | |||
# readr 1.1.0 | |||
|
|||
* `fwf_cols()` allows for specifying the `col_positions` argument of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think something went wrong with your merge 😞
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. wtf did that merge do? Bad git :-( Sorry about that, and fixed now.
weird things happened to NEWS.md. They are fixed now.
Thanks! |
This adds a helper function
fwf_cols
that is a more intuitive way of specifying fixed width column start and end points. Whilefwf_positions
requires three vectors for start, end, and names,fwf_cols
accepts a named list of length-2 vectors of the column start and end positions.For example,