-
Notifications
You must be signed in to change notification settings - Fork 369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add documentation for Query.jl #1105
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that's really useful! I'd rather create a "Querying frameworks" page where both Queries.jl and StructuredQueries.jl will be presented in separate sections. Else, users might not immediately see from the "Query package" title why they should be interested in this.
@@ -0,0 +1,69 @@ | |||
# Query package | |||
|
|||
The [Query.jl](https://github.com/davidanthoff/Query.jl) package provides advanced data manipulation capabilities for `DataFrames` (and many other data structures). This section provides a short introduction to the package, the [Query.jl](https://github.com/davidanthoff/Query.jl) [documentation](http://www.david-anthoff.com/Query.jl/stable/) has a more comprehensive documentation of the package. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make this a single link to the docs, as there's already a link to the package project above (same remark for the last paragraph of the page).
Pkg.add("Query") | ||
``` | ||
|
||
A query is started with the `@from` macro and consists of a series of query commands. Query provides commands that can filter, project, join, group, flatten and group data from a `DataFrame`. A query returns either an iterator of the transformed data, or one can materialize the results of a query into a variety of data structures, including a new `DataFrame`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Query -> Query.jl? Else it's going to be quite confusing with the term "query".
Syntax is a bit weird in the second part of the sentence: it doesn't match well with "returns either".
```julia | ||
using DataFrames, Query | ||
|
||
df = DataFrame(name=["John", "Sally", "Roger"], age=[54., 34., 79.], children=[0, 2, 4]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not integer ages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because age is a continuous variable. Doesn't really matter, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, not really.
|
||
df = DataFrame(name=["John", "Sally", "Roger"], age=[54., 34., 79.], children=[0, 2, 4]) | ||
|
||
x = @from i in df begin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a way of printing the results too? I think a doctest should do it. That would make it much easier to follow, without running the code in parallel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add the output for now. Doctests would be great, but the dependency management for doctests is not great at this point, i.e. you would have to manually make sure Query.jl
is installed. Maybe easier to wait until Documenter.jl has a better story for managing dependencies and then converting things to doctests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean, it will need to be installed in order to build the docs? Doesn't sound too much of a requirement. I'm not sure how we can ensure building the online docs work, though, but it would probably be possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add it to the travis script in the same place as Documenter
is installed?
We'll get some kind of docs/REQUIRE
eventually, but it'll need to be in Base
to be of much use to anyone. I'll get around to implementing it at some point...
end | ||
``` | ||
|
||
The `@where` command in this query will filter the source data by applying the filter condition `i.age > 40`. This filters out any rows in which the `age` column is not larger than 40. The `@select` command then projects the columns of the source data onto a new column structure. The example here applies three specific modifications: 1) it only keeps a subset of the columns in the source `DataFrame`, i.e. the `age` column will not be part of the transformed data; 2) it changes the order of the two columns that are selected; and 3) it renames one of the columns that is selected from `children` to `number_of_children`. The `@collect` statement determines the data structure that the query returns. In this example the results are returned as a `DataFrame`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe explain what i
is? Same for {}
.
A query without a `@collect` statement returns a standard julia iterator that can be used with any normal julia language construct that can deal with iterators. Here are some examples: | ||
|
||
```julia | ||
using DataFrames, Query |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't repeat this line, nor the one below (same for next example).
end | ||
``` | ||
|
||
A query that ends with a `@collect` statement without a specific type will materialize the query results into an array. Note also the difference in the `@select` statement: The previous queries all used the `{}` syntax in the `@select` statement to return results that had a column structure. The last query instead just selects a single value from each row in the `@select` statement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I first understood "to return results that had a column structure" to mean "return results whose structure was columnar" instead of "return results in a columnar form". Can you make this more explicit? Maybe "return results in a tabular form" or something like that?
@nalimilan Thanks for the great feedback! I think I addressed all your points. When I run |
│ 2 │ 4 │ Roger │ | ||
``` | ||
|
||
The query starts with the ``@from`` macro. The first argument ``i`` is the name of the range variable that will be used to refer to an individual row in later query commands. The next argument ``df`` is the data source that one wants to query. The `@where` command in this query will filter the source data by applying the filter condition `i.age > 40`. This filters out any rows in which the `age` column is not larger than 40. The `@select` command then projects the columns of the source data onto a new column structure. The example here applies three specific modifications: 1) it only keeps a subset of the columns in the source `DataFrame`, i.e. the `age` column will not be part of the transformed data; 2) it changes the order of the two columns that are selected; and 3) it renames one of the columns that is selected from `children` to `number_of_children`. The example query uses the ``{}`` syntax to achieve this. A ``{}`` in a Query.jl expression instantiates a new [NamedTuple](https://github.com/blackrock/NamedTuples.jl), i.e. it is a shortcut for writing ``@NT(number_of_children=>i.children, name=>i.name)``. The `@collect` statement determines the data structure that the query returns. In this example the results are returned as a `DataFrame`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should use single backquotes everywhere, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC single backticks are for code and double are for LaTeX math.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, single backticks for these ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Argh, sorry, I was still in github markdown mode. Fixed with the next push.
Sorry, no idea. I'm not familiar yet with Documenter.jl Not sure if that's expected, but on Julia 0.6, I get this error if I don't load julia> q1 = @from i in df begin
@where i.age > 40
@select {number_of_children=i.children, i.name}
@collect DataFrame
end
ERROR: UndefVarError: @NT not defined |
I'll have a look, not come across missing stylesheet problems before. |
@where i.age > 40 | ||
@select {number_of_children=i.children, i.name} | ||
end | ||
```` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One too many backticks here.
|
||
Or one can use a comprehension to extract the name of a subset of rows: | ||
|
||
````julia |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One too many backticks.
@davidanthoff the stylesheets are loading alright for me on a local build of this branch. What are you versions of Documenter and Julia? Which web browser are you viewing the docs in? |
I need to tag a new version, things should work on Query.jl |
julia is 0.5.0 on Windows, Documenter is 0.6.0. I narrowed the problem down: The link to "Querying frameworks" in the navigation area on the left is So I think all that needs to be done is make sure that there are no two |
Can you change examples into doctests then? |
Done. |
@collect DataFrame | ||
end | ||
|
||
println(q1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I right that you shouldn't need println
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, removed in the next push.
|
||
```@meta | ||
DocTestSetup = quote | ||
using DataFrames, Query |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no way to avoid repeating the code block from above? There's always the risk of them not being in sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. Each jldoctest block is run independently and no state is preserved between these code blocks. I guess the only alternative would be to repeat the query itself in the jldoctest, and only have the using
and DataFrame construction code in the @meta
section.
│ 2 │ 4 │ Roger │ | ||
``` | ||
|
||
```@meta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this necessary (and below too)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@meta
blocks are like global state, i.e. they are run before every jldoctest
block that follows the @meta
block. I think "clearing" that out after it has been used is safer, otherwise if someone adds another jldoctest
later in the documentation, it might accidentally pick up this init code from a previous jldoctest
block.
Really it would be nice if one could specify an init code block for a given jldoctest
that is only run before that code block. But I don't think that can be done right now...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Essentially I'm just following the guidance at the bottom of the page here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
@MichaelHatherly Do you think it would be possible to have a way of specifying that a series of doctests goes together and should share state? Looks like it could also be useful to allow setting a DocTestSetup
block to be used only by the following doctests, but not the others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we probably could do. There's already precedence for having "named" groups with the @repl
and @example
blocks, so we could probably do the same thing with doctests.
FWIW, named @example
blocks might be a better fit for this section of docs anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that's exactly what I have in mind. @davidagold Do you feel like changing the examples to use the same @example
name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bump!
There are conflicts with master. |
I probably missed a conversation somewhere about it, but why are we running Query doctests from DataFrames? |
@ararslan @nalimilan suggested that we add a section about query frameworks to the documentation, and also that I convert the examples into doc tests. |
The idea is that if we don't ensure examples work and give the expected result, they will stop working at some point (witness the recent issue with constructor examples). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, LGTM then.
@davidanthoff Can you just make the move to |
I just tried to implement the Maybe we should merge this PR at this point, given that it works, and then we can still change this at a later point? |
@MichaelHatherly Any suggestions?
We can certainly do that if |
With the following diff everything displays as it should on Linux/Windows Julia 0.5: https://gist.github.com/MichaelHatherly/48f57ea5a60e0334e1cfc77feb68b4f4. |
Thanks, now it seems to work, I must have missed something yesterday. The Also is there some way to have some text between the code example block and the output? Right now there is no visual indication that one is code and the other one is output, which looks strange to me. |
Just as an aside, it's best to avoid using @ in commit messages, as it always pings the GitHub user with that name. Common replacements would be "example macro" or "at-example." |
Is there no way to check the output of |
I don't think so. |
(Sorry, been busy with non-Julia things.) There's no way to check
Is this the character width for the unicode chars that form the table frame? If so there's not much that can be done in Documenter I don't think aside from changing the font again to something that properly supports those characters.
Probably can do. Proposals/PRs for style changes to the output would definitely be welcomed. |
OK, though that's kind of unfortunate. Any plans to change this in the future? I don't like encouraging people to write untested examples... |
Thanks @davidanthoff! |
Hm, when I look at this here now it doesn't show any of the example output... |
Package dependencies need fixing, |
Why not make the |
That would be quite a bad breaking change.
I agree with that, but it does need a gradual deprecation period to not cause too much trouble for all the packages that are currently ignoring those warnings. Since Documenter is currently going through the ssh-key deprecation I'd also prefer to only do one major deprecation at a time. |
For what its worth, I think the doctests should actually run as part of the normal test suite, i.e. one should be able to add a line like |
For the failure itself: Query currently targets the tagged versions of DataFrames. So I guess this PR here really should go into a release branch for the 0.8.x series of DataFrames... Is there some way to publish new doc versions as stable before the current DataFrames master gets tagged? |
Just make a PR against |
@nalimilan Here is a first go at a section for the doc that deals with Query that you suggested. My idea was to give a hint of how things work, and then point folks to the Query documentation.