Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for Query.jl #1105

Merged
merged 8 commits into from
Nov 12, 2016
Merged

Conversation

davidanthoff
Copy link
Contributor

@nalimilan Here is a first go at a section for the doc that deals with Query that you suggested. My idea was to give a hint of how things work, and then point folks to the Query documentation.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's really useful! I'd rather create a "Querying frameworks" page where both Queries.jl and StructuredQueries.jl will be presented in separate sections. Else, users might not immediately see from the "Query package" title why they should be interested in this.

@@ -0,0 +1,69 @@
# Query package

The [Query.jl](https://github.com/davidanthoff/Query.jl) package provides advanced data manipulation capabilities for `DataFrames` (and many other data structures). This section provides a short introduction to the package, the [Query.jl](https://github.com/davidanthoff/Query.jl) [documentation](http://www.david-anthoff.com/Query.jl/stable/) has a more comprehensive documentation of the package.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make this a single link to the docs, as there's already a link to the package project above (same remark for the last paragraph of the page).

Pkg.add("Query")
```

A query is started with the `@from` macro and consists of a series of query commands. Query provides commands that can filter, project, join, group, flatten and group data from a `DataFrame`. A query returns either an iterator of the transformed data, or one can materialize the results of a query into a variety of data structures, including a new `DataFrame`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Query -> Query.jl? Else it's going to be quite confusing with the term "query".

Syntax is a bit weird in the second part of the sentence: it doesn't match well with "returns either".

```julia
using DataFrames, Query

df = DataFrame(name=["John", "Sally", "Roger"], age=[54., 34., 79.], children=[0, 2, 4])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not integer ages?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because age is a continuous variable. Doesn't really matter, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, not really.


df = DataFrame(name=["John", "Sally", "Roger"], age=[54., 34., 79.], children=[0, 2, 4])

x = @from i in df begin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a way of printing the results too? I think a doctest should do it. That would make it much easier to follow, without running the code in parallel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add the output for now. Doctests would be great, but the dependency management for doctests is not great at this point, i.e. you would have to manually make sure Query.jl is installed. Maybe easier to wait until Documenter.jl has a better story for managing dependencies and then converting things to doctests?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean, it will need to be installed in order to build the docs? Doesn't sound too much of a requirement. I'm not sure how we can ensure building the online docs work, though, but it would probably be possible.

@MichaelHatherly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add it to the travis script in the same place as Documenter is installed?

We'll get some kind of docs/REQUIRE eventually, but it'll need to be in Base to be of much use to anyone. I'll get around to implementing it at some point...

end
```

The `@where` command in this query will filter the source data by applying the filter condition `i.age > 40`. This filters out any rows in which the `age` column is not larger than 40. The `@select` command then projects the columns of the source data onto a new column structure. The example here applies three specific modifications: 1) it only keeps a subset of the columns in the source `DataFrame`, i.e. the `age` column will not be part of the transformed data; 2) it changes the order of the two columns that are selected; and 3) it renames one of the columns that is selected from `children` to `number_of_children`. The `@collect` statement determines the data structure that the query returns. In this example the results are returned as a `DataFrame`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe explain what i is? Same for {}.

A query without a `@collect` statement returns a standard julia iterator that can be used with any normal julia language construct that can deal with iterators. Here are some examples:

```julia
using DataFrames, Query
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't repeat this line, nor the one below (same for next example).

end
```

A query that ends with a `@collect` statement without a specific type will materialize the query results into an array. Note also the difference in the `@select` statement: The previous queries all used the `{}` syntax in the `@select` statement to return results that had a column structure. The last query instead just selects a single value from each row in the `@select` statement.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I first understood "to return results that had a column structure" to mean "return results whose structure was columnar" instead of "return results in a columnar form". Can you make this more explicit? Maybe "return results in a tabular form" or something like that?

@davidanthoff
Copy link
Contributor Author

@nalimilan Thanks for the great feedback! I think I addressed all your points.

When I run make.jl locally, the docs don't seem to pick up the stylesheet, so I have a hard time checking the final look here. I don't understand why that is happening because I use pretty much the same setup for the Query.jl documentation, and there everything works great locally...

│ 2 │ 4 │ Roger │
```

The query starts with the ``@from`` macro. The first argument ``i`` is the name of the range variable that will be used to refer to an individual row in later query commands. The next argument ``df`` is the data source that one wants to query. The `@where` command in this query will filter the source data by applying the filter condition `i.age > 40`. This filters out any rows in which the `age` column is not larger than 40. The `@select` command then projects the columns of the source data onto a new column structure. The example here applies three specific modifications: 1) it only keeps a subset of the columns in the source `DataFrame`, i.e. the `age` column will not be part of the transformed data; 2) it changes the order of the two columns that are selected; and 3) it renames one of the columns that is selected from `children` to `number_of_children`. The example query uses the ``{}`` syntax to achieve this. A ``{}`` in a Query.jl expression instantiates a new [NamedTuple](https://github.com/blackrock/NamedTuples.jl), i.e. it is a shortcut for writing ``@NT(number_of_children=>i.children, name=>i.name)``. The `@collect` statement determines the data structure that the query returns. In this example the results are returned as a `DataFrame`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use single backquotes everywhere, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC single backticks are for code and double are for LaTeX math.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, single backticks for these ones.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argh, sorry, I was still in github markdown mode. Fixed with the next push.

@nalimilan
Copy link
Member

When I run make.jl locally, the docs don't seem to pick up the stylesheet, so I have a hard time checking the final look here. I don't understand why that is happening because I use pretty much the same setup for the Query.jl documentation, and there everything works great locally...

Sorry, no idea. I'm not familiar yet with Documenter.jl

Not sure if that's expected, but on Julia 0.6, I get this error if I don't load NamedTuples in addition to Query:

julia> q1 = @from i in df begin
                   @where i.age > 40
                   @select {number_of_children=i.children, i.name}
                   @collect DataFrame
              end

ERROR: UndefVarError: @NT not defined

@MichaelHatherly
Copy link
Contributor

When I run make.jl locally, the docs don't seem to pick up the stylesheet, so I have a hard time checking the final look here.

I'll have a look, not come across missing stylesheet problems before.

@where i.age > 40
@select {number_of_children=i.children, i.name}
end
````
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One too many backticks here.


Or one can use a comprehension to extract the name of a subset of rows:

````julia
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One too many backticks.

@MichaelHatherly
Copy link
Contributor

@davidanthoff the stylesheets are loading alright for me on a local build of this branch. What are you versions of Documenter and Julia? Which web browser are you viewing the docs in?

@davidanthoff
Copy link
Contributor Author

I get this error if I don't load NamedTuples in addition to Query

I need to tag a new version, things should work on Query.jl master.

@davidanthoff
Copy link
Contributor Author

What are you versions of Documenter and Julia? Which web browser are you viewing the docs in?

julia is 0.5.0 on Windows, Documenter is 0.6.0. I narrowed the problem down: The link to "Querying frameworks" in the navigation area on the left is file:///C:/Users/anthoff/.julia/v0.5/DataFrames/docs/build/man//querying_frameworks.html. Note that there are two / between man and querying_frameworks.html. When I click that it shows the content of that page, but the stylesheets are messed up. If I then manually remove the second /, everything looks correct.

So I think all that needs to be done is make sure that there are no two / in a row in the URLs.

@nalimilan
Copy link
Member

Can you change examples into doctests then?

@davidanthoff
Copy link
Contributor Author

Can you change examples into doctests then?

Done.

@collect DataFrame
end

println(q1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I right that you shouldn't need println?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, removed in the next push.


```@meta
DocTestSetup = quote
using DataFrames, Query
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no way to avoid repeating the code block from above? There's always the risk of them not being in sync.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. Each jldoctest block is run independently and no state is preserved between these code blocks. I guess the only alternative would be to repeat the query itself in the jldoctest, and only have the using and DataFrame construction code in the @meta section.

│ 2 │ 4 │ Roger │
```

```@meta
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary (and below too)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@meta blocks are like global state, i.e. they are run before every jldoctest block that follows the @meta block. I think "clearing" that out after it has been used is safer, otherwise if someone adds another jldoctest later in the documentation, it might accidentally pick up this init code from a previous jldoctest block.

Really it would be nice if one could specify an init code block for a given jldoctest that is only run before that code block. But I don't think that can be done right now...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially I'm just following the guidance at the bottom of the page here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

@MichaelHatherly Do you think it would be possible to have a way of specifying that a series of doctests goes together and should share state? Looks like it could also be useful to allow setting a DocTestSetup block to be used only by the following doctests, but not the others.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we probably could do. There's already precedence for having "named" groups with the @repl and @example blocks, so we could probably do the same thing with doctests.

FWIW, named @example blocks might be a better fit for this section of docs anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's exactly what I have in mind. @davidagold Do you feel like changing the examples to use the same @example name?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bump!

@nalimilan
Copy link
Member

There are conflicts with master.

@ararslan
Copy link
Member

I probably missed a conversation somewhere about it, but why are we running Query doctests from DataFrames?

@davidanthoff
Copy link
Contributor Author

@ararslan @nalimilan suggested that we add a section about query frameworks to the documentation, and also that I convert the examples into doc tests.

@nalimilan
Copy link
Member

The idea is that if we don't ensure examples work and give the expected result, they will stop working at some point (witness the recent issue with constructor examples).

Copy link
Member

@ararslan ararslan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, LGTM then.

@nalimilan
Copy link
Member

@davidanthoff Can you just make the move to @example so that we can merge this? Thanks!

@davidanthoff
Copy link
Contributor Author

I just tried to implement the @example thing, but at least on my system it doesn't splice the results from the last line of such a block into the output. I'm on Windows trying this with Documenter v0.7.1.

Maybe we should merge this PR at this point, given that it works, and then we can still change this at a later point?

@nalimilan
Copy link
Member

I just tried to implement the @example thing, but at least on my system it doesn't splice the results from the last line of such a block into the output. I'm on Windows trying this with Documenter v0.7.1.

@MichaelHatherly Any suggestions?

Maybe we should merge this PR at this point, given that it works, and then we can still change this at a later point?

We can certainly do that if @example doesn't work, but I'd rather do it before merging if that's not too hard.

@MichaelHatherly
Copy link
Contributor

With the following diff everything displays as it should on Linux/Windows Julia 0.5: https://gist.github.com/MichaelHatherly/48f57ea5a60e0334e1cfc77feb68b4f4.

@davidanthoff
Copy link
Contributor Author

Thanks, now it seems to work, I must have missed something yesterday.

The DataFrame display formatting is pretty messed up. I think there is actually a bug both in the DataFrame show method and in Documenter here.

Also is there some way to have some text between the code example block and the output? Right now there is no visual indication that one is code and the other one is output, which looks strange to me.

@ararslan
Copy link
Member

ararslan commented Nov 9, 2016

Just as an aside, it's best to avoid using @ in commit messages, as it always pings the GitHub user with that name. Common replacements would be "example macro" or "at-example."

@nalimilan
Copy link
Member

Is there no way to check the output of @example like with doctests?

@davidanthoff
Copy link
Contributor Author

I don't think so.

@nalimilan
Copy link
Member

@MichaelHatherly?

@MichaelHatherly
Copy link
Contributor

(Sorry, been busy with non-Julia things.) There's no way to check @example output other than using a jldoctest block instead — I've not come up with any kind of satisfactory middle ground between the two.

The DataFrame display formatting is pretty messed up. I think there is actually a bug both in the DataFrame show method and in Documenter here.

Is this the character width for the unicode chars that form the table frame? If so there's not much that can be done in Documenter I don't think aside from changing the font again to something that properly supports those characters.

Also is there some way to have some text between the code example block and the output? Right now there is no visual indication that one is code and the other one is output, which looks strange to me.

Probably can do. Proposals/PRs for style changes to the output would definitely be welcomed.

@nalimilan
Copy link
Member

(Sorry, been busy with non-Julia things.) There's no way to check @example output other than using a jldoctest block instead — I've not come up with any kind of satisfactory middle ground between the two.

OK, though that's kind of unfortunate. Any plans to change this in the future? I don't like encouraging people to write untested examples...

@nalimilan nalimilan merged commit 4c47ed3 into JuliaData:master Nov 12, 2016
@nalimilan
Copy link
Member

Thanks @davidanthoff!

@davidanthoff davidanthoff deleted the query-doc branch November 13, 2016 00:33
@davidanthoff
Copy link
Contributor Author

Hm, when I look at this here now it doesn't show any of the example output...

@MichaelHatherly
Copy link
Contributor

I don't like encouraging people to write untested examples...

@example will still print a warning when any errors are encountered, which can be turned into an error by calling makedocs with the strict = true keyword. Not quite as thorough as a unit test of course, kind of more like a Jupyter notebook cell really.

Hm, when I look at this here now it doesn't show any of the example output...

Package dependencies need fixing, DataArrays isn't being installed: https://travis-ci.org/JuliaStats/DataFrames.jl/jobs/175315815#L277

@nalimilan
Copy link
Member

Why not make the strict=true behavior the default? I don't see the use case for building docs with errors in them.

@MichaelHatherly
Copy link
Contributor

Why not make the strict=true behavior the default?

That would be quite a bad breaking change. strict=true will error out when anything is slightly wrong, such as not including all docstrings in the generated docs — for example, building DataFrames docs lists quite a large number that haven't been included in @docs blocks.

I don't see the use case for building docs with errors in them.

I agree with that, but it does need a gradual deprecation period to not cause too much trouble for all the packages that are currently ignoring those warnings. Since Documenter is currently going through the ssh-key deprecation I'd also prefer to only do one major deprecation at a time.

@davidanthoff
Copy link
Contributor Author

For what its worth, I think the doctests should actually run as part of the normal test suite, i.e. one should be able to add a line like Documenter.test("DataFrames") to the runtest.jl file in DataFrames, so that the doctests actually run if someone does Pkg.test("DataFrames"). See JuliaDocs/Documenter.jl#198.

@davidanthoff
Copy link
Contributor Author

For the failure itself: Query currently targets the tagged versions of DataFrames. So I guess this PR here really should go into a release branch for the 0.8.x series of DataFrames... Is there some way to publish new doc versions as stable before the current DataFrames master gets tagged?

@nalimilan
Copy link
Member

Just make a PR against release-0.8, and I'll add a tag so that it appears in the online docs.

nalimilan pushed a commit that referenced this pull request Jul 8, 2017
nalimilan pushed a commit that referenced this pull request Jul 8, 2017
rofinn pushed a commit that referenced this pull request Aug 17, 2017
nalimilan pushed a commit that referenced this pull request Aug 25, 2017
quinnj pushed a commit that referenced this pull request Sep 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants