Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC: Vaex connector #6041

Closed
wants to merge 2 commits into from
Closed

Conversation

maartenbreddels
Copy link

Vaex is a (lazy) out-of-core DataFrame library for Python that is used to visualize and explore big tabular data at ~ a billion rows per second (on a decent computer/laptop). The visualization part of vaex is similar to datashader (see #4492), but vaex is more general.

Vaex focusses strongly on binned statistics on N-d grids, and instead of the groupby, uses binby which can be used for instance to create 1d histograms:

x_counts = ds.count(binby=ds.x, limits=[-10, 10], shape=64)

Or a 2d array with means of a column:

z_mean_map = ds.mean(ds.z, binby=[ds.x, ds.y], limits=[[-10, 10], [-20, 20]], shape=(64, 128))

I thought it would be interesting to see if I could integrate this in superset, hence this PR, which is only a proof of concept.

I managed to get some visualizations up using the New York Taxi dataset: https://docs.vaex.io/en/latest/datasets.html which contains over 1 billion rows (although for this test I only used the 2015 data, which contains ~150 million rows).

I got the table view working:
screen shot 2018-10-03 at 20 51 05

Pie charts:
screen shot 2018-10-03 at 21 04 43

And time series:
screen shot 2018-10-04 at 20 47 10

And I think the most beautiful one is the heatmap:
screen shot 2018-10-03 at 21 38 41

Note that the data to produce these viz just takes a fraction of a second for these ~150 million rows, 1 ~1 billion rows per second is about the expected performance (per computer).

I just put this out here so judge interest in this, and as an additional example to #3492

@maartenbreddels maartenbreddels changed the title POC: Vaex support POC: Vaex connector Oct 5, 2018
@mistercrunch
Copy link
Member

mistercrunch commented Oct 7, 2018

For reference, someone wrote a pandas connector (#3492) in the past that we never merged. The main reason it wasn't merge is that it was a fair amount of code to manage coming from a non-committer, while the connector interface wasn't super well-defined and "settled" at that point. Evolving the interface would mean carrying the pandas connector along for the ride.

Also the problem of where to persist the dataframe. Since our web servers are stateless, the pandas dataframe needs to be brought up in memory from the network prior to performing aggregations / filters. With something like Arrow that becomes somewhat reasonable, but it feels like there should be a dedicated service (that resembles a database quite a bit) loading/caching/computing on those files.

@maartenbreddels
Copy link
Author

Hi Maxime,

Evolving the interface would mean carrying the pandas connector along for the ride.

would you say it is more stable, or stable?

Also the problem of where to persist the dataframe.

Vaex is using mostly memory mapped hdf5 files, relying on the OS cache, meaning there are practically 0 costs if a process gets restarted (apart from Python startup/import cost). It doesn't matter if you open a 100MB or 2TB file, it is practically for free

I always planned to add Arrow support, but I never saw the same performance, so didn't bother yet. However, for compatibility, it would be great to have.

There is also a (stateless) vaex server for accessing remote datasets and build on top of that a distributed part (although less mature) which is almost trivial since almost everything vaex computes is embarrassingly parallel.

I'd like to know if a connector for vaex is potentially interesting. I don't know much about Druid's performance, so I don't know how fast it could produce the data for these visualizations I produced. If it would be an order of magnitude, it could be interesting, otherwise probably not worth the effort, maybe you have some statistics on this.

Also, to really make superset work well with vaex, it probably needs some custom viz as well, like something I demonstrate here:
https://youtu.be/bP-JBbjwLM8?t=1125 (the 150 million taxi dataset)
https://youtu.be/bP-JBbjwLM8?t=1741 (1 bilion stars)
Which is similar to the Heatmap viz, except here it will use numerical data with limits/bounds, and adds the ability to zoom/pan (and select/filter).
Again, here I don't know if this is potentially interesting for superset.

Regards,

Maarten

@mistercrunch
Copy link
Member

Maintaining both the Druid and SQLAlchemy interface has been quite a burden and the direction we're moving towards is killing the Druid connector and going through SQLAlchemy for Druid (SQL support in Druid is somewhat new, but the way of the future). There's a Superset improvement proposal open on this.

Is it planned for the stateless Vaex server to support SQL? SQL support for database-like compute frameworks is becoming easier with the help Apache Calcite. Having SQL support opens lots of doors way beyond Superset and work in that direction is probably a better investment than towards a custom Superset connector.

@maartenbreddels
Copy link
Author

The issue with SQL is that there is no way I know of to express regular N-d binning in, or would you know of a natural way to do this?
For instance, the line above:

ds.mean(ds.z, binby=[ds.x, ds.y], limits=[[-10, 10], [-20, 20]], shape=(64, 128))

I would not know how to translate to SQL.

@maartenbreddels
Copy link
Author

Apart from maybe dealing with missing values, I think a query like this would work:

SELECT
CAST((pickup_longitude - {long_min})/{long_delta} * {Nbins} as int) as binx,
CAST((pickup_latitude - {lat_min})/{lat_delta} * {Nbins} as int) as biny,
COUNT(*)
FROM trips
WHERE binx >= 0 AND binx < {Nbins} AND
      biny >= 0 AND biny < {Nbins}
group by binx,biny

It would be up to vaex to recognize the group by + where + cast expression triplet to translate this to a binby operation.
For aggregate queries, vaex would be really fast.

@kristw kristw added the enhancement:request Enhancement request submitted by anyone from the community label Feb 7, 2019
@stale
Copy link

stale bot commented Apr 10, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the inactive Inactive for >= 30 days label Apr 10, 2019
@stale stale bot closed this Apr 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement:request Enhancement request submitted by anyone from the community inactive Inactive for >= 30 days
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants