POC: Vaex connector #6041

maartenbreddels · 2018-10-05T07:10:59Z

Vaex is a (lazy) out-of-core DataFrame library for Python that is used to visualize and explore big tabular data at ~ a billion rows per second (on a decent computer/laptop). The visualization part of vaex is similar to datashader (see #4492), but vaex is more general.

Vaex focusses strongly on binned statistics on N-d grids, and instead of the groupby, uses binby which can be used for instance to create 1d histograms:

x_counts = ds.count(binby=ds.x, limits=[-10, 10], shape=64)

Or a 2d array with means of a column:

z_mean_map = ds.mean(ds.z, binby=[ds.x, ds.y], limits=[[-10, 10], [-20, 20]], shape=(64, 128))

I thought it would be interesting to see if I could integrate this in superset, hence this PR, which is only a proof of concept.

I managed to get some visualizations up using the New York Taxi dataset: https://docs.vaex.io/en/latest/datasets.html which contains over 1 billion rows (although for this test I only used the 2015 data, which contains ~150 million rows).

I got the table view working:

Pie charts:

And time series:

And I think the most beautiful one is the heatmap:

Note that the data to produce these viz just takes a fraction of a second for these ~150 million rows, 1 ~1 billion rows per second is about the expected performance (per computer).

I just put this out here so judge interest in this, and as an additional example to #3492

mistercrunch · 2018-10-07T20:52:11Z

For reference, someone wrote a pandas connector (#3492) in the past that we never merged. The main reason it wasn't merge is that it was a fair amount of code to manage coming from a non-committer, while the connector interface wasn't super well-defined and "settled" at that point. Evolving the interface would mean carrying the pandas connector along for the ride.

Also the problem of where to persist the dataframe. Since our web servers are stateless, the pandas dataframe needs to be brought up in memory from the network prior to performing aggregations / filters. With something like Arrow that becomes somewhat reasonable, but it feels like there should be a dedicated service (that resembles a database quite a bit) loading/caching/computing on those files.

maartenbreddels · 2018-10-08T07:38:04Z

Hi Maxime,

Evolving the interface would mean carrying the pandas connector along for the ride.

would you say it is more stable, or stable?

Also the problem of where to persist the dataframe.

Vaex is using mostly memory mapped hdf5 files, relying on the OS cache, meaning there are practically 0 costs if a process gets restarted (apart from Python startup/import cost). It doesn't matter if you open a 100MB or 2TB file, it is practically for free

I always planned to add Arrow support, but I never saw the same performance, so didn't bother yet. However, for compatibility, it would be great to have.

There is also a (stateless) vaex server for accessing remote datasets and build on top of that a distributed part (although less mature) which is almost trivial since almost everything vaex computes is embarrassingly parallel.

I'd like to know if a connector for vaex is potentially interesting. I don't know much about Druid's performance, so I don't know how fast it could produce the data for these visualizations I produced. If it would be an order of magnitude, it could be interesting, otherwise probably not worth the effort, maybe you have some statistics on this.

Also, to really make superset work well with vaex, it probably needs some custom viz as well, like something I demonstrate here:
https://youtu.be/bP-JBbjwLM8?t=1125 (the 150 million taxi dataset)
https://youtu.be/bP-JBbjwLM8?t=1741 (1 bilion stars)
Which is similar to the Heatmap viz, except here it will use numerical data with limits/bounds, and adds the ability to zoom/pan (and select/filter).
Again, here I don't know if this is potentially interesting for superset.

Regards,

Maarten

mistercrunch · 2018-10-08T13:28:15Z

Maintaining both the Druid and SQLAlchemy interface has been quite a burden and the direction we're moving towards is killing the Druid connector and going through SQLAlchemy for Druid (SQL support in Druid is somewhat new, but the way of the future). There's a Superset improvement proposal open on this.

Is it planned for the stateless Vaex server to support SQL? SQL support for database-like compute frameworks is becoming easier with the help Apache Calcite. Having SQL support opens lots of doors way beyond Superset and work in that direction is probably a better investment than towards a custom Superset connector.

maartenbreddels · 2018-10-08T14:26:35Z

The issue with SQL is that there is no way I know of to express regular N-d binning in, or would you know of a natural way to do this?
For instance, the line above:

ds.mean(ds.z, binby=[ds.x, ds.y], limits=[[-10, 10], [-20, 20]], shape=(64, 128))

I would not know how to translate to SQL.

maartenbreddels · 2018-10-09T08:03:34Z

Apart from maybe dealing with missing values, I think a query like this would work:

SELECT
CAST((pickup_longitude - {long_min})/{long_delta} * {Nbins} as int) as binx,
CAST((pickup_latitude - {lat_min})/{lat_delta} * {Nbins} as int) as biny,
COUNT(*)
FROM trips
WHERE binx >= 0 AND binx < {Nbins} AND
      biny >= 0 AND biny < {Nbins}
group by binx,biny

It would be up to vaex to recognize the group by + where + cast expression triplet to translate this to a binby operation.
For aggregate queries, vaex would be really fast.

stale · 2019-04-10T08:58:54Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

maartenbreddels added 2 commits October 5, 2018 08:30

initial vaex support

9cc53a2

cleanups, removing druid specific code

2243fa2

maartenbreddels changed the title ~~POC: Vaex support~~ POC: Vaex connector Oct 5, 2018

kristw added the enhancement:request Enhancement request submitted by anyone from the community label Feb 7, 2019

stale bot added the inactive Inactive for >= 30 days label Apr 10, 2019

stale bot closed this Apr 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POC: Vaex connector #6041

POC: Vaex connector #6041

maartenbreddels commented Oct 5, 2018

mistercrunch commented Oct 7, 2018 •

edited

Loading

maartenbreddels commented Oct 8, 2018

mistercrunch commented Oct 8, 2018

maartenbreddels commented Oct 8, 2018

maartenbreddels commented Oct 9, 2018

stale bot commented Apr 10, 2019

POC: Vaex connector #6041

POC: Vaex connector #6041

Conversation

maartenbreddels commented Oct 5, 2018

mistercrunch commented Oct 7, 2018 • edited Loading

maartenbreddels commented Oct 8, 2018

mistercrunch commented Oct 8, 2018

maartenbreddels commented Oct 8, 2018

maartenbreddels commented Oct 9, 2018

stale bot commented Apr 10, 2019

mistercrunch commented Oct 7, 2018 •

edited

Loading