-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
POC: Vaex connector #6041
POC: Vaex connector #6041
Conversation
For reference, someone wrote a Also the problem of where to persist the dataframe. Since our web servers are stateless, the pandas dataframe needs to be brought up in memory from the network prior to performing aggregations / filters. With something like Arrow that becomes somewhat reasonable, but it feels like there should be a dedicated service (that resembles a database quite a bit) loading/caching/computing on those files. |
Hi Maxime,
would you say it is more stable, or stable?
Vaex is using mostly memory mapped hdf5 files, relying on the OS cache, meaning there are practically 0 costs if a process gets restarted (apart from Python startup/import cost). It doesn't matter if you open a 100MB or 2TB file, it is practically for free I always planned to add Arrow support, but I never saw the same performance, so didn't bother yet. However, for compatibility, it would be great to have. There is also a (stateless) vaex server for accessing remote datasets and build on top of that a distributed part (although less mature) which is almost trivial since almost everything vaex computes is embarrassingly parallel. I'd like to know if a connector for vaex is potentially interesting. I don't know much about Druid's performance, so I don't know how fast it could produce the data for these visualizations I produced. If it would be an order of magnitude, it could be interesting, otherwise probably not worth the effort, maybe you have some statistics on this. Also, to really make superset work well with vaex, it probably needs some custom viz as well, like something I demonstrate here: Regards, Maarten |
Maintaining both the Druid and SQLAlchemy interface has been quite a burden and the direction we're moving towards is killing the Druid connector and going through SQLAlchemy for Druid (SQL support in Druid is somewhat new, but the way of the future). There's a Superset improvement proposal open on this. Is it planned for the stateless Vaex server to support SQL? SQL support for database-like compute frameworks is becoming easier with the help Apache Calcite. Having SQL support opens lots of doors way beyond Superset and work in that direction is probably a better investment than towards a custom Superset connector. |
The issue with SQL is that there is no way I know of to express regular N-d binning in, or would you know of a natural way to do this?
I would not know how to translate to SQL. |
Apart from maybe dealing with missing values, I think a query like this would work: SELECT
CAST((pickup_longitude - {long_min})/{long_delta} * {Nbins} as int) as binx,
CAST((pickup_latitude - {lat_min})/{lat_delta} * {Nbins} as int) as biny,
COUNT(*)
FROM trips
WHERE binx >= 0 AND binx < {Nbins} AND
biny >= 0 AND biny < {Nbins}
group by binx,biny It would be up to vaex to recognize the |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Vaex is a (lazy) out-of-core DataFrame library for Python that is used to visualize and explore big tabular data at ~ a billion rows per second (on a decent computer/laptop). The visualization part of vaex is similar to datashader (see #4492), but vaex is more general.
Vaex focusses strongly on binned statistics on N-d grids, and instead of the groupby, uses binby which can be used for instance to create 1d histograms:
Or a 2d array with means of a column:
I thought it would be interesting to see if I could integrate this in superset, hence this PR, which is only a proof of concept.
I managed to get some visualizations up using the New York Taxi dataset: https://docs.vaex.io/en/latest/datasets.html which contains over 1 billion rows (although for this test I only used the 2015 data, which contains ~150 million rows).
I got the table view working:

Pie charts:

And time series:

And I think the most beautiful one is the heatmap:

Note that the data to produce these viz just takes a fraction of a second for these ~150 million rows, 1 ~1 billion rows per second is about the expected performance (per computer).
I just put this out here so judge interest in this, and as an additional example to #3492