-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Push down column selections when using __dataframe__
protocol
#386
Comments
Thanks for raising the issue @ivirshup. This is currently expected. VegaFusion will build up a query against the source dataset that only references and produces the required columns, but the current default DataFusion implementation converts the input object to Arrow before evaluating the query against it. Could you say a little more about your usecase? For your In vega/altair#3134, I was referring to an idea that @jcrist is looking at for pushing computation into an Ibis table. In this (future) scenario, the DataFrame interchange protocol would be used by Altair to perform column type inference, and then VegaFusion would push calculations into Ibis, where Ibis would control which columns are pulled in from the database along with performing the data transformations. I think it would be possible to add support for using the DataFusion query engine directly against |
Thanks for the quick response!
Sure! This is for anndata. A simplified explanation of this object: a large 2d array (often a sparse array) with labeled dimensions aligned with a dataframe. The set of possible columns would be one of those dimensions + other annotations. This would typically lead to tens of thousands of possible columns. We really don't want to create an "actual" in memory dataframe with all those columns since that would require densifying the whole matrix, which is expensive and often we wouldn't have the available memory for. Also, sometimes this object may be stored on disk. We can handle fairly efficient random access to the on disk format, but also really wouldn't want to load the whole thing into memory + densify it for a plot.
I would like to be able to use VegaFusion's computation engine(s).
This is what I'm looking for right now. Maybe something more complex would be nice, but I think there would be a ton of value from just column selection. |
Thanks for the additional context, anndata sounds really useful and the scenario you describe in the original issue does sound like a good fit. To make this work, I think what we'd need to do is write a custom DataFusion DataSource (See example in https://github.com/apache/arrow-datafusion/blob/main/datafusion-examples/examples/custom_datasource.rs). This DataSource implementation would be in Rust, but would use PyO3 to wrap a Python object that implements the We'd then change this chunk of vegafusion's Python logic to not eagerly convert |
A few more design notes: In Then in the vegafusion-python-embed's When found, create a We'll need to add a pyo3 feature flag to the This approach should also improve the performance of working with pandas, and will hopefully close the performance gap between the DataFusion and DuckDb data connections. |
This is available to try out in version |
Hey @jonmmease, thanks for implementing this! I'm just getting a change to try this now. Using the code above I am still seeing every column get accessed. Is this expected, and should I be trying something different? It does look like your example from #438 does have the desired behaviour though: Demoimport pandas as pd
import vega_datasets
import altair as alt
# import vegafusion
# vegafusion.enable()
alt.data_transformers.enable("vegafusion")
from pandas.core.interchange.dataframe import PandasDataFrameXchg
class NoisyDfInterface(pd.core.interchange.dataframe.PandasDataFrameXchg):
def __dataframe__(self, allow_copy: bool = True):
return NoisyDfInterface(self._df, allow_copy=allow_copy)
def get_column_by_name(self, name):
print(f"get_column_by_name('{name}')")
return super().get_column_by_name(name)
cars = vega_datasets.data.cars()
chart = (
alt.Chart(NoisyDfInterface(cars))
.mark_rect()
.encode(
x=alt.X("Origin"),
y=alt.Y("Miles_per_Gallon:Q", bin=True),
color="count()",
)
)
chart
movies = pd.read_json("https://raw.githubusercontent.com/vega/vega-datasets/main/data/movies.json")
movies = pd.concat([movies]*3200, axis=0).reset_index(drop=False)
print(len(movies))
chart = alt.Chart(NoisyDfInterface(movies)).mark_bar().encode(
alt.X("IMDB Rating:Q", bin=True),
y='count()',
)
chart
Is there some sort of threshold under which this doesn't kick in? |
Hi @ivirshup. I think the calls to See vega/altair#3114 for how Altair is using the The part that VegaFusion added in #438 was to add a call to |
Hi,
I would like to expose a
__dataframe__
protocol on a large object where you would never actually want to request all columns. In this feature request:df.select_columns_by_name
to subselect columns coming from__dataframe__
protocol altair#3134over on the altair repo it was suggested that vegafusion should be able to push down column selections when handling the dataframe interchange protocol. This does not seem to be happening at first glance.
In this example, I create a subclass of pandas dataframe interchange object that prints the column name being retrieved every time
get_column_by_name
is called. Making a 2d histogram of"Origin"
by"Miles_per_Gallon"
, I would only expect to see those two columns accessed. However:This is using latest
altair
andvegafusion
.Environment info
Output of
sessioninfo.show(dependencies=True, html=False)
Any idea what's up here? Is my expectation correct?
The text was updated successfully, but these errors were encountered: