Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support dataframe columns containing lists #380

Closed
joelostblom opened this issue Aug 31, 2023 · 5 comments · Fixed by #393
Closed

Support dataframe columns containing lists #380

joelostblom opened this issue Aug 31, 2023 · 5 comments · Fixed by #393
Labels
bug Something isn't working

Comments

@joelostblom
Copy link

This chart renders fine using the default renderer in Altair

alt.Chart(
    pd.DataFrame({
        'month': ['January', 'January'],
        'value': [3.0, 1.0],
        'label': [['Benzene', '(μg/m³)'], ['Carbon monoxide', '(mg/m³)']]
    })
).mark_point().encode(
    x='value',
    y='month'
).facet(
    'label'
)

image

Trying the same chart after enabling vegafusion with alt.data_transformers.enable('vegafusion'), yields this error:

ValueError: DataFusion error: External error: Internal error: Unsupported data type in hasher: List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }). This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker
    Context[0]: Failed to get node value

While lists are useful to split strings over multiple lines, there might be a workaround here using labelExpr. However, this is for a class taught to first year students and I want to avoid making it too complicated so I might just stick with single line string labels for now.

@jonmmease
Copy link
Collaborator

jonmmease commented Aug 31, 2023

Thanks for the report!

My initial guess (based on the error message) is that the issues here is that the chart facets by the list column, which results in a GROUP BY query in DataFusion where the groupby column is a list, and it looks like DataFusion might not support that yet. I'll look at either fixing it in DataFusion or working around it by keeping the logic on the client in situations like this.

@jonmmease jonmmease added the bug Something isn't working label Aug 31, 2023
@jonmmease
Copy link
Collaborator

Tracking upstream in apache/datafusion#7473

@jonmmease
Copy link
Collaborator

This isn't straightforward to work around unfortunately. VegaFusion's planning logic, which decides which transforms should be evaluated on the server, doesn't currently have access to the schema of the input tables, so we can't just check the column types and keep the aggregate transform on the client if there's a list column.

@jonmmease
Copy link
Collaborator

Upstream support in DataFusion was merged in apache/datafusion#7616. So when DataFusion 32 is released and VegaFusion updates to it, this bug will be fixed.

I'll keep this open until VegaFusion has the update

@joelostblom
Copy link
Author

That's great, thank you Jon!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants