Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't error when hashing data that can't be serialised to JSON #3161

Closed

Conversation

btabram
Copy link

@btabram btabram commented Aug 22, 2023

This is a suggestion that would help a particular issue I've got but it might be a bit niche. Thanks for looking!

Converting non-serialisable data to a string instead of erroring seems fine when it's just for generating a hash. This allows to_dict to be called when there's data that json.dumps can't serialise.

In my case I've got datetime.time instances in the dataframe I'm plotting a chart from and am using Pydantic (which can convert datetime.time to JSON) to serialise the chart dictionary.

Converting non-serialisable data to a string seems fine when we're
just generating a hash. This allows `to_dict` to be called when
there's `datetime.datetime` etc. in the data
@mattijn
Copy link
Contributor

mattijn commented Aug 23, 2023

Thank you for the suggestion @btabram! You already provide a textual description how this can be of benefit for you and others, that is nice, but it would be great if you also could provide an minimal working example in code. That make this a bit easier to review. Would that be possible to provide? Thanks again!

@btabram
Copy link
Author

btabram commented Aug 24, 2023

Thanks for the quick response @mattijn! Here's some code that works with my change but fails without it:

import pandas as pd
from altair import Chart, TopLevelMixin
from datetime import time
from pydantic import BaseModel


class Result(BaseModel):
    chart: TopLevelMixin

    class Config:
        arbitrary_types_allowed = True
        extra = "forbid"
        json_encoders = {
            TopLevelMixin: TopLevelMixin.to_dict,
        }


# Create a chart
df = pd.DataFrame({"t": [time(second=i) for i in range(10)], "value": range(10)})
chart = Chart(df).mark_point().encode(x="t", y="value")

# Now try to serialise the result so the vega chart can be rendered elsewhere
result = Result(chart=chart)
print(result.json())

In my real workflow the data is coming from an Excel spreadsheet (and pandas.read_excel) and it would be nice to be able to create charts directly from the resulting dataframe, without having to convert any data that doesn't trivially serialise 🙂

@mattijn
Copy link
Contributor

mattijn commented Aug 24, 2023

Thanks for adding context.

In your case I think this should work temporarily:

with alt.data_transformers.enable(consolidate_datasets=False):
    print(result.json(encoder=str))

Related is this issue: #2199.

And the interesting thing is, if we use other, private, functions within Altair this hash can be computed correctly:

import pandas as pd
import datetime
from altair.utils.data import _data_to_json_string, _compute_data_hash

data_date = pd.DataFrame({
    "time": [datetime.date(2020, 5, 1), datetime.date(2020, 5, 3)],
    "value": [3, 5],
})

data_time = pd.DataFrame({
    "time": [datetime.time(second=1), datetime.time(second=2)],
    "value": [3, 5],
})

json_data_date = _data_to_json_string(data_date)
json_data_time = _data_to_json_string(data_time)

_compute_data_hash(json_data_date), _compute_data_hash(json_data_time)
('156c826cd7fbaa9f799b300221f8bd31', 'a45ae23ead672db1c8c7aa63f72ac58c')

Regarding your proposed solution. I think it would be better to rework this _dataset_name function used within _consolidate_data see https://github.com/altair-viz/altair/blob/main/altair/vegalite/v5/api.py#L55-L56 by using what is done here using the approach within .to_json() function of the data model, defined here https://github.com/altair-viz/altair/blob/main/altair/utils/data.py#L192-L193. It would be good to get a verification on this of another maintainer.

A bit more debugging. Using the conventional sanitization of dataframes this works:

from altair.utils.core import sanitize_dataframe
sanitize_dataframe(data_date)

But through our new dataframe interchange protocol this errors:

from altair.utils.core import sanitize_arrow_table
import pyarrow.interchange as pi

pa_table = sanitize_arrow_table(pi.from_dataframe(data_date))
NotImplementedError: Non-string object dtypes are not supported yet

Basically the dataframe cannot be changed into an arrow table.

Another caveat is that this also means that this only can work if the type is specified in the encoding, since the automatic inference of dtypes is now done using the dataframe interchange protocol if it is available (if I remember correctly @jonmmease, see https://github.com/altair-viz/altair/blob/main/altair/utils/core.py#L589)

@jonmmease
Copy link
Contributor

Another caveat is that this also means that this only can work if the type is specified in the encoding, since the automatic inference of dtypes is now done using the dataframe interchange protocol if it is available (if I remember correctly @jonmmease, see https://github.com/altair-viz/altair/blob/main/altair/utils/core.py#L589)

That's correct, we use the DataFrame interchange protocol approach unless pyarrow is not available or pandas is too old.

I'm not clear on how we would support datetime.time values, since Vega doesn't have a time type. @btabram are you converting these times into strings for display? Or to timestamps somehow?

@btabram
Copy link
Author

btabram commented Sep 13, 2023

Apologies for taking so long to reply. I can't get that workaround working for me @mattijn but it's okay because I can just convert or drop the data the doesn't serialise.

@jonmmease I'm using a javascript library to render the chart and it looked fine but actually when I started investigating I realised it was just plotting as strings and I was lucky that my points were evenly spaced. So it's not a very useful chart and this request isn't so useful either. Sorry.

However, I do still have a usecase where I'd appreciate this change, but I accept it's not a very important one so feel free to just close this. I'm often reading a DataFrame from Excel and then passing that DataFrame straight to Altair and sometimes there's a datetime.time column which I'm not even plotting but its presence causes to_dict to fail at the hashing stage. I could and probably should only pass in the DataFrame columns that I'm actually plotting to Altair but I'd appreciate the ability to pass in an arbitrary DataFrame and have to_dict succeed.

Thanks for taking the time to consder this 🙂

@mattijn
Copy link
Contributor

mattijn commented Mar 30, 2024

The intended behavior of OP might be fixed by this code change: #3377 (comment).

@btabram, could you have a look if Altair version 5.3 provides the behavior you are looking for?

@btabram
Copy link
Author

btabram commented Apr 2, 2024

Hi @mattijn yeah 5.3 fixes the issue that I was having, thanks. Looks like #3377 includes the exact change I was proposing here so I'll close this now 👍

@btabram btabram closed this Apr 2, 2024
@btabram btabram deleted the allow-non-serialisable-values-in-to-dict branch April 2, 2024 11:45
@btabram btabram restored the allow-non-serialisable-values-in-to-dict branch April 5, 2024 22:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants