Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest from dataframe is broken for dataframes containing objects/dynamic columns - should ingest from JSON instead #567

Open
drhar opened this issue Feb 5, 2025 · 3 comments
Assignees

Comments

@drhar
Copy link

drhar commented Feb 5, 2025

Problem description

I have an ADX table with dynamic columns that I want to store json objects in (e.g. arrays, dicts as per https://learn.microsoft.com/en-us/kusto/query/scalar-data-types/dynamic?view=microsoft-fabric. As these structures exist as objects in pandas dataframes I was naively using ingest_from_dataframe and didn't really notice at first that this was doing anything wrong as it passes the data type validation in kusto. However, arrays with length > 1 were actually being sent as an array containing a single string which was a load of single quoted items.

Now I've looked under the covers, it's clear that all this library does when ingesting from dataframe is dump the dataframe to CSV then ingest from that. Dumping to CSV ruins json formatting in pandas (single quotes strings). I feel like you should dump to json (ensure date format is compatible with ADX timestamps with df.to_json(filename, orient="records", date_format="iso") instead and ingest from that file.

I'm now doing this instead as a workaround:

Old method (broken):

        log.info("Ingesting data into Azure Data Cluster")
        ingestion_properties = IngestionProperties(
            database=DATABASE,
            table=TABLE,
            data_format=DataFormat.JSON,
        )
        rsp = ingest_client.ingest_from_dataframe(
            df, ingestion_properties=ingestion_properties
        )

New method:

        df.to_json(filename, orient="records", date_format="iso"`)

        log.info("Ingesting data into Azure Data Cluster")
        ingestion_properties = IngestionProperties(
            database=DATABASE,
            table=TABLE,
            data_format=DataFormat.MULTIJSON,
        )
        rsp = ingest_client.ingest_from_file(
            filename, ingestion_properties=ingestion_properties
        )

Output of pip freeze

aiohappyeyeballs==2.4.4
aiohttp==3.11.11
aiosignal==1.3.2
anyio==4.8.0
appdirs==1.4.4
astroid==3.3.8
async-timeout==5.0.1
attrs==24.3.0
azure-core==1.32.0
azure-identity==1.19.0
azure-kusto-data==4.6.3
azure-kusto-ingest==4.6.3
azure-storage-blob==12.24.0
azure-storage-queue==12.12.0
bandit==1.8.2
black==24.10.0
certifi==2024.12.14
cffi==1.17.1
charset-normalizer==3.4.1
click==8.1.8
cryptography==43.0.3
Deprecated==1.2.15
dill==0.3.9
docformatter==1.7.5
dparse==0.6.4
exceptiongroup==1.2.2
flake8==7.1.1
flake8-bandit==4.1.1
flake8-black==0.3.6
flake8-bugbear==24.12.12
flake8-docstrings==1.7.0
flake8-isort==6.1.1
frozenlist==1.5.0
h11==0.14.0
h2==4.1.0
hpack==4.0.0
httpcore==1.0.7
httpx==0.28.1
hyperframe==6.0.1
idna==3.10
ijson==3.3.0
importlib_metadata==8.5.0
isodate==0.7.2
isort==5.13.2
markdown-it-py==3.0.0
mccabe==0.7.0
mdurl==0.1.2
microsoft-kiota-abstractions==1.9.0
microsoft-kiota-authentication-azure==1.9.0
microsoft-kiota-http==1.9.0
microsoft-kiota-serialization-form==1.9.0
microsoft-kiota-serialization-json==1.9.0
microsoft-kiota-serialization-multipart==1.9.0
microsoft-kiota-serialization-text==1.9.0
msal==1.31.1
msal-extensions==1.2.0
msgraph-core==1.2.0
msgraph-sdk==1.17.0
multidict==6.1.0
mypy==1.14.1
mypy-extensions==1.0.0
numpy==2.0.2
opentelemetry-api==1.29.0
opentelemetry-sdk==1.29.0
opentelemetry-semantic-conventions==0.50b0
ossaudit==0.5.0
packaging==24.2
pandas==2.2.3
pandas-stubs==2.2.2.240807
pathspec==0.12.1
pbr==6.1.0
platformdirs==4.3.6
portalocker==2.10.1
propcache==0.2.1
pycodestyle==2.12.1
pycparser==2.22
pydocstyle==6.3.0
pyflakes==3.2.0
Pygments==2.19.1
PyJWT==2.10.1
pylint==3.3.3
python-dateutil==2.9.0.post0
python-static-checks==4.0.2
pytz==2024.2
PyYAML==6.0.2
requests==2.32.3
rich==13.9.4
six==1.17.0
sniffio==1.3.1
snowballstemmer==2.2.0
std-uritemplate==2.0.1
stevedore==5.4.0
-e git+https://git.datcon.co.uk/tools.core/subscription-migration.git@a63e428f3f941348e7a43942f41a19ed6f278db8#egg=subscription_migration
tenacity==9.0.0
texttable==1.7.0
tomli==2.2.1
tomlkit==0.13.2
types-Deprecated==1.2.15.20241117
types-docutils==0.21.0.20241128
types-openpyxl==3.1.5.20241225
types-Pygments==2.19.0.20250107
types-python-dateutil==2.9.0.20241206
types-pytz==2024.2.0.20241221
types-requests==2.32.0.20241016
types-setuptools==75.8.0.20250110
typing_extensions==4.12.2
tzdata==2024.2
untokenize==0.1.1
urllib3==2.3.0
wrapt==1.17.2
yarl==1.18.3
zipp==3.21.0

@AsafMah
Copy link
Collaborator

AsafMah commented Feb 17, 2025

Sorry, I couldn't manage to recreate the error.

Can you provide a sample of the data that fails, and the schema of the table?

As minimal as can be

@drhar
Copy link
Author

drhar commented Feb 17, 2025

Sure,

For the schema, the table is

.create table {TABLE} (Timestamp:datetime, AppId:string, ObjectId:string, OwnerNames:dynamic, OwnerEmails:dynamic)

Then as dummy data:

update_date = datetime.today()

records = [
    {
        "Timestamp": str(update_date),
        "AppId": "g1efsd95-b7ab-4624-a6b3-0dee20243b50",
        "ObjectId": "5629a029-82ec-82e5-97af-a720dd8607f4",
        "OwnerNames": ["Me", "You", "Them"],
        "OwnerEmails": ["[email protected]", "[email protected]", "[email protected]"]
    },
    {
        "Timestamp": str(update_date),
        "AppId": "p1efsd95-b7ab-4624-a6b3-0dee20243b50",
        "ObjectId": "6629a029-82ec-82e5-97af-a720dd8607f4",
        "OwnerNames": ["Me"],
        "OwnerEmails": ["[email protected]"]
    },
    {
        "Timestamp": str(update_date),
        "AppId": "v1efsd95-b7ab-4624-a6b3-0dee20243b50",
        "ObjectId": "7629a029-82ec-82e5-97af-a720dd8607f4",
        "OwnerNames": [],
        "OwnerEmails": []
    }
]
df = pandas.DataFrame.from_records(records)

@AsafMah
Copy link
Collaborator

AsafMah commented Feb 18, 2025

@drhar

Thank you! I have managed to recreate the issue with this data.

I think your assessment was right, and the correct way to handle this is to serialize as json.

I'll get a fix out soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants