Ingest from dataframe is broken for dataframes containing objects/dynamic columns - should ingest from JSON instead #567

drhar · 2025-02-05T18:20:10Z

Problem description

I have an ADX table with dynamic columns that I want to store json objects in (e.g. arrays, dicts as per https://learn.microsoft.com/en-us/kusto/query/scalar-data-types/dynamic?view=microsoft-fabric. As these structures exist as objects in pandas dataframes I was naively using ingest_from_dataframe and didn't really notice at first that this was doing anything wrong as it passes the data type validation in kusto. However, arrays with length > 1 were actually being sent as an array containing a single string which was a load of single quoted items.

Now I've looked under the covers, it's clear that all this library does when ingesting from dataframe is dump the dataframe to CSV then ingest from that. Dumping to CSV ruins json formatting in pandas (single quotes strings). I feel like you should dump to json (ensure date format is compatible with ADX timestamps with df.to_json(filename, orient="records", date_format="iso") instead and ingest from that file.

I'm now doing this instead as a workaround:

Old method (broken):

        log.info("Ingesting data into Azure Data Cluster")
        ingestion_properties = IngestionProperties(
            database=DATABASE,
            table=TABLE,
            data_format=DataFormat.JSON,
        )
        rsp = ingest_client.ingest_from_dataframe(
            df, ingestion_properties=ingestion_properties
        )

New method:

        df.to_json(filename, orient="records", date_format="iso"`)

        log.info("Ingesting data into Azure Data Cluster")
        ingestion_properties = IngestionProperties(
            database=DATABASE,
            table=TABLE,
            data_format=DataFormat.MULTIJSON,
        )
        rsp = ingest_client.ingest_from_file(
            filename, ingestion_properties=ingestion_properties
        )

Output of `pip freeze`

aiohappyeyeballs==2.4.4
aiohttp==3.11.11
aiosignal==1.3.2
anyio==4.8.0
appdirs==1.4.4
astroid==3.3.8
async-timeout==5.0.1
attrs==24.3.0
azure-core==1.32.0
azure-identity==1.19.0
azure-kusto-data==4.6.3
azure-kusto-ingest==4.6.3
azure-storage-blob==12.24.0
azure-storage-queue==12.12.0
bandit==1.8.2
black==24.10.0
certifi==2024.12.14
cffi==1.17.1
charset-normalizer==3.4.1
click==8.1.8
cryptography==43.0.3
Deprecated==1.2.15
dill==0.3.9
docformatter==1.7.5
dparse==0.6.4
exceptiongroup==1.2.2
flake8==7.1.1
flake8-bandit==4.1.1
flake8-black==0.3.6
flake8-bugbear==24.12.12
flake8-docstrings==1.7.0
flake8-isort==6.1.1
frozenlist==1.5.0
h11==0.14.0
h2==4.1.0
hpack==4.0.0
httpcore==1.0.7
httpx==0.28.1
hyperframe==6.0.1
idna==3.10
ijson==3.3.0
importlib_metadata==8.5.0
isodate==0.7.2
isort==5.13.2
markdown-it-py==3.0.0
mccabe==0.7.0
mdurl==0.1.2
microsoft-kiota-abstractions==1.9.0
microsoft-kiota-authentication-azure==1.9.0
microsoft-kiota-http==1.9.0
microsoft-kiota-serialization-form==1.9.0
microsoft-kiota-serialization-json==1.9.0
microsoft-kiota-serialization-multipart==1.9.0
microsoft-kiota-serialization-text==1.9.0
msal==1.31.1
msal-extensions==1.2.0
msgraph-core==1.2.0
msgraph-sdk==1.17.0
multidict==6.1.0
mypy==1.14.1
mypy-extensions==1.0.0
numpy==2.0.2
opentelemetry-api==1.29.0
opentelemetry-sdk==1.29.0
opentelemetry-semantic-conventions==0.50b0
ossaudit==0.5.0
packaging==24.2
pandas==2.2.3
pandas-stubs==2.2.2.240807
pathspec==0.12.1
pbr==6.1.0
platformdirs==4.3.6
portalocker==2.10.1
propcache==0.2.1
pycodestyle==2.12.1
pycparser==2.22
pydocstyle==6.3.0
pyflakes==3.2.0
Pygments==2.19.1
PyJWT==2.10.1
pylint==3.3.3
python-dateutil==2.9.0.post0
python-static-checks==4.0.2
pytz==2024.2
PyYAML==6.0.2
requests==2.32.3
rich==13.9.4
six==1.17.0
sniffio==1.3.1
snowballstemmer==2.2.0
std-uritemplate==2.0.1
stevedore==5.4.0
-e git+https://git.datcon.co.uk/tools.core/subscription-migration.git@a63e428f3f941348e7a43942f41a19ed6f278db8#egg=subscription_migration
tenacity==9.0.0
texttable==1.7.0
tomli==2.2.1
tomlkit==0.13.2
types-Deprecated==1.2.15.20241117
types-docutils==0.21.0.20241128
types-openpyxl==3.1.5.20241225
types-Pygments==2.19.0.20250107
types-python-dateutil==2.9.0.20241206
types-pytz==2024.2.0.20241221
types-requests==2.32.0.20241016
types-setuptools==75.8.0.20250110
typing_extensions==4.12.2
tzdata==2024.2
untokenize==0.1.1
urllib3==2.3.0
wrapt==1.17.2
yarl==1.18.3
zipp==3.21.0

The text was updated successfully, but these errors were encountered:

AsafMah · 2025-02-17T12:51:00Z

Sorry, I couldn't manage to recreate the error.

Can you provide a sample of the data that fails, and the schema of the table?

As minimal as can be

drhar · 2025-02-17T14:34:37Z

Sure,

For the schema, the table is

.create table {TABLE} (Timestamp:datetime, AppId:string, ObjectId:string, OwnerNames:dynamic, OwnerEmails:dynamic)

Then as dummy data:

update_date = datetime.today()

records = [
    {
        "Timestamp": str(update_date),
        "AppId": "g1efsd95-b7ab-4624-a6b3-0dee20243b50",
        "ObjectId": "5629a029-82ec-82e5-97af-a720dd8607f4",
        "OwnerNames": ["Me", "You", "Them"],
        "OwnerEmails": ["[email protected]", "[email protected]", "[email protected]"]
    },
    {
        "Timestamp": str(update_date),
        "AppId": "p1efsd95-b7ab-4624-a6b3-0dee20243b50",
        "ObjectId": "6629a029-82ec-82e5-97af-a720dd8607f4",
        "OwnerNames": ["Me"],
        "OwnerEmails": ["[email protected]"]
    },
    {
        "Timestamp": str(update_date),
        "AppId": "v1efsd95-b7ab-4624-a6b3-0dee20243b50",
        "ObjectId": "7629a029-82ec-82e5-97af-a720dd8607f4",
        "OwnerNames": [],
        "OwnerEmails": []
    }
]
df = pandas.DataFrame.from_records(records)

AsafMah · 2025-02-18T07:57:05Z

@drhar

Thank you! I have managed to recreate the issue with this data.

I think your assessment was right, and the correct way to handle this is to serialize as json.

I'll get a fix out soon.

AsafMah self-assigned this Feb 12, 2025

yogilad added bug User Issue labels Feb 16, 2025

AsafMah mentioned this issue Feb 18, 2025

Fixed bug where nested json inside pandas wouldn't be ingested correctly #568

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingest from dataframe is broken for dataframes containing objects/dynamic columns - should ingest from JSON instead #567

Ingest from dataframe is broken for dataframes containing objects/dynamic columns - should ingest from JSON instead #567

drhar commented Feb 5, 2025

AsafMah commented Feb 17, 2025

drhar commented Feb 17, 2025

AsafMah commented Feb 18, 2025

Ingest from dataframe is broken for dataframes containing objects/dynamic columns - should ingest from JSON instead #567

Ingest from dataframe is broken for dataframes containing objects/dynamic columns - should ingest from JSON instead #567

Comments

drhar commented Feb 5, 2025

Problem description

Output of pip freeze

AsafMah commented Feb 17, 2025

drhar commented Feb 17, 2025

AsafMah commented Feb 18, 2025

Output of `pip freeze`