Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed bug where nested json inside pandas wouldn't be ingested correctly #568

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## Unreleased

### Changed
- [BREAKING] Dropped support for python version 3.7, as it is on EOL for over year.
- [BREAKING] Aligned the Connection String Builder keywords with the rest of the SDKs.
This means that some keywords were removed, and they will no longer be parsed as part of the Connection String.
Expand All @@ -17,7 +18,12 @@ The following keywords have been removed:
- `interactive_login`
- `az_cli`

### Fixed
- Fixed issues with nested pandas dataframes ingestion.

## [4.6.3] - 2025-01-08

### Fixed
- Explicitly export members in `__init__.py` via `__all__`

## [4.6.2] - 2025-01-07
Expand Down
13 changes: 5 additions & 8 deletions azure-kusto-ingest/azure/kusto/ingest/base_ingest_client.py
Original file line number Diff line number Diff line change
@@ -1,22 +1,20 @@
import gzip
import ipaddress
import os
import tempfile
import time
import uuid
from abc import ABCMeta, abstractmethod
from copy import copy
from enum import Enum
from io import TextIOWrapper
from typing import TYPE_CHECKING, Union, IO, AnyStr, Optional, Tuple
from urllib.parse import urlparse

from azure.kusto.data.data_format import DataFormat
from azure.kusto.data.exceptions import KustoClosedError

from .descriptors import FileDescriptor, StreamDescriptor
from .ingestion_properties import IngestionProperties


if TYPE_CHECKING:
import pandas

Expand Down Expand Up @@ -117,12 +115,11 @@ def ingest_from_dataframe(self, df: "pandas.DataFrame", ingestion_properties: In
if not isinstance(df, DataFrame):
raise ValueError("Expected DataFrame instance, found {}".format(type(df)))

file_name = "df_{id}_{timestamp}_{uid}.csv.gz".format(id=id(df), timestamp=int(time.time()), uid=uuid.uuid4())
file_name = "df_{id}_{timestamp}_{uid}.json.gz".format(id=id(df), timestamp=int(time.time()), uid=uuid.uuid4())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a breaking change - as this wont work the same for users who configured csv mapping

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, but before we had:

        ingestion_properties.format = DataFormat.CSV

So if a user used a different format mapping it wouldn't work.

I can roll it in the next breaking version (with a comment on the release notes),
Or I can check for a mapping and error? Or if the format doesn't match error?

I don't think we can get around not converting it to json

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not there's another hidden braking change here.
CSV matching is commonly ordinal while json matching is by name.
This means that if the dataframe names do not match the table names the mapping will fail.

temp_file_path = os.path.join(tempfile.gettempdir(), file_name)

df.to_csv(temp_file_path, index=False, encoding="utf-8", header=False, compression="gzip")

ingestion_properties.format = DataFormat.CSV
with gzip.open(temp_file_path, "wb") as temp_file:
df.to_json(temp_file, orient="records", date_format="iso", lines=True)
ingestion_properties.format = DataFormat.JSON
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought JSON was obsolete.
multijson?


try:
return self.ingest_from_file(temp_file_path, ingestion_properties)
Expand Down
11 changes: 8 additions & 3 deletions azure-kusto-ingest/tests/test_e2e_ingest.py
Original file line number Diff line number Diff line change
Expand Up @@ -529,15 +529,20 @@ async def test_streaming_ingest_from_dataframe(self):
"xtextWithNulls",
"xdynamicWithNulls",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: but column name is not correct anymore (not sure if the table is created ad-hoc or not)

]
rows = [
[0, "00000000-0000-0000-0001-020304050607", 0.0, 0.0, 0, 0, 0, 0, 0, 0, 0, 0, "2014-01-01T01:01:01Z", "Zero", "Zero", "0", "00:00:00", None, ""]
]

guid = uuid.uuid4()

dynamic_value = ["[email protected]", "[email protected]", "[email protected]"]
rows = [[0, str(guid), 0.0, 0.0, 0, 0, 0, 0, 0, 0, 0, 0, "2014-01-01T01:01:01Z", "Zero", "Zero", "0", "00:00:00", None, dynamic_value]]
df = DataFrame(data=rows, columns=fields)
ingestion_properties = IngestionProperties(database=self.test_db, table=self.test_table, flush_immediately=True, data_format=DataFormat.CSV)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover CSV

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's actually good to have, to make sure it forces the right format (or do we want to throw in this case)

self.ingest_client.ingest_from_dataframe(df, ingestion_properties)

await self.assert_rows_added(1, timeout=120)

a = self.client.execute(self.test_db, f"{self.test_table} | where rowguid == '{guid}'")
assert a.primary_results[0].rows[0]["xdynamicWithNulls"] == dynamic_value

@pytest.mark.asyncio
async def test_streaming_ingest_from_blob(self, is_managed_streaming):
ingestion_properties = IngestionProperties(
Expand Down
4 changes: 2 additions & 2 deletions azure-kusto-ingest/tests/test_kusto_ingest_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -518,11 +518,11 @@ def test_simple_ingest_from_dataframe(self, mock_pid, mock_time, mock_uuid, mock
result = ingest_client.ingest_from_dataframe(df, ingestion_properties=ingestion_properties)
assert result.status == IngestionStatus.QUEUED

expected_url = "https://storageaccount.blob.core.windows.net/tempstorage/database__table__11111111-1111-1111-1111-111111111111__df_{0}_100_11111111-1111-1111-1111-111111111111.csv.gz?".format(
expected_url = "https://storageaccount.blob.core.windows.net/tempstorage/database__table__11111111-1111-1111-1111-111111111111__df_{0}_100_11111111-1111-1111-1111-111111111111.json.gz?".format(
id(df)
)

assert_queued_upload(mock_put_message_in_queue, mock_upload_blob_from_stream, expected_url)
assert_queued_upload(mock_put_message_in_queue, mock_upload_blob_from_stream, expected_url, format="json")

@responses.activate
@patch("azure.kusto.ingest.managed_streaming_ingest_client.ManagedStreamingIngestClient.MAX_STREAMING_SIZE_IN_BYTES", new=0)
Expand Down
Loading