Python client is slow: takes 20min to query data that takes 6 seconds using a cURL command #72

pjayathissa · 2020-03-16T02:26:17Z

I am finding the python client to be very very slow

I'm querying the data using python as follows, which takes about 20min to execute

result = query_api.query_data_frame('from(bucket: "testing")'
  			'|> range(start: 2020-02-19T23:30:00Z, stop: now())'
  			'|> filter(fn: (r) => r._measurement == "awair-api")'
               , org=credentials.org)

I run the same command using a shell script which extracts the data in about 6 seconds

curl https://us-west-2-1.aws.cloud2.influxdata.com/api/v2/query?[email protected] -XPOST -sS \
  -H 'Authorization: Token tokencode==' \
  -H 'Accept: application/csv' \
  -H 'Content-type: application/vnd.flux' \
  -d 'from(bucket:"testing") |> range(start: 2020-02-19T23:30:00Z, stop: now()) |> filter(fn: (r) => r._measurement == "awair-api")' >> test.csv

bednar · 2020-03-16T06:39:19Z

Hi @pjayathissa,

Thanks for an open issue. Could you please share an information how your data looks like (cardinality, amount,...) ?

Regards

pjayathissa · 2020-03-16T12:20:21Z

Attached is a zip file of the csv that was extracted when using a cURL function and appending the result to a csv
awaircopy.csv.zip

bednar · 2020-03-16T12:22:58Z

Thanks @pjayathissa, I am currently investigate this issue...

bednar · 2020-03-17T09:58:52Z

Hi @pjayathissa,

I prepared fixed version in a branch fix/pandas-performance.

If you would like to test it then install client via:

pip install git+https://github.com/influxdata/influxdb-client-python.git@fix/pandas-performance

Regards

joranbeasley · 2020-08-12T21:27:19Z

this is still very very slow

# this test returns 194k rows of data
query_api().query(qs,org=org) # ~28s
query_api().query_dataframe(qs,org=org) # ~31.5s
def custom_query_dataframe(qs,org):
      httpResp = query_api().query_raw(qs,org=org)
      headers = [httpResp.readline() for _ in range(3)] # stuff i dont need(I think?... not sure about "groups")
      df = pandas.read_csv(httpResp)
      return df.drop(columns=df.columns[:2]) # some extra stuff i dont need
custom_query_dataframe(qs,org=org) # ~1-2s

bednar · 2020-08-13T05:14:52Z

Hi @joranbeasley,

Could you share how your data looks like?

One of the possible speed up could be install a ciso8601. The ciso8601 speed up parsing dates a lot of.

pip install ciso8601

https://github.com/influxdata/influxdb-client-python#installation

Regards

franz101 · 2021-01-21T15:35:12Z

Painfully slow here as well with the latest dev version

bednar · 2021-01-21T20:19:31Z

Hi @franz101,

Could you please share an information how your data looks like - cardinality, amount, example... ?

Which version of Python do you use?

Regards

idubinets · 2021-11-29T19:36:27Z

I can confirm that query_api is super slow! 200k records query takes 62 seconds.

do you have some solution or workaround?

bednar · 2021-11-30T08:29:14Z

@idubinets, see #371 (comment)

sarjarapu · 2023-01-05T17:28:02Z

@bednar 's recommendation of installing the ciso8601 dependency improved my query execution time from 1.23s to 0.34s. Thank you

ojdo · 2023-01-06T15:30:22Z

For me, the current best workaround is still @joranbeasley's idea to go for query_raw() + pandas roughly like this for simple cases with a single table:

import pandas as pd
from influxdb_client import InfluxDBClient
from io import BytesIO

def perform_simple_query(
        query_api,
        organization:str,
        query: str,
        field: str
    ) -> pd.DataFrame:
    """Perform simple query against InfluxDB query API.
    
    Left as an exercise: generalize to results with multiple groups
    """
    response = query_api.query_raw(
        query=query,
        org=organization,
    )
    try:
        df = pd.read_csv(
            BytesIO(response.data),
            skiprows=[0, 1, 2]  # group header rows
        )
    except pd.errors.EmptyDataError:
        return pd.DataFrame()

    df.rename(columns={'_value': field}, inplace=True)
    df.drop(['Unnamed: 0', '_field', '_start', '_stop', 'result',
            '_measurement', 'table'], axis=1, inplace=True)  # customize as needed
    df['_time'] = pd.to_datetime(df['_time'])
    df.set_index('_time', inplace=True)
    return df

calumroy · 2023-08-14T08:07:09Z

@ojdo Thanks.
For when multiple tables are returend by a raw query I found simply splitting on the characters '\r\n\r\n' and processing each one worked.

response = query_api.query_raw(
    query=query,
    org=organization,
)
df_tables = response.data.split(b'\r\n\r\n')
for df_table in df_tables:
    df = pd.DataFrame()
    if df_table != b'':
        try:
            df = pd.read_csv(
                BytesIO(df_table),
                skiprows=[0, 1, 2]  # group header rows
            )
        except pd.errors.EmptyDataError:
            return pd.DataFrame()

donggu-kang · 2024-10-22T09:52:30Z

나에게 있어서 현재 가장 좋은 해결 방법은 여전히 다음과 같습니다.@joranbeasleyquery_raw()간단한 사례에 대해 단일 테이블이 있는 경우 대략 다음과 같이 + pandas를 사용하는 것이 좋습니다 .

import pandas as pd
from influxdb_client import InfluxDBClient
from io import BytesIO

def perform_simple_query(
        query_api,
        organization:str,
        query: str,
        field: str
    ) -> pd.DataFrame:
    """Perform simple query against InfluxDB query API.
    
    Left as an exercise: generalize to results with multiple groups
    """
    response = query_api.query_raw(
        query=query,
        org=organization,
    )
    try:
        df = pd.read_csv(
            BytesIO(response.data),
            skiprows=[0, 1, 2]  # group header rows
        )
    except pd.errors.EmptyDataError:
        return pd.DataFrame()

    df.rename(columns={'_value': field}, inplace=True)
    df.drop(['Unnamed: 0', '_field', '_start', '_stop', 'result',
            '_measurement', 'table'], axis=1, inplace=True)  # customize as needed
    df['_time'] = pd.to_datetime(df['_time'])
    df.set_index('_time', inplace=True)
    return df

pip install ciso8601

Oh! If you apply the above two methods at the same time, the processing time will be greatly reduced!!

It took more than 10 minutes for 2.5 million data...
But now I get the query results in 5 seconds!!

tientr · 2025-01-24T09:20:23Z

This problem still persists to this day. Using two solutions provided by @ojdo and @bednar works. Reduced query execution time from 200 seconds to 8 seconds for 1 million records. Thanks to @ojdo and @bednar

bednar added this to the 1.6.0 milestone Mar 16, 2020

bednar added the bug Something isn't working label Mar 16, 2020

bednar mentioned this issue Mar 17, 2020

fix: Optimize serializing data into Pandas DataFrame #74

Merged

4 tasks

bednar closed this as completed in #74 Mar 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python client is slow: takes 20min to query data that takes 6 seconds using a cURL command #72

Python client is slow: takes 20min to query data that takes 6 seconds using a cURL command #72

pjayathissa commented Mar 16, 2020

bednar commented Mar 16, 2020

pjayathissa commented Mar 16, 2020

bednar commented Mar 16, 2020

bednar commented Mar 17, 2020

joranbeasley commented Aug 12, 2020 •

edited

Loading

bednar commented Aug 13, 2020

franz101 commented Jan 21, 2021

bednar commented Jan 21, 2021

idubinets commented Nov 29, 2021

bednar commented Nov 30, 2021

sarjarapu commented Jan 5, 2023

ojdo commented Jan 6, 2023

calumroy commented Aug 14, 2023 •

edited

Loading

donggu-kang commented Oct 22, 2024

tientr commented Jan 24, 2025

Python client is slow: takes 20min to query data that takes 6 seconds using a cURL command #72

Python client is slow: takes 20min to query data that takes 6 seconds using a cURL command #72

Comments

pjayathissa commented Mar 16, 2020

bednar commented Mar 16, 2020

pjayathissa commented Mar 16, 2020

bednar commented Mar 16, 2020

bednar commented Mar 17, 2020

joranbeasley commented Aug 12, 2020 • edited Loading

bednar commented Aug 13, 2020

franz101 commented Jan 21, 2021

bednar commented Jan 21, 2021

idubinets commented Nov 29, 2021

bednar commented Nov 30, 2021

sarjarapu commented Jan 5, 2023

ojdo commented Jan 6, 2023

calumroy commented Aug 14, 2023 • edited Loading

donggu-kang commented Oct 22, 2024

tientr commented Jan 24, 2025

joranbeasley commented Aug 12, 2020 •

edited

Loading

calumroy commented Aug 14, 2023 •

edited

Loading