Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python client is slow: takes 20min to query data that takes 6 seconds using a cURL command #72

Closed
pjayathissa opened this issue Mar 16, 2020 · 15 comments · Fixed by #74
Closed
Labels
bug Something isn't working
Milestone

Comments

@pjayathissa
Copy link

I am finding the python client to be very very slow

I'm querying the data using python as follows, which takes about 20min to execute

result = query_api.query_data_frame('from(bucket: "testing")'
  			'|> range(start: 2020-02-19T23:30:00Z, stop: now())'
  			'|> filter(fn: (r) => r._measurement == "awair-api")'
               , org=credentials.org)

I run the same command using a shell script which extracts the data in about 6 seconds

curl https://us-west-2-1.aws.cloud2.influxdata.com/api/v2/query?[email protected] -XPOST -sS \
  -H 'Authorization: Token tokencode==' \
  -H 'Accept: application/csv' \
  -H 'Content-type: application/vnd.flux' \
  -d 'from(bucket:"testing") |> range(start: 2020-02-19T23:30:00Z, stop: now()) |> filter(fn: (r) => r._measurement == "awair-api")' >> test.csv
@bednar
Copy link
Contributor

bednar commented Mar 16, 2020

Hi @pjayathissa,

Thanks for an open issue. Could you please share an information how your data looks like (cardinality, amount,...) ?

Regards

@bednar bednar added this to the 1.6.0 milestone Mar 16, 2020
@bednar bednar added the bug Something isn't working label Mar 16, 2020
@pjayathissa
Copy link
Author

Attached is a zip file of the csv that was extracted when using a cURL function and appending the result to a csv
awaircopy.csv.zip

@bednar
Copy link
Contributor

bednar commented Mar 16, 2020

Thanks @pjayathissa, I am currently investigate this issue...

@bednar
Copy link
Contributor

bednar commented Mar 17, 2020

Hi @pjayathissa,

I prepared fixed version in a branch fix/pandas-performance.

If you would like to test it then install client via:

pip install git+https://github.com/influxdata/influxdb-client-python.git@fix/pandas-performance

Regards

@joranbeasley
Copy link

joranbeasley commented Aug 12, 2020

this is still very very slow

# this test returns 194k rows of data
query_api().query(qs,org=org) # ~28s
query_api().query_dataframe(qs,org=org) # ~31.5s
def custom_query_dataframe(qs,org):
      httpResp = query_api().query_raw(qs,org=org)
      headers = [httpResp.readline() for _ in range(3)] # stuff i dont need(I think?... not sure about "groups")
      df = pandas.read_csv(httpResp)
      return df.drop(columns=df.columns[:2]) # some extra stuff i dont need
custom_query_dataframe(qs,org=org) # ~1-2s

@bednar
Copy link
Contributor

bednar commented Aug 13, 2020

Hi @joranbeasley,

Could you share how your data looks like?

One of the possible speed up could be install a ciso8601. The ciso8601 speed up parsing dates a lot of.

pip install ciso8601

Regards

@franz101
Copy link

Painfully slow here as well with the latest dev version

@bednar
Copy link
Contributor

bednar commented Jan 21, 2021

Hi @franz101,

Could you please share an information how your data looks like - cardinality, amount, example... ?

Which version of Python do you use?

Regards

@idubinets
Copy link

I can confirm that query_api is super slow! 200k records query takes 62 seconds.

do you have some solution or workaround?

@bednar
Copy link
Contributor

bednar commented Nov 30, 2021

@idubinets, see #371 (comment)

@sarjarapu
Copy link

@bednar 's recommendation of installing the ciso8601 dependency improved my query execution time from 1.23s to 0.34s. Thank you

@ojdo
Copy link

ojdo commented Jan 6, 2023

For me, the current best workaround is still @joranbeasley's idea to go for query_raw() + pandas roughly like this for simple cases with a single table:

import pandas as pd
from influxdb_client import InfluxDBClient
from io import BytesIO

def perform_simple_query(
        query_api,
        organization:str,
        query: str,
        field: str
    ) -> pd.DataFrame:
    """Perform simple query against InfluxDB query API.
    
    Left as an exercise: generalize to results with multiple groups
    """
    response = query_api.query_raw(
        query=query,
        org=organization,
    )
    try:
        df = pd.read_csv(
            BytesIO(response.data),
            skiprows=[0, 1, 2]  # group header rows
        )
    except pd.errors.EmptyDataError:
        return pd.DataFrame()

    df.rename(columns={'_value': field}, inplace=True)
    df.drop(['Unnamed: 0', '_field', '_start', '_stop', 'result',
            '_measurement', 'table'], axis=1, inplace=True)  # customize as needed
    df['_time'] = pd.to_datetime(df['_time'])
    df.set_index('_time', inplace=True)
    return df

@calumroy
Copy link

calumroy commented Aug 14, 2023

@ojdo Thanks.
For when multiple tables are returend by a raw query I found simply splitting on the characters '\r\n\r\n' and processing each one worked.

response = query_api.query_raw(
    query=query,
    org=organization,
)
df_tables = response.data.split(b'\r\n\r\n')
for df_table in df_tables:
    df = pd.DataFrame()
    if df_table != b'':
        try:
            df = pd.read_csv(
                BytesIO(df_table),
                skiprows=[0, 1, 2]  # group header rows
            )
        except pd.errors.EmptyDataError:
            return pd.DataFrame()

@donggu-kang
Copy link

나에게 있어서 현재 가장 좋은 해결 방법은 여전히 ​​다음과 같습니다.@joranbeasleyquery_raw()간단한 사례에 대해 단일 테이블이 있는 경우 대략 다음과 같이 + pandas를 사용하는 것이 좋습니다 .

import pandas as pd
from influxdb_client import InfluxDBClient
from io import BytesIO

def perform_simple_query(
        query_api,
        organization:str,
        query: str,
        field: str
    ) -> pd.DataFrame:
    """Perform simple query against InfluxDB query API.
    
    Left as an exercise: generalize to results with multiple groups
    """
    response = query_api.query_raw(
        query=query,
        org=organization,
    )
    try:
        df = pd.read_csv(
            BytesIO(response.data),
            skiprows=[0, 1, 2]  # group header rows
        )
    except pd.errors.EmptyDataError:
        return pd.DataFrame()

    df.rename(columns={'_value': field}, inplace=True)
    df.drop(['Unnamed: 0', '_field', '_start', '_stop', 'result',
            '_measurement', 'table'], axis=1, inplace=True)  # customize as needed
    df['_time'] = pd.to_datetime(df['_time'])
    df.set_index('_time', inplace=True)
    return df
pip install ciso8601

Oh! If you apply the above two methods at the same time, the processing time will be greatly reduced!!

It took more than 10 minutes for 2.5 million data...
But now I get the query results in 5 seconds!!

@tientr
Copy link

tientr commented Jan 24, 2025

This problem still persists to this day. Using two solutions provided by @ojdo and @bednar works. Reduced query execution time from 200 seconds to 8 seconds for 1 million records. Thanks to @ojdo and @bednar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants