Request: a 'schema' param for the "read_sql" method (to help lower latency, and avoid some errors) #182

alexander-beedie · 2021-12-02T08:14:07Z

This would be hugely useful - in our case we always know the result schema before issuing a query. If we can pass that schema into the read_sql function, then connector-x will not have to issue a query to the database to determine it. This will give us better latency performance (less queries required for setup) and potentially allow more complex queries to be executed (if the caller knows the schema).

I tried some fairly typical join queries on postgres (to load into a polars.DataFrame), but they fail to run; it looks like the query adjustments done by connector-x at the schema-determination stage result in a SQL syntax error. If we can pass the schema directly, this problem will be avoided, and the given SQL can be run unmodified (*if no partitioning). Thanks! :)

Example:
cx.read_sql( postgres_url, query, schema=... )

The text was updated successfully, but these errors were encountered:

wangxiaoying · 2021-12-02T18:30:00Z

Hi @alexander-beedie , thanks for the suggestion, I think it's a great idea! The main concern we had on letting user to specify the schema is if the type given is not accurate the process might fail directly. But just like you said, it is common to know the schema and it will improve the performance. We will definitely consider it and do some investigation. Will come back to you later.

Btw, can you show how your join query looks like when it fails? We will also take a look on that. Thank you!

alexander-beedie · 2021-12-05T07:02:14Z

That would be fantastic - this feature would make connector-x usable for us and add a huge amount of value; i'd be very happy to help test it! The failing join query looks like something like this (i removed functions and simplified table & column names so the query looks a bit odd, but it preserves the subquery structure), against a redshift database -

SELECT DISTINCT u."d",r."dt",r."z"
FROM
  (SELECT "d","e","f"
   FROM
    (SELECT "a","b","c","d","e","f"
     FROM
      (SELECT "a","b","c","d","e","f" FROM abc) u 
       WHERE ((("e" >= '2021-12-01'::date) AND ("e" <= '2021-12-05'::date)) OR (("e" <= '2021-12-01'::date) AND ("f" >= '2021-12-01'::date)))) u
   WHERE (u."a" = 'value')) u
INNER JOIN
  (SELECT "w","x" AS "dt","y","z" FROM xyz) r 
ON ((u."d" = r."w") AND (r."dt" >= u."e") AND (r."dt" <= u."f"))

Table schemas:

{'abc': ({
   'a': String(20),
   'b': String(50),
   'c': String(10),
   'd': String(20),
   'e': Date,
   'f': Date,
 }),
 'xyz': ({
   'w': String(8),
   'x': Date,
   'y': String(3),
   'z': Float,
 })}

The error raised is...

  File "...\lib\site-packages\connectorx\__init__.py", line 102, in read_sql
    result = _read_sql(
RuntimeError: db error: ERROR: syntax error at or near "("

...but running the same query (unmodified) against a direct connection succeeds.

wangxiaoying · 2021-12-06T04:13:29Z

Hi @alexander-beedie , thanks for the query example! However it seems like the query works on my test environment (using PostgreSQL). May I ask which version of connectorx you are using? We did some bug fix in our latest versions. Another possible cause of the error is the protocol used. Redshift does not support the COPY command, so it can only work by specifying the protocol to cursor.

alexander-beedie · 2021-12-06T06:18:39Z

The latest stable (0.2.2). In fact, we get the same error from even simpler queries (also against redshift).

As we currently have a lot of the same data available in both Postgres and Redshift, I tested the same queries against both DBs - and, interestingly, the queries succeed when querying Postgres, but fail against Redshift, while using identical SQL. So, looks like it's a Redshift-specific problem, if that helps!

>>> import connectorx
>>> connectorx.__version__
'0.2.2'
>>> frame = connectorx.read_sql( redshift_uri, 'SELECT * FROM test_schema.test_table LIMIT 100', return_type='polars' )
RuntimeError: db error: ERROR: syntax error at or near "("

EDIT
Hmm... Might also be some confusion in our URI schemes, as we were taking them from the sqlalchemy connection, and that introduces some noise (eg: 'redshift+psycopg2', etc). I'll try some variations.

UPDATE
Ok, got it :) You were right about the protocol param - setting it to 'cursor' (rather than the default of 'binary') now allows the redshift query to succeed.

The 'schema' param would still be hugely useful for us, but we can make some progress now - thanks for the suggestion! (Also, I'll submit a pull request to the polars project to expose the 'protocol' parameter from their 'read_sql' passthrough method to connectorx).

alexander-beedie · 2021-12-06T06:55:37Z

FYI: if passing a URI with 'redshift' prefix, it would be nice if connectorx changed the protocol to 'cursor' automatically, and proceeded as if postgresql was supplied with the URI instead? (at the moment it panics ;)

pyo3_runtime.PanicException: not implemented: Connection: redshift://...

Have just created a pull request with this behaviour - #187.
Our initial testing is showing great performance, so thanks for this project ;)

wangxiaoying · 2021-12-06T17:13:50Z

Hi @alexander-beedie , It's great that it worked out! I think adding redshift to connection string scheme is a good suggestion!

ldacey · 2021-12-20T12:44:44Z

is there a workaround to get connectorx to work with Snowflake as well? I just tested all of our sources (MySQL, MS SQL, Postgres, Redshift) and everything seemed great. I have one client which uses Snowflake though which has failed. I am currently using turbodbc but it would be nice to swap over completely (turbodbc has issues with pip install working with pyarrow, basically required to use conda for it).

wangxiaoying · 2021-12-21T06:15:04Z

is there a workaround to get connectorx to work with Snowflake as well? I just tested all of our sources (MySQL, MS SQL, Postgres, Redshift) and everything seemed great. I have one client which uses Snowflake though which has failed. I am currently using turbodbc but it would be nice to swap over completely (turbodbc has issues with pip install working with pyarrow, basically required to use conda for it).

Hi @ldacey , we do want to support Snowflake in the future. However, it looks like for now there is no Rust driver that we could directly adopt and their wire protocol is not open-sourced. So the only option for us is to port one of their provided drivers, which needs more investigation. We may need some time to evaluate the cost for develop and maintenance compare with the need.

char101 · 2022-05-20T14:46:54Z

This is also useful to specify dtype other than int64 for integer in sqlite.

mcrumiller · 2023-05-02T16:03:04Z

Has there been any update on this?

klaerik · 2023-09-08T22:37:12Z

We've experienced this issue too with connector-x.

More complex queries might fully execute during the limit 1 step if they contain a lot of CTEs/aggregations/etc. and the database can't push down the limit 1 filter and instead has to apply it on the completed query.

It would be nice to have a mode where the user could capture the schema from the returned query rather than having to execute an additional query (or queries if using partitioning), though I realize that might not be easy to do. It would make connector-x much more usable with complex queries.

mcrumiller · 2024-01-08T22:43:19Z

We also have some fairly massive/complex queries that cannot push down the limit 1 so our query (which takes a while) effectively takes twice the time. Would be really nice to be able to specify the schema.

alexander-beedie mentioned this issue Apr 9, 2023

Allow return schema specification to avoid LIMIT 1 query #482

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request: a 'schema' param for the "read_sql" method (to help lower latency, and avoid some errors) #182

Request: a 'schema' param for the "read_sql" method (to help lower latency, and avoid some errors) #182

alexander-beedie commented Dec 2, 2021 •

edited

Loading

wangxiaoying commented Dec 2, 2021

alexander-beedie commented Dec 5, 2021 •

edited

Loading

wangxiaoying commented Dec 6, 2021 •

edited

Loading

alexander-beedie commented Dec 6, 2021 •

edited

Loading

alexander-beedie commented Dec 6, 2021 •

edited

Loading

wangxiaoying commented Dec 6, 2021

ldacey commented Dec 20, 2021

wangxiaoying commented Dec 21, 2021 •

edited

Loading

char101 commented May 20, 2022

mcrumiller commented May 2, 2023

klaerik commented Sep 8, 2023 •

edited

Loading

mcrumiller commented Jan 8, 2024

Request: a 'schema' param for the "read_sql" method (to help lower latency, and avoid some errors) #182

Request: a 'schema' param for the "read_sql" method (to help lower latency, and avoid some errors) #182

Comments

alexander-beedie commented Dec 2, 2021 • edited Loading

wangxiaoying commented Dec 2, 2021

alexander-beedie commented Dec 5, 2021 • edited Loading

wangxiaoying commented Dec 6, 2021 • edited Loading

alexander-beedie commented Dec 6, 2021 • edited Loading

alexander-beedie commented Dec 6, 2021 • edited Loading

wangxiaoying commented Dec 6, 2021

ldacey commented Dec 20, 2021

wangxiaoying commented Dec 21, 2021 • edited Loading

char101 commented May 20, 2022

mcrumiller commented May 2, 2023

klaerik commented Sep 8, 2023 • edited Loading

mcrumiller commented Jan 8, 2024

alexander-beedie commented Dec 2, 2021 •

edited

Loading

alexander-beedie commented Dec 5, 2021 •

edited

Loading

wangxiaoying commented Dec 6, 2021 •

edited

Loading

alexander-beedie commented Dec 6, 2021 •

edited

Loading

alexander-beedie commented Dec 6, 2021 •

edited

Loading

wangxiaoying commented Dec 21, 2021 •

edited

Loading

klaerik commented Sep 8, 2023 •

edited

Loading