Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request: a 'schema' param for the "read_sql" method (to help lower latency, and avoid some errors) #182

Open
alexander-beedie opened this issue Dec 2, 2021 · 12 comments

Comments

@alexander-beedie
Copy link
Contributor

alexander-beedie commented Dec 2, 2021

This would be hugely useful - in our case we always know the result schema before issuing a query. If we can pass that schema into the read_sql function, then connector-x will not have to issue a query to the database to determine it. This will give us better latency performance (less queries required for setup) and potentially allow more complex queries to be executed (if the caller knows the schema).

I tried some fairly typical join queries on postgres (to load into a polars.DataFrame), but they fail to run; it looks like the query adjustments done by connector-x at the schema-determination stage result in a SQL syntax error. If we can pass the schema directly, this problem will be avoided, and the given SQL can be run unmodified (*if no partitioning). Thanks! :)

Example:
cx.read_sql( postgres_url, query, schema=... )

@wangxiaoying
Copy link
Contributor

Hi @alexander-beedie , thanks for the suggestion, I think it's a great idea! The main concern we had on letting user to specify the schema is if the type given is not accurate the process might fail directly. But just like you said, it is common to know the schema and it will improve the performance. We will definitely consider it and do some investigation. Will come back to you later.

Btw, can you show how your join query looks like when it fails? We will also take a look on that. Thank you!

@alexander-beedie
Copy link
Contributor Author

alexander-beedie commented Dec 5, 2021

That would be fantastic - this feature would make connector-x usable for us and add a huge amount of value; i'd be very happy to help test it! The failing join query looks like something like this (i removed functions and simplified table & column names so the query looks a bit odd, but it preserves the subquery structure), against a redshift database -

SELECT DISTINCT u."d",r."dt",r."z"
FROM
  (SELECT "d","e","f"
   FROM
    (SELECT "a","b","c","d","e","f"
     FROM
      (SELECT "a","b","c","d","e","f" FROM abc) u 
       WHERE ((("e" >= '2021-12-01'::date) AND ("e" <= '2021-12-05'::date)) OR (("e" <= '2021-12-01'::date) AND ("f" >= '2021-12-01'::date)))) u
   WHERE (u."a" = 'value')) u
INNER JOIN
  (SELECT "w","x" AS "dt","y","z" FROM xyz) r 
ON ((u."d" = r."w") AND (r."dt" >= u."e") AND (r."dt" <= u."f"))

Table schemas:

{'abc': ({
   'a': String(20),
   'b': String(50),
   'c': String(10),
   'd': String(20),
   'e': Date,
   'f': Date,
 }),
 'xyz': ({
   'w': String(8),
   'x': Date,
   'y': String(3),
   'z': Float,
 })}

The error raised is...

  File "...\lib\site-packages\connectorx\__init__.py", line 102, in read_sql
    result = _read_sql(
RuntimeError: db error: ERROR: syntax error at or near "("

...but running the same query (unmodified) against a direct connection succeeds.

@wangxiaoying
Copy link
Contributor

wangxiaoying commented Dec 6, 2021

Hi @alexander-beedie , thanks for the query example! However it seems like the query works on my test environment (using PostgreSQL). May I ask which version of connectorx you are using? We did some bug fix in our latest versions. Another possible cause of the error is the protocol used. Redshift does not support the COPY command, so it can only work by specifying the protocol to cursor.

@alexander-beedie
Copy link
Contributor Author

alexander-beedie commented Dec 6, 2021

The latest stable (0.2.2). In fact, we get the same error from even simpler queries (also against redshift).

As we currently have a lot of the same data available in both Postgres and Redshift, I tested the same queries against both DBs - and, interestingly, the queries succeed when querying Postgres, but fail against Redshift, while using identical SQL. So, looks like it's a Redshift-specific problem, if that helps!

>>> import connectorx
>>> connectorx.__version__
'0.2.2'
>>> frame = connectorx.read_sql( redshift_uri, 'SELECT * FROM test_schema.test_table LIMIT 100', return_type='polars' )
RuntimeError: db error: ERROR: syntax error at or near "("

EDIT
Hmm... Might also be some confusion in our URI schemes, as we were taking them from the sqlalchemy connection, and that introduces some noise (eg: 'redshift+psycopg2', etc). I'll try some variations.

UPDATE
Ok, got it :) You were right about the protocol param - setting it to 'cursor' (rather than the default of 'binary') now allows the redshift query to succeed.

The 'schema' param would still be hugely useful for us, but we can make some progress now - thanks for the suggestion! (Also, I'll submit a pull request to the polars project to expose the 'protocol' parameter from their 'read_sql' passthrough method to connectorx).

@alexander-beedie
Copy link
Contributor Author

alexander-beedie commented Dec 6, 2021

FYI: if passing a URI with 'redshift' prefix, it would be nice if connectorx changed the protocol to 'cursor' automatically, and proceeded as if postgresql was supplied with the URI instead? (at the moment it panics ;)

pyo3_runtime.PanicException: not implemented: Connection: redshift://...

Have just created a pull request with this behaviour - #187.
Our initial testing is showing great performance, so thanks for this project ;)

@wangxiaoying
Copy link
Contributor

Hi @alexander-beedie , It's great that it worked out! I think adding redshift to connection string scheme is a good suggestion!

@ldacey
Copy link

ldacey commented Dec 20, 2021

is there a workaround to get connectorx to work with Snowflake as well? I just tested all of our sources (MySQL, MS SQL, Postgres, Redshift) and everything seemed great. I have one client which uses Snowflake though which has failed. I am currently using turbodbc but it would be nice to swap over completely (turbodbc has issues with pip install working with pyarrow, basically required to use conda for it).

@wangxiaoying
Copy link
Contributor

wangxiaoying commented Dec 21, 2021

is there a workaround to get connectorx to work with Snowflake as well? I just tested all of our sources (MySQL, MS SQL, Postgres, Redshift) and everything seemed great. I have one client which uses Snowflake though which has failed. I am currently using turbodbc but it would be nice to swap over completely (turbodbc has issues with pip install working with pyarrow, basically required to use conda for it).

Hi @ldacey , we do want to support Snowflake in the future. However, it looks like for now there is no Rust driver that we could directly adopt and their wire protocol is not open-sourced. So the only option for us is to port one of their provided drivers, which needs more investigation. We may need some time to evaluate the cost for develop and maintenance compare with the need.

@char101
Copy link

char101 commented May 20, 2022

This is also useful to specify dtype other than int64 for integer in sqlite.

@mcrumiller
Copy link

Has there been any update on this?

@klaerik
Copy link

klaerik commented Sep 8, 2023

We've experienced this issue too with connector-x.

More complex queries might fully execute during the limit 1 step if they contain a lot of CTEs/aggregations/etc. and the database can't push down the limit 1 filter and instead has to apply it on the completed query.

It would be nice to have a mode where the user could capture the schema from the returned query rather than having to execute an additional query (or queries if using partitioning), though I realize that might not be easy to do. It would make connector-x much more usable with complex queries.

@mcrumiller
Copy link

We also have some fairly massive/complex queries that cannot push down the limit 1 so our query (which takes a while) effectively takes twice the time. Would be really nice to be able to specify the schema.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants