-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request: a 'schema' param for the "read_sql" method (to help lower latency, and avoid some errors) #182
Comments
Hi @alexander-beedie , thanks for the suggestion, I think it's a great idea! The main concern we had on letting user to specify the schema is if the type given is not accurate the process might fail directly. But just like you said, it is common to know the schema and it will improve the performance. We will definitely consider it and do some investigation. Will come back to you later. Btw, can you show how your join query looks like when it fails? We will also take a look on that. Thank you! |
That would be fantastic - this feature would make connector-x usable for us and add a huge amount of value; i'd be very happy to help test it! The failing join query looks like something like this (i removed functions and simplified table & column names so the query looks a bit odd, but it preserves the subquery structure), against a redshift database - SELECT DISTINCT u."d",r."dt",r."z"
FROM
(SELECT "d","e","f"
FROM
(SELECT "a","b","c","d","e","f"
FROM
(SELECT "a","b","c","d","e","f" FROM abc) u
WHERE ((("e" >= '2021-12-01'::date) AND ("e" <= '2021-12-05'::date)) OR (("e" <= '2021-12-01'::date) AND ("f" >= '2021-12-01'::date)))) u
WHERE (u."a" = 'value')) u
INNER JOIN
(SELECT "w","x" AS "dt","y","z" FROM xyz) r
ON ((u."d" = r."w") AND (r."dt" >= u."e") AND (r."dt" <= u."f")) Table schemas: {'abc': ({
'a': String(20),
'b': String(50),
'c': String(10),
'd': String(20),
'e': Date,
'f': Date,
}),
'xyz': ({
'w': String(8),
'x': Date,
'y': String(3),
'z': Float,
})} The error raised is... File "...\lib\site-packages\connectorx\__init__.py", line 102, in read_sql
result = _read_sql(
RuntimeError: db error: ERROR: syntax error at or near "(" ...but running the same query (unmodified) against a direct connection succeeds. |
Hi @alexander-beedie , thanks for the query example! However it seems like the query works on my test environment (using PostgreSQL). May I ask which version of connectorx you are using? We did some bug fix in our latest versions. Another possible cause of the error is the protocol used. Redshift does not support the COPY command, so it can only work by specifying the |
The latest stable (0.2.2). In fact, we get the same error from even simpler queries (also against redshift). As we currently have a lot of the same data available in both Postgres and Redshift, I tested the same queries against both DBs - and, interestingly, the queries succeed when querying Postgres, but fail against Redshift, while using identical SQL. So, looks like it's a Redshift-specific problem, if that helps! >>> import connectorx
>>> connectorx.__version__
'0.2.2'
>>> frame = connectorx.read_sql( redshift_uri, 'SELECT * FROM test_schema.test_table LIMIT 100', return_type='polars' )
RuntimeError: db error: ERROR: syntax error at or near "(" EDIT UPDATE The 'schema' param would still be hugely useful for us, but we can make some progress now - thanks for the suggestion! (Also, I'll submit a pull request to the polars project to expose the 'protocol' parameter from their 'read_sql' passthrough method to connectorx). |
FYI: if passing a URI with 'redshift' prefix, it would be nice if connectorx changed the protocol to 'cursor' automatically, and proceeded as if postgresql was supplied with the URI instead? (at the moment it panics ;) pyo3_runtime.PanicException: not implemented: Connection: redshift://... Have just created a pull request with this behaviour - #187. |
Hi @alexander-beedie , It's great that it worked out! I think adding redshift to connection string scheme is a good suggestion! |
is there a workaround to get connectorx to work with Snowflake as well? I just tested all of our sources (MySQL, MS SQL, Postgres, Redshift) and everything seemed great. I have one client which uses Snowflake though which has failed. I am currently using turbodbc but it would be nice to swap over completely (turbodbc has issues with pip install working with pyarrow, basically required to use conda for it). |
Hi @ldacey , we do want to support Snowflake in the future. However, it looks like for now there is no Rust driver that we could directly adopt and their wire protocol is not open-sourced. So the only option for us is to port one of their provided drivers, which needs more investigation. We may need some time to evaluate the cost for develop and maintenance compare with the need. |
This is also useful to specify dtype other than int64 for integer in sqlite. |
Has there been any update on this? |
We've experienced this issue too with connector-x. More complex queries might fully execute during the It would be nice to have a mode where the user could capture the schema from the returned query rather than having to execute an additional query (or queries if using partitioning), though I realize that might not be easy to do. It would make connector-x much more usable with complex queries. |
We also have some fairly massive/complex queries that cannot push down the |
This would be hugely useful - in our case we always know the result schema before issuing a query. If we can pass that schema into the
read_sql
function, then connector-x will not have to issue a query to the database to determine it. This will give us better latency performance (less queries required for setup) and potentially allow more complex queries to be executed (if the caller knows the schema).I tried some fairly typical join queries on postgres (to load into a
polars.DataFrame
), but they fail to run; it looks like the query adjustments done by connector-x at the schema-determination stage result in a SQL syntax error. If we can pass the schema directly, this problem will be avoided, and the given SQL can be run unmodified (*if no partitioning). Thanks! :)Example:
cx.read_sql( postgres_url, query, schema=... )
The text was updated successfully, but these errors were encountered: