-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve support for BigQuery, Redshift, Oracle, Db2, Snowflake #5827
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5827 +/- ##
==========================================
- Coverage 73.32% 73.25% -0.08%
==========================================
Files 67 67
Lines 9604 9615 +11
==========================================
+ Hits 7042 7043 +1
- Misses 2562 2572 +10
Continue to review full report at Codecov.
|
Hey @villebro I'm trying to run some tests and ran into some weird issues on Redshift. It's the same error I've been running into with Redshift aggregations.
Also here's a screen shot |
Thanks @minh5 for testing, I'll make some adjustments and push through an update soon. |
I read up on the Redshift dialect, can you give it another go @minh5? |
I may not be doing this right, I'm not too familiar with npm. But I got the same error
Then there's this error when I run
|
@minh5 I'm also now getting some npm errors which I think are coming from master. Regarding testing on Redshift, I've spun up a Redshift cluster to make it easier to test, hoping to complete this feature during the weekend. |
One question @minh5 , is |
Yea right now the SUM is an "old school" metric. I used to get around this issue by changing |
In this branch adhoc metrics should now work, but not old school ones. By the looks of it making the old school ones work automatically is slightly more challenging, as they don't have a separate label to override. |
@minh5 Ready for another round of testing. |
Same error. The traceback is below. I'm pretty new to this but I just want to make sure my dev environment is set up. I'm running
|
@minh5 Hmm, the stacktrace is referencing code that has changed in this branch. Are you sure you are using the |
Hey @villebro Got another error this time around
|
@minh5 Sorry, just putting on the finishing touches, I think I just got the last bugs sorted. Feeling pretty confident about this PR, but wouldn't be surprised if there are still some small typos lurking somewhere. |
I'm not sure whether this is the right approach. There's a lot going on in here, and passing the |
@mistercrunch I agree that this might seem excessive, but let me explain the reasoning behind the changes: The main argument is that
Currently (in
What this PR attempts to do is move from 2) to 1), i.e. encapsulate all SQLA specific logic in While it might appear excessively complicated, I think the heterogeneous nature of the SQLA ecosystem seems to require a lot of flexibility from the backend to be able to conform to the quirks of every individual engine. However, if this still feels like the wrong approach I am open to suggestions. |
Anyway, I'll park this for now. @minh5 the functionality should now be testable, would appreciate feedback on whether or not this works in your context. |
No problem, @villebro , I don't mind testing since I would really love for this bug to be ironed out. Right now I just have a very hacky way of dealing with Redshift data since my org only uses Redshift. However, running the latest test I've ran into this
|
@minh5 I'm thinking this error might be related to the timestamp with/without tz bug that's been reported by Redshift/Postgres users. Can you test other charts, e.g. table viz and such, that don't have a time dimension to them, or plot the line chart without a time grain, which I think someone reported works now? |
I think pushing that dict around is very error prone and super hard to reason about. It adds a layer on top of an already overloaded model and breaks all sorts of design patterns. Personally I think we need to go back to the design board on this one. |
@villebro Where is this replacement happening in 'master'? Can you point me to the code please? I seem to be not able to find it. :-)
|
@mistercrunch I think the label mutation logic is sound, but do agree that the dict pushing is overly complicated. As the original code wasn't designed to handle this type of added complexity, some changes are probably inevitable, but should be less invasive than was proposed here. Will revert with a better proposal. @JamshedRahman check the following lines: https://github.com/apache/incubator-superset/blob/7098ada8c5e241ba59b985478c1249da89b9b676/superset/db_engine_specs.py#L1384-L1394 |
@minh5 This WIP together with the fix from #6453 should now make life easier on Redshift. This PR has been in use in production for a few months and has been tested to work very reliably with Snowflake. BigQuery and Redshift have also been tested to work, although not as extensively. I also don't see any reason why Oracle and DB2 shouldn't now work (along with any other quirky dialect/engine), but haven't tested against them. If you have the time to give this a go I would be very thankful. If all is good I can make a last thorough check of the code before submitting this as a SIP, as I'm sure this will need to be thoroughly reviewed. |
…in db_engine_specs
This is well needed and looks good to me on a first pass. Adding a label about this being a bit hard to test, though outside of the platforms where this is needed it should be straightforward. |
@staticmethod | ||
def mutate_label(label): | ||
""" | ||
Oracle 12.1 and earlier support a maximum of 30 byte length object names, which |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OMG they finally fixed this in Oracle!!!!!!!!!!!!!!! I thought this day would never happen.
@staticmethod | ||
def mutate_label(label): | ||
""" | ||
Db2 for z/OS supports a maximum of 30 byte length object names, which usually |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OMG db2 too does this? Wow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did another pass and it LGTM. We may want another set of 👀 on this one though. Maybe @john-bodley or someone else from Airbnb
I have actually tested this semi thoroughly on Snowflake, BigQuery, Redshift and Oracle, so the main stuff should work. But the devil is in the details. Will update the description to describe exactly what is going on and why and change to SIP today if needed. |
Thanks for reviewing @mistercrunch . I changed the title to a SIP to highlight the fact that this is a fairly substantial change that might bring with it regressions. I also updated the original description (with pics!) to make it easier for @john-bodley or other reviewers to understand the reasoning behind the changes and see before/after. |
I'm thinking about merging this and shipping as |
Ping @mistercrunch @john-bodley any chance of getting additional comments or merging this? |
A continuation of PR #5686 for databases that have non-standard handling of column/alias names. The purpose of this PR is to make all engines 'just work' regardless of connector quirks or database restrictions. Based on available documentation and empirical experience, the following holds for dbapi1 query results used by Superset:
This PR changes the following:
/connectors/sqla/models.py
, which modifies labels as necessary on the fly and returns a dataframe with the original column headers, irrespective of database type. All SQL Alchemy specific logic removed fromviz.py
.mutate_label
method indb_engine_specs.py
to change the label as needed. For BigQuery the logic is as follows:Below some examples of before and after this PR.
BigQuery
Currently BigQuery mutates column names to comply with the minimum requirements. This causes funny column names where e.g. parentheses are replaced by underscores:
This PR changes the columns back to their original state:
Looking at the query one can see that a hash has been added to the mutated column to avoid collisions, .e.g. if the columns
SUM(x)
andSUM[x]
are present.Redshift
Currently Redshift shows all lowercase column names in tables:
On the other hand timeseries don't work at all:
After this PR timeseries work just fine:
Oracle
Currently column names that exceed 30 characters don't work:
This PR makes it possible to use arbitrarily long column names:
This is done by changing the column names that exceed 30 chars to the first 30 chars from a MD5 hash in the query:
Snowflake
Currently timeseries graphs don't work due to forced quotes being missing from temporal column names (my bad, I forgot to add it to the original PR):
After this PR timeseries work fine:
Db2
Not tested at all, but should work similar to Oracle.