Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve support for BigQuery, Redshift, Oracle, Db2, Snowflake #5827

Merged
merged 9 commits into from
Jan 18, 2019
Merged

Improve support for BigQuery, Redshift, Oracle, Db2, Snowflake #5827

merged 9 commits into from
Jan 18, 2019

Conversation

villebro
Copy link
Member

@villebro villebro commented Sep 5, 2018

A continuation of PR #5686 for databases that have non-standard handling of column/alias names. The purpose of this PR is to make all engines 'just work' regardless of connector quirks or database restrictions. Based on available documentation and empirical experience, the following holds for dbapi1 query results used by Superset:

  • BigQuery: Column names can be no longer than 128 characters, must start with a letter and can contain only letters, numbers and underscores. Mixed case.
  • Redshift: Column names in query results are all lowercase, even if mixed-case alias is quoted.
  • Oracle: Column names have a maximum length 30 characters.
  • DB2: Same as Oracle.
  • Snowflake: Lowercase aliases that are unquoted result in all uppercase column names. In addition 256 character limit on column names.

This PR changes the following:

  • No functional changes to any existing databases; the PR only modifies behavior of the five databases mentioned above.
  • Centralizes all label mutation and quoting logic in /connectors/sqla/models.py, which modifies labels as necessary on the fly and returns a dataframe with the original column headers, irrespective of database type. All SQL Alchemy specific logic removed from viz.py.
  • Extends the work started in Field names in big query can contain only alphanumeric and underscore #5641 where engines with special restrictions for labels can provide a custom mutate_label method in db_engine_specs.py to change the label as needed. For BigQuery the logic is as follows:
    1. If the label contains unsupported characters, replace all characters in violation with underscores and add an md5 hash to the end of the column to avoid collisions.
    2. Return the new label from 1) if the new column name is less than 128 characters long, otherwise return only the hash.
  • Other engines, use a similar approach as described above: Redshift lowercases everything, while Oracle and DB2 restricts the column length to 30 characters. If a label is mutated, the original and mutated labels are listed in the View Query view.

Below some examples of before and after this PR.

BigQuery

Currently BigQuery mutates column names to comply with the minimum requirements. This causes funny column names where e.g. parentheses are replaced by underscores:

bigquery before

This PR changes the columns back to their original state:

bigquery after

Looking at the query one can see that a hash has been added to the mutated column to avoid collisions, .e.g. if the columns SUM(x)and SUM[x] are present.

bigquery after query

Redshift

Currently Redshift shows all lowercase column names in tables:

redshift before table

On the other hand timeseries don't work at all:

redshift before timeseries

After this PR timeseries work just fine:

redshift after timeseries

Oracle

Currently column names that exceed 30 characters don't work:

oracle before

This PR makes it possible to use arbitrarily long column names:

oracle after

This is done by changing the column names that exceed 30 chars to the first 30 chars from a MD5 hash in the query:

oracle after query

Snowflake

Currently timeseries graphs don't work due to forced quotes being missing from temporal column names (my bad, I forgot to add it to the original PR):

snowflake before

After this PR timeseries work fine:

snowflake after

Db2

Not tested at all, but should work similar to Oracle.

@codecov-io
Copy link

codecov-io commented Sep 5, 2018

Codecov Report

Merging #5827 into master will decrease coverage by 0.07%.
The diff coverage is 62.29%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5827      +/-   ##
==========================================
- Coverage   73.32%   73.25%   -0.08%     
==========================================
  Files          67       67              
  Lines        9604     9615      +11     
==========================================
+ Hits         7042     7043       +1     
- Misses       2562     2572      +10
Impacted Files Coverage Δ
superset/db_engine_specs.py 54.72% <28.57%> (-0.44%) ⬇️
superset/viz.py 72.04% <60%> (-0.2%) ⬇️
superset/connectors/sqla/models.py 77.91% <92%> (+0.07%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 96f5106...9ea3c58. Read the comment docs.

@minh5
Copy link
Contributor

minh5 commented Sep 6, 2018

Hey @villebro I'm trying to run some tests and ran into some weird issues on Redshift. It's the same error I've been running into with Redshift aggregations.

2018-09-05 22:01:49,675:INFO:root:Database.get_sqla_engine(). Masked URL: redshift+psycopg2://[email protected]:5439/testdb
2018-09-05 22:01:49,681:INFO:root:SELECT day AS __timestamp, COUNT(*) AS count
FROM test.sales_table
WHERE day >= '2018-08-29 00:00:00' AND day <= '2018-09-05 22:01:49' GROUP BY day ORDER BY count DESC
 LIMIT 10000
2018-09-05 22:01:49,692:INFO:root:Database.get_sqla_engine(). Masked URL: redshift+psycopg2://[email protected]:5439/testdb
2018-09-05 22:01:51,754:DEBUG:root:[stats_logger] (incr) loaded_from_source
2018-09-05 22:01:51,791:INFO:werkzeug:127.0.0.1 - - [05/Sep/2018 22:01:51] "POST /superset/explore_json/ HTTP/1.1" 200 -
2018-09-05 22:01:51,826:DEBUG:root:[stats_logger] (incr) log
2018-09-05 22:01:51,829:INFO:werkzeug:127.0.0.1 - - [05/Sep/2018 22:01:51] "POST /superset/log/?slice_id=0 HTTP/1.1" 200 -
2018-09-05 22:02:02,313:DEBUG:root:[stats_logger] (incr) explore_json
2018-09-05 22:02:02,402:INFO:root:Cache key: a1a934775176c2934b028cad8f959e86
2018-09-05 22:02:02,404:INFO:root:Database.get_sqla_engine(). Masked URL: redshift+psycopg2://cavagrill:[email protected]:5439/testdb
2018-09-05 22:02:02,413:INFO:root:SELECT day AS __timestamp, SUM(itemsales) AS "SUM(itemsales)"
FROM test.sales_table
WHERE day >= '2018-08-29 00:00:00' AND day <= '2018-09-05 22:02:02' GROUP BY day ORDER BY "SUM(itemsales)" DESC
 LIMIT 10000
2018-09-05 22:02:02,425:INFO:root:Database.get_sqla_engine(). Masked URL: redshift+psycopg2://cavagrill:[email protected]:5439/testdb
2018-09-05 22:02:04,715:DEBUG:root:[stats_logger] (incr) loaded_from_source
2018-09-05 22:02:04,716:ERROR:root:'SUM(itemsales)'
Traceback (most recent call last):
  File "/Users/minhmai/.pyenv/versions/3.4.3/envs/superset/lib/python3.4/site-packages/superset/views/core.py", line 1105, in generate_json
    payload = viz_obj.get_payload()
  File "/Users/minhmai/.pyenv/versions/3.4.3/envs/superset/lib/python3.4/site-packages/superset/viz.py", line 359, in get_payload
    payload['data'] = self.get_data(df)
  File "/Users/minhmai/.pyenv/versions/3.4.3/envs/superset/lib/python3.4/site-packages/superset/viz.py", line 1243, in get_data
    df = self.process_data(df)
  File "/Users/minhmai/.pyenv/versions/3.4.3/envs/superset/lib/python3.4/site-packages/superset/viz.py", line 1147, in process_data
    values=utils.get_metric_names(fd.get('metrics')))
  File "/Users/minhmai/.pyenv/versions/3.4.3/envs/superset/lib/python3.4/site-packages/pandas/core/frame.py", line 5303, in pivot_table
    margins_name=margins_name)
  File "/Users/minhmai/.pyenv/versions/3.4.3/envs/superset/lib/python3.4/site-packages/pandas/core/reshape/pivot.py", line 61, in pivot_table
    raise KeyError(i)
KeyError: 'SUM(itemsales)'

Also here's a screen shot

image

@villebro
Copy link
Member Author

villebro commented Sep 6, 2018

Thanks @minh5 for testing, I'll make some adjustments and push through an update soon.

@villebro villebro changed the title Add missing label quotes to sqla queries [WIP] Add missing label quotes to sqla queries Sep 6, 2018
@villebro
Copy link
Member Author

villebro commented Sep 6, 2018

I read up on the Redshift dialect, can you give it another go @minh5?

@villebro villebro changed the title [WIP] Add missing label quotes to sqla queries [WIP] Improve column/alias handling for case insensitive engines Sep 6, 2018
@minh5
Copy link
Contributor

minh5 commented Sep 7, 2018

I may not be doing this right, I'm not too familiar with npm. But I got the same error

2018-09-07 10:53:42,410:INFO:root:SELECT day AS __timestamp, SUM(itemsales) AS "SUM(itemsales)"
FROM test.sales_table
WHERE day >= '2018-07-20 00:00:00' AND day <= '2018-09-07 10:53:42' GROUP BY day ORDER BY "SUM(itemsales)" DESC
 LIMIT 10000
2018-09-07 10:53:42,434:INFO:root:Database.get_sqla_engine(). Masked URL: redshift+psycopg2://testuser:[email protected]:5439/testdb
2018-09-07 10:53:43,575:DEBUG:root:[stats_logger] (incr) loaded_from_source
2018-09-07 10:53:43,576:ERROR:root:'SUM(itemsales)'
Traceback (most recent call last):
  File "/Users/minhmai/.pyenv/versions/3.4.3/envs/superset/lib/python3.4/site-packages/superset/views/core.py", line 1105, in generate_json
    payload = viz_obj.get_payload()
  File "/Users/minhmai/.pyenv/versions/3.4.3/envs/superset/lib/python3.4/site-packages/superset/viz.py", line 359, in get_payload
    payload['data'] = self.get_data(df)
  File "/Users/minhmai/.pyenv/versions/3.4.3/envs/superset/lib/python3.4/site-packages/superset/viz.py", line 1243, in get_data
    df = self.process_data(df)
  File "/Users/minhmai/.pyenv/versions/3.4.3/envs/superset/lib/python3.4/site-packages/superset/viz.py", line 1147, in process_data
    values=utils.get_metric_names(fd.get('metrics')))
  File "/Users/minhmai/.pyenv/versions/3.4.3/envs/superset/lib/python3.4/site-packages/pandas/core/frame.py", line 5303, in pivot_table
    margins_name=margins_name)
  File "/Users/minhmai/.pyenv/versions/3.4.3/envs/superset/lib/python3.4/site-packages/pandas/core/reshape/pivot.py", line 61, in pivot_table
    raise KeyError(i)
KeyError: 'SUM(itemsales)'

Then there's this error when I run npm run dev-server

WARNING in ./node_modules/luma.gl/dist/esm/webgl-context/create-headless-context.js
Module not found: Error: Can't resolve 'gl' in '/Users/minhmai/incubator-superset/superset/assets/node_modules/luma.gl/dist/esm/webgl-context'
 @ ./node_modules/luma.gl/dist/esm/webgl-context/create-headless-context.js
 @ ./node_modules/luma.gl/dist/esm/webgl-context/index.js
 @ ./node_modules/luma.gl/dist/esm/webgl/functions.js
 @ ./node_modules/luma.gl/dist/esm/index.js
 @ ./node_modules/@deck.gl/layers/dist/esm/line-layer/line-layer.js
 @ ./node_modules/@deck.gl/layers/dist/esm/index.js
 @ ./node_modules/deck.gl/dist/esm/index.js
 @ ./src/visualizations/deckgl/layers/geojson.jsx
 @ ./src/visualizations/index.js
 @ ./src/modules/AnnotationTypes.js
 @ ./src/chart/chartAction.js
 @ ./src/dashboard/containers/Dashboard.jsx
 @ ./src/dashboard/index.jsx
 @ multi (webpack)-dev-server/client?http://localhost:9000 (webpack)/hot/dev-server.js babel-polyfill ./src/dashboard/index.jsx

WARNING in ./node_modules/luma.gl/dist/esm/webgl-utils/webgl-types.js
Module not found: Error: Can't resolve 'gl/wrap' in '/Users/minhmai/incubator-superset/superset/assets/node_modules/luma.gl/dist/esm/webgl-utils'
 @ ./node_modules/luma.gl/dist/esm/webgl-utils/webgl-types.js
 @ ./node_modules/luma.gl/dist/esm/webgl-utils/index.js
 @ ./node_modules/luma.gl/dist/esm/webgl/functions.js
 @ ./node_modules/luma.gl/dist/esm/index.js
 @ ./node_modules/@deck.gl/layers/dist/esm/line-layer/line-layer.js
 @ ./node_modules/@deck.gl/layers/dist/esm/index.js
 @ ./node_modules/deck.gl/dist/esm/index.js
 @ ./src/visualizations/deckgl/layers/geojson.jsx
 @ ./src/visualizations/index.js
 @ ./src/modules/AnnotationTypes.js
 @ ./src/chart/chartAction.js
 @ ./src/dashboard/containers/Dashboard.jsx
 @ ./src/dashboard/index.jsx
 @ multi (webpack)-dev-server/client?http://localhost:9000 (webpack)/hot/dev-server.js babel-polyfill ./src/dashboard/index.jsx
ℹ 「wdm」: Compiled with warnings.

@villebro
Copy link
Member Author

villebro commented Sep 7, 2018

@minh5 I'm also now getting some npm errors which I think are coming from master. Regarding testing on Redshift, I've spun up a Redshift cluster to make it easier to test, hoping to complete this feature during the weekend.

@villebro
Copy link
Member Author

villebro commented Sep 7, 2018

One question @minh5 , is Sum(itemsales) an adhoc metric or an "old school" metric (defined in the datasource)? And if you change the name manually to all lowercase sum(itemsales), does the error go away?

@minh5
Copy link
Contributor

minh5 commented Sep 7, 2018

Yea right now the SUM is an "old school" metric. I used to get around this issue by changing SUM to sum or make some ad hoc modification similar to this.

Here's a screenshot with changing the metric name
screen shot 2018-09-07 at 3 05 42 pm

@villebro
Copy link
Member Author

villebro commented Sep 7, 2018

In this branch adhoc metrics should now work, but not old school ones. By the looks of it making the old school ones work automatically is slightly more challenging, as they don't have a separate label to override.

@villebro
Copy link
Member Author

villebro commented Sep 8, 2018

@minh5 Ready for another round of testing.

@minh5
Copy link
Contributor

minh5 commented Sep 9, 2018

Same error. The traceback is below. I'm pretty new to this but I just want to make sure my dev environment is set up. I'm running npm run dev-server and then superset runserver -d and run my graph on port 9000. However the traceback I see is in my virtualenv's pythonpath. Just wanted to make sure.

Traceback (most recent call last):
  File "/Users/minhmai/.pyenv/versions/3.4.3/envs/superset/lib/python3.4/site-packages/superset/views/core.py", line 1105, in generate_json
    payload = viz_obj.get_payload()
  File "/Users/minhmai/.pyenv/versions/3.4.3/envs/superset/lib/python3.4/site-packages/superset/viz.py", line 359, in get_payload
    payload['data'] = self.get_data(df)
  File "/Users/minhmai/.pyenv/versions/3.4.3/envs/superset/lib/python3.4/site-packages/superset/viz.py", line 1243, in get_data
    df = self.process_data(df)
  File "/Users/minhmai/.pyenv/versions/3.4.3/envs/superset/lib/python3.4/site-packages/superset/viz.py", line 1147, in process_data
    values=utils.get_metric_names(fd.get('metrics')))
  File "/Users/minhmai/.pyenv/versions/3.4.3/envs/superset/lib/python3.4/site-packages/pandas/core/frame.py", line 5303, in pivot_table
    margins_name=margins_name)
  File "/Users/minhmai/.pyenv/versions/3.4.3/envs/superset/lib/python3.4/site-packages/pandas/core/reshape/pivot.py", line 61, in pivot_table
    raise KeyError(i)
KeyError: 'SUM(itemsales)'

@villebro
Copy link
Member Author

villebro commented Sep 9, 2018

@minh5 Hmm, the stacktrace is referencing code that has changed in this branch. Are you sure you are using the timestamp_label branch? The npm part sounds right to me. Once this hopefully starts to work, can you also verify that the metric name is also showing up as SUM(itemsales) in the legend, not sum(itemsales) (lowercase)?

@minh5
Copy link
Contributor

minh5 commented Sep 11, 2018

Hey @villebro Got another error this time around

2018-09-11 18:04:30,711:ERROR:root:'tuple' object has no attribute 'lower'
Traceback (most recent call last):
  File "/Users/minhmai/incubator-superset/superset/viz.py", line 385, in get_df_payload
    df = self.get_df(query_obj)
  File "/Users/minhmai/incubator-superset/superset/viz.py", line 190, in get_df
    self.results = self.datasource.query(query_obj)
  File "/Users/minhmai/incubator-superset/superset/connectors/sqla/models.py", line 813, in query
    sql, mutated_labels = self.get_query_str(query_obj)
  File "/Users/minhmai/incubator-superset/superset/connectors/sqla/models.py", line 482, in get_query_str
    sql = self.database.compile_sqla_query(qry)
  File "/Users/minhmai/incubator-superset/superset/models/core.py", line 819, in compile_sqla_query
    compile_kwargs={'literal_binds': True},
  File "<string>", line 1, in <lambda>
  File "/Users/minhmai/.pyenv/versions/3.4.3/envs/superset/lib/python3.4/site-packages/sqlalchemy/sql/elements.py", line 442, in compile
    return self._compiler(dialect, bind=bind, **kw)
  File "/Users/minhmai/.pyenv/versions/3.4.3/envs/superset/lib/python3.4/site-packages/sqlalchemy/sql/elements.py", line 448, in _compiler
    return dialect.statement_compiler(dialect, self, **kw)
  File "/Users/minhmai/.pyenv/versions/3.4.3/envs/superset/lib/python3.4/site-packages/sqlalchemy/sql/compiler.py", line 453, in __init__
    Compiled.__init__(self, dialect, statement, **kwargs)
  File "/Users/minhmai/.pyenv/versions/3.4.3/envs/superset/lib/python3.4/site-packages/sqlalchemy/sql/compiler.py", line 219, in __init__
    self.string = self.process(self.statement, **compile_kwargs)
  File "/Users/minhmai/.pyenv/versions/3.4.3/envs/superset/lib/python3.4/site-packages/sqlalchemy/sql/compiler.py", line 245, in process
    return obj._compiler_dispatch(self, **kwargs)
  File "/Users/minhmai/.pyenv/versions/3.4.3/envs/superset/lib/python3.4/site-packages/sqlalchemy/sql/visitors.py", line 81, in _compiler_dispatch
    return meth(self, **kw)
  File "/Users/minhmai/.pyenv/versions/3.4.3/envs/superset/lib/python3.4/site-packages/sqlalchemy/sql/compiler.py", line 1785, in visit_select
    for name, column in select._columns_plus_names
  File "/Users/minhmai/.pyenv/versions/3.4.3/envs/superset/lib/python3.4/site-packages/sqlalchemy/sql/compiler.py", line 1785, in <listcomp>
    for name, column in select._columns_plus_names
  File "/Users/minhmai/.pyenv/versions/3.4.3/envs/superset/lib/python3.4/site-packages/sqlalchemy/sql/compiler.py", line 1557, in _label_select_column
    **column_clause_args
  File "/Users/minhmai/.pyenv/versions/3.4.3/envs/superset/lib/python3.4/site-packages/sqlalchemy/sql/visitors.py", line 81, in _compiler_dispatch
    return meth(self, **kw)
  File "/Users/minhmai/.pyenv/versions/3.4.3/envs/superset/lib/python3.4/site-packages/sqlalchemy/sql/compiler.py", line 684, in visit_label
    self.preparer.format_label(label, labelname)
  File "/Users/minhmai/.pyenv/versions/3.4.3/envs/superset/lib/python3.4/site-packages/sqlalchemy/sql/compiler.py", line 3089, in format_label
    return self.quote(name or label.name)
  File "/Users/minhmai/.pyenv/versions/3.4.3/envs/superset/lib/python3.4/site-packages/sqlalchemy/sql/compiler.py", line 3062, in quote
    if self._requires_quotes(ident):
  File "/Users/minhmai/.pyenv/versions/3.4.3/envs/superset/lib/python3.4/site-packages/sqlalchemy/sql/compiler.py", line 3033, in _requires_quotes
    lc_value = value.lower()
AttributeError: 'tuple' object has no attribute 'lower'

@villebro
Copy link
Member Author

@minh5 Sorry, just putting on the finishing touches, I think I just got the last bugs sorted. Feeling pretty confident about this PR, but wouldn't be surprised if there are still some small typos lurking somewhere.

@mistercrunch
Copy link
Member

I'm not sure whether this is the right approach. There's a lot going on in here, and passing the mutated_labels dict around seems very hard to track and error-prone. Do we really need this reverse lookup?

@villebro
Copy link
Member Author

@mistercrunch I agree that this might seem excessive, but let me explain the reasoning behind the changes:

The main argument is that viz.py should be independent of database type. In order to achieve this, any column names or labels in query_obj need to be in their original format. Let's assume that we have a metric with the label SUM(col). In the case of Redshift and BigQuery these would cause the following:

  • Redshift: the resulting dataframe will contain a column called sum(col) (gets automatically mutated by the database)
  • Bigquery: will fail unless the label is mutated.

Currently (in master) BigQuery solves this by replacing the parens with underscores, resulting in a column called SUM_col_. Aside from the mostly theoretical risk of collisions, this introduces a discrepancy between the dataframe and form_data/query_obj. Furthermore, the mapping from metric name to verbose name used by all charts doesn't work (data.verbose_map in /connectors/base/models.py) unless that is also made aware of the new mutated labels. The way I see it there are two ways around this:

  1. Either the columns in the dataframe returned by /connectors/sqla/models.py need to be renamed to their original state prior to being passed to their respective Viz, or
  2. The Viz need to be aware that some labels have changed (in this case SUM_col_ actually refers to SUM(col)).

What this PR attempts to do is move from 2) to 1), i.e. encapsulate all SQLA specific logic in /connectors/sqla/models.py. In this proposal this has been done by pushing around a dict that collects all mutated labels in a dict, and renames the dataframe columns to their original state prior to being returned. I agree that this looks clumsy, but seemed like the best solution at the time. This can probably be refactored to something more maintainable/understandable. Where this approach adds complexity to SQLA models logic, this decouples Viz logic completely from the database backend, which I think is a good thing.

While it might appear excessively complicated, I think the heterogeneous nature of the SQLA ecosystem seems to require a lot of flexibility from the backend to be able to conform to the quirks of every individual engine. However, if this still feels like the wrong approach I am open to suggestions.

@villebro
Copy link
Member Author

Anyway, I'll park this for now. @minh5 the functionality should now be testable, would appreciate feedback on whether or not this works in your context.

@minh5
Copy link
Contributor

minh5 commented Sep 12, 2018

No problem, @villebro , I don't mind testing since I would really love for this bug to be ironed out. Right now I just have a very hacky way of dealing with Redshift data since my org only uses Redshift. However, running the latest test I've ran into this

2018-09-12 11:15:01,596:INFO:root:SELECT DATE_TRUNC('day', day) AT TIME ZONE 'UTC' AS __timestamp, SUM(itemsales) AS "sum(itemsales)"
FROM test.sales_table
WHERE day >= '2017-09-12 00:00:00' AND day <= '2018-09-12 00:00:00' GROUP BY DATE_TRUNC('day', day) AT TIME ZONE 'UTC' ORDER BY "sum(itemsales)" DESC
 LIMIT 10000
2018-09-12 11:15:01,622:INFO:root:Database.get_sqla_engine(). Masked URL: redshift+psycopg2://testuser:[email protected]:5439/testdb
2018-09-12 11:15:05,136:DEBUG:root:[stats_logger] (incr) loaded_from_source
2018-09-12 11:15:05,506:INFO:werkzeug:127.0.0.1 - - [12/Sep/2018 11:15:05] "POST /superset/explore_json/ HTTP/1.1" 500 -
Traceback (most recent call last):
  File "/Users/minhmai/.pyenv/versions/3.4.3/envs/superset/lib/python3.4/site-packages/flask/app.py", line 1997, in __call__
    return self.wsgi_app(environ, start_response)
  File "/Users/minhmai/.pyenv/versions/3.4.3/envs/superset/lib/python3.4/site-packages/flask/app.py", line 1985, in wsgi_app
    response = self.handle_exception(e)
  File "/Users/minhmai/.pyenv/versions/3.4.3/envs/superset/lib/python3.4/site-packages/flask/app.py", line 1540, in handle_exception
    reraise(exc_type, exc_value, tb)
  File "/Users/minhmai/.pyenv/versions/3.4.3/envs/superset/lib/python3.4/site-packages/flask/_compat.py", line 33, in reraise
    raise value
  File "/Users/minhmai/.pyenv/versions/3.4.3/envs/superset/lib/python3.4/site-packages/flask/app.py", line 1982, in wsgi_app
    response = self.full_dispatch_request()
  File "/Users/minhmai/.pyenv/versions/3.4.3/envs/superset/lib/python3.4/site-packages/flask/app.py", line 1614, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/Users/minhmai/.pyenv/versions/3.4.3/envs/superset/lib/python3.4/site-packages/flask/app.py", line 1517, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/Users/minhmai/.pyenv/versions/3.4.3/envs/superset/lib/python3.4/site-packages/flask/_compat.py", line 33, in reraise
    raise value
  File "/Users/minhmai/.pyenv/versions/3.4.3/envs/superset/lib/python3.4/site-packages/flask/app.py", line 1612, in full_dispatch_request
    rv = self.dispatch_request()
  File "/Users/minhmai/.pyenv/versions/3.4.3/envs/superset/lib/python3.4/site-packages/flask/app.py", line 1598, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/Users/minhmai/incubator-superset/superset/models/core.py", line 1010, in wrapper
    value = f(*args, **kwargs)
  File "/Users/minhmai/.pyenv/versions/3.4.3/envs/superset/lib/python3.4/site-packages/flask_appbuilder/security/decorators.py", line 52, in wraps
    return f(self, *args, **kwargs)
  File "/Users/minhmai/incubator-superset/superset/views/core.py", line 1180, in explore_json
    force=force)
  File "/Users/minhmai/incubator-superset/superset/views/core.py", line 1114, in generate_json
    return json_success(viz_obj.json_dumps(payload), status=status)
  File "/Users/minhmai/incubator-superset/superset/viz.py", line 444, in json_dumps
    sort_keys=sort_keys,
  File "/Users/minhmai/.pyenv/versions/3.4.3/envs/superset/lib/python3.4/site-packages/simplejson/__init__.py", line 399, in dumps
    **kw).encode(obj)
  File "/Users/minhmai/.pyenv/versions/3.4.3/envs/superset/lib/python3.4/site-packages/simplejson/encoder.py", line 296, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/Users/minhmai/.pyenv/versions/3.4.3/envs/superset/lib/python3.4/site-packages/simplejson/encoder.py", line 378, in iterencode
    return _iterencode(o, 0)
  File "/Users/minhmai/incubator-superset/superset/utils.py", line 380, in json_int_dttm_ser
    obj = datetime_to_epoch(obj)
  File "/Users/minhmai/incubator-superset/superset/utils.py", line 366, in datetime_to_epoch
    return (dttm - epoch_with_tz).total_seconds() * 1000
  File "pandas/_libs/tslibs/timestamps.pyx", line 311, in pandas._libs.tslibs.timestamps._Timestamp.__sub__

TypeError: Timestamp subtraction must have the same timezones or no timezones

@villebro
Copy link
Member Author

@minh5 I'm thinking this error might be related to the timestamp with/without tz bug that's been reported by Redshift/Postgres users. Can you test other charts, e.g. table viz and such, that don't have a time dimension to them, or plot the line chart without a time grain, which I think someone reported works now?

@mistercrunch
Copy link
Member

I think pushing that dict around is very error prone and super hard to reason about. It adds a layer on top of an already overloaded model and breaks all sorts of design patterns. Personally I think we need to go back to the design board on this one.

@JamshedRahman
Copy link
Contributor

@villebro Where is this replacement happening in 'master'? Can you point me to the code please? I seem to be not able to find it. :-)

Currently (in master) BigQuery solves this by replacing the parens with underscores, resulting in a column called SUM_col_.

@villebro
Copy link
Member Author

@mistercrunch I think the label mutation logic is sound, but do agree that the dict pushing is overly complicated. As the original code wasn't designed to handle this type of added complexity, some changes are probably inevitable, but should be less invasive than was proposed here. Will revert with a better proposal.

@JamshedRahman check the following lines: https://github.com/apache/incubator-superset/blob/7098ada8c5e241ba59b985478c1249da89b9b676/superset/db_engine_specs.py#L1384-L1394

@villebro
Copy link
Member Author

@minh5 This WIP together with the fix from #6453 should now make life easier on Redshift. This PR has been in use in production for a few months and has been tested to work very reliably with Snowflake. BigQuery and Redshift have also been tested to work, although not as extensively. I also don't see any reason why Oracle and DB2 shouldn't now work (along with any other quirky dialect/engine), but haven't tested against them. If you have the time to give this a go I would be very thankful. If all is good I can make a last thorough check of the code before submitting this as a SIP, as I'm sure this will need to be thoroughly reviewed.

@mistercrunch mistercrunch added reviewable risk:hard-to-test The change will be hard to test labels Dec 11, 2018
@mistercrunch
Copy link
Member

This is well needed and looks good to me on a first pass. Adding a label about this being a bit hard to test, though outside of the platforms where this is needed it should be straightforward.

@staticmethod
def mutate_label(label):
"""
Oracle 12.1 and earlier support a maximum of 30 byte length object names, which
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OMG they finally fixed this in Oracle!!!!!!!!!!!!!!! I thought this day would never happen.

@staticmethod
def mutate_label(label):
"""
Db2 for z/OS supports a maximum of 30 byte length object names, which usually
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OMG db2 too does this? Wow.

Copy link
Member

@mistercrunch mistercrunch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did another pass and it LGTM. We may want another set of 👀 on this one though. Maybe @john-bodley or someone else from Airbnb

@villebro
Copy link
Member Author

This is well needed and looks good to me on a first pass. Adding a label about this being a bit hard to test, though outside of the platforms where this is needed it should be straightforward.

I have actually tested this semi thoroughly on Snowflake, BigQuery, Redshift and Oracle, so the main stuff should work. But the devil is in the details. Will update the description to describe exactly what is going on and why and change to SIP today if needed.

@villebro villebro changed the title [WIP] Improve column/alias handling for case insensitive engines [SIP-10] Improve support for BigQuery, Redshift, Oracle, Db2, Snowflake Dec 11, 2018
@villebro
Copy link
Member Author

Thanks for reviewing @mistercrunch . I changed the title to a SIP to highlight the fact that this is a fairly substantial change that might bring with it regressions. I also updated the original description (with pics!) to make it easier for @john-bodley or other reviewers to understand the reasoning behind the changes and see before/after.

@mistercrunch
Copy link
Member

mistercrunch commented Dec 12, 2018

I'm thinking about merging this and shipping as 0.31.x which hasn't been cut yet. Let's get a review from someone at Airbnb first maybe though.

@villebro villebro changed the title [SIP-10] Improve support for BigQuery, Redshift, Oracle, Db2, Snowflake Improve support for BigQuery, Redshift, Oracle, Db2, Snowflake Dec 18, 2018
@villebro
Copy link
Member Author

Ping @mistercrunch @john-bodley any chance of getting additional comments or merging this?

@mistercrunch mistercrunch merged commit 7ee8afb into apache:master Jan 18, 2019
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.34.0 labels Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels risk:hard-to-test The change will be hard to test 🚢 0.34.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants