fix: druid double percent #28270

betodealmeida · 2024-04-29T20:32:14Z

SUMMARY

Some DB API 2.0 drivers use pyformat for paramstyle. This means that queries should be parameterized with %s for the placeholders, like this:

cursor.execute("SELECT * FROM t WHERE role = '%s'", ("engineer",))

The driver then performs "old-school" string interpolation:

sql = "SELECT * FROM t WHERE role = '%s'"
parameters = ("engineer",)
escaped_parameters = escape_parameters(parameters)  # important to prevent SQL injection
final_query = sql % escaped_parameters

Because of the percent interpolation, when SQL compiles a query for a database that uses pyformat or format, any percent symbols in the query are escaped by being replaced with %%:

https://github.com/sqlalchemy/sqlalchemy/blob/6888cf79db79d5e5660300ccf2a2a91f1eecf75f/lib/sqlalchemy/sql/compiler.py#L2652-L2653

For some reason we undo that process (introduced in #5178):

superset/superset/models/core.py

Lines 668 to 669 in 76d897e

    
           if engine.dialect.identifier_preparer._double_percents:  # noqa 
        
               sql = sql.replace("%%", "%")

The code above doesn't make sense. If SQLAlchemy is replacing % with %% for databases where dialect.identifier_preparer._double_percents is true, why would we reverse it when we compile the query for the same databases?

One clue can be found in another codepath were we compile the query, but that replacement is missing. In values_for_column:

superset/superset/models/helpers.py

Lines 1380 to 1384 in 44690fb

    
           sql = qry.compile(engine, compile_kwargs={"literal_binds": True}) 
        
           sql = self._apply_cte(sql, cte) 
        
           sql = self.database.mutate_sql_based_on_config(sql) 
        
           df = pd.read_sql_query(sql=sql, con=engine)

@Vitor-Avila noticed that here, when the column is a calculated column containing a percent symbol, like:

case when column like '%a%' then 1 else 0 end

Then the generated SQL being sent to the database is:

case when column like '%%a%%' then 1 else 0 end

Note that the query above is completely valid and syntactically equivalent to the original one, so everything works as expected when we run it... except that in Druid, the query performs extremely poorly, compared to the one with a single percent. This suggests that the fix is to add sql = sql.replace("%%", "%") to the values_for_column method as well.

But again... why are we undoing what SQLAlchemy is doing?

Looking deeper into the problem, I found a bug in pydruid:

https://github.com/druid-io/pydruid/blob/1d72d26c3e14bc9a7c6725dfa877c98a7afbe6f3/pydruid/db/api.py#L430-L435

Note that in the code above, when no parameters are passed to execute — which is the case when Superset calls the method — the string interpolation never happens, because the SQL is returned early! This means that any escaped percent symbols (%%) will not be unescaped to %.

I've fixed pydruid in druid-io/pydruid#317, and made a new release. This PR bumps the version to the fixed one, and removes the sql = sql.replace("%%", "%") logic completely. This way, when double percents are passed to pydruid, they will be unescaped by the driver, as expected.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

I tested with Postgres, since psycopg2 uses pyformat. Queries run as expected:

ADDITIONAL INFORMATION

Has associated issue:
Required feature flags:
Changes UI
Includes DB Migration (follow approval process in SIP-59)
- Migration is atomic, supports rollback & is backwards-compatible
- Confirm DB migration upgrade and downgrade tested
- Runtime estimates and downtime expectations provided
Introduces new feature or API
Removes existing feature or API

betodealmeida · 2024-04-29T20:59:34Z

Ugh, reading #5178 again, it seems that this behavior is desirable for MySQL/Preseto/Hive. Will revisit this.

fix: druid double percent

9f97adf

pull-request-size bot added the size/XS label Apr 29, 2024

betodealmeida marked this pull request as draft April 29, 2024 20:52

github-actions bot added the preset-io label Apr 29, 2024

betodealmeida closed this Apr 29, 2024

mistercrunch deleted the fix-druid-percent branch November 25, 2024 06:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: druid double percent #28270

fix: druid double percent #28270

betodealmeida commented Apr 29, 2024 •

edited

Loading

betodealmeida commented Apr 29, 2024

	if engine.dialect.identifier_preparer._double_percents: # noqa
	sql = sql.replace("%%", "%")

	sql = qry.compile(engine, compile_kwargs={"literal_binds": True})
	sql = self._apply_cte(sql, cte)
	sql = self.database.mutate_sql_based_on_config(sql)

	df = pd.read_sql_query(sql=sql, con=engine)

fix: druid double percent #28270

fix: druid double percent #28270

Conversation

betodealmeida commented Apr 29, 2024 • edited Loading

SUMMARY

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

ADDITIONAL INFORMATION

betodealmeida commented Apr 29, 2024

betodealmeida commented Apr 29, 2024 •

edited

Loading