Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: druid double percent #28270

Closed
wants to merge 1 commit into from
Closed

fix: druid double percent #28270

wants to merge 1 commit into from

Conversation

betodealmeida
Copy link
Member

@betodealmeida betodealmeida commented Apr 29, 2024

SUMMARY

Some DB API 2.0 drivers use pyformat for paramstyle. This means that queries should be parameterized with %s for the placeholders, like this:

cursor.execute("SELECT * FROM t WHERE role = '%s'", ("engineer",))

The driver then performs "old-school" string interpolation:

sql = "SELECT * FROM t WHERE role = '%s'"
parameters = ("engineer",)
escaped_parameters = escape_parameters(parameters)  # important to prevent SQL injection
final_query = sql % escaped_parameters

Because of the percent interpolation, when SQL compiles a query for a database that uses pyformat or format, any percent symbols in the query are escaped by being replaced with %%:

https://github.com/sqlalchemy/sqlalchemy/blob/6888cf79db79d5e5660300ccf2a2a91f1eecf75f/lib/sqlalchemy/sql/compiler.py#L2652-L2653

For some reason we undo that process (introduced in #5178):

if engine.dialect.identifier_preparer._double_percents: # noqa
sql = sql.replace("%%", "%")

The code above doesn't make sense. If SQLAlchemy is replacing % with %% for databases where dialect.identifier_preparer._double_percents is true, why would we reverse it when we compile the query for the same databases?

One clue can be found in another codepath were we compile the query, but that replacement is missing. In values_for_column:

sql = qry.compile(engine, compile_kwargs={"literal_binds": True})
sql = self._apply_cte(sql, cte)
sql = self.database.mutate_sql_based_on_config(sql)
df = pd.read_sql_query(sql=sql, con=engine)

@Vitor-Avila noticed that here, when the column is a calculated column containing a percent symbol, like:

case when column like '%a%' then 1 else 0 end

Then the generated SQL being sent to the database is:

case when column like '%%a%%' then 1 else 0 end

Note that the query above is completely valid and syntactically equivalent to the original one, so everything works as expected when we run it... except that in Druid, the query performs extremely poorly, compared to the one with a single percent. This suggests that the fix is to add sql = sql.replace("%%", "%") to the values_for_column method as well.

But again... why are we undoing what SQLAlchemy is doing?

Looking deeper into the problem, I found a bug in pydruid:

https://github.com/druid-io/pydruid/blob/1d72d26c3e14bc9a7c6725dfa877c98a7afbe6f3/pydruid/db/api.py#L430-L435

Note that in the code above, when no parameters are passed to execute — which is the case when Superset calls the method — the string interpolation never happens, because the SQL is returned early! This means that any escaped percent symbols (%%) will not be unescaped to %.

I've fixed pydruid in druid-io/pydruid#317, and made a new release. This PR bumps the version to the fixed one, and removes the sql = sql.replace("%%", "%") logic completely. This way, when double percents are passed to pydruid, they will be unescaped by the driver, as expected.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

I tested with Postgres, since psycopg2 uses pyformat. Queries run as expected:

Screenshot 2024-04-29 at 16-42-51 Superset

ADDITIONAL INFORMATION

  • Has associated issue:
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@betodealmeida betodealmeida marked this pull request as draft April 29, 2024 20:52
@betodealmeida
Copy link
Member Author

Ugh, reading #5178 again, it seems that this behavior is desirable for MySQL/Preseto/Hive. Will revisit this.

@mistercrunch mistercrunch deleted the fix-druid-percent branch November 25, 2024 06:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant