-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[sql lab] a better approach at limiting queries #4947
Conversation
Currently there are two mechanisms that we use to enforce the row limiting constraints, depending on the database engine: 1. use dbapi's `cursor.fetchmany()` 2. wrap the SQL into a limiting subquery Method 1 isn't great as it can result in the database server storing larger than required result sets in memory expecting another fetch command while we know we don't need that. Method 2 has a positive side of working with all database engines, whether they use LIMIT, ROWNUM, TOP or whatever else since sqlalchemy does the work as specified for the dialect. On the downside though the query optimizer might not be able to optimize this as much as an approach that doesn't use a subquery. Since most modern DBs use the LIMIT syntax, this adds a regex approach to modify the query and force a LIMIT clause without using a subquery for the database that support this syntax and uses method 2 for all others.
Codecov Report
@@ Coverage Diff @@
## master #4947 +/- ##
==========================================
+ Coverage 77.1% 77.13% +0.02%
==========================================
Files 44 44
Lines 8644 8649 +5
==========================================
+ Hits 6665 6671 +6
+ Misses 1979 1978 -1
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but I have a few comments.
superset/db_engine_specs.py
Outdated
) | ||
return database.compile_sqla_query(qry) | ||
elif LimitMethod.FORCE_LIMIT: | ||
no_limit = re.sub(r'(?i)\s+LIMIT\s+\d+;?(\s|;)*$', '', sql) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scary! :-O
BTW, you can use backreferences here to replace the LIMIT number in one pass (untested):
sql = re.sub(r'(?i)(.*\s+LIMIT\s+)\d+\s*;?\s*$', r'\1 {0}'.format(limit), sql)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I barely understand what I wrote here :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
Yeah, I just realized my example does not work when there's no LIMIT clause. :-/
You know about verbose regexes? Eg:
no_limit = re.sub(r"""
(?ix) # case insensitive, verbose
\s+ # whitespace
LIMIT\s+\d+ # LIMIT $ROWS
;? # optional semi-colon
(\s|;)*$ # any number of trailing whitespace or semi-colons, til the end
""", '', sql)
Though in this case I think it's pretty straightforward. You might wanna rstrip
whitespace and semicolons from the end of sql
, that would simplify the regex, no?
superset/models/core.py
Outdated
).limit(limit) | ||
) | ||
return self.compile_sqla_query(qry) | ||
return self.db_engine_spec.apply_limit_to_sql(sql, limit, self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better update the method name here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, rename wrap_sql_limit
to something else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
went with apply_limit_to_sql
tests/db_engine_specs_test.py
Outdated
db = Database(sqlalchemy_uri='mysql://localhost') | ||
limited = MySQLEngineSpec.apply_limit_to_sql(sql, 1000, db) | ||
expected = 'SELECT * FROM (SELECT * FROM a LIMIT 10) LIMIT 1000' | ||
self.assertEquals(expected, limited) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm nervous about the regular expression, I'd add a few more unit tests covering some weird cases, eg:
SELECT
'LIMIT 1000' AS a
, b
FROM
table
LIMIT
1000
<EOF>
And maybe some stuff with extra space at the end:
...
LIMIT 1000 ;
<EOF>
Addressed comments, merging |
@betodealmeida @mistercrunch are you sure we want to be augmenting a query that someone authors? This seems overly magical to me, i.e., if you believe you're running the SQL is a powerful language and it seems like educating users about how best to use SQL rather than adding somewhat opaque handrails. |
* [sql lab] a better approach at limiting queries Currently there are two mechanisms that we use to enforce the row limiting constraints, depending on the database engine: 1. use dbapi's `cursor.fetchmany()` 2. wrap the SQL into a limiting subquery Method 1 isn't great as it can result in the database server storing larger than required result sets in memory expecting another fetch command while we know we don't need that. Method 2 has a positive side of working with all database engines, whether they use LIMIT, ROWNUM, TOP or whatever else since sqlalchemy does the work as specified for the dialect. On the downside though the query optimizer might not be able to optimize this as much as an approach that doesn't use a subquery. Since most modern DBs use the LIMIT syntax, this adds a regex approach to modify the query and force a LIMIT clause without using a subquery for the database that support this syntax and uses method 2 for all others. * Fixing build * Fix lint * Added more tests * Fix tests
I think we agree that we need a way to prevent users from running infinite billion rows query that will hurt the db server, network, web server and ultimately crash their browser. Now we've been doing this two ways depending on the DB engine, I also looked at how other tools seem to be doing this, and they edit the user query much like we do here, likely for the reasons listed above. |
* [sql lab] a better approach at limiting queries Currently there are two mechanisms that we use to enforce the row limiting constraints, depending on the database engine: 1. use dbapi's `cursor.fetchmany()` 2. wrap the SQL into a limiting subquery Method 1 isn't great as it can result in the database server storing larger than required result sets in memory expecting another fetch command while we know we don't need that. Method 2 has a positive side of working with all database engines, whether they use LIMIT, ROWNUM, TOP or whatever else since sqlalchemy does the work as specified for the dialect. On the downside though the query optimizer might not be able to optimize this as much as an approach that doesn't use a subquery. Since most modern DBs use the LIMIT syntax, this adds a regex approach to modify the query and force a LIMIT clause without using a subquery for the database that support this syntax and uses method 2 for all others. * Fixing build * Fix lint * Added more tests * Fix tests
* [sql lab] a better approach at limiting queries Currently there are two mechanisms that we use to enforce the row limiting constraints, depending on the database engine: 1. use dbapi's `cursor.fetchmany()` 2. wrap the SQL into a limiting subquery Method 1 isn't great as it can result in the database server storing larger than required result sets in memory expecting another fetch command while we know we don't need that. Method 2 has a positive side of working with all database engines, whether they use LIMIT, ROWNUM, TOP or whatever else since sqlalchemy does the work as specified for the dialect. On the downside though the query optimizer might not be able to optimize this as much as an approach that doesn't use a subquery. Since most modern DBs use the LIMIT syntax, this adds a regex approach to modify the query and force a LIMIT clause without using a subquery for the database that support this syntax and uses method 2 for all others. * Fixing build * Fix lint * Added more tests * Fix tests
* [sql lab] a better approach at limiting queries Currently there are two mechanisms that we use to enforce the row limiting constraints, depending on the database engine: 1. use dbapi's `cursor.fetchmany()` 2. wrap the SQL into a limiting subquery Method 1 isn't great as it can result in the database server storing larger than required result sets in memory expecting another fetch command while we know we don't need that. Method 2 has a positive side of working with all database engines, whether they use LIMIT, ROWNUM, TOP or whatever else since sqlalchemy does the work as specified for the dialect. On the downside though the query optimizer might not be able to optimize this as much as an approach that doesn't use a subquery. Since most modern DBs use the LIMIT syntax, this adds a regex approach to modify the query and force a LIMIT clause without using a subquery for the database that support this syntax and uses method 2 for all others. * Fixing build * Fix lint * Added more tests * Fix tests
Currently there are two mechanisms that we use to enforce the row
limiting constraints, depending on the database engine:
cursor.fetchmany()
Method 1 isn't great as it can result in the database server storing
larger than required result sets in memory expecting another fetch
command while we know we don't need that.
Method 2 has a positive side of working with all database engines,
whether they use LIMIT, ROWNUM, TOP or whatever else since sqlalchemy
does the work as specified for the dialect. On the downside though
the query optimizer might not be able to optimize this as much as an
approach that doesn't use a subquery.
Since most modern DBs use the LIMIT syntax, this adds a regex approach
to modify the query and force a LIMIT clause without using a subquery
for the database that support this syntax and uses method 2 for all
others.