-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[sqllab] How can we make large Superset queries load faster? #4588
Comments
So say a user runs a query that returns 1M rows. Now we know that they're not going to read those 1M rows like it's a book, and 1K usually is a decent sample. Now what are some legit things they might do that requires 1M rows?
In any case, it may be a good thing to always do 1K and allow them to push a button to get the whole set (and wait longer, maybe crash their browser). Perhaps the data table shows a msg at the top I haven't looked at the implementation of
Oh and I wanted to make it clear that we absolutely should not run the query twice, that's just not right. |
I second this. I have a table with 2 million rows and I have to remember to manually add |
Some more ideas were shared at today's meeting. Always limiting the query before sending to the db ( this should help reduce Presto load). The question here is that when the query is eventually run on a slice, it still runs the full query. Or do we want to limit that too? Another potentially separate idea is to respond early and display results on the frontend when the db has the first 1000 rows ready. (This can be known from the stats object returned from polling). THen rendering the rest when the user clicks to see all. We will have to have an opinion on how superset should run queries and communiate that clearly in the interface so users, for example don't think the first 1K rows is all the results there is. |
Another use case on this is putting the query results in cahe so that Dashboards load faster. |
@CoryChaplin you should already be able to achieve this by using the warm_up_cache endpoint. See #1063 for details. The selection of the charts to refresh and the scheduling have to be implemented elsewhere, though (e.g., via Apache Airflow). |
The new proposed approach is to have a limit on the UI that the user can see and configure. This UI limit will have some javascript validation to prevent the user from exceeding a particular value. The UI will always show at most 1K rows (also configurable) and only via exporting CSV will you be able to see more than the UI limit. It will look something like the image below (from Jeff's PR (#4941) ) |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
I'm looking into the possibility of making large superset queries load faster. And I want members of the community to share ideas here.
Many times, users run a long query for a slice and get stuck waiting for a long time to get tens of thousands of rows they don't intend to see. Before users can see the query the whole query has to run and there usually is a round trip to s3. This takes a really long time.
For inspiration, the Presto/Hive CLI returns almost immediately because it uses something like the
less
bash command to load results immediately there are some rows.There is a way to know if any data has been loaded in handle_cursor :
(https://github.com/apache/incubator-superset/blob/31a995714df49e55ff69474378845fd8af605d4b/superset/db_engine_specs.py#L617)
https://github.com/apache/incubator-superset/blob/31a995714df49e55ff69474378845fd8af605d4b/superset/db_engine_specs.py#L185
The most basic idea is to make every query 2 queries. One query with a small limit (100?) and a View more button / loading icon so users don't wrongly assume that's all the results, while the actual full query keeps running.
I think we can do better than this starting idea. In particular, we shouldn't need 2 queries. Please share your thoughts.
@fabianmenges @hughhhh @john-bodley @michellethomas @mistercrunch @jeffreythewang
The text was updated successfully, but these errors were encountered: