[sqllab] How can we make large Superset queries load faster? #4588

timifasubaa · 2018-03-09T22:29:58Z

I'm looking into the possibility of making large superset queries load faster. And I want members of the community to share ideas here.
Many times, users run a long query for a slice and get stuck waiting for a long time to get tens of thousands of rows they don't intend to see. Before users can see the query the whole query has to run and there usually is a round trip to s3. This takes a really long time.

For inspiration, the Presto/Hive CLI returns almost immediately because it uses something like the less bash command to load results immediately there are some rows.

There is a way to know if any data has been loaded in handle_cursor :
(https://github.com/apache/incubator-superset/blob/31a995714df49e55ff69474378845fd8af605d4b/superset/db_engine_specs.py#L617)

https://github.com/apache/incubator-superset/blob/31a995714df49e55ff69474378845fd8af605d4b/superset/db_engine_specs.py#L185

The most basic idea is to make every query 2 queries. One query with a small limit (100?) and a View more button / loading icon so users don't wrongly assume that's all the results, while the actual full query keeps running.

I think we can do better than this starting idea. In particular, we shouldn't need 2 queries. Please share your thoughts.

@fabianmenges @hughhhh @john-bodley @michellethomas @mistercrunch @jeffreythewang

The text was updated successfully, but these errors were encountered:

mistercrunch · 2018-03-12T05:31:06Z

So say a user runs a query that returns 1M rows. Now we know that they're not going to read those 1M rows like it's a book, and 1K usually is a decent sample. Now what are some legit things they might do that requires 1M rows?

exporting the CSV to excel (different button)
searching client-side using the instant filtering table search functionality, which isn't a super-legit use-case since they could use a LIKE statement or something equivalent, though LIKE assumes you know the column. Anyhow. Returning 1M rows to full-text search a table isn't great.
infinite scrolling looking for patterns?
sort a column and use pagination?

In any case, it may be a good thing to always do 1K and allow them to push a button to get the whole set (and wait longer, maybe crash their browser). Perhaps the data table shows a msg at the top showing first 1k rows, click here to load the entire data set.

I haven't looked at the implementation of fetchmany for Presto/Hive in pyhive, but in theory given the dbapi spec you should be able to fetchmany(1000) and then fetchmany(MAX_ROWS). Assuming all this is happening on the worker (async), you'd ship the 1k to a first results backend location, and the whole set to another later on. That implies the introduction of:

a new query state partially_saved
perhaps some db_engine_spec methods if the two_phase_save need some specific handling on different engines
frontend logic that understands partially_saved, and exposes that to the user, allowing them to fetch the whole data set. CSV export should probably be greyed until the second phase is saved
configuration elements to turn this on and off (per db? per db engine as a BaseEngineSpec class attr?)
introduce 2 environment configuration settings PARTIAL_SAVE_ROW_COUNT, MAX_SAVE_ROW_COUNT or something equivalent?

Oh and I wanted to make it clear that we absolutely should not run the query twice, that's just not right.

yuha0 · 2018-03-19T16:24:49Z

I second this. I have a table with 2 million rows and I have to remember to manually add limit to my query every time I want to try something new in sqllab. And with limit, the generated visualization will also be limited, which is not ideal and making it extremely difficult to test new visualization.

timifasubaa · 2018-03-23T23:53:39Z

Some more ideas were shared at today's meeting.
One was having a separate flow for downloading query results as CSV.
Then for non csv,

Always limiting the query before sending to the db ( this should help reduce Presto load). The question here is that when the query is eventually run on a slice, it still runs the full query. Or do we want to limit that too?

Another potentially separate idea is to respond early and display results on the frontend when the db has the first 1000 rows ready. (This can be known from the stats object returned from polling). THen rendering the rest when the user clicks to see all.

We will have to have an opinion on how superset should run queries and communiate that clearly in the interface so users, for example don't think the first 1K rows is all the results there is.

CoryChaplin · 2018-09-01T08:02:54Z

Another use case on this is putting the query results in cahe so that Dashboards load faster.
I'd love to have a way to schedule the refresh of queries that take 1 to 5 minutes in a cache (like we have after first load) so that it's seamless for end users.

rumbin · 2018-09-01T10:56:10Z

@CoryChaplin you should already be able to achieve this by using the warm_up_cache endpoint. See #1063 for details.

The selection of the charts to refresh and the scheduling have to be implemented elsewhere, though (e.g., via Apache Airflow).

timifasubaa · 2018-10-15T18:06:17Z

The new proposed approach is to have a limit on the UI that the user can see and configure. This UI limit will have some javascript validation to prevent the user from exceeding a particular value.
This value will be set to 10K by default (other querying tools use a similar default but ours is more flexible)

The UI will always show at most 1K rows (also configurable) and only via exporting CSV will you be able to see more than the UI limit.

It will look something like the image below (from Jeff's PR (#4941) )

stale · 2019-04-10T13:59:07Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Rakesh-Online · 2019-09-23T06:24:22Z

Hi, I had a question regarding the superset dashboard it would be really helpful if can share some insight. Is there any possibility of viewing the same excel sheet structure in the dashboard for client reference as you can see the 3 levels of data in excel so exact same way I have to show in the dashboard this is the client requirement so please help me out.

timifasubaa mentioned this issue Apr 17, 2018

Load async sql lab results early for Presto #4834

Closed

jeffreythewang mentioned this issue May 7, 2018

Add separate limit setting for SqlLab #4941

Merged

stale bot added the inactive Inactive for >= 30 days label Apr 10, 2019

stale bot closed this as completed Apr 17, 2019

lanyusea mentioned this issue Dec 15, 2020

An error occurred when trying to visualize a group of 4-million-row data #12052

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[sqllab] How can we make large Superset queries load faster? #4588

[sqllab] How can we make large Superset queries load faster? #4588

timifasubaa commented Mar 9, 2018 •

edited

Loading

mistercrunch commented Mar 12, 2018 •

edited

Loading

yuha0 commented Mar 19, 2018

timifasubaa commented Mar 23, 2018 •

edited

Loading

CoryChaplin commented Sep 1, 2018

rumbin commented Sep 1, 2018

timifasubaa commented Oct 15, 2018 •

edited

Loading

stale bot commented Apr 10, 2019

Rakesh-Online commented Sep 23, 2019

[sqllab] How can we make large Superset queries load faster? #4588

[sqllab] How can we make large Superset queries load faster? #4588

Comments

timifasubaa commented Mar 9, 2018 • edited Loading

mistercrunch commented Mar 12, 2018 • edited Loading

yuha0 commented Mar 19, 2018

timifasubaa commented Mar 23, 2018 • edited Loading

CoryChaplin commented Sep 1, 2018

rumbin commented Sep 1, 2018

timifasubaa commented Oct 15, 2018 • edited Loading

stale bot commented Apr 10, 2019

Rakesh-Online commented Sep 23, 2019

timifasubaa commented Mar 9, 2018 •

edited

Loading

mistercrunch commented Mar 12, 2018 •

edited

Loading

timifasubaa commented Mar 23, 2018 •

edited

Loading

timifasubaa commented Oct 15, 2018 •

edited

Loading