Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sqllab] How can we make large Superset queries load faster? #4588

Closed
timifasubaa opened this issue Mar 9, 2018 · 8 comments
Closed

[sqllab] How can we make large Superset queries load faster? #4588

timifasubaa opened this issue Mar 9, 2018 · 8 comments
Labels
inactive Inactive for >= 30 days

Comments

@timifasubaa
Copy link
Contributor

timifasubaa commented Mar 9, 2018

I'm looking into the possibility of making large superset queries load faster. And I want members of the community to share ideas here.
Many times, users run a long query for a slice and get stuck waiting for a long time to get tens of thousands of rows they don't intend to see. Before users can see the query the whole query has to run and there usually is a round trip to s3. This takes a really long time.

For inspiration, the Presto/Hive CLI returns almost immediately because it uses something like the less bash command to load results immediately there are some rows.

There is a way to know if any data has been loaded in handle_cursor :
(https://github.com/apache/incubator-superset/blob/31a995714df49e55ff69474378845fd8af605d4b/superset/db_engine_specs.py#L617)

https://github.com/apache/incubator-superset/blob/31a995714df49e55ff69474378845fd8af605d4b/superset/db_engine_specs.py#L185

The most basic idea is to make every query 2 queries. One query with a small limit (100?) and a View more button / loading icon so users don't wrongly assume that's all the results, while the actual full query keeps running.

I think we can do better than this starting idea. In particular, we shouldn't need 2 queries. Please share your thoughts.

@fabianmenges @hughhhh @john-bodley @michellethomas @mistercrunch @jeffreythewang

@mistercrunch
Copy link
Member

mistercrunch commented Mar 12, 2018

So say a user runs a query that returns 1M rows. Now we know that they're not going to read those 1M rows like it's a book, and 1K usually is a decent sample. Now what are some legit things they might do that requires 1M rows?

  • exporting the CSV to excel (different button)
  • searching client-side using the instant filtering table search functionality, which isn't a super-legit use-case since they could use a LIKE statement or something equivalent, though LIKE assumes you know the column. Anyhow. Returning 1M rows to full-text search a table isn't great.
  • infinite scrolling looking for patterns?
  • sort a column and use pagination?

In any case, it may be a good thing to always do 1K and allow them to push a button to get the whole set (and wait longer, maybe crash their browser). Perhaps the data table shows a msg at the top showing first 1k rows, click here to load the entire data set.

I haven't looked at the implementation of fetchmany for Presto/Hive in pyhive, but in theory given the dbapi spec you should be able to fetchmany(1000) and then fetchmany(MAX_ROWS). Assuming all this is happening on the worker (async), you'd ship the 1k to a first results backend location, and the whole set to another later on. That implies the introduction of:

  • a new query state partially_saved
  • perhaps some db_engine_spec methods if the two_phase_save need some specific handling on different engines
  • frontend logic that understands partially_saved, and exposes that to the user, allowing them to fetch the whole data set. CSV export should probably be greyed until the second phase is saved
  • configuration elements to turn this on and off (per db? per db engine as a BaseEngineSpec class attr?)
  • introduce 2 environment configuration settings PARTIAL_SAVE_ROW_COUNT, MAX_SAVE_ROW_COUNT or something equivalent?

Oh and I wanted to make it clear that we absolutely should not run the query twice, that's just not right.

@yuha0
Copy link

yuha0 commented Mar 19, 2018

I second this. I have a table with 2 million rows and I have to remember to manually add limit to my query every time I want to try something new in sqllab. And with limit, the generated visualization will also be limited, which is not ideal and making it extremely difficult to test new visualization.

@timifasubaa
Copy link
Contributor Author

timifasubaa commented Mar 23, 2018

Some more ideas were shared at today's meeting.
One was having a separate flow for downloading query results as CSV.
Then for non csv,

Always limiting the query before sending to the db ( this should help reduce Presto load). The question here is that when the query is eventually run on a slice, it still runs the full query. Or do we want to limit that too?

Another potentially separate idea is to respond early and display results on the frontend when the db has the first 1000 rows ready. (This can be known from the stats object returned from polling). THen rendering the rest when the user clicks to see all.

We will have to have an opinion on how superset should run queries and communiate that clearly in the interface so users, for example don't think the first 1K rows is all the results there is.

@CoryChaplin
Copy link
Contributor

Another use case on this is putting the query results in cahe so that Dashboards load faster.
I'd love to have a way to schedule the refresh of queries that take 1 to 5 minutes in a cache (like we have after first load) so that it's seamless for end users.

@rumbin
Copy link
Contributor

rumbin commented Sep 1, 2018

@CoryChaplin you should already be able to achieve this by using the warm_up_cache endpoint. See #1063 for details.

The selection of the charts to refresh and the scheduling have to be implemented elsewhere, though (e.g., via Apache Airflow).

@timifasubaa
Copy link
Contributor Author

timifasubaa commented Oct 15, 2018

The new proposed approach is to have a limit on the UI that the user can see and configure. This UI limit will have some javascript validation to prevent the user from exceeding a particular value.
This value will be set to 10K by default (other querying tools use a similar default but ours is more flexible)

The UI will always show at most 1K rows (also configurable) and only via exporting CSV will you be able to see more than the UI limit.

It will look something like the image below (from Jeff's PR (#4941) )
image

@stale
Copy link

stale bot commented Apr 10, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the inactive Inactive for >= 30 days label Apr 10, 2019
@stale stale bot closed this as completed Apr 17, 2019
@Rakesh-Online
Copy link

Hi, I had a question regarding the superset dashboard it would be really helpful if can share some insight. Is there any possibility of viewing the same excel sheet structure in the dashboard for client reference as you can see the 3 levels of data in excel so exact same way I have to show in the dashboard this is the client requirement so please help me out.
Screenshot 2019-09-23 at 11 43 34 AM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
inactive Inactive for >= 30 days
Projects
None yet
Development

No branches or pull requests

6 participants