Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distsql: tracking issue for queries we expect to run through DistSQL #14288

Closed
11 tasks done
rjnn opened this issue Mar 21, 2017 · 31 comments
Closed
11 tasks done

distsql: tracking issue for queries we expect to run through DistSQL #14288

rjnn opened this issue Mar 21, 2017 · 31 comments
Assignees
Milestone

Comments

@rjnn
Copy link
Contributor

rjnn commented Mar 21, 2017

This is a TODO list for listing queries that we expect to run using DistSQL (in auto mode) by 1.0. Feel free to add as needed, cc @andreimatei @RaduBerinde @cuongdo @asubiotto.

  • SELECT COUNT(*) should always run through DistSQL for speed reasons. It is a common operation just after loading a bunch of data. It is currently exceedingly slow on large tables (without DistSQL). @arjunravinarayan
  • SELECT COUNT (DISTINCT) column_name @arjunravinarayan
  • Distributed Aggregations over large datasets: might be implicitly tested by a TPC-H query. Identify which one and document. @asubiotto
  • Large scans with a very sparse WHERE clause. @arjunravinarayan
  • TPC-H queries that we support (see distsql: tracking issue for TPC-H queries #14295) @asubiotto
    • Queries 9 and 15 don't run through DistSQL, see issue for more info.
  • Basic join queries. For some helpful query ideas, the join order benchmark provides plenty of samples to try with DistSQL, even if we won't be optimally efficient on these for a while. @arjunravinarayan
  • Distributed sorting over large datasets: might be implicitly tested by a TPC-H query. Identify which one and document. @asubiotto
  • LIMIT queries, particularly limits after JOINs that would cause large amounts of state to stream across machines. @arjunravinarayan
  • queries using MIN @asubiotto
  • queries using MAX @asubiotto
  • queries using AVG (see distsql: support local aggregation in STDDEV, VARIANCE #14351) @asubiotto

If there are any queries that we want turned on by 1.0, please add them to the list above.

Do not check off an item as done without adding a comment/issue tracking the queries actually attempted on a cluster. Try and report running times and EXPLAIN(query) output to show the DistSQL plan.

@rjnn rjnn added this to the 1.0 milestone Mar 21, 2017
@asubiotto
Copy link
Contributor

asubiotto commented Mar 30, 2017

Ran

SELECT MIN(l_extendedprice) FROM lineitem

on a 3 node local cluster, using the TPC-H scalefactor=1 dataset. lineitem has 6million rows (~2.5GiB).
The query runs correctly, the execution plan looks good, and the times for one run are below:

Without DistSQL With DistSQL
22.028s 9.520s

@asubiotto
Copy link
Contributor

asubiotto commented Mar 30, 2017

Ran

SELECT MAX(l_extendedprice) FROM lineitem

on a 3 node local cluster, using the TPC-H scalefactor=1 dataset. lineitem has 6million rows (~2.5GiB).
The query runs correctly, the execution plan looks good, and the times for one run are below:

Without DistSQL With DistSQL
22.821s 9.481s

@tamird
Copy link
Contributor

tamird commented Mar 30, 2017

Hm, that's much slower than I'd have expected. Isn't DistSQL able to reduce network traffic on these queries to O(1)?

@RaduBerinde
Copy link
Member

Yes, the network traffic is O(1) in these plans. Note that these are local clusters (all nodes on the same machine) so network traffic is not really network traffic.

@tamird
Copy link
Contributor

tamird commented Mar 30, 2017 via email

@RaduBerinde
Copy link
Member

@danhhz mentioned a SELECT COUNT(*) experiment on a real 3 node cluster (I think it was ~200M rows, SELECT COUNT(*) took 7m33s compared to over an hour).

@tamird
Copy link
Contributor

tamird commented Mar 30, 2017 via email

@danhhz
Copy link
Contributor

danhhz commented Mar 30, 2017

Indeed, it was the production lapis cluster

@asubiotto
Copy link
Contributor

Ran

SELECT COUNT(*) FROM lineitem

on 6-node navy (Standard_D3_v2), using the TPC-H scalefactor=1 dataset. lineitem has 6million rows (~2.5GiB).
The query runs correctly, the execution plan looks good, and the times for one run are below:

Without DistSQL With DistSQL
36.707s 12.350s

@asubiotto
Copy link
Contributor

asubiotto commented Apr 20, 2017

Ran

SELECT AVG(l_extendedprice) FROM lineitem

using f97a5c3 on 6-node navy (Standard_D3_v2), using the TPC-H scalefactor=1 dataset. lineitem has 6million rows (~2.5GiB).
The query runs correctly, the execution plan looks good, and the times for one run are below:

Without DistSQL With DistSQL
46.255s 18.560s

@rjnn
Copy link
Contributor Author

rjnn commented Apr 20, 2017

Can you report the cockroach sha you used for future reference?

@petermattis
Copy link
Collaborator

@asubiotto The ranges for the lineitem table are only spread across 4 nodes? Or is there another reason the query plan only has 4 TableReaders?

@rjnn
Copy link
Contributor Author

rjnn commented Apr 20, 2017

Is there a quick way to find this information out? Last time I asked, there wasn't a clean way to figure out, given a table, what ranges the table was spread across.

@rjnn
Copy link
Contributor Author

rjnn commented Apr 20, 2017

Admin UI says there are 74 ranges for table lineitem (but I can't figure out how to find range IDs or node information for those ranges), so it being spread across just 4 nodes seems strange (but possible).

@petermattis
Copy link
Collaborator

SHOW TESTING_RANGES FROM TABLE <table-name>

@rjnn
Copy link
Contributor Author

rjnn commented Apr 20, 2017

Thank you @petermattis. It appears there are replicas and lease holders on every node, so let me investigate.
Also, there are 24 ranges, for 72 replicas, so clearly the Admin UI is buggy when it says there are 74 ranges. I will file an issue.

@asubiotto
Copy link
Contributor

I updated the execution plan. I think the gateway node probably had a cache that hadn't been updated. The query plan now shows only 5 TableReaders which is still weird.

@asubiotto
Copy link
Contributor

asubiotto commented Apr 20, 2017

Actually, 1 doesn't seem to be a lease holder for any range in lineitem.

@rjnn
Copy link
Contributor Author

rjnn commented Apr 20, 2017

Sure, but it was showing 4 before, and that was certainly incorrect. cc @andreimatei.

@andreimatei
Copy link
Contributor

Sure, but it was showing 4 before, and that was certainly incorrect.

The range-descriptor and leaseholder caches can be empty or stale. This explains it, right?

The state of the caches is supposed to be seen in the ranges_cached internal table, but I think this hasn't been implemented yet.

@asubiotto
Copy link
Contributor

All these queries were run using 4129fe0 on 6-node navy (Standard_D3_v2), using the TPC-H scalefactor=1 dataset.

I started by trying to run

SELECT * FROM lineitem ORDER BY l_extendedprice

On 6million ~2.5GiB lineitem but this failed (see #15332 but note that this query doesn't cause an OOM crash with 4129fe0 any more)

I then moved down to orders, a 1.5million ~600MiB table. Ran

SELECT * FROM orders ORDER BY o_totalprice

The query runs correctly, the execution plan looks good (note that nodes 1 and 3 aren't leaseholders for any of orders' ranges), and the times for one run are below:

Without DistSQL With DistSQL
3m51s 3m34s

I also ran:

SELECT * FROM lineitem ORDER BY l_extendedprice LIMIT 10

To avoid running out of memory on the gateway node (execution plan here). The times for one run are as follows:

Without DistSQL With DistSQL
1m45s 15.389s

@rjnn
Copy link
Contributor Author

rjnn commented May 1, 2017

Queries run with 630757cbc0 on a 6-nodeo navy (Standard_D3_v2) using the TPC-H scalefactor=1 dataset.

cockroach@cockroach-navy-0006:~$ time ./cockroach sql --certs-dir=certs -e "set distsql = off; select count(DISTINCT l_suppkey) FROM tpch.lineitem LIMIT 1;"
+---------------------------+
| count(DISTINCT l_suppkey) |
+---------------------------+
|                     10000 |
+---------------------------+
(1 row)

real	0m29.891s
user	0m0.028s
sys	0m0.012s
cockroach@cockroach-navy-0006:~$ time ./cockroach sql --certs-dir=certs -e "set distsql = on; select count(DISTINCT l_suppkey) FROM tpch.lineitem LIMIT 1;"
+---------------------------+
| count(DISTINCT l_suppkey) |
+---------------------------+
|                     10000 |
+---------------------------+
(1 row)

real	0m9.481s
user	0m0.044s
sys	0m0.008s

@rjnn
Copy link
Contributor Author

rjnn commented May 1, 2017

I forgot to include the execution plan, which looks good.

@rjnn
Copy link
Contributor Author

rjnn commented May 1, 2017

Queries run with 630757cbc0 on a 6-node navy (Standard_D3_v2) using the TPC-H scalefactor=1 dataset.

This query was constructed to have a sparse where clause, and really nothing else. The

cockroach@cockroach-navy-0006:~$ time ./cockroach sql --certs-dir=certs -e "set distsql = on; SELECT * FROM tpch.lineitem WHERE l_extendedprice < 1000;" > foo_on

real	0m12.068s
user	0m0.072s
sys	0m0.012s
cockroach@cockroach-navy-0006:~$ time ./cockroach sql --certs-dir=certs -e "set distsql = off; SELECT * FROM tpch.lineitem WHERE l_extendedprice < 1000;" > foo_off

real	0m49.482s
user	0m0.076s
sys	0m0.008s


The execution plan looks good

The results are correct:

cockroach@cockroach-navy-0006:~$ sort foo_on > foo_on_sorted
cockroach@cockroach-navy-0006:~$ sort foo_off > foo_off_sorted
cockroach@cockroach-navy-0006:~$ diff foo_off_sorted foo_on_sorted
cockroach@cockroach-navy-0006:~$ wc -l foo_off
3082 foo_off
cockroach@cockroach-navy-0006:~$ wc -l foo_on
3082 foo_on

I did not add an ORDER BY or a COUNT operation so that we controlled the number of moving parts, but when added the runtimes are similar.

@rjnn
Copy link
Contributor Author

rjnn commented May 1, 2017

Queries run with 630757c on 6-node navy (Standard_D3_v2) using the TPC-H scalefactor=1 dataset.

The query is SELECT * FROM tpch.lineitem JOIN tpch.supplier ON tpch.lineitem.l_suppkey = tpch.supplier.s_suppkey. This query was constructed to have a JOIN that could have a lot of rows in it, followed by a limit. It is extremely artificial and otherwise useless in any real analytics scenario.

The DistSQL execution plan is intimidating, but ultimately correct.

cockroach@cockroach-navy-0006:~$ time ./cockroach sql --certs-dir=certs -e "set distsql = on; SELECT * FROM tpch.lineitem JOIN tpch.supplier ON tpch.lineitem.l_suppkey = tpch.supplier.s_suppkey limit 10" > foo_on

real	0m0.124s
user	0m0.028s
sys	0m0.028s
cockroach@cockroach-navy-0006:~$ time ./cockroach sql --certs-dir=certs -e "set distsql = off; SELECT * FROM tpch.lineitem JOIN tpch.supplier ON tpch.lineitem.l_suppkey = tpch.supplier.s_suppkey limit 10" > foo_off

real	0m0.186s
user	0m0.036s
sys	0m0.016s

When run without the limit, both versions OOM:

cockroach@cockroach-navy-0006:~$ time ./cockroach sql --certs-dir=certs -e "set distsql = off; SELECT * FROM tpch.lineitem JOIN tpch.supplier ON tpch.lineitem.l_suppkey = tpch.supplier.s_suppkey" > foo_off
Error: pq: root: memory budget exceeded: 10240 bytes requested, 3676112896 bytes in budget
Failed running "sql"

real	0m16.898s
user	0m0.036s
sys	0m0.012s

cockroach@cockroach-navy-0006:~$ time ./cockroach sql --certs-dir=certs -e "set distsql = on; SELECT * FROM tpch.lineitem JOIN tpch.supplier ON tpch.lineitem.l_suppkey = tpch.supplier.s_suppkey" > foo_off
Error: pq: root: memory budget exceeded: 136007680 bytes requested, 3676112896 bytes in budget
Failed running "sql"

real	0m35.439s
user	0m0.028s
sys	0m0.028s

While that is unsatisfactory and needs work, the point of this query demonstrates that in both time and memory usage, LIMIT queries in DistSQL (and regular SQL) do the right thing, and match expected behavior.

@asubiotto
Copy link
Contributor

Query run using 630757c on 6-node navy (Standard_D3_v2), using the TPC-H scalefactor=1 dataset.

SELECT l_shipmode, AVG(l_extendedprice) FROM lineitem GROUP BY l_shipmode;

The query runs correctly, the execution plan looks good, and the times for one run are below:

Without DistSQL With DistSQL
44.103s 7.799s

@petermattis
Copy link
Collaborator

@asubiotto Almost linear speedup. Nice!

@rjnn
Copy link
Contributor Author

rjnn commented May 2, 2017

Queries run using 630757c on 6-node navy (Standard_D3_v2), using the TPC-H scalefactor=1 dataset.

I ran a variety of join queries, but not documenting all of them, since they all have the same story: we always plan HashJoins with full bisection flows on all nodes that have a TableReader for that query. Sadly, this means we are very susceptible to running out of memory, which we still do on large datasets (and sometimes kill nodes as well since the memory accounting guardrails are not in 630757c).

Here is one sample execution plan: as you can see, the planner is planning HashJoins and doing full bisection flows between all the nodes.

cockroach@cockroach-navy-0006:~$ time ./cockroach sql --certs-dir=certs -e "set distsql = off; SELECT count(*) FROM tpch.lineitem, tpch.supplier where lineitem.l_suppkey = supplier.s_suppkey;"
+----------+
| count(*) |
+----------+
|  6001215 |
+----------+
(1 row)

real	0m52.902s
user	0m0.032s
sys	0m0.016s
cockroach@cockroach-navy-0006:~$ time ./cockroach sql --certs-dir=certs -e "set distsql = on; SELECT count(*) FROM tpch.lineitem, tpch.supplier where lineitem.l_suppkey = supplier.s_suppkey;"
+----------+
| count(*) |
+----------+
|  6001215 |
+----------+
(1 row)

real	0m13.531s
user	0m0.028s
sys	0m0.020s

There is speedup, so the HashJoin, while not the best possible plan for this query, is still a hefty speedup over local execution.

@rjnn
Copy link
Contributor Author

rjnn commented May 2, 2017

Closing this issue as we have now empirically evaluated and learned the breadth and limits of our DistSQL processors and the planning. All the credit to @asubiotto, who shepherded this through all those OOMs!

🎉

@rjnn rjnn closed this as completed May 2, 2017
@asubiotto
Copy link
Contributor

asubiotto commented May 2, 2017

Spun up an azworker with the same specs as navy and ran all of these queries against postgres (TPC-H scalefactor 1). These numbers are from one run only. Note that the single-node and distributed SQL numbers are from the runs above (copy-pasted for convenience) from 6 node clusters (only the first query was run on a 3 node local cluster).

Query Postgres CockroachDB (single-node SQL) CockroachDB (distributed SQL)
SELECT MIN(l_extendedprice) FROM lineitem 1.582s 22.028s 9.520s
SELECT COUNT(*) FROM lineitem 0.925s 36.707s 12.350s
SELECT AVG(l_extendedprice) FROM lineitem 1.839s 46.255s 18.560s
SELECT * FROM orders ORDER BY o_totalprice 6.063s 3m51s 3m34s
SELECT * FROM lineitem ORDER BY l_extendedprice LIMIT 10 1.476s 1m45s 15.389s
SELECT COUNT(DISTINCT l_suppkey) FROM lineitem LIMIT 1 5.234s 29.891s 9.481s
SELECT * FROM lineitem WHERE l_extendedprice < 1000 1.343s 49.482s 12.068s
SELECT * FROM lineitem JOIN supplier ON lineitem.l_suppkey = supplier.s_suppkey LIMIT 10 1.247ms 1860ms 1240ms
SELECT l_shipmode, AVG(l_extendedprice) FROM lineitem GROUP BY l_shipmode 3.548s 44.103s 7.799s
SELECT COUNT(*) FROM lineitem, supplier where lineitem.l_suppkey = supplier.s_suppkey 2.409s 52.902s 13.531s

cc @petermattis @arjunravinarayan

@petermattis
Copy link
Collaborator

Thanks, @asubiotto. This will definitely motivate work in 1.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants