distsql: tracking issue for queries we expect to run through DistSQL #14288

rjnn · 2017-03-21T15:06:25Z

This is a TODO list for listing queries that we expect to run using DistSQL (in auto mode) by 1.0. Feel free to add as needed, cc @andreimatei @RaduBerinde @cuongdo @asubiotto.

If there are any queries that we want turned on by 1.0, please add them to the list above.

Do not check off an item as done without adding a comment/issue tracking the queries actually attempted on a cluster. Try and report running times and EXPLAIN(query) output to show the DistSQL plan.

The text was updated successfully, but these errors were encountered:

asubiotto · 2017-03-30T18:10:29Z

Ran

SELECT MIN(l_extendedprice) FROM lineitem

on a 3 node local cluster, using the TPC-H scalefactor=1 dataset. lineitem has 6million rows (~2.5GiB).
The query runs correctly, the execution plan looks good, and the times for one run are below:

Without DistSQL	With DistSQL
22.028s	9.520s

asubiotto · 2017-03-30T18:26:44Z

Ran

SELECT MAX(l_extendedprice) FROM lineitem

on a 3 node local cluster, using the TPC-H scalefactor=1 dataset. lineitem has 6million rows (~2.5GiB).
The query runs correctly, the execution plan looks good, and the times for one run are below:

Without DistSQL	With DistSQL
22.821s	9.481s

tamird · 2017-03-30T18:32:20Z

Hm, that's much slower than I'd have expected. Isn't DistSQL able to reduce network traffic on these queries to O(1)?

RaduBerinde · 2017-03-30T19:17:58Z

Yes, the network traffic is O(1) in these plans. Note that these are local clusters (all nodes on the same machine) so network traffic is not really network traffic.

tamird · 2017-03-30T19:19:16Z

Ah, understood.

…

On Thu, Mar 30, 2017 at 3:18 PM, RaduBerinde ***@***.***> wrote: Yes, the network traffic is O(1) in these plans. Note that these are *local* clusters (all nodes on the same machine) so network traffic is not really network traffic. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#14288 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABdsPPXaHXVWMN34eLKNY2q0emX5eUEjks5rq__rgaJpZM4Mj8Zu> .

RaduBerinde · 2017-03-30T19:20:39Z

@danhhz mentioned a SELECT COUNT(*) experiment on a real 3 node cluster (I think it was ~200M rows, SELECT COUNT(*) took 7m33s compared to over an hour).

tamird · 2017-03-30T19:24:31Z

Makes much more sense, thanks.

…

On Thu, Mar 30, 2017 at 3:20 PM, RaduBerinde ***@***.***> wrote: @danhhz <https://github.com/danhhz> mentioned a SELECT COUNT(*) experiment on a real 3 node cluster (I think it was ~200M rows, SELECT COUNT(*) took 7m33s compared to over an hour). — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#14288 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABdsPC0LSslegjSGJTl7havxxxJy5jWVks5rrACNgaJpZM4Mj8Zu> .

danhhz · 2017-03-30T19:24:40Z

Indeed, it was the production lapis cluster

asubiotto · 2017-04-11T18:19:44Z

Ran

SELECT COUNT(*) FROM lineitem

on 6-node navy (Standard_D3_v2), using the TPC-H scalefactor=1 dataset. lineitem has 6million rows (~2.5GiB).
The query runs correctly, the execution plan looks good, and the times for one run are below:

Without DistSQL	With DistSQL
36.707s	12.350s

asubiotto · 2017-04-20T20:23:07Z

Ran

SELECT AVG(l_extendedprice) FROM lineitem

using f97a5c3 on 6-node navy (Standard_D3_v2), using the TPC-H scalefactor=1 dataset. lineitem has 6million rows (~2.5GiB).
The query runs correctly, the execution plan looks good, and the times for one run are below:

Without DistSQL	With DistSQL
46.255s	18.560s

rjnn · 2017-04-20T20:23:50Z

Can you report the cockroach sha you used for future reference?

petermattis · 2017-04-20T20:25:41Z

@asubiotto The ranges for the lineitem table are only spread across 4 nodes? Or is there another reason the query plan only has 4 TableReaders?

rjnn · 2017-04-20T20:27:20Z

Is there a quick way to find this information out? Last time I asked, there wasn't a clean way to figure out, given a table, what ranges the table was spread across.

rjnn · 2017-04-20T20:29:09Z

Admin UI says there are 74 ranges for table lineitem (but I can't figure out how to find range IDs or node information for those ranges), so it being spread across just 4 nodes seems strange (but possible).

petermattis · 2017-04-20T20:32:16Z

SHOW TESTING_RANGES FROM TABLE <table-name>

rjnn · 2017-04-20T20:37:00Z

Thank you @petermattis. It appears there are replicas and lease holders on every node, so let me investigate.
Also, there are 24 ranges, for 72 replicas, so clearly the Admin UI is buggy when it says there are 74 ranges. I will file an issue.

asubiotto · 2017-04-20T20:37:56Z

I updated the execution plan. I think the gateway node probably had a cache that hadn't been updated. The query plan now shows only 5 TableReaders which is still weird.

asubiotto · 2017-04-20T20:40:01Z

Actually, 1 doesn't seem to be a lease holder for any range in lineitem.

rjnn · 2017-04-20T20:40:46Z

Sure, but it was showing 4 before, and that was certainly incorrect. cc @andreimatei.

andreimatei · 2017-04-24T15:23:19Z

Sure, but it was showing 4 before, and that was certainly incorrect.

The range-descriptor and leaseholder caches can be empty or stale. This explains it, right?

The state of the caches is supposed to be seen in the ranges_cached internal table, but I think this hasn't been implemented yet.

asubiotto · 2017-04-26T18:24:28Z

All these queries were run using 4129fe0 on 6-node navy (Standard_D3_v2), using the TPC-H scalefactor=1 dataset.

I started by trying to run

SELECT * FROM lineitem ORDER BY l_extendedprice

On 6million ~2.5GiB lineitem but this failed (see #15332 but note that this query doesn't cause an OOM crash with 4129fe0 any more)

I then moved down to orders, a 1.5million ~600MiB table. Ran

SELECT * FROM orders ORDER BY o_totalprice

The query runs correctly, the execution plan looks good (note that nodes 1 and 3 aren't leaseholders for any of orders' ranges), and the times for one run are below:

Without DistSQL	With DistSQL
3m51s	3m34s

I also ran:

SELECT * FROM lineitem ORDER BY l_extendedprice LIMIT 10

To avoid running out of memory on the gateway node (execution plan here). The times for one run are as follows:

Without DistSQL	With DistSQL
1m45s	15.389s

rjnn · 2017-05-01T18:24:10Z

Queries run with 630757cbc0 on a 6-nodeo navy (Standard_D3_v2) using the TPC-H scalefactor=1 dataset.

cockroach@cockroach-navy-0006:~$ time ./cockroach sql --certs-dir=certs -e "set distsql = off; select count(DISTINCT l_suppkey) FROM tpch.lineitem LIMIT 1;"
+---------------------------+
| count(DISTINCT l_suppkey) |
+---------------------------+
|                     10000 |
+---------------------------+
(1 row)

real	0m29.891s
user	0m0.028s
sys	0m0.012s
cockroach@cockroach-navy-0006:~$ time ./cockroach sql --certs-dir=certs -e "set distsql = on; select count(DISTINCT l_suppkey) FROM tpch.lineitem LIMIT 1;"
+---------------------------+
| count(DISTINCT l_suppkey) |
+---------------------------+
|                     10000 |
+---------------------------+
(1 row)

real	0m9.481s
user	0m0.044s
sys	0m0.008s

rjnn · 2017-05-01T18:31:40Z

I forgot to include the execution plan, which looks good.

rjnn · 2017-05-01T18:43:26Z

Queries run with 630757cbc0 on a 6-node navy (Standard_D3_v2) using the TPC-H scalefactor=1 dataset.

This query was constructed to have a sparse where clause, and really nothing else. The

cockroach@cockroach-navy-0006:~$ time ./cockroach sql --certs-dir=certs -e "set distsql = on; SELECT * FROM tpch.lineitem WHERE l_extendedprice < 1000;" > foo_on

real	0m12.068s
user	0m0.072s
sys	0m0.012s
cockroach@cockroach-navy-0006:~$ time ./cockroach sql --certs-dir=certs -e "set distsql = off; SELECT * FROM tpch.lineitem WHERE l_extendedprice < 1000;" > foo_off

real	0m49.482s
user	0m0.076s
sys	0m0.008s

The execution plan looks good

The results are correct:

cockroach@cockroach-navy-0006:~$ sort foo_on > foo_on_sorted
cockroach@cockroach-navy-0006:~$ sort foo_off > foo_off_sorted
cockroach@cockroach-navy-0006:~$ diff foo_off_sorted foo_on_sorted
cockroach@cockroach-navy-0006:~$ wc -l foo_off
3082 foo_off
cockroach@cockroach-navy-0006:~$ wc -l foo_on
3082 foo_on

I did not add an ORDER BY or a COUNT operation so that we controlled the number of moving parts, but when added the runtimes are similar.

rjnn · 2017-05-01T19:14:56Z

Queries run with 630757c on 6-node navy (Standard_D3_v2) using the TPC-H scalefactor=1 dataset.

The query is SELECT * FROM tpch.lineitem JOIN tpch.supplier ON tpch.lineitem.l_suppkey = tpch.supplier.s_suppkey. This query was constructed to have a JOIN that could have a lot of rows in it, followed by a limit. It is extremely artificial and otherwise useless in any real analytics scenario.

The DistSQL execution plan is intimidating, but ultimately correct.

cockroach@cockroach-navy-0006:~$ time ./cockroach sql --certs-dir=certs -e "set distsql = on; SELECT * FROM tpch.lineitem JOIN tpch.supplier ON tpch.lineitem.l_suppkey = tpch.supplier.s_suppkey limit 10" > foo_on

real	0m0.124s
user	0m0.028s
sys	0m0.028s
cockroach@cockroach-navy-0006:~$ time ./cockroach sql --certs-dir=certs -e "set distsql = off; SELECT * FROM tpch.lineitem JOIN tpch.supplier ON tpch.lineitem.l_suppkey = tpch.supplier.s_suppkey limit 10" > foo_off

real	0m0.186s
user	0m0.036s
sys	0m0.016s

When run without the limit, both versions OOM:

cockroach@cockroach-navy-0006:~$ time ./cockroach sql --certs-dir=certs -e "set distsql = off; SELECT * FROM tpch.lineitem JOIN tpch.supplier ON tpch.lineitem.l_suppkey = tpch.supplier.s_suppkey" > foo_off
Error: pq: root: memory budget exceeded: 10240 bytes requested, 3676112896 bytes in budget
Failed running "sql"

real	0m16.898s
user	0m0.036s
sys	0m0.012s

cockroach@cockroach-navy-0006:~$ time ./cockroach sql --certs-dir=certs -e "set distsql = on; SELECT * FROM tpch.lineitem JOIN tpch.supplier ON tpch.lineitem.l_suppkey = tpch.supplier.s_suppkey" > foo_off
Error: pq: root: memory budget exceeded: 136007680 bytes requested, 3676112896 bytes in budget
Failed running "sql"

real	0m35.439s
user	0m0.028s
sys	0m0.028s

While that is unsatisfactory and needs work, the point of this query demonstrates that in both time and memory usage, LIMIT queries in DistSQL (and regular SQL) do the right thing, and match expected behavior.

asubiotto · 2017-05-02T16:04:09Z

Query run using 630757c on 6-node navy (Standard_D3_v2), using the TPC-H scalefactor=1 dataset.

SELECT l_shipmode, AVG(l_extendedprice) FROM lineitem GROUP BY l_shipmode;

The query runs correctly, the execution plan looks good, and the times for one run are below:

Without DistSQL	With DistSQL
44.103s	7.799s

petermattis · 2017-05-02T16:15:32Z

@asubiotto Almost linear speedup. Nice!

rjnn · 2017-05-02T18:16:40Z

Queries run using 630757c on 6-node navy (Standard_D3_v2), using the TPC-H scalefactor=1 dataset.

I ran a variety of join queries, but not documenting all of them, since they all have the same story: we always plan HashJoins with full bisection flows on all nodes that have a TableReader for that query. Sadly, this means we are very susceptible to running out of memory, which we still do on large datasets (and sometimes kill nodes as well since the memory accounting guardrails are not in 630757c).

Here is one sample execution plan: as you can see, the planner is planning HashJoins and doing full bisection flows between all the nodes.

cockroach@cockroach-navy-0006:~$ time ./cockroach sql --certs-dir=certs -e "set distsql = off; SELECT count(*) FROM tpch.lineitem, tpch.supplier where lineitem.l_suppkey = supplier.s_suppkey;"
+----------+
| count(*) |
+----------+
|  6001215 |
+----------+
(1 row)

real	0m52.902s
user	0m0.032s
sys	0m0.016s
cockroach@cockroach-navy-0006:~$ time ./cockroach sql --certs-dir=certs -e "set distsql = on; SELECT count(*) FROM tpch.lineitem, tpch.supplier where lineitem.l_suppkey = supplier.s_suppkey;"
+----------+
| count(*) |
+----------+
|  6001215 |
+----------+
(1 row)

real	0m13.531s
user	0m0.028s
sys	0m0.020s

There is speedup, so the HashJoin, while not the best possible plan for this query, is still a hefty speedup over local execution.

rjnn · 2017-05-02T18:18:45Z

Closing this issue as we have now empirically evaluated and learned the breadth and limits of our DistSQL processors and the planning. All the credit to @asubiotto, who shepherded this through all those OOMs!

🎉

asubiotto · 2017-05-02T19:43:24Z

Spun up an azworker with the same specs as navy and ran all of these queries against postgres (TPC-H scalefactor 1). These numbers are from one run only. Note that the single-node and distributed SQL numbers are from the runs above (copy-pasted for convenience) from 6 node clusters (only the first query was run on a 3 node local cluster).

Query	Postgres	CockroachDB (single-node SQL)	CockroachDB (distributed SQL)
`SELECT MIN(l_extendedprice) FROM lineitem`	1.582s	22.028s	9.520s
`SELECT COUNT(*) FROM lineitem`	0.925s	36.707s	12.350s
`SELECT AVG(l_extendedprice) FROM lineitem`	1.839s	46.255s	18.560s
`SELECT * FROM orders ORDER BY o_totalprice`	6.063s	3m51s	3m34s
`SELECT * FROM lineitem ORDER BY l_extendedprice LIMIT 10`	1.476s	1m45s	15.389s
`SELECT COUNT(DISTINCT l_suppkey) FROM lineitem LIMIT 1`	5.234s	29.891s	9.481s
`SELECT * FROM lineitem WHERE l_extendedprice < 1000`	1.343s	49.482s	12.068s
`SELECT * FROM lineitem JOIN supplier ON lineitem.l_suppkey = supplier.s_suppkey LIMIT 10`	1.247ms	1860ms	1240ms
`SELECT l_shipmode, AVG(l_extendedprice) FROM lineitem GROUP BY l_shipmode`	3.548s	44.103s	7.799s
`SELECT COUNT(*) FROM lineitem, supplier where lineitem.l_suppkey = supplier.s_suppkey`	2.409s	52.902s	13.531s

cc @petermattis @arjunravinarayan

petermattis · 2017-05-02T19:59:36Z

Thanks, @asubiotto. This will definitely motivate work in 1.1.

rjnn added this to the 1.0 milestone Mar 21, 2017

rjnn assigned rjnn and asubiotto Mar 21, 2017

spencerkimball added the high priority label Mar 27, 2017

rjnn closed this as completed May 2, 2017

rjnn mentioned this issue May 8, 2017

distsql: tracking issue for continuous load tests #15775

Closed

asubiotto mentioned this issue Feb 5, 2018

perf: distsql row batching #20555

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distsql: tracking issue for queries we expect to run through DistSQL #14288

distsql: tracking issue for queries we expect to run through DistSQL #14288

rjnn commented Mar 21, 2017 •

edited by asubiotto

Loading

asubiotto commented Mar 30, 2017 •

edited

Loading

asubiotto commented Mar 30, 2017 •

edited

Loading

tamird commented Mar 30, 2017

RaduBerinde commented Mar 30, 2017

tamird commented Mar 30, 2017 via email

RaduBerinde commented Mar 30, 2017

tamird commented Mar 30, 2017 via email

danhhz commented Mar 30, 2017

asubiotto commented Apr 11, 2017

asubiotto commented Apr 20, 2017 •

edited

Loading

rjnn commented Apr 20, 2017

petermattis commented Apr 20, 2017

rjnn commented Apr 20, 2017

rjnn commented Apr 20, 2017

petermattis commented Apr 20, 2017

rjnn commented Apr 20, 2017

asubiotto commented Apr 20, 2017

asubiotto commented Apr 20, 2017 •

edited

Loading

rjnn commented Apr 20, 2017

andreimatei commented Apr 24, 2017

asubiotto commented Apr 26, 2017

rjnn commented May 1, 2017

rjnn commented May 1, 2017

rjnn commented May 1, 2017 •

edited

Loading

rjnn commented May 1, 2017 •

edited

Loading

asubiotto commented May 2, 2017

petermattis commented May 2, 2017

rjnn commented May 2, 2017

rjnn commented May 2, 2017

asubiotto commented May 2, 2017 •

edited

Loading

petermattis commented May 2, 2017

distsql: tracking issue for queries we expect to run through DistSQL #14288

distsql: tracking issue for queries we expect to run through DistSQL #14288

Comments

rjnn commented Mar 21, 2017 • edited by asubiotto Loading

asubiotto commented Mar 30, 2017 • edited Loading

asubiotto commented Mar 30, 2017 • edited Loading

tamird commented Mar 30, 2017

RaduBerinde commented Mar 30, 2017

tamird commented Mar 30, 2017 via email

RaduBerinde commented Mar 30, 2017

tamird commented Mar 30, 2017 via email

danhhz commented Mar 30, 2017

asubiotto commented Apr 11, 2017

asubiotto commented Apr 20, 2017 • edited Loading

rjnn commented Apr 20, 2017

petermattis commented Apr 20, 2017

rjnn commented Apr 20, 2017

rjnn commented Apr 20, 2017

petermattis commented Apr 20, 2017

rjnn commented Apr 20, 2017

asubiotto commented Apr 20, 2017

asubiotto commented Apr 20, 2017 • edited Loading

rjnn commented Apr 20, 2017

andreimatei commented Apr 24, 2017

asubiotto commented Apr 26, 2017

rjnn commented May 1, 2017

rjnn commented May 1, 2017

rjnn commented May 1, 2017 • edited Loading

rjnn commented May 1, 2017 • edited Loading

asubiotto commented May 2, 2017

petermattis commented May 2, 2017

rjnn commented May 2, 2017

rjnn commented May 2, 2017

asubiotto commented May 2, 2017 • edited Loading

petermattis commented May 2, 2017

rjnn commented Mar 21, 2017 •

edited by asubiotto

Loading

asubiotto commented Mar 30, 2017 •

edited

Loading

asubiotto commented Mar 30, 2017 •

edited

Loading

asubiotto commented Apr 20, 2017 •

edited

Loading

asubiotto commented Apr 20, 2017 •

edited

Loading

rjnn commented May 1, 2017 •

edited

Loading

rjnn commented May 1, 2017 •

edited

Loading

asubiotto commented May 2, 2017 •

edited

Loading