Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capacity planning and troubleshooting #2723

Closed
robert-s-lee opened this issue Mar 15, 2018 · 8 comments
Closed

Capacity planning and troubleshooting #2723

robert-s-lee opened this issue Mar 15, 2018 · 8 comments
Assignees
Labels
O-sales-eng Internal source: Sales Engineering P-2 Normal priority; secondary task T-missing-info

Comments

@robert-s-lee
Copy link
Contributor

robert-s-lee commented Mar 15, 2018

Exalate commented:

High level steps below. More detail to be added

  • run the workload at concurrency 1
    collect performance data: throughput, response time
    collect capacity data: CPU, RAM, Network bandwidth, Storage IOPS, Storage bandwidth, Storage capacity

  • run the workload at least 2 other concurrency to construct the curve as below. More concurrency tests will produce more accurate prediction

chart 3

  • The capacity where the response time meet the SLA at various concurrent can be used to forecast

Jira Issue: DOC-154

@robert-s-lee robert-s-lee self-assigned this Mar 15, 2018
@jseldess jseldess changed the title describe steps for estimating capacity Capacity planning guidance Mar 19, 2018
@jseldess jseldess added this to the 2.1 milestone Mar 19, 2018
@jseldess jseldess changed the title Capacity planning guidance Capacity planning and troubleshooting Apr 16, 2018
@jseldess
Copy link
Contributor

jseldess commented Apr 16, 2018

Capacity troubleshooting from #2461:

Performance, availability, and stability issues can be caused by insufficient capacity.

Initial suggestions below:

Check Out of Capacity Conditions

First, check adequate capacity was available for the incident for the following components.

Type Linux Command what to look for
file system storage capacity df -kh file system at 99%
Storage IOPS / bandwidth capacity sar -d consistent non-zero values in avgqu-sz
RAM capacity sar -s any non-zero value in kbswpused and %swpused
CPU capacity sar -q consistent non-zero values in runq-sz
Network capacity sar -n ALL any non-zero value in rxerr/s, txerr/s, coll/s, rxdrop/s, txdrop/s, txcarr/s, rxfram/s, rxfifo/s, txfifo/s

Check Near Out of Capacity Conditions

Type Linux Command what to look for
Storage bandwidth capacity sar -d consistently more than 80% or unusually high values in rd_sec/s and wr_sec/s (in 512 bytes sector thus not as accurate as iostat)
Storage IOPS capacity sar -d consistently more than 1ms or unusually high values in await
RAM capacity sar -r consistently more than 80% in %memused
CPU capacity sar -u consistently less than 20% in %idle (ie:80% busy)
Network capacity sar -n ALL consistently more than 50% line capacity in rxkB/s or txkB/s

Check for abnormal conditions

Historical measurements are required determine whether a metric is too high or too low. Maintaining at least 1 month of historical data will be useful.

Linix sar in crontab

Enable sar data collected at 1-minute interval

  • Ubuntu
sudo bash
apt-get install sysstat
sed -ibak -e 's/ENABLED="false"/ENABLED="true"/g' /etc/default/sysstat
sed -ibak -e 's|5-55/10 \* \* \* \*|\* \* \* \* \*|g' /etc/cron.d/sysstat

@jseldess
Copy link
Contributor

This docs work should wait on this related 2.1 roadmap work.

@jseldess
Copy link
Contributor

Based on feedback from @drewdeally, we don't yet have the data for this. Moving to Later.

@jseldess jseldess removed this from the Later milestone Nov 10, 2018
@jseldess jseldess added the O-sales-eng Internal source: Sales Engineering label Nov 12, 2018
@lnhsingh lnhsingh added the P-2 Normal priority; secondary task label Nov 15, 2018
@Amruta-Ranade Amruta-Ranade self-assigned this Feb 26, 2019
@Amruta-Ranade
Copy link
Contributor

@robert-s-lee I addressed this issue here: https://www.cockroachlabs.com/docs/dev/cluster-setup-troubleshooting.html#capacity-planning-issues
PTAL and let me know if I can close this issue.

@drewdeally
Copy link

drewdeally commented Apr 17, 2019 via email

@Amruta-Ranade
Copy link
Contributor

@drewdeally Can you share the correct values?

@taroface taroface self-assigned this Oct 6, 2020
@taroface
Copy link
Contributor

taroface commented Oct 6, 2020

Relates to #7818 and #8547.

@Amruta-Ranade Amruta-Ranade removed their assignment Jan 26, 2021
@jseldess
Copy link
Contributor

We have closed this issue because it is more than 3 years
old. If this issue is still relevant, please add a comment and re-open
the issue. Thank you for your contribution to CockroachDB docs!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
O-sales-eng Internal source: Sales Engineering P-2 Normal priority; secondary task T-missing-info
Projects
None yet
Development

No branches or pull requests

6 participants