Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for disk resource isolation #241

Merged
merged 5 commits into from
Feb 9, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions docs/source/yelpsoa_configs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,11 @@ instance MAY have:
There is currently not way to detect if this condition is met, other than a
``TASK_FAILED`` message.

* ``disk``: Disk (in MB) an instance needs. Defaults to 1024 (1GB). In Mesos
disk is constrained to the specified limit, and tasks will recieve 'No space
left on device' errors if they attempt to exceed these limits, and then be
unable to write any more data to disk.

* ``instances``: Marathon will attempt to run this many instances of the Service

* ``nerve_ns``: Specifies that this namespace should be routed to by another
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,10 @@
test_instance:
cpus: 0.1
ram: 100
disk: 512.3

test_instance_2:
cpus: 0.1
ram: 250
disk: 256.7
deploy_group: test_cluster.test_instance
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,10 @@ canary:
cpus: 0.1
instances: 1
mem: 500
disk: 750
nerve_ns: main
main:
cpus: 0.1
instances: 3
mem: 500
disk: 600
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,5 @@
chronos_job:
cpus: .1
mem: 100
disk: 300
schedule: 'R/2015-08-14T10:00:00+00:00/PT10M'
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
main:
cpus: .1
mem: 100
disk: 450
instances: 1
env:
FOO: BAR
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,5 @@
chronos_job:
cpus: .1
mem: 100
disk: 200
epsilon: foo
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
main:
cpus: .1
mem: 100
disk: 200
instances: 1
env:
FOO: BAR
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,5 @@
chronos_job:
cpus: .1
mem: 100
disk: 250.2
schedule: 'R/2015-08-14T10:00:00+00:00/PT10M'
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
main:
cpus: .1
mem: 100
disk: 200.0
instances: 1
env:
FOO: BAR
2 changes: 1 addition & 1 deletion paasta_itests/docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ mesosslave:
- zookeeper
environment:
-CLUSTER: testcluster
command: 'mesos-slave --master=zk://zookeeper:2181/mesos-testcluster --resources="cpus(*):10; mem(*):512"'
command: 'mesos-slave --master=zk://zookeeper:2181/mesos-testcluster --resources="cpus(*):10; mem(*):512; disk(*):100"'
hostname: mesosslave.test_hostname

marathon:
Expand Down
8 changes: 8 additions & 0 deletions paasta_itests/paasta_metastatus.feature
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,14 @@ Feature: paasta_metastatus describes the state of the paasta cluster
And a task belonging to the app with id "memtest" is in the task list
Then paasta_metastatus -v exits with return code "2" and output "CRITICAL: Less than 10% memory available."

# paasta_metastatus defines "high" disk usage as > 90% of the total cluster
# capacity. In docker-compose.yml, we set disk at 10240MB for the 1 mesos slave in use.
Scenario: High disk usage
Given a working paasta cluster
When an app with id "disktest" using high disk is launched
And a task belonging to the app with id "disktest" is in the task list
Then paasta_metastatus -v exits with return code "2" and output "CRITICAL: Less than 10% disk available."

# paasta_metastatus defines 'high' cpu usage as > 90% of the total cluster
# capacity. in docker-compose.yml, we set cpus at 10 for the 1 mesos slave in use;
# mainly this is just to use a round number. It's important to note that this
Expand Down
5 changes: 5 additions & 0 deletions paasta_itests/steps/paasta_metastatus_steps.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,11 @@ def run_paasta_metastatus_high_mem(context, app_id):
context.marathon_client.create_app(app_id, MarathonApp(cmd='/bin/sleep infinity', mem=490, instances=1))


@when(u'an app with id "{app_id}" using high disk is launched')
def run_paasta_metastatus_high_disk(context, app_id):
context.marathon_client.create_app(app_id, MarathonApp(cmd='/bin/sleep infinity', disk=95, instances=1))


@when(u'a chronos job with name "{job_name}" is launched')
def chronos_job_launched(context, job_name):
job = {'async': False, 'command': 'echo 1', 'epsilon': 'PT15M', 'name': job_name,
Expand Down
7 changes: 7 additions & 0 deletions paasta_tools/check_mesos_resource_utilization.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,9 +70,11 @@ def check_thresholds(percent):
output = ""
current_mem = stats['master/mem_percent']
current_cpu = stats['master/cpus_percent']
current_disk = stats['master/disk_percent']
percent = int(percent)
cpu_print_tuple = (percent, current_cpu)
mem_print_tuple = (percent, current_mem)
disk_print_tuple = (percent, current_disk)
if current_mem >= percent:
output += "CRITICAL: Memory usage is over %d%%! Currently at %f%%!\n" % mem_print_tuple
over_threshold = True
Expand All @@ -83,6 +85,11 @@ def check_thresholds(percent):
over_threshold = True
else:
output += "OK: CPU usage is under %d%%. (Currently at %f%%)\n" % cpu_print_tuple
if current_disk >= percent:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add blank lines between each if/else in here? These blocks kind of run into each other.

output += "CRITICAL: Disk usage is over %d%%! Currently at %f%%!\n" % disk_print_tuple
over_threshold = True
else:
output += "OK: Disk usage is under %d%%. (Currently at %f%%)\n" % disk_print_tuple
if over_threshold is True:
status = 2
else:
Expand Down
4 changes: 3 additions & 1 deletion paasta_tools/chronos_tools.py
Original file line number Diff line number Diff line change
Expand Up @@ -356,6 +356,7 @@ def check(self, param):
'retries': self.check_retries,
'cpus': self.check_cpus,
'mem': self.check_mem,
'disk': self.check_disk,
'schedule': self.check_schedule,
'scheduleTimeZone': self.check_schedule_time_zone,
'parents': self.check_parents,
Expand Down Expand Up @@ -385,6 +386,7 @@ def format_chronos_job_dict(self, docker_url, docker_volumes):
'environmentVariables': self.get_env(),
'mem': self.get_mem(),
'cpus': self.get_cpus(),
'disk': self.get_disk(),
'constraints': self.get_constraints(),
'command': self.get_cmd(),
'arguments': self.get_args(),
Expand Down Expand Up @@ -413,7 +415,7 @@ def validate(self):
# Use InstanceConfig to validate shared config keys like cpus and mem
error_msgs.extend(super(ChronosJobConfig, self).validate())

for param in ['epsilon', 'retries', 'cpus', 'mem', 'schedule', 'scheduleTimeZone']:
for param in ['epsilon', 'retries', 'cpus', 'mem', 'disk', 'schedule', 'scheduleTimeZone']:
check_passed, check_msg = self.check(param)
if not check_passed:
error_msgs.append(check_msg)
Expand Down
6 changes: 6 additions & 0 deletions paasta_tools/cli/schemas/chronos_schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,12 @@
"exclusiveMinimum": true,
"default": 1024
},
"disk": {
"type": "number",
"minimum": 0,
"exclusiveMinimum": true,
"default": 1024
},
"bounce_method": {
"enum": [ "graceful" ],
"default": "graceful"
Expand Down
6 changes: 6 additions & 0 deletions paasta_tools/cli/schemas/marathon_schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,12 @@
"exclusiveMinimum": true,
"default": 1024
},
"disk": {
"type": "number",
"minimum": 0,
"exclusiveMinimum": true,
"default": 1024
},
"instances": {
"type": "integer",
"minimum": 0,
Expand Down
2 changes: 2 additions & 0 deletions paasta_tools/marathon_tools.py
Original file line number Diff line number Diff line change
Expand Up @@ -278,6 +278,7 @@ def format_marathon_app_dict(self, app_id, docker_url, docker_volumes, service_n
- env: environment variables for the container.
- mem: the amount of memory required.
- cpus: the number of cpus required.
- disk: the amount of disk space required.
- constraints: the constraints on the Marathon app.
- instances: the number of instances required.
- cmd: the command to be executed.
Expand Down Expand Up @@ -315,6 +316,7 @@ def format_marathon_app_dict(self, app_id, docker_url, docker_volumes, service_n
'env': self.get_env(),
'mem': float(self.get_mem()),
'cpus': float(self.get_cpus()),
'disk': float(self.get_disk()),
'constraints': self.get_constraints(service_namespace_config),
'instances': self.get_instances(),
'cmd': self.get_cmd(),
Expand Down
1 change: 1 addition & 0 deletions paasta_tools/mesos_tools.py
Original file line number Diff line number Diff line change
Expand Up @@ -242,6 +242,7 @@ def status_mesos_tasks_verbose(job_id, get_short_task_id):
"Host deployed to",
"Ram",
"CPU",
"Disk",
"Deployed at what localtime"
]]
for task in running_and_active_tasks:
Expand Down
30 changes: 30 additions & 0 deletions paasta_tools/paasta_metastatus.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,18 @@ def get_mesos_cpu_status(metrics):
return total, used, available


def get_mesos_disk_status(metrics):
"""Takes in the mesos metrics and analyzes them, returning the status
:param metrics: mesos metrics dictionary
:returns: Tuple of the output array and is_ok bool
"""

total = metrics['master/disk_total']
used = metrics['master/disk_used']
available = total - used
return total, used, available


def get_extra_mesos_slave_data(mesos_state):
slaves = dict((slave['id'], {
'free_resources': slave['resources'],
Expand Down Expand Up @@ -121,6 +133,22 @@ def assert_memory_health(metrics, threshold=10):
False)


def assert_disk_health(metrics, threshold=10):
total = metrics['master/disk_total'] / float(1024)
used = metrics['master/disk_used'] / float(1024)
perc_used = percent_used(total, used)

if check_threshold(perc_used, threshold):
return ("Disk: %0.2f / %0.2fGB in use (%s)"
% (used, total, PaastaColors.green("%.2f%%" % perc_used)),
True)
else:
return (PaastaColors.red(
"CRITICAL: Less than %d%% disk available. (Currently using %.2f%%)"
% (threshold, perc_used)),
False)


def assert_tasks_running(metrics):
running = metrics['master/tasks_running']
staging = metrics['master/tasks_staging']
Expand Down Expand Up @@ -189,6 +217,7 @@ def assert_extra_slave_data(mesos_state):
slave['hostname'],
'%.2f' % slave['free_resources']['cpus'],
'%.2f' % slave['free_resources']['mem'],
'%.2f' % slave['free_resources']['disk'],
))
result = ('\n'.join((' %s' % row for row in format_table(rows)))[2:], True)
else:
Expand All @@ -207,6 +236,7 @@ def get_mesos_status(mesos_state, verbosity):
metrics_results = run_healthchecks_with_param(metrics, [
assert_cpu_health,
assert_memory_health,
assert_disk_health,
assert_tasks_running,
assert_slave_health,
])
Expand Down
32 changes: 24 additions & 8 deletions paasta_tools/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -95,17 +95,26 @@ def get_mem(self):
Defaults to 1024 (1G) if no value specified in the config.

:returns: The amount of memory specified by the config, 1024 if not specified"""
mem = self.config_dict.get('mem')
return mem if mem else 1024
mem = self.config_dict.get('mem', 1024)
return mem

def get_cpus(self):
"""Gets the number of cpus required from the service's configuration.

Defaults to .25 (1/4 of a cpu) if no value specified in the config.

:returns: The number of cpus specified in the config, .25 if not specified"""
cpus = self.config_dict.get('cpus')
return cpus if cpus else .25
cpus = self.config_dict.get('cpus', .25)
return cpus

def get_disk(self):
"""Gets the amount of disk space required from the service's configuration.

Defaults to 1024 (1G) if no value is specified in the config.

:returns: The amount of disk space specified by the config, 1024 if not specified"""
disk = self.config_dict.get('disk', 1024)
return disk

def get_cmd(self):
"""Get the docker cmd specified in the service's configuration.
Expand Down Expand Up @@ -190,15 +199,22 @@ def get_force_bounce(self):
def check_cpus(self):
cpus = self.get_cpus()
if cpus is not None:
if not isinstance(cpus, float) and not isinstance(cpus, int):
return False, 'The specified cpus value "%s" is not a valid float.' % cpus
if not isinstance(cpus, (float, int)):
return False, 'The specified cpus value "%s" is not a valid float or int.' % cpus
return True, ''

def check_mem(self):
mem = self.get_mem()
if mem is not None:
if not isinstance(mem, float) and not isinstance(mem, int):
return False, 'The specified mem value "%s" is not a valid float.' % mem
if not isinstance(mem, (float, int)):
return False, 'The specified mem value "%s" is not a valid float or int.' % mem
return True, ''

def check_disk(self):
disk = self.get_disk()
if disk is not None:
if not isinstance(disk, (float, int)):
return False, 'The specified disk value "%s" is not a valid float or int.' % disk
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isinstance can be passed a tuple for its second argument so this could just be isinstance(cpus, (float, int))

Ditto for line 202 and 209

return True, ''

def check(self, param):
Expand Down
2 changes: 2 additions & 0 deletions tests/cli/test_cmds_validate.py
Original file line number Diff line number Diff line change
Expand Up @@ -156,12 +156,14 @@ def test_marathon_validate_schema_list_hashes_good(
cpus: 0.1
instances: 2
mem: 250
disk: 512
cmd: virtualenv_run/bin/python adindexer/adindex_worker.py
healthcheck_mode: cmd
main_http:
cpus: 0.1
instances: 2
mem: 250
disk: 512
"""
mock_get_file_contents.return_value = marathon_content

Expand Down
Loading