forked from pompp/power-slurm
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathRELEASE_NOTES
270 lines (246 loc) · 14 KB
/
RELEASE_NOTES
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
RELEASE NOTES FOR SLURM VERSION 17.02
23 February 2017
IMPORTANT NOTES:
THE MAXJOBID IS NOW 67,108,863. ANY PRE-EXISTING JOBS WILL CONTINUE TO RUN BUT
NEW JOB IDS WILL BE WITHIN THE NEW MAXJOBID RANGE. Adjust your configured
MaxJobID value as needed to eliminate any confusion.
If using the slurmdbd (Slurm DataBase Daemon) you must update this first.
The 17.02 slurmdbd will work with Slurm daemons of version 15.08 and above.
You will not need to update all clusters at the same time, but it is very
important to update slurmdbd first and having it running before updating
any other clusters making use of it. No real harm will come from updating
your systems before the slurmdbd, but they will not talk to each other
until you do. Also at least the first time running the slurmdbd you need to
make sure your my.cnf file has innodb_buffer_pool_size equal to at least 64M.
You can accomplish this by adding the line
innodb_buffer_pool_size=64M
under the [mysqld] reference in the my.cnf file and restarting the mysqld. The
buffer pool size must be smaller than the size of the MySQL tmpdir. This is
needed when converting large tables over to the new database schema.
Slurm can be upgraded from version 15.08 or 16.05 to version 17.02 without loss
of jobs or other state information. Upgrading directly from an earlier version
of Slurm will result in loss of state information.
If using SPANK plugins that use the Slurm APIs, they should be recompiled when
upgrading Slurm to a new major release.
NOTE: systemd services files are installed automatically, but not enabled.
You will need to manually enable them on the appropriate systems:
- Controller: systemctl enable slurmctld
- Database: systemctl enable slurmdbd
- Compute Nodes: systemctl enable slurmd
NOTE: If you are not using Munge, but are using the "service" scripts to
start Slurm daemons, then you will need to remove this check from the
etc/slurm*service scripts.
NOTE: If you are upgrading with any jobs from 14.03 or earlier
(i.e. quick upgrade from 14.03 -> 15.08 -> 17.02) you will need
to wait until after those jobs are gone before you upgrade to 17.02.
NOTE: If you interact with any memory values in a job_submit plugin, you will
need to test against NO_VAL64 instead of NO_VAL, and change your printf
format as well.
HIGHLIGHTS
==========
-- Added infrastructure for managing workload across a federation of clusters.
(partial functionality in version 17.02, fully operational in May 2017)
-- In order to support federated jobs, the MaxJobID configuration parameter
default value has been reduced from 2,147,418,112 to 67,043,328 and its
maximum value is now 67,108,863. Upon upgrading, any pre-existing jobs that
have a job ID above the new range will continue to run and new jobs will get
job IDs in the new range.
-- Added "MailDomain" configuration parameter to qualify email addresses.
-- Automatically clean up task/cgroup cpuset and devices cgroups after steps
are completed.
-- Added burst buffer support for job arrays. Added new SchedulerParameters
configuration parameter of bb_array_stage_cnt=# to indicate how many pending
tasks of a job array should be made available for burst buffer resource
allocation.
-- Added new sacctmgr commands: "shutdown" (shutdown the server), "list stats"
(get server statistics) "clear stats" (clear server statistics).
-- The database index for jobs is now 64 bits. If you happen to be close to
4 billion jobs in your database you will want to update your slurmctld at
the same time as your slurmdbd to prevent roll over of this variable as
it is 32 bit previous versions of Slurm.
-- All memory values (in MB) are now 64 bit. Previously, nodes with more than
2 TB of memory would not schedule or enforce memory limits correctly.
-- Removed AIX, BlueGene/L and BlueGene/P support.
-- Removed sched/wiki and sched/wiki2 plugins and associated code.
-- Added PrologFlags=Serial to disable concurrent execution of prolog/epilog
scripts.
RPMBUILD CHANGES
================
-- Removed separate slurm_blcr package. If Slurm is built with BLCR support,
the files will now be part of the main Slurm packages.
-- Replace sjstat, seff and sjobexit RPM packages with a single "contribs"
package.
-- Prevent installation of both init scripts and systemd service files.
Note that systemd services are not automatically enabled at this time,
unlike the init scripts which were automatically enabled with chkconfig.
CONFIGURATION FILE CHANGES (see man appropriate man page for details)
=====================================================================
-- Added new "TRESWeights" configuration parameter to "NodeName" line for
calculating how busy a node is. Currently only used for federation
configurations.
-- Added new "PriorityFlags" option of "INCR_ONLY", which prevents a job's
priority from being decremented.
-- Added "SchedulerParameters" configuration parameter of "default_gbytes",
which treats numeric only (no suffix) value for memory and tmp disk space as
being in units of Gigabytes. Mostly for compatability with LSF.
-- Added "BootTime" configuration parameter to knl.conf file to optimize
resource allocations with respect to required node reboots.
-- Added "SchedulingParameters" option of "bf_job_part_count_reserve". Jobs
below the specified threshold will not have resources reserved for them.
-- Added "SchedulerParameters" option of "spec_cores_first" to select
specialized cores from the lowest rather than highest number cores and
sockets.
-- Added "PrologFlags" option of "Serial" to disable concurrent launch of
Prolog and Epilog scripts.
-- Removed "SchedulerRootFilter" configuration parameter.
-- Added "SyscfgTimeout" parameter to knl.conf configuration file.
COMMAND CHANGES (see man pages for details)
===========================================
-- Added commands to sacctmgr for managing and displaying federations.
-- Added salloc, sbatch and srun option of "--delay-boot=<time>", which will
temporarily delay booting nodes into the desired state for a job in the
hope of using nodes already in the proper state which will be available at
a later time.
-- Added salloc/sbatch/srun "--spread-job" option to distribute tasks over as
many nodes as possible. This also treats the "--ntasks-per-node" option as a
maximum value.
-- Added support for sbatch --bbf option to specify a burst buffer input file.
-- Added SLURM_ARRAY_TASK_COUNT environment variable. Total number of tasks in
a job array (e.g. "--array=2,4,8" will set SLURM_ARRAY_TASK_COUNT=3).
-- Modify cpu_bind and mem_bind map and mask options to accept a repetition
count to better support large task count. For example:
"mask_mem:0x0f*2,0xf0*2" is equivalent to "mask_mem:0x0f,0x0f,0xf0,0xf0".
-- Added support for "--mem_bind=prefer" option to prefer, but not restrict
memory use to the identified NUMA node.
-- Added "--mem_bind=sort" option to run zonesort on KNL nodes at step start.
-- Do not process SALLOC_HINT, SBATCH_HINT or SLURM_HINT environment variables
if any of the following salloc, sbatch or srun command line options are
specified: -B, --cpu_bind, --hint, --ntasks-per-core, or --threads-per-core.
-- Added port info to "sinfo" and "scontrol show node" output.
-- Move BatchScript to end of each job's information when using
"scontrol -dd show job" to make it more readable.
-- The scancel command would treat a non-numeric argument as the name of jobs
to be cancelled (a non-documented feature). Cancelling jobs by name now
require the "--jobname=" command line argument.
-- If GRES are configured with file IDs, then "scontrol -d show node" will
not only identify the count of currently allocated GRES, but their specific
index numbers (e.g. "GresUsed=gpu:alpha:2(IDX:0,2),gpu:beta:0(IDX:N/A)").
Ditto for job information with "scontrol -d show job".
-- Added "GresEnforceBind=Yes" to "scontrol show job" output if so configured.
-- Added "Partitions" field to "scontrol show node" output.
OTHER CHANGES
=============
-- Added event trigger on burst buffer errors (see strigger man page,
"--burst_buffer" option).
-- Added job AdminComment field which can only be set by a Slurm administrator.
-- Added mechanism to constrain kernel memory allocation using cgroups. New
cgroup.conf parameters added: ConstrainKmemSpace, MaxKmemPercent, and
MinKmemSpace.
-- burst_buffer/cray: Accept new jobs on backup slurmctld daemon without access
to dw_wlm_cli command. No burst buffer actions will take place.
-- node_features/knl_cray and knl_generic - Added capability to detected
Uncorrectable Memory Errors (UME) and if detected then log the event in all
job and step stderr with a message of the form:
error: *** STEP 1.2 ON tux1 UNCORRECTABLE MEMORY ERROR AT 2016-12-14T09:09:37 ***
-- If job is allocated nodes which are powered down, then reset job start time
when the nodes are ready and do not charge the job for power up time.
-- Added the ability to purge transactions from the database.
-- Added the ability to purge rolled up usage from the database.
-- Added new mcs/account plugin.
-- job_submit/lua - remove access to internal reservation job_run_cnt and
job_pend_cnt fields.
-- Increased maximum file size supported by sbcast from 2 GB (32-bit integer
to 64-bits).
API CHANGES
===========
Changed members of the following structs
========================================
Memory conversion to uint64_t in: job_descriptor, job_info, node_info,
partition_info, resource_allocation_response_msg, slurm_ctl_conf,
slurmd_status_msg, slurm_step_ctx_params_t
Variables converted are: actual_real_mem, def_mem_per_cpu, free_mem,
max_mem_per_cpu, mem_spec_limit, pn_min_memory, real_memory, req_mem.
In slurm_step_io_fds_t: changed "in" to "input".
In file_bcast_msg_t: changed block_no from uint16_t to uint32_t, changed
block_offset uint32_t to uint64_t.
Added members to the following struct definitions
=================================================
In burst_buffer_info_t: Added unfree_space (Drained or allocates space)
In burst_buffer_pool_t: Added unfree_space
In connection_arg_t: Added bool persist
In job_desc_msg_t: Added admin_comment, delay_boot, fed_siblings (to track which
clusters have sibling jobs), group_number, numpack, pack_leader,
pelog_env, pelog_env_size and restart_cnt.
In slurm_job_info_t: Added admin_comment, burst_buffer_state, delay_boot,
fed_origin_str, fed_siblings, fed_siblings_str, gres_detail_cnt and
gres_detail_str.
In job_step_info_t: Added packjobid, packstepid, srun_host and srun_pid.
In node_info_t: Added partitions and port.
In partition_info_t: Added over_time_limit
In resource_allocation_msg_t: added node_addr and working_cluster_rec for
federated interactive jobs.
In slurm_ctl_conf: Added mail_domain and sbcast_parameters.
In slurm_job_info_t: Added fed_origin_str, fed_siblings, fed_siblings_str to
display job federation information.
In slurm_msg_t: Added buffer to keep received message buffer to use for later
purposes.
In slurmctld_lock_t: Added federation
In slurmdb_cluster_cond_t: Added List federation_list
In slurmdb_cluster_rec_t: Added fed, lock, sockfd
In will_run_response_msg_t: Added double sys_usage_per to report back how busy a
cluster is.
Added the following struct definitions
======================================
Added slurmdb_cluster_fed_t to store federation information on
slurmdb_cluster_rec_t.
Added slurmdb_federation_cond_t for selecting federations from db.
Added slurmdb_federation_rec_t to represent federation objects.
Added job_fed_details_t for storing federated job information.
Added sib_msg_t for sending messages to siblings.
Added slurm_step_layout_req_t to improve slurmctld control over task layout
Removed members from the following struct definitions
=====================================================
Removed schedport from slurm_ctl_conf.
Removed schedrootfltr from slurm_ctl_conf.
Changed the following enums and #defines
========================================
Added BB_STATE_PRE_RUN for burst buffer state information
Added CR_NHC_ABSOLUTELY_NO to disable Cray Node Health Check
Added DEBUG_FLAG_FEDR flag for federation debugging.
Added cluster_fed_states enum and defines for federation states.
Changed DEFAULT_MAX_JOB_ID from 0x7fff0000 to 0x03ff0000.
Added the following job state flags: JOB_REVOKED (to indicate that federated
job started on another cluster), JOB_REQUEUE_FED, JOB_POWER_UP_NODE
and JOB_RECONFIG_FAIL.
Added MAX_JOB_ID (0x03FFFFFF).
Added MEM_BIND_PREFER and MEM_BIND_SORT memory binding options.
Changed MEM_PER_CPU flag to 0x8000000000000000 from 0x80000000.
Added NO_VAL16 (0xfffe).
Changed NODE_STATE_FLAGS to 0xfffffff0 from 0xfff0.
Added NODE_STATE_REBOOT to request node reboot.
Added the following job options: NODE_MEM_CALC, NODE_REBOOT, SPREAD_JOB and
USE_MIN_NODES.
Added PROLOG_FLAG_SERIAL to serially execute prolog/epilog.
Added SELECT_NODEDATA_TRES_ALLOC_FMT_STR to select_nodedata_type.
Added SELECT_NODEDATA_TRES_ALLOC_WEIGHTED to select_nodedata_type.
Added SHOW_FED_TRACK to show only federated jobs.
Added SLURM_MSG_KEEP_BUFFER msg flag to instruct slurm_receive_msg() to save the
buffer ptr.
Added SLURM_PENDING_STEP (step for external process container)
Added TRIGGER_RES_TYPE_OTHER for burst buffer event trigger
Added TRIGGER_TYPE_BURST_BUFFER for burst buffer event trigger
Added the following job wait reasons: WAIT_PART_CONFIG, WAIT_ACCOUNT_POLICY,
WAIT_FED_JOB_LOCK.
Added the following API's
=========================
Added slurm_load_federation() to retrieve federation info from cluster.
Added slurm_print_federation() to print federation info retrieved from cluster.
Added slurm_get_priority_flags() to retrieve priority flags from slurmctld_conf.
Added slurm_populate_node_partitions() to identify all partitions associated
with nodes
Changed the following API's
============================
Removed the following API's
===========================
Removed slurm_get_sched_port() .
Removed slurm_get_root_filter().