Skip to content

Commit

Permalink
merge conflicts
Browse files Browse the repository at this point in the history
  • Loading branch information
wtripp180901 committed Mar 4, 2025
2 parents bdd265a + ede561f commit 4d6dee0
Show file tree
Hide file tree
Showing 89 changed files with 1,402 additions and 553 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/stackhpc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -178,11 +178,11 @@ jobs:
ansible-playbook -v ansible/site.yml
ansible-playbook -v ansible/ci/check_slurm.yml
- name: Test reimage of compute nodes and compute-init (via rebuild adhoc)
- name: Test compute node reboot and compute-init
run: |
. venv/bin/activate
. environments/.stackhpc/activate
ansible-playbook -v --limit compute ansible/adhoc/rebuild.yml
ansible-playbook -v ansible/adhoc/reboot_via_slurm.yml
ansible-playbook -v ansible/ci/check_slurm.yml
- name: Check sacct state survived reimage
Expand Down
28 changes: 18 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,7 @@ It requires an OpenStack cloud, and an Ansible "deploy host" with access to that

Before starting ensure that:
- You have root access on the deploy host.
- You can create instances using a Rocky 9 GenericCloud image (or an image based on that).
- **NB**: In general it is recommended to use the [latest released image](https://github.com/stackhpc/ansible-slurm-appliance/releases) which already contains the required packages. This is built and tested in StackHPC's CI.
- You can create instances from the [latest Slurm appliance image](https://github.com/stackhpc/ansible-slurm-appliance/releases), which already contains the required packages. This is built and tested in StackHPC's CI. Although you can use a Rocky Linux 9 GenericCloud instead, it is not recommended.
- You have an SSH keypair defined in OpenStack, with the private key available on the deploy host.
- Created instances have access to internet (note proxies can be setup through the appliance if necessary).
- Created instances have accurate/synchronised time (for VM instances this is usually provided by the hypervisor; if not or for bare metal instances it may be necessary to configure a time service via the appliance).
Expand Down Expand Up @@ -82,30 +81,39 @@ And generate secrets for it:

Create an OpenTofu variables file to define the required infrastructure, e.g.:

# environments/$ENV/terraform/terraform.tfvars:
# environments/$ENV/tofu/terraform.tfvars:

cluster_name = "mycluster"
cluster_net = "some_network" # *
cluster_subnet = "some_subnet" # *
cluster_networks = [
{
network = "some_network" # *
subnet = "some_subnet" # *
}
]
key_pair = "my_key" # *
control_node_flavor = "some_flavor_name"
login_nodes = {
login-0: "login_flavor_name"
login = {
# Arbitrary group name for these login nodes
interactive = {
nodes: ["login-0"]
flavor: "login_flavor_name" # *
}
}
cluster_image_id = "rocky_linux_9_image_uuid"
compute = {
# Group name used for compute node partition definition
general = {
nodes: ["compute-0", "compute-1"]
flavor: "compute_flavor_name"
flavor: "compute_flavor_name" # *
}
}

Variables marked `*` refer to OpenStack resources which must already exist. The above is a minimal configuration - for all variables and descriptions see `environments/$ENV/terraform/terraform.tfvars`.
Variables marked `*` refer to OpenStack resources which must already exist. The above is a minimal configuration - for all variables and descriptions see `environments/$ENV/tofu/variables.tf`.

To deploy this infrastructure, ensure the venv and the environment are [activated](#create-a-new-environment) and run:

export OS_CLOUD=openstack
cd environments/$ENV/terraform/
cd environments/$ENV/tofu/
tofu init
tofu apply

Expand Down
4 changes: 4 additions & 0 deletions ansible/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,8 @@ roles/*
!roles/mysql/**
!roles/systemd/
!roles/systemd/**
!roles/cacerts/
!roles/cacerts/**
!roles/cuda/
!roles/cuda/**
!roles/freeipa/
Expand Down Expand Up @@ -82,3 +84,5 @@ roles/*
!roles/slurm_stats/**
!roles/pytools/
!roles/pytools/**
!roles/rebuild/
!roles/rebuild/**
24 changes: 24 additions & 0 deletions ansible/adhoc/reboot_via_slurm.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Reboot compute nodes via slurm. Nodes will be rebuilt if `image_id` in inventory is different to the currently-provisioned image.
# Example:
# ansible-playbook -v ansible/adhoc/reboot_via_slurm.yml

- hosts: login
run_once: true
become: yes
gather_facts: no
tasks:
- name: Submit a Slurm job to reboot compute nodes
ansible.builtin.shell: |
set -e
srun --reboot -N 2 uptime
become_user: root
register: slurm_result
failed_when: slurm_result.rc != 0

- name: Fetch Slurm controller logs if reboot fails
ansible.builtin.shell: |
journalctl -u slurmctld --since "10 minutes ago" | tail -n 50
become_user: root
register: slurm_logs
when: slurm_result.rc != 0
delegate_to: "{{ groups['control'] | first }}"
48 changes: 40 additions & 8 deletions ansible/bootstrap.yml
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,13 @@
- import_role:
name: proxy

- hosts: chrony
tags: chrony
become: yes
tasks:
- import_role:
name: mrlesmithjr.chrony

- hosts: cluster
gather_facts: false
become: yes
Expand Down Expand Up @@ -126,22 +133,46 @@
ansible.builtin.assert:
that: dnf_repos_password is undefined
fail_msg: Passwords should not be templated into repofiles during configure, unset 'dnf_repos_password'
when: appliances_mode == 'configure'
- name: Replace system repos with pulp repos
ansible.builtin.include_role:
name: dnf_repos
tasks_from: set_repos.yml
when:
- appliances_mode == 'configure'
- not (dnf_repos_allow_insecure_creds | default(false)) # useful for development

- hosts: cacerts:!builder
tags: cacerts
gather_facts: false
tasks:
- name: Install custom cacerts
import_role:
name: cacerts

# --- tasks after here require access to package repos ---
- hosts: squid
tags: squid
gather_facts: yes
become: yes
tasks:
# - Installing squid requires working dnf repos
# - Configuring dnf_repos itself requires working dnf repos to install epel
# - Hence do this on squid nodes first in case they are proxying others
- name: Replace system repos with pulp repos
ansible.builtin.include_role:
name: dnf_repos
tasks_from: set_repos.yml
when: "'dnf_repos' in group_names"
- name: Configure squid proxy
import_role:
name: squid

- hosts: dnf_repos
tags: dnf_repos
gather_facts: yes
become: yes
tasks:
- name: Replace system repos with pulp repos
ansible.builtin.include_role:
name: dnf_repos
tasks_from: set_repos.yml

# --- tasks after here require general access to package repos ---
- hosts: tuned
tags: tuned
gather_facts: yes
Expand Down Expand Up @@ -282,10 +313,11 @@
- include_role:
name: azimuth_cloud.image_utils.linux_ansible_init

- hosts: k3s
- hosts: k3s:&builder
become: yes
tags: k3s
tasks:
- ansible.builtin.include_role:
- name: Install k3s
ansible.builtin.include_role:
name: k3s
tasks_from: install.yml
2 changes: 1 addition & 1 deletion ansible/ci/check_slurm.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
shell: 'sinfo --noheader --format="%N %P %a %l %D %t" | sort' # using --format ensures we control whitespace: Partition,partition_state,max_jobtime,num_nodes,node_state,node_name
register: sinfo
changed_when: false
until: not ("boot" in sinfo.stdout or "idle*" in sinfo.stdout)
until: not ("boot" in sinfo.stdout or "idle*" in sinfo.stdout or "down" in sinfo.stdout)
retries: 10
delay: 5
- name: Check nodes have expected slurm state
Expand Down
2 changes: 1 addition & 1 deletion ansible/ci/retrieve_inventory.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
gather_facts: no
vars:
cluster_prefix: "{{ undef(hint='cluster_prefix must be defined') }}" # e.g. ci4005969475
ci_vars_file: "{{ appliances_environment_root + '/terraform/' + lookup('env', 'CI_CLOUD') }}.tfvars"
ci_vars_file: "{{ appliances_environment_root + '/tofu/' + lookup('env', 'CI_CLOUD') }}.tfvars"
cluster_network: "{{ lookup('ansible.builtin.ini', 'cluster_net', file=ci_vars_file, type='properties') | trim('\"') }}"
tasks:
- name: Get control host IP
Expand Down
20 changes: 20 additions & 0 deletions ansible/extras.yml
Original file line number Diff line number Diff line change
@@ -1,3 +1,23 @@
- hosts: k3s_server:!builder
become: yes
tags: k3s
tasks:
- name: Start k3s server
ansible.builtin.include_role:
name: k3s
tasks_from: server-runtime.yml

# technically should be part of bootstrap.yml but hangs waiting on failed mounts
# if runs before filesystems.yml after the control node has been reimaged
- hosts: k3s_agent:!builder
become: yes
tags: k3s
tasks:
- name: Start k3s agents
ansible.builtin.include_role:
name: k3s
tasks_from: agent-runtime.yml

- hosts: basic_users:!builder
become: yes
tags:
Expand Down
15 changes: 12 additions & 3 deletions ansible/roles/basic_users/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,16 +2,19 @@
basic_users
===========

Setup users on cluster nodes using `/etc/passwd` and manipulating `$HOME`, i.e. without requiring LDAP etc. Features:
Setup users on cluster nodes using `/etc/passwd` and manipulating `$HOME`, i.e.
without requiring LDAP etc. Features:
- UID/GID is consistent across cluster (and explicitly defined).
- SSH key generated and propagated to all nodes to allow login between cluster nodes.
- An "external" SSH key can be added to allow login from elsewhere.
- Login to the control node is prevented.
- Login to the control node is prevented (by default)
- When deleting users, systemd user sessions are terminated first.

Requirements
------------
- $HOME (for normal users, i.e. not `centos`) is assumed to be on a shared filesystem.
- `$HOME` (for normal users, i.e. not `rocky`) is assumed to be on a shared
filesystem. Actions affecting that shared filesystem are run on a single host,
see `basic_users_manage_homedir` below.

Role Variables
--------------
Expand All @@ -22,9 +25,15 @@ Role Variables
- `shell` if *not* set will be `/sbin/nologin` on the `control` node and the default shell on other users. Explicitly setting this defines the shell for all nodes.
- An additional key `public_key` may optionally be specified to define a key to log into the cluster.
- An additional key `sudo` may optionally be specified giving a string (possibly multiline) defining sudo rules to be templated.
- `ssh_key_type` defaults to `ed25519` instead of the `ansible.builtin.user` default of `rsa`.
- Any other keys may present for other purposes (i.e. not used by this role).
- `basic_users_groups`: Optional, default empty list. A list of mappings defining information for each group. Mapping keys/values are passed through as parameters to [ansible.builtin.group](https://docs.ansible.com/ansible/latest/collections/ansible/builtin/group_module.html) and default values are as given there.
- `basic_users_override_sssd`: Optional bool, default false. Whether to disable `sssd` when ensuring users/groups exist with this role. Permits creating local users/groups even if they clash with users provided via sssd (e.g. from LDAP). Ignored if host is not in group `sssd` as well. Note with this option active `sssd` will be stopped and restarted each time this role is run.
- `basic_users_manage_homedir`: Optional bool, must be true on a single host to
determine which host runs tasks affecting the shared filesystem. The default
is to use the first play host which is not the control node, because the
default NFS configuration does not have the shared `/home` directory mounted
on the control node.

Dependencies
------------
Expand Down
3 changes: 2 additions & 1 deletion ansible/roles/basic_users/defaults/main.yml
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
basic_users_manage_homedir: "{{ (ansible_hostname == (ansible_play_hosts | first)) }}"
basic_users_manage_homedir: "{{ ansible_hostname == (ansible_play_hosts | difference(groups['control']) | first) }}"
basic_users_userdefaults:
state: present
create_home: "{{ basic_users_manage_homedir }}"
generate_ssh_key: "{{ basic_users_manage_homedir }}"
ssh_key_comment: "{{ item.name }}"
ssh_key_type: ed25519
shell: "{{'/sbin/nologin' if 'control' in group_names else omit }}"
basic_users_users: []
basic_users_groups: []
Expand Down
6 changes: 3 additions & 3 deletions ansible/roles/basic_users/tasks/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -46,21 +46,21 @@
- item.state | default('present') == 'present'
- item.public_key is defined
- basic_users_manage_homedir
run_once: true

- name: Write generated public key as authorized for SSH access
# this only runs on the basic_users_manage_homedir so has registered var
# from that host too
authorized_key:
user: "{{ item.name }}"
state: present
manage_dir: no
key: "{{ item.ssh_public_key }}"
loop: "{{ hostvars[ansible_play_hosts | first].basic_users_info.results }}"
loop: "{{ basic_users_info.results }}"
loop_control:
label: "{{ item.name }}"
when:
- item.ssh_public_key is defined
- basic_users_manage_homedir
run_once: true

- name: Write sudo rules
blockinfile:
Expand Down
2 changes: 1 addition & 1 deletion ansible/roles/block_devices/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ This is a convenience wrapper around the ansible modules:

To avoid issues with device names changing after e.g. reboots, devices are identified by serial number and mounted by filesystem UUID.

**NB:** This role is ignored[^1] during Packer builds as block devices will not be attached to the Packer build VMs. This role is therefore deprecated and it is suggested that `cloud-init` is used instead. See e.g. `environments/skeleton/{{cookiecutter.environment}}/terraform/control.userdata.tpl`.
**NB:** This role is ignored[^1] during Packer builds as block devices will not be attached to the Packer build VMs. This role is therefore deprecated and it is suggested that `cloud-init` is used instead. See e.g. `environments/skeleton/{{cookiecutter.environment}}/tofu/control.userdata.tpl`.

[^1]: See `environments/common/inventory/group_vars/builder/defaults.yml`

Expand Down
3 changes: 3 additions & 0 deletions ansible/roles/cacerts/defaults/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#cacerts_dest_dir: /etc/pki/ca-trust/source/anchors/
cacerts_cert_dir: "{{ appliances_environment_root }}/cacerts"
cacerts_update: true
16 changes: 16 additions & 0 deletions ansible/roles/cacerts/tasks/configure.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
---

- name: Copy all certificates
copy:
src: "{{ item }}"
dest: /etc/pki/ca-trust/source/anchors/
owner: root
group: root
mode: 0644
with_fileglob:
- "{{ cacerts_cert_dir }}/*"
become: true

- name: Update trust store
command: update-ca-trust extract
become: true
11 changes: 11 additions & 0 deletions ansible/roles/cacerts/tasks/export.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
- name: Copy cacerts from deploy host to /exports/cluster/cacerts/
copy:
src: "{{ item }}"
dest: /exports/cluster/cacerts/
owner: root
group: root
mode: 0644
with_fileglob:
- "{{ cacerts_cert_dir }}/*"
delegate_to: "{{ groups['control'] | first }}"
run_once: true
1 change: 1 addition & 0 deletions ansible/roles/cacerts/tasks/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
- import_tasks: configure.yml
Loading

0 comments on commit 4d6dee0

Please sign in to comment.