Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sled-agent] Disk Detection, Partition Management, and U.2 formatting #2176

Merged
merged 43 commits into from
Jan 30, 2023

Conversation

smklein
Copy link
Collaborator

@smklein smklein commented Jan 17, 2023

Features

Enables the Sled Agent to identify physical disks, format zpools on U.2 devices, and use those zpools as allocation targets for datasets within Nexus.

For anyone familiar with the hard-coded list of zpools in sled agent's config.toml, these hook into the same mechanism, but come from real devices. As such, specifying them ahead-of-time is no longer required.

Mechanism

Implements the mechanism described in RFD 352 to detect disks, partitions, and zpools within.

  • Uses illumos-devinfo to query libdevinfo for device information
  • Uses libefi-illumos to query libefi for partition information
  • Uses fstyp to bridge the gap from "partition path" to "zpool name".

Testing

Tested manually on gimlet-sn21 on the following device layout:

Slot    Device  Path
   0       u.2  /devices/pci@ab,0/pci1022,1483@1,1/pci1b96,0@0/blkdev@w0014EE81000BC153,0
   1       u.2  /devices/pci@ab,0/pci1022,1483@1,2/pci1b96,0@0/blkdev@w0014EE81000BC67A,0
   2       u.2  /devices/pci@ab,0/pci1022,1483@1,3/pci1b96,0@0/blkdev@w0014EE81000BC5F5,0
   3       u.2  /devices/pci@ab,0/pci1022,1483@1,4/pci1b96,0@0/blkdev@w0014EE81000BC5C5,0
   4       u.2  /devices/pci@0,0/pci1022,1483@3,4/pci1b96,0@0/blkdev@w0014EE81000BC5D4,0
   5       u.2  /devices/pci@0,0/pci1022,1483@3,3/pci1b96,0@0/blkdev@w0014EE81000BC5C9,0
   6       u.2  /devices/pci@0,0/pci1022,1483@3,2/pci1b96,0@0/blkdev@w0014EE81000BC5D2,0
   7       u.2  /devices/pci@38,0/pci1022,1483@1,3/pci1b96,0@0/blkdev@w0014EE81000BC5CB,0
  17       m.2  /devices/pci@0,0/pci1022,1483@1,3/pci1344,3100@0/blkdev@w00A07501340802D1,0
  • Used nvmeadm secure-erase to fully wipe U.2 devices, observed that the sled-agent detects and correctly formats them.
  • Used zpool destroy to wipe the zpool off the U.2 device while leaving the GPT partition, observed that the sled-agent correctly detects the partition and adds the zpool to it.
  • Tested that in all cases Nexus is notified of the new zpools.

common/src/sql/dbinit.sql Outdated Show resolved Hide resolved
@smklein smklein changed the title [sled-agent] Disk Detection and Partition Management [sled-agent] Disk Detection, Partition Management, and U.2 formatting Jan 23, 2023
@smklein smklein marked this pull request as ready for review January 23, 2023 05:03
@smklein smklein requested review from ahl, jclulow and jgallagher January 24, 2023 20:06
sled-agent/src/illumos/zpool.rs Outdated Show resolved Hide resolved
sled-agent/src/illumos/zpool.rs Outdated Show resolved Hide resolved
sled-agent/src/illumos/zpool.rs Outdated Show resolved Hide resolved
.ensure_using_exactly_these_disks(self.inner.hardware.disks())
.await
{
warn!(log, "Failed to add disk: {e}");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This warning might be misleading, because there are a bunch of possible failures from ensure_using_exactly_these_disks (including at least failure to remove a disk in addition to adding one). Is failure to ensure the disk set we expect just a warning? I'm early in the PR review so maybe this will become clear later, but - what are the ramifications of continuing to run if the set of disks we think we're using aren't the disks we're actually using?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failure to ensure any disks results in an error, not a warning. However, we still try to insert /remove everything else that needs to be changed.

I didn't want "one bad U.2" to cause a whole sled to eject itself, which is sorta the motivation behind "continuing when we've already seen an error anyway".

sled-agent/src/storage_manager.rs Show resolved Hide resolved
sled-agent/src/hardware/illumos/disk.rs Outdated Show resolved Hide resolved
sled-agent/src/hardware/illumos/disk.rs Outdated Show resolved Hide resolved
sled-agent/src/hardware/illumos/mod.rs Show resolved Hide resolved
sled-agent/src/hardware/illumos/mod.rs Outdated Show resolved Hide resolved
sled-agent/src/hardware/illumos/mod.rs Outdated Show resolved Hide resolved
})
.map(|(key, _)| key);
for key in keys_to_be_removed {
if let Err(e) = self.delete_disk_locked(&mut disks, &key).await {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is right - at this point disks is empty (because we swapped it on line 940), so delete_disk_locked won't do anything. Fixing this seems tricky. Playing with it a bit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My short-term workaround: I'm going back to my old mechanism of updating the set of disks in place.

This desperately needs to be refactored to have some better unit tests without the tight coupling to real hardware

Copy link
Collaborator

@jclulow jclulow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Sean!

With my apologies for the interminable delay, I have taken an initial look through this. I think the shape is about right on the Nexus/Omicron end. There are some things about the interaction with devinfo and NVMe and programs that we're running that I think might merit some changes.

Let me know if anything I've said needs more explanation. There are a lot of complex moving pieces which I am always happy to talk through if you'd like.

Thanks for picking this up, it's extremely critical!

sled-agent/src/hardware/illumos/mod.rs Show resolved Hide resolved
let cmd = command.arg(FSTYP).arg("-a").arg(path);

let output = execute(cmd).map_err(Error::from)?;
let stdout = String::from_utf8_lossy(&output.stdout);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should use from_utf8_lossy() here, but rather from_utf8() and propagate an error out if one should occur. We definitely don't expect an error at this moment.

In general I think we should use String::from_utf8() (and handle failures) on any stdout output that we intend to inspect programatically. If you're just including stderr in a diagnostic message, then the lossy version seems fine (even preferable!).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, sounds good. Updated.

Comment on lines +139 to +143
// TODO: If we see a completely empty M.2, should we create
// the expected partitions? Or would it be wiser to infer
// that this indicates an unexpected error conditions that
// needs mitigation?
return Err(DiskError::CannotFormatM2NotImplemented);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, but it feels like this will be in the purvue of installinator; at least initially -- at least in the cases where manufacturing has failed to get it done.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense. I figured the sled agent could support it, but I'd rather just exit early for now, since it's not a priority now.

Comment on lines 275 to 276
// For some reason, libdevfs doesn't prepend "/devices" when
// referring to the path, but it still returns an absolute path.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because libdevinfo is working with the kernel-visible device node path. The devfs file system, which exposes this tree, is traditionally mounted at /devices -- but in theory it could also be mounted elsewhere, and is a construct of the user mode environment, etc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the comment with this detail

sled-agent/src/hardware/illumos/mod.rs Outdated Show resolved Hide resolved
Self { tofino: TofinoSnapshot::new() }
// Walk the device tree to capture a view of the current hardware.
fn new(log: &Logger) -> Result<Self, Error> {
let mut device_info = DevInfo::new().map_err(Error::DevInfo)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should probably be doing DevInfo::new_force_load() here, rather than new(). How often is this polling being done? Perhaps we should do a force load occasionally, or just the first time we start up. I'm concerned that we might not see all of the disks if we don't, because they might not be attached when they are not already in use by a ZFS pool, etc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are currently scanning the hardware every five seconds (though that polling choice is arbitrary). I figured we'd eventually switch to a sysevent-style interface, and only do a full scan if we missed an event, but for now, the "scanning devinfo" tries to do very little, and hand off work to a different tokio task.

I can update this to new_force_load anyway?

sled-agent/src/hardware/illumos/mod.rs Show resolved Hide resolved
sled-agent/src/illumos/fstyp.rs Outdated Show resolved Hide resolved
sled-agent/src/illumos/zpool.rs Show resolved Hide resolved
sled-agent/src/hardware/illumos/mod.rs Outdated Show resolved Hide resolved
@smklein
Copy link
Collaborator Author

smklein commented Jan 30, 2023

(The current CI failures are due to ghcr being down)

We pretty desperately need this PR in as the foundation for some follow-up work, so I'm going to merge. I'm still very open to follow-ups and additional feedback, but I'm prioritizing urgency in this particular case.

@smklein smklein merged commit 1925085 into main Jan 30, 2023
@smklein smklein deleted the detect-disks branch January 30, 2023 22:28
smklein added a commit that referenced this pull request Jan 31, 2023
Depends on #2176

- Restructure sled agent's "storage manager" to be better at reporting
notifications to Nexus
- Report notifications of disk attachment and removal to Nexus
- Create DB structures to represent physical disks
- Expose some information about those disks in the public API

TODO, potentially after this PR:
- [ ] Record a better history of "which disks have been attached to
which sleds"
- [ ] Make "Physical Disks" the parents of "Zpools", rather than their
current "parent" (which is just "sled").

Part of #2036
smklein added a commit that referenced this pull request Jan 31, 2023
Depends on a refactor made in
#2176

- Uses `libdevinfo` to grab `baseboard-*` fields about Gimlets
- Plumbs that information up to Nexus

Fixes #2211

TODO:
- [ ] Add tests
- [ ] Make it easier to "inject fake values" here, possibly via sled
agent configuration
- [ ] Actually index on these values in Nexus, rather than storing them
as arbitrary strings. E.g., we should probably ensure there aren't
duplicates across the fleet.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants