-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add illumos FMD ZFS logic to ZED -- phase 2 (WIP) #5343
Conversation
@don-brady, thanks for your PR! By analyzing the history of the files in this pull request, we identified @dun, @behlendorf and @tonyhutter to be potential reviewers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work. Questions and comments inline.
$(top_builddir)/lib/libzpool/libzpool.la \ | ||
$(top_builddir)/lib/libzfs/libzfs.la \ | ||
$(top_builddir)/lib/libzfs_core/libzfs_core.la | ||
$(top_builddir)/lib/libzfs_core/libzfs_core.la \ | ||
-lrt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The -lrt
flags are for the POSIX timer support right? If it's only used by the zed I believe these should be added as zed_LDFLAGS = -lrt
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Thanks.
ereports and feeds them into a SERD engine which will generate a | ||
corresponding fault diagnosis when the SERD engine fires. The initial | ||
N and T values for the SERD engines are estimates inherited from | ||
illumos (10 errors in 10 minutes). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these values tunable? If so how?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not directly yet. As stated in the README this was slated for Phase 3.
module registration, memory allocation, module property accessors, basic | ||
case management, one-shot timers and SERD engines. | ||
For detailed information on the FMD module API, see the document -- | ||
_"Fault Management Daemon Programmer's Reference Manual"_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had no idea this even existed, let alone was so extensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see some results when Googling for this. Is there a canonical source such that you list a URL here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not aware of a stable/canonical source. It was originally hosted on opensolaris.org
|
||
/* | ||
* Copyright (c) 2004, 2010, Oracle and/or its affiliates. All rights reserved. | ||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unneeded extra empty line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
* | ||
* In the ZED runtime, the modules are called from a single thread so no | ||
* locking is required in this emulated FMD environment. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment should probably be above the includes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
"ereport time %lld.%lld, pool load time = %lld.%lld", | ||
pool_guid, er_when.ertv_sec, er_when.ertv_nsec, | ||
pool_load.ertv_sec, pool_load.ertv_nsec); | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why doesn't think work on Linux? What bit of infrastructure are we missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We originally did have a time value for er_when. We do now. Also, these log message should be limited to the rare case that we are skipping an event because it occurred before the pool was imported. I have move them accordingly and they are not excluded now.
{ "io_T", FMD_TYPE_TIME, "10min" }, | ||
{ "remove_timeout", FMD_TYPE_TIME, "15sec" }, | ||
{ NULL, 0, NULL } | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These values could be initialized via command line options to the zed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. Property config is slated for the next phase.
{ | ||
zcp->zc_data.zc_version = CASE_DATA_VERSION_SERD; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the intention to implement persistent records at a latter date?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the checkpoints of case data is a future feature. The current diagnosis is just SERD values so not losing much by not check-pointing. As the diagnosis/prediction model gets more complex we may want to checkpoint it to avoid losing valuable state.
@@ -1 +0,0 @@ | |||
io-spare.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These scripts will also need to be removed from the Makefile.am.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch. fixed.
zed_log_msg(LOG_INFO, "zed_udev_monitor: %s waiting " | ||
"for slice", udev_device_get_devnode(dev)); | ||
// zed_log_msg(LOG_INFO, "zed_udev_monitor: %s waiting " | ||
// "for slice", udev_device_get_devnode(dev)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you intend to comment this out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed it. It was just a noisy log entry.
### ZFS Fault Management Overview ### | ||
|
||
The primary purpose with ZFS fault management is automated diagnosis | ||
and isolation of VDEV faults. A Fault is something we can associate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is Fault intentionally capitalized here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A typo, will fix in next commit.
3. A _Disk Add Agent_ module (`agents/zfs_mod.c`) | ||
|
||
To begin with, a **Diagnosis Engine** consumes per-vdev I/O and checksum | ||
ereports and feeds them into a SERD engine which will generate a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is "SERD"? I actually don't know, but also should it be expanded here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add a basic definition.
module registration, memory allocation, module property accessors, basic | ||
case management, one-shot timers and SERD engines. | ||
For detailed information on the FMD module API, see the document -- | ||
_"Fault Management Daemon Programmer's Reference Manual"_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see some results when Googling for this. Is there a canonical source such that you list a URL here?
* function can also be used in combination with fmd_serd_eng_record(). | ||
*/ | ||
if (sgp->sg_flags & FMD_SERD_FIRED) { | ||
#if 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this #if 1 leftover?
} | ||
|
||
/* | ||
* sourced from fmd_string.c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple of these "sourced from" comments. Is this copy-and-paste of some code? Should it be done differently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its code from the FMD runtime and under the same license and copyright. It's just a consolidation. For example: https://github.com/openzfs/openzfs/tree/master/usr/src/cmd/fm/fmd/common
|
||
/* | ||
* There is no way to open a pool by GUID, or lookup a vdev by GUID. No | ||
* matter what we do, we're going to have to stomach a O(vdevs * cases) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"a" should be "an".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo, can fix in next commit.
* The ZFS retire agent is responsible for managing hot spares across all pools. | ||
* When we see a device fault or a device removal, we try to open the associated | ||
* pool and look for any hot spares. We iterate over any available hot spares | ||
* and attempt a 'zpool replace' for each one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Policy for which ones to try (or the order) is for later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Policy in this phase is same as upstream (i.e very basic). Subsequent patches will improve the hot spare matching criteria (e.g. like matching virtual spares for dRAID or matching by size and device types).
fmd_module_t *mp = (fmd_module_t *)hdl; | ||
|
||
if (fmd_serd_eng_lookup(&mp->mod_serds, name) != NULL) { | ||
zed_log_msg(LOG_ERR, "failed to create serd engine '%s': " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these error messages use all-caps "SERD" for consistency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey this is UNIX/Linux. But point taken for being consistent.
My comments here are pretty much all trivial proofreading stuff. If you think that's a waste of time, feel free to ignore them. |
@rlaager Thanks for taking time to review. All input is welcomed here. We also need to update the man page for ZED. I'll take a stab at it in the next commit and would welcome your input for that as well. |
The phase 2 work primarily entails the Diagnosis Engine and the Retire Agent modules. It also includes infrastructure to support a crude FMD environment to host these modules. The Diagnosis Engine consumes I/O and checksum ereports and feeds them into a SERD engine which will generate a corres- ponding fault diagnosis when the SERD engine fires. All the diagnosis state data is collected into cases, one case per vdev being tracked. The Retire Agent responds to diagnosed faults by isolating the faulty VDEV. It will notify the ZFS kernel module of the new VDEV state (degraded or faulted). This agent is also responsible for managing hot spares across pools. When it encounters a device fault or a device removal it replaces the device with an appropriate spare if available.
I've been testing this on our JBODs and haven't hit any major issues. I tested auto-online, auto-replace, and spares on a multipath pool and it worked as advertised. LGTM |
The phase 2 work primarily entails the Diagnosis Engine and the Retire Agent modules. It also includes infrastructure to support a crude FMD environment to host these modules. The Diagnosis Engine consumes I/O and checksum ereports and feeds them into a SERD engine which will generate a corres- ponding fault diagnosis when the SERD engine fires. All the diagnosis state data is collected into cases, one case per vdev being tracked. The Retire Agent responds to diagnosed faults by isolating the faulty VDEV. It will notify the ZFS kernel module of the new VDEV state (degraded or faulted). This agent is also responsible for managing hot spares across pools. When it encounters a device fault or a device removal it replaces the device with an appropriate spare if available. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes openzfs#5343
Fault Management Logic for ZED
The integration of Fault Management Daemon (FMD) logic from illumos
is being deployed in three phases. This logic is encapsulated in several
software modules inside ZED. This phase 2 portion primarily entails the
Diagnosis Engine and the Retire Agent modules. It also includes
infrastructure to support a crude FMD environment to host these modules.
feeds them into a SERD engine which will generate a corresponding
fault diagnosis when the SERD engine fires. All the diagnosis state
data is collected into cases, one case per-vdev being tracked.
faulty VDEV. It will notify the ZFS kernel module of the new VDEV
state (degraded or faulted). This agent is also responsible for
managing hot spares across pools. When it encounters a device
fault or a device removal it replaces the device with an appropriate
spare if available.
Implementation Notes
in the
fmd_api.c
andfmd_serd.c
source files. This support includesmodule registration, memory allocation, module property accessors, basic
case management, one-shot timers and SERD engines.
configuration file on illumos) are currently hard-coded into the ZED
zfs_agent_dispatch()
function.consumes events queued to the modules. These events are sourced from
the normal ZED events and also include events posted from the diagnosis
engine and the libudev disk event monitor.
as similar as possible to their upstream source files.
"resource.sysevent.EC_zfs.ESC_ZFS_vdev_remove"
"sysevent.fs.zfs.vdev_remove"
number B609815 between the U.S. Department of Energy (DOE) and Intel
Federal, LLC.