Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add illumos FMD ZFS logic to ZED -- phase 2 (WIP) #5343

Merged
merged 1 commit into from
Nov 7, 2016
Merged

Add illumos FMD ZFS logic to ZED -- phase 2 (WIP) #5343

merged 1 commit into from
Nov 7, 2016

Conversation

don-brady
Copy link
Contributor

Fault Management Logic for ZED

The integration of Fault Management Daemon (FMD) logic from illumos
is being deployed in three phases. This logic is encapsulated in several
software modules inside ZED. This phase 2 portion primarily entails the
Diagnosis Engine and the Retire Agent modules. It also includes
infrastructure to support a crude FMD environment to host these modules.

  1. The Diagnosis Engine consumes I/O and checksum ereports and
    feeds them into a SERD engine which will generate a corresponding
    fault diagnosis when the SERD engine fires. All the diagnosis state
    data is collected into cases, one case per-vdev being tracked.
  2. The Retire Agent responds to diagnosed faults by isolating the
    faulty VDEV. It will notify the ZFS kernel module of the new VDEV
    state (degraded or faulted). This agent is also responsible for
    managing hot spares across pools. When it encounters a device
    fault or a device removal it replaces the device with an appropriate
    spare if available.

Implementation Notes

  • The FMD module API required for logic modules is emulated and implemented
    in the fmd_api.c and fmd_serd.c source files. This support includes
    module registration, memory allocation, module property accessors, basic
    case management, one-shot timers and SERD engines.
  • The event subscriptions for the modules (located in a module specific
    configuration file on illumos) are currently hard-coded into the ZED
    zfs_agent_dispatch() function.
  • The FMD modules are called one at a time from a single thread that
    consumes events queued to the modules. These events are sourced from
    the normal ZED events and also include events posted from the diagnosis
    engine and the libudev disk event monitor.
  • The FMD code modules have minimal changes and were intentionally left
    as similar as possible to their upstream source files.
  • The sysevent namespace in ZED differs from illumos. For example:
    • illumos uses "resource.sysevent.EC_zfs.ESC_ZFS_vdev_remove"
    • Linux uses "sysevent.fs.zfs.vdev_remove"
  • The FMD Modules port was produced by Intel Federal, LLC under award
    number B609815 between the U.S. Department of Energy (DOE) and Intel
    Federal, LLC.

@mention-bot
Copy link

@don-brady, thanks for your PR! By analyzing the history of the files in this pull request, we identified @dun, @behlendorf and @tonyhutter to be potential reviewers.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. Questions and comments inline.

$(top_builddir)/lib/libzpool/libzpool.la \
$(top_builddir)/lib/libzfs/libzfs.la \
$(top_builddir)/lib/libzfs_core/libzfs_core.la
$(top_builddir)/lib/libzfs_core/libzfs_core.la \
-lrt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The -lrt flags are for the POSIX timer support right? If it's only used by the zed I believe these should be added as zed_LDFLAGS = -lrt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thanks.

ereports and feeds them into a SERD engine which will generate a
corresponding fault diagnosis when the SERD engine fires. The initial
N and T values for the SERD engines are estimates inherited from
illumos (10 errors in 10 minutes).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these values tunable? If so how?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not directly yet. As stated in the README this was slated for Phase 3.

module registration, memory allocation, module property accessors, basic
case management, one-shot timers and SERD engines.
For detailed information on the FMD module API, see the document --
_"Fault Management Daemon Programmer's Reference Manual"_.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had no idea this even existed, let alone was so extensive.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see some results when Googling for this. Is there a canonical source such that you list a URL here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not aware of a stable/canonical source. It was originally hosted on opensolaris.org


/*
* Copyright (c) 2004, 2010, Oracle and/or its affiliates. All rights reserved.
*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unneeded extra empty line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

*
* In the ZED runtime, the modules are called from a single thread so no
* locking is required in this emulated FMD environment.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment should probably be above the includes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

"ereport time %lld.%lld, pool load time = %lld.%lld",
pool_guid, er_when.ertv_sec, er_when.ertv_nsec,
pool_load.ertv_sec, pool_load.ertv_nsec);
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doesn't think work on Linux? What bit of infrastructure are we missing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We originally did have a time value for er_when. We do now. Also, these log message should be limited to the rare case that we are skipping an event because it occurred before the pool was imported. I have move them accordingly and they are not excluded now.

{ "io_T", FMD_TYPE_TIME, "10min" },
{ "remove_timeout", FMD_TYPE_TIME, "15sec" },
{ NULL, 0, NULL }
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These values could be initialized via command line options to the zed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Property config is slated for the next phase.

{
zcp->zc_data.zc_version = CASE_DATA_VERSION_SERD;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the intention to implement persistent records at a latter date?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the checkpoints of case data is a future feature. The current diagnosis is just SERD values so not losing much by not check-pointing. As the diagnosis/prediction model gets more complex we may want to checkpoint it to avoid losing valuable state.

@@ -1 +0,0 @@
io-spare.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These scripts will also need to be removed from the Makefile.am.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. fixed.

zed_log_msg(LOG_INFO, "zed_udev_monitor: %s waiting "
"for slice", udev_device_get_devnode(dev));
// zed_log_msg(LOG_INFO, "zed_udev_monitor: %s waiting "
// "for slice", udev_device_get_devnode(dev));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you intend to comment this out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it. It was just a noisy log entry.

### ZFS Fault Management Overview ###

The primary purpose with ZFS fault management is automated diagnosis
and isolation of VDEV faults. A Fault is something we can associate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Fault intentionally capitalized here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A typo, will fix in next commit.

3. A _Disk Add Agent_ module (`agents/zfs_mod.c`)

To begin with, a **Diagnosis Engine** consumes per-vdev I/O and checksum
ereports and feeds them into a SERD engine which will generate a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "SERD"? I actually don't know, but also should it be expanded here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add a basic definition.

module registration, memory allocation, module property accessors, basic
case management, one-shot timers and SERD engines.
For detailed information on the FMD module API, see the document --
_"Fault Management Daemon Programmer's Reference Manual"_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see some results when Googling for this. Is there a canonical source such that you list a URL here?

* function can also be used in combination with fmd_serd_eng_record().
*/
if (sgp->sg_flags & FMD_SERD_FIRED) {
#if 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this #if 1 leftover?

}

/*
* sourced from fmd_string.c
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple of these "sourced from" comments. Is this copy-and-paste of some code? Should it be done differently?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its code from the FMD runtime and under the same license and copyright. It's just a consolidation. For example: https://github.com/openzfs/openzfs/tree/master/usr/src/cmd/fm/fmd/common


/*
* There is no way to open a pool by GUID, or lookup a vdev by GUID. No
* matter what we do, we're going to have to stomach a O(vdevs * cases)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"a" should be "an".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo, can fix in next commit.

* The ZFS retire agent is responsible for managing hot spares across all pools.
* When we see a device fault or a device removal, we try to open the associated
* pool and look for any hot spares. We iterate over any available hot spares
* and attempt a 'zpool replace' for each one.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Policy for which ones to try (or the order) is for later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Policy in this phase is same as upstream (i.e very basic). Subsequent patches will improve the hot spare matching criteria (e.g. like matching virtual spares for dRAID or matching by size and device types).

fmd_module_t *mp = (fmd_module_t *)hdl;

if (fmd_serd_eng_lookup(&mp->mod_serds, name) != NULL) {
zed_log_msg(LOG_ERR, "failed to create serd engine '%s': "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these error messages use all-caps "SERD" for consistency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey this is UNIX/Linux. But point taken for being consistent.

@rlaager
Copy link
Member

rlaager commented Nov 1, 2016

My comments here are pretty much all trivial proofreading stuff. If you think that's a waste of time, feel free to ignore them.

@don-brady
Copy link
Contributor Author

@rlaager Thanks for taking time to review. All input is welcomed here. We also need to update the man page for ZED. I'll take a stab at it in the next commit and would welcome your input for that as well.

The phase 2 work primarily entails the Diagnosis Engine and
the Retire Agent modules. It also includes infrastructure
to support a crude FMD environment to host these modules.

The Diagnosis Engine consumes I/O and checksum ereports and
feeds them into a SERD engine which will generate a corres-
ponding fault diagnosis when the SERD engine fires. All the
diagnosis state data is collected into cases, one case per
vdev being tracked.

The Retire Agent responds to diagnosed faults by isolating
the faulty VDEV. It will notify the ZFS kernel module of
the new VDEV state (degraded or faulted). This agent is
also responsible for managing hot spares across pools.
When it encounters a device fault or a device removal it
replaces the device with an appropriate spare if available.
@tonyhutter
Copy link
Contributor

I've been testing this on our JBODs and haven't hit any major issues. I tested auto-online, auto-replace, and spares on a multipath pool and it worked as advertised. LGTM

@behlendorf behlendorf merged commit 976246f into openzfs:master Nov 7, 2016
akatrevorjay pushed a commit to akatrevorjay/zfs that referenced this pull request Nov 29, 2016
The phase 2 work primarily entails the Diagnosis Engine and
the Retire Agent modules. It also includes infrastructure
to support a crude FMD environment to host these modules.

The Diagnosis Engine consumes I/O and checksum ereports and
feeds them into a SERD engine which will generate a corres-
ponding fault diagnosis when the SERD engine fires. All the
diagnosis state data is collected into cases, one case per
vdev being tracked.

The Retire Agent responds to diagnosed faults by isolating
the faulty VDEV. It will notify the ZFS kernel module of
the new VDEV state (degraded or faulted). This agent is
also responsible for managing hot spares across pools.
When it encounters a device fault or a device removal it
replaces the device with an appropriate spare if available.

Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Don Brady <[email protected]>
Closes openzfs#5343
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants