solution-03.tex

\subsection{Tools}

There is no technical issue preventing someone from cataloging, 
annotating and investigating a dataset 
by manually writing RDF files, populating an annotation database 
and running SQL and SPARQL queries against them. However, 
the schema and vocabulary are accessible to a much wider 
audience with the aid of tools that ease the learning curve and automate many of the processes.

In the course of this work we have developed a small collection 
of tools for constructing and querying graphs and annotation databases.

%\begin{itemize}
%\item get.py for querying an annotation database
%\item what do we have for building an annotation database? (Ann's tool for %ingesting the mutrino and jobs info?)
%\item framework for working with rdf graph
%
%(plan is to have cataloging and running general queries supported at least, and ideally also slicing-and-fetching of timestamped logs in line-per-timestamp format)
%\end{itemize}

\subsubsection{Cataloging log sources and archives}

For easier use of the RDF graph we have developed a framework, in 
the form of a python package \textcolor{red}{cite github address?}
for constructing and querying the graph. The package is still pre-release
and in active development; so far, its capability is limited to 
walking a directory tree to discover and catalog log files --with 
some constraints on the types of log files it can inspect-- and some 
basic querying. The design is extensible: a small interface
is defined for handlers of file and log format types and additional 
handlers can be easily added to the toolkit.

Much information must be collected to create useful metadata: most of
this can be automatically inferred by querying the graph and inspecting 
the file (for example, the filename hints at the likely \texttt{LogSeries},
temporal boundaries can be discovered by reading the first and last records
and the subjects for a tree of log data can be inferred from a coarse-grained subject (such as the cluster) and information held by each \texttt{LogSeries}. Details requiring human input are acquired through 
a simple question-and-answer interface that presents the user with 
reasonable guesses and prompts editing and confirmation. In this manner 
thousands of logs can be described based on only a handful of user 
interactions.


\subsubsection{Querying Annotations}
\label{s:querying}
A stated goal of our project is to enable annotation of log data so that researchers can analyze it with sufficienct context to understand what was happening on the system when a given set of loglines were written. However, such a tool would also enable consultants and users to make sense of log data in order to better address questions about job failures. Neither use case would be addressed, however, without an efficient means of querying the annotations. We have therefore created a python-based query engine for that purpose.

The query engine offers a set of command-line flags to customize a query. Users can combine them for highly specific queries without resorting to complex SQL. A user can refine a query by start (-s or --start) and/or end (-e or --end) time, by component name (e.g., node or slot cname, -c or --component), by type (descriptive phrase, such as "link down", -t or --type), or by any of the columns in the database schema (-t columnname=foo). In addition, a user can specify a job ID (-j or --job) to retrieve annotations on logs written during the time of the job and affecting any of the nodes on which it ran. Adding the -a or --after flag enables the user to specify a complex query with a single timestamp and retrieve the next single instance of a match for the query.

In addition to specifying a single component, a user can retrieve annotations for related components, traversing the architecture of the system to a specified depth (-d or --depth), or number of hops. For example, a query for a component specified to depth 2 would retrieve parent and grandparent components as well as child and grandchild components. 
For any depth search, any supremum (eg SMW) or `unknown' components are also queried.
This flag leverages a table of physical components in the database that reflects the architecture of the system for which the annotations were made. 

The user also has options for viewing the retrieved annotations. Formatting options include a table (-f table), a JSON array (-f json), or the default textual listing Figure~\ref{f:default_format}. The user may choose to increase verbosity (-v, -vv) or to limit the maximum number of retrieved annotations (--limit).

An additional query for jobs (--jobs) allows the user to enter one or more annotation identifiers and retrieve the list of job IDs that could be affected by the annotation, based on the time and the nodes on which they ran.

The current query engine retrieves annotations only, though the annotation metadata provides sufficient information to locate relevant log files. We plan to enable retrieval of subsets of the logs through the tool in the future.

\begin{figure}
\begin{small}
\begin{minted}{sql}
----------
[822659] by acg on system Mutrino
Time: 2015-04-29 18:16:32 to 2015-04-29 18:16:32
Start state: None ; End state: None
Description: Correctable memory error. This may result in degraded performance.
Manually invoked? False ; System down?:  False
Components: ["c0-0c1s8a0n0"]
Tags:
LogDiver category group: NO
Baler pattern ID: 280
Relevant log files: hwerrlog
----------
[822660] by acg on system Mutrino
Time: 2015-04-29 18:16:42 to 2015-04-29 18:16:42
Start state: None ; End state: None
Description: Correctable memory error. This may result in degraded performance.
Manually invoked? False ; System down?:  False
Components: ["c0-0c1s8a0n0"]
Tags:
LogDiver category group: NO
Baler pattern ID: 280
Relevant log files: hwerrlog
----------
[822661] by acg on system Mutrino
Time: 2015-04-29 18:16:52 to 2015-04-29 18:16:52
Start state: None ; End state: None
Description: Correctable memory error. This may result in degraded performance.
Manually invoked? False ; System down?:  False
Components: ["c0-0c1s8a0n0"]
Tags:
LogDiver category group: NO
Baler pattern ID: 280
Relevant log files: hwerrlog
*** Done! ***

\end{minted}
\end{small}
\caption{Annotations as returned by the query engine in default format. }
\label{f:default_format}
\end{figure}


\subsubsection{Creating Annotations}


Annotations can be added to a database by several means. The simplest approach would be with an SQL insert at an sqlite command line, but this method can be tedious and relies on manual input. One simple way of adding a large batch of annotations is to enter them as a spreadsheet in CSV format, using a small uploader script which we provide. These approaches work well for human-generated annotations, such as individual observations from a system administrator or output from a ticketing system.

To faciliate the automated creation of log line annotations
and the identification of the occurrences of events to be annotated,
we have been using two external tools, Baler and LogDiver (described in more detail below).
In support of these we have developed tools to create annotation using output from these.

Note that a single or even multiple SQL databases are not required by our design. A plugin interface to a component presenting an API which could return similar annotation information would also be appropriate (e.g., one could front a tool with its own datastore). However, to enable a simple, consistent format for portable general release of an annotated dataset we do so, as described in the next section.

\section{Proof of Concept}
\label{s:proof}
To faciliate the creation of log line annotations
and the identification of the occurrences of events to be annotated,
we have been using two tools.

LogDiver~\cite{LogDiver} is a tool developed by UIUC which includes a set of regular expressions defining
events in log files of interest; the regular expressions are associated with categorizations
which are a subset of those described in the previous section; the
category name, \texttt{LDcategory}, was chosen to reflect our intention to map to the
LogDiver categorizations where possible.
LogDiver itself is used to discover the occurrences of the regular
expresssions in the logs and to determine statistics and information about event sequences
such as statistics of failure events, or of timings of failures and recoveries.
LogDiver, or any such regex-based tool (e.g., SEC~\cite{SEC}), can be used to efficiently extract events
for subsequent annotation, based on the intention for the existence of the regex.

For the dataset described in this work, we prinicpally used Baler~\cite{Baler} for
identifying the log lines to be annotated and for extracing them from the dataset.
Baler extracts patterns from log files without requiring apriori knowledge of
regex of interest. Rather, Baler takes dictionaries of ''words''; words appearing in the log lines
are the passed through to the pattern and non-words become a wildcard in the pattern.
Wildcards of certain formats, for example numbers, hex dumps, char arrays, hostnames and link names
(in cname format for Cray systems) are represented as that formatted type in the pattern.
For example, every instance of the log message \texttt{mutrino-smw 24626 found\_critical\_aries\_error: Processing ''PCI-e CMPL\_TIMEOUT'' critical error (0x660e)}
is represented by the pattern \texttt{<host> nlrd <pid> found\_critical\_aries\_error: Processing ''* *\_TIMEOUT'' critical error (<num>)}.
This illustrates where words, formatted wildcards, and unformatted wildcards (represented by \texttt{*}) appear in the pattern.

For Cray systems, we augment
the dictionary with an architecture specific dictionary of about 100 words (e.g., Lustre, DIMM).
For 3 months of data from our Trinity test system, Mutrino, a 100 node XC 40,
we had over 120 million text log lines which were reduced to 15,500 patterns. To further identify patterns
of interest, we weight the patterns by the occurence of 50 weighted
keywords (e.g., congestion = 1.5, error = 1.5, degrade = 0.75). This further reduced the patterns
to 2,500 significantly weighted patterns. For example the pattern
\texttt{<host> nlrd <pid> ***ERROR***: Link recovery operation failed; error <num>} has
an aggregate weight of 5.5. From those, we chose 150
patterns to annotate with enhanced descriptions. This resulted in about 860,000
annotated log line instances.

It is our intention to
build a plugin to interface with Baler, and support the annotations there,
however in the prototype, we merely annotated the extracted patterns from
Baler and loaded them into a single database.
We include the Baler pattern id in the annotation fields
for reference ease; only the annotation description, not the original log line nor the pattern
are stored in the annotation database.

Some example patterns, from which the originating log line will be obvious, and
the resulting annotations used in this work, are given in Figure~\ref{f:baler}.

\begin{figure*}
\begin{annol}

Baler pattern, preceeded by weight (W=#) and balerpatternid number:
(W=5)        258   HWERR[<host>][<num>]:<num>:SSID RSP A_STATUS_ORB_TIMEOUT Error:*=<num>:*=<num>:*=<num>
Annotation:
authorid:acg  description: 'ORB timeout waiting on outstanding request(s) in the buffer'  LDcatgroup: NE

Baler pattern and weight:
(W=3.75)     498   <host> nlrd <pid> do_set_alerts: <num> links failed, <num> blades failed, <num> blade critical faults, *_in_progress <num>, *_*_reroute <num>; reroute required
Annotation:
authorid:acg  description: 'Setting alerts due to failures. A network reroute is required' LDcatgroup: NE

Baler pattern and weight:
(W=3.25)     748   <host> nlrd <pid> ***ERROR***: Warm swap operation failed; error <num>
Annotation:
authorid:acg description: 'Warm swap failed. This is in response to a operation intended to reset/reinit/replace a component (including network components).' LDcatgroup: NO

Baler pattern and weight:
(W=1.5)      705   <host> nlrd <pid> responder_work_*: Top <num> nodes involved with network congestion
Annotation:
authorid:acg description 'System computing and listing congestion candidate applications' LDcatgroup:NE
\end{annol}
\caption{Example Baler patterns extracted from log lines and their annotated versions. Events to annotate are based on
knowledge of significant events. Annotation descriptions can provide additional context to non-self-explanatory log messages.}
\label{f:baler}
\end{figure*}


We fed Baler with the Cray logs in the format in which they reside on the smw. This required us
to do some file-specific format extraction of messages, timestamps,
and components (e.g., \texttt{netwatch}, \texttt{hwerrlog}) which we may not have had to do if
we had fed it raw syslog versions of the files or datastream; however, this also enabled us
to include the log file type (e.g., \texttt{nlrd}, \texttt{hwerrlog}) in the pattern metadata,
which aids the log file look-up. 

For many cases, messages are reported
on the smw with the component association of the smw. For some cases, we can
extract the actual component of interest. For example, from the Baler pattern
\texttt{<host> nlrd <pid> found\_critical\_aries\_error: handling failed * link on <host> (node )}
we can infer the fields from which to extract the \texttt{host} and \texttt{node} to which
the annotation should be associated. Other messages refer to actions by the SMW for which
the component cannot be inferred. In these cases the component assignment will either be
the smw or 'unknown'.

We used Baler for all major log processing, except for the \texttt{command} log, which required us
to associate \texttt{START} and \texttt{END} of events, for which we used a perl script. This
log includes both manually initiated and automatically invoked commands of
interest such as warmswaps, boots, etc. This was particularly useful for
determining manual actions that may not have been well documented by the
system administrators. This resulted in another 2,000 annotations. In addition,
we extracted the times of reboots from the datetime in the name of the \texttt{p0-XXX}
directories. All annotations from these two sources are attributed as manually induced.

Some other system administrator actions were recorded by manually generated annotations
(about 10, in this case). Ticketing systems may be used to generate such annotations as well.
Knowledge of such events is useful for understanding the root causes and resolution of errors
in the dataset. These annotations were generated manually.

Other non-log events include external actions by non-administrators
such as facilities tests, fault injection research, which require
annotations by differnet people. These were also generated manually
for this dataset. Similar to the Baler Annotations, for the
prototype these were loaded into the same annotation database.

Log data was extracted from alps logs, which was the scheduler in use for
this time period. This could be replaced with queries to a slurm,
or similar, database or interface, where available.

Recall that the main aim of the annotations
is to provide a reduced set of searchable, understandable data, with the
ability to use the annotations to further more detailed search of the raw
logs, if possible and desired. This differs from tools
such as SEC, which is intended to enable action upon the run time
occurence of a log line matching a regex (e.g., notification
of failed component), or Splunk, which is intended
to facilitate knowledge of the occurences of pre-defined
events with accompaning statistical plots.