Add design documentation for file_input operator (#203)

* Add file_input design documentation
open-telemetry · Jun 29, 2021 · 8b99134 · 8b99134
1 parent 9d00c8d
commit 8b99134
Showing 1 changed file with 181 additions and 0 deletions.
diff --git a/operator/builtin/input/file/design.md b/operator/builtin/input/file/design.md
@@ -0,0 +1,181 @@
+# File Matching
+
+The operator searches the file system for files that meet the following requirements:
+1. The file's path matches one or more patterns specified in the `include` setting.
+2. The file's path does not match any pattern specified in the `exclude` setting.
+
+The set of files that satisfy these requirements are known in this document as "matched files". 
+The effective search space (`include - exclude`) is referred to colloquially as the operator's "matching pattern".
+
+# Fingerprints
+
+Files are identified and tracked using fingerprints. A fingerprint is the first `N` bytes of the file, with the default for `N` being `1000`. 
+
+### Fingerprint Growth
+
+When a file is smaller than `N` bytes, the fingerprint is the entire contents of the file. A fingerprint that is less than `N` bytes will be compared to other fingerprints using a prefix check. As the file grows, its fingerprint will be updated, until it reaches the full size of `N`.
+
+### Deduplication of Files
+
+Multiple files with the same fingerprint are handled as if they are the same file. 
+
+Most commonly, this circumstance is observed during file rotation that depends on a copy/truncate strategy. After copying the file, but before truncating the original, two files with the same content briefly exist. If the `file_input` operator happens to observe both files at the same time, it will detect a duplicate fingerprint and ingest only one of the files.
+
+If logs are replicated to multiple files, or if log files are copied manually, it is not understood to be of any significant value to ingest the duplicates. As a result, fingerprints are not designed to differentiate between these files, and double ingestion of the same content is not supported automatically.
+
+In some rare circumstances, a logger may print a very verbose preamble to each log file. When this occurs, fingerprinting may fail to differentiate files from one another. This can be overcome by customizing the size of the fingerprint using the `fingerprint_size` setting.
+
+# Readers
+
+Readers are a convenience struct, which exist for the purpose of managing files and their associated metadata. 
+
+### Contents
+
+A Reader contains the following:
+- File handle (may be open or closed)
+- File fingerprint
+- File offset (aka checkpoint)
+- File path
+- Decoder (dedicated instance to avoid concurrency issues)
+
+### Functionality
+
+As implied by the name, Readers are responsible for consuming data as it is written to a file.
+
+Before a Reader begins consuming, it will seek the file's last known offset. If no offset is known for the file, then the Reader will seek either the beginning or end of the file, according to the `start_at` setting. It will then begin reading from there.
+
+While a file is shorter than the length of a fingerprint, its Reader will continuously append to the fingerprint, as it consumes newly written data.
+
+A Reader consumes a file using a `bufio.Scanner`, with the Scanner's buffer size defined by the `max_log_size` setting, and the Scanner's split func defined by the `multiline` setting. 
+
+As each log is read from the file, it is decoded according to the `encoding` function, and then emitted from the operator. 
+
+The Reader's offset is updated accordingly whenever a log is emitted.
+
+
+### Persistence
+
+Readers are always instantiated with an open file handle. Eventually, the file handle is closed, but the Reader is not immediately discarded. Rather, it is maintained for a fixed number of "poll cycles" (see Polling section below) as a reference to the file's metadata, which may be useful for detecting files that have been moved or copied, and for recalling metadata such as the file's previous path.
+
+Readers are maintained for a fixed period of time, and then discarded.
+
+When the `file_input` operator makes use of a persistence mechanism to save and recall its state, it is simply Setting and Getting a slice of Readers. These Readers contain all the information necessary to pick up exactly where the operator left off.
+
+
+# Polling
+
+The file system is polled on a regular interval, defined by the `poll_interval` setting. 
+
+Each poll cycle runs through a series of steps which are presented below.
+
+### Detailed Poll Cycle
+
+1. Dequeuing
+    1. If any matches are queued from the previous cycle, an appropriate number are dequeued, and processed the same as would a newly matched set of files.
+2. Aging
+    1. If no queued files were left over from the previous cycle, then all previously matched files have been consumed, and we are ready to query the file system again. Prior to doing so, we will increment the "generation" of all historical Readers. Eventually, these Readers will be discarded based on their age. Until that point, they may be useful references.
+3. Matching
+    1. The file system is searched for files with a path that matches the `include` setting.
+    2. Files that match the `exclude` setting are discarded.
+    3. As a special case, on the first poll cycle, a warning is printed if no files are matched. Execution continues regardless.
+4. Queueing
+    1. If the number of matched files is less than or equal to the maximum degree of concurrency, as defined by the `max_concurrent_files` setting, then no queueing occurs.
+    2. Else, queueing occurs, which means the following:
+        - Matched files are split into two sets, such that the first is small enough to respect `max_concurrent_files`, and the second contains the remaining files (called the queue).
+        - The current poll interval will begin processing the first set of files, just as if they were the only ones found during the matching phase.
+        - Subsequent poll cycles will pull matches off of the queue, until the queue is empty.
+        - The `max_concurrent_files` setting is respected at all times.
+5. Opening
+    1. Each of the matched files is opened. Note:
+        - A small amount of time has passed since the file was matched.
+        - It is possible that it has been moved or deleted by this point.
+        - Only a minimum set of operations should occur between file matching and opening.
+        - If an error occurs while opening, it is logged.
+6. Fingerprinting
+    1. The first `N` bytes of each file are read. (See fingerprinting section above.)
+7. Exclusion
+    1. Empty files are closed immediately and discarded. (There is nothing to read.)
+    2. Fingerprints found in this batch are cross referenced against each other to detect duplicates. Duplicate files are closed immediately and discarded.
+        - In the vast majority of cases, this occurs during file rotation that uses the copy/truncate method. (See fingerprinting section above.)
+8. Reader Creation
+    1. Each file handle is wrapped into a `Reader` along with some metadata. (See Reader section above)
+        - During the creation of a `Reader`, the file's fingerprint is cross referenced with previously known fingerprints.
+        - If a file's fingerprint matches one that has recently been seen, then metadata is copied over from the previous iteration of the Reader. Most importantly, the offset is accurately maintained in this way.
+        - If a file's fingerprint does not match any recently seen files, then its offset is initialized according to the `start_at` setting.
+9. Detection of Lost Files
+    1. Fingerprints are used to cross reference the matched files from this poll cycle against the matched file from the previous poll cycle. Files that were matched in the previous cycle but were not matched in this cycle are referred to as "lost files".
+    2. File become "lost" for several reasons:
+        - The file may have been deleted, typically due to rotation limits or ttl-based pruning.
+        - The file may have been rotated to another location.
+            - If the file was moved, the open file handle from the previous poll cycle may be useful.
+10. Consumption
+    1. Lost files are consumed. In some cases, such as deletion, this operation will fail. However, if a file was moved, we may be able to consume the remainder of its content. 
+        - We do not expect to match this file again, so the best we can do is finish consuming their current contents.
+        - We can reasonably expect in most cases that these files are no longer being written to.
+    2. Matched files (from this poll cycle) are consumed.
+        - These file handles will be left open until the next poll cycle, when they will be used to detect and potentially consume lost files.
+        - Typically, we can expect to find most of these files again. However, these files are consumed greedily in case we do not see them again.
+    3. All open files are consumed concurrently. This includes both the lost files from the previous cycle, and the matched files from this cycle.
+11. Closing
+    1. All files from the previous poll cycle are closed.
+12. Archiving
+    1. Readers created in the current poll cycle are added to the historical record.
+    2. The same Readers are also retained as a separate slice, for easy access in the next poll cycle.
+13. Pruning
+    1. The historical record is purged of Readers that have existed for 3 generations.
+        - This number is somewhat arbitrary, and should probably be made configurable. However, its exact purpose is quite obscure.
+14. Persistence
+    1. The historical record of readers is synced to whatever persistence mechanism was provided to the operator.
+15. End Poll Cycle
+    1. At this point, the operator sits idle until the poll timer fires again.
+
+
+
+# Additional Details
+
+### Startup Logic
+
+Whenever the operator starts, it:
+- Requests the historical record of Readers, as described in steps 12-14 of the poll cycle.
+- Starts the polling timer.
+
+### Shutdown Logic
+
+When the operator shuts down, the following occurs:
+- If a poll cycle is not currently underway, the operator simply closes any open files.
+- Otherwise, the current poll cycle is signaled to stop immediately, which in turn signals all Readers to stop immediately.
+    - If a Reader is idle or in between log entries, it will return immediately. Otherwise it will return after consuming one final log entry.
+    - Once all Readers have stopped, the remainder of the poll cycle completes as usual, which includes the steps labeled `Closing`, `Archiving`, `Pruning`, and `Persistence`.
+
+The net effect of the shut down routine is that all files are checkpointed in a normal manner (i.e. not in the middle of a log entry), and all checkpoints are persisted.
+
+
+# Known Limitations
+
+### Potential data loss when maximum concurrency must be enforced
+
+The operator may lose a small percentage of logs, if both of the following conditions are true:
+1. The number of files being matched exceeds the maximum degree of concurrency allowed by the `max_concurrent_files` setting. 
+2. Files are being "lost". That is, file rotation is moving files out of the operator's matching pattern, such that subsequent polling cycles will not find these files.
+
+When both of these conditions occur, it is impossible for the operator to both:
+1. Respect the specified concurrency limitation.
+2. Guarantee that if a file is rotated out of the matching pattern, it may still be consumed before being closed.
+
+When this scenario occurs, a design tradeoff must be made. The choice is between:
+1. Ensure that `max_concurrent_files` is always respected.
+2. Risk losing a small percentage of log entries.
+
+The current design chooses to guarantee the maximum degree of concurrency because failure to do so risks harming the operator's host system. While the loss of logs is not ideal, it is less likely to harm the operator's host system, and is therefore considered the more acceptable of the two options.
+
+### Potential data loss when file rotation via copy/truncate rotates backup files out of operator's matching pattern
+
+The operator may lose a small percentage of logs, if both of the following conditions are true:
+1. Files are being rotated using the copy/truncate strategy.
+2. Files are being "lost". That is, file rotation is moving files out of the operator's matching pattern, such that subsequent polling cycles will not find these files.
+
+When both of these conditions occur, it is possible that a file is written to (then copied elsewhere) and then truncated before the operator has a chance to consume the new data.
+
+### Potential failure to consume files when file rotation via move/create is used on Windows
+
+On Windows, rotation of files using the Move/Create strategy may cause errors and loss of data, because Golang does not currently support the Windows mechanism for `FILE_SHARE_DELETE`.