Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
--DRAFT PR please review... lots of code changes, so many eyes are welcome--
This PR adds a mmap() interface to allow processors to map FlowFile payloads to a memory address. This increases efficiency and performance significantly for some use cases. The change does not negatively impact performance in almost all cases, as shown in benchmarks.
Original/full reason/justification:
"Currently, MiNiFi - C++ only support stream-oriented i/o to FlowFile payloads. This can limit performance in cases where in-place access to the payload is desirable. In cases where data can be accessed randomly and in-place, a significant speedup can be realized by mapping the payload into system memory address space. This is natively supported at the kernel level in Linux, MacOS, and Windows via the mmap() interface on files. Other repositories, such as the VolatileRepository, already store the entire payload in memory, so it is natural to pass through this memory block as if it were a memory-mapped file. While the DatabaseContentRepostory does not appear to natively support a memory map interface, accesses via an emulated memory-map interface should be possible with no performance degradation with respect to a full read via the streaming interface.
Cases where in-place, random access is beneficial include, but are not limited to:
The interface should be accessible by processors via a mmap() call on ProcessSession (adjacent to read() and write()). A MemoryMapCallback should be provided, which is called back via a process() call where the argument is an instance of BaseMemoryMap. The BaseMemoryMap is extended for each type of repository that MiNiFi - C++ supports, including: FileSystemRepository, VolatileRepository, and DatabaseContentRepository.
As part of the change, in addition to extensive unit test coverage, benchmarks should be written such that the performance impact can be empirically measured and evaluated."
Here is the full benchmark suite:
The benchmarks show a significant performance increase in almost all cases. Both the FS repository and volatile can natively support memory mapping, but the DB repo has to simulate it by reading the full object. This has almost no performance impact in most cases, but is somewhat slower for the "small" (131KB payload) benchmark cases. The random access benchmarks show the most significant increase, even with the DB repo.
Caveats: