Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MINIFICPP-929 mmap #603

Closed

Conversation

ai-christianson
Copy link
Contributor

@ai-christianson ai-christianson commented Jun 28, 2019

--DRAFT PR please review... lots of code changes, so many eyes are welcome--

This PR adds a mmap() interface to allow processors to map FlowFile payloads to a memory address. This increases efficiency and performance significantly for some use cases. The change does not negatively impact performance in almost all cases, as shown in benchmarks.

Original/full reason/justification:

"Currently, MiNiFi - C++ only support stream-oriented i/o to FlowFile payloads. This can limit performance in cases where in-place access to the payload is desirable. In cases where data can be accessed randomly and in-place, a significant speedup can be realized by mapping the payload into system memory address space. This is natively supported at the kernel level in Linux, MacOS, and Windows via the mmap() interface on files. Other repositories, such as the VolatileRepository, already store the entire payload in memory, so it is natural to pass through this memory block as if it were a memory-mapped file. While the DatabaseContentRepostory does not appear to natively support a memory map interface, accesses via an emulated memory-map interface should be possible with no performance degradation with respect to a full read via the streaming interface.

Cases where in-place, random access is beneficial include, but are not limited to:

in-place parsing of JSON (e.g. RapidJSON supports parsing in-place, at least for strings).
access of payload via protocol buffers
random access of large files on disk, where it would otherwise require many seek() and read() syscalls

The interface should be accessible by processors via a mmap() call on ProcessSession (adjacent to read() and write()). A MemoryMapCallback should be provided, which is called back via a process() call where the argument is an instance of BaseMemoryMap. The BaseMemoryMap is extended for each type of repository that MiNiFi - C++ supports, including: FileSystemRepository, VolatileRepository, and DatabaseContentRepository.

As part of the change, in addition to extensive unit test coverage, benchmarks should be written such that the performance impact can be empirically measured and evaluated."

Here is the full benchmark suite:

-------------------------------------------------------------------------------------------------------------------
Benchmark                                                                         Time             CPU   Iterations
-------------------------------------------------------------------------------------------------------------------
FSMemoryMapBMFixture/MemoryMap_FileSystemRepository_Read_Tiny                  2956 ns         2923 ns       240558
[2019-06-debugl 14:10:44.663] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] org::apache::nifi::minifi::io::FileStream logger got sink
s from namespace root and level error from namespace root
FSMemoryMapBMFixture/Callback_FileSystemRepository_Read_Tiny                   4258 ns         4227 ns       164835
FSMemoryMapBMFixture/MemoryMap_FileSystemRepository_WriteRead_Tiny             7764 ns         7665 ns        91078
FSMemoryMapBMFixture/Callback_FileSystemRepository_WriteRead_Tiny             14152 ns        14022 ns        49870
FSMemoryMapBMFixture/MemoryMap_FileSystemRepository_Read_Small                15671 ns        15631 ns        44870
FSMemoryMapBMFixture/Callback_FileSystemRepository_Read_Small                 21020 ns        20977 ns        33246
FSMemoryMapBMFixture/MemoryMap_FileSystemRepository_WriteRead_Small           59944 ns        59772 ns        11701
FSMemoryMapBMFixture/Callback_FileSystemRepository_WriteRead_Small            57354 ns        57152 ns        12237
FSMemoryMapBMFixture/MemoryMap_FileSystemRepository_Read_Large              3592536 ns      3587026 ns          194
FSMemoryMapBMFixture/Callback_FileSystemRepository_Read_Large              17014790 ns     16979026 ns           41
FSMemoryMapBMFixture/MemoryMap_FileSystemRepository_WriteRead_Large        16578370 ns     16530633 ns           42
FSMemoryMapBMFixture/Callback_FileSystemRepository_WriteRead_Large         26228637 ns     26159193 ns           27
FSMemoryMapBMFixture/MemoryMap_FileSystemRepository_RandomRead_Large           53.7 ns         53.7 ns     13026678
FSMemoryMapBMFixture/Callback_FileSystemRepository_RandomRead_Large          170905 ns       170829 ns         4074
[2019-06-debugl 14:10:56.874] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] org::apache::nifi::minifi::core::Repository logger got si
nks from namespace root and level error from namespace root
[2019-06-debugl 14:10:56.874] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] org::apache::nifi::minifi::core::repository::VolatileRepo
sitory<std::shared_ptr<org::apache::nifi::minifi::ResourceClaim> > logger got sinks from namespace root and level error from namespace root
[2019-06-debugl 14:10:56.874] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] org::apache::nifi::minifi::core::repository::VolatileCont
entRepository logger got sinks from namespace root and level error from namespace root
[2019-06-debugl 14:10:56.874] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] org::apache::nifi::minifi::io::AtomicEntryMemoryMap<std::
shared_ptr<org::apache::nifi::minifi::ResourceClaim> > () logger got sinks from namespace root and level error from namespace root
VolatileMemoryMapBMFixture/MemoryMap_VolatileRepository_Read_Tiny               267 ns          267 ns      2627306
[2019-06-debugl 14:10:57.877] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] org::apache::nifi::minifi::io::AtomicEntryStream<std::sha
red_ptr<org::apache::nifi::minifi::ResourceClaim> > () logger got sinks from namespace root and level error from namespace root
VolatileMemoryMapBMFixture/Callback_VolatileRepository_Read_Tiny                360 ns          360 ns      1957163
VolatileMemoryMapBMFixture/MemoryMap_VolatileRepository_WriteRead_Tiny          558 ns          558 ns      1255342
VolatileMemoryMapBMFixture/Callback_VolatileRepository_WriteRead_Tiny          1024 ns         1024 ns       682374
VolatileMemoryMapBMFixture/MemoryMap_VolatileRepository_Read_Small             2654 ns         2653 ns       254700
VolatileMemoryMapBMFixture/Callback_VolatileRepository_Read_Small              7920 ns         7916 ns        96029
VolatileMemoryMapBMFixture/MemoryMap_VolatileRepository_WriteRead_Small        7581 ns         7578 ns       105741
VolatileMemoryMapBMFixture/Callback_VolatileRepository_WriteRead_Small        11594 ns        11590 ns        60342
VolatileMemoryMapBMFixture/MemoryMap_VolatileRepository_Read_Large          2438303 ns      2434904 ns          286
VolatileMemoryMapBMFixture/Callback_VolatileRepository_Read_Large          14859059 ns     14838872 ns           47
VolatileMemoryMapBMFixture/MemoryMap_VolatileRepository_WriteRead_Large     5889984 ns      5879759 ns          119
VolatileMemoryMapBMFixture/Callback_VolatileRepository_WriteRead_Large     17126978 ns     17105183 ns           41
[2019-06-debugl 14:11:07.870] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] org::apache::nifi::minifi::core::repository::DatabaseCont
entRepository logger got sinks from namespace root and level error from namespace root
[2019-06-debugl 14:11:07.872] [org::apache::nifi::minifi::core::logging::LoggerConfiguration] [debug] org::apache::nifi::minifi::io::RocksDbStream logger got s
inks from namespace root and level error from namespace root
DatabaseMemoryMapBMFixture/MemoryMap_DatabaseRepository_Read_Tiny               285 ns          285 ns      2469053
DatabaseMemoryMapBMFixture/Callback_DatabaseRepository_Read_Tiny                254 ns          254 ns      2757993
DatabaseMemoryMapBMFixture/MemoryMap_DatabaseRepository_WriteRead_Tiny         4573 ns         4571 ns       150453
DatabaseMemoryMapBMFixture/Callback_DatabaseRepository_WriteRead_Tiny          3553 ns         3551 ns       197460
DatabaseMemoryMapBMFixture/MemoryMap_DatabaseRepository_Read_Small            12882 ns        12876 ns        54293
DatabaseMemoryMapBMFixture/Callback_DatabaseRepository_Read_Small             11930 ns        11925 ns        58679
DatabaseMemoryMapBMFixture/MemoryMap_DatabaseRepository_WriteRead_Small       88615 ns        88436 ns         7935
DatabaseMemoryMapBMFixture/Callback_DatabaseRepository_WriteRead_Small        90748 ns        90548 ns         7717
DatabaseMemoryMapBMFixture/MemoryMap_DatabaseRepository_Read_Large         26695310 ns     26666793 ns           26
DatabaseMemoryMapBMFixture/Callback_DatabaseRepository_Read_Large          26571426 ns     26544032 ns           26
DatabaseMemoryMapBMFixture/MemoryMap_DatabaseRepository_WriteRead_Large    49532071 ns     49459516 ns           14
DatabaseMemoryMapBMFixture/Callback_DatabaseRepository_WriteRead_Large     62205023 ns     62085915 ns           12
DatabaseMemoryMapBMFixture/MemoryMap_DatabaseRepository_RandomRead_Large       55.5 ns         55.4 ns     12612349
DatabaseMemoryMapBMFixture/Callback_DatabaseRepository_RandomRead_Large         514 ns          514 ns      1094727

The benchmarks show a significant performance increase in almost all cases. Both the FS repository and volatile can natively support memory mapping, but the DB repo has to simulate it by reading the full object. This has almost no performance impact in most cases, but is somewhat slower for the "small" (131KB payload) benchmark cases. The random access benchmarks show the most significant increase, even with the DB repo.

Caveats:

  • No Windows build yet, although it should be possible (https://docs.microsoft.com/en-us/windows/desktop/memory/file-mapping). I mainly need a set of Windows build instructions to test and validate against, as there's several possible ways to do a windows build.
  • There are some formatting changes due to clang-format (I included a .clang-format to hopefully reduce the issue going forward)
  • It's a fairly big code change so there could be some other things I missed
  • RocksDB is updated and needed RTTI to build on my machine. We can talk about this and/or extract out the rocks update.
  • RocksDB seems to have many MemoryMap references internally... maybe there is a way to natively support it with DB repo, but for now the change at least doesn't negatively impact performance. Would suggest that as a later change unless there's a clear/relatively simple route to do it in this PR.

options.use_direct_io_for_flush_and_compaction = true;
options.use_direct_reads = true;
// options.use_direct_io_for_flush_and_compaction = true;
// options.use_direct_reads = true;
Copy link
Contributor Author

@ai-christianson ai-christianson Jun 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't mean to commit this part, but I will note: on my machine without this, build failed, even on master, so we probably will need to look at this separately.

@ai-christianson ai-christianson force-pushed the mmap-2019-06-28b branch 2 times, most recently from f5435e3 to c790ca4 Compare July 26, 2019 20:29
@ai-christianson
Copy link
Contributor Author

Switched to mio library for cross-platform (windows) compatibility. Fixed the broken test. The benchmarks need some work, but should be close now.

@ai-christianson
Copy link
Contributor Author

The mio library does build for Windows, but I'm not happy with the performance. It's still faster than the non-mmap implemation (esp. w/ random access), but I think this PR should directly make the mmap calls.

@szaszm
Copy link
Member

szaszm commented Dec 16, 2020

Closing this because of inactivity. I think the community doesn't have the capacity to review such huge PRs in general.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants