Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI-backend support #436

Closed
wants to merge 79 commits into from
Closed

MPI-backend support #436

wants to merge 79 commits into from

Conversation

AhmedEleliemy
Copy link
Collaborator

This fork implements the support of communication primitives via MPI. Currently, the coordinator does not participate in communication. The code also is very similar to gRPC and there are a lot of things that are redundant and can be abstracted to serve various implementations, mainly gRPCand MPI.

aristotelis96 and others added 30 commits August 31, 2022 20:46
- Updated project structure
- Moved distributed related kernels and datastructures under runtime/distributed/coordinator/
- Generalized Broadcast&Distribute kernels
- DistributedWrapper implementation
- Updated Worker to support Vectorized execution.
- DistributedPipelineOp akin to VectorizedPipelineOp
  - main difference: pipeline body given as IR string, not as nested code region
- New DistributePipelinesPass rewriting VectorizedPipelineOp to DistributedPipelineOp.
  - including the generation of the IR string
  - distribution requires vectorization (--vec) to be effective
- RewriteToCallKernelOpPass lowers DistributedPipelineOp to CallKernelOp
  - special rewrite pattern for DistributedPipelineOp, for certain reasons it cannot be treated as any other op here
- distributedPipeline-kernel: currently just a stub that prints the information it receives, including the IR string.
- Limitations:
  - currently, we only support distributed pipelines with 1 or 2 outputs (we still have the same limitation for vectorized pipelines)
  - currently, we only support DenseMatrix<double>
  - currently, all vectorized pipelines are rewritten to distributed pipelines
- Updated distributedCollect Primitive
- TODO generated ir code needs to be fixed
- TODO Additional debugging needed on worker side
- Extended parseType for rows/cols

- Updated Distributed Runtime for COLS combine

-Updated DistributedTest

- DistributedDescriptor implementation (metadata for the distributed runtime)
- Distributed allocation descriptor implementation. simply holds object metadata information
- Distributed kernels (distribute/broadcast/compute, etc.) use template functions for each communication framework.
- MPI implementation missing.

- New enum type for distributed backend implementation
…/" .

- Some files containing gRPC code where located under distributed/worker. Moved class ProtoDataConverter, class CallData
- Some files containing gRPC code where located under distributed/coordinator. Moved class DistributedGRPCCaller
- Updated CMAKE files.
- Seperated worker implementation from gRPC.
- Worker gRPC implementation now derives from base class Worker Implementation.
- Base class WorkerImpl contains generic functions for storing data, computing pipelines, etc.
- class WorkerImplGRPC contains functions for communicating with gRPC and using parent class for storing/computing data.
- TODO WorkerImplMPI.
- There was already a partial implementation transfering the recent changes from vectorized pipelines to distributed pipelines.

- However, a few pieces were still missing to make it work:

- The CallKernelOp generated for the DistributedPipelineOp in RewriteToCallKernelOpPass must have the attribute "hasVariadicResults" to ensure correct lowering in LowerToLLVMPass.

- The number of outputs must come after the outputs in the kernel, and must not be added as an operand to the CallKernelOp, since it is added automatically for variadic results in LowerToLLVMPass.
…ted types (for now we support only Dense<double>)
- Minor fixes.
- Due to current Object Meta Data limitations, we only support unique inputs for a Distributed Pipeline (no duplicate pipeline inputs).
- Distributed kernels have specializations for each distributed backend implementation.
- Distributed kernels update the meta data and handle the communication using specific distributed-backend implementation.
- Distributed metadata now hold only information.
- TODO: Add simple transferTo/From functions in the meta data class for the distributed gRPC implementation.
- Various small changes.
- main includes the initial Meta Data implementation
- MetaDataObject mdo field of class Structure is now public. Distributed kernels need to access and modify the metadata of an object.
- Various small updates to kernels in order to support the new meta data implementation.
- WorkerTest.cpp now tests the generic WorkerImpl class, instead of the gRPC specific implementation.
- TODO: Add a test for the gRPC WorkerImpl class.
- Removed unused utility function "StartDistributedWorker"
- Disabled "DistributedRead" test. With the new Distributed-Pipeline implementation we do not support distributed read yet, therefore this test does not actually test something significant.
- Updated a few test-scripts for the distributed runtime, due to unique-pipeline-inputs limitations.
- Various warning fixes.
- Renamed and moved AllocationDescriptorGRPC.
- Renamed Worker::StoredInfo::filename to identifier.
- Added Status nested class to WorkerImpl.
- Renamed and moved AllocationDescriptorGRPC.
- Renamed Worker::StoredInfo::filename to identifier.
- Improved serialization from CSRMatrix to protobuf.
- Changed MetaDataObject mdo in Structure class, from public to private.
- Added getter by reference for modifying MetaDataObject of a Structure.
- Improved CSRMatrix serialization from Daphne object to protobuf.
- Fixed various warnings.
- Minor changes.
- Added DistributedContext.h containg information about distributed workers.
- Removed duplicate code.
- Added command line argument "--distributed", in order to enable distributed execution for DAPHNE.
- Various small fixes after adding "--distributed"" flag.

Co-authored-by: Stratos Psomadakis <[email protected]>
Co-authored-by: Aristotelis Vontzalidis <[email protected]>
- Fixed headers.
- Changed some pass-by-value to const-reference.
- Channel map changed to inline static.
…neralize the use of certain structure see DistributedAlloactionHelper.h
When the worker receives a value, we need to allocate memory, store the value and keep the address, for later use.
Since we did not allocate any memory this resulted in a bug where the address stored was not pointing to any value.
@AhmedEleliemy
Copy link
Collaborator Author

@pdamme @aristotelis96 @corepointer @psomas
please have a look it is ready for merge

Copy link
Collaborator

@corepointer corepointer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dear @AhmedEleliemy,
I have checked out your code and compiled it. This threw some warnings -> please fix.
Also, please address the comments of my last review.
Regards, Mark

…en the number of workers is larger than the work items. Simply, some workers will not participate in the computations, i.e., at least each worker should get 1 work item
@AhmedEleliemy AhmedEleliemy changed the title [WIP] MPI-backend support MPI-backend support Mar 14, 2023
@AhmedEleliemy
Copy link
Collaborator Author

closes #195

@corepointer
Copy link
Collaborator

Thank you for your contribution @AhmedEleliemy,
I have merged this now with some minor changes to make the test script work and to silence another warning. I further removed SLURM support in the dependency compilation and changed the OpenMPI version to the latest bugfix release (4.1.5).

EricMier pushed a commit that referenced this pull request Mar 28, 2023
* [MINOR} DenseMatrix<int64_t> output for vectorizedPipelineOp

* [BUGFIX] Wrong row offsets while slicing in vectorized task

Changing rowStart and rowEnd in accumulation of vectorized task results caused an out of bounds error when accessing a matrix. Passing altered row start/end values as temporaries fixes this.

* [BUGFIX] Workaround casting error occurring in vectorized cuda

* [MINOR] DAPHNE Binary Format Metadata

When writing in DAPHNE binary format, the *.dbdf.meta file was not written (as is the case with CSV).

* [DAHPNE-#198] Second order function map (#407)

- a first working version of the `map` built-in function in DaphneDSL
- applies a given unary UDF to each element of a DenseMatrix
- the UDF is passed as a function pointer to the map-kernel
- unit tests and script-level tests for map()
- Closes #198

* [MINOR] Fix compilation on modern gcc

* [DAPHNE-451] FPGA GEMV Operation

This commit adds support for a single precision GEMV operation to the list of supported FPGA operations. The necessary bitstream can be downloaded from the supplemental-binaries git repository (https://github.com/daphne-eu/supplemental-binaries). This does not represent a separate DAPHNE kernel but branches out to call the appropriate function in the Matmult kernel.

* [DAPHNE-451] Initial FPGA SYRK integration

This is a follow up that is merged in the same pull request (hence the same issue number).
Adding preliminary syrk support for fp32 (operation will see more work in the future).

Closes #451

* [DAPHNE-#453] Build from source artifact

This change enables the build.sh script to compile the third party dependencies without relying on the git submodule check out of llvm.

Closes #453

* [DAPHNE-#459] Enable CI

- prepares an ubuntu base image with required dependencies
- builds the daphne system
- runs test suite
- stores created artifacts
- simpler install of cmake
- run build on 20.04
- run only on pushes and PRs to main
- and manually via workflow dispatch

Closes #459, Closes #460

* Upgraded MLIR. (#468)

This commit upgrades MLIR from llvm/llvm-project@4763c8c (May 14, 2021) to llvm/llvm-project@20d454c (Jan 31, 2023).

The upgrade required a lot of changes in the DAPHNE code base, which are summarized below:
- Several things were renamed in MLIR (in some cases with small API changes):
  - C++
    - builder.getIdentifier() -> builder.getStringAttr()
    - FunctionPass -> OperationPass<func::FuncOp>
    - Likewise, runOnFunction() -> runOnOperation()
    - OwningRewritePatternList -> RewritePatternSet
    - OwningModuleRef -> OwningOpRef<ModuleOp>
    - BlockAndValueMapping -> IRMapping
  - TableGen
    - OpTrait -> Trait
    - NoSideEffect -> Pure
- Tablegen'erted names of operations' operands/attributes/results/regions all start with "get" now and are in camelCase.
- Several header files were moved in MLIR.
- A few things were refactored in MLIR:
  - Passing of options to execution engine.
  - Signature of matchAndRewrite() changed: instead of ArrayRef<Value> operands, it now gets a OpAdaptor adaptor.
  - Several places that used to require an IndexAttr now need an I64IntegerAttr.
  - Instead of mlir::ConstantIntOp and mlir::ConstantFloatOp, only mlir::arith::ConstantOp.
  - Type parsing and printing was changed. We even need a workaround now.
  - daphne::ConstantOp cannot have a custom builder with an attribute as the parameter anymore, since attributes do not have a type anymore.
  - Folding interfaces.
  - scf::IfOp builder does not accept result types anymore.
  - The StandardOps dialect was replaced by several other dialects, e.g. Arith, Func, ControlFlow.
- Needed to add some extra (existing MLIR) passes to the DAPHNE compilation chain:
  - A pass for creating C wrappers.
  - A pass for eliminating unrealized conversion casts (we introduce some explicitly when lowering to LLVM now).
- Some things in DAPHNE didn't work anymore, needed to be fixed:
  - Lowering to LLVM: kernel calls shouldn't have the return type void, but they should simply have an empty vector of return types.
  - Lowering to LLVM: special treatment of ReturnOp in VectorizedPipelineOp via UnrealizedConversionCast was necessary (maybe there are better solutions).

Some useful resources related to some of these changes:
- https://discourse.llvm.org/t/psa-update-your-opconversionpattern-matchandrewrite-methods/4354
- https://reviews.llvm.org/D110293
- https://reviews.llvm.org/D110293#change-lfVb579kxHdZ
- llvm/llvm-project@610139d
- https://discourse.llvm.org/t/psa-new-improved-fold-method-signature-has-landed-please-update-your-downstream-projects/67618

* [BUGFIX] Fix build dependency

Move daphneir before compiler/utils to fix error that seems to occur only when compiling with a low thread count (github action compiles with 2 threads and there the issue surfaced)

* [DAPHNE-#470] Cache build dependencies to speed up github action  compilation

* trigger on push to branch
* fix work dir permissions after code checkout in github action
* thirdparty dir caching
* check for existence of submodule directory in .git for llvm-project
* fix build.sh error testing for .git
* improve build.sh -nf
* only checkout submodules if .git/modules exists
* main.yml debug output

Closes #470

* [MINOR] Switch off fancy in non-terminal environment

...and protect that tput call.

* [MINOR] build.sh parameter for installPrefix

Set installPrefix from command line with --installPrefix=/my/directory

* [MINOR] Install MLIR in installPrefix

This change adds the invocation of the installation target while compiling LLVM/MLIR. With that, we can omit specifying LLVM_DIR and MLIR_DIR explicitly. Furthermore, this cuts the dependency of buildPrefix, which might not contain a meaningful value in a container with prebuild 3rd party dependencies. Finally, this enables us to switch the directory of installed dependencies with the installPrefix parameter of build.sh.

* [MINOR] Flag to make dependency compilation optional

Dependency building will be omitted if the --no-deps flag is provided to build.sh.
This is particularly useful in combination with --installPrefix for containers which have the deps already installed somewhere else (coming soon), e.g., ./build.sh --no-deps --installPrefix /usr/local

* [DAPHNE-#466] Change behavior of build.sh --clean

* --clean now only cleans the build output of DAPHNE and not the third party stuff. See the Github issue #466 for further details.
* reverting a workaround introduced by #470 (dir check before updating submodules)
* documentation updates

Closes #466, Closes #471

* [MINOR] Change Github Action cache key to exact hash

Minor fixes by Patrick Damme <[email protected]>

- build.sh
  - removed duplicate echo
  - reset the flag for accepting with yes after call
  - line break in message: line shall not start with whitespace
- README.md
  - wording
- BuildingDaphne.md
  - escaped some "<" by "\<"
  - added some missing ">"
  - corrected default install prefix

Co-authored-by: Patrick Damme <[email protected]>

* [CLEANUP] Reformat build.sh and remove warnings

This commit contains the result of a code formatter set to 4 spaces of indentation. Furthermore, all shell-check warnings have been resolved.
Furthermore, the --no-deps parameter got an abbreviation and is now also available as -nd

* [DAPHNE-#473] Improved Apache Arrow support

Cleaned up the integration of Apache Arrow for Parquet reader support.
* Arrow is now always built from 11.0 release package
* CMake code is now free from hard coded paths
* It seems we can drop the requirement to install boost libraries as everything compiles just fine without it, and we don't use much Arrow anyway atm

Closes #473

* [BUGFIX] Resolving circular build dependency

Still trying to fix a build error when building on GitHub Actions where CompilerUtils depends on some tablegen generated headers and vice versa.

* [MINOR] Fix git submodule loaded from gh action cache

Another fix for this issue =)

* [MINOR] Github Action: avoid test.sh and dependency building

* running tests without test.sh, as this does not support passing arguments to build.sh yet.
* If test.sh is used again in the future, run without piping output to ''ts'' as this swallows the exit code
* using a docker image with prebuilt dependencies now
* no need to build third party dependencies anymore as this is coming from the docker image
* deactivates the previously introduced caching of the third party dependency build

* [DAPHNE-#472, DAPHNE-#192] New Daphne Docker Images

* This is an initial version of our official docker images for various purposes. Atm it contains an image for running our GitHub action more efficiently (skipping dependency building) and an interactive development environment (also containing precompiled deps).
* Shell scripts are provided to build and run these containers.
* Use the run scripts to bind-mount your local DAPHNE source tree and avoid permission issues (commands in the container will be run as your local user to not end up with files that belong to root)
* A longer standing issue of missing numpy is also resolved now (issue #192)
* Initial dockermentation in containers/Readme.md

Closes #472, Closes #192

* [MINOR] Fix a path in submodule checkout

This one fixes a fix (from f314fdd) where the full path to the .git file is needed (and was not supplied)

* [MINOR] Fixed DistributedWrapper, could not parse MLIR code.

DistributedWrapper::getPipelineInputTypes could not parse the MLIR fragment because FuncDialect was not registered.

Closes issue 475.

* [BUGFIX] Test case failed to trigger distributed execution.

- The `--vec` flag used to be sufficient to trigger distributed execution if the environment variable `DISTRIBUTED_WORKERS` is set.
- However, meanwhile, this behavior was changed, now `--distributed` must be given, but the test case didn't reflect this change.
- Thus, the test case was executed purely locally, thereby hiding any bugs in the distributed execution (see #475).
- This commit fixes this bug.

* [MINOR] Added a link in the DaphneDSL lang ref.

* [DAPHNE-#132] Frame column labels after (#478)

- Modified the labeling for the second column of the group-join operation to reflect the summation.
- Adapted the frame label inference for consistency.
- Modified the test case to reflect the changes.
- Closes #132.

* [MINOR] Updated broadcast kernel assertion.

Broadcast kernel has an assertion that checks if the matrix/value is nullptr, but in case value 0 was broadcasted the assertion still failed. Updated so that assertion occurs only when we broadcast a matrix.

* [MINOR] Change Github Action Docker Image

Changing the docker image of our CI back to daphneeu/github-action to include pre-main third party dependencies.

* [DAPHNE-195] Distributed Ops using OpenMPI

This commit adds OpenMPI as an alternative way of running distributed operations (besides gRPC).

Closes #195, Closes #436

* [DAPHNE-#481] Right indexing rows by bit vector for DenseMatrix.

- Harmonized parsing of right indexing for extracting/slicing and filtering.
- Added partial specialization of filterRow-kernel for DenseMatrix.
- Added unit tests and script-level tests.
- Adapted the user documentation.
- Contributes to #481, but indexing columns by bit vector is still an open todo.

* [DAPHNE-#481] Right indexing columns by bit vector for DenseMatrix.

- New operation FilterColOp with some necessary inferences.
- New kernel filterCol with a partial specialization for DenseMatrix.
- Parsing of right indexing supports bit vectors for the columns.
- Unit test cases and script-level test cases.
- Updated the DaphneDSL language reference in the documentation.
- Closes #481.

* [MINOR] Fixed two typos in error messages.

* [MINOR] Removed a todo related to a closed issue.

* [MINOR] Updated DaphneDSL reference on right indexing.

- Simplified it by showing how matrix literals can be used to specify the positions.
- Added a remark on a required whitespace.

* [BUGFIX] genGivenVals() with zero rows.

- genGivenVals() used to crash if zero rows are specified.
- Now, it throws an exception.
- Note that, while matrices with 0 rows are allowed, using genGivenVals() to create then doesn't make sense (see comment in the code).

---------

Co-authored-by: Mark Dokter <[email protected]>
Co-authored-by: Simeon <[email protected]>
Co-authored-by: pratuszniak <[email protected]>
Co-authored-by: Benjamin Steinwender <[email protected]>
Co-authored-by: Patrick Damme <[email protected]>
Co-authored-by: [email protected] <[email protected]>
Co-authored-by: DamianDinoiu <[email protected]>
Co-authored-by: Ahmed Eleliemy <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants