Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas, Pytorch & Tensorflow Shared Memory Support (Issue #499) #585

Merged
merged 75 commits into from
Apr 26, 2024

Conversation

danielwetzel
Copy link
Contributor

Added Pandas Shared Memory Support for Frames

  • Different Pandas Frame Types (eg. Series, Sparse, Categorical) are automatically transformed to standard frames
  • With the Argument "keepIndex=True" in the from_pandas function, the original DF index is stored as first column named "index"
  • With the Argument "useIndexColumn=True" the Index column from a Daphne Frame is stored as Index of the Pandas DF and no longer as separate column
  • Updated Frame Operators within Python

Added Pytorch Shared Memory Support for 2d & nd Tensors (nd Tensors will be flattened)
Added Tensorflow Shared Memory Support for 2d & nd Tensors (nd Tensors will be flattened)

  • Tensors are Transformed to Matrixes, the original shape can be returned with the Argument "return_shape=True" in the from_pytorch & from_tensorflow method
  • Matrixes from Daphne can be returned as Torch & Tensorflow Tensors, with optional function Arguments for the compute function: "isTensorflow: bool, isPytorch: bool, shape (original shape of the tensor)"

Added Updates to the Numpy functions & to the Matrix Operators in Python

Added Support for Daphne SQL and Joins for Frames

  • Only InnerJoins are working at the moment, as Semi Joins and Group Joins unexpectedly return multiple results in Daphne (Demonstrated in "scripts/examples/daphnelib/issues-with-joins.daph")
  • The other two Join functions are commented out for future adjustments

Added a delete function for Daphne Objects in Python to prevent Memory Overflow

Designed all functions in a zero copy manner with strong focus on performance

Examples for all the added functions

  • "scripts/examples/daphnelib/"
  • with a real world example "scripts/examples/daphnelib/daphne-python-realworld-example.py"

Benchmarks & Benchmark Results

  • "scripts/benchmarks/"
  • "scripts/benchmarks/testoutputs/"

Function Tests

  • "test/api/python/"

@danielwetzel danielwetzel changed the title Pandas, Pytorch & Tensorflow Shared Memory Support Pandas, Pytorch & Tensorflow Shared Memory Support (Issue #499) Aug 7, 2023
@danielwetzel
Copy link
Contributor Author

Solution for Issue #499 in the Context of the LDE Project Summer 2023

@pdamme pdamme self-requested a review August 13, 2023 19:09
@pdamme pdamme added the LDE summer 2023 Student project in the course Large-scale Data Engineering at TU Berlin (summer 2023). label Sep 13, 2023
@pdamme
Copy link
Collaborator

pdamme commented Sep 13, 2023

I completely forgot to answer here, since we had already discussed the topic offline. Thanks for your contribution, @danielwetzel and @Niklas-Ventker. Connecting DAPHNE better to the data science ecosystem by enabling efficient data transfer with widely used Python libraries is a great new feature, and highly anticipated by our use-case partners.

I will finalize your code and merge it soon.

Niklas-Ventker and others added 20 commits April 23, 2024 21:47
Placeholders as they are not working yet
Overview.md
- DaphneLib data exchange via shared memory for pandas, tensorflow, pytorch

APIRef.md
- Added shared memory for from_numpy and from_pandas
- Added from_tensorflow and from_pytorch
- Added Frame and Matrix operations
- Removed things we don't need on the main branch.
  - Scripts for experiments/performance tests (and the CSV files containing their outputs).
  - Some new example scripts in "scripts/examples/daphnelib/" that were not ready to be showcased.
  - Everything related to SQL queries in DaphneLib, since the current approach with a separate compute_sql() is not intuitive (and a good solution to take the registerView()s into account in compute() is not trivial).
  - Additional relational frame ops that would require MultiReturn, since the implementation was not finished.
  - Everything related to freeing the DaphneLib results, since the current approach required manual freeing, which is not normally necessary in Python and could cause errors.
- Revised the changes to the docs.
  - E.g., simplified the examples for data transfer with TensorFlow/PyTorch.
- Simplified .gitignore
- Silenced TensorFlow warnings in DaphneLib test cases.
- To simplify the merge with the latest state of the main branch.
- Changed the Python import statements accordingly.
pdamme added 2 commits April 26, 2024 15:36
- They currently require TensorFlow and PyTorch as dependencies, but these do not exist in the current DAPHNE containers on DockerHub.
- A follow-up commit after the merge of this PR will make TensorFlow and PyTorch optional dependencies and switch on the test cases again.
- Currently, all tests pass in a container that has TensorFlow and PyTorch installed.
Copy link
Collaborator

@pdamme pdamme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, thanks a lot for this valuable contribution @danielwetzel and @Niklas-Ventker! I finally found the time to get back to it now. It will be included in our upcoming v0.3 release.

I rebased your commits on the latest state of main and tidied up your code to make it ready to be merged. In that context, I also reduced it a bit to the core of the contribution regarding data exchange with Python libs. Some of the frame ops you added are also still in it.

We also appreciate the remaining things you started implementing. These were valuable for the LDE project. However, I removed some of them from this PR because they’re not in the state to be merged into main, e.g.:

  • SQL processing in DaphneLib through a separate compute_sql()
  • freeing of the DaphneLibResult
  • the partly finished work on MultiReturn
  • the scripts for your project experiments
  • some example scripts

Furthermore, I can confirm that all test cases run successfully in this PR, if TensorFlow and PyTorch are installed. Since these dependencies are not included in the container we use for CI testing, I commented out all DaphneLib test cases for now. After the merge I will make a follow-up commit that makes these dependencies optional and switches on the test cases again.

@pdamme pdamme merged commit f359a77 into daphne-eu:main Apr 26, 2024
2 checks passed
pdamme added a commit that referenced this pull request Apr 26, 2024
… via shared memory (#585)"

- This reverts commit f359a77.
- Reason: I forgot to mention @Niklas-Ventker as a co-author when "squash & merge"ing the PR in GitHub, but we want to give full credit.
pdamme pushed a commit that referenced this pull request Apr 26, 2024
…red memory (#585)

- Efficient data transfer via shared memory in DaphneLib.
  - Designed all functions in a zero-copy manner with strong focus on performance.
  - Added pandas shared memory support for frames.
    - Different pandas frame types (e.g., Series, Sparse, Categorical) are automatically transformed to standard frames.
    - With the argument "keepIndex=True" in the from_pandas function, the original df index is stored as the first column named "index".
    - With the argument "useIndexColumn=True" the Index column from a DAPHNE Frame is stored as the index of the pandas df and no longer as separate column.
  - Added PyTorch and TensorFlow shared memory support for 2d & nd tensors (nd tensors will be flattened to 2d).
    - Tensors are transformed to matrices, the original shape can be returned with the argument "return_shape=True" in the from_pytorch & from_tensorflow methods.
    - Matrices from DAPHNE can be returned as PyTorch & TensorFlow tensors, with the optional function arguments for the compute() function: "asTensorflow: bool", "asPytorch: bool", "shape" (original shape of the tensor).
- Added additional frame operations in DaphneLib.
  - Intended for testing processing of data frames transferred from pandas.
- Script-level test cases.
  - Examples and/or test cases for all the added functions.
  - Currently, the test cases related to DaphneLib are commented out as they require TensorFlow and PyTorch as dependencies.
- Updated the DaphneLib documentation.
- Closes #499.

- These changes have been committed before in f359a77, but were reverted in 158772a, since the co-author note was forgotten in the commit message, when @pdamme "squash & merge"ed the pull request.

Co-authored-by: Niklas <[email protected]>
pdamme added a commit that referenced this pull request Apr 26, 2024
… via shared memory (#585)"

- This reverts commit 4d4ec47.
- Reason: When re-committing the changes with an additional co-author in the commit message, I forgot to include the newly added files...
- Sorry to clutter the commit history, but we have a rule of never ever force-pushing to main.
pdamme pushed a commit that referenced this pull request Apr 26, 2024
…red memory (#585)

- Efficient data transfer via shared memory in DaphneLib.
  - Designed all functions in a zero-copy manner with strong focus on performance.
  - Added pandas shared memory support for frames.
    - Different pandas frame types (e.g., Series, Sparse, Categorical) are automatically transformed to standard frames.
    - With the argument "keepIndex=True" in the from_pandas function, the original df index is stored as the first column named "index".
    - With the argument "useIndexColumn=True" the Index column from a DAPHNE Frame is stored as the index of the pandas df and no longer as separate column.
  - Added PyTorch and TensorFlow shared memory support for 2d & nd tensors (nd tensors will be flattened to 2d).
    - Tensors are transformed to matrices, the original shape can be returned with the argument "return_shape=True" in the from_pytorch & from_tensorflow methods.
    - Matrices from DAPHNE can be returned as PyTorch & TensorFlow tensors, with the optional function arguments for the compute() function: "asTensorflow: bool", "asPytorch: bool", "shape" (original shape of the tensor).
- Added additional frame operations in DaphneLib.
  - Intended for testing processing of data frames transferred from pandas.
- Script-level test cases.
  - Examples and/or test cases for all the added functions.
  - Currently, the test cases related to DaphneLib are commented out as they require TensorFlow and PyTorch as dependencies.
- Updated the DaphneLib documentation.
- Closes #499.

- These changes have been committed before in f359a77, but were reverted in 158772a, since the co-author note was forgotten in the commit message, when @pdamme "squash & merge"ed the pull request.
- So they were re-commited in 4d4ec47, but there, the newly added files from f359a77 were forgotten, which are added again now.

Co-authored-by: Niklas <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
LDE summer 2023 Student project in the course Large-scale Data Engineering at TU Berlin (summer 2023).
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants