Pandas, Pytorch & Tensorflow Shared Memory Support (Issue #499) #585

danielwetzel · 2023-08-07T11:48:48Z

Added Pandas Shared Memory Support for Frames

Different Pandas Frame Types (eg. Series, Sparse, Categorical) are automatically transformed to standard frames
With the Argument "keepIndex=True" in the from_pandas function, the original DF index is stored as first column named "index"
With the Argument "useIndexColumn=True" the Index column from a Daphne Frame is stored as Index of the Pandas DF and no longer as separate column
Updated Frame Operators within Python

Added Pytorch Shared Memory Support for 2d & nd Tensors (nd Tensors will be flattened)
Added Tensorflow Shared Memory Support for 2d & nd Tensors (nd Tensors will be flattened)

Tensors are Transformed to Matrixes, the original shape can be returned with the Argument "return_shape=True" in the from_pytorch & from_tensorflow method
Matrixes from Daphne can be returned as Torch & Tensorflow Tensors, with optional function Arguments for the compute function: "isTensorflow: bool, isPytorch: bool, shape (original shape of the tensor)"

Added Updates to the Numpy functions & to the Matrix Operators in Python

Added Support for Daphne SQL and Joins for Frames

Only InnerJoins are working at the moment, as Semi Joins and Group Joins unexpectedly return multiple results in Daphne (Demonstrated in "scripts/examples/daphnelib/issues-with-joins.daph")
The other two Join functions are commented out for future adjustments

Added a delete function for Daphne Objects in Python to prevent Memory Overflow

Designed all functions in a zero copy manner with strong focus on performance

Examples for all the added functions

"scripts/examples/daphnelib/"
with a real world example "scripts/examples/daphnelib/daphne-python-realworld-example.py"

Benchmarks & Benchmark Results

"scripts/benchmarks/"
"scripts/benchmarks/testoutputs/"

Function Tests

"test/api/python/"

danielwetzel · 2023-08-07T11:53:46Z

Solution for Issue #499 in the Context of the LDE Project Summer 2023

pdamme · 2023-09-13T15:27:40Z

I completely forgot to answer here, since we had already discussed the topic offline. Thanks for your contribution, @danielwetzel and @Niklas-Ventker. Connecting DAPHNE better to the data science ecosystem by enabling efficient data transfer with widely used Python libraries is a great new feature, and highly anticipated by our use-case partners.

I will finalize your code and merge it soon.

llvm remoed

Cleanup main branch

This reverts commit 6082700.

Renamed from_tensor to from_tensorflow Added from_pytorch

Added isTensorFlow and isPyTorch draft in compute function of operation_node.py

Ansatz für eine bessere Auswertung

Comment that from_pandas() does not carry over Indices of DataFrames. "Use "pandas.Dataframe.reset_index()" in advance to keep indices."

Placeholders as they are not working yet

Overview.md - DaphneLib data exchange via shared memory for pandas, tensorflow, pytorch APIRef.md - Added shared memory for from_numpy and from_pandas - Added from_tensorflow and from_pytorch - Added Frame and Matrix operations

… PR on upstream/main.

- Removed things we don't need on the main branch. - Scripts for experiments/performance tests (and the CSV files containing their outputs). - Some new example scripts in "scripts/examples/daphnelib/" that were not ready to be showcased. - Everything related to SQL queries in DaphneLib, since the current approach with a separate compute_sql() is not intuitive (and a good solution to take the registerView()s into account in compute() is not trivial). - Additional relational frame ops that would require MultiReturn, since the implementation was not finished. - Everything related to freeing the DaphneLib results, since the current approach required manual freeing, which is not normally necessary in Python and could cause errors. - Revised the changes to the docs. - E.g., simplified the examples for data transfer with TensorFlow/PyTorch. - Simplified .gitignore - Silenced TensorFlow warnings in DaphneLib test cases.

- To simplify the merge with the latest state of the main branch. - Changed the Python import statements accordingly.

- They currently require TensorFlow and PyTorch as dependencies, but these do not exist in the current DAPHNE containers on DockerHub. - A follow-up commit after the merge of this PR will make TensorFlow and PyTorch optional dependencies and switch on the test cases again. - Currently, all tests pass in a container that has TensorFlow and PyTorch installed.

pdamme

Again, thanks a lot for this valuable contribution @danielwetzel and @Niklas-Ventker! I finally found the time to get back to it now. It will be included in our upcoming v0.3 release.

I rebased your commits on the latest state of main and tidied up your code to make it ready to be merged. In that context, I also reduced it a bit to the core of the contribution regarding data exchange with Python libs. Some of the frame ops you added are also still in it.

We also appreciate the remaining things you started implementing. These were valuable for the LDE project. However, I removed some of them from this PR because they’re not in the state to be merged into main, e.g.:

SQL processing in DaphneLib through a separate compute_sql()
freeing of the DaphneLibResult
the partly finished work on MultiReturn
the scripts for your project experiments
some example scripts

Furthermore, I can confirm that all test cases run successfully in this PR, if TensorFlow and PyTorch are installed. Since these dependencies are not included in the container we use for CI testing, I commented out all DaphneLib test cases for now. After the merge I will make a follow-up commit that makes these dependencies optional and switches on the test cases again.

@Niklas-Ventker

… via shared memory (#585)" - This reverts commit f359a77. - Reason: I forgot to mention @Niklas-Ventker as a co-author when "squash & merge"ing the PR in GitHub, but we want to give full credit.

@pdamme

…red memory (#585) - Efficient data transfer via shared memory in DaphneLib. - Designed all functions in a zero-copy manner with strong focus on performance. - Added pandas shared memory support for frames. - Different pandas frame types (e.g., Series, Sparse, Categorical) are automatically transformed to standard frames. - With the argument "keepIndex=True" in the from_pandas function, the original df index is stored as the first column named "index". - With the argument "useIndexColumn=True" the Index column from a DAPHNE Frame is stored as the index of the pandas df and no longer as separate column. - Added PyTorch and TensorFlow shared memory support for 2d & nd tensors (nd tensors will be flattened to 2d). - Tensors are transformed to matrices, the original shape can be returned with the argument "return_shape=True" in the from_pytorch & from_tensorflow methods. - Matrices from DAPHNE can be returned as PyTorch & TensorFlow tensors, with the optional function arguments for the compute() function: "asTensorflow: bool", "asPytorch: bool", "shape" (original shape of the tensor). - Added additional frame operations in DaphneLib. - Intended for testing processing of data frames transferred from pandas. - Script-level test cases. - Examples and/or test cases for all the added functions. - Currently, the test cases related to DaphneLib are commented out as they require TensorFlow and PyTorch as dependencies. - Updated the DaphneLib documentation. - Closes #499. - These changes have been committed before in f359a77, but were reverted in 158772a, since the co-author note was forgotten in the commit message, when @pdamme "squash & merge"ed the pull request. Co-authored-by: Niklas <[email protected]>

… via shared memory (#585)" - This reverts commit 4d4ec47. - Reason: When re-committing the changes with an additional co-author in the commit message, I forgot to include the newly added files... - Sorry to clutter the commit history, but we have a rule of never ever force-pushing to main.

@pdamme

…red memory (#585) - Efficient data transfer via shared memory in DaphneLib. - Designed all functions in a zero-copy manner with strong focus on performance. - Added pandas shared memory support for frames. - Different pandas frame types (e.g., Series, Sparse, Categorical) are automatically transformed to standard frames. - With the argument "keepIndex=True" in the from_pandas function, the original df index is stored as the first column named "index". - With the argument "useIndexColumn=True" the Index column from a DAPHNE Frame is stored as the index of the pandas df and no longer as separate column. - Added PyTorch and TensorFlow shared memory support for 2d & nd tensors (nd tensors will be flattened to 2d). - Tensors are transformed to matrices, the original shape can be returned with the argument "return_shape=True" in the from_pytorch & from_tensorflow methods. - Matrices from DAPHNE can be returned as PyTorch & TensorFlow tensors, with the optional function arguments for the compute() function: "asTensorflow: bool", "asPytorch: bool", "shape" (original shape of the tensor). - Added additional frame operations in DaphneLib. - Intended for testing processing of data frames transferred from pandas. - Script-level test cases. - Examples and/or test cases for all the added functions. - Currently, the test cases related to DaphneLib are commented out as they require TensorFlow and PyTorch as dependencies. - Updated the DaphneLib documentation. - Closes #499. - These changes have been committed before in f359a77, but were reverted in 158772a, since the co-author note was forgotten in the commit message, when @pdamme "squash & merge"ed the pull request. - So they were re-commited in 4d4ec47, but there, the newly added files from f359a77 were forgotten, which are added again now. Co-authored-by: Niklas <[email protected]>

danielwetzel changed the title ~~Pandas, Pytorch & Tensorflow Shared Memory Support~~ Pandas, Pytorch & Tensorflow Shared Memory Support (Issue #499) Aug 7, 2023

danielwetzel mentioned this pull request Aug 7, 2023

Connecting DAPHNE to the data science ecosystem: Efficient data exchange with popular Python libs #499

Closed

pdamme self-requested a review August 13, 2023 19:09

pdamme added the LDE summer 2023 Student project in the course Large-scale Data Engineering at TU Berlin (summer 2023). label Sep 13, 2023

danielwetzel and others added 24 commits April 23, 2024 21:44

CleanUp

7a2cf67

update .gitignore

c8a91dd

llvm remoed

Working Pandas Extension

28d1130

Pandas Working Cleanup

93b4c11

Pandas with execution timing

ce7c7ba

Pandas Frame Type Check

c0523e6

Pandas (Verbose-Flag Argument)

0418fc5

TensorFlow_AND_PandasPerformanceTest

7337bda

PandasPerformanceTest_Adjustments

c22165a

CleanUp

8bd8df6

Cleanup main branch

Revert "CleanUp"

34d710b

This reverts commit 6082700.

Added from_pytorch fuction in daphne_context.py

6da8469

Renamed from_tensor to from_tensorflow Added from_pytorch

Added isTensorFlow and isPyTorch draft in compute function

bcb2801

Added isTensorFlow and isPyTorch draft in compute function of operation_node.py

Added data exchange test script for pytorch

e7d05e2

Script update

14b23c0

Ansatz für eine bessere Auswertung

Performance Test 1 & 2

483f028

Pandas Performance Test 3

f83419c

Pandas Performance Test 3 Fix

cd9a076

Small changes

de3f514

Uncommented tensorflow and pytorch functions

74fa874

TensorFlow and Torch adjustments

87fbc29

Updated Tensor Support, Memory Management & Benchmark

3147874

PyTorch & TensorFlow Adjustments + Memory Management Adjustments

e461eca

Added comment of missing indices handover for dataframes

d598751

Comment that from_pandas() does not carry over Indices of DataFrames. "Use "pandas.Dataframe.reset_index()" in advance to keep indices."

Niklas-Ventker and others added 20 commits April 23, 2024 21:47

Increased runs and added newly run benchmark results

3ada58d

Added tests in test/api/python/

f947455

Placeholders as they are not working yet

Daphne-Python-Realworld-Example

e5368af

Daphne Test Cases Fixed

49f4649

Added 3 further Test Cases

0d92835

Delete compiler-debug-cuda.txt

e08d405

Delete compiler-trace-cuda.txt

a869896

Delete daphne-output.txt

41d0e6a

Delete daphne.code-workspace

8f4fd55

Clean up

2305242

Benchmark outputs clean up

0d4f011

Delete containers-tmp directory

33589e8

Delete test.py

f868731

Delete import_success_2 2.txt

a04c546

Added test cases for series, sparse and categorical dataframe

b326ff9

Added documentation

c5ce184

Overview.md - DaphneLib data exchange via shared memory for pandas, tensorflow, pytorch APIRef.md - Added shared memory for from_numpy and from_pandas - Added from_tensorflow and from_pytorch - Added Frame and Matrix operations

Quick fixes to make all tests pass after rebasing the commits in this…

09128cd

… PR on upstream/main.

Moved the entire source code of DaphneLib into a subdir "daphne".

6d2d849

- To simplify the merge with the latest state of the main branch. - Changed the Python import statements accordingly.

Merge remote-tracking branch 'upstream/main'

2b2c021

pdamme force-pushed the main branch from f17ec67 to 28bb7c2 Compare April 26, 2024 13:01

pdamme added 2 commits April 26, 2024 15:36

Additional polishing.

c6ff10c

pdamme force-pushed the main branch from 28bb7c2 to 2d4cfc9 Compare April 26, 2024 13:46

pdamme approved these changes Apr 26, 2024

View reviewed changes

pdamme merged commit f359a77 into daphne-eu:main Apr 26, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandas, Pytorch & Tensorflow Shared Memory Support (Issue #499) #585

Pandas, Pytorch & Tensorflow Shared Memory Support (Issue #499) #585

danielwetzel commented Aug 7, 2023

danielwetzel commented Aug 7, 2023

pdamme commented Sep 13, 2023

pdamme left a comment

Pandas, Pytorch & Tensorflow Shared Memory Support (Issue #499) #585

Pandas, Pytorch & Tensorflow Shared Memory Support (Issue #499) #585

Conversation

danielwetzel commented Aug 7, 2023

danielwetzel commented Aug 7, 2023

pdamme commented Sep 13, 2023

pdamme left a comment

Choose a reason for hiding this comment