-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pandas, Pytorch & Tensorflow Shared Memory Support (Issue #499) #585
Conversation
Solution for Issue #499 in the Context of the LDE Project Summer 2023 |
I completely forgot to answer here, since we had already discussed the topic offline. Thanks for your contribution, @danielwetzel and @Niklas-Ventker. Connecting DAPHNE better to the data science ecosystem by enabling efficient data transfer with widely used Python libraries is a great new feature, and highly anticipated by our use-case partners. I will finalize your code and merge it soon. |
llvm remoed
This reverts commit 6082700.
Renamed from_tensor to from_tensorflow Added from_pytorch
Added isTensorFlow and isPyTorch draft in compute function of operation_node.py
Ansatz für eine bessere Auswertung
Comment that from_pandas() does not carry over Indices of DataFrames. "Use "pandas.Dataframe.reset_index()" in advance to keep indices."
Placeholders as they are not working yet
Overview.md - DaphneLib data exchange via shared memory for pandas, tensorflow, pytorch APIRef.md - Added shared memory for from_numpy and from_pandas - Added from_tensorflow and from_pytorch - Added Frame and Matrix operations
… PR on upstream/main.
- Removed things we don't need on the main branch. - Scripts for experiments/performance tests (and the CSV files containing their outputs). - Some new example scripts in "scripts/examples/daphnelib/" that were not ready to be showcased. - Everything related to SQL queries in DaphneLib, since the current approach with a separate compute_sql() is not intuitive (and a good solution to take the registerView()s into account in compute() is not trivial). - Additional relational frame ops that would require MultiReturn, since the implementation was not finished. - Everything related to freeing the DaphneLib results, since the current approach required manual freeing, which is not normally necessary in Python and could cause errors. - Revised the changes to the docs. - E.g., simplified the examples for data transfer with TensorFlow/PyTorch. - Simplified .gitignore - Silenced TensorFlow warnings in DaphneLib test cases.
- To simplify the merge with the latest state of the main branch. - Changed the Python import statements accordingly.
- They currently require TensorFlow and PyTorch as dependencies, but these do not exist in the current DAPHNE containers on DockerHub. - A follow-up commit after the merge of this PR will make TensorFlow and PyTorch optional dependencies and switch on the test cases again. - Currently, all tests pass in a container that has TensorFlow and PyTorch installed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, thanks a lot for this valuable contribution @danielwetzel and @Niklas-Ventker! I finally found the time to get back to it now. It will be included in our upcoming v0.3 release.
I rebased your commits on the latest state of main and tidied up your code to make it ready to be merged. In that context, I also reduced it a bit to the core of the contribution regarding data exchange with Python libs. Some of the frame ops you added are also still in it.
We also appreciate the remaining things you started implementing. These were valuable for the LDE project. However, I removed some of them from this PR because they’re not in the state to be merged into main, e.g.:
- SQL processing in DaphneLib through a separate
compute_sql()
- freeing of the
DaphneLibResult
- the partly finished work on
MultiReturn
- the scripts for your project experiments
- some example scripts
Furthermore, I can confirm that all test cases run successfully in this PR, if TensorFlow and PyTorch are installed. Since these dependencies are not included in the container we use for CI testing, I commented out all DaphneLib test cases for now. After the merge I will make a follow-up commit that makes these dependencies optional and switches on the test cases again.
… via shared memory (#585)" - This reverts commit f359a77. - Reason: I forgot to mention @Niklas-Ventker as a co-author when "squash & merge"ing the PR in GitHub, but we want to give full credit.
…red memory (#585) - Efficient data transfer via shared memory in DaphneLib. - Designed all functions in a zero-copy manner with strong focus on performance. - Added pandas shared memory support for frames. - Different pandas frame types (e.g., Series, Sparse, Categorical) are automatically transformed to standard frames. - With the argument "keepIndex=True" in the from_pandas function, the original df index is stored as the first column named "index". - With the argument "useIndexColumn=True" the Index column from a DAPHNE Frame is stored as the index of the pandas df and no longer as separate column. - Added PyTorch and TensorFlow shared memory support for 2d & nd tensors (nd tensors will be flattened to 2d). - Tensors are transformed to matrices, the original shape can be returned with the argument "return_shape=True" in the from_pytorch & from_tensorflow methods. - Matrices from DAPHNE can be returned as PyTorch & TensorFlow tensors, with the optional function arguments for the compute() function: "asTensorflow: bool", "asPytorch: bool", "shape" (original shape of the tensor). - Added additional frame operations in DaphneLib. - Intended for testing processing of data frames transferred from pandas. - Script-level test cases. - Examples and/or test cases for all the added functions. - Currently, the test cases related to DaphneLib are commented out as they require TensorFlow and PyTorch as dependencies. - Updated the DaphneLib documentation. - Closes #499. - These changes have been committed before in f359a77, but were reverted in 158772a, since the co-author note was forgotten in the commit message, when @pdamme "squash & merge"ed the pull request. Co-authored-by: Niklas <[email protected]>
…red memory (#585) - Efficient data transfer via shared memory in DaphneLib. - Designed all functions in a zero-copy manner with strong focus on performance. - Added pandas shared memory support for frames. - Different pandas frame types (e.g., Series, Sparse, Categorical) are automatically transformed to standard frames. - With the argument "keepIndex=True" in the from_pandas function, the original df index is stored as the first column named "index". - With the argument "useIndexColumn=True" the Index column from a DAPHNE Frame is stored as the index of the pandas df and no longer as separate column. - Added PyTorch and TensorFlow shared memory support for 2d & nd tensors (nd tensors will be flattened to 2d). - Tensors are transformed to matrices, the original shape can be returned with the argument "return_shape=True" in the from_pytorch & from_tensorflow methods. - Matrices from DAPHNE can be returned as PyTorch & TensorFlow tensors, with the optional function arguments for the compute() function: "asTensorflow: bool", "asPytorch: bool", "shape" (original shape of the tensor). - Added additional frame operations in DaphneLib. - Intended for testing processing of data frames transferred from pandas. - Script-level test cases. - Examples and/or test cases for all the added functions. - Currently, the test cases related to DaphneLib are commented out as they require TensorFlow and PyTorch as dependencies. - Updated the DaphneLib documentation. - Closes #499. - These changes have been committed before in f359a77, but were reverted in 158772a, since the co-author note was forgotten in the commit message, when @pdamme "squash & merge"ed the pull request. - So they were re-commited in 4d4ec47, but there, the newly added files from f359a77 were forgotten, which are added again now. Co-authored-by: Niklas <[email protected]>
Added Pandas Shared Memory Support for Frames
Added Pytorch Shared Memory Support for 2d & nd Tensors (nd Tensors will be flattened)
Added Tensorflow Shared Memory Support for 2d & nd Tensors (nd Tensors will be flattened)
Added Updates to the Numpy functions & to the Matrix Operators in Python
Added Support for Daphne SQL and Joins for Frames
Added a delete function for Daphne Objects in Python to prevent Memory Overflow
Designed all functions in a zero copy manner with strong focus on performance
Examples for all the added functions
Benchmarks & Benchmark Results
Function Tests