A C implementation of the python pickle serialization function. This includes two implementations, one in pure python under the py_pickle/ directory and one with a C-Python interface under the sea_pickle/ directory.
To setup and run the modules follow the following instructions:
-
Ensure that Python >= 3.8 is installed
-
Ensure that pip is installed
-
run the following commands
cd py_pickle pip install .
-
Ensure that Python >= 3.8 is installed
-
Ensure that pip is installed
-
Ensure that the clang compiler version 10 or later is installed, check with the following command
clang --version
-
Ensure that the Python header files are within your $PATH. If on mac, these come with xcode tools but you may need to export them in your .zshrc, .bash etc. file. Refer to this page for help if needed.
-
Run the following commands to get setup
cd sea_pickle # enter the sea_pickle directory pip install . # install base requirements (pytest, setuptools) ./scripts/build_script.sh # run setuptools building protocol
Warning
If the setup script fails due to an access permission, run the following command chmod +x scripts/build_script.sh
and run the third command again. If once again failing, default to the following base commands of which the script consists
python setup.py clean --all # clean all previous setuptools binaries
rm -rf ./build ./sea_pickle.egg-info # remove previous build directories
python setup.py build # build the extension with setuptools
pip install . # install the extension locally
Testing is operated with pytest, with the tests being under the tests/ directory. Tests can be run in the console with the following command
pytest -vv ./tests/test_sea_pickle.py
Note
Ensure that pytest is installed and if using a venv ensure that it is activated in the terminal.
This API consists of two functions, those being partial_pickle and merge_partials. These can be imported from the C module with the following code
from typing import Callable
import sea_pickle
partial_pickle: Callable = sea_pickle.partial_pickle
merge_partials: Callable = sea_pickle.merge_partials
These functions have the following signatures
partial_pickle(Any) -> bytes
merge_partials(bytes, bytes, Optional[bool]) -> bytes
partial_pickle accepts any type of python object besides sets, bytearrays and custom types. This is due to memoization constrictions which limit the algorithm and its speed. partial_pickle does NOT return a result equivalent to that of pickle.dumps, as it is of a special un-memoized form fit for merging.
merge_partials accepts two bytes objects, which each must be the result of a call to partial_pickle, and a boolean frame_info which defaults to True. If frame_info is true, the output of the function will be equivalent to that of pickle.dumps, however, if it is set to false it is assumed that this is an intermediary phase and will be merged again in the future. Below are the two use cases displayed
# correct base case usage
assert merge_partials(
partial_pickle((1, "a", (2, "b"))), partial_pickle((3, "c", (4, "d")))
) == pickle.dumps([(1, "a", (2, "b")), (3, "c", (4, "d"))])
# incorrect usage of frame_info
assert merge_partials(partial_pickle((300, 400)), partial_pickle([500, 600]), False) == pickle.dumps([(300, 400), [500, 600]])
# correct usage of frame_info, which allows for divide and conquer algorithm usage.
assert merge_partials(
partial_pickle([100, 200]),
merge_partials(partial_pickle((300, 400)), partial_pickle([500, 600]), False),
) == pickle.dumps([[100, 200], (300, 400), [500, 600]])
Default behavior depends on the type of pickled objects:
-
Two Strings/Two Bytes Objects: Merge strings into one
assert merge_partials( partial_pickle("Test string with spaces"), partial_pickle(" and more text.") ) == pickle.dumps("Test string with spaces and more text.") assert merge_partials( partial_pickle(b"Byte string with spaces "), partial_pickle(b"and more bytes.") ) == pickle.dumps(b"Byte string with spaces and more bytes.")
-
Two Inhomogenous Types/Two Non-Stringlike Types: Merge into a list containing the two objects
assert merge_partials( partial_pickle(1.23456789), partial_pickle(9.87654321) ) == pickle.dumps([1.23456789, 9.87654321]) assert merge_partials( partial_pickle([1, 2.5, 3]), partial_pickle([4.0, 5, 6]) ) == pickle.dumps([[1, 2.5, 3], [4.0, 5, 6]])
Warning
merge_partials does NOT support memoization. A lot of time was dedicated to it, and I might implement it in the future, but memoization is not very memory safe and is actually slower than the base pickle algorithm. This is explained further in Algorithm Analysis. What does this mean for the end user? You cannot merge_partials for the following objects:
# sequences with repetitive string-like entries (this applies recursively as well)
["hello", "I", "dont", "repeat"] -> correct usage
["hello", "I", "want", "to", "be", "memoized", "to", "bytes"] -> incorrect usage
^ first usage ^ second usage
# sequences with repetitive tuple entries (lists and dictionaries are fine though)
[(1, 2, 3), (4, 5, 6)] -> correct usage
[(1, 2, 3), (1, 2, 3)] -> incorrect usage
The current implementation utilizes setuptools (which is the standard), pytest for testing (cleaner than unittest), and C for the extension module. The module was designed in such a way to favor parallel programming capabilities, hence the frame_info tuple which allows for a large list to be subdivided, and then sequentially merged, all in
The partial_pickle algorithm follows the same steps as the base pickling algorithm, however it is automatically defaulted to the pickling protocol 4 and is without memoization, meaning that the runtime of both of these algorithms is
merge_partials runs in
Therefore, I made the decision to leave that type of encoding to the base pickling algorithm, and instead focus on the encoding where this algorithm offers major efficiency boosts with parallelization, those of large string-likes and sequences of 2D or greater.
I would like to implement a future mediary method which could take in any type of python object, determine it's type and size, and disbatch the work out either to partial_pickle and merge_partials with an optimized number of threads/processes or to the base pickling algorithm (where it is more efficient), allowing for a simpler interface for users just to import that one function and to expect it to work the same way that pickle.dumps does.