Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add db benchmark script #1928

Merged
merged 11 commits into from
Mar 16, 2022
Merged

Conversation

matthewmturner
Copy link
Contributor

Which issue does this PR close?

Closes #1870

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@matthewmturner
Copy link
Contributor Author

meh, i think im close but im having some issues due to docker memory swapping which isnt allowed by some db-benchmark helpers. will come back to this.

@matthewmturner
Copy link
Contributor Author

Lol. I unknowingly created a shell script that had a name conflict with a db-benchmark script which caused issue.

Comment on lines 4 to 8
RUN apt-get update && \
apt-get install -y git build-essential

# Install R, curl, and python deps
RUN apt-get update && apt-get -y install --no-install-recommends --no-install-suggests \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the apt-get update seems to be redundant? looks like we can merge these two apt runs to reduce build time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Thx.


# Clone datafusion-python and build python library
# Not sure if the wheel will be the same on all computers
RUN git clone https://github.com/datafusion-contrib/datafusion-python \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good to clone a particular tag/commit to make this more reproducible.

@matthewmturner
Copy link
Contributor Author

I have this working now.

Would at least one other person be able to give it a try. I do the following from root directory of arrow-datafusion

$ docker build -t db-benchmark -f benchmarks/db-benchmark/db-benchmark.dockerfile .
# I used privileged as its supposed to improve with CPU intensive scripts
$ docker run --privileged db-benchmark

So far im not seeing as good of results in docker as i was natively but i think thats expected, even using --privileged. I havent had to performance tune docker too much in the past though so if anyone else has other ideas im open to it.

Right now, this only works with official datafusion release (version 7). However, I added option to pull in local datafusion so that its easier to benchmark local changes. Even if the docker performance isnt optimal it should at least give a baseline for making local changes and seeing if there are improvements - which i think is the intent of this script.

let me know if any questions or if you see any areas for improvement.

@matthewmturner matthewmturner marked this pull request as ready for review March 6, 2022 04:47
@houqp
Copy link
Member

houqp commented Mar 9, 2022

@matthewmturner you are on a mac machine right?

@matthewmturner
Copy link
Contributor Author

@matthewmturner you are on a mac machine right?

Yes, M1 Mac

@houqp
Copy link
Member

houqp commented Mar 9, 2022

that's expected then. the overhead should go away with when it's executed on a linux box.

@bobtins
Copy link

bobtins commented Mar 11, 2022

Is this specific to M1 Mac? I am running Ubuntu 20.04 on AMD64 and I get this error on docker build:

⚠️  Warning: No compatible platform tag found, using the linux tag instead. You won't be able to upload those wheels to PyPI.
📦 Built wheel for abi3 Python ≥ 3.6 to /datafusion-python/target/wheels/datafusion-0.4.0-cp36-abi3-linux_x86_64.whl
WARNING: Requirement 'target/wheels/datafusion-0.4.0-cp36-abi3-linux_aarch64.whl' looks like a filename, but the file does not exist
ERROR: datafusion-0.4.0-cp36-abi3-linux_aarch64.whl is not a supported wheel on this platform.
The command '/bin/sh -c cd datafusion-python     && maturin build --release     && python3 -m pip install target/wheels/datafusion-0.4.0-cp36-abi3-linux_aarch64.whl     && cd ..' returned a non-zero code: 1

Will try s/aarch64/x86_64/...

@matthewmturner
Copy link
Contributor Author

@bobtins thanks for trying it out! You'll just need to update the wheel file name in the docker file. Looks like the below is what you'll need.

/datafusion-python/target/wheels/datafusion-0.4.0-cp36-abi3-linux_x86_64.whl

@bobtins
Copy link

bobtins commented Mar 11, 2022

Yeah, that got it to work.
I found this for populating env variables automatically, but it requires BuildKit, which means I have to do export DOCKER_BUILDKIT=1 before running the build step.

@matthewmturner
Copy link
Contributor Author

Nice find - I can look into integrating that

@matthewmturner
Copy link
Contributor Author

@houqp If no other feedback I think this is good to merge for now.

@bobtins im a bit constrained on time right now to work on getting the architectures / wheels integrated properly into the script. given youve shown interest in the benchmarking area would you be willing to handle that as a follow on PR?

@yjshen
Copy link
Member

yjshen commented Mar 15, 2022

/datafusion-python/target/wheels/datafusion-0.4.0-cp36-abi3-linux_x86_64.whl

Possible to adjust the docker file to make it more generic? Meet the same problem here.

After some search, Buildx might do the trick?

@bobtins
Copy link

bobtins commented Mar 15, 2022

Almost got it...will post a diff as soon as it works.

@matthewmturner
Copy link
Contributor Author

@bobtins thanks so much!

Run the following from root `arrow-datafusion` directory

```bash
$ docker build -t db-benchmark -f benchmarks/db-benchmark/db-benchmark.dockerfile .
Copy link

@bobtins bobtins Mar 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
$ docker build -t db-benchmark -f benchmarks/db-benchmark/db-benchmark.dockerfile .
$ docker buildx build -t db-benchmark -f benchmarks/db-benchmark/db-benchmark.dockerfile .

Copy link

@bobtins bobtins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it working with changes suggested.
Looking forward to playing around with this and maybe adding some more features.

# 1. datafusion-python that builds from datafusion version referenced datafusion-python
RUN cd datafusion-python \
&& maturin build --release \
&& python3 -m pip install target/wheels/datafusion-0.4.0-cp36-abi3-linux_aarch64.whl \
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
&& python3 -m pip install target/wheels/datafusion-0.4.0-cp36-abi3-linux_aarch64.whl \
&& case "${TARGETPLATFORM}" in \
*/amd64) CPUARCH=x86_64 ;; \
*/arm64) CPUARCH=aarch64 ;; \
*) exit 1 ;; \
esac \
&& python3 -m pip install target/wheels/datafusion-0.4.0-cp36-abi3-linux_${CPUARCH}.whl \
&& case "${TARGETPLATFORM}" in \
*/amd64) CPUARCH=x86_64 ;; \
*/arm64) CPUARCH=aarch64 ;; \
*) exit 1 ;; \
esac \
&& python3 -m pip install target/wheels/datafusion-0.4.0-cp36-abi3-linux_${CPUARCH}.whl \

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's weird, didn't mean to put it in twice; haven't used this github feature before

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, ignore those suggestions...I just did a diff and attached it.

diff.txt

@matthewmturner
Copy link
Contributor Author

@bobtins thank you for the help! I am going to review and integrate tomorrow morning.

@matthewmturner
Copy link
Contributor Author

@bobtins thank you again, for the help. I have updated.

@yjshen can you give it a try now?

Run the following from root `arrow-datafusion` directory

```bash
$ docker buildx -t db-benchmark -f benchmarks/db-benchmark/db-benchmark.dockerfile .
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
$ docker buildx -t db-benchmark -f benchmarks/db-benchmark/db-benchmark.dockerfile .
$ docker buildx build -t db-benchmark -f benchmarks/db-benchmark/db-benchmark.dockerfile .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sry, did that local but forgot to add there

@yjshen yjshen closed this Mar 16, 2022
@yjshen yjshen reopened this Mar 16, 2022
Copy link
Member

@yjshen yjshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @matthewmturner !

@yjshen yjshen merged commit 8b249ae into apache:master Mar 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a script for running full db-benchmark suite
4 participants