Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add performance suggestions to docs #2278

Closed
samuelcolvin opened this issue Apr 6, 2022 · 11 comments
Closed

Add performance suggestions to docs #2278

samuelcolvin opened this issue Apr 6, 2022 · 11 comments

Comments

@samuelcolvin
Copy link
Contributor

samuelcolvin commented Apr 6, 2022

Hi, thanks so much for pyo3 - it's wonderful, I'm using it a lot (see rtoml and watchfiles, I might be using it pydantic soon too).

I wonder if it would be worth adding a "Performance Suggestions" or similar section to the docs, perhaps the FAQs?

Examples:

cast_as vs extract

I spent a lot of the morning very worried that pyo3 rust making heavy use of python types was actually slower than python, until by chance I tried cast_as instead of extract.

// fast
let list: &PyList = py_any.cast_as()?;
// slow
let list: &PyList = py_any.extract()?;

This change in my case made a ~4x improvement to performance.

Optimised compilation

While investigating the above, I noticed this SO question referenced on #1470 didn't mention that the performance is massively effected by debug/release compilation. I've added an answer to the question with suggestions.


These are both no doubt very obvious to the maintainers of this library and sessioned users, but to those new to rust and pyo3, they might be less obvious.

I'm sure there are other things which can have a dramatic effect on performance which I'm not ware of - I've love to hear any other general suggestions?


Update:

Other things to suggest when this section is written:

  • interning strings where they're being used a lot, a good example is if you're creating a dict with common keys, performance can be significantly improved by reusing PyStrings rather than rust strings
@adamreichold
Copy link
Member

I spent a lot of the morning very worried that pyo3 rust making heavy use of python types was actually slower than python, until by chance I tried cast_as instead of extract.

I am surprised by this. Shouldn't both cast_as and extract for a native type end up calling PyTryFrom::try_from? cast_as directly and extract via pyobject_native_type_extract! which should both end up here.

Is it possible to show that performance difference using a self-contained reproducer?

@adamreichold
Copy link
Member

adamreichold commented Apr 6, 2022

I tried the following benchmark

diff --git a/benches/bench_frompyobject.rs b/benches/bench_frompyobject.rs
index 8cf8a4da7..ca4b80bb2 100644
--- a/benches/bench_frompyobject.rs
+++ b/benches/bench_frompyobject.rs
@@ -1,6 +1,6 @@
-use criterion::{criterion_group, criterion_main, Bencher, Criterion};
+use criterion::{black_box, criterion_group, criterion_main, Bencher, Criterion};
 
-use pyo3::{prelude::*, types::PyString};
+use pyo3::{prelude::*, types::{PyList, PyString}};
 
 #[derive(FromPyObject)]
 enum ManyTypes {
@@ -18,8 +18,30 @@ fn enum_from_pyobject(b: &mut Bencher<'_>) {
     })
 }
 
+fn list_via_cast_as(b: &mut Bencher<'_>) {
+    Python::with_gil(|py| {
+        let any: &PyAny = PyList::empty(py).into();
+
+        b.iter(|| {
+            let _list: &PyList = black_box(any).cast_as().unwrap();
+        });
+    })
+}
+
+fn list_via_extract(b: &mut Bencher<'_>) {
+    Python::with_gil(|py| {
+        let any: &PyAny = PyList::empty(py).into();
+
+        b.iter(|| {
+            let _list: &PyList = black_box(any).extract().unwrap();
+        });
+    })
+}
+
 fn criterion_benchmark(c: &mut Criterion) {
     c.bench_function("enum_from_pyobject", enum_from_pyobject);
+    c.bench_function("list_via_cast_as", list_via_cast_as);
+    c.bench_function("list_via_extract", list_via_extract);
 }
 
 criterion_group!(benches, criterion_benchmark);

and found no measurable difference:

list_via_cast_as        time:   [1.1823 ns 1.1823 ns 1.1823 ns]                              
list_via_extract        time:   [1.8916 ns 1.8918 ns 1.8920 ns]                              

Could it be that something besides using extract or cast_as changed?

@samuelcolvin
Copy link
Contributor Author

Code is currently closed source to avoid confusion, but if I can make it work it will end up being OS.

I'll invite you as a collaborator so you can see what I mean, feel free to ignore the invitation if perfer.

@adamreichold
Copy link
Member

adamreichold commented Apr 6, 2022

I'll invite you as a collaborator so you can see what I mean, feel free to ignore the invitation if perfer.

Since the commits always change multiple things (like PyObject and <&PyAny>::extract to just &PyAny) or returning an error instead of raising an exception, I would suggest rolling back the mechanical replacements of extract by cast_as and checking whether it still makes a difference. (I expect it to not make a difference.)

EDIT: Just noticed the PR you opened doing exactly that.

@mejrs
Copy link
Member

mejrs commented Apr 6, 2022

I wonder if it would be worth adding a "Performance Suggestions" or similar section to the docs, perhaps the FAQs?

Regardless of the outcome of this issue I think this is a great idea 👍

@adamreichold
Copy link
Member

adamreichold commented Apr 6, 2022

So I think the problem is the following: The benchmarked code is basically the following

if let Ok(list) = any.extract::<&PyList>() { ... } else { ... }

where the else branch is always taken. extract returns PyResult<T> instead of Result<T, PyDowncastError> which means into must call From<PyDowncastError> for PyErr which will turn the PyDowncastError into a string here.

The difference is indeed significant as show by the benchmark results:

list_via_cast_as        time:   [3.0937 ns 3.2319 ns 3.3556 ns]                            
list_via_extract        time:   [3.0724 ns 3.0798 ns 3.0897 ns]                              
not_a_list_via_cast_as  time:   [3.4965 ns 3.4974 ns 3.4982 ns]           
not_a_list_via_extract  time:   [292.06 ns 292.80 ns 293.61 ns]                                   

given by the diff

diff --git a/benches/bench_frompyobject.rs b/benches/bench_frompyobject.rs
index 8cf8a4da7..86d2047df 100644
--- a/benches/bench_frompyobject.rs
+++ b/benches/bench_frompyobject.rs
@@ -1,6 +1,6 @@
-use criterion::{criterion_group, criterion_main, Bencher, Criterion};
+use criterion::{black_box, criterion_group, criterion_main, Bencher, Criterion};
 
-use pyo3::{prelude::*, types::PyString};
+use pyo3::{prelude::*, types::{PyList, PyString}};
 
 #[derive(FromPyObject)]
 enum ManyTypes {
@@ -18,8 +18,52 @@ fn enum_from_pyobject(b: &mut Bencher<'_>) {
     })
 }
 
+fn list_via_cast_as(b: &mut Bencher<'_>) {
+    Python::with_gil(|py| {
+        let any: &PyAny = PyList::empty(py).into();
+
+        b.iter(|| {
+            let _list: &PyList = black_box(any).cast_as().unwrap();
+        });
+    })
+}
+
+fn list_via_extract(b: &mut Bencher<'_>) {
+    Python::with_gil(|py| {
+        let any: &PyAny = PyList::empty(py).into();
+
+        b.iter(|| {
+            let _list: &PyList = black_box(any).extract().unwrap();
+        });
+    })
+}
+
+fn not_a_list_via_cast_as(b: &mut Bencher<'_>) {
+    Python::with_gil(|py| {
+        let any: &PyAny = PyString::new(py, "foobar").into();
+
+        b.iter(|| {
+            black_box(any).cast_as::<PyList>().unwrap_err();
+        });
+    })
+}
+
+fn not_a_list_via_extract(b: &mut Bencher<'_>) {
+    Python::with_gil(|py| {
+        let any: &PyAny = PyString::new(py, "foobar").into();
+
+        b.iter(|| {
+            black_box(any).extract::<&PyList>().unwrap_err();
+        });
+    })
+}
+
 fn criterion_benchmark(c: &mut Criterion) {
     c.bench_function("enum_from_pyobject", enum_from_pyobject);
+    c.bench_function("list_via_cast_as", list_via_cast_as);
+    c.bench_function("list_via_extract", list_via_extract);
+    c.bench_function("not_a_list_via_cast_as", not_a_list_via_cast_as);
+    c.bench_function("not_a_list_via_extract", not_a_list_via_extract);
 }
 
 criterion_group!(benches, criterion_benchmark);

This only affects the Err path, not the Ok. But then again, if-else-if-chains to determine the correct type are probably common in Rust-code-called-from-Python-code.

EDIT: I also tried replacing err.to_string() by String::new and this is indeed almost all of the slow down. Only a small part is due to PyTypeError::new_err itself.)

@adamreichold
Copy link
Member

Using something like

#[derive(FromPyObject)]
enum ListOrDictOrAny<'a> {
    List(&'a PyList),
    Dict(&'a PyDict),
    Any(&'a PyAny),
}

also does not help as the generated implementation is probably also an if-else-if chain of calls to extract.

@samuelcolvin
Copy link
Contributor Author

This is really interesting, thank you so much.

Stuff like this scares me a little, the explanation is beyond my knowledge. If I hadn't happened upon this fix, I would probably have abandoned the entire project assuming the performance gains I had hoped for weren't possible.

@davidhewitt
Copy link
Member

Agreed, I've wanted to write some notes along these lines for a long time. I think the main thing holding me off has been lack of time, and I'd kind of been hoping I could make progress in #1308 first (which would reduce a few known PyO3 overheads and likely change a number of performance best practices).

@samuelcolvin
Copy link
Contributor Author

Another point for the performance section:

as per pydantic/pydantic-core#501,

PyErr::new::<CustomError, _>(([args]))

is significantly slower than

let custom_error = CustomError::new([args]);
match Py::new(py, validation_error) {
    Ok(err) => PyErr::from_value(err.into_ref(py)),
    Err(err) => err,
}

For some errors, I think because PyErr::new converts args to python objects and back again.

That change (found by chance) improved some of our benchmarks by 30%.

@davidhewitt
Copy link
Member

Superseded by #3310

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants