Add performance suggestions to docs #2278

samuelcolvin · 2022-04-06T10:29:38Z

Hi, thanks so much for pyo3 - it's wonderful, I'm using it a lot (see rtoml and watchfiles, I might be using it pydantic soon too).

I wonder if it would be worth adding a "Performance Suggestions" or similar section to the docs, perhaps the FAQs?

Examples:

`cast_as` vs `extract`

I spent a lot of the morning very worried that pyo3 rust making heavy use of python types was actually slower than python, until by chance I tried cast_as instead of extract.

// fast
let list: &PyList = py_any.cast_as()?;
// slow
let list: &PyList = py_any.extract()?;

This change in my case made a ~4x improvement to performance.

Optimised compilation

While investigating the above, I noticed this SO question referenced on #1470 didn't mention that the performance is massively effected by debug/release compilation. I've added an answer to the question with suggestions.

These are both no doubt very obvious to the maintainers of this library and sessioned users, but to those new to rust and pyo3, they might be less obvious.

I'm sure there are other things which can have a dramatic effect on performance which I'm not ware of - I've love to hear any other general suggestions?

Update:

Other things to suggest when this section is written:

interning strings where they're being used a lot, a good example is if you're creating a dict with common keys, performance can be significantly improved by reusing PyStrings rather than rust strings

The text was updated successfully, but these errors were encountered:

adamreichold · 2022-04-06T10:51:32Z

I spent a lot of the morning very worried that pyo3 rust making heavy use of python types was actually slower than python, until by chance I tried cast_as instead of extract.

I am surprised by this. Shouldn't both cast_as and extract for a native type end up calling PyTryFrom::try_from? cast_as directly and extract via pyobject_native_type_extract! which should both end up here.

Is it possible to show that performance difference using a self-contained reproducer?

adamreichold · 2022-04-06T11:00:37Z

I tried the following benchmark

diff --git a/benches/bench_frompyobject.rs b/benches/bench_frompyobject.rs
index 8cf8a4da7..ca4b80bb2 100644
--- a/benches/bench_frompyobject.rs
+++ b/benches/bench_frompyobject.rs
@@ -1,6 +1,6 @@
-use criterion::{criterion_group, criterion_main, Bencher, Criterion};
+use criterion::{black_box, criterion_group, criterion_main, Bencher, Criterion};
 
-use pyo3::{prelude::*, types::PyString};
+use pyo3::{prelude::*, types::{PyList, PyString}};
 
 #[derive(FromPyObject)]
 enum ManyTypes {
@@ -18,8 +18,30 @@ fn enum_from_pyobject(b: &mut Bencher<'_>) {
     })
 }
 
+fn list_via_cast_as(b: &mut Bencher<'_>) {
+    Python::with_gil(|py| {
+        let any: &PyAny = PyList::empty(py).into();
+
+        b.iter(|| {
+            let _list: &PyList = black_box(any).cast_as().unwrap();
+        });
+    })
+}
+
+fn list_via_extract(b: &mut Bencher<'_>) {
+    Python::with_gil(|py| {
+        let any: &PyAny = PyList::empty(py).into();
+
+        b.iter(|| {
+            let _list: &PyList = black_box(any).extract().unwrap();
+        });
+    })
+}
+
 fn criterion_benchmark(c: &mut Criterion) {
     c.bench_function("enum_from_pyobject", enum_from_pyobject);
+    c.bench_function("list_via_cast_as", list_via_cast_as);
+    c.bench_function("list_via_extract", list_via_extract);
 }
 
 criterion_group!(benches, criterion_benchmark);

and found no measurable difference:

list_via_cast_as        time:   [1.1823 ns 1.1823 ns 1.1823 ns]                              
list_via_extract        time:   [1.8916 ns 1.8918 ns 1.8920 ns]

Could it be that something besides using extract or cast_as changed?

samuelcolvin · 2022-04-06T11:03:06Z

Code is currently closed source to avoid confusion, but if I can make it work it will end up being OS.

I'll invite you as a collaborator so you can see what I mean, feel free to ignore the invitation if perfer.

adamreichold · 2022-04-06T11:14:01Z

I'll invite you as a collaborator so you can see what I mean, feel free to ignore the invitation if perfer.

Since the commits always change multiple things (like PyObject and <&PyAny>::extract to just &PyAny) or returning an error instead of raising an exception, I would suggest rolling back the mechanical replacements of extract by cast_as and checking whether it still makes a difference. (I expect it to not make a difference.)

EDIT: Just noticed the PR you opened doing exactly that.

mejrs · 2022-04-06T11:17:59Z

I wonder if it would be worth adding a "Performance Suggestions" or similar section to the docs, perhaps the FAQs?

Regardless of the outcome of this issue I think this is a great idea 👍

adamreichold · 2022-04-06T11:43:12Z

So I think the problem is the following: The benchmarked code is basically the following

if let Ok(list) = any.extract::<&PyList>() { ... } else { ... }

where the else branch is always taken. extract returns PyResult<T> instead of Result<T, PyDowncastError> which means into must call From<PyDowncastError> for PyErr which will turn the PyDowncastError into a string here.

The difference is indeed significant as show by the benchmark results:

list_via_cast_as        time:   [3.0937 ns 3.2319 ns 3.3556 ns]                            
list_via_extract        time:   [3.0724 ns 3.0798 ns 3.0897 ns]                              
not_a_list_via_cast_as  time:   [3.4965 ns 3.4974 ns 3.4982 ns]           
not_a_list_via_extract  time:   [292.06 ns 292.80 ns 293.61 ns]

given by the diff

diff --git a/benches/bench_frompyobject.rs b/benches/bench_frompyobject.rs
index 8cf8a4da7..86d2047df 100644
--- a/benches/bench_frompyobject.rs
+++ b/benches/bench_frompyobject.rs
@@ -1,6 +1,6 @@
-use criterion::{criterion_group, criterion_main, Bencher, Criterion};
+use criterion::{black_box, criterion_group, criterion_main, Bencher, Criterion};
 
-use pyo3::{prelude::*, types::PyString};
+use pyo3::{prelude::*, types::{PyList, PyString}};
 
 #[derive(FromPyObject)]
 enum ManyTypes {
@@ -18,8 +18,52 @@ fn enum_from_pyobject(b: &mut Bencher<'_>) {
     })
 }
 
+fn list_via_cast_as(b: &mut Bencher<'_>) {
+    Python::with_gil(|py| {
+        let any: &PyAny = PyList::empty(py).into();
+
+        b.iter(|| {
+            let _list: &PyList = black_box(any).cast_as().unwrap();
+        });
+    })
+}
+
+fn list_via_extract(b: &mut Bencher<'_>) {
+    Python::with_gil(|py| {
+        let any: &PyAny = PyList::empty(py).into();
+
+        b.iter(|| {
+            let _list: &PyList = black_box(any).extract().unwrap();
+        });
+    })
+}
+
+fn not_a_list_via_cast_as(b: &mut Bencher<'_>) {
+    Python::with_gil(|py| {
+        let any: &PyAny = PyString::new(py, "foobar").into();
+
+        b.iter(|| {
+            black_box(any).cast_as::<PyList>().unwrap_err();
+        });
+    })
+}
+
+fn not_a_list_via_extract(b: &mut Bencher<'_>) {
+    Python::with_gil(|py| {
+        let any: &PyAny = PyString::new(py, "foobar").into();
+
+        b.iter(|| {
+            black_box(any).extract::<&PyList>().unwrap_err();
+        });
+    })
+}
+
 fn criterion_benchmark(c: &mut Criterion) {
     c.bench_function("enum_from_pyobject", enum_from_pyobject);
+    c.bench_function("list_via_cast_as", list_via_cast_as);
+    c.bench_function("list_via_extract", list_via_extract);
+    c.bench_function("not_a_list_via_cast_as", not_a_list_via_cast_as);
+    c.bench_function("not_a_list_via_extract", not_a_list_via_extract);
 }
 
 criterion_group!(benches, criterion_benchmark);

This only affects the Err path, not the Ok. But then again, if-else-if-chains to determine the correct type are probably common in Rust-code-called-from-Python-code.

EDIT: I also tried replacing err.to_string() by String::new and this is indeed almost all of the slow down. Only a small part is due to PyTypeError::new_err itself.)

adamreichold · 2022-04-06T11:50:43Z

Using something like

#[derive(FromPyObject)]
enum ListOrDictOrAny<'a> {
    List(&'a PyList),
    Dict(&'a PyDict),
    Any(&'a PyAny),
}

also does not help as the generated implementation is probably also an if-else-if chain of calls to extract.

samuelcolvin · 2022-04-06T11:56:03Z

This is really interesting, thank you so much.

Stuff like this scares me a little, the explanation is beyond my knowledge. If I hadn't happened upon this fix, I would probably have abandoned the entire project assuming the performance gains I had hoped for weren't possible.

davidhewitt · 2022-04-10T09:36:40Z

Agreed, I've wanted to write some notes along these lines for a long time. I think the main thing holding me off has been lack of time, and I'd kind of been hoping I could make progress in #1308 first (which would reduce a few known PyO3 overheads and likely change a number of performance best practices).

samuelcolvin · 2023-04-01T07:31:10Z

Another point for the performance section:

as per pydantic/pydantic-core#501,

PyErr::new::<CustomError, _>(([args]))

is significantly slower than

let custom_error = CustomError::new([args]);
match Py::new(py, validation_error) {
    Ok(err) => PyErr::from_value(err.into_ref(py)),
    Err(err) => err,
}

For some errors, I think because PyErr::new converts args to python objects and back again.

That change (found by chance) improved some of our benchmarks by 30%.

davidhewitt · 2023-07-17T21:15:25Z

Superseded by #3310

adamreichold mentioned this issue Apr 6, 2022

Add benchmark highlighting the costs of failed calls to FromPyObject::extract. #2279

Merged

samuelcolvin mentioned this issue Apr 6, 2022

extract vs cast_as pydantic/pydantic-core#2

Closed

davidhewitt added the needs-implementer label Apr 23, 2022

samuelcolvin mentioned this issue Apr 29, 2022

Performance questions? pydantic/pydantic-core#35

Closed

adamreichold mentioned this issue May 25, 2023

extract::<i64>() failure is extremely slow #3182

Closed

davidhewitt closed this as completed Jul 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add performance suggestions to docs #2278

Add performance suggestions to docs #2278

samuelcolvin commented Apr 6, 2022 •

edited

Loading

adamreichold commented Apr 6, 2022

adamreichold commented Apr 6, 2022 •

edited

Loading

samuelcolvin commented Apr 6, 2022

adamreichold commented Apr 6, 2022 •

edited

Loading

mejrs commented Apr 6, 2022

adamreichold commented Apr 6, 2022 •

edited

Loading

adamreichold commented Apr 6, 2022

samuelcolvin commented Apr 6, 2022

davidhewitt commented Apr 10, 2022

samuelcolvin commented Apr 1, 2023

davidhewitt commented Jul 17, 2023

Add performance suggestions to docs #2278

Add performance suggestions to docs #2278

Comments

samuelcolvin commented Apr 6, 2022 • edited Loading

cast_as vs extract

Optimised compilation

adamreichold commented Apr 6, 2022

adamreichold commented Apr 6, 2022 • edited Loading

samuelcolvin commented Apr 6, 2022

adamreichold commented Apr 6, 2022 • edited Loading

mejrs commented Apr 6, 2022

adamreichold commented Apr 6, 2022 • edited Loading

adamreichold commented Apr 6, 2022

samuelcolvin commented Apr 6, 2022

davidhewitt commented Apr 10, 2022

samuelcolvin commented Apr 1, 2023

davidhewitt commented Jul 17, 2023

samuelcolvin commented Apr 6, 2022 •

edited

Loading

`cast_as` vs `extract`

adamreichold commented Apr 6, 2022 •

edited

Loading

adamreichold commented Apr 6, 2022 •

edited

Loading

adamreichold commented Apr 6, 2022 •

edited

Loading