add PyString::chars #2451

robinvd · 2022-06-12T16:02:53Z

Useful helper method to avoid allocating anything while iterating over a PyUnicode string. The AsUTF8 methods will allocate and cache the utf8 str on the python heap.

Please consider adding the following to your pull request:

an entry in CHANGELOG.md
docs to all new functions and / or detail in the guide
tests for all new or changed functions

adamreichold · 2022-06-13T14:19:19Z

src/types/string.rs

+    pub fn chars(&self) -> impl ExactSizeIterator<Item = PyResult<char>> + '_ {
+        unsafe {
+            let len = ffi::PyUnicode_GetLength(self.as_ptr());
+            (0..len).map(move |i| {


Since the implementation is based on indexing, would it make sense to expose this a the interface as well? Something like char_at(&self, index: usize) -> Option<char> so that the iterator can be produced on the outside?

mejrs · 2022-06-14T10:37:10Z

Thanks for the PR :)

The AsUTF8 methods will allocate and cache the utf8 str on the python heap.

it won't if the string is ascii, which most strings are.
calling PyUnicode_AsUTF8/PyUnicode_AsUTF8AndSize caches the utf8 representation (as you mention), so any subsequent calls do not allocate (note that this is not true for PyString::to_str on its slower path).

I've benchmarked this method and found that it generally is around twice as slow on ascii strings compared to to_str().chars(). It is only faster if the string is not ascii and long, which is not representative of most strings. For example, https://peps.python.org/pep-0393/ mentions: "out of 36,000 strings (with 1,310,000 chars), 35713 where ASCII strings". I suspect these figures are similar for other applications.

robinvd · 2022-06-15T08:29:25Z

In my benchmarks this is slightly faster. But my usecase is quite specific. I have to modify some unicode chars. So any ascii data just gets returned as is. And all strings passed in are newly allocated and thus have no utf8 string cached.

As it is slower in the general case. Maybe its best to close this.

mejrs · 2022-06-15T09:14:50Z

I see a use case where it forwards to to_str().chars() on ascii strings, but falls back to your method if it's not. The trouble with this approach is that there's no good way to check which representation it has - that information is stored in a C bitfield 😭 . (See also the documentation of https://pyo3.rs/internal/doc/pyo3/types/struct.PyString.html#method.data)

robinvd · 2022-06-15T13:34:38Z

Ah yes the way i check for ascii is indeed using the data method.

What i dont completely understand is why the method is unsafe. The c bitfield is decoded using functions in python, and i would (maybe wrongly) assume that all functions provided by python are safe and cross platform. The docs dont seem to mentions anything https://docs.python.org/3/c-api/unicode.html

mejrs · 2022-06-20T23:14:11Z

Unfortunately the "function" for checking this (PyUnicode_KIND) isn't a function, it's a macro. See #1824 (comment) for a discussion on this.

Robin added 2 commits June 13, 2022 13:58

add PyString.iter method

2943330

add changlog entry

e5af86e

robinvd force-pushed the main branch from 53b4bf3 to e5af86e Compare June 13, 2022 11:59

Robin added 2 commits June 13, 2022 14:11

remove needless lifetime

b546a3e

format string.rs

b8c2ef3

adamreichold reviewed Jun 13, 2022

View reviewed changes

robinvd closed this Jul 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add PyString::chars #2451

add PyString::chars #2451

robinvd commented Jun 12, 2022 •

edited

Loading

adamreichold Jun 13, 2022

mejrs commented Jun 14, 2022 •

edited

Loading

robinvd commented Jun 15, 2022

mejrs commented Jun 15, 2022

robinvd commented Jun 15, 2022

mejrs commented Jun 20, 2022

add PyString::chars #2451

add PyString::chars #2451

Conversation

robinvd commented Jun 12, 2022 • edited Loading

adamreichold Jun 13, 2022

Choose a reason for hiding this comment

mejrs commented Jun 14, 2022 • edited Loading

robinvd commented Jun 15, 2022

mejrs commented Jun 15, 2022

robinvd commented Jun 15, 2022

mejrs commented Jun 20, 2022

robinvd commented Jun 12, 2022 •

edited

Loading

mejrs commented Jun 14, 2022 •

edited

Loading