Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add PyString::chars #2451

Closed
wants to merge 4 commits into from
Closed

add PyString::chars #2451

wants to merge 4 commits into from

Conversation

robinvd
Copy link

@robinvd robinvd commented Jun 12, 2022

Useful helper method to avoid allocating anything while iterating over a PyUnicode string. The AsUTF8 methods will allocate and cache the utf8 str on the python heap.

Please consider adding the following to your pull request:

  • an entry in CHANGELOG.md
  • docs to all new functions and / or detail in the guide
  • tests for all new or changed functions

pub fn chars(&self) -> impl ExactSizeIterator<Item = PyResult<char>> + '_ {
unsafe {
let len = ffi::PyUnicode_GetLength(self.as_ptr());
(0..len).map(move |i| {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the implementation is based on indexing, would it make sense to expose this a the interface as well? Something like char_at(&self, index: usize) -> Option<char> so that the iterator can be produced on the outside?

@mejrs
Copy link
Member

mejrs commented Jun 14, 2022

Thanks for the PR :)

The AsUTF8 methods will allocate and cache the utf8 str on the python heap.

  • it won't if the string is ascii, which most strings are.
  • calling PyUnicode_AsUTF8/PyUnicode_AsUTF8AndSize caches the utf8 representation (as you mention), so any subsequent calls do not allocate (note that this is not true for PyString::to_str on its slower path).

I've benchmarked this method and found that it generally is around twice as slow on ascii strings compared to to_str().chars(). It is only faster if the string is not ascii and long, which is not representative of most strings. For example, https://peps.python.org/pep-0393/ mentions: "out of 36,000 strings (with 1,310,000 chars), 35713 where ASCII strings". I suspect these figures are similar for other applications.

@robinvd
Copy link
Author

robinvd commented Jun 15, 2022

In my benchmarks this is slightly faster. But my usecase is quite specific. I have to modify some unicode chars. So any ascii data just gets returned as is. And all strings passed in are newly allocated and thus have no utf8 string cached.

As it is slower in the general case. Maybe its best to close this.

@mejrs
Copy link
Member

mejrs commented Jun 15, 2022

I see a use case where it forwards to to_str().chars() on ascii strings, but falls back to your method if it's not. The trouble with this approach is that there's no good way to check which representation it has - that information is stored in a C bitfield 😭 . (See also the documentation of https://pyo3.rs/internal/doc/pyo3/types/struct.PyString.html#method.data)

@robinvd
Copy link
Author

robinvd commented Jun 15, 2022

Ah yes the way i check for ascii is indeed using the data method.

What i dont completely understand is why the method is unsafe. The c bitfield is decoded using functions in python, and i would (maybe wrongly) assume that all functions provided by python are safe and cross platform. The docs dont seem to mentions anything https://docs.python.org/3/c-api/unicode.html

@mejrs
Copy link
Member

mejrs commented Jun 20, 2022

Unfortunately the "function" for checking this (PyUnicode_KIND) isn't a function, it's a macro. See #1824 (comment) for a discussion on this.

@robinvd robinvd closed this Jul 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants