-
Notifications
You must be signed in to change notification settings - Fork 792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ffi: define some cpython/unicodeobject bindings #1777
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this. It's a particularly gnarly one; the representation of unicode objects is one of the more aggressively optimized parts of the Python interpreter. On top of that, any work here needs to be aware of PEP 623 (affecting Python 3.12).
As-is, it looks to me that the PyASCIIObjectState
struct is incorrect. I'm not entirely sure what the correct solution is. (See the review comment below.)
Because there's some very complex inline macros going on here, I would like to either see tests within this ffi module, or a thoroughly-tested safe API for #1776. I'm particuarly worried there's OS, arch and Python-version edge cases for this low-level code. For example this'll presumably crash on Python 3.12, when that comes along, so we'll need to catch that with tests.
symbols currently defined in
src/ffi/unicodeobject.rs
have been annotated withmove
.
I understand not wanting to edit the existing code, however I'd much rather that these were moved as part of this PR, perhaps as a standalone commit.
src/ffi/cpython/unicodeobject.rs
Outdated
// This type is defined inline in CPython source. | ||
#[repr(C)] | ||
pub struct PyASCIIObjectState { | ||
// SSTATE_* constants. | ||
pub interned: c_uint, | ||
// PyUnicode_*_KIND constants. | ||
pub kind: c_uint, | ||
pub compact: c_uint, | ||
pub ascii: c_uint, | ||
pub ready: c_uint, | ||
pub _a: c_uint, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This struct looks wrong; the upstream CPython source implements this state as a 32-bit bit field.
It's not possible to have an equivalent bitfield definition in Rust (e.g. see rust-lang/rfcs#314)
Looks like the cpython
crate defines the state
member as a single u32
field and uses bitwise operators to read values at various offsets. That seems to make sense, but as far as I'm aware the bitfield order isn't actually well-defined (https://stackoverflow.com/a/19376742) so that seems like it's actually a bug. I guess it'll work on some platforms and be UB on others?
I think some more research is needed on the correct way to handle this.
I was just typing up a comment talking about the limitations of this PR and your review beat me to it :) Yes, the struct definition is wrong. And the macros could use some tests. I'll address your review feedback. Also, I noticed that PyO3 is using soon-to-be-deprecated Python Unicode APIs. I'm unsure how much you care about that. If I proceed with #1776, I may have a go at flushing things out and improving the string handling story. Time will tell. |
Oh interesting, I'd mistakenly thought #1370 flushed them out! Yes, I'd really appreciate any help moving off deprecated APIs. |
3387187
to
ce91e47
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed the bitfield definition with this version.
I've added accessor functions for resolving each logical component of the bitfield to a c_uint
. The macros/functions now use these accessor functions. I figured it was clearer this way rather than doing bitwise manipulation inline in each macro function.
Tests over bitfield APIs and macros have been added. The behavior asserted in the ASCII and UCS-4 tests seem reasonable to me. Although I wouldn't be surprised if the assertions are non-portable across Python implementations.
A 2nd commit has been introduced to move limited API symbols to the appropriate file.
👍 thanks, I'll give this a proper review when I get a chance, got a busy few days ahead. Would be great to have it rebased once #1780 is merged so we can see whether the bitfield order is indeed stable on all platforms tested on CI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the CI is passing on all platforms, so I'm reasonably convinced this is a safe way to handle the bitfield 👍
I've added some tweaks for PyPy and Python 3.10 below, I'll add them as a commit.
Looks like I broke this - I'll fixup and merge later, sorry! |
pyo3 doesn't currently define various Unicode bindings that allow the retrieval of raw data from Python strings. Said bindings are a prerequisite to possibly exposing this data in the Rust API (PyO3#1776). Even if those high-level APIs never materialize, the FFI bindings are necessary to enable consumers of the raw C API to utilize them. This commit partially defines the FFI bindings as defined in CPython's Include/cpython/unicodeobject.h file. I used the latest CPython 3.9 Git commit for defining the order of the symbols and the implementation of various inline preprocessor macros. I tried to be as faithful as possible to the original implementation, preserving intermediate `#define`s as inline functions. Missing symbols have been annotated with `skipped` and symbols currently defined in `src/ffi/unicodeobject.rs` have been annotated with `move`. The `state` field of `PyASCIIObject` is a bitfield, which Rust doesn't support. So we've provided accessor functions for retrieving these fields' values. No accessor functions are present because you shouldn't be touching these values from Rust code. Tests of the bitfield APIs and macro implementations have been added.
All symbols which are canonically defined in cpython/unicodeobject.h and had been defined in our unicodeobject.rs have been moved to our corresponding cpython/unicodeobject.rs file. This module is only present in non-limited build configurations, so we were able to drop the cfg annotations as part of moving the definitions.
Twas a missing |
Rust didn't seem to honor |
Ugh, something doesn't like |
This API is available in all supported Python versions and isn't deprecated.
Thanks! |
pyo3 doesn't currently define various Unicode bindings that allow the
retrieval of raw data from Python strings. Said bindings are a
prerequisite to possibly exposing this data in the Rust API (#1776).
Even if those high-level APIs never materialize, the FFI bindings are
necessary to enable consumers of the raw C API to utilize them.
This commit partially defines the FFI bindings as defined in
CPython's Include/cpython/unicodeobject.h file.
I used the latest CPython 3.9 Git commit for defining the order
of the symbols and the implementation of various inline preprocessor
macros. I tried to be as faithful as possible to the original
implementation, preserving intermediate
#define
s as inline functions.The structs are a bit wonky and probably warrant the most review
scrutiny. I haven't tested this code thoroughly.
Missing symbols have been annotated with
skipped
and symbols currentlydefined in
src/ffi/unicodeobject.rs
have been annotated withmove
.