Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-111089: Add PyUnicode_AsUTF8Unsafe() function #111672

Closed
wants to merge 1 commit into from

Conversation

vstinner
Copy link
Member

@vstinner vstinner commented Nov 3, 2023

Moreover, PyUnicode_AsUTF8AndSize(str, NULL) now raises an exception if the string contains embedded null characters.


📚 Documentation preview 📚: https://cpython-previews--111672.org.readthedocs.build/

Moreover, PyUnicode_AsUTF8AndSize(str, NULL) now raises an exception
if the string contains embedded null characters.
@vstinner
Copy link
Member Author

vstinner commented Nov 3, 2023

@serhiy-storchaka suggested in private that if PyUnicode_AsUTF8(str) raises an exception on embedded null character, PyUnicode_AsUTF8AndSize(str, NULL) should also raise. So I wrote this draft PR to implement this idea.

The change adds PyUnicode_AsUTF8Unsafe() (name open for bikeshedding) which is like PyUnicode_AsUTF8() but doesn't reject null characters.

@vstinner
Copy link
Member Author

vstinner commented Nov 3, 2023

Apparently, this is a disagreement on the PyUnicode_AsUTF8() change which rejects null characters: #111091 (comment)

Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is more consistent with PyUnicode_AsWideCharString() and PyBytes_AsStringAndSize`.

In general LGTM (besides some nitpicks), but I would wait until the ongoing discussion has been finished.

An alternative is to restore the PyUnicode_AsUTF8() behavior and introduce PyUnicode_AsUTF8Safe(). Then PyUnicode_AsUTF8() can be removed from the Limited C API and deprecated as it was initially planned.

@@ -971,6 +971,12 @@ These are the UTF-8 codec APIs:
returned buffer always has an extra null byte appended (not included in
*size*), regardless of whether there are any other null code points.

If *size* is NULL and the *unicode* string contains embedded null
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wording differs from the one for PyUnicode_AsWideCharString(). It would be better to have the same wording for the same behavior, so the user do not need to search non-existing differences.

If *size* is NULL and the *unicode* string contains embedded null
characters, raise an exception. To accept embedded null characters and
truncate on purpose at the first null byte, :c:func:`PyUnicode_AsUTF8Unsafe`
and :c:func:`PyUnicode_AsUTF8AndSize(unicode, &size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a reference to self. Unlikely it will be useful.

Comment on lines +1003 to +1004
Similar to :c:func:`PyUnicode_AsUTF8AndSize(unicode, NULL)
<PyUnicode_AsUTF8AndSize>`, but does not store the size.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyUnicode_AsUTF8AndSize(unicode, NULL) does not store size either.

Maybe just say that it is equivalent to PyUnicode_AsUTF8AndSize(unicode, NULL)? And no more explanations will be needed.

Comment on lines +458 to +460
#if !defined(Py_LIMITED_API) || Py_LIMITED_API+0 >= 0x030D0000
PyAPI_FUNC(const char*) PyUnicode_AsUTF8Unsafe(PyObject *unicode);
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not add it to the Limited C API? PyUnicode_AsUTF8() was not the Limited C API before 3.13.

@@ -451,7 +451,13 @@ PyAPI_FUNC(PyObject*) PyUnicode_AsUTF8String(
// This function caches the UTF-8 encoded string in the Unicode object
// and subsequent calls will return the same string. The memory is released
// when the Unicode object is deallocated.
PyAPI_FUNC(const char *) PyUnicode_AsUTF8(PyObject *unicode);
PyAPI_FUNC(const char*) PyUnicode_AsUTF8(PyObject *unicode);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, this function should only be available in the Limited C API 3.13.

@vstinner
Copy link
Member Author

vstinner commented Nov 3, 2023

I abandon this PR in favor of the opposite approach: add PyUnicode_AsUTF8Safe(), PR #111688.

@vstinner vstinner closed this Nov 3, 2023
@vstinner vstinner deleted the asutf8_unsafe branch November 3, 2023 11:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants