gh-111089: Add PyUnicode_AsUTF8Unsafe() function #111672

vstinner · 2023-11-03T01:10:22Z

Moreover, PyUnicode_AsUTF8AndSize(str, NULL) now raises an exception if the string contains embedded null characters.

Issue: [C API] Change PyUnicode_AsUTF8() to return NULL on embedded null characters #111089

📚 Documentation preview 📚: https://cpython-previews--111672.org.readthedocs.build/

Moreover, PyUnicode_AsUTF8AndSize(str, NULL) now raises an exception if the string contains embedded null characters.

vstinner · 2023-11-03T01:13:28Z

@serhiy-storchaka suggested in private that if PyUnicode_AsUTF8(str) raises an exception on embedded null character, PyUnicode_AsUTF8AndSize(str, NULL) should also raise. So I wrote this draft PR to implement this idea.

The change adds PyUnicode_AsUTF8Unsafe() (name open for bikeshedding) which is like PyUnicode_AsUTF8() but doesn't reject null characters.

vstinner · 2023-11-03T01:14:44Z

Apparently, this is a disagreement on the PyUnicode_AsUTF8() change which rejects null characters: #111091 (comment)

serhiy-storchaka

It is more consistent with PyUnicode_AsWideCharString() and PyBytes_AsStringAndSize`.

In general LGTM (besides some nitpicks), but I would wait until the ongoing discussion has been finished.

An alternative is to restore the PyUnicode_AsUTF8() behavior and introduce PyUnicode_AsUTF8Safe(). Then PyUnicode_AsUTF8() can be removed from the Limited C API and deprecated as it was initially planned.

serhiy-storchaka · 2023-11-03T08:02:03Z

Doc/c-api/unicode.rst

@@ -971,6 +971,12 @@ These are the UTF-8 codec APIs:
   returned buffer always has an extra null byte appended (not included in
   *size*), regardless of whether there are any other null code points.

+   If *size* is NULL and the *unicode* string contains embedded null


The wording differs from the one for PyUnicode_AsWideCharString(). It would be better to have the same wording for the same behavior, so the user do not need to search non-existing differences.

serhiy-storchaka · 2023-11-03T08:02:54Z

Doc/c-api/unicode.rst

+   If *size* is NULL and the *unicode* string contains embedded null
+   characters, raise an exception. To accept embedded null characters and
+   truncate on purpose at the first null byte, :c:func:`PyUnicode_AsUTF8Unsafe`
+   and :c:func:`PyUnicode_AsUTF8AndSize(unicode, &size)


This is a reference to self. Unlikely it will be useful.

serhiy-storchaka · 2023-11-03T08:06:41Z

Doc/c-api/unicode.rst

+   Similar to :c:func:`PyUnicode_AsUTF8AndSize(unicode, NULL)
+   <PyUnicode_AsUTF8AndSize>`, but does not store the size.


PyUnicode_AsUTF8AndSize(unicode, NULL) does not store size either.

Maybe just say that it is equivalent to PyUnicode_AsUTF8AndSize(unicode, NULL)? And no more explanations will be needed.

serhiy-storchaka · 2023-11-03T08:09:17Z

Include/unicodeobject.h

+#if !defined(Py_LIMITED_API) || Py_LIMITED_API+0 >= 0x030D0000
+PyAPI_FUNC(const char*) PyUnicode_AsUTF8Unsafe(PyObject *unicode);
+#endif


Maybe not add it to the Limited C API? PyUnicode_AsUTF8() was not the Limited C API before 3.13.

serhiy-storchaka · 2023-11-03T08:13:36Z

Include/unicodeobject.h

@@ -451,7 +451,13 @@ PyAPI_FUNC(PyObject*) PyUnicode_AsUTF8String(
 // This function caches the UTF-8 encoded string in the Unicode object
 // and subsequent calls will return the same string. The memory is released
 // when the Unicode object is deallocated.
-PyAPI_FUNC(const char *) PyUnicode_AsUTF8(PyObject *unicode);
+PyAPI_FUNC(const char*) PyUnicode_AsUTF8(PyObject *unicode);


BTW, this function should only be available in the Limited C API 3.13.

vstinner · 2023-11-03T11:08:20Z

I abandon this PR in favor of the opposite approach: add PyUnicode_AsUTF8Safe(), PR #111688.

pythongh-111089: Add PyUnicode_AsUTF8Unsafe() function

cb87653

Moreover, PyUnicode_AsUTF8AndSize(str, NULL) now raises an exception if the string contains embedded null characters.

bedevere-app bot mentioned this pull request Nov 3, 2023

[C API] Change PyUnicode_AsUTF8() to return NULL on embedded null characters #111089

Closed

serhiy-storchaka reviewed Nov 3, 2023

View reviewed changes

vstinner closed this Nov 3, 2023

vstinner deleted the asutf8_unsafe branch November 3, 2023 11:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-111089: Add PyUnicode_AsUTF8Unsafe() function #111672

gh-111089: Add PyUnicode_AsUTF8Unsafe() function #111672

vstinner commented Nov 3, 2023 •

edited by github-actions bot

Loading

vstinner commented Nov 3, 2023

vstinner commented Nov 3, 2023

serhiy-storchaka left a comment

serhiy-storchaka Nov 3, 2023

serhiy-storchaka Nov 3, 2023

serhiy-storchaka Nov 3, 2023

serhiy-storchaka Nov 3, 2023

serhiy-storchaka Nov 3, 2023

vstinner commented Nov 3, 2023

		Similar to :c:func:`PyUnicode_AsUTF8AndSize(unicode, NULL)
		<PyUnicode_AsUTF8AndSize>`, but does not store the size.

gh-111089: Add PyUnicode_AsUTF8Unsafe() function #111672

gh-111089: Add PyUnicode_AsUTF8Unsafe() function #111672

Conversation

vstinner commented Nov 3, 2023 • edited by github-actions bot Loading

vstinner commented Nov 3, 2023

vstinner commented Nov 3, 2023

serhiy-storchaka left a comment

Choose a reason for hiding this comment

serhiy-storchaka Nov 3, 2023

Choose a reason for hiding this comment

serhiy-storchaka Nov 3, 2023

Choose a reason for hiding this comment

serhiy-storchaka Nov 3, 2023

Choose a reason for hiding this comment

serhiy-storchaka Nov 3, 2023

Choose a reason for hiding this comment

serhiy-storchaka Nov 3, 2023

Choose a reason for hiding this comment

vstinner commented Nov 3, 2023

vstinner commented Nov 3, 2023 •

edited by github-actions bot

Loading