Skip to content

Commit

Permalink
pythongh-111089: Add PyUnicode_AsUTF8Safe() function
Browse files Browse the repository at this point in the history
Revert PyUnicode_AsUTF8() change: it no longer rejects embedded null
characters: the PyUnicode_AsUTF8Safe() function should be used
instead.
  • Loading branch information
vstinner committed Nov 7, 2023
1 parent 931f443 commit 36973d6
Show file tree
Hide file tree
Showing 43 changed files with 270 additions and 201 deletions.
34 changes: 29 additions & 5 deletions Doc/c-api/unicode.rst
Original file line number Diff line number Diff line change
Expand Up @@ -979,6 +979,15 @@ These are the UTF-8 codec APIs:
responsible for deallocating the buffer. The buffer is deallocated and
pointers to it become invalid when the Unicode object is garbage collected.
If *size* is NULL and the *unicode* string contains null characters, the
UTF-8 encoded string contains embedded null bytes and the caller is not
aware since the string size is not stored. C functions processing null
terminated ``char*`` truncate the string at the first embedded null byte, and
so ignore bytes after the null byte. The :c:func:`PyUnicode_AsUTF8` function
can be used to raise an exception rather than truncating the string. Or
:c:func:`PyUnicode_AsUTF8(unicode, &size) <PyUnicode_AsUTF8AndSize>` can be
used to store the size.
.. versionadded:: 3.3
.. versionchanged:: 3.7
Expand All @@ -990,12 +999,13 @@ These are the UTF-8 codec APIs:
.. c:function:: const char* PyUnicode_AsUTF8(PyObject *unicode)
As :c:func:`PyUnicode_AsUTF8AndSize`, but does not store the size.
Similar to :c:func:`PyUnicode_AsUTF8AndSize`, but does not store the size.
Raise an exception if the *unicode* string contains embedded null
characters. To accept embedded null characters and truncate on purpose
at the first null byte, ``PyUnicode_AsUTF8AndSize(unicode, NULL)`` can be
used instead.
If the *unicode* string contains null characters, the UTF-8 encoded string
contains embedded null bytes. C functions processing null terminated ``char*``
truncate the string at the first embedded null byte, and so ignore bytes
after the null byte. The :c:func:`PyUnicode_AsUTF8` function can be used to
raise an exception rather than truncating the string.
.. versionadded:: 3.3
Expand All @@ -1005,6 +1015,20 @@ These are the UTF-8 codec APIs:
.. versionchanged:: 3.13
Raise an exception if the string contains embedded null characters.
.. c:function:: const char* PyUnicode_AsUTF8Safe(PyObject *unicode)
Similar to :c:func:`PyUnicode_AsUTF8`, but raise :exc:`ValueError` if the
string contains embedded null characters.
The Unicode Character Set contains characters which can cause bugs or even
security issues depending on how they are proceed. See for example `Unicode
Technical Report #36: Unicode Security Considerations
<https://unicode.org/reports/tr36/>`_. This function implements a single
check: only test if the string contains null characters. Additional checks
are needed to prevent further issues cause by Unicode characters.
.. versionadded:: 3.13
UTF-32 Codecs
"""""""""""""
Expand Down
11 changes: 5 additions & 6 deletions Doc/whatsnew/3.13.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1137,6 +1137,11 @@ New Features
* Add :c:func:`PyUnicode_AsUTF8` function to the limited C API.
(Contributed by Victor Stinner in :gh:`111089`.)

* Add :c:func:`PyUnicode_AsUTF8Safe` function: similar to
:c:func:`PyUnicode_AsUTF8`, but raise :exc:`ValueError` if the string
contains embedded null characters.
(Contributed by Victor Stinner in :gh:`111089`.)


Porting to Python 3.13
----------------------
Expand Down Expand Up @@ -1207,12 +1212,6 @@ Porting to Python 3.13
Note that ``Py_TRASHCAN_BEGIN`` has a second argument which
should be the deallocation function it is in.

* The :c:func:`PyUnicode_AsUTF8` function now raises an exception if the string
contains embedded null characters. To accept embedded null characters and
truncate on purpose at the first null byte,
``PyUnicode_AsUTF8AndSize(unicode, NULL)`` can be used instead.
(Contributed by Victor Stinner in :gh:`111089`.)

* On Windows, ``Python.h`` no longer includes the ``<stddef.h>`` standard
header file. If needed, it should now be included explicitly. For example, it
provides ``offsetof()`` function, and ``size_t`` and ``ptrdiff_t`` types.
Expand Down
4 changes: 4 additions & 0 deletions Include/cpython/unicodeobject.h
Original file line number Diff line number Diff line change
Expand Up @@ -440,6 +440,10 @@ PyAPI_FUNC(PyObject*) PyUnicode_FromKindAndData(
const void *buffer,
Py_ssize_t size);

// Similar to PyUnicode_AsUTF8(), but raise ValueError if the string contains
// embedded null characters.
PyAPI_FUNC(const char *) PyUnicode_AsUTF8Safe(PyObject *unicode);


/* === Characters Type APIs =============================================== */

Expand Down
33 changes: 23 additions & 10 deletions Lib/test/test_capi/test_unicode.py
Original file line number Diff line number Diff line change
Expand Up @@ -905,25 +905,38 @@ def test_fromordinal(self):
self.assertRaises(ValueError, fromordinal, 0x110000)
self.assertRaises(ValueError, fromordinal, -1)

@support.cpython_only
@unittest.skipIf(_testcapi is None, 'need _testcapi module')
def test_asutf8(self):
"""Test PyUnicode_AsUTF8()"""
from _testcapi import unicode_asutf8

def check_asutf8(self, unicode_asutf8):
self.assertEqual(unicode_asutf8('abc', 4), b'abc\0')
self.assertEqual(unicode_asutf8('абв', 7), b'\xd0\xb0\xd0\xb1\xd0\xb2\0')
self.assertEqual(unicode_asutf8('\U0001f600', 5), b'\xf0\x9f\x98\x80\0')

# disallow embedded null characters
self.assertRaises(ValueError, unicode_asutf8, 'abc\0', 0)
self.assertRaises(ValueError, unicode_asutf8, 'abc\0def', 0)

self.assertRaises(UnicodeEncodeError, unicode_asutf8, '\ud8ff', 0)
self.assertRaises(TypeError, unicode_asutf8, b'abc', 0)
self.assertRaises(TypeError, unicode_asutf8, [], 0)
# CRASHES unicode_asutf8(NULL, 0)

@support.cpython_only
@unittest.skipIf(_testcapi is None, 'need _testcapi module')
def test_asutf8(self):
"""Test PyUnicode_AsUTF8()"""
from _testcapi import unicode_asutf8
self.check_asutf8(unicode_asutf8)

# allow embedded null characters
self.assertEqual(unicode_asutf8('abc\0', 5), b'abc\0\0')
self.assertEqual(unicode_asutf8('abc\0def', 8), b'abc\0def\0')

@support.cpython_only
@unittest.skipIf(_testcapi is None, 'need _testcapi module')
def test_asutf8safe(self):
"""Test PyUnicode_AsUTF8Safe()"""
from _testcapi import unicode_asutf8safe
self.check_asutf8(unicode_asutf8safe)

# disallow embedded null characters
self.assertRaises(ValueError, unicode_asutf8safe, 'abc\0', 0)
self.assertRaises(ValueError, unicode_asutf8safe, 'abc\0def', 0)

@support.cpython_only
@unittest.skipIf(_testcapi is None, 'need _testcapi module')
def test_asutf8andsize(self):
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Add :c:func:`PyUnicode_AsUTF8Safe` function: similar to
:c:func:`PyUnicode_AsUTF8`, but raise :exc:`ValueError` if the string
contains embedded null characters. Patch by Victor Stinner.
10 changes: 5 additions & 5 deletions Modules/_io/clinic/_iomodule.c.h

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions Modules/_io/clinic/fileio.c.h

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 4 additions & 4 deletions Modules/_io/clinic/textio.c.h

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions Modules/_io/clinic/winconsoleio.c.h

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions Modules/_multiprocessing/clinic/multiprocessing.c.h

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions Modules/_multiprocessing/clinic/semaphore.c.h

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

26 changes: 13 additions & 13 deletions Modules/_sqlite/clinic/connection.c.h

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading

0 comments on commit 36973d6

Please sign in to comment.