Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid losing null-chars when casting to/from std::string. #48

Closed
wants to merge 5 commits into from

Conversation

Vayu
Copy link

@Vayu Vayu commented Dec 25, 2015

In Python, strings ofter carry binary data including NULL characters, we should try to avoid losing them when calling C++ functions, std::string can hold NULL char easily.

This PR changes casts to use *_AsStringAndSize functions, which copy full string length.

I kept unicode->utf8 in Python to C++ casts (since we have PyUnicode_Check).

Unfortunately in C++ to Python cast there is no way to know if byte sequence is intended to be utf8 string, so for consistency and to avoid data loss we should return str/bytes and do decoding on the Python side. (Maybe one can introduce a helper class py::utf8string which can indicate the API intent to decode utf8 string).

Python 2:
str -> std::string
unicode (utf8) -> std::string
std::string -> str

Python 3:
bytes -> std::string
str (utf8) -> std::string
std::string -> bytes

Do not try to utf8 encode std::string and char*, since many binary strings
cannot be encoded in utf8.

Python 2:
str -> std::string
unicode (utf8) -> std::string
std::string -> str

Python 3:
bytes -> std::string
str (utf8) -> std::string
std::string -> bytes
@Vayu
Copy link
Author

Vayu commented Dec 25, 2015

Tests are broken mostly because __repr__ and __str__ return bytes instead of str. It can be fixed by adding casts to utf8string.

One can also keep default utf8 behaviour and instead make bytes strings explicit (e.g. pybind11::bytestring). Let me know what you think.

// pytypes.h
class utf8string : public std::string {
public:
    using std::string::string;

    utf8string(const std::string& src) : std::string(src) {}
    utf8string() : std::string() {}
};

// cast.h
template <> class type_caster<pybind11::utf8string> : public type_caster<std::string> {
public:
    static PyObject *cast(const pybind11::utf8string &src, return_value_policy /* policy */, PyObject * /* parent */) {
        return PyUnicode_FromString(src.c_str());
    }
#if PY_MAJOR_VERSION >= 3
    PYBIND11_TYPE_CASTER(pybind11::utf8string, "str");
#else
    PYBIND11_TYPE_CASTER(pybind11::utf8string, "unicode");
#endif
};

Add pybind11::bytestring and pybind11::bytepchar which cast to bytes arrays.
@Vayu
Copy link
Author

Vayu commented Dec 25, 2015

Reverted to old logic, added pybind11::bytestring

Python 2:
std::string (utf8) -> unicode
pybind11::bytestring -> str

Python 3:
std::string (utf8) -> str
pybind11::bytestring -> bytes

@wjakob
Copy link
Member

wjakob commented Dec 26, 2015

Hi,

what about introducing a sort of templated type wrapper that can take different string types, e.g. py::bytes<std::string> and py::bytes<char *>?

I will probably also add such wrapper types for other STL data types, e.g. py::rawstd::vector<...> for cases where the argument should never be converted to a Python array and instead is treated as an opaque object.

Thanks,
Wenzel

@Vayu
Copy link
Author

Vayu commented Dec 26, 2015

That is an option. It doesn't save much though, since one still needs explicit specializations for type_caster<bytes<std::string>> and both type_caster<bytes<char*>> and type_caster<bytes<const char*>>.

To be fair if we are OK with an extra copy, I'd rather drop bytepchar completely and keep only bytestring. It will handle both char* and cost char* by implicit std::string conversion via inherited constructor.

@Vayu
Copy link
Author

Vayu commented Dec 28, 2015

more comments?

@wjakob
Copy link
Member

wjakob commented Dec 29, 2015

I do lean towards the py::bytes<> "modifier" style syntax -- I intend to use this approach for other kinds of binding-specific type tags as well, and having a uniform way of doing this would be preferable IMHO.

@Vayu
Copy link
Author

Vayu commented Dec 29, 2015

I'm not sure that notation uniformity will justify significantly increased code complexity. If bytes<> is a wrapper class then we will need 3 specializations: string, char* and const char*.

One can do a subclass (like bytestring), then only bytes<string> is sufficient, but I don't see what bytes<T> should do.

@Vayu
Copy link
Author

Vayu commented Dec 30, 2015

Is that what you had in mind? Here bytes<T> would work for any T which has const char* data() and int size() methods (like std::string or vector<char>). If you think it is fine, I can add it to PR

diff --git a/include/pybind11/cast.h b/include/pybind11/cast.h
index bcaed97..a6153b8 100644
--- a/include/pybind11/cast.h
+++ b/include/pybind11/cast.h
@@ -378,15 +378,15 @@ public:
 #endif
 };

-template <> class type_caster<bytestring> : public type_caster<std::string> {
+template <typename T> class type_caster<bytes<T>> : public type_caster<T> {
 public:
-    static PyObject *cast(const bytestring &src, return_value_policy /* policy */, PyObject * /* parent */) {
-        return PYBIND11_FROM_STRING_AND_SIZE(src.c_str(), src.size());
+    static PyObject *cast(const bytes<T> &src, return_value_policy /* policy */, PyObject * /* parent */) {
+        return PYBIND11_FROM_STRING_AND_SIZE(src.data(), src.size());
     }
 #if PY_MAJOR_VERSION >= 3
-    PYBIND11_TYPE_CASTER(bytestring, "bytes");
+    PYBIND11_TYPE_CASTER(bytes<T>, "bytes");
 #else
-    PYBIND11_TYPE_CASTER(bytestring, "str");
+    PYBIND11_TYPE_CASTER(bytes<T>, "str");
 #endif
 };

diff --git a/include/pybind11/pytypes.h b/include/pybind11/pytypes.h
index 1548423..0b1ad5e 100644
--- a/include/pybind11/pytypes.h
+++ b/include/pybind11/pytypes.h
@@ -259,13 +259,14 @@ inline iterator handle::end() { return iterator(nullptr); }
     PYBIND11_OBJECT(Name, Parent, CheckFun) \
     Name() : Parent() { }

-class bytestring : public std::string {
+template <typename T>
+class bytes : public T {
 public:
-    using std::string::string;
+    using T::T;

-    bytestring(const std::string& src) : std::string(src) { }
-    bytestring(std::string&& src) : std::string(std::move(src)) { }
-    bytestring() : std::string() { }
+    bytes(const T& src) : T(src) { }
+    bytes(T&& src) : T(std::move(src)) { }
+    bytes() : T() { }
 };

 class str : public object {

@wjakob
Copy link
Member

wjakob commented Dec 30, 2015

yes, that looks great!

py::bytes<T> will work with any T which has these methods
const char* data() and int size()

Other types can be accommodated by specializing bytes<T>
and type_caster<bytes<T>>
@Vayu
Copy link
Author

Vayu commented Dec 30, 2015

updated

@Vayu
Copy link
Author

Vayu commented Jan 12, 2016

more comments?

@wjakob
Copy link
Member

wjakob commented Jan 17, 2016

This is now implemented, though with a slightly different approach (a pybind11::bytes type -- sorry for the back and forth, this was ultimately the easiest to fit into the codebase)

@wjakob wjakob closed this Jan 17, 2016
@Vayu
Copy link
Author

Vayu commented Jan 18, 2016

That way there is no C++ type which is automatically typecast to Python bytes. One needs to explicitly create py::bytes. Also doesn't work with vector<char>

EricCousineau-TRI added a commit to EricCousineau-TRI/pybind11 that referenced this pull request Feb 25, 2022
Not a functional change per se, but good for simple code assurance
sschnug pushed a commit to sschnug/pybind11 that referenced this pull request Aug 2, 2024
* feat: support Universal2 wheels

* fix: support Apple Silicon

* Update setup.py

* Update test.py

* fix: local testing on AS, use pytest
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants