-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid losing null-chars when casting to/from std::string. #48
Conversation
Do not try to utf8 encode std::string and char*, since many binary strings cannot be encoded in utf8. Python 2: str -> std::string unicode (utf8) -> std::string std::string -> str Python 3: bytes -> std::string str (utf8) -> std::string std::string -> bytes
Tests are broken mostly because One can also keep default utf8 behaviour and instead make bytes strings explicit (e.g. // pytypes.h
class utf8string : public std::string {
public:
using std::string::string;
utf8string(const std::string& src) : std::string(src) {}
utf8string() : std::string() {}
};
// cast.h
template <> class type_caster<pybind11::utf8string> : public type_caster<std::string> {
public:
static PyObject *cast(const pybind11::utf8string &src, return_value_policy /* policy */, PyObject * /* parent */) {
return PyUnicode_FromString(src.c_str());
}
#if PY_MAJOR_VERSION >= 3
PYBIND11_TYPE_CASTER(pybind11::utf8string, "str");
#else
PYBIND11_TYPE_CASTER(pybind11::utf8string, "unicode");
#endif
}; |
Add pybind11::bytestring and pybind11::bytepchar which cast to bytes arrays.
Reverted to old logic, added pybind11::bytestring Python 2: Python 3: |
Hi, what about introducing a sort of templated type wrapper that can take different string types, e.g. I will probably also add such wrapper types for other STL data types, e.g. py::rawstd::vector<...> for cases where the argument should never be converted to a Python array and instead is treated as an opaque object. Thanks, |
That is an option. It doesn't save much though, since one still needs explicit specializations for To be fair if we are OK with an extra copy, I'd rather drop |
more comments? |
I do lean towards the |
I'm not sure that notation uniformity will justify significantly increased code complexity. If bytes<> is a wrapper class then we will need 3 specializations: string, char* and const char*. One can do a subclass (like bytestring), then only |
Is that what you had in mind? Here diff --git a/include/pybind11/cast.h b/include/pybind11/cast.h
index bcaed97..a6153b8 100644
--- a/include/pybind11/cast.h
+++ b/include/pybind11/cast.h
@@ -378,15 +378,15 @@ public:
#endif
};
-template <> class type_caster<bytestring> : public type_caster<std::string> {
+template <typename T> class type_caster<bytes<T>> : public type_caster<T> {
public:
- static PyObject *cast(const bytestring &src, return_value_policy /* policy */, PyObject * /* parent */) {
- return PYBIND11_FROM_STRING_AND_SIZE(src.c_str(), src.size());
+ static PyObject *cast(const bytes<T> &src, return_value_policy /* policy */, PyObject * /* parent */) {
+ return PYBIND11_FROM_STRING_AND_SIZE(src.data(), src.size());
}
#if PY_MAJOR_VERSION >= 3
- PYBIND11_TYPE_CASTER(bytestring, "bytes");
+ PYBIND11_TYPE_CASTER(bytes<T>, "bytes");
#else
- PYBIND11_TYPE_CASTER(bytestring, "str");
+ PYBIND11_TYPE_CASTER(bytes<T>, "str");
#endif
};
diff --git a/include/pybind11/pytypes.h b/include/pybind11/pytypes.h
index 1548423..0b1ad5e 100644
--- a/include/pybind11/pytypes.h
+++ b/include/pybind11/pytypes.h
@@ -259,13 +259,14 @@ inline iterator handle::end() { return iterator(nullptr); }
PYBIND11_OBJECT(Name, Parent, CheckFun) \
Name() : Parent() { }
-class bytestring : public std::string {
+template <typename T>
+class bytes : public T {
public:
- using std::string::string;
+ using T::T;
- bytestring(const std::string& src) : std::string(src) { }
- bytestring(std::string&& src) : std::string(std::move(src)) { }
- bytestring() : std::string() { }
+ bytes(const T& src) : T(src) { }
+ bytes(T&& src) : T(std::move(src)) { }
+ bytes() : T() { }
};
class str : public object { |
yes, that looks great! |
py::bytes<T> will work with any T which has these methods const char* data() and int size() Other types can be accommodated by specializing bytes<T> and type_caster<bytes<T>>
updated |
more comments? |
This is now implemented, though with a slightly different approach (a |
That way there is no C++ type which is automatically typecast to Python bytes. One needs to explicitly create py::bytes. Also doesn't work with |
Not a functional change per se, but good for simple code assurance
* feat: support Universal2 wheels * fix: support Apple Silicon * Update setup.py * Update test.py * fix: local testing on AS, use pytest
In Python, strings ofter carry binary data including NULL characters, we should try to avoid losing them when calling C++ functions, std::string can hold NULL char easily.
This PR changes casts to use *_AsStringAndSize functions, which copy full string length.
I kept unicode->utf8 in Python to C++ casts (since we have PyUnicode_Check).
Unfortunately in C++ to Python cast there is no way to know if byte sequence is intended to be utf8 string, so for consistency and to avoid data loss we should return str/bytes and do decoding on the Python side. (Maybe one can introduce a helper class py::utf8string which can indicate the API intent to decode utf8 string).
Python 2:
str -> std::string
unicode (utf8) -> std::string
std::string -> str
Python 3:
bytes -> std::string
str (utf8) -> std::string
std::string -> bytes