-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consolidate string encoding handling across C API #1040
Comments
…ailable on Python 2; also start to port the first C functions in FreeBSD
Note: stdlib |
… 2; also get rid of the pstuil_ prefix and use the original Python names
I have a couple of things for your consideration. I am no expert on this, these encoding issues are problematic. I think you really have several problems to deal with.
Of course, problems 2 and 3 are trivially solved once the harder problem 1 is addressed. So what you really want is a way to ensure that strings are always represented using the same method throughout psutil. This means you want to encode/decode strings "at the edge" of your code base. You probably want to deal with them internally as unicode (or utf-8, which would involve re-encoding). Exhibit. Take for example the >>> import re
>>> bs = b'Bytes data.'
>>> us = u'Unicode data.'
>>> re.search(b'data', bs)
<_sre.SRE_Match object; span=(6, 10), match=b'data'>
>>> re.search(b'data', us)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.5/re.py", line 173, in search
return _compile(pattern, flags).search(string)
TypeError: cannot use a bytes pattern on a string-like object This is convenient because I control both parameters. This leaves the encoding/decoding issues in my hands. Now let's take a look at a fake example involving psutil. Let's say that there is a function in psutil that allows me to check if a process with a given name exists Take a look at this piece of code: https://github.com/giampaolo/psutil/blob/master/psutil/arch/osx/process_info.c#L180 vs. https://github.com/giampaolo/psutil/blob/master/psutil/arch/windows/process_info.c#L683 If the first example, you are using the python version to dictate how to decode the string. But in the second, you know (because of MSDN documentation) the encoding that is in use. The second example is the pattern you want to follow. The API you rely on will dictate the encoding being used (which may be configured globally on the system). You should basically use that information to decode everything to unicode and then deal with that internally. |
Hello Ben, thanks a lot for chiming in. You raised a very good point about
Yes, correct, but the problem is what to do on Python 2. The first example ( On Windows, cPython 3 uses Then there is Python 2. cPython 2 uses
To my understanding no information is lost (thanks to # -*- coding: utf-8 -*-
import psutil, sys
PY2 = sys.version_info[0] == 2
LOOKFOR = u"ƒőő"
for proc in psutil.process_iter(attrs=['name']):
name = proc.info['name']
if PY2:
name = unicode(name, sys.getdefaultencoding(), errors="replace")
if LOOKFOR == name:
print("process %s found" % p) At least, this is how I think it's supposed to work hehe. Finally there is the third and (to me) most obscure problem for non-fs APIs: locale. To my understanding any operating system has 2 encodings: fs encoding and the locale encoding (well also the terminal encoding but whatever). But what exactly is "locale encoding"? Is it the language of the system including things like windows, menus etc.? If that's the case then maybe APIs such as |
Fix memory_maps() which was returning an invalid encoded path in case of non ASCII path on both Python 2 and 3. Use GetMappedFileNameW instead of GetMappedFilenameA in order to ask the system an actual unicode path. Also, on Windows encode unicode back to str/bytes by using default fs-encoding + "replace" error handler. This paves the way for fixing other APIs in an identical manner.
OK, this is what I meant: ace8d28 |
unicode. Also fixes #1048 (host IP address was invalid).
…to return proper unicode; also return a (domain, user) tuple instead of concatenating the string in C (I feel safer)
…service display_name() and username()
OK, I'm finally done with this (phew!). Anyway, here's what changed with #1052:
I think this is the best compromise in terms of usability for both Python 2 and 3. @btimby thanks again for chiming in |
This is really awesome... Do you have i18n support? I would like to translate messages to Korean, Russian and Chinese... |
Well, AFAIK internationalization has to be implemented in your app (by you). |
…e default error handler instead of guessing it
UPDATE: final situation
This issue has been fixed in PR #1052. Starting from version 5.3.0 psutil will fully support unicode. The notes below apply to any API returning a string such as process
exe()
orcwd()
including non-filesystem related APIs such as processusername()
orWindowsService.description()
. This is what users will get with psutil 5.3.0:UnicodeDecodeError
sys.getfilesystemencodeerrors()
(PY 3.6+) or"surrogatescape"
on POSIX and"replace"
on Windows"replace"
str
type), neverunicode
unicode(p.exe(), sys.getdefaultencoding(), errors="replace")
and do funky string comparisons.ùExample which filters processes with a funky name working with both Python 2 and 3:
Original issue
(NOTE: this content is updated as I go)
So, psutil has different APIs returning a string, many of which misbehaving when it comes to unicode.
unicode
on Python 2 instead ofstr
Process.cmdline()
Process.connections()
Process.cwd()
Process.environ()
Process.exe()
Process.memory_maps()
Process.name()
Process.open_files()
Process.username()
disk_io_counters()
disk_partitions()
disk_usage(str)
net_connections()
net_if_addrs()
net_if_stats()
net_io_counters()
sensors_fans()
sensors_temperatures()
users()
WinService.binpath()
WinService.description()
WinService.display_name()
WinService.name()
WinService.status()
WinService.username()
Right now there are 3 distinctive problems about it.
Filesystem or locale encoding?
First problem is that the C extension currently uses 2 approaches when it comes to decode and return a string:
PyUnicode_DecodeFSDefault
PyUnicode_Decode(Py_FileSystemDefaultEncoding, "replace")
on Python 2 (kinda equivalent)PyUnicode_DecodeLocale
(Python 3 only)Most of the times we use
PyUnicode_DecodeFSDefault
but not always. First issue, then, is to figure out which APIs should use one or the other. It appears clear thatPyUnicode_DecodeFSDefault
should be used for all fs-related APIs such as processexe()
,open_files()
etc. It is less clear when to usePyUnicode_DecodeLocale
. To my understanding maybe we should use it for things such as:WindowsService.description()
WindowsService.display_name()
...and maybe (but less likely) for:
Process.username()
users()
UPDATE: decided it's better for the user to deal with one encoding only (filesystem) and not think about what API he/she is using
Error handling
Second question is what to do in case the string cannot be correctly decoded.
About FS APIs
Right now we tend to use
"surrogateescape"
, which is also the default forPyUnicode_DecodeFSDefault
on Python 3, so I'm pretty sure for fs-related paths we should do this every time we have the chance (on Python 3 at least).Note: on Windows the default is
"surrogatepass"
(py 3.6) or"replace"
as per PEP-529.It must be noted that AFAIK on Python 2 the
os
module has no fs-APIs returning a string (e.g.os.listdir()
) which may crash withUnicodeDecodeError
so we should do the same and use"replace"
. There are already some tests for this, see see test_unicode.py).About other APIs
Shall we use
"strict"
(and raise exception) or"surrogateescape"
? Not sure.Python 2 vs. 3
And here comes the troubles. Whereas it appears kind of clear what to do on Python 3, Python 2 is different. In order to attempt to correctly handle and represent all kind of strings on Python 2 we should return... well,
unicode
instead ofstr
, but I don't want to do that, and neither have APIs which return two different types depending on the circumstance. Since unicode support is already broken in Python 2 and its stdlib (see bpo-18695) I'm happy to always returnstr
, use"replace"
error handler and consider unicode support in psutil + python 2 broken (EDIT: it turns out it's not as you can retrieve the correct string by doingunicode(proc.exe(), sys.getdefaultencoding(), errors="replace")
).There's still the question about when to use
PyUnicode_DecodeFSDefault
and (a variant of)PyUnicode_DecodeLocale
but on Python 2 this is less important as unicode handling is broken anyway.Summary / TODO
Python 3
PyUnicode_DecodeFSDefault
andPyUnicode_DecodeLocale
PyUnicode_DecodeLocale
(if used)PyUnicode_DecodeFSDefault
and as such may crashPython 2
UnicodeDecodeError
in case ofPyUnicode_DecodeLocale
unicode
, alwaysstr
and have tests for itThe text was updated successfully, but these errors were encountered: