Consolidate string encoding handling across C API #1040

giampaolo · 2017-04-30T19:44:15Z

UPDATE: final situation

This issue has been fixed in PR #1052. Starting from version 5.3.0 psutil will fully support unicode. The notes below apply to any API returning a string such as process exe() or cwd() including non-filesystem related APIs such as process username() or WindowsService.description(). This is what users will get with psutil 5.3.0:

all strings are encoded by using the OS filesystem encoding which varies depending on the platform (e.g. UTF-8 on Linux, mbcs on Win)
unicode is now correctly supported on Windows (no corrupted data is returned) by using specific unicode Windows APIs
no API call is supposed to crash with UnicodeDecodeError
instead, in case of badly encoded data returned by the OS, the following error handlers are used to replace the bad characters in the string:
- Python 3: sys.getfilesystemencodeerrors() (PY 3.6+) or "surrogatescape" on POSIX and "replace" on Windows
- Python 2: "replace"
on Python 2 all APIs return bytes (str type), never unicode
on Python 2 you can go back to unicode by doing:
unicode(p.exe(), sys.getdefaultencoding(), errors="replace")and do funky string comparisons.ù
Example which filters processes with a funky name working with both Python 2 and 3:

# -*- coding: utf-8 -*-
import psutil, sys

PY3 = sys.version_info[0] == 2
LOOKFOR = u"ƒőő"
for proc in psutil.process_iter(attrs=['name']):
    name = proc.info['name']
    if not PY3:
        name = unicode(name, sys.getdefaultencoding(), errors="replace")
    if LOOKFOR == name:
         print("process %s found" % p)

Original issue

(NOTE: this content is updated as I go)

So, psutil has different APIs returning a string, many of which misbehaving when it comes to unicode.

A: may raise decoding error on python 3 in case of non-ASCII string
B: return unicode on Python 2 instead of str
C returns incorrect / invalid encoded data in case of non-ASCII string

API	Linux	Win	OSX	FreeBSD	NetBSD	OpenBSD	SunOS
`Process.cmdline()`
`Process.connections()`			A	A	A
`Process.cwd()`
`Process.environ()`		B
`Process.exe()`
`Process.memory_maps()`		B,C	A	A			A
`Process.name()`
`Process.open_files()`				A
`Process.username()`

`disk_io_counters()`
`disk_partitions()`	A		A	A	A	A	A
`disk_usage(str)`
`net_connections()`				A	A
`net_if_addrs()`
`net_if_stats()`		B
`net_io_counters()`
`sensors_fans()`
`sensors_temperatures()`
`users()`	A		A	A	A	A	A

`WinService.binpath()`		B
`WinService.description()`		B,C
`WinService.display_name()`		B,C
`WinService.name()`
`WinService.status()`
`WinService.username()`		B

Right now there are 3 distinctive problems about it.

Filesystem or locale encoding?

First problem is that the C extension currently uses 2 approaches when it comes to decode and return a string:

PyUnicode_DecodeFSDefault
- PyUnicode_Decode(Py_FileSystemDefaultEncoding, "replace") on Python 2 (kinda equivalent)
PyUnicode_DecodeLocale (Python 3 only)

Most of the times we use PyUnicode_DecodeFSDefault but not always. First issue, then, is to figure out which APIs should use one or the other. It appears clear that PyUnicode_DecodeFSDefault should be used for all fs-related APIs such as process exe(), open_files() etc. It is less clear when to use PyUnicode_DecodeLocale. To my understanding maybe we should use it for things such as:

WindowsService.description()
WindowsService.display_name()

...and maybe (but less likely) for:

Process.username()
users()

UPDATE: decided it's better for the user to deal with one encoding only (filesystem) and not think about what API he/she is using

Error handling

Second question is what to do in case the string cannot be correctly decoded.

About FS APIs

Right now we tend to use "surrogateescape", which is also the default for PyUnicode_DecodeFSDefault on Python 3, so I'm pretty sure for fs-related paths we should do this every time we have the chance (on Python 3 at least).

Note: on Windows the default is "surrogatepass" (py 3.6) or "replace" as per PEP-529.

It must be noted that AFAIK on Python 2 the os module has no fs-APIs returning a string (e.g. os.listdir()) which may crash with UnicodeDecodeError so we should do the same and use "replace". There are already some tests for this, see see test_unicode.py).

About other APIs

Shall we use "strict" (and raise exception) or "surrogateescape"? Not sure.

Python 2 vs. 3

And here comes the troubles. Whereas it appears kind of clear what to do on Python 3, Python 2 is different. In order to attempt to correctly handle and represent all kind of strings on Python 2 we should return... well, unicode instead of str, but I don't want to do that, and neither have APIs which return two different types depending on the circumstance. Since unicode support is already broken in Python 2 and its stdlib (see bpo-18695) I'm happy to always return str, use "replace" error handler and consider unicode support in psutil + python 2 broken (EDIT: it turns out it's not as you can retrieve the correct string by doing unicode(proc.exe(), sys.getdefaultencoding(), errors="replace")).

There's still the question about when to use PyUnicode_DecodeFSDefault and (a variant of) PyUnicode_DecodeLocale but on Python 2 this is less important as unicode handling is broken anyway.

Summary / TODO

Python 3

figure out whether / when to use PyUnicode_DecodeFSDefault and PyUnicode_DecodeLocale
figure out the default error handler for PyUnicode_DecodeLocale (if used)
discover APIs which do not use PyUnicode_DecodeFSDefault and as such may crash

Python 2

same as above but never fail with UnicodeDecodeError in case of PyUnicode_DecodeLocale
never return unicode, always str and have tests for it

The text was updated successfully, but these errors were encountered:

…ailable on Python 2; also start to port the first C functions in FreeBSD

… error on py3

…open_files()

giampaolo · 2017-05-01T02:14:54Z

Note: stdlib pwd.getpwall() which lists users uses PyUnicode_DecodeFSDefault, not PyUnicode_DecodeLocale.

… 2; also get rid of the pstuil_ prefix and use the original Python names

btimby · 2017-05-01T21:18:35Z

I have a couple of things for your consideration. I am no expert on this, these encoding issues are problematic. I think you really have several problems to deal with.

Interpreting the strings you get from the underlying libraries you are using (C extension code).
Providing those strings to consumers of psutil so that they can use them properly.
Accepting strings from consumers for things like filtering or getting a process by name etc.

Of course, problems 2 and 3 are trivially solved once the harder problem 1 is addressed. So what you really want is a way to ensure that strings are always represented using the same method throughout psutil. This means you want to encode/decode strings "at the edge" of your code base. You probably want to deal with them internally as unicode (or utf-8, which would involve re-encoding).

Exhibit.

Take for example the re module. You can use either bytes or unicode for patterns, but it requires that the data being processed uses the same representation.

>>> import re
>>> bs = b'Bytes data.'
>>> us = u'Unicode data.'
>>> re.search(b'data', bs)
<_sre.SRE_Match object; span=(6, 10), match=b'data'>
>>> re.search(b'data', us)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.5/re.py", line 173, in search
    return _compile(pattern, flags).search(string)
TypeError: cannot use a bytes pattern on a string-like object

This is convenient because I control both parameters. This leaves the encoding/decoding issues in my hands.

Now let's take a look at a fake example involving psutil. Let's say that there is a function in psutil that allows me to check if a process with a given name exists get_process(name). Now let's suppose that the C extension code for OSX returns unicode process names but utf8 on Windows. As a user, I only control the name I wish to search for. Probably the easiest way for you to handle this is to ensure user input is immediately converted to unicode (or functions only accept unicode). Further, the C extension code must convert to unicode as soon as data is obtained from the underlying API (Win32 API or similar). Then all of your internal comparison/filtering etc. can be done implicitly since all strings are unicode.

Take a look at this piece of code:

https://github.com/giampaolo/psutil/blob/master/psutil/arch/osx/process_info.c#L180

vs.

https://github.com/giampaolo/psutil/blob/master/psutil/arch/windows/process_info.c#L683

If the first example, you are using the python version to dictate how to decode the string. But in the second, you know (because of MSDN documentation) the encoding that is in use. The second example is the pattern you want to follow. The API you rely on will dictate the encoding being used (which may be configured globally on the system). You should basically use that information to decode everything to unicode and then deal with that internally.

giampaolo · 2017-05-02T17:39:42Z

Hello Ben, thanks a lot for chiming in.
First of all let me clarify that when it comes to strings psutil is all about returning information except for one API (disk_usage(path)).

You raised a very good point about re. On Python 2 disk_usage shoud behave like re and os modules, meaning it should be able to accept both bytes and strings, so that's another TODO to add to this list (I believe POSIX is fine already but not Windows).

Take a look at this piece of code:
https://github.com/giampaolo/psutil/blob/master/psutil/arch/osx/process_info.c#L180
vs.
https://github.com/giampaolo/psutil/blob/master/psutil/arch/windows/process_info.c#L683
If the first example, you are using the python version to dictate how to decode the string. But in the second, you know (because of MSDN documentation) the encoding that is in use. The second example is the pattern you want to follow.

Yes, correct, but the problem is what to do on Python 2.

The first example (PyUnicode_DecodeFSDefault) is what cPython 3 on POSIX uses all over the place, for instance for os.listdir and os.getcwd.

On Windows, cPython 3 uses PyUnicode_FromWideChar (e.g. for os.getcwd), which is also what I do in psutil. So Python 3 is easier.

Then there is Python 2. cPython 2 uses PyString_FromString all over the place, which returns a str (which are basically bytes), and PyUnicode_FromWideChar only when explicitly requested (e.g. os.getcwdu() or os.listdir(u'.')). Again, I believe this is what I should do and not return unicode (are you suggesting otherwise?). That means that on Python 2 + Windows if the C API returns unicode I should try to be consistent with non-unicode APIs and encode to str as in:

s.encode(s, encoding=sys.getfilesystemencoding(), errors='replace')`

To my understanding no information is lost (thanks to errors='replace') and on Python 2 you can still do comparisons by encoding the string back to unicode, which is supposed to be what was originally returned by PyUnicode_FromWideChar. For instance, to look for a process with a funky name() you should do:

# -*- coding: utf-8 -*-
import psutil, sys

PY2 = sys.version_info[0] == 2
LOOKFOR = u"ƒőő"
for proc in psutil.process_iter(attrs=['name']):
    name = proc.info['name']
    if PY2:
        name = unicode(name, sys.getdefaultencoding(), errors="replace")
    if LOOKFOR == name:
         print("process %s found" % p)

At least, this is how I think it's supposed to work hehe.

Finally there is the third and (to me) most obscure problem for non-fs APIs: locale. To my understanding any operating system has 2 encodings: fs encoding and the locale encoding (well also the terminal encoding but whatever). But what exactly is "locale encoding"? Is it the language of the system including things like windows, menus etc.? If that's the case then maybe APIs such as WindowsService.description() should use the locale encoding instead of the fs-encoding. But again, I am not sure.

Fix memory_maps() which was returning an invalid encoded path in case of non ASCII path on both Python 2 and 3. Use GetMappedFileNameW instead of GetMappedFilenameA in order to ask the system an actual unicode path. Also, on Windows encode unicode back to str/bytes by using default fs-encoding + "replace" error handler. This paves the way for fixing other APIs in an identical manner.

giampaolo · 2017-05-03T02:09:07Z

OK, this is what I meant: ace8d28

…nicode on python 2

…tes to unicode

unicode. Also fixes #1048 (host IP address was invalid).

…to return proper unicode; also return a (domain, user) tuple instead of concatenating the string in C (I feel safer)

…ice description()

… enumerate() and display_name()

…service display_name() and username()

#1040 fix unicode

giampaolo · 2017-05-04T00:04:31Z

OK, I'm finally done with this (phew!).
In the end I decided to use FS encoding everywhere and leave locale alone. I think it's more convenient because as long as psutil keeps the promise to return fs-encoded strings everywhere the user has to think about and deal with 1 encoding only instead of 2.

Anyway, here's what changed with #1052:

all strings use the FS encoding (never locale)
unicode is now correctly supported on Windows (no corrupted data is returned) by using specific unicode Windows APIs
str type is always returned on both python 2 and 3, never unicode
use "surrogateescape" err handler on UNIX and "replace" on Windows so no API call is supposed to crash with UnicodeDecodeError and also information is not lost in case of corrupted data

I think this is the best compromise in terms of usability for both Python 2 and 3.

@btimby thanks again for chiming in

FJplant · 2017-05-04T18:46:23Z

This is really awesome...

Do you have i18n support? I would like to translate messages to Korean, Russian and Chinese...

giampaolo · 2017-05-04T19:43:48Z

Well, AFAIK internationalization has to be implemented in your app (by you).

…e default error handler instead of guessing it

giampaolo mentioned this issue Apr 30, 2017

FreeBSD / OSX: Process.connections('unix') on python 3 does not properly handle unicode #1029

Closed

giampaolo added the bug label Apr 30, 2017

giampaolo mentioned this issue Apr 30, 2017

Consolidate API types #1039

Closed

giampaolo added a commit that referenced this issue Apr 30, 2017

#1040: provide an alias for PyUnicode_DecodeFSDefault which is not av…

8646be5

…ailable on Python 2; also start to port the first C functions in FreeBSD

giampaolo added a commit that referenced this issue Apr 30, 2017

#1040 / FreeBSD: fix net_connections('unix') which may raise decoding…

cdc9b4f

… error on py3

giampaolo mentioned this issue Apr 30, 2017

Add test for memory_maps() which loads a library by path #1041

Closed

giampaolo added a commit that referenced this issue May 1, 2017

#1040: add funky unicode test for memory_maps()

a3c1bc2

giampaolo added a commit that referenced this issue May 1, 2017

#1040 / unicode / freebsd: fix decode handling for memory_maps() and …

3685cfd

…open_files()

giampaolo added a commit that referenced this issue May 1, 2017

#1040 / unicode / bsd: fix unicode handling for disk_partitions()

0fab2e5

giampaolo added a commit that referenced this issue May 1, 2017

#1040 / unicode / bsd: fix unicode handling for users()

b19ddf8

giampaolo added a commit that referenced this issue May 1, 2017

#1040 fix unicode for connections() on netbsd

b6c2636

giampaolo added a commit that referenced this issue May 1, 2017

#1040 fix unicode for memory_maps() on osx

d93a437

giampaolo added a commit that referenced this issue May 1, 2017

#1040 use psutil_PyUnicode_DecodeFSDefault everywhere (osx)

40d73c4

giampaolo added a commit that referenced this issue May 1, 2017

#1040: add alias for psutil_PyUnicode_DecodeFSDefaultAndSize

043d573

giampaolo added a commit that referenced this issue May 1, 2017

#1040: add replacement for PyUnicode_DecodeFSDefaultAndSize on Python…

0cab2d1

… 2; also get rid of the pstuil_ prefix and use the original Python names

giampaolo added a commit that referenced this issue May 2, 2017

#1040: remove dead unicode C code

9c6e126

giampaolo added a commit that referenced this issue May 2, 2017

#1040 / users() / sunos: fix unicode

d1ddeb5

giampaolo added a commit that referenced this issue May 2, 2017

#1040 / disk_partitions() / sunos: fix unicode

f23d7b2

giampaolo added a commit that referenced this issue May 2, 2017

#1040 / memory_maps() / sunos: fix unicode

986fb8a

giampaolo added a commit that referenced this issue May 2, 2017

#1040 / users() / osx: fix unicode

bb60602

giampaolo added a commit that referenced this issue May 2, 2017

#1040 / disk_èartitions() / osx: fix unicode

3681cd7

giampaolo added a commit that referenced this issue May 2, 2017

#1040 disk_partitions() / linux: fix unicode

79e30c2

giampaolo added a commit that referenced this issue May 2, 2017

#1040 users() / linux: fix unicode

753faf8

giampaolo added a commit that referenced this issue May 3, 2017

#1040: have net_if_stats() and proc environ() return str instead of u…

2d1028c

…nicode on python 2

giampaolo added a commit that referenced this issue May 3, 2017

#1040: use default fs encoding instead of utf8 when decoding from uby…

a655055

…tes to unicode

giampaolo added a commit that referenced this issue May 3, 2017

re #1040 have users() use WTSQuerySessionInformationW in order to return

b06bb03

unicode. Also fixes #1048 (host IP address was invalid).

giampaolo added a commit that referenced this issue May 3, 2017

#1040: for proc username() on Windows use LookupAccountSidW in order …

734d603

…to return proper unicode; also return a (domain, user) tuple instead of concatenating the string in C (I feel safer)

giampaolo added a commit that referenced this issue May 3, 2017

#1040: just use Py_BuildValue on both py 2 and 3

e47981f

giampaolo added a commit that referenced this issue May 3, 2017

#1040: use QueryServiceConfig2W() and return unicode for windows serv…

7e86fc4

…ice description()

giampaolo added a commit that referenced this issue May 3, 2017

#1040: use EnumServiceStatusExW and return proper unicode for service…

3fccfda

… enumerate() and display_name()

giampaolo added a commit that referenced this issue May 3, 2017

#1040 use QueryServiceConfigW() and return proper unicode strings on …

4fe26b0

…service display_name() and username()

giampaolo added a commit that referenced this issue May 3, 2017

#1040: add notes about unicode in the doc

9ea0636

giampaolo added a commit that referenced this issue May 3, 2017

#1040 update doc/notes

de2b223

giampaolo added a commit that referenced this issue May 3, 2017

Merge pull request #1052 from giampaolo/1040-fix-unicode

91f26bb

#1040 fix unicode

giampaolo added a commit that referenced this issue May 4, 2017

#1040: on py 3.6 use sys.getfilesystemencodeerrors() to determined th…

982f255

…e default error handler instead of guessing it

giampaolo closed this as completed May 7, 2017

giampaolo added the enhancement label May 7, 2017

giampaolo mentioned this issue Jun 9, 2017

the result of net_if_stats() on windows server2008 seems incorrect #682

Closed

giampaolo added new-api and removed bug labels Nov 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate string encoding handling across C API #1040

Consolidate string encoding handling across C API #1040

giampaolo commented Apr 30, 2017 •

edited

Loading

giampaolo commented May 1, 2017

btimby commented May 1, 2017

giampaolo commented May 2, 2017 •

edited

Loading

giampaolo commented May 3, 2017

giampaolo commented May 4, 2017 •

edited

Loading

FJplant commented May 4, 2017

giampaolo commented May 4, 2017

Consolidate string encoding handling across C API #1040

Consolidate string encoding handling across C API #1040

Comments

giampaolo commented Apr 30, 2017 • edited Loading

UPDATE: final situation

Original issue

Filesystem or locale encoding?

Error handling

Python 2 vs. 3

Summary / TODO

giampaolo commented May 1, 2017

btimby commented May 1, 2017

giampaolo commented May 2, 2017 • edited Loading

giampaolo commented May 3, 2017

giampaolo commented May 4, 2017 • edited Loading

FJplant commented May 4, 2017

giampaolo commented May 4, 2017

giampaolo commented Apr 30, 2017 •

edited

Loading

giampaolo commented May 2, 2017 •

edited

Loading

giampaolo commented May 4, 2017 •

edited

Loading