Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate string encoding handling across C API #1040

Closed
giampaolo opened this issue Apr 30, 2017 · 7 comments
Closed

Consolidate string encoding handling across C API #1040

giampaolo opened this issue Apr 30, 2017 · 7 comments

Comments

@giampaolo
Copy link
Owner

giampaolo commented Apr 30, 2017

UPDATE: final situation

This issue has been fixed in PR #1052. Starting from version 5.3.0 psutil will fully support unicode. The notes below apply to any API returning a string such as process exe() or cwd() including non-filesystem related APIs such as process username() or WindowsService.description(). This is what users will get with psutil 5.3.0:

  • all strings are encoded by using the OS filesystem encoding which varies depending on the platform (e.g. UTF-8 on Linux, mbcs on Win)
  • unicode is now correctly supported on Windows (no corrupted data is returned) by using specific unicode Windows APIs
  • no API call is supposed to crash with UnicodeDecodeError
  • instead, in case of badly encoded data returned by the OS, the following error handlers are used to replace the bad characters in the string:
    • Python 3: sys.getfilesystemencodeerrors() (PY 3.6+) or "surrogatescape" on POSIX and "replace" on Windows
    • Python 2: "replace"
  • on Python 2 all APIs return bytes (str type), never unicode
  • on Python 2 you can go back to unicode by doing:
    unicode(p.exe(), sys.getdefaultencoding(), errors="replace")and do funky string comparisons.ù
    Example which filters processes with a funky name working with both Python 2 and 3:
# -*- coding: utf-8 -*-
import psutil, sys

PY3 = sys.version_info[0] == 2
LOOKFOR = u"ƒőő"
for proc in psutil.process_iter(attrs=['name']):
    name = proc.info['name']
    if not PY3:
        name = unicode(name, sys.getdefaultencoding(), errors="replace")
    if LOOKFOR == name:
         print("process %s found" % p)

Original issue

(NOTE: this content is updated as I go)

So, psutil has different APIs returning a string, many of which misbehaving when it comes to unicode.

  • A: may raise decoding error on python 3 in case of non-ASCII string
  • B: return unicode on Python 2 instead of str
  • C returns incorrect / invalid encoded data in case of non-ASCII string
API Linux Win OSX FreeBSD NetBSD OpenBSD SunOS
Process.cmdline()
Process.connections() A A A
Process.cwd()
Process.environ() B
Process.exe()
Process.memory_maps() B,C A A A
Process.name()
Process.open_files() A
Process.username()
disk_io_counters()
disk_partitions() A A A A A A
disk_usage(str)
net_connections() A A
net_if_addrs()
net_if_stats() B
net_io_counters()
sensors_fans()
sensors_temperatures()
users() A A A A A A
WinService.binpath() B
WinService.description() B,C
WinService.display_name() B,C
WinService.name()
WinService.status()
WinService.username() B

Right now there are 3 distinctive problems about it.

Filesystem or locale encoding?

First problem is that the C extension currently uses 2 approaches when it comes to decode and return a string:

  • PyUnicode_DecodeFSDefault
    • PyUnicode_Decode(Py_FileSystemDefaultEncoding, "replace") on Python 2 (kinda equivalent)
  • PyUnicode_DecodeLocale (Python 3 only)

Most of the times we use PyUnicode_DecodeFSDefault but not always. First issue, then, is to figure out which APIs should use one or the other. It appears clear that PyUnicode_DecodeFSDefault should be used for all fs-related APIs such as process exe(), open_files() etc. It is less clear when to use PyUnicode_DecodeLocale. To my understanding maybe we should use it for things such as:

  • WindowsService.description()
  • WindowsService.display_name()

...and maybe (but less likely) for:

  • Process.username()
  • users()

UPDATE: decided it's better for the user to deal with one encoding only (filesystem) and not think about what API he/she is using

Error handling

Second question is what to do in case the string cannot be correctly decoded.

About FS APIs

Right now we tend to use "surrogateescape", which is also the default for PyUnicode_DecodeFSDefault on Python 3, so I'm pretty sure for fs-related paths we should do this every time we have the chance (on Python 3 at least).

Note: on Windows the default is "surrogatepass" (py 3.6) or "replace" as per PEP-529.

It must be noted that AFAIK on Python 2 the os module has no fs-APIs returning a string (e.g. os.listdir()) which may crash with UnicodeDecodeError so we should do the same and use "replace". There are already some tests for this, see see test_unicode.py).

About other APIs

Shall we use "strict" (and raise exception) or "surrogateescape"? Not sure.

Python 2 vs. 3

And here comes the troubles. Whereas it appears kind of clear what to do on Python 3, Python 2 is different. In order to attempt to correctly handle and represent all kind of strings on Python 2 we should return... well, unicode instead of str, but I don't want to do that, and neither have APIs which return two different types depending on the circumstance. Since unicode support is already broken in Python 2 and its stdlib (see bpo-18695) I'm happy to always return str, use "replace" error handler and consider unicode support in psutil + python 2 broken (EDIT: it turns out it's not as you can retrieve the correct string by doing unicode(proc.exe(), sys.getdefaultencoding(), errors="replace")).

There's still the question about when to use PyUnicode_DecodeFSDefault and (a variant of) PyUnicode_DecodeLocale but on Python 2 this is less important as unicode handling is broken anyway.

Summary / TODO

Python 3

  • figure out whether / when to use PyUnicode_DecodeFSDefault and PyUnicode_DecodeLocale
  • figure out the default error handler for PyUnicode_DecodeLocale (if used)
  • discover APIs which do not use PyUnicode_DecodeFSDefault and as such may crash

Python 2

  • same as above but never fail with UnicodeDecodeError in case of PyUnicode_DecodeLocale
  • never return unicode, always str and have tests for it
@giampaolo
Copy link
Owner Author

Note: stdlib pwd.getpwall() which lists users uses PyUnicode_DecodeFSDefault, not PyUnicode_DecodeLocale.

giampaolo added a commit that referenced this issue May 1, 2017
… 2; also get rid of the pstuil_ prefix and use the original Python names
@btimby
Copy link

btimby commented May 1, 2017

I have a couple of things for your consideration. I am no expert on this, these encoding issues are problematic. I think you really have several problems to deal with.

  1. Interpreting the strings you get from the underlying libraries you are using (C extension code).
  2. Providing those strings to consumers of psutil so that they can use them properly.
  3. Accepting strings from consumers for things like filtering or getting a process by name etc.

Of course, problems 2 and 3 are trivially solved once the harder problem 1 is addressed. So what you really want is a way to ensure that strings are always represented using the same method throughout psutil. This means you want to encode/decode strings "at the edge" of your code base. You probably want to deal with them internally as unicode (or utf-8, which would involve re-encoding).

Exhibit.

Take for example the re module. You can use either bytes or unicode for patterns, but it requires that the data being processed uses the same representation.

>>> import re
>>> bs = b'Bytes data.'
>>> us = u'Unicode data.'
>>> re.search(b'data', bs)
<_sre.SRE_Match object; span=(6, 10), match=b'data'>
>>> re.search(b'data', us)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.5/re.py", line 173, in search
    return _compile(pattern, flags).search(string)
TypeError: cannot use a bytes pattern on a string-like object

This is convenient because I control both parameters. This leaves the encoding/decoding issues in my hands.

Now let's take a look at a fake example involving psutil. Let's say that there is a function in psutil that allows me to check if a process with a given name exists get_process(name). Now let's suppose that the C extension code for OSX returns unicode process names but utf8 on Windows. As a user, I only control the name I wish to search for. Probably the easiest way for you to handle this is to ensure user input is immediately converted to unicode (or functions only accept unicode). Further, the C extension code must convert to unicode as soon as data is obtained from the underlying API (Win32 API or similar). Then all of your internal comparison/filtering etc. can be done implicitly since all strings are unicode.

Take a look at this piece of code:

https://github.com/giampaolo/psutil/blob/master/psutil/arch/osx/process_info.c#L180

vs.

https://github.com/giampaolo/psutil/blob/master/psutil/arch/windows/process_info.c#L683

If the first example, you are using the python version to dictate how to decode the string. But in the second, you know (because of MSDN documentation) the encoding that is in use. The second example is the pattern you want to follow. The API you rely on will dictate the encoding being used (which may be configured globally on the system). You should basically use that information to decode everything to unicode and then deal with that internally.

@giampaolo
Copy link
Owner Author

giampaolo commented May 2, 2017

Hello Ben, thanks a lot for chiming in.
First of all let me clarify that when it comes to strings psutil is all about returning information except for one API (disk_usage(path)).

You raised a very good point about re. On Python 2 disk_usage shoud behave like re and os modules, meaning it should be able to accept both bytes and strings, so that's another TODO to add to this list (I believe POSIX is fine already but not Windows).

Take a look at this piece of code:
https://github.com/giampaolo/psutil/blob/master/psutil/arch/osx/process_info.c#L180
vs.
https://github.com/giampaolo/psutil/blob/master/psutil/arch/windows/process_info.c#L683
If the first example, you are using the python version to dictate how to decode the string. But in the second, you know (because of MSDN documentation) the encoding that is in use. The second example is the pattern you want to follow.

Yes, correct, but the problem is what to do on Python 2.

The first example (PyUnicode_DecodeFSDefault) is what cPython 3 on POSIX uses all over the place, for instance for os.listdir and os.getcwd.

On Windows, cPython 3 uses PyUnicode_FromWideChar (e.g. for os.getcwd), which is also what I do in psutil. So Python 3 is easier.

Then there is Python 2. cPython 2 uses PyString_FromString all over the place, which returns a str (which are basically bytes), and PyUnicode_FromWideChar only when explicitly requested (e.g. os.getcwdu() or os.listdir(u'.')). Again, I believe this is what I should do and not return unicode (are you suggesting otherwise?). That means that on Python 2 + Windows if the C API returns unicode I should try to be consistent with non-unicode APIs and encode to str as in:

s.encode(s, encoding=sys.getfilesystemencoding(), errors='replace')`

To my understanding no information is lost (thanks to errors='replace') and on Python 2 you can still do comparisons by encoding the string back to unicode, which is supposed to be what was originally returned by PyUnicode_FromWideChar. For instance, to look for a process with a funky name() you should do:

# -*- coding: utf-8 -*-
import psutil, sys

PY2 = sys.version_info[0] == 2
LOOKFOR = u"ƒőő"
for proc in psutil.process_iter(attrs=['name']):
    name = proc.info['name']
    if PY2:
        name = unicode(name, sys.getdefaultencoding(), errors="replace")
    if LOOKFOR == name:
         print("process %s found" % p)

At least, this is how I think it's supposed to work hehe.

Finally there is the third and (to me) most obscure problem for non-fs APIs: locale. To my understanding any operating system has 2 encodings: fs encoding and the locale encoding (well also the terminal encoding but whatever). But what exactly is "locale encoding"? Is it the language of the system including things like windows, menus etc.? If that's the case then maybe APIs such as WindowsService.description() should use the locale encoding instead of the fs-encoding. But again, I am not sure.

giampaolo added a commit that referenced this issue May 3, 2017
Fix memory_maps() which was returning an invalid encoded path in case of
non ASCII path on both Python 2 and 3. Use GetMappedFileNameW instead of GetMappedFilenameA in
order to ask the system an actual unicode path. Also, on Windows
encode unicode back to str/bytes by using default fs-encoding + "replace"
error handler.
This paves the way for fixing other APIs in an identical manner.
@giampaolo
Copy link
Owner Author

OK, this is what I meant: ace8d28

giampaolo added a commit that referenced this issue May 3, 2017
unicode. Also fixes #1048 (host IP address was invalid).
giampaolo added a commit that referenced this issue May 3, 2017
…to return proper unicode; also return a (domain, user) tuple instead of concatenating the string in C (I feel safer)
giampaolo added a commit that referenced this issue May 3, 2017
giampaolo added a commit that referenced this issue May 3, 2017
giampaolo added a commit that referenced this issue May 3, 2017
giampaolo added a commit that referenced this issue May 3, 2017
giampaolo added a commit that referenced this issue May 3, 2017
@giampaolo
Copy link
Owner Author

giampaolo commented May 4, 2017

OK, I'm finally done with this (phew!).
In the end I decided to use FS encoding everywhere and leave locale alone. I think it's more convenient because as long as psutil keeps the promise to return fs-encoded strings everywhere the user has to think about and deal with 1 encoding only instead of 2.

Anyway, here's what changed with #1052:

  • all strings use the FS encoding (never locale)
  • unicode is now correctly supported on Windows (no corrupted data is returned) by using specific unicode Windows APIs
  • str type is always returned on both python 2 and 3, never unicode
  • use "surrogateescape" err handler on UNIX and "replace" on Windows so no API call is supposed to crash with UnicodeDecodeError and also information is not lost in case of corrupted data

I think this is the best compromise in terms of usability for both Python 2 and 3.

@btimby thanks again for chiming in

@FJplant
Copy link

FJplant commented May 4, 2017

This is really awesome...

Do you have i18n support? I would like to translate messages to Korean, Russian and Chinese...

@giampaolo
Copy link
Owner Author

Well, AFAIK internationalization has to be implemented in your app (by you).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants