Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test failures due to .es file name suffix for text/javascript file #75

Closed
maxnikulin opened this issue Jan 25, 2024 · 24 comments
Closed
Labels
enhancement New feature or request

Comments

@maxnikulin
Copy link
Contributor

The following tests may fail:

  • tests.test_scrapbook_indexer.TestUnSingleHtmlConverter.test_rewrite_svg
  • tests.test_scrapbook_indexer.TestUnSingleHtmlConverter.test_rewrite_svg_file
-   <script href="313d6864fa48b411d082f7692efd0c0892788fc4.es"></script>
?                                                          ^
+   <script href="313d6864fa48b411d082f7692efd0c0892788fc4.js"></script>
?                                                          ^

Originally I mentioned it in #34 (comment)

I think the origin of the issue is

In [1]: import mimetypes

In [2]: mimetypes.guess_extension("text/javascript")
Out[2]: '.es'

due to

grep -w es /etc/mime.types 
text/javascript                                 es js mjs

It is specific to Debian Linux and its derivatives
https://salsa.debian.org/debian/media-types/-/commit/f67681aabbce05378e0ac45f92d398f8b7a31d5e
while RedHat-based distributions (Fedora, etc.) are not affected
https://pagure.io/mailcap/blob/master/f/mime.types#_1990

While I believe that it was oversight of the Debian media-types project to put es suffix before js, PyWebScrapBook should either explicitly set text/javascript to .js mapping or should allow .es in tests.

@danny0838 danny0838 added the enhancement New feature or request label Jan 25, 2024
@maxnikulin
Copy link
Contributor Author

I have filed https://bugs.debian.org/1061477 "Use js as first file name suffix for text/javascript", however Debian 12 bookworm (current stable release) is affected, so a workaround at the side of applications is desired.

@danny0838
Copy link
Owner

danny0838 commented Jan 25, 2024

Thank you for the information and report.

During the investigation I found something bizarre. According to Python source code, mimetypes.init should generate an internal database that:

  1. loads the default MIME types map defined in the source code
  2. loads MIME types map from Windows registry (if available)
  3. loads MIME types map from the defined config files (mimetypes.knownfiles) which resides in most *nix devices (whenever available)

When loading any of the above map, mimetypes.MimeTypes.add_type is called to add a new MIME->ext and ext->MIME mapping to the database for each entry. According to the source code, an ext->MIME mapping is last-win, while a MIME->ext mapping is first-win, and therefore the system MIME types definition should not be able to overwrite the default text/javascript->.js mapping in the source code.

This seems to be true for Python 3.12 to 3.7, and I currently cannot figure out why your system doesn't do so...

Can you provide your Python implementation for further investigation? (or better, check the source code)

@maxnikulin
Copy link
Contributor Author

Prior to Python 3.12 it was application/javascript, not text/javascript and Debian 12 bookworm was released with Python 3.11

I was disappointed by first win vs. last win discrepancy as well. My first idea was to suggest set_default instead of assignment for extension to media type mapping. It gives some consistency, but unfortunately it would be impossible to override mappings by e.g. calling mimetypes.init with an application-specific file.

Perhaps it is possible to extend mimetypes.add_type() and its callers to specify whether provided info should be treated with higher priority or as a fallback. For completeness remove methods for types and extensions should be added as well. It is not clear whether earlier or later files in the list should be treated with higher priority. Likely earlier ones. Some applications have ~/.mime.types before /etc/mime.types.

@danny0838
Copy link
Owner

danny0838 commented Jan 25, 2024

Sorry, I did not check versions < 3.12 carefully enough and failed to notice that it's application/javascript.

I wonder the last-win and first-win behavior is implemented as how most mime.types like files work. Unfortunately the native mimetypes module does not provide a way to modify the already added MIME->ext mapping. Although we can subclass the mimetypes.MimeTypes class and generate our own db, it's not safe enough to do so as the underlying packages such as flask and werkzeug may still rely on the native mimetypes.

I implemented a quick patch that may have fixed the issue, and feel free to check if it does the job. Though it's still suboptimal as it accesses the hidden mimetypes._db, which is not part of standard library and is at higher risk of breakage in future Python releases.

@maxnikulin
Copy link
Contributor Author

Danny, have you considered the following workaround?

  • Let's do not care concerning Python built-in mapping. It may be obsolete, but it should not hamper since it is not incorrect.
  • Provide a package-specific mime.types file with entries currently present in webscrapbook._polyfill.mimetypes. If this file is prepended to mimetypes.knownfiles (a part of the public API) then, unlike mimetypes.add_type() calls, it gets precedence in comparison to /etc/mime.types.
  • Read the file once more after standard knownfiles for the sake of last win overrides.

I have tried the following:

#!/usr/bin/env python3
import mimetypes
import os.path

user_mime_types = os.path.join(os.path.expanduser('~'), '.mime.types')
pkg_mime_types = os.path.join(os.path.dirname(__file__), 'mime.types')


def _init(files):
    mimetypes.knownfiles[:] = files + mimetypes.knownfiles.copy() + list(reversed(files))
    mimetypes.init()


_init([user_mime_types, pkg_mime_types])

if __name__ == '__main__':
    import sys
    for typ in sys.argv[1:]:
        all = mimetypes.guess_all_extensions(typ)
        print(f'{typ}: {all}')
        for ext in all:
            name = 'stub' + ext
            print(f'{ext}: {mimetypes.guess_type(name)}')

It reports

text/javascript: ['.es', '.js', '.mjs']
.es: ('text/javascript', None)
.js: ('text/javascript', None)
.mjs: ('text/javascript', None)

However with mime.types as

text/javascript  js mjs
# Obsolete
text/ecmascript  es
application/x-javascript  js mjs
application/x-ecmascript  es
application/javascript  js mjs
application/ecmascript  es
text/x-javascript  js mjs
text/x-ecmascript  es
# Duplicate primary records.
# Python mimetypes last win idiosyncrasy for extension to type mapping.
text/ecmascript  es
text/javascript  js mjs

it gives desired result

text/javascript: ['.js', '.mjs', '.es']
.js: ('text/javascript', None)
.mjs: ('text/javascript', None)
.es: ('text/ecmascript', None)

So the only complication is shipping and extra resource file.

@danny0838
Copy link
Owner

danny0838 commented Jan 26, 2024

There are already known issues in the Python default mapping and Windows registry, which are loaded before the knownfiles. Although they are ext=>MIME cases and can be overwritten, we can't guarantee there won't be a MIME=>exe case in the future. The solution shoild be able to cover such potential cases, I think.

@maxnikulin
Copy link
Contributor Author

There are already known issues in the Python default mapping and Windows registry, which are loaded before the knownfiles.

Could you, please, be more specific? Do you mean the following?

I have reread mimetypes docs. There is the types_map dictionary that is applied in the MimeTypes constructor, so before Windows registry. So the following should be reliable enough while using public API only. Perhaps there are a couple of media types with several extensions that may be treated specially, but it is just a several lines of code.

types_map_overrides = {
    '.js': 'text/javascript',
    '.es': 'text/ecmascript'
}


def _init2(files=None, overrides=None):
    # First wins for media type to extension mapping.
    if overrides:
        mimetypes.types_map.update(overrides)
    # Load Windows registry and POSIX /etc/mime.types.
    mimetypes.init()
    if files:
        mimetypes.init(files)
    # Ensure stable mapping of extension to media type.
    if overrides:
        for ext, typ in overrides.items():
            mimetypes.add_type(typ, ext)


_init2([user_mime_types], types_map_overrides)

It seems, first wins for media type to file name suffix is a bug:
mimetypes.init(files=None)

Each file named in files or knownfiles takes precedence over those named before it.

@danny0838
Copy link
Owner

danny0838 commented Jan 27, 2024

I'm considering the similar approach. See the latest commits.

@danny0838
Copy link
Owner

danny0838 commented Jan 27, 2024

There are already known issues in the Python default mapping and Windows registry, which are loaded before the knownfiles.

Could you, please, be more specific? Do you mean the following?

See the code comments, some patches are explicitly for such issues.

It seems, first wins for media type to file name suffix is a bug: mimetypes.init(files=None)

Each file named in files or knownfiles takes precedence over those named before it.

I'm not very sure about what "those named before it" means. But I think raising an issue to the Python developers for this should be OK.

@maxnikulin
Copy link
Contributor Author

Danny, f47648f breaks tests because '.js': 'application/javascript' from mimetypes.types_map in Python-3.11 overrides '.js': 'text/javascript' in patch_types_map from _polyfill.

@danny0838
Copy link
Owner

Danny, f47648f breaks tests because '.js': 'application/javascript' from mimetypes.types_map in Python-3.11 overrides '.js': 'text/javascript' in patch_types_map from _polyfill.

Have you tested it? This shouldn't happen as it will be overwritten again by the add_type.

@danny0838
Copy link
Owner

Unfortunately, there is an issue for the "patch default types_map" approach: This approach won't work if mimetypes.inited==True, which can happen during the init of a library. E.g. In Python < 3.9, import flask imports http.server, which triggers mimetypes.init() during the initialization of SimpleHTTPRequestHandler.

Another problem is how should we deal with the user mime config. Should we allow it to overwrite our patch for native Python and OS configs? Should we allow it to overwrite MIME=>ext mapping as the last-win manner?

Applying the patch and user config when mimetypes.inited==True may be unsafe, as the super script/project may have already configured and used the mimetypes some other where, and an inconsistency of the behavior of WSB and its super script/project may happen.

@danny0838
Copy link
Owner

danny0838 commented Jan 29, 2024

Reworked, see the latest commits. It seems that patching the internal _db is still the easiest and safest way to do the job. Please check if it works on your device.

@maxnikulin
Copy link
Contributor Author

The rabbit hole is deeper than I expected.

I do not think you need tricks with importlib.reload(). Have you seen the following?
mimetypes.init(files=None)

If files is None the internal data structure is completely rebuilt to its initial default value.

(A side note. I do not like the init() interface though. I am considering discussion of an alternative for python/cpython#68715 on python-ideas.)

I still prefer the types_map approach. I believe, it is better to ensure loading _polyfill.mimetypes before flask. Even with your hack with _db patching, you likely want to patch http.server.HTTPSimpleRequestHandler.extensions_map (when this attribute is present) as well.

Another problem is how should we deal with the user mime config. Should we allow it to overwrite our patch for native Python and OS configs? Should we allow it to overwrite MIME=>ext mapping as the last-win manner?

In my opinion, ideally the priorities should be the following (from higher to lower):

  • WSB_USER_DIR/mime.types;
  • webscrapbook internal mappings;
  • ~/.mime.types that is currently ignored, but used by enough application on Linux, mostly with text terminal UI;
  • system files and registry.

As a workaround for the first wins bug in the case of media types to extension mapping, perhaps it is possible to use mimetypes.types_map.update() calls with result of mimetypes.read_mime_types().

@danny0838
Copy link
Owner

danny0838 commented Jan 29, 2024

First all all: do you confirm whether the current commit works on your device?

I do not think you need tricks with importlib.reload(). Have you seen the following? mimetypes.init(files=None)

If files is None the internal data structure is completely rebuilt to its initial default value.

I forgot to revise the related code comments, which is for the previous "patch default types_map" implementation, which permanently changes the default map and can not be reset without reloading the module, and is no longer valid for the current "patch _db" implementation.

However, reloading the module is still required since in Python < 3.7.5 there are no default maps and any loaded config files permanently change the default maps, as well as mimetypes.init() and mimetypes.MimeTypes().

(A side note. I do not like the init() interface though. I am considering discussion of an alternative for python/cpython#68715 on python-ideas.)

I don't know what is the exact issue you care about, before you elaborate it. The referenced issue is so old and no longer valid since Python 3.7.5, I think.

I still prefer the types_map approach. I believe, it is better to ensure loading _polyfill.mimetypes before flask. Even with your hack with _db patching, you likely want to patch http.server.HTTPSimpleRequestHandler.extensions_map (when this attribute is present) as well.

It is not quite easy to ensure that all modules possibly call mimetypes.init() are not loaded before our mimetypes script. It also violates the common routine of importing native modules, 3rd party modules, and then package modules in order, which is already a de facto "standard" and has been written in some code linters , and will make code management more complicated and difficult.

Another problem is how should we deal with the user mime config. Should we allow it to overwrite our patch for native Python and OS configs? Should we allow it to overwrite MIME=>ext mapping as the last-win manner?

In my opinion, ideally the priorities should be the following (from higher to lower):

* `WSB_USER_DIR/mime.types`;

* webscrapbook internal mappings;

* `~/.mime.types` that is currently ignored, but used by enough application on Linux, mostly with text terminal UI;

* system files and registry.

As a workaround for the first wins bug in the case of media types to extension mapping, perhaps it is possible to use mimetypes.types_map.update() calls with result of mimetypes.read_mime_types().

Reading mimetypes.read_mime_types() is not a good idea as it reintroduces the default maps and overwrites our patch. One way I can think of is to create a new MimeTypes instance, clear its internal dicts, and then read the config files and merge the result dicts into the internal _db. But it's less performant and I'm kind of lazy to implement it, and not sure whether making user MIME=>ext rules go before everything is really a good way to go. Actually I even wonder whether users are so enthusiastic about customizing mime types.

@maxnikulin
Copy link
Contributor Author

First all all: do you confirm whether the current commit works on your device?

Yes, I do

Ran 1518 tests in 2.930s

OK (skipped=26)

f15fa0e 2024-01-26 00:22:37 +0800 Danny Lin: Rework mimetypes handling

However, reloading the module is still required since in Python < 3.7.5

It is sour if you still have to support 3.7. Accordingly to https://devguide.python.org/versions/ that branch reached end of life status on 2023-06-27.

@danny0838
Copy link
Owner

First all all: do you confirm whether the current commit works on your device?

Yes, I do

Ran 1518 tests in 2.930s

OK (skipped=26)

Good. We probably are going on this way.

However, reloading the module is still required since in Python < 3.7.5

It is sour if you still have to support 3.7. Accordingly to https://devguide.python.org/versions/ that branch reached end of life status on 2023-06-27.

I know this. But we would avoid dropping previously supported versions too early.

Also, currently the implementation is one-time, i.e. not inherited by new mimetypes.init(). We still need module reloading for testing different cases safely even without considering these old versions.

@danny0838
Copy link
Owner

Forgot to say, we don't need to patch http.server.HTTPSimpleRequestHandler.extensions_map because it's not used by flask/werkzeug.

@maxnikulin
Copy link
Contributor Author

Forgot to say, we don't need to patch http.server.HTTPSimpleRequestHandler.extensions_map because it's not used by flask/werkzeug.

Ack. I see usage of http.server.BaseHTTPRequestHandler in werkzeug, but not of SimpleHTTPRequestHandler, so its extension_map can be safely ignored.

@maxnikulin
Copy link
Contributor Author

However, reloading the module is still required since in Python < 3.7.5 there are no default maps and any loaded config files permanently change the default maps, as well as mimetypes.init() and mimetypes.MimeTypes().

Reloading the module may be important to test namely patching of mapping. Otherwise types_map is available since first commit python/cpython@ac8a9f3ee9 and reinit of _db should work since v2.2 python/cpython@eeee4ec#diff-aed43839a49bace08b60186baa4b27ad69ecd6b61f928bd696b4fb670750774fR190 I have not try to tun code in ancient versions though.

However it does not really matter and I agree that the approach, I suggest, is a trade-off. It allows to avoid usage private API, but the price is config files processed twice. It may be reconsidered later if another issue will be found.

@danny0838
Copy link
Owner

reinit of _db should work since v2.2 python/cpython@eeee4ec

This is not true. Try this in Python 3.7.4:

Python 3.7.4 ... on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import mimetypes
>>> print(mimetypes.guess_type('123.js'))
('application/javascript', None)
>>> mimetypes.add_type('text/myjs', '.js')
>>> print(mimetypes.guess_type('123.js'))
('text/myjs', None)
>>> mimetypes.init()
>>> print(mimetypes.guess_type('123.js'))
('text/myjs', None)
Python 3.7.4 ... on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import mimetypes
>>> print(mimetypes.guess_type('123.js'))
('application/javascript', None)
>>> mimetypes.add_type('text/myjs', '.js')
>>> print(mimetypes.guess_type('123.js'))
('text/myjs', None)
>>> mimetypes._db.__init__()
>>> print(mimetypes.guess_type('123.js'))
('text/myjs', None)

Apparently re-running init() or _db.__init__() cannot reset the global types_map which has been changed by add_type() or so. This is expected according to the source code before Python 3.7.5.

Also, as mentioned before: our polyfill is one-time, and reinit will remove all our patches. Although we do can rewrite the script to make it apply again during reinit, we currently don't see a strong rationale to do so, with the cost of complicating the code.

Intercepting with the hidden attribute is not uncommon for a patch or polyfill script. Would try to avoid it, but only when there's no too much extra cost.

Slightly updated the code, which should be the final version unless there is a strong real world use case that suggests a further change.

@maxnikulin
Copy link
Contributor Author

Apparently re-running init() or _db.__init__() cannot reset the global types_map which has been changed by add_type() or so. This is expected according to the source code before Python 3.7.5.

I do not see any issue for the application or if you are going to test result of _db modification. It affects only tests of procedure of _db patching.

The real issue is that behavior is different from newer versions. I do not like that there are no way to specify directly what types_map is used to initialize internal _db or a custom MimeTypes instance.

Another problem is that Python-3.7.3 does not support incremental mimetypes.init().

I do not think it is necessary to add workarounds against code invoking mimetypes.init() from other modules besides the issue with http.server. This case can be solved by calling polyfill early enough. Actually I believe, it is responsibility of an application, not of any module to eagerly update mimetypes. A module may do it lazily.

I have made another attempt to use purely public API. The result is fragile, but it seems it works in 3.11 and 3.7.3. It assumes that wsb custom mappings have lower priority than local configuration files.

class MimeTypesSafeRead(mimetypes.MimeTypes):
    """Constructor `file` argument may contain missed files"""
    def read(self, filename, strict=True):
        """Ignore inaccessible files"""
        try:
            f = open(filename, encoding='utf-8')
        except OSError:
            return None
        with f:
            self.readfp(f, strict)


def _init(files=None, overrides=None):
    """A workaround for first wins for media type to extension mapping"""
    if files is not None and not len(files):
        files = None
    if overrides is not None and not len(overrides):
        overrides = None
    if files is None and overrides is None:
        return

    override_types_map = None
    saved_types_map_default = None
    saved_types_map = None
    # `MimeTypes` constructor uses `mimetypes.types_map in Python-3.7.3
    # and `mimetypes._types_map_default` in 3.11.
    # Fragile, `mimetypes.init()` decouples `mimetypes.types_map`
    # and `mimetypes._types_map_default` e.g. in Python-3.11.
    internal_types_map = mimetypes.types_map
    try:
        saved_knownfiles = None
        try:
            saved_types_map = mimetypes.types_map.copy()
            # Build combined `types_map` for `files` and `overrides`.

            # `MimeTypes` constructor below may call `mimetypes.init()`.
            # Suppress loading of system files, it will be done later.
            # There is no way to avoid loading of Widnows registry.
            if not mimetypes.inited:
                saved_knownfiles = mimetypes.knownfiles.copy()
                mimetypes.knownfiles.clear()
                mimetypes.init()

            saved_types_map_default = internal_types_map.copy()

            # Avoid module and system defaults in the override map.
            internal_types_map.clear()
            mimetypes.types_map.clear()
            if overrides:
                internal_types_map.update(overrides)
                mimetypes.types_map.update(overrides)
            override_types_map = MimeTypesSafeRead(files or ()).types_map[True]

            # Set overrides before loading system settings.
            internal_types_map.clear()
            mimetypes.types_map.clear()
            internal_types_map.update(saved_types_map_default)
            internal_types_map.update(override_types_map)
            mimetypes.types_map.update(internal_types_map)
            if not overrides:
                override_types_map = None

        finally:
            if saved_knownfiles:
                mimetypes.knownfiles[:] = saved_knownfiles

        # Modify `knownfiles` to get at least some overrides if some code
        # will call `mimetypes.init()` later.
        if files:
            mimetypes.knownfiles.extend(files)
        # Replaces `mimetypes.types_map`.
        mimetypes.init()
        saved_types_map = None
    finally:
        # Something goes wrong. Restore initial state.
        if saved_types_map_default:
            internal_types_map.clear()
            internal_types_map.update(saved_types_map_default)
        if saved_types_map:
            mimetypes.types_map.clear()
            mimetypes.types_map.update(saved_types_map)
            mimetypes.init()

    # Widows registry or system files may overwrite entries from `overrides`.
    # Python < 3.7.5 does not support incremental `init()`
    # bpo-4963 https://github.com/python/cpython/issues/49213
    # So it is impossible to apply `overrides` in between of loading
    # of system configuration and `files`
    # and it is necessary to apply combined overrides instead.
    if override_types_map:
        for ext, typ in override_types_map.items():
            mimetypes.add_type(typ, ext)

danny0838 added a commit that referenced this issue Jan 31, 2024
- Fix an issue that native (Python and platform) MIME=>ext conversions are not patched.
- Allow user config to overwrite the patch.
- Patch mimetypes.init() to defer patching and improve the performance, and allow unittest mockings for the user config directory to work for the mimetypes module.
@danny0838
Copy link
Owner

danny0838 commented Jan 31, 2024

I'm going to end this issue. Our goal is to fix the issue of instability caused by some bad platform rules, while maintain the largest compatibility with previous versions. We are not going to complicate things up unless there is a significant real world use case. After all, the user can always fix any mimetype conversion issue by modifying the platform rules directly or by adding a little code in the launcher script (such as the app.py generated by wsb config -ba) if the user config file isn't enough.

@maxnikulin
Copy link
Contributor Author

Thank you, Danny.

My attempt to make media type to extension mapping configurable out of the box:
[Py] [Ideas] Mimetypes: Allow to override media type to file extension mapping

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants