Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Breaking change in beautifulsoup4 4.13 #79

Closed
watschi opened this issue Feb 5, 2025 · 2 comments · Fixed by #80
Closed

Breaking change in beautifulsoup4 4.13 #79

watschi opened this issue Feb 5, 2025 · 2 comments · Fixed by #80

Comments

@watschi
Copy link

watschi commented Feb 5, 2025

beautifulsoup4 4.13 introduces a breaking change in the text processing module at /src/commoncode/text.py (Link), see #4129

as_unicode(s) returns bytes instead of str starting with 4.13, which in turn breaks is_markup(location)/is_markup_text(text) in scancode here.

From the Changelog:

  • UnicodeDammit.markup is now always a bytestring representing the
    original markup (sans BOM), and UnicodeDammit.unicode_markup is
    always the converted Unicode equivalent of the original
    markup. Previously, UnicodeDammit.markup was treated inconsistently
    and would often end up containing Unicode. UnicodeDammit.markup was
    not a documented attribute, but if you were using it, you probably
    want to switch to using .unicode_markup instead.

If UnicodeDammit(s).unicode_markup is used here instead of UnicodeDammit(s).markup, a unicode string is returned:

Originally posted by @watschi in #4129

@stefan6419846
Copy link
Contributor

Unfortunately, this seems to affect further aspects like extracting archives as well and took me quite some time to locate (output with extractcode.extract.TRACE = True):

(venv) stefan@localhost:~/tmp/license_tools$ python -c 'from extractcode.api import extract_archive; list(extract_archive("test.zip", "tmp_test_zip"))'
DEBUG:extractcode.extract:extract_file: extractor: for: test.zip with kinds: (2, 3, 4, 5, 1, 6, 7): extractcode.archive.extract_with_fallback
DEBUG:extractcode.extract:extract_file: ERROR: test.zip: ['normalize() argument 2 must be str, not bytes']
normalize() argument 2 must be str, not bytes
Traceback (most recent call last):
  File "/home/stefan/tmp/license_tools/venv/lib/python3.12/site-packages/extractcode/archive.py", line 416, in extract_with_fallback
    warnings = extractor1(abs_location, temp_target1)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefan/tmp/license_tools/venv/lib/python3.12/site-packages/extractcode/libarchive2.py", line 206, in extract
    _target_path = entry.write(abs_target_dir, transform_path=partial(paths.safe_path, preserve_spaces=True))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefan/tmp/license_tools/venv/lib/python3.12/site-packages/extractcode/libarchive2.py", line 447, in write
    clean_path = transform_path(self.path)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefan/tmp/license_tools/venv/lib/python3.12/site-packages/commoncode/paths.py", line 45, in safe_path
    if not is_posixpath(path):
           ^^^^^^^^^^^^^^^^^^
  File "/home/stefan/tmp/license_tools/venv/lib/python3.12/site-packages/commoncode/fileutils.py", line 150, in is_posixpath
    has_slashes = "/" in location
                  ^^^^^^^^^^^^^^^
TypeError: a bytes-like object is required, not 'str'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/stefan/tmp/license_tools/venv/lib/python3.12/site-packages/extractcode/extract.py", line 276, in extract_file
    warns = extractor(abs_location, tmp_tgt) or []
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefan/tmp/license_tools/venv/lib/python3.12/site-packages/extractcode/archive.py", line 423, in extract_with_fallback
    warnings = extractor2(abs_location, temp_target2)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefan/tmp/license_tools/venv/lib/python3.12/site-packages/extractcode/sevenzip.py", line 242, in extract
    return extractor(
           ^^^^^^^^^^
  File "/home/stefan/tmp/license_tools/venv/lib/python3.12/site-packages/extractcode/sevenzip.py", line 273, in extract_all_files_at_once
    rc, stdout, stderr = command.execute(**ex_args)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefan/tmp/license_tools/venv/lib/python3.12/site-packages/commoncode/command.py", line 113, in execute
    sop = text.toascii(sor).strip()
          ^^^^^^^^^^^^^^^^^
  File "/home/stefan/tmp/license_tools/venv/lib/python3.12/site-packages/commoncode/text.py", line 113, in toascii
    converted = unicodedata.normalize("NFKD", s)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: normalize() argument 2 must be str, not bytes

(venv) stefan@localhost:~/tmp/license_tools$ python -c 'from extractcode.api import extract_archive; list(extract_archive("test.tar.gz", "tmp_test_zip"))'
DEBUG:extractcode.extract:extract_file: extractor: for: test.tar.gz with kinds: (2, 3, 4, 5, 1, 6, 7): extractcode.libarchive2.extract
DEBUG:extractcode.extract:extract_file: ERROR: test.tar.gz: ["a bytes-like object is required, not 'str"]
a bytes-like object is required, not 'str'
Traceback (most recent call last):
  File "/home/stefan/tmp/license_tools/venv/lib/python3.12/site-packages/extractcode/extract.py", line 276, in extract_file
    warns = extractor(abs_location, tmp_tgt) or []
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefan/tmp/license_tools/venv/lib/python3.12/site-packages/extractcode/libarchive2.py", line 206, in extract
    _target_path = entry.write(abs_target_dir, transform_path=partial(paths.safe_path, preserve_spaces=True))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefan/tmp/license_tools/venv/lib/python3.12/site-packages/extractcode/libarchive2.py", line 447, in write
    clean_path = transform_path(self.path)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/stefan/tmp/license_tools/venv/lib/python3.12/site-packages/commoncode/paths.py", line 45, in safe_path
    if not is_posixpath(path):
           ^^^^^^^^^^^^^^^^^^
  File "/home/stefan/tmp/license_tools/venv/lib/python3.12/site-packages/commoncode/fileutils.py", line 150, in is_posixpath
    has_slashes = "/" in location
                  ^^^^^^^^^^^^^^^
TypeError: a bytes-like object is required, not 'str'

@AyanSinhaMahapatra
Copy link
Member

Thanks @watschi @stefan6419846 this is fixed by @jloehel and then released.

AyanSinhaMahapatra added a commit to aboutcode-org/scancode-toolkit that referenced this issue Feb 14, 2025
There was an issue in copyright scan and other places which was
caused by a breaking change in bs4, which is fixed in commoncode
with https://github.com/aboutcode-org/commoncode/releases/tag/v32.2.0

Reference: #4129
Reference: aboutcode-org/commoncode#79
Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants