Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

❇️ Improve chunk extraction with multi-byte string #95

Merged
merged 3 commits into from
Sep 14, 2021

Conversation

Ousret
Copy link
Member

@Ousret Ousret commented Sep 14, 2021

Second attempt following draft #90

Abstract

When charset_normalizer measure the mess/coherence in a given byte sequence for a given encoding, it extract small chunks of data.
Sometime, due to bad luck, it could split the data at the wrong initial index causing the rendered str (for the chunk) to be gibberish.

This PR address that problem and try to find the appropriate start index.

@Ousret Ousret added bug Something isn't working enhancement New feature or request detection Related to the charset detection mechanism, chaos/mess/coherence flourish Not really needed but nice to have! labels Sep 14, 2021
@codecov-commenter
Copy link

codecov-commenter commented Sep 14, 2021

Codecov Report

Merging #95 (d835ae2) into master (9ddd725) will increase coverage by 0.29%.
The diff coverage is 90.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #95      +/-   ##
==========================================
+ Coverage   85.25%   85.54%   +0.29%     
==========================================
  Files          11       11              
  Lines        1187     1197      +10     
==========================================
+ Hits         1012     1024      +12     
+ Misses        175      173       -2     
Impacted Files Coverage Δ
charset_normalizer/api.py 84.21% <90.00%> (+1.79%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9ddd725...d835ae2. Read the comment docs.

@Ousret Ousret merged commit ab17ac9 into master Sep 14, 2021
@Ousret Ousret deleted the bugfix-mb-chunk-extraction-renew branch September 14, 2021 18:09
@Ousret Ousret mentioned this pull request Sep 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working detection Related to the charset detection mechanism, chaos/mess/coherence enhancement New feature or request flourish Not really needed but nice to have!
Development

Successfully merging this pull request may close these issues.

2 participants