❇️ Improve chunk extraction with multi-byte string #95

Ousret · 2021-09-14T17:45:24Z

Second attempt following draft #90

Abstract

When charset_normalizer measure the mess/coherence in a given byte sequence for a given encoding, it extract small chunks of data.
Sometime, due to bad luck, it could split the data at the wrong initial index causing the rendered str (for the chunk) to be gibberish.

This PR address that problem and try to find the appropriate start index.

Second attempt

codecov-commenter · 2021-09-14T18:04:25Z

Codecov Report

Merging #95 (d835ae2) into master (9ddd725) will increase coverage by 0.29%.
The diff coverage is 90.00%.

@@            Coverage Diff             @@
##           master      #95      +/-   ##
==========================================
+ Coverage   85.25%   85.54%   +0.29%     
==========================================
  Files          11       11              
  Lines        1187     1197      +10     
==========================================
+ Hits         1012     1024      +12     
+ Misses        175      173       -2

Impacted Files	Coverage Δ
charset_normalizer/api.py	`84.21% <90.00%> (+1.79%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9ddd725...d835ae2. Read the comment docs.

❇️ Improve chunk extraction with multi-byte string

72b7afc

Second attempt

Ousret added bug Something isn't working enhancement New feature or request detection Related to the charset detection mechanism, chaos/mess/coherence flourish Not really needed but nice to have! labels Sep 14, 2021

Ousret mentioned this pull request Sep 14, 2021

❇️ Improve chunk extraction with multi-byte string #90

Closed

Ousret added 2 commits September 14, 2021 18:47

🎨 Reformat api.py

8a90063

✔️ Add test for the use case / trigger mb bad split on chunk extr

d835ae2

Ousret merged commit ab17ac9 into master Sep 14, 2021

Ousret deleted the bugfix-mb-chunk-extraction-renew branch September 14, 2021 18:09

Ousret mentioned this pull request Sep 14, 2021

Version 2.0.5 #98

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

❇️ Improve chunk extraction with multi-byte string #95

❇️ Improve chunk extraction with multi-byte string #95

Ousret commented Sep 14, 2021 •

edited

Loading

codecov-commenter commented Sep 14, 2021 •

edited

Loading

❇️ Improve chunk extraction with multi-byte string #95

❇️ Improve chunk extraction with multi-byte string #95

Conversation

Ousret commented Sep 14, 2021 • edited Loading

Abstract

codecov-commenter commented Sep 14, 2021 • edited Loading

Codecov Report

Ousret commented Sep 14, 2021 •

edited

Loading

codecov-commenter commented Sep 14, 2021 •

edited

Loading