You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When processing text files in the document conversion pipeline, we're currently assuming UTF-8 encoding which fails for files using different character encodings common in European languages.
Current Behavior
The system assumes UTF-8 encoding for all text files
Files with different encodings (e.g., Windows-1252, ISO-8859-1) fail with UnicodeDecodeError
Example error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 1160: invalid start byte
Use Case Context
Working with digital forensics in a European agency
Need to handle documents in various European languages
Files may use different character encodings depending on their origin and creation time
Proposed Solution
Implement a more robust text file reading mechanism that:
Attempts to read files with multiple common European encodings:
UTF-8
Windows-1252 (cp1252)
ISO-8859-1
Latin-1
Provides a fallback mechanism for unrecognized encodings
Potentially uses character encoding detection libraries (like chardet)
Technical Details
Current problematic code:
server/app/routes/convert.py (177-178)
# For txt files, just read the contentwithopen(file_path, 'r', encoding='utf-8') asf:
content=f.read()
The text was updated successfully, but these errors were encountered:
Description
When processing text files in the document conversion pipeline, we're currently assuming UTF-8 encoding which fails for files using different character encodings common in European languages.
Current Behavior
UnicodeDecodeError
Use Case Context
Proposed Solution
Implement a more robust text file reading mechanism that:
chardet
)Technical Details
Current problematic code:
server/app/routes/convert.py (177-178)
The text was updated successfully, but these errors were encountered: