Handle non-UTF8 text file encodings for European language support #298

lemig · 2025-01-30T10:31:39Z

Description

When processing text files in the document conversion pipeline, we're currently assuming UTF-8 encoding which fails for files using different character encodings common in European languages.

Current Behavior

The system assumes UTF-8 encoding for all text files
Files with different encodings (e.g., Windows-1252, ISO-8859-1) fail with UnicodeDecodeError

Example error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 1160: invalid start byte

Use Case Context

Working with digital forensics in a European agency
Need to handle documents in various European languages
Files may use different character encodings depending on their origin and creation time

Proposed Solution

Implement a more robust text file reading mechanism that:

Attempts to read files with multiple common European encodings:
- UTF-8
- Windows-1252 (cp1252)
- ISO-8859-1
- Latin-1
Provides a fallback mechanism for unrecognized encodings
Potentially uses character encoding detection libraries (like chardet)

Technical Details

Current problematic code:

server/app/routes/convert.py (177-178)

# For txt files, just read the content
with open(file_path, 'r', encoding='utf-8') as f:
    content = f.read()

The text was updated successfully, but these errors were encountered:

lemig mentioned this issue Jan 30, 2025

Allow import of text files in non utf-8 encoding #299

Merged

shreyashankar closed this as completed in #299 Jan 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle non-UTF8 text file encodings for European language support #298

Handle non-UTF8 text file encodings for European language support #298

lemig commented Jan 30, 2025

Handle non-UTF8 text file encodings for European language support #298

Handle non-UTF8 text file encodings for European language support #298

Comments

lemig commented Jan 30, 2025

Description

Current Behavior

Use Case Context

Proposed Solution

Technical Details