Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could not decode to UTF-8 column 'json_data' with text .... #1

Open
dpieski opened this issue Dec 21, 2023 · 0 comments · May be fixed by #2
Open

Could not decode to UTF-8 column 'json_data' with text .... #1

dpieski opened this issue Dec 21, 2023 · 0 comments · May be fixed by #2

Comments

@dpieski
Copy link

dpieski commented Dec 21, 2023

I am getting the above error message.

Log:

 [ADMIN ] Starting user script with executable='/sist2-admin/scripts/test/run.sh', index_path='/sist2-admin/scan-TEST_v2-2023-12-11 15:27:05.254099.sist2', extra_args=''
 [INFO ] Instantiating the Index...
 [INFO ] Iterating through the documents...
 [INFO ] Could not decode to UTF-8 column 'json_data' with text '{"extension":"pdf","name":"DOC NAME","path":"PATH/TO/DOC
 [INFO ] [ERROR] Something went wrong with the doc loop!
 [INFO ] Finished Processing 5040 documents.

User Script:

import sys

print("Instantiating the Index...")
index = Sist2Index(sys.argv[1])

print("Iterating through the documents...")
docs = 0

try:
    for doc in index.document_iter():
        docs += 1

except Exception as error:
    print(error)
    print("[ERROR] Something went wrong with the doc loop!")

print("Finished Processing %d documents." % docs)

I am not sure where the non-UTF-8 data is coming from. The document identified is a PDF and it does include pages that were OCR'd during the scan so, maybe it came in through that?

I am unsure how to identify the character that is causing the issue. The sqlite reader I have used seems to handle it gracefully.

@simondmorias simondmorias linked a pull request Aug 9, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant