UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte #8

ghost · 2020-03-30T17:47:54Z

@fengxinjie
when running predict.py i get the error bellow.
I am using the IC15.pth and the image below, along with resnet101.pth from https://download.pytorch.org/models/resnet101-5d3b4d8f.pth

(mben) home@home-desktop:~/p13/Transformer-OCR$ python predict.py 
/home/home/p13/Transformer-OCR/model.py:255: UserWarning: nn.init.xavier_uniform is now deprecated in favor of nn.init.xavier_uniform_.
  nn.init.xavier_uniform(p)
Traceback (most recent call last):
  File "predict.py", line 81, in <module>
    do_folder('./images/1.jpg')
  File "predict.py", line 66, in do_folder
    for line in open(root).readlines():
  File "/home/home/anaconda3/envs/mben/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

The image trying to predict:

The text was updated successfully, but these errors were encountered:

decoder746 · 2020-03-30T18:29:02Z

From where did you get the "IC15.pth" file

ghost · 2020-03-30T18:40:25Z

@decoder746 were you able to train a new model?

decoder746 · 2020-03-30T18:40:48Z

As for your problem do_folder(root) is a function which requires as input a file with certain specifications. To find predictions for a single image, instead of using do_folder you can use the following function:-

def do_image(img): 
    img = cv2.imread(img)
    img = resize(img) / 255.
    img = np.transpose(img, (2, 0, 1))
    img = torch.from_numpy(img).float().unsqueeze(0).cuda()
    pred = greedy_decode(img)
    print(pred)

decoder746 · 2020-03-30T18:43:12Z

@deepseek I thought it was the pretrained model given by the author

ghost · 2020-03-30T18:44:22Z

it is
download the repo:
https://github.com/fengxinjie/Transformer-OCR/tree/76c321fb89be51c1718b98f5c5c446633614f97b

then

cd checkpoints && cat IC1500 IC1501 > IC15.zip && unzip IC15.zip

ghost · 2020-03-30T18:45:54Z

@decoder746
upload your modified predict.py so i can run
also from where did you download resnet101.pth

decoder746 · 2020-03-30T19:07:41Z

I downloaded resenet101.pth from https://download.pytorch.org/models/resnet101-5d3b4d8f.pth
The predict.py is attached but its predictions seem to be completely off (this might be the wrong way to do it).

import torch
from torch.autograd import Variable
import numpy as np
from model import make_model
from dataset import vocab, char2token, token2char
from dataset import subsequent_mask
import cv2
import sys, os

model = make_model(len(char2token))
model.load_state_dict(torch.load('IC15.pth'))
model.cuda()
model.eval()
src_mask=Variable(torch.from_numpy(np.ones([1, 1, 36], dtype=np.bool)).cuda())
SIZE=96

def greedy_decode(src, max_len=36, start_symbol=1):
    global model
    global src_mask
    memory = model.encode(src, src_mask)
    ys = torch.ones(1, 1).fill_(start_symbol).long().cuda()
    for i in range(max_len-1):
        out = model.decode(memory, src_mask, 
                           Variable(ys), 
                           Variable(subsequent_mask(ys.size(1))
                                    .long().cuda()))
        prob = model.generator(out[:, -1])
        _, next_word = torch.max(prob, dim = 1)
        next_word = next_word.data[0]
        ys = torch.cat([ys, 
                        torch.ones(1, 1).long().cuda().fill_(next_word)], dim=1)
        if token2char[next_word.item()] == '>':
            break
    ret = ys.cpu().numpy()[0]
    out = [token2char[i] for i in ret]
    out = "".join(out[1:-1])
    return out

def resize(img):
    h, w, c = img.shape
    if w > h:
        nw, nh = SIZE, int(h * SIZE/w)
        if nh < 10 : nh = 10
        #print(h, w, nh, nw)
        img = cv2.resize(img, (nw, nh))
        a1 = int((SIZE-nh)/2)
        a2= SIZE-nh-a1
        pad1 = np.zeros((a1, SIZE, c), dtype=np.uint8)
        pad2 = np.zeros((a2, SIZE, c), dtype=np.uint8)
        img = np.concatenate((pad1, img, pad2), axis=0)
    else:
        nw, nh = int(w * SIZE/h), SIZE
        if nw < 10 : nw = 10
        #print(h, w, nh, nw)
        img = cv2.resize(img, (nw, nh))
        a1 = int((SIZE-nw)/2)
        a2= SIZE-nw-a1
        pad1 = np.zeros((SIZE, a1, c), dtype=np.uint8)
        pad2 = np.zeros((SIZE, a2, c), dtype=np.uint8)
        img = np.concatenate((pad1, img, pad2), axis=1)
    return img

def do_folder(root):
    hit = 0
    all = 0
    for line in open(root).readlines():
        all += 1
        imp, label = line.strip('\n').split('\t')
        img = cv2.imread(imp)
        img = resize(img) / 255.
        img = np.transpose(img, (2, 0, 1))
        img = torch.from_numpy(img).float().unsqueeze(0).cuda()
        pred = greedy_decode(img)
        if pred != label:
            hit += 1
            print('imp:', imp, 'label:', label, 'pred:', pred, hit, all, hit/all)
    print(hit, all, hit/all)

def do_image(img):
    img = cv2.imread(img)
    img = resize(img) / 255.
    img = np.transpose(img, (2, 0, 1))
    img = torch.from_numpy(img).float().unsqueeze(0).cuda()
    pred = greedy_decode(img)
    print(pred)

if __name__ == '__main__':
    do_image("img88.jpg")
    # do_folder('your-test-lines')

ghost · 2020-03-30T19:56:31Z

hmmmm...
what about training?
for train.py, what should be the structure of the list for your-train-lines

decoder746 · 2020-03-31T04:27:33Z

The your-train-files consists of lines where each line has structure image_path \t label \n as can be seen commented in dataset.py under __getitem__. If there are one or more such your-train-files you have to pass them as a list of files. Don't know how to pass if one image has more than one label(some text written on the top and some text written on the bottom too).

ghost · 2020-03-31T18:51:36Z

@decoder746 thats why i am asking since it's not clear.
But according to the developer it seems that he cropped the words, then trained with image \t label structure.
If you can give it a try, it might be great. Though i am still sceptical since the developer "stated" having very high accuracy rate, "stated", and he's not replying back nor disclosing any sort of training documentation.

gussmith · 2020-04-14T18:42:50Z

@deepseek

cd checkpoints && cat IC1500 IC1501 > IC15.zip && unzip IC15.zip

How did you know to do this? (it works by the way and it loads without complaining)
Is this some kind of convention to load the model in multiple files because some sites have file size limits?

gussmith · 2020-04-16T17:19:23Z

@fengxinjie
when running predict.py i get the error bellow.
I am using the IC15.pth and the image below, along with resnet101.pth from https://download.pytorch.org/models/resnet101-5d3b4d8f.pth

Use this to remove the first hidden symbol:

            lines = open(f,encoding='utf-8-sig').readlines()
            self.lines += [i for i in lines if not illegal(i.strip('\n').split(', ')[1].strip('"'))]

ghost closed this as completed Mar 31, 2020

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte #8

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte #8

ghost commented Mar 30, 2020

decoder746 commented Mar 30, 2020

ghost commented Mar 30, 2020

decoder746 commented Mar 30, 2020 •

edited

Loading

decoder746 commented Mar 30, 2020

ghost commented Mar 30, 2020 •

edited by ghost

Loading

ghost commented Mar 30, 2020 •

edited by ghost

Loading

decoder746 commented Mar 30, 2020

ghost commented Mar 30, 2020

decoder746 commented Mar 31, 2020

ghost commented Mar 31, 2020

gussmith commented Apr 14, 2020

gussmith commented Apr 16, 2020

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte #8

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte #8

Comments

ghost commented Mar 30, 2020

decoder746 commented Mar 30, 2020

ghost commented Mar 30, 2020

decoder746 commented Mar 30, 2020 • edited Loading

decoder746 commented Mar 30, 2020

ghost commented Mar 30, 2020 • edited by ghost Loading

ghost commented Mar 30, 2020 • edited by ghost Loading

decoder746 commented Mar 30, 2020

ghost commented Mar 30, 2020

decoder746 commented Mar 31, 2020

ghost commented Mar 31, 2020

gussmith commented Apr 14, 2020

gussmith commented Apr 16, 2020

decoder746 commented Mar 30, 2020 •

edited

Loading

ghost commented Mar 30, 2020 •

edited by ghost

Loading

ghost commented Mar 30, 2020 •

edited by ghost

Loading