Skip to content
This repository has been archived by the owner on Apr 9, 2020. It is now read-only.
/ ocr-space Public archive

Run OCR process on pdf files via ocr.space API

Notifications You must be signed in to change notification settings

CoProcure/ocr-space

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OCR.Space

Description

Run pdf files through the OCR process, receive aggregated chunk of parsed text from an array of urls. This is useful when extracting the text of pdf documents in order to make them fully searchable by creating an AWS cloudsearch ingestable JSON document.

Accepts:

  • apiKey for your OCR.space account
  • Array of urls of publicly accessible pdf files
  • callback function

Returns:

  • JSON compatible single text string containing the parsed content from all the files sent

The ocr.space API is good at extracting text from any type of pdf file, even ones that are partially images and partially searchable text but is limited in the number of parallel requests any one endpoint called with your API Key will process. This code will spread the requests out through the list of available ocr.space endpoints as recommended by OCR.space tech support to reduce throttling errors.

Returned text has newline characters normalized and is processed with the following regex to allow it to be used safely in JSON files.

normalizeNewline(data).replace(/\n/g, " ").replace(/\"/g,'\\"').replace(/[^\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD]/g,'').trim();

How to use

Install the package:

npm install @coprocure/ocr-space

Use it in your node.js code:

const ocr = require('@coprocure/ocr-space')

ocr('myAPIKey',[documentURL1, documentURL2], (text) => {
  // do something with the returned text
})

We often want to run many documents through this process and the ocr.space API will throttle too many simultaneous requests so use await to avoid errors:

textContentOfAllUrlArrayPDFFiles = await new Promise((resolve, reject) => {
  ocr(myAPIKey, urlArray, (text) => {
    resolve(true);
  })
});

About

Run OCR process on pdf files via ocr.space API

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published