GitHub - TalkToMePST/Unredact: Helping out juniors

Setup

Install these 2 programs:
- Poppler, which is required by pdf2image to convert PDFs into images for OCR processing. Extract the ZIP file (e.g., poppler-23.11.0). Copy the extracted folder path and add it to your system path variable (e.g., C:\poppler-23.11.0\Library\bin).
- Tesseract, Run the installer and remember the installation path (default: C:\Program Files\Tesseract-OCR\). During installation, select Additional Language Data (if needed for multilingual PDFs).

Run the commands below to install libraries for the project

Create a virtual environment and activate it

python -m venv venv
.\venv\Scripts\activate

Install pytorch for the GPU (default options use the CPU)

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Install the spaCy model that will help us tag entities

python -m spacy download en_core_web_sm

Install python libraries

pip install numpy nltk requests beautifulsoup4 PyPDF2 torch pytesseract pdf2image tensorflow spacy

Obtain the api keys for the following
1. LANGSMITH_API_KEYcreate a workspace and a project and the constant names will be generated for you.
2. API_KEYfor google search json api.
3. SEARCH_ENGINE_ID, create a custom web engine and then copy its id.
Make a .env file to manage all sensitive api keys which this will reside in the root of the project folder

LANGSMITH_TRACING=true
LANGSMITH_ENDPOINT="https://api.smith.langchain.com"
LANGSMITH_API_KEY= ""
LANGSMITH_PROJECT= ""
API_KEY = ""
SEARCH_ENGINE_ID = ""

How to

copy page from: the cia website with link as: https://www.cia.gov/readingroom/document/51112a4a993247d4d8394487 and enter the whole link in the terminal.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
src		src
.gitignore		.gitignore
README.md		README.md
model_pipeline.ipynb		model_pipeline.ipynb
rag.ipynb		rag.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Setup

How to

Resources

Papers

About

Releases

Packages

Languages

TalkToMePST/Unredact

Folders and files

Latest commit

History

Repository files navigation

Setup

How to

Resources

Papers

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages