-
Install these 2 programs:
- Poppler, which is required by
pdf2image
to convert PDFs into images for OCR processing. Extract the ZIP file (e.g.,poppler-23.11.0
). Copy the extracted folder path and add it to your system path variable (e.g.,C:\poppler-23.11.0\Library\bin
). - Tesseract, Run the installer and remember the installation path (default:
C:\Program Files\Tesseract-OCR\
). During installation, select Additional Language Data (if needed for multilingual PDFs).
- Poppler, which is required by
-
Run the commands below to install libraries for the project
- Create a virtual environment and activate it
python -m venv venv .\venv\Scripts\activate
- Install pytorch for the GPU (default options use the CPU)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
- Install the spaCy model that will help us tag entities
python -m spacy download en_core_web_sm
- Install python libraries
pip install numpy nltk requests beautifulsoup4 PyPDF2 torch pytesseract pdf2image tensorflow spacy
-
Obtain the api keys for the following
- LANGSMITH_API_KEYcreate a workspace and a project and the constant names will be generated for you.
- API_KEYfor google search json api.
- SEARCH_ENGINE_ID, create a custom web engine and then copy its id.
-
Make a
.env
file to manage all sensitive api keys which this will reside in the root of the project folder
LANGSMITH_TRACING=true
LANGSMITH_ENDPOINT="https://api.smith.langchain.com"
LANGSMITH_API_KEY= ""
LANGSMITH_PROJECT= ""
API_KEY = ""
SEARCH_ENGINE_ID = ""
- copy page from: the cia website with link as: https://www.cia.gov/readingroom/document/51112a4a993247d4d8394487 and enter the whole link in the terminal.