Vision Parse

🚀 Parse PDF documents into beautifully formatted markdown content using state-of-the-art Vision Language Models - all with just a few lines of code!

🎯 Introduction

Vision Parse harnesses the power of Vision Language Models to revolutionize document processing:

📝 Smart Content Extraction: Intelligently identifies and extracts text and tables with high precision
🎨 Content Formatting: Preserves document hierarchy, styling, and indentation for markdown formatted content
🤖 Multi-LLM Support: Supports multiple Vision LLM providers i.e. OpenAI, LLama, Gemini etc. for accuracy and speed
🔄 PDF Document Support: Handle multi-page PDF documents effortlessly by converting each page into byte64 encoded images
📁 Local Model Hosting: Supports local model hosting using Ollama for secure document processing and for offline use

🚀 Getting Started

Prerequisites

🐍 Python >= 3.9
🖥️ Ollama (if you want to use local models)
🤖 API Key for OpenAI or Google Gemini (if you want to use OpenAI or Google Gemini)

Installation

Install the package using pip (Recommended):

pip install vision-parse

Install the optional dependencies for OpenAI or Gemini:

pip install 'vision-parse[openai]'

pip install 'vision-parse[gemini]'

Setting up Ollama (Optional)

See examples/ollama_setup.md on how to setup Ollama locally.

⌛️ Usage

Basic Example Usage

from vision_parse import VisionParser

# Initialize parser
parser = VisionParser(
    model_name="llama3.2-vision:11b", # For local models, you don't need to provide the api key
    temperature=0.4,
    top_p=0.3,
    extraction_complexity=False # Set to True for more detailed extraction
)

# Convert PDF to markdown
pdf_path = "path/to/your/document.pdf"
markdown_pages = parser.convert_pdf(pdf_path)

# Process results
for i, page_content in enumerate(markdown_pages):
    print(f"\n--- Page {i+1} ---\n{page_content}")

PDF Page Configuration

from vision_parse import VisionParser, PDFPageConfig

# Configure PDF processing settings
page_config = PDFPageConfig(
    dpi=400,
    color_space="RGB",
    include_annotations=True,
    preserve_transparency=False
)

# Initialize parser with custom page config
parser = VisionParser(
    model_name="llama3.2-vision:11b",
    temperature=0.7,
    top_p=0.4,
    extraction_complexity=False,
    page_config=page_config
)

# Convert PDF to markdown
pdf_path = "path/to/your/document.pdf"
markdown_pages = parser.convert_pdf(pdf_path)

OpenAI or Gemini Model Usage

from vision_parse import VisionParser

# Initialize parser with OpenAI model
parser = VisionParser(
    model_name="gpt-4o",
    api_key="your-openai-api-key", # Get the OpenAI API key from https://platform.openai.com/api-keys
    temperature=0.7,
    top_p=0.4,
    extraction_complexity=True # Set to True for more detailed extraction
)

# Initialize parser with Google Gemini model
parser = VisionParser(
    model_name="gemini-1.5-flash",
    api_key="your-gemini-api-key", # Get the Gemini API key from https://aistudio.google.com/app/apikey
    temperature=0.7,
    top_p=0.4,
    extraction_complexity=True # Set to True for more detailed extraction
)

Supported Models

This package supports the following Vision LLM models:

OpenAI: gpt-4o, gpt-4o-mini
Google Gemini: gemini-1.5-flash, gemini-2.0-flash-exp, gemini-1.5-pro
Meta Llama and LLava from Ollama: llava:13b, llava:34b, llama3.2-vision:11b, llama3.2-vision:70b

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.github		.github
examples		examples
src/vision_parse		src/vision_parse
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vision Parse

🎯 Introduction

🚀 Getting Started

Prerequisites

Installation

Setting up Ollama (Optional)

⌛️ Usage

Basic Example Usage

PDF Page Configuration

OpenAI or Gemini Model Usage

Supported Models

📄 License

About

Releases

Packages

Languages

License

bobkingdom/vision-parse

Folders and files

Latest commit

History

Repository files navigation

Vision Parse

🎯 Introduction

🚀 Getting Started

Prerequisites

Installation

Setting up Ollama (Optional)

⌛️ Usage

Basic Example Usage

PDF Page Configuration

OpenAI or Gemini Model Usage

Supported Models

📄 License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages