hatchmoment. scored by care · not by stars

PolicyPipeline

Local GPU-accelerated pipeline that extracts policy text from PDFs

Many organisations need to distill policy‑relevant content from long PDF documents but cannot rely on cloud services for privacy or cost reasons. PolicyPipeline chains seven checkpointed stages—PDF parsing, cleaning, chunking, embedding, RAG extraction, validation, and final assembly—using Marker, LangChain, ChromaDB, and an Ollama‑hosted LLM, all on a single GPU. It is aimed at developers or analysts who want a reproducible, self‑hosted solution for policy mining, offering crash‑recoverable runs and full control over data. Compared to generic OCR‑plus‑LLM scripts, it provides an end‑to‑end, GPU‑optimized workflow with no external API calls.

View on GitHub →

Hamza-Malik05/PolicyPipeline