A fast, helpful, and open-source document parser
-
Updated
May 26, 2026 - Rust
A fast, helpful, and open-source document parser
The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.
Pure Rust PDF library for AI/RAG: structure-aware chunking, no ML, no C deps.
3DCF / doc2dataset: token-efficient document layer with NumGuard numeric integrity and multi-framework exports for RAG & fine-tuning.
Rust CLI implementing the Recursive Language Model (RLM) pattern for Claude Code. Process documents 100x larger than context windows through intelligent chunking, SQLite persistence, and recursive sub-LLM orchestration.
The fastest Office document library for Python, Rust, Go, JS/TS, C# and WASM. DOCX, XLSX, PPTX, DOC, XLS, PPT. Up to 100× faster than python-docx/openpyxl/python-pptx. 100% pass rate on valid Office files. MIT/Apache-2.0.
Cloud-native document extraction platform — SaaS at kreuzberg.dev or self-host on any Kubernetes cluster. 90+ formats, REST API, webhooks. Built on Kreuzberg.
Fast pure-Rust PDF extraction library and CLI by Clark Labs Inc. — 10–50x faster than pdfplumber for text, word, table, layout, image, and metadata extraction.
Native Rust port of IBM's Docling document processing library. Convert PDF, DOCX, XLSX, PPTX, HTML, Markdown, and CSV to structured data for RAG applications.
PDF and image to-text converter with XFA forms support. It extract embedded text, and/or render pages into upscaled images for OCR to handle complex layouts and scans. Single static binary, reads stdin/writes stdout. Built for n8n, Power Automate, and containerized workflows.
Pure Rust OCR inference engine powered by GLM-OCR vision-language model. No Python. No PyTorch. Just cargo build and go.
Logstash-like RAG ingestion daemon for local files, chunking, embeddings, and Qdrant indexing.
Convert scans of handwritten notes to PDF.
📄 Ingest documents into structured datasets for LLMs, ensuring numeric integrity and easy export across multiple frameworks with doc2dataset.
High-performance document processing pipeline in Rust. Extracts text, performs OCR, and optimizes images from PDFs and other document formats with parallel processing and memory efficiency.
Document engineering IDE. Rust + Tauri 2.0 + Typst backend, visual pipelines, AI-assisted document creation.
Drag a PDF, get it in your language. 22 Indic languages. Layout preserved.
Desktop app (Tauri 2 + Svelte 5 + Rust) for parsing PDF/DOCX/TXT with image context extraction, document classification, and local SQLite storage. AGPL-3.0 + commercial.
Add a description, image, and links to the document-processing topic page so that developers can more easily learn about it.
To associate your repository with the document-processing topic, visit your repo's landing page and select "manage topics."