document-processing

Here are 25 public repositories matching this topic...

run-llama / liteparse

A fast, helpful, and open-source document parser

pdf ocr text-extraction ocr-recognition pdf-parser document-processing document-ocr

Updated May 26, 2026
Rust

The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.

python markdown rust fast pdf text-extraction data-extraction pdf-generation pdf-to-text pdf-library pdf-parser document-processing rag pyo3 pdf-editor image-extraction llm pdf-to-markdown

Updated May 26, 2026
Rust

bzsanti / oxidizePdf

Star

Pure Rust PDF library for AI/RAG: structure-aware chunking, no ML, no C deps.

Updated May 24, 2026
Rust

3DCF-Labs / doc2dataset

Star

3DCF / doc2dataset: token-efficient document layer with NumGuard numeric integrity and multi-framework exports for RAG & fine-tuning.

nlp rust cli machine-learning ocr evaluation dataset-generation data-pipeline document-processing fine-tuning rag document-understanding llm 3dcf doc2dataset numguard

Updated Feb 10, 2026
Rust

zircote / rlm-rs

Star

Rust CLI implementing the Recursive Language Model (RLM) pattern for Claude Code. Process documents 100x larger than context windows through intelligent chunking, SQLite persistence, and recursive sub-LLM orchestration.

Updated May 25, 2026
Rust

yfedoseev / office_oxide

Star

The fastest Office document library for Python, Rust, Go, JS/TS, C# and WASM. DOCX, XLSX, PPTX, DOC, XLS, PPT. Up to 100× faster than python-docx/openpyxl/python-pptx. 100% pass rate on valid Office files. MIT/Apache-2.0.

Updated May 20, 2026
Rust

kreuzberg-dev / kreuzberg-cloud

Star

Cloud-native document extraction platform — SaaS at kreuzberg.dev or self-host on any Kubernetes cluster. 90+ formats, REST API, webhooks. Built on Kreuzberg.

api kubernetes rust pdf microservices ocr nextjs helm postgresql self-hosted saas nats text-extraction cloud-native document-processing document-extraction busl axum kreuzberg

Updated May 26, 2026
Rust

clark-labs-inc / pdfsink-rs

Star

Fast pure-Rust PDF extraction library and CLI by Clark Labs Inc. — 10–50x faster than pdfplumber for text, word, table, layout, image, and metadata extraction.

rust pdf text-extraction rust-library pdf-to-text rust-crate table-extraction pdf-parser document-processing layout-analysis pdf-to-json pdf-extraction pdfplumber document-ai clark-labs

Updated Apr 27, 2026
Rust

carles-abarca / docling-rs

Star

Native Rust port of IBM's Docling document processing library. Convert PDF, DOCX, XLSX, PPTX, HTML, Markdown, and CSV to structured data for RAG applications.

text-extraction document-processing pdf-document-processor xlsx-parser docx-converter rag-pipeline docling

Updated Dec 15, 2025
Rust

KimSeogyu / undocx

Star

Extract clean, structured Markdown from DOCX for LLM and RAG contexts.

python markdown rust converter docx document-processing rag llm

Updated Mar 23, 2026
Rust

wmahfoudh / crabocr

Star

PDF and image to-text converter with XFA forms support. It extract embedded text, and/or render pages into upscaled images for OCR to handle complex layouts and scans. Single static binary, reads stdin/writes stdout. Built for n8n, Power Automate, and containerized workflows.

Updated Feb 20, 2026
Rust

saravananravi08 / glm-ocr-rs

Star

Pure Rust OCR inference engine powered by GLM-OCR vision-language model. No Python. No PyTorch. Just cargo build and go.

rust ocr cuda candle vlm document-processing vision-language-model glm-ocr

Updated Mar 4, 2026
Rust

laofahai / linch-docx-rs

Star

A reliable DOCX reading and writing library for Rust with round-trip preservation

rust word office docx ooxml document-processing

Updated Mar 20, 2026
Rust

ragloom / ragloom

Star

Logstash-like RAG ingestion daemon for local files, chunking, embeddings, and Qdrant indexing.

rust embeddings ingestion document-processing rag vector-database qdrant retrieval-augmented-generation

Updated May 26, 2026
Rust

gumienny / cn

Star

Convert scans of handwritten notes to PDF.

rust cli entropy notes image-processing clean image-thresholding k-means document-processing separation foreground-background tsallis

Updated Sep 5, 2018
Rust

reisel-g / doc2dataset

Star

📄 Ingest documents into structured datasets for LLMs, ensuring numeric integrity and easy export across multiple frameworks with doc2dataset.

nlp rust cli machine-learning ocr big-data text evaluation dataset document data-pipeline document-processing fine-tuning interleaved multimodal document-understanding 3dcf numguard

Updated May 26, 2026
Rust

oeo / processor-rs

Star

High-performance document processing pipeline in Rust. Extracts text, performs OCR, and optimizes images from PDFs and other document formats with parallel processing and memory efficiency.

rust text-extraction tesseract-ocr image-optimization parallel-processing document-processing

Updated Jan 15, 2025
Rust

oneKn8 / Syzygy

Star

Document engineering IDE. Rust + Tauri 2.0 + Typst backend, visual pipelines, AI-assisted document creation.

desktop-app rust ide document-processing tauri typst

Updated Feb 17, 2026
Rust

sovereign-shovels / sarvam-pdf

Star

Drag a PDF, get it in your language. 22 Indic languages. Layout preserved.

i18n rust pdf translation indic document-processing indian-languages sarvam

Updated May 11, 2026
Rust

build-on-ai / document-processor

Star

Desktop app (Tauri 2 + Svelte 5 + Rust) for parsing PDF/DOCX/TXT with image context extraction, document classification, and local SQLite storage. AGPL-3.0 + commercial.

desktop-app rust cross-platform sqlite svelte agplv3 pdf-parser document-processing tauri docx-parser

Updated May 22, 2026
Rust

Improve this page

Add a description, image, and links to the document-processing topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the document-processing topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

document-processing

Here are 25 public repositories matching this topic...

run-llama / liteparse

yfedoseev / pdf_oxide

bzsanti / oxidizePdf

3DCF-Labs / doc2dataset

zircote / rlm-rs

yfedoseev / office_oxide

kreuzberg-dev / kreuzberg-cloud

clark-labs-inc / pdfsink-rs

carles-abarca / docling-rs

KimSeogyu / undocx

wmahfoudh / crabocr

saravananravi08 / glm-ocr-rs

laofahai / linch-docx-rs

ragloom / ragloom

gumienny / cn

reisel-g / doc2dataset

oeo / processor-rs

oneKn8 / Syzygy

sovereign-shovels / sarvam-pdf

build-on-ai / document-processor

Improve this page

Add this topic to your repo