Skip to main content

What is PaperJet?

PaperJet is a platform for extracting structured data from documents, all while using your own infrastructure.

Features

  • Structured data extraction - define a schema and extract it from any supported document (docx, pdf, images)
  • Fully open-source - The web and self-hosted versions have the same feature set
  • Zero cloud dependencies - PaperJet doesn’t depend on any cloud services. Everything is self-contained in Docker
  • Built for large documents: easily ingest hundreds of pages at once
  • Use any LLM with your own keys (BYOK)
  • supports major cloud providers like OpenAI and Gemini
  • local providers: VLLM, LM Studio and Ollama

Getting started

If you’re looking to start using PaperJet, head over to the User Guide. If you want to see how to set it up yourself, check out the Adminstrator Guide.