Synthetic OCR + Document AI Training Data

Generate labeled synthetic documents for OCR and document AI.

Q: What's the difference between OCR Degradation and Corrupt Text Layer?

OCR Degradation visually corrupts the document by substituting characters and making text harder to read. Corrupt Text Layer keeps the document looking clean but corrupts the hidden embedded text, forcing tools to fall back to visual OCR.

Q: Can I use the generated documents for commercial ML training?

Yes. Documents generated by DocSet Generator can be used as training data for commercial ML products.

DocSet Generator is a Windows & Linux desktop generator for OCR training data, document AI evaluation, and extraction pipeline testing. 33 document types, controllable degradation, and ML-ready exports — COCO, LayoutLM, FUNSD, and DocVQA — with zero real PII.

Purchase — $199 → Try the Live Demo One-time purchase. Free updates. Runs locally — nothing leaves your machine.

court_document.pdf

STATE OF CALIFORNIA

LOS ANGELES COUNTY SUPERIOR COURT

━━━━━━━━━━━━━━━━━━━━━━━

ASHLEY VINCENT, Plaintiff,

Case No.: 2020-CV-3360

invoice.pdf

INVOICE #7291

Date: March 24, 2026

Subtotal: $12,450.00

Tax (8%): $996.00

TOTAL: $13,446.00

receipt_Corrupt_Etext.pdf — OCR: 15%

RÒB1Ñ$OIN BI<

Dsle: QE/7?|2026 7îlna.

Srih7c+a] $3zQ20

Yoy {/%> $7Z91

TQLÂI $8q2·07

OCR DEGRADATION ACTIVE — 15%

Document Types

300

Docs / Second (clean)

Real PII Generated

100%

Local — Nothing Uploaded

Free Sample

See the output before you commit.

Download 50 real documents generated by DocSet Generator — across multiple types, clean and corrupted. No email required. No strings attached.

PDF documents across 10+ types
Clean and OCR-degraded versions
Corrupt text layer examples
Native formats included (.docx, .eml, .csv)
50 documents total, ready to inspect

↓ Download Free Sample Pack

// 50 documents · .zip archive · No signup required · .7z version

// Or generate your own right now — try the live browser demo →

Corrupted receipt showing OCR degradation at 15%

Receipt — OCR Degraded 15%

Receipt — Clean 100%

The Application

Clean interface. No configuration required.

Main Interface 33 document types organized by category. Select individual types or entire families. Estimated output count updates in real time.

Completed generation run with output files

Generation Complete COMPLETE status, generation time, output folder open with timestamped files. 5 documents with Bates stamping, watermarks, and corrupt text layer — 1.3 seconds.

Performance

Fast enough to not be your bottleneck.

Mode	Documents	Time	Workers
Clean (100% quality)	330 docs	1.1s	8 workers
OCR Degraded (50%)	330 docs	11s	8 workers
Corrupt Text Layer	330 docs	11s	4 workers
Corrupt Text Layer (scale)	3,300 docs	104s	4 workers

// Worker count scales automatically based on your CPU and workload type. No configuration required.

01 — EXPORTS

ML-Ready Annotations

Export directly to COCO, LayoutLM, FUNSD, and DocVQA — word-detection boxes, token/label/normalized-box JSONL, question/answer/header roles with semantic links, and derived question-answer pairs. Train and fine-tune without writing a labeling pipeline.

02 — GROUND TRUTH

Canonical Text + Geometry

A pre-render document model captures the exact text of every type — titles, paragraphs, list items, tables with merged-cell spans — before any corruption. Ships as a ground-truth text sidecar per document plus word → line → block boxes and key/value pairs.

03 — DATASETS

Splits & Recipes

Deterministic, type-balanced train / validation / test splits with a configurable seed. Every run writes a replayable dataset_recipe.json, SHA-256 checksums, and a quality_report.json so datasets are reproducible and auditable.

04 — DEGRADATION

OCR Corruption Slider

Control exactly how hard your model has to work. From clean ground truth to heavily degraded scans — continuous spectrum, not presets. Realistic character substitutions based on actual OCR failure patterns.

05 — STEALTH

Corrupt Text Layer

Clean visual page, corrupted hidden text layer. Forces OCR fallback on tools that read embedded text directly. Tests the gap between what a document looks like and what an extractor actually reads.

06 — LEGAL

Bates & Image-Only PDFs

Sequential Bates identifiers on every page for eDiscovery and legal AI pipelines, plus fully flattened image-only PDFs with no selectable text — a true visual-only extraction challenge, generated at scale.

07 — REPRODUCIBLE

Seeded & Byte-Identical

Seed a run and regenerate it exactly, down to byte-identical PDFs on the same platform. Derived per-document seeds, a fixed reference date, and normalized metadata make clean/degraded twin pairs and dataset replays deterministic.

08 — PRIVACY

Zero Real PII

All names, addresses, companies, SSNs, account numbers, and financial figures are synthetically generated — reserved .example domains, 555-01xx phones, invalid SSNs. Mathematically accurate but entirely fake.

09 — LOCAL

Runs On-Prem

Windows and Linux desktop application. No internet required after install. No data sent to any server. Your training data stays on your machine — critical for regulated industries.

Coverage

33 document types across every domain your pipeline will encounter.

General

Letter
Memo
Report
Fax
Meeting Notes
Scheduler
Transmittal

Business

Email
Mass CC Email
Invoice
Receipt
Check
Financial
Corporate
Presentation
Real Estate

Legal & Govt

Agreement
Court Document
Government
Patent
Certificate
Form

Data & Misc

Media
Documentation
Personal Info
Publication
Table / List
Transcript

Native Formats

Word (.docx)
Excel (.xlsx)
Email (.eml)
CSV (.csv)
Plain Text (.txt)

Latest Release · v1.5

From OCR test data to a full training dataset.

New in v1.5

COCO · LayoutLM · FUNSD · DocVQA exports

Every run can emit ML-ready annotations: COCO word boxes in image pixels, LayoutLM token/label/normalized-box JSONL, FUNSD question/answer/header roles with links, and DocVQA Q&A pairs with word and page references. Drop it straight into training.

New in v1.5

Canonical ground truth & rich annotations

A pre-render model captures the true text of all 33 types — headings, tables, merged-cell spans — before any corruption. Annotation schema 2.1 adds word → line → block geometry, table/row/cell records, entities, and key/value pairs, with a ground-truth text sidecar per document.

New in v1.5

Reproducible datasets & recipes

Deterministic, type-balanced train/validation/test splits, a replayable dataset_recipe.json, byte-identical seeded PDFs, SHA-256 checksums, and a quality_report.json with per-type completeness metrics. Manifest schema 2.0 with built-in JSON Schema validation.

New in v1.5

Now on Linux, too

Ships as a Linux AppImage and Flatpak alongside the Windows build, with high-fidelity LibreOffice rendering for native formats when available. Automatic parallel generation in the GUI, and coverage expanded from 33 tests to 104.

Read the full v1.5 changelog →

Pricing

One price. No subscriptions. No usage limits.

Built for ML engineers and QA teams who need training data now, not after a procurement process. Buy once, generate as many documents as you need, keep every update.

Questions? Contact us at
[email protected]

^$199

One-time purchase · Windows & Linux

Windows .exe + Linux AppImage & Flatpak
50-document sample pack included
33 document types across 5 categories
COCO, LayoutLM, FUNSD & DocVQA exports
Ground-truth text + word/line/block annotations
Train / validation / test dataset splits
OCR degradation slider + corrupt text layer
Bates stamping, watermarks, image-only PDFs
Seeded, reproducible generation with recipes
Free updates — re-download anytime
Runs fully offline — no data leaves machine

Purchase Now →

Or download the free sample first

FAQ

Common questions.

Anything else? Email [email protected]

Is any of the generated data real?

No. All names, addresses, companies, SSNs, EINs, account numbers, and financial figures are synthetically generated. The math is accurate but every entity is entirely fabricated. Safe to use in any environment.

Does it require an internet connection?

Only for the initial download. After that it runs entirely offline. No telemetry, no license servers, no API calls. Your generated data never leaves your machine.

What's the difference between OCR Degradation and Corrupt Text Layer?

OCR Degradation visually corrupts the document — characters are substituted, text becomes hard to read. Corrupt Text Layer keeps the document looking clean but corrupts the hidden embedded text, forcing tools to fall back to visual OCR. They can be used independently or together.

What OS does it run on?

Windows and Linux. Windows ships as a standalone .exe; Linux ships as an AppImage and a Flatpak. No Python installation or dependencies required on either platform.

What annotation and dataset export formats does it produce?

Every run can write ML-ready exports: COCO word-detection boxes in rendered-image pixels, LayoutLM token/label/normalized-box JSONL, FUNSD question/answer/header roles with semantic links, and DocVQA question-answer pairs with word and page references. It also emits word/line/block annotations with page geometry, a ground-truth text sidecar per document, clean page images, and deterministic train/validation/test splits — all validated against bundled JSON Schemas.

What happens when you release updates?

Updates are free for all existing customers. Re-download the latest version from your original purchase link anytime.

Can I use the generated documents for commercial ML training?

Yes. Documents generated by DocSet Generator are yours to use however you need, including as training data for commercial ML products.