Generate synthetic documents for OCR and document AI.
DocSet Generator is a Windows desktop synthetic document generator for OCR training data, document AI evaluation, and extraction pipeline testing. 33 document types. Controllable degradation. Zero real PII.
See the output before you commit.
Download 50 real documents generated by DocSet Generator — across multiple types, clean and corrupted. No email required. No strings attached.
- PDF documents across 10+ types
- Clean and OCR-degraded versions
- Corrupt text layer examples
- Native formats included (.docx, .eml, .csv)
- 50 documents total, ready to inspect
// 50 documents · .7z archive · No signup required
Clean interface. No configuration required.
Fast enough to not be your bottleneck.
| Mode | Documents | Time | Workers | |
|---|---|---|---|---|
| Clean (100% quality) | 330 docs | 1.1s | 8 workers | |
| OCR Degraded (50%) | 330 docs | 11s | 8 workers | |
| Corrupt Text Layer | 330 docs | 11s | 4 workers | |
| Corrupt Text Layer (scale) | 3,300 docs | 104s | 4 workers |
// Worker count scales automatically based on your CPU and workload type. No configuration required.
OCR Corruption Slider
Control exactly how hard your model has to work. From clean ground truth to heavily degraded scans — continuous spectrum, not presets. Realistic character substitutions based on actual OCR failure patterns.
Corrupt Text Layer
Clean visual page, corrupted hidden text layer. Forces OCR fallback on tools that read embedded text directly. Tests the gap between what a document looks like and what an extractor actually reads.
Bates Stamping
Sequential review-style identifiers on every page. Essential for eDiscovery and legal document AI pipelines. Generates at scale without manual numbering.
Image-Only PDFs
Flatten documents to pure image — no selectable text layer at all. Tests OCR systems that can't rely on embedded text as a fallback. True visual-only extraction challenge.
Zero Real PII
All names, addresses, companies, SSNs, account numbers, and financial figures are synthetically generated. Mathematically accurate but entirely fake. Safe for any environment.
Runs On-Prem
Windows desktop application. No internet required after install. No data sent to any server. Your training data stays on your machine — critical for regulated industries.
33 document types across every domain your pipeline will encounter.
General
- Letter
- Memo
- Report
- Fax
- Meeting Notes
- Scheduler
- Transmittal
Business
- Mass CC Email
- Invoice
- Receipt
- Check
- Financial
- Corporate
- Presentation
- Real Estate
Legal & Govt
- Agreement
- Court Document
- Government
- Patent
- Certificate
- Form
Data & Misc
- Media
- Documentation
- Personal Info
- Publication
- Table / List
- Transcript
Native Formats
- Word (.docx)
- Excel (.xlsx)
- Email (.eml)
- CSV (.csv)
- Plain Text (.txt)
One price. No subscriptions. No usage limits.
Built for ML engineers and QA teams who need training data now, not after a procurement process. Buy once, generate as many documents as you need, keep every update.
Questions? Contact us at
[email protected]
- DocumentGenerator.exe — runs immediately
- 50-document sample pack included
- 33 document types across 5 categories
- OCR degradation slider (0–100%)
- Corrupt text layer mode
- Bates stamping + watermarks
- Image-only PDF generation
- Adaptive parallel processing
- Free updates — re-download anytime
- Runs fully offline — no data leaves machine
Or download the free sample first