The problem

80–90% of historical data is
invisible to AI

Institutions have spent decades digitizing records, but the data stays locked in flat files — no relationships, no GPS, no structure AI can use. The "data wall" is arriving 2026–27.

Today's reality

✗Millions of digitized items stored as flat JSON — no relationships
✗35+ inconsistent schemas across archives — impossible to join
✗No entity extraction, no GPS metadata, no semantic search
✗Researchers spend 40+ hours per collection on manual curation
✗LLMs trained on post-2010 web data — historical knowledge is dark
✗Google / AWS OCR: vendor lock-in, no graph, you don't own the rows

GeoGraph OCR solves this

✓Automatic entity extraction from any photo — people, places, dates, orgs
✓Knowledge graph auto-built with semantic cross-document links
✓GPS + GIS metadata on every document, normalized to WGS84
✓Reduce curation from 40 hours to under 5 minutes per collection
✓Structured AI training datasets ready to license
✓User-owned Supabase rows with Row-Level Security — not vendor-locked

Capabilities

Everything you need to unlock dark data

Built on Google Gemini 2.5 Flash with a privacy-first, offline-capable architecture — 50,000+ lines of production TypeScript.

AI-Powered OCR

Extract text from historical documents, artifacts, signs, and scenery. Multi-language detection built in. Configurable: Gemini, OpenAI, or local models.

Knowledge Graph

Automatically links entities across documents. Interactive force-directed D3.js visualization. Export to JSON, CSV, or GraphML.

GIS Metadata

Enrich every scan with GPS coordinates, zone classification, and historical location correlation. Coordinates normalized to WGS84.

3D Metaverse Explore

Navigate your entire document corpus in an immersive 3D spatial environment powered by Three.js. Semantic clustering groups related records visually.

Smart Deduplication

Semantic NLP detects and merges duplicate entities across thousands of documents automatically. No manual cleanup required.

Batch Processing

Process hundreds or thousands of documents in parallel. Server-side queue with pause/resume/cancel controls and real-time progress tracking.

Privacy-First Storage

All data stored locally in IndexedDB by default. Cloud sync with Supabase is fully opt-in. End-to-end encryption. You control what gets shared.

AI Training Marketplace

License your structured datasets to AI companies. Earn passive income as a fractional owner when your data is licensed via the Web3 marketplace.

Mobile PWA

Installable progressive web app with full offline capability. Point your phone at any document and capture it on the spot — museum, estate sale, or archive.

How it works

Three steps from photo
to structured knowledge

No complex setup. No vendor lock-in. Point, capture, explore.

📸 Capture

Point your camera at any document, artifact, sign, or page. GeoGraph captures the image and automatically tags the GPS location.

→

🧠 Extract

Gemini 2.5 Flash extracts raw text, then identifies entities (people, places, dates, orgs), temporal era, and semantic relationships.

→

🔗 Connect

Entities are linked across your entire corpus in an interactive knowledge graph. Search, query, visualize in 3D, export, or license to AI companies.

The app

See GeoGraph OCR in action

A full-featured web app that works on desktop and mobile, online and offline.

geographocrnode.vercel.app

Competitive comparison

Built for what the others missed

Existing tools give you OCR. GeoGraph gives you a structured, user-owned, monetizable knowledge base.

Capability	Google Vision	AWS Textract	Smithsonian JSON	GeoGraph OCR ✦
Text extraction (OCR)	✓	✓	Partial	✓
Entity extraction (NLP)	Limited	Limited	✗	✓ Full
Knowledge graph auto-build	✗	✗	✗	✓
GPS / GIS metadata	✗	✗	✗	✓
User-owned DB rows (RLS)	✗ Vendor lock	✗ Vendor lock	✗	✓ Supabase RLS
Offline-first PWA	✗	✗	✗	✓
AI training data monetization	✗	✗	✗	✓ Marketplace
3D spatial visualization	✗	✗	✗	✓ Three.js

Built for

Who uses GeoGraph OCR

From solo researchers to enterprise institutions — if you work with documents, this is for you.

🏛️

Archivists & Museums

Process entire collections in hours. Auto-extract entities, build provenance graphs, and make holdings searchable by anyone.

~50K institutions globally

📚

Researchers & Historians

Stop spending 40 hours manually cataloging each collection. Let AI extract the knowledge so you can focus on the insights.

10M+ knowledge workers

⚖️

Legal Firms

Automate discovery. Extract key entities from thousands of documents in minutes. Full confidentiality with offline-first storage.

~1.3M firms globally

🤖

AI Companies

Buy structured, verified, historically rich training datasets from the marketplace — the kind of data that pushes models past the data wall.

$1.2B market opportunity

Pricing

Simple, transparent pricing

Start for free with your own API keys. Scale as your corpus grows. Cancel anytime.

Free

$0 / month

For individuals and hobbyists starting their first collections.

100 OCR scans / month
Local IndexedDB storage
Basic knowledge graph
GIS metadata capture
Bring your own API keys
Community support

Get started free

Get paid to live life.
Capture history. Earn income.

GeoGraph is the only OCR platform that gives you fractional ownership of the data you create. When AI companies license the corpus, revenue flows back to contributors — proportional to what you captured.

Step 1

📸

Capture documents

Photograph records at museums, archives, estate sales, workplaces. Every scan is structured and recorded in your account.

Step 2

🧩

Own a data shard

Your structured records are minted as GARD Data Shards — ERC-1155 tokens on-chain. You hold fractional ownership of the corpus.

Step 3

💰

AI companies license

OpenAI, Anthropic, and others buy structured historical datasets from the marketplace. Revenue splits proportionally to shard holders.

Step 4

🔄

Earn more, capture more

To grow your share, capture more records. Visit museums. Explore archives. Build a data portfolio while living your life.

ERC-1155 Shards Supabase RLS Ethers.js 6 On-chain provenance Fractionalized ownership

For Investors

Seeking $150K for a
$1.2B opportunity

Pre-seed round — 8–10% equity to structure the first 100 archival collections and prove the AI licensing model. The AI training data crisis is real, and the window to own this market is closing.

✓ 50K+ lines TypeScript ✓ Production on Vercel + Supabase ✓ Security audited ✓ v2.16 — bi-weekly releases

Talk to us → Read pitch deck

Questions

Frequently asked

What types of documents can GeoGraph OCR process?

Any photo or scan — handwritten letters, printed records, historical maps, newspaper clippings, legal documents, safety posters, artifacts, and museum placards. Multi-language detection covers Latin scripts, CJK characters, and more.

Does it work offline?

Yes. GeoGraph OCR is an offline-first PWA. All data is stored in IndexedDB on your device. Cloud sync with Supabase is fully opt-in — your data never has to leave your device unless you choose to enable it.

How is my data kept private?

By default everything stays on your device. When you enable cloud sync, data is stored in your own Supabase instance with Row-Level Security — even we cannot read your rows. OCR API calls go directly from your browser to your AI provider using your own key.

What is the Web3 / data ownership model?

When you capture and structure documents, you can opt-in to mint them as GARD Data Shards — ERC-1155 NFTs representing fractional ownership of the corpus. When AI companies license datasets, revenue is distributed proportionally to shard holders. Think royalties for the data you help create.

Can I export the knowledge graph?

Yes. Export your entire graph as JSON, CSV, or GraphML. Pro users also get structured AI-training dataset exports compatible with Hugging Face and OpenAI fine-tuning formats.

Do I need a Google Gemini API key?

On the Free plan, yes — bring your own Gemini or OpenAI key (Gemini 2.5 Flash has a generous free tier). On paid plans, processing credits are included. Local model support via Ollama and LM Studio is also available.

Turn any document into
structured intelligence

80–90% of historical data is
invisible to AI