Over an intense 3-month period, I had the privilege of leading the architecture and development of Docuvera - Loss360 at Trellissoft—an enterprise SaaS platform that uses AI to automate insurance loss run document processing. The accelerated timeline demanded exceptional foresight in architectural decisions, thinking ahead to anticipate scalability challenges and technical bottlenecks before they materialized. It was one of those projects that starts with a deceptively simple problem statement and quickly reveals layers of fascinating technical challenges.
In this post, I'll share the architectural decisions, technical challenges, and lessons learned from building a production-grade multi-tenant AI platform from the ground up. Whether you're building document processing systems, multi-tenant SaaS, or integrating LLMs into enterprise products, I hope these insights prove valuable.
"The best architectures emerge not from following best practices blindly, but from deeply understanding your constraints and making deliberate trade-offs."
The Problem: Why OCR Alone Isn't Enough
Insurance underwriters deal with loss run reports daily—documents containing historical claims data that are critical for risk assessment and pricing. Here's the catch: every insurance carrier formats these reports differently. Some are PDFs with neat tables, others are scanned images of faxed documents from the 1990s, and some are Excel files with creative layouts that defy any templating logic.
The traditional approach of using OCR (Optical Character Recognition) gets you 60% of the way there—you extract text from the document. But then what? You have unstructured text with:
- Tables that don't follow a consistent structure
- Field labels that vary by carrier ("Paid Loss" vs "Total Paid" vs "Indemnity Paid")
- Contextual information scattered across pages
- Claims data intermixed with policy information
- Handwritten notes on scanned documents
This is where Large Language Models (LLMs) become game-changers. An LLM doesn't just read text—it understands context, can handle variations in terminology, and can extract structured data from unstructured layouts. The combination of OCR + LLM is what makes automated loss run processing actually work.
Architectural Decision #1: Multi-Tenant Isolation Strategy
One of the first major decisions I faced was how to implement multi-tenancy. We had three main options:
Option 1: Shared Schema with Tenant ID
All tenants share the same database tables, with a tenant_id column on every table.
Simple to implement, easy to scale horizontally.
The Problem: A single bug in a WHERE clause can leak data across tenants. In insurance, that's not just embarrassing—it's potentially illegal. The fail-open nature of this approach was a non-starter.
Option 2: Separate Database Per Tenant
Each tenant gets their own PostgreSQL database. Maximum isolation, zero chance of cross-tenant queries.
The Problem: Operational nightmare. Managing hundreds of database instances, backup strategies, connection pooling, and monitoring? Not scalable for a startup.
Option 3: Separate Schema Per Tenant (The Winner)
Each tenant gets their own PostgreSQL schema within a shared database cluster. We use PostgreSQL's
search_path to isolate queries:
from functools import wraps
from flask import g, request
import jwt
def set_tenant_context(f):
"""Middleware to set PostgreSQL schema based on authenticated tenant"""
@wraps(f)
def decorated_function(*args, **kwargs):
# Extract tenant from JWT token
token = request.headers.get('Authorization', '').replace('Bearer ', '')
payload = jwt.decode(token, verify=True)
tenant_id = payload['tenant_id']
# Set PostgreSQL search_path to tenant schema
schema_name = f"tenant_{tenant_id}"
g.db.execute(f"SET search_path TO {schema_name}, public")
# Store in request context for logging
g.tenant_id = tenant_id
g.schema_name = schema_name
return f(*args, **kwargs)
return decorated_function
# Every API endpoint gets this decorator
@app.route('/api/submissions')
@set_tenant_context
def get_submissions():
# This query automatically scoped to tenant's schema
results = db.session.query(Submission).all()
return jsonify(results)
Why this works: It's fail-closed by design. If something goes wrong with tenant context detection, the query fails rather than returning another tenant's data. We get excellent isolation without the operational overhead of separate databases. Plus, we can easily export a tenant's data by dumping their schema.
Architectural Decision #2: Asynchronous Processing Pipeline
Processing a 20-page scanned loss run document takes time:
- OCR processing: 30-60 seconds
- LLM extraction: 40-90 seconds
- Data validation & storage: 5-10 seconds
That's 1-2.5 minutes per document. You can't hold an HTTP connection open that long. We needed an asynchronous architecture.
The Pipeline Architecture
import redis
from rq import Queue
from datetime import datetime
# Initialize Redis queue
redis_conn = redis.Redis()
processing_queue = Queue('document_processing', connection=redis_conn)
# Step 1: User uploads document (API endpoint)
@app.route('/api/upload', methods=['POST'])
@set_tenant_context
def upload_document():
file = request.files['document']
lob_id = request.form.get('lob_id')
# Save to S3 (tenant-prefixed path)
s3_key = f"{g.tenant_id}/submissions/{uuid.uuid4()}/{file.filename}"
s3_client.upload_fileobj(file, BUCKET_NAME, s3_key)
# Create submission record
submission = Submission(
tenant_id=g.tenant_id,
lob_id=lob_id,
document_path=s3_key,
status='queued',
uploaded_by=g.user_id
)
db.session.add(submission)
db.session.commit()
# Queue processing job
job = processing_queue.enqueue(
process_document_task,
submission_id=submission.id,
tenant_id=g.tenant_id,
job_timeout='10m'
)
return jsonify({
'submission_id': submission.id,
'status': 'queued',
'job_id': job.id
}), 202
# Step 2: Background worker processes document
def process_document_task(submission_id, tenant_id):
# Set tenant context in worker
set_tenant_schema(tenant_id)
submission = Submission.query.get(submission_id)
submission.status = 'processing'
submission.started_at = datetime.utcnow()
db.session.commit()
try:
# Download from S3
document = s3_client.get_object(
Bucket=BUCKET_NAME,
Key=submission.document_path
)
# Get tenant's LOB configuration
lob = LineOfBusiness.query.get(submission.lob_id)
ocr_model = get_ocr_model(lob.ocr_model_id)
llm_model = get_llm_model(lob.llm_model_id)
extraction_schema = lob.get_fields_config()
# OCR processing
extracted_text = ocr_model.process(document)
# LLM extraction
structured_data = llm_model.extract(
text=extracted_text,
schema=extraction_schema
)
# Store results
for record in structured_data:
result = ExtractionResult(
submission_id=submission_id,
data=record
)
db.session.add(result)
submission.status = 'completed'
submission.completed_at = datetime.utcnow()
db.session.commit()
# Notify user via WebSocket
notify_user(
tenant_id=tenant_id,
user_id=submission.uploaded_by,
message=f"Submission {submission_id} processed successfully"
)
except Exception as e:
submission.status = 'failed'
submission.error_message = str(e)
db.session.commit()
notify_user(
tenant_id=tenant_id,
user_id=submission.uploaded_by,
message=f"Submission {submission_id} failed: {e}",
type='error'
)
Key Insights:
- Immediate Response: The upload endpoint returns a 202 Accepted immediately, giving users instant feedback
- Worker Scalability: We can run multiple worker instances on different machines, scaling processing capacity independently from the web tier
- Tenant Context in Workers: Critical detail—workers must also set tenant context to ensure data isolation
- Real-Time Updates: WebSocket notifications keep users informed without polling
The LLM Integration Challenge: Dynamic Schemas
Here's a requirement that kept me up at night: Each tenant should be able to define their own extraction fields per line of business.
Tenant A might want: claim_number, loss_date, paid_amount,
reserved_amount.
Tenant B might want: claim_id, incident_date, total_incurred,
claim_status, adjuster_name.
Same documents, different extraction requirements. How do you design a database schema for that?
Solution: Metadata-Driven Extraction with JSONB
-- LOB Configuration Table
CREATE TABLE line_of_business (
id UUID PRIMARY KEY,
name VARCHAR(255),
ocr_model_id VARCHAR(100),
llm_model_id VARCHAR(100),
-- Field definitions as JSONB
fields_config JSONB,
-- Example: [
-- {
-- "key": "claim_number",
-- "label": "Claim Number",
-- "type": "string",
-- "scope": "per_claim",
-- "required": true
-- },
-- {
-- "key": "loss_date",
-- "label": "Date of Loss",
-- "type": "date",
-- "scope": "per_claim",
-- "required": true
-- }
-- ]
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Extraction Results Table (flexible schema)
CREATE TABLE extraction_results (
id UUID PRIMARY KEY,
submission_id UUID REFERENCES submissions(id),
-- Store extracted data as JSONB
extracted_data JSONB,
-- Example: {
-- "claim_number": "CLM-2024-001234",
-- "loss_date": "2024-03-15",
-- "paid_amount": 25000.00,
-- "reserved_amount": 10000.00
-- }
-- Track human edits
is_edited BOOLEAN DEFAULT FALSE,
edited_by UUID REFERENCES users(id),
edited_at TIMESTAMP,
confidence_score DECIMAL(5,2),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- GIN index for fast JSONB queries
CREATE INDEX idx_extracted_data_gin
ON extraction_results USING gin(extracted_data);
How it works in practice:
- Tenant admin configures fields through the UI
- Configuration stored as JSONB in
fields_config - At processing time, we read the config and construct the LLM prompt dynamically
- LLM extracts data according to the schema
- Results stored as JSONB, queryable via PostgreSQL's excellent JSON operators
The LLM Prompt Construction
def build_extraction_prompt(text: str, fields_config: list) -> str:
"""Build LLM prompt from field configuration"""
# Group fields by scope
per_document_fields = [f for f in fields_config if f['scope'] == 'per_document']
per_claim_fields = [f for f in fields_config if f['scope'] == 'per_claim']
prompt = f"""You are an expert insurance document analyst. Extract structured data from the following loss run document.
Document Text:
{text}
Extract the following information:
Document-Level Fields (extract once):
"""
for field in per_document_fields:
prompt += f"\n- {field['label']} ({field['key']}): {field.get('description', '')}"
prompt += f"""
Claim-Level Fields (extract for each claim in the document):
"""
for field in per_claim_fields:
prompt += f"\n- {field['label']} ({field['key']}): {field.get('description', '')}"
prompt += """
Return the data as JSON in this format:
{
"document_fields": {
"field_key": "extracted_value"
},
"claims": [
{
"field_key": "extracted_value"
}
]
}
If a field cannot be found, use null. Be precise with dates and numbers."""
return prompt
Handling Poor Quality Documents
One of the biggest surprises: how bad some loss run documents are. We're talking about:
- 20-year-old faxed documents that have been photocopied multiple times
- Skewed scans at 45-degree angles
- Handwritten notes in margins
- Coffee stains obscuring critical data
- Multi-column layouts with text flowing in unexpected directions
Standard OCR fails spectacularly on these. We had to build a preprocessing pipeline:
import cv2
import numpy as np
from PIL import Image
def preprocess_image(image_path: str) -> np.ndarray:
"""Preprocess scanned document for better OCR"""
# Load image
img = cv2.imread(image_path)
# Convert to grayscale
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Detect and correct skew
coords = np.column_stack(np.where(gray > 0))
angle = cv2.minAreaRect(coords)[-1]
if angle < -45:
angle = -(90 + angle)
else:
angle = -angle
# Rotate image to correct skew
(h, w) = gray.shape[:2]
center = (w // 2, h // 2)
M = cv2.getRotationMatrix2D(center, angle, 1.0)
rotated = cv2.warpAffine(
gray, M, (w, h),
flags=cv2.INTER_CUBIC,
borderMode=cv2.BORDER_REPLICATE
)
# Denoise
denoised = cv2.fastNlMeansDenoising(rotated)
# Increase contrast using adaptive thresholding
thresh = cv2.adaptiveThreshold(
denoised,
255,
cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY,
11,
2
)
# Morphological operations to remove noise
kernel = np.ones((1, 1), np.uint8)
opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, kernel)
return opening
This preprocessing improved OCR accuracy from ~65% to ~88% on poor quality documents. The LLM then handles the remaining errors by using context to correct OCR mistakes.
The Human-in-the-Loop Design Pattern
Here's a truth that took me a while to accept: No AI system will be 100% accurate for financial data extraction. And when you're dealing with insurance claims worth millions of dollars, 95% accuracy isn't good enough.
The solution isn't to keep improving the AI until it's perfect (impossible). The solution is to design for human validation from the start.
The Review Interface Design
We built a split-screen interface: original document on the left, extracted data table on the right. Key features:
- Visual Highlighting: Click a cell in the table, the corresponding text in the PDF highlights
- Inline Editing: Double-click any cell to edit. Changes saved with audit trail
- Confidence Indicators: Low-confidence extractions highlighted in yellow
- Quick Navigation: Jump to next uncertain field with keyboard shortcuts
- Add/Remove Rows: If AI missed a claim or hallucinated one, users can fix it
The result: Users can validate a 50-claim loss run in 3-5 minutes. Without our system, manual entry took 45-60 minutes. That's a 10-15x productivity improvement, even with human validation.
Real-Time Notifications: Getting WebSockets Right
Users upload documents and need to know when processing completes. Polling is terrible UX and wasteful. We needed WebSockets.
The challenge: maintaining WebSocket connections in a multi-tenant environment where users might be connected for hours.
from flask_socketio import SocketIO, emit, join_room
import redis
socketio = SocketIO(app, cors_allowed_origins="*")
redis_client = redis.Redis()
# When user connects
@socketio.on('connect')
def handle_connect():
# Authenticate user from WebSocket headers
token = request.args.get('token')
user = authenticate_token(token)
if not user:
return False # Reject connection
# Join tenant-specific room
room = f"tenant_{user.tenant_id}"
join_room(room)
# Store connection mapping
redis_client.hset(
f"ws_connections:{user.tenant_id}",
user.id,
request.sid
)
# Function called by background workers
def notify_user(tenant_id: str, user_id: str, message: str, type: str = 'info'):
"""Send notification to specific user"""
# Store notification in database (for history)
notification = Notification(
tenant_id=tenant_id,
user_id=user_id,
message=message,
type=type,
read=False
)
db.session.add(notification)
db.session.commit()
# Get user's socket connection
connection_id = redis_client.hget(
f"ws_connections:{tenant_id}",
user_id
)
if connection_id:
# User is currently connected, send real-time notification
socketio.emit(
'notification',
{
'id': notification.id,
'message': message,
'type': type,
'timestamp': notification.created_at.isoformat()
},
room=connection_id.decode()
)
Key Design Decisions:
- Redis for Connection Tracking: WebSocket connections are ephemeral, Redis tracks which users are currently connected
- Database Persistence: All notifications stored in DB, so users see them even if they weren't connected when sent
- Tenant Rooms: Users join tenant-specific rooms, preventing cross-tenant notification leaks
Lessons Learned: What I'd Do Differently
1. Start with Observability
I added comprehensive logging and monitoring about 6 weeks into the project. Should have done it on day 1. When an LLM extraction fails, you need to see the prompt, the raw response, the document text, and the configuration that generated the prompt. We lost debugging hours early on because we didn't have this.
Recommendation: Implement structured logging from the start. Every processing step should emit events with correlation IDs that link back to the submission.
2. Build Admin Tools Early
Later in the project, we needed to give specific tenants beta access to new features. I had to manually update database rows. Should have built a proper admin interface for feature flags from the beginning.
3. LLM Response Validation is Critical
LLMs occasionally hallucinate or return malformed JSON. Early on, this would crash the processing pipeline. We learned to:
- Use JSON Schema validation on LLM responses
- Retry with modified prompts if validation fails
- Have fallback extraction strategies
- Log all failures with context for manual review
4. Don't Underestimate Data Migration Complexity
When you have separate schemas per tenant and need to deploy a database migration, you need to run that migration on every tenant schema. We built migration tooling that:
- Lists all tenant schemas
- Runs migrations in parallel with progress tracking
- Handles failures gracefully (some tenants might be mid-operation)
- Validates migrations in a test schema first
Performance Optimizations That Mattered
GPU Resource Pooling
LLM inference requires GPUs. We can't allocate one GPU per tenant. Solution: shared GPU pool with request queuing and model caching. Key insight—keep models warm in GPU memory for frequently used configurations.
Incremental OCR for Large Documents
Instead of processing a 50-page document as one blob, we process pages in parallel and stream results. Users see extracted data appear progressively rather than waiting for the entire document.
PostgreSQL JSONB Indexing
The GIN indexes on JSONB columns are crucial for dashboard queries. Without them, aggregations across thousands of extracted records were taking 10+ seconds. With proper indexing: sub-200ms.
The Business Impact
By the numbers:
- 85% time reduction in loss run processing (45 min → 5 min with validation)
- 92% extraction accuracy before human review
- 10+ active tenants in production
- ~2 minutes average processing time per document
But the real impact? Underwriters told us they can now focus on actually underwriting instead of data entry. That's the goal—AI augmenting human expertise, not replacing it.
Advice for Building Similar Systems
If You're Building Multi-Tenant SaaS
- Choose your isolation strategy early and stick with it
- Build tenant context handling as middleware, not sprinkled throughout code
- Test cross-tenant isolation with dedicated test suites
- Make tenant ID visible in all logs and traces
If You're Integrating LLMs
- LLMs are probabilistic—design for graceful failure
- Always validate LLM outputs with schemas
- Cache identical requests (LLM inference is expensive)
- Log everything—prompts, responses, tokens used
- Version your prompts and track which version produced which results
If You're Processing Documents
- Document quality varies wildly—build preprocessing pipelines
- OCR + LLM is more powerful than either alone
- Always show users the source document alongside extracted data
- Make human corrections easy and track them
Conclusion: Architecture is About Trade-offs
Building Loss360 taught me that great architecture isn't about using the latest technologies or following best practices religiously. It's about understanding your constraints—performance, security, user needs, team capabilities—and making deliberate trade-offs.
We chose separate schemas over shared tables because data isolation was paramount. We chose asynchronous processing over real-time because documents take time to process. We chose human-in-the-loop validation over pure AI because accuracy matters more than full automation.
Each decision had pros and cons. The key is making those trade-offs explicit and building systems that align with your actual requirements, not ideal theoretical scenarios.
"The best system is the one that ships, works reliably, and solves the actual problem—not the one that uses the most cutting-edge technology."
I hope these insights help you in your own journey building AI-powered systems. Feel free to reach out if you want to discuss any of these topics in more depth!