Building Loss360: Architecting a Multi-Tenant AI Platform

Building Loss360: Architecting a Multi-Tenant AI Document Processing Platform

Over an intense 3-month period, I had the privilege of leading the architecture and development of Docuvera - Loss360 at Trellissoft—an enterprise SaaS platform that uses AI to automate insurance loss run document processing. The accelerated timeline demanded exceptional foresight in architectural decisions, thinking ahead to anticipate scalability challenges and technical bottlenecks before they materialized. It was one of those projects that starts with a deceptively simple problem statement and quickly reveals layers of fascinating technical challenges.

In this post, I'll share the architectural decisions, technical challenges, and lessons learned from building a production-grade multi-tenant AI platform from the ground up. Whether you're building document processing systems, multi-tenant SaaS, or integrating LLMs into enterprise products, I hope these insights prove valuable.

"The best architectures emerge not from following best practices blindly, but from deeply understanding your constraints and making deliberate trade-offs."

The Problem: Why OCR Alone Isn't Enough

Insurance underwriters deal with loss run reports daily—documents containing historical claims data that are critical for risk assessment and pricing. Here's the catch: every insurance carrier formats these reports differently. Some are PDFs with neat tables, others are scanned images of faxed documents from the 1990s, and some are Excel files with creative layouts that defy any templating logic.

The traditional approach of using OCR (Optical Character Recognition) gets you 60% of the way there—you extract text from the document. But then what? You have unstructured text with:

Tables that don't follow a consistent structure
Field labels that vary by carrier ("Paid Loss" vs "Total Paid" vs "Indemnity Paid")
Contextual information scattered across pages
Claims data intermixed with policy information
Handwritten notes on scanned documents

This is where Large Language Models (LLMs) become game-changers. An LLM doesn't just read text—it understands context, can handle variations in terminology, and can extract structured data from unstructured layouts. The combination of OCR + LLM is what makes automated loss run processing actually work.

Architectural Decision #1: Multi-Tenant Isolation Strategy

One of the first major decisions I faced was how to implement multi-tenancy. We had three main options:

Option 1: Shared Schema with Tenant ID

All tenants share the same database tables, with a tenant_id column on every table. Simple to implement, easy to scale horizontally.

The Problem: A single bug in a WHERE clause can leak data across tenants. In insurance, that's not just embarrassing—it's potentially illegal. The fail-open nature of this approach was a non-starter.

Option 2: Separate Database Per Tenant

Each tenant gets their own PostgreSQL database. Maximum isolation, zero chance of cross-tenant queries.

The Problem: Operational nightmare. Managing hundreds of database instances, backup strategies, connection pooling, and monitoring? Not scalable for a startup.

Option 3: Separate Schema Per Tenant (The Winner)

Each tenant gets their own PostgreSQL schema within a shared database cluster. We use PostgreSQL's search_path to isolate queries:

Python - Tenant Isolation Middleware

from functools import wraps
from flask import g, request
import jwt

def set_tenant_context(f):
    """Middleware to set PostgreSQL schema based on authenticated tenant"""
    @wraps(f)
    def decorated_function(*args, **kwargs):
        # Extract tenant from JWT token
        token = request.headers.get('Authorization', '').replace('Bearer ', '')
        payload = jwt.decode(token, verify=True)
        tenant_id = payload['tenant_id']
        
        # Set PostgreSQL search_path to tenant schema
        schema_name = f"tenant_{tenant_id}"
        g.db.execute(f"SET search_path TO {schema_name}, public")
        
        # Store in request context for logging
        g.tenant_id = tenant_id
        g.schema_name = schema_name
        
        return f(*args, **kwargs)
    
    return decorated_function


# Every API endpoint gets this decorator
@app.route('/api/submissions')
@set_tenant_context
def get_submissions():
    # This query automatically scoped to tenant's schema
    results = db.session.query(Submission).all()
    return jsonify(results)

Why this works: It's fail-closed by design. If something goes wrong with tenant context detection, the query fails rather than returning another tenant's data. We get excellent isolation without the operational overhead of separate databases. Plus, we can easily export a tenant's data by dumping their schema.

Architectural Decision #2: Asynchronous Processing Pipeline

Processing a 20-page scanned loss run document takes time:

OCR processing: 30-60 seconds
LLM extraction: 40-90 seconds
Data validation & storage: 5-10 seconds

That's 1-2.5 minutes per document. You can't hold an HTTP connection open that long. We needed an asynchronous architecture.

The Pipeline Architecture

Python - Document Processing Pipeline

import redis
from rq import Queue
from datetime import datetime

# Initialize Redis queue
redis_conn = redis.Redis()
processing_queue = Queue('document_processing', connection=redis_conn)

# Step 1: User uploads document (API endpoint)
@app.route('/api/upload', methods=['POST'])
@set_tenant_context
def upload_document():
    file = request.files['document']
    lob_id = request.form.get('lob_id')
    
    # Save to S3 (tenant-prefixed path)
    s3_key = f"{g.tenant_id}/submissions/{uuid.uuid4()}/{file.filename}"
    s3_client.upload_fileobj(file, BUCKET_NAME, s3_key)
    
    # Create submission record
    submission = Submission(
        tenant_id=g.tenant_id,
        lob_id=lob_id,
        document_path=s3_key,
        status='queued',
        uploaded_by=g.user_id
    )
    db.session.add(submission)
    db.session.commit()
    
    # Queue processing job
    job = processing_queue.enqueue(
        process_document_task,
        submission_id=submission.id,
        tenant_id=g.tenant_id,
        job_timeout='10m'
    )
    
    return jsonify({
        'submission_id': submission.id,
        'status': 'queued',
        'job_id': job.id
    }), 202


# Step 2: Background worker processes document
def process_document_task(submission_id, tenant_id):
    # Set tenant context in worker
    set_tenant_schema(tenant_id)
    
    submission = Submission.query.get(submission_id)
    submission.status = 'processing'
    submission.started_at = datetime.utcnow()
    db.session.commit()
    
    try:
        # Download from S3
        document = s3_client.get_object(
            Bucket=BUCKET_NAME, 
            Key=submission.document_path
        )
        
        # Get tenant's LOB configuration
        lob = LineOfBusiness.query.get(submission.lob_id)
        ocr_model = get_ocr_model(lob.ocr_model_id)
        llm_model = get_llm_model(lob.llm_model_id)
        extraction_schema = lob.get_fields_config()
        
        # OCR processing
        extracted_text = ocr_model.process(document)
        
        # LLM extraction
        structured_data = llm_model.extract(
            text=extracted_text,
            schema=extraction_schema
        )
        
        # Store results
        for record in structured_data:
            result = ExtractionResult(
                submission_id=submission_id,
                data=record
            )
            db.session.add(result)
        
        submission.status = 'completed'
        submission.completed_at = datetime.utcnow()
        db.session.commit()
        
        # Notify user via WebSocket
        notify_user(
            tenant_id=tenant_id,
            user_id=submission.uploaded_by,
            message=f"Submission {submission_id} processed successfully"
        )
        
    except Exception as e:
        submission.status = 'failed'
        submission.error_message = str(e)
        db.session.commit()
        
        notify_user(
            tenant_id=tenant_id,
            user_id=submission.uploaded_by,
            message=f"Submission {submission_id} failed: {e}",
            type='error'
        )

Key Insights:

Immediate Response: The upload endpoint returns a 202 Accepted immediately, giving users instant feedback
Worker Scalability: We can run multiple worker instances on different machines, scaling processing capacity independently from the web tier
Tenant Context in Workers: Critical detail—workers must also set tenant context to ensure data isolation
Real-Time Updates: WebSocket notifications keep users informed without polling

The LLM Integration Challenge: Dynamic Schemas

Here's a requirement that kept me up at night: Each tenant should be able to define their own extraction fields per line of business.

Tenant A might want: claim_number, loss_date, paid_amount, reserved_amount.

Tenant B might want: claim_id, incident_date, total_incurred, claim_status, adjuster_name.

Same documents, different extraction requirements. How do you design a database schema for that?

Solution: Metadata-Driven Extraction with JSONB

SQL - Dynamic Schema Design

-- LOB Configuration Table
CREATE TABLE line_of_business (
    id UUID PRIMARY KEY,
    name VARCHAR(255),
    ocr_model_id VARCHAR(100),
    llm_model_id VARCHAR(100),
    
    -- Field definitions as JSONB
    fields_config JSONB,
    
    -- Example: [
    --   {
    --     "key": "claim_number",
    --     "label": "Claim Number",
    --     "type": "string",
    --     "scope": "per_claim",
    --     "required": true
    --   },
    --   {
    --     "key": "loss_date",
    --     "label": "Date of Loss",
    --     "type": "date",
    --     "scope": "per_claim",
    --     "required": true
    --   }
    -- ]
    
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Extraction Results Table (flexible schema)
CREATE TABLE extraction_results (
    id UUID PRIMARY KEY,
    submission_id UUID REFERENCES submissions(id),
    
    -- Store extracted data as JSONB
    extracted_data JSONB,
    
    -- Example: {
    --   "claim_number": "CLM-2024-001234",
    --   "loss_date": "2024-03-15",
    --   "paid_amount": 25000.00,
    --   "reserved_amount": 10000.00
    -- }
    
    -- Track human edits
    is_edited BOOLEAN DEFAULT FALSE,
    edited_by UUID REFERENCES users(id),
    edited_at TIMESTAMP,
    
    confidence_score DECIMAL(5,2),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- GIN index for fast JSONB queries
CREATE INDEX idx_extracted_data_gin 
ON extraction_results USING gin(extracted_data);

How it works in practice:

Tenant admin configures fields through the UI
Configuration stored as JSONB in fields_config
At processing time, we read the config and construct the LLM prompt dynamically
LLM extracts data according to the schema
Results stored as JSONB, queryable via PostgreSQL's excellent JSON operators

The LLM Prompt Construction

Python - Dynamic Prompt Generation

def build_extraction_prompt(text: str, fields_config: list) -> str:
    """Build LLM prompt from field configuration"""
    
    # Group fields by scope
    per_document_fields = [f for f in fields_config if f['scope'] == 'per_document']
    per_claim_fields = [f for f in fields_config if f['scope'] == 'per_claim']
    
    prompt = f"""You are an expert insurance document analyst. Extract structured data from the following loss run document.

Document Text:
{text}

Extract the following information:

Document-Level Fields (extract once):
"""
    
    for field in per_document_fields:
        prompt += f"\n- {field['label']} ({field['key']}): {field.get('description', '')}"
    
    prompt += f"""

Claim-Level Fields (extract for each claim in the document):
"""
    
    for field in per_claim_fields:
        prompt += f"\n- {field['label']} ({field['key']}): {field.get('description', '')}"
    
    prompt += """

Return the data as JSON in this format:
{
  "document_fields": {
    "field_key": "extracted_value"
  },
  "claims": [
    {
      "field_key": "extracted_value"
    }
  ]
}

If a field cannot be found, use null. Be precise with dates and numbers."""
    
    return prompt

Handling Poor Quality Documents

One of the biggest surprises: how bad some loss run documents are. We're talking about:

20-year-old faxed documents that have been photocopied multiple times
Skewed scans at 45-degree angles
Handwritten notes in margins
Coffee stains obscuring critical data
Multi-column layouts with text flowing in unexpected directions

Standard OCR fails spectacularly on these. We had to build a preprocessing pipeline:

Python - Image Preprocessing Pipeline

import cv2
import numpy as np
from PIL import Image

def preprocess_image(image_path: str) -> np.ndarray:
    """Preprocess scanned document for better OCR"""
    
    # Load image
    img = cv2.imread(image_path)
    
    # Convert to grayscale
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    
    # Detect and correct skew
    coords = np.column_stack(np.where(gray > 0))
    angle = cv2.minAreaRect(coords)[-1]
    
    if angle < -45:
        angle = -(90 + angle)
    else:
        angle = -angle
    
    # Rotate image to correct skew
    (h, w) = gray.shape[:2]
    center = (w // 2, h // 2)
    M = cv2.getRotationMatrix2D(center, angle, 1.0)
    rotated = cv2.warpAffine(
        gray, M, (w, h),
        flags=cv2.INTER_CUBIC,
        borderMode=cv2.BORDER_REPLICATE
    )
    
    # Denoise
    denoised = cv2.fastNlMeansDenoising(rotated)
    
    # Increase contrast using adaptive thresholding
    thresh = cv2.adaptiveThreshold(
        denoised,
        255,
        cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
        cv2.THRESH_BINARY,
        11,
        2
    )
    
    # Morphological operations to remove noise
    kernel = np.ones((1, 1), np.uint8)
    opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, kernel)
    
    return opening

This preprocessing improved OCR accuracy from ~65% to ~88% on poor quality documents. The LLM then handles the remaining errors by using context to correct OCR mistakes.

The Human-in-the-Loop Design Pattern

Here's a truth that took me a while to accept: No AI system will be 100% accurate for financial data extraction. And when you're dealing with insurance claims worth millions of dollars, 95% accuracy isn't good enough.

The solution isn't to keep improving the AI until it's perfect (impossible). The solution is to design for human validation from the start.

The Review Interface Design

We built a split-screen interface: original document on the left, extracted data table on the right. Key features:

Visual Highlighting: Click a cell in the table, the corresponding text in the PDF highlights
Inline Editing: Double-click any cell to edit. Changes saved with audit trail
Confidence Indicators: Low-confidence extractions highlighted in yellow
Quick Navigation: Jump to next uncertain field with keyboard shortcuts
Add/Remove Rows: If AI missed a claim or hallucinated one, users can fix it

The result: Users can validate a 50-claim loss run in 3-5 minutes. Without our system, manual entry took 45-60 minutes. That's a 10-15x productivity improvement, even with human validation.

Real-Time Notifications: Getting WebSockets Right

Users upload documents and need to know when processing completes. Polling is terrible UX and wasteful. We needed WebSockets.

The challenge: maintaining WebSocket connections in a multi-tenant environment where users might be connected for hours.

Python - WebSocket Notification System

from flask_socketio import SocketIO, emit, join_room
import redis

socketio = SocketIO(app, cors_allowed_origins="*")
redis_client = redis.Redis()

# When user connects
@socketio.on('connect')
def handle_connect():
    # Authenticate user from WebSocket headers
    token = request.args.get('token')
    user = authenticate_token(token)
    
    if not user:
        return False  # Reject connection
    
    # Join tenant-specific room
    room = f"tenant_{user.tenant_id}"
    join_room(room)
    
    # Store connection mapping
    redis_client.hset(
        f"ws_connections:{user.tenant_id}",
        user.id,
        request.sid
    )


# Function called by background workers
def notify_user(tenant_id: str, user_id: str, message: str, type: str = 'info'):
    """Send notification to specific user"""
    
    # Store notification in database (for history)
    notification = Notification(
        tenant_id=tenant_id,
        user_id=user_id,
        message=message,
        type=type,
        read=False
    )
    db.session.add(notification)
    db.session.commit()
    
    # Get user's socket connection
    connection_id = redis_client.hget(
        f"ws_connections:{tenant_id}",
        user_id
    )
    
    if connection_id:
        # User is currently connected, send real-time notification
        socketio.emit(
            'notification',
            {
                'id': notification.id,
                'message': message,
                'type': type,
                'timestamp': notification.created_at.isoformat()
            },
            room=connection_id.decode()
        )

Key Design Decisions:

Redis for Connection Tracking: WebSocket connections are ephemeral, Redis tracks which users are currently connected
Database Persistence: All notifications stored in DB, so users see them even if they weren't connected when sent
Tenant Rooms: Users join tenant-specific rooms, preventing cross-tenant notification leaks

Lessons Learned: What I'd Do Differently

1. Start with Observability

I added comprehensive logging and monitoring about 6 weeks into the project. Should have done it on day 1. When an LLM extraction fails, you need to see the prompt, the raw response, the document text, and the configuration that generated the prompt. We lost debugging hours early on because we didn't have this.

Recommendation: Implement structured logging from the start. Every processing step should emit events with correlation IDs that link back to the submission.

2. Build Admin Tools Early

Later in the project, we needed to give specific tenants beta access to new features. I had to manually update database rows. Should have built a proper admin interface for feature flags from the beginning.

3. LLM Response Validation is Critical

LLMs occasionally hallucinate or return malformed JSON. Early on, this would crash the processing pipeline. We learned to:

Use JSON Schema validation on LLM responses
Retry with modified prompts if validation fails
Have fallback extraction strategies
Log all failures with context for manual review

4. Don't Underestimate Data Migration Complexity

When you have separate schemas per tenant and need to deploy a database migration, you need to run that migration on every tenant schema. We built migration tooling that:

Lists all tenant schemas
Runs migrations in parallel with progress tracking
Handles failures gracefully (some tenants might be mid-operation)
Validates migrations in a test schema first

Performance Optimizations That Mattered

GPU Resource Pooling

LLM inference requires GPUs. We can't allocate one GPU per tenant. Solution: shared GPU pool with request queuing and model caching. Key insight—keep models warm in GPU memory for frequently used configurations.

Incremental OCR for Large Documents

Instead of processing a 50-page document as one blob, we process pages in parallel and stream results. Users see extracted data appear progressively rather than waiting for the entire document.

PostgreSQL JSONB Indexing

The GIN indexes on JSONB columns are crucial for dashboard queries. Without them, aggregations across thousands of extracted records were taking 10+ seconds. With proper indexing: sub-200ms.

The Business Impact

By the numbers:

85% time reduction in loss run processing (45 min → 5 min with validation)
92% extraction accuracy before human review
10+ active tenants in production
~2 minutes average processing time per document

But the real impact? Underwriters told us they can now focus on actually underwriting instead of data entry. That's the goal—AI augmenting human expertise, not replacing it.

Advice for Building Similar Systems

If You're Building Multi-Tenant SaaS

Choose your isolation strategy early and stick with it
Build tenant context handling as middleware, not sprinkled throughout code
Test cross-tenant isolation with dedicated test suites
Make tenant ID visible in all logs and traces

If You're Integrating LLMs

LLMs are probabilistic—design for graceful failure
Always validate LLM outputs with schemas
Cache identical requests (LLM inference is expensive)
Log everything—prompts, responses, tokens used
Version your prompts and track which version produced which results

If You're Processing Documents

Document quality varies wildly—build preprocessing pipelines
OCR + LLM is more powerful than either alone
Always show users the source document alongside extracted data
Make human corrections easy and track them

Conclusion: Architecture is About Trade-offs

Building Loss360 taught me that great architecture isn't about using the latest technologies or following best practices religiously. It's about understanding your constraints—performance, security, user needs, team capabilities—and making deliberate trade-offs.

We chose separate schemas over shared tables because data isolation was paramount. We chose asynchronous processing over real-time because documents take time to process. We chose human-in-the-loop validation over pure AI because accuracy matters more than full automation.

Each decision had pros and cons. The key is making those trade-offs explicit and building systems that align with your actual requirements, not ideal theoretical scenarios.

"The best system is the one that ships, works reliably, and solves the actual problem—not the one that uses the most cutting-edge technology."

I hope these insights help you in your own journey building AI-powered systems. Feel free to reach out if you want to discuss any of these topics in more depth!

Additional Resources

View Full Loss360 Project Page

Tags:

#Architecture #MultiTenant #LLM #DocumentProcessing #SaaS #InsurTech #PostgreSQL #Python

Building Loss360: Architecting a Multi-Tenant AI Document Processing Platform