Project Overview

The Context Search Engine is an AI-powered semantic document search platform that demonstrates the full pipeline from document ingestion to context-aware query results using modern vector search. It is designed as a learning and experimentation lab for students, researchers, and developers who want to understand how semantic search works beyond simple keyword matching.

Context Search Engine Overview

The system processes documents (PDF, DOCX, TXT), chunks them intelligently, converts text to embeddings using transformer models, and performs similarity search using FAISS. Unlike traditional keyword search, this semantic approach understands meaning and context, returning conceptually related results even when exact words don't match.

Technical Architecture

Context Search Engine Architecture

Document Upload → Text Extraction → Chunking → Embeddings → FAISS Index → Semantic Search

Key Components

NLP & Embeddings

Transformers PyTorch HuggingFace DistilBERT

Vector Search

FAISS NumPy Similarity Search Indexing

Document Processing

PyPDF2 python-docx Flask HTML/CSS/JS

Implementation Details

The core logic of the system can be summarized as: configurable chunking → embedding with a transformer model → FAISS similarity search over chunk vectors. Below is a simplified implementation:

Python - Core Search Engine
from pathlib import Path
from typing import List, Dict, Any
import faiss
import numpy as np
from transformers import AutoTokenizer, AutoModel
import torch

class SemanticSearchEngine:
    def __init__(self, model_name: str = "distilbert-base-uncased", dimension: int = 768):
        self.model_name = model_name
        self.dimension = dimension
        # Load embedding model
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModel.from_pretrained(model_name)
        # Initialize FAISS index
        self.index = faiss.IndexFlatL2(dimension)
        self.metadata: Dict[int, Dict[str, Any]] = {}
        self._next_id = 0

Chunking & Embedding Process

Documents are split into overlapping chunks to preserve context, then converted to embeddings using mean pooling over transformer outputs:

Python - Chunking & Embeddings
    def _simple_chunk(self, text: str, chunk_size: int = 500, overlap: int = 50) -> List[str]:
        words = text.split()
        chunks = []
        start = 0
        while start < len(words):
            end = start + chunk_size
            chunk_words = words[start:end]
            chunks.append(" ".join(chunk_words))
            start += max(chunk_size - overlap, 1)
        return chunks

    @torch.no_grad()
    def _embed_texts(self, texts: List[str]) -> np.ndarray:
        inputs = self.tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
        outputs = self.model(**inputs)
        # Mean pooling over token embeddings
        embeddings = outputs.last_hidden_state.mean(dim=1)
        embeddings = embeddings / embeddings.norm(dim=1, keepdim=True)
        return embeddings.cpu().numpy().astype("float32")

Project Results

768
Embedding Dimensions
<200ms
Query Latency
3
File Formats
Local
Privacy-First

System Capabilities

Feature Description Status
Semantic Search Context-aware query results beyond keywords ✅ Live
Document Processing PDF, DOCX, TXT with metadata extraction ✅ Live
Configurable Pipeline Adjustable models, chunk size, overlap ✅ Live
Index Management FAISS-based vector storage & search ✅ Live

Live Demo

Try Context Search Engine

Upload your documents and experience semantic search that understands meaning, not just keywords!

Clone & Run Locally View Documentation
Context Search Engine Dashboard

Challenges & Solutions

Document Boundary & Context Loss: Splitting long text into chunks can cut important context. Solved by implementing configurable chunk size and overlap, ensuring that information near boundaries is present in multiple chunks, preserving semantic continuity across splits.

Model Compatibility & Dimensions: Different models output different embedding dimensions; FAISS index must match. Solved with explicit dimension parameter in configuration, validated when building the index. UI prompts users to rebuild when changing model or dimension.

Balancing Speed vs Quality: Larger models and higher top_k values improve quality but cost time. Solved with configurable model, top_k, and num_search_results, allowing users to empirically find optimal trade-offs for their specific use case and corpus size.

Future Enhancements

Project Links

GitHub Repository Documentation Contact Author