Context Search Engine Overview

If you've ever tried to find something in a long PDF or a folder full of documents, you know how painful keyword search can be. You type a word, and you get dozens of hits that contain the word but don't really understand what you mean.

The Context Search Engine project was built to solve exactly that problem—and to serve as a learning lab for modern semantic search. In this post, we'll walk through what the project does, how it's built, and how you can use it to understand and experiment with semantic search end to end.

Why I Built This Project

I didn't want just another search demo. I wanted:

So the objective of Context Search Engine is two-fold:

  1. Practical: Provide a local, privacy-friendly semantic search engine for PDFs, Word docs, and text files.
  2. Educational: Expose the entire semantic search pipeline so you can understand, debug, and experiment.

Who This Project Is For

The User Experience: A Google-Style Search Lab

When you run the app and open it in your browser (http://localhost:5000), you get a clean, centered search box—very similar to Google.

From there, you can:

Each search result includes:

High-Level Architecture

At a high level, the Context Search Engine works like this:

Context Search Engine Architecture

Upload → Extract → Chunk → Embed → Index → Search → Results

The Seven-Step Pipeline

Step What Happens Technology
1. Upload User uploads files via web UI Flask file handling
2. Extract Text extraction from documents PyPDF2, python-docx
3. Chunk Split text with configurable overlap Custom chunking logic
4. Embed Convert chunks to vectors HuggingFace Transformers
5. Index Store vectors for fast search FAISS
6. Search Find similar chunks for query FAISS similarity search
7. Results Display with metadata Web UI with ranking

A Look at the Core Logic

At its heart, the engine is doing three things:

  1. Chunk the text into overlapping windows.
  2. Embed each chunk with a transformer model.
  3. Search with FAISS using the query embedding.

Here's a simplified, self-contained version of the core logic:

Python - Semantic Search Engine Core
from typing import List, Dict, Any
import numpy as np
import faiss
import torch
from transformers import AutoTokenizer, AutoModel

class SemanticSearchEngine:
    def __init__(self, model_name: str = "distilbert-base-uncased", dimension: int = 768):
        self.model_name = model_name
        self.dimension = dimension
        # 1. Load model & tokenizer
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModel.from_pretrained(model_name)
        # 2. Initialize FAISS index (L2 distance)
        self.index = faiss.IndexFlatL2(dimension)
        # 3. Keep metadata for each vector
        self.metadata: Dict[int, Dict[str, Any]] = {}
        self._next_id = 0

Chunking Logic

Documents are split into overlapping chunks to preserve context at boundaries:

Python - Chunking with Overlap
    def _chunk_text(self, text: str, chunk_size: int = 500, overlap: int = 50) -> List[str]:
        words = text.split()
        chunks = []
        start = 0
        while start < len(words):
            end = start + chunk_size
            chunk_words = words[start:end]
            chunks.append(" ".join(chunk_words))
            # Step forward by chunk_size - overlap
            start += max(chunk_size - overlap, 1)
        return chunks

Embedding and Search

Each chunk is converted to a vector, and queries are matched using cosine similarity:

Python - Embedding & Search
    @torch.no_grad()
    def _embed_batch(self, texts: List[str]) -> np.ndarray:
        inputs = self.tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
        outputs = self.model(**inputs)
        # Mean pooling over tokens
        embeddings = outputs.last_hidden_state.mean(dim=1)
        embeddings = embeddings / embeddings.norm(dim=1, keepdim=True)
        return embeddings.cpu().numpy().astype("float32")

    def search(self, query: str, top_k: int = 5) -> List[Dict[str, Any]]:
        query_vec = self._embed_batch([query])
        distances, indices = self.index.search(query_vec, top_k)
        results = []
        for score, idx in zip(distances[0], indices[0]):
            if idx == -1 or idx not in self.metadata:
                continue
            meta = self.metadata[idx]
            results.append({
                "score": float(score),
                "document_id": meta["document_id"],
                "page": meta["page"],
                "text": meta["text"],
            })
        return results

Configuration as a First-Class Citizen

Instead of hard-coding choices, the app uses a JSON configuration file: app_config.json.

JSON - Configuration
{
  "model_repo_id": "distilbert-base-uncased",
  "chunk_size": 500,
  "chunk_overlap": 50,
  "num_search_results": 5,
  "top_k": 10,
  "dimension": 768
}

When you change key parameters like model_repo_id, dimension, or chunking, the app prompts you to rebuild the index, ensuring the FAISS index and embeddings always stay in sync.

🧪

Experiment Ideas

Here are some concrete experiments you can run with the Context Search Engine:

1. Compare Chunk Sizes

2. Test Different Models

Switch between:

Observe query latency and relevance of results.

3. Domain-Specific Documents

Upload documents from a particular domain (medical, legal, technical, finance) and ask domain-specific questions. See how well a general-purpose model performs. This can guide you toward whether you need fine-tuning or a domain-specific model.

Challenges & Lessons Learned

Handling Context at Chunk Boundaries

Problem: Splitting text into chunks risks cutting important sentences in half.

Approach: Use overlapping chunks. The chunk_overlap parameter ensures that adjacent chunks share some words, so relevant information near boundaries appears in more than one chunk.

Matching Models and FAISS

Problem: Different models have different output dimensions. FAISS index dimensions must match exactly.

Approach: Make dimension a required configuration parameter and validate it against the selected model. When either changes, force a full index rebuild so the vectors and FAISS structure remain consistent.

Balancing Speed and Quality

Problem: Bigger models and large top_k values improve result quality but increase latency.

Approach: Make top_k, num_search_results, and model choice configurable so users can find their own speed/accuracy trade-off. On a laptop, top_k = 10 is a good starting point.

Running the Project Locally

You only need Python and pip to get started:

Context Search Engine Dashboard
Bash - Setup Commands
# 1. Clone the repo
git clone https://github.com/inboxpraveen/context-search-engine.git
cd context-search-engine

# 2. Install dependencies
pip install -r requirements.txt

# 3. Run the app
python app.py

# 4. Open in browser
# Navigate to http://localhost:5000

On first run, the model will be downloaded (around ~250 MB). After that, everything runs locally.

Closing Thoughts

The Context Search Engine is not just a tool—it's a sandbox for learning how modern semantic search systems really work. You can peek under the hood, change parameters, swap models, and see in real time how those decisions affect retrieval quality.

If you're interested in information retrieval, NLP, or building intelligent document systems, cloning the repo and playing with your own documents is one of the most practical ways to get started.

"The best way to understand semantic search is to build one yourself. This project gives you all the pieces—now it's your turn to experiment."

Resources & Links

Tags:

#SemanticSearch #VectorSearch #FAISS #NLP #Embeddings #InformationRetrieval