Building a Hands-On Semantic Search Engine

If you've ever tried to find something in a long PDF or a folder full of documents, you know how painful keyword search can be. You type a word, and you get dozens of hits that contain the word but don't really understand what you mean.

The Context Search Engine project was built to solve exactly that problem—and to serve as a learning lab for modern semantic search. In this post, we'll walk through what the project does, how it's built, and how you can use it to understand and experiment with semantic search end to end.

Why I Built This Project

I didn't want just another search demo. I wanted:

A complete pipeline: upload documents → extract text → chunk → embed → index → search → explain.
A Google-style interface that feels familiar, but powered by embeddings instead of keywords.
A playground for students, researchers, and developers to learn, tweak, and break things safely.

So the objective of Context Search Engine is two-fold:

Practical: Provide a local, privacy-friendly semantic search engine for PDFs, Word docs, and text files.
Educational: Expose the entire semantic search pipeline so you can understand, debug, and experiment.

Who This Project Is For

Students & College Grads
Learn what embeddings, FAISS, and semantic search actually do by playing with real documents and queries.
Researchers
Test different chunking strategies, overlap sizes, and models on your own domain-specific corpora.
Developers & Data Scientists
Use it as a microservice, a prototype, or a reference implementation for document search in your own apps.

The User Experience: A Google-Style Search Lab

When you run the app and open it in your browser (http://localhost:5000), you get a clean, centered search box—very similar to Google.

From there, you can:

Upload Documents: Drag-and-drop PDFs, DOCX, or TXT files.
Configure the Engine: Choose your embedding model, chunk size, overlap, and search parameters.
Manage Documents: View and delete existing documents through a simple dashboard.
Search Semantically: Type at least 3 characters and see results update with a slight delay (500ms debounce).

Each search result includes:

Rank (#1, #2, …)
Document name
Page number (for PDFs)
Chunk number
Matched text snippet
Relevance score
A "View Source" button to jump back to context

High-Level Architecture

At a high level, the Context Search Engine works like this:

Upload → Extract → Chunk → Embed → Index → Search → Results

The Seven-Step Pipeline

Step	What Happens	Technology
1. Upload	User uploads files via web UI	Flask file handling
2. Extract	Text extraction from documents	PyPDF2, python-docx
3. Chunk	Split text with configurable overlap	Custom chunking logic
4. Embed	Convert chunks to vectors	HuggingFace Transformers
5. Index	Store vectors for fast search	FAISS
6. Search	Find similar chunks for query	FAISS similarity search
7. Results	Display with metadata	Web UI with ranking

A Look at the Core Logic

At its heart, the engine is doing three things:

Chunk the text into overlapping windows.
Embed each chunk with a transformer model.
Search with FAISS using the query embedding.

Here's a simplified, self-contained version of the core logic:

Python - Semantic Search Engine Core

from typing import List, Dict, Any
import numpy as np
import faiss
import torch
from transformers import AutoTokenizer, AutoModel

class SemanticSearchEngine:
    def __init__(self, model_name: str = "distilbert-base-uncased", dimension: int = 768):
        self.model_name = model_name
        self.dimension = dimension
        # 1. Load model & tokenizer
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModel.from_pretrained(model_name)
        # 2. Initialize FAISS index (L2 distance)
        self.index = faiss.IndexFlatL2(dimension)
        # 3. Keep metadata for each vector
        self.metadata: Dict[int, Dict[str, Any]] = {}
        self._next_id = 0

Chunking Logic

Documents are split into overlapping chunks to preserve context at boundaries:

Python - Chunking with Overlap

    def _chunk_text(self, text: str, chunk_size: int = 500, overlap: int = 50) -> List[str]:
        words = text.split()
        chunks = []
        start = 0
        while start < len(words):
            end = start + chunk_size
            chunk_words = words[start:end]
            chunks.append(" ".join(chunk_words))
            # Step forward by chunk_size - overlap
            start += max(chunk_size - overlap, 1)
        return chunks

Embedding and Search

Each chunk is converted to a vector, and queries are matched using cosine similarity:

Python - Embedding & Search

    @torch.no_grad()
    def _embed_batch(self, texts: List[str]) -> np.ndarray:
        inputs = self.tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
        outputs = self.model(**inputs)
        # Mean pooling over tokens
        embeddings = outputs.last_hidden_state.mean(dim=1)
        embeddings = embeddings / embeddings.norm(dim=1, keepdim=True)
        return embeddings.cpu().numpy().astype("float32")

    def search(self, query: str, top_k: int = 5) -> List[Dict[str, Any]]:
        query_vec = self._embed_batch([query])
        distances, indices = self.index.search(query_vec, top_k)
        results = []
        for score, idx in zip(distances[0], indices[0]):
            if idx == -1 or idx not in self.metadata:
                continue
            meta = self.metadata[idx]
            results.append({
                "score": float(score),
                "document_id": meta["document_id"],
                "page": meta["page"],
                "text": meta["text"],
            })
        return results

Configuration as a First-Class Citizen

Instead of hard-coding choices, the app uses a JSON configuration file: app_config.json.

JSON - Configuration

{
  "model_repo_id": "distilbert-base-uncased",
  "chunk_size": 500,
  "chunk_overlap": 50,
  "num_search_results": 5,
  "top_k": 10,
  "dimension": 768
}

When you change key parameters like model_repo_id, dimension, or chunking, the app prompts you to rebuild the index, ensuring the FAISS index and embeddings always stay in sync.

🧪

Experiment Ideas

Here are some concrete experiments you can run with the Context Search Engine:

1. Compare Chunk Sizes

Try chunk_size = 300, 500, and 1000
Keep overlap constant (e.g., 50)
Upload the same documents, rebuild index each time
Do smaller chunks give more precise but fragmented answers?
Do larger chunks give more context but sometimes irrelevant text?

2. Test Different Models

Switch between:

distilbert-base-uncased (fast, good starter)
sentence-transformers/all-MiniLM-L6-v2 (384-dim, lightweight)
sentence-transformers/all-mpnet-base-v2 (768-dim, high quality)

Observe query latency and relevance of results.

3. Domain-Specific Documents

Upload documents from a particular domain (medical, legal, technical, finance) and ask domain-specific questions. See how well a general-purpose model performs. This can guide you toward whether you need fine-tuning or a domain-specific model.

Challenges & Lessons Learned

Handling Context at Chunk Boundaries

Problem: Splitting text into chunks risks cutting important sentences in half.

Approach: Use overlapping chunks. The chunk_overlap parameter ensures that adjacent chunks share some words, so relevant information near boundaries appears in more than one chunk.

Matching Models and FAISS

Problem: Different models have different output dimensions. FAISS index dimensions must match exactly.

Approach: Make dimension a required configuration parameter and validate it against the selected model. When either changes, force a full index rebuild so the vectors and FAISS structure remain consistent.

Balancing Speed and Quality

Problem: Bigger models and large top_k values improve result quality but increase latency.

Approach: Make top_k, num_search_results, and model choice configurable so users can find their own speed/accuracy trade-off. On a laptop, top_k = 10 is a good starting point.

Running the Project Locally

You only need Python and pip to get started:

Bash - Setup Commands

# 1. Clone the repo
git clone https://github.com/inboxpraveen/context-search-engine.git
cd context-search-engine

# 2. Install dependencies
pip install -r requirements.txt

# 3. Run the app
python app.py

# 4. Open in browser
# Navigate to http://localhost:5000

On first run, the model will be downloaded (around ~250 MB). After that, everything runs locally.

Closing Thoughts

The Context Search Engine is not just a tool—it's a sandbox for learning how modern semantic search systems really work. You can peek under the hood, change parameters, swap models, and see in real time how those decisions affect retrieval quality.

If you're interested in information retrieval, NLP, or building intelligent document systems, cloning the repo and playing with your own documents is one of the most practical ways to get started.

"The best way to understand semantic search is to build one yourself. This project gives you all the pieces—now it's your turn to experiment."

Resources & Links

Tags:

#SemanticSearch #VectorSearch #FAISS #NLP #Embeddings #InformationRetrieval