Project Overview

The Movie Recommendation System is a production-ready Django application that delivers intelligent movie recommendations using advanced machine learning algorithms. Built to scale from thousands to millions of movies, it combines TF-IDF vectorization, SVD dimensionality reduction, and content-based filtering to provide highly relevant suggestions in under 50 milliseconds.

Movie Recommendation System Header

Unlike basic keyword matching systems, this recommender understands the semantic meaning of movie features—genres, keywords, production companies, plot summaries, and more. The system evolved from handling 10,000 movies in the MovieLens dataset to successfully processing 930,000+ movies from the comprehensive TMDB dataset while maintaining excellent performance and recommendation quality.

The Challenge

Building a recommendation system that works at scale presents several key challenges:

Technical Architecture

System Architecture

Feature Engineering → TF-IDF Vectorization → SVD Reduction → Cosine Similarity → Recommendations

Core ML Pipeline

Machine Learning

TF-IDF SVD Cosine Similarity scikit-learn

Backend & Storage

Django 6.0 Python 3.10+ Parquet NumPy/SciPy

Data Processing

pandas NLTK Sparse Matrices Chunked Processing

Key Features

Advanced Content-Based Filtering

Scalability & Performance

User Experience

Implementation Highlights

Python - Core Recommendation Logic
class MovieRecommender:
    """Production-ready recommendation engine"""
    
    def __init__(self, model_dir):
        # Load pre-trained model artifacts
        self.metadata = pd.read_parquet(model_dir / 'movie_metadata.parquet')
        self.similarity_matrix = load_npz(model_dir / 'similarity_matrix.npz').toarray()
        
        with open(model_dir / 'title_to_idx.json') as f:
            self.title_to_idx = json.load(f)
    
    def get_recommendations(self, movie_title, n=15, min_rating=None):
        # Fuzzy match title
        matched_title = self.find_movie(movie_title)
        if not matched_title:
            return {'error': 'Movie not found'}
        
        # Get similarity scores
        movie_idx = self.title_to_idx[matched_title]
        sim_scores = list(enumerate(self.similarity_matrix[movie_idx]))
        sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:]
        
        # Apply filters and format results
        recommendations = []
        for idx, score in sim_scores:
            if len(recommendations) >= n:
                break
                
            movie = self.metadata.iloc[idx]
            
            if min_rating and movie['vote_average'] < min_rating:
                continue
            
            recommendations.append({
                'title': movie['title'],
                'rating': f"{movie['vote_average']:.1f}/10",
                'genres': ', '.join(movie['genres']),
                'similarity_score': f"{score:.3f}",
                'imdb_link': f"https://www.imdb.com/title/{movie['imdb_id']}"
            })
        
        return {'recommendations': recommendations}

Training Pipeline

The system includes a complete training pipeline for creating custom models from any movie dataset:

Python - Model Training
from training.train import MovieRecommenderTrainer

# Initialize trainer with configuration
trainer = MovieRecommenderTrainer(
    output_dir='./models',
    use_dimensionality_reduction=True,
    n_components=500  # SVD components
)

# Train on TMDB dataset
df, sim_matrix = trainer.train(
    'path/to/TMDB_movie_dataset.csv',
    quality_threshold='medium',  # 50+ votes
    max_movies=100000          # Top 100K by quality
)

# Models saved automatically:
# - movie_metadata.parquet
# - similarity_matrix.npz
# - title_to_idx.json
# - tfidf_vectorizer.pkl
# - svd_model.pkl

Performance Metrics

930K+
Movies Supported
<50ms
Recommendation Time
180MB
Model Size (100K)
85%
Variance Retained (SVD)

System Capabilities

Configuration Movies Training Time Memory Model Size
Demo (Included) 2K N/A 50MB 8MB
Small 10K 2 min 500MB 40MB
Medium ⭐ 100K 15 min 2GB 180MB
Large 930K+ 60 min 6GB 800MB

Live Demo

See the System in Action

Watch how the recommendation engine handles user queries with fuzzy matching, filtering, and rich metadata display!

Application Demo
GitHub Repository Documentation

Deployment Options

The system is deployment-ready with configurations for multiple platforms:

Quick Deploy to Render (Free Tier)

Bash - Deploy to Render
# 1. Push to GitHub
git push origin main

# 2. Connect to Render (auto-detects render.yaml)
# 3. Configure environment variables
SECRET_KEY=
DEBUG=False
MODEL_DIR=./models

# 4. Deploy! (Render handles build & deployment)

Also Supports

Challenges & Solutions

Challenge 1: Memory Explosion with Large Datasets
Computing similarity for 100K movies creates a 100K × 100K matrix (40GB if dense). Solution: Implemented sparse matrix storage (scipy.sparse) reducing size to ~150MB, plus chunked computation to prevent memory overflow during training.

Challenge 2: Slow Feature Extraction
Initial implementation took 30+ minutes to process 100K movies due to inefficient JSON parsing. Solution: Optimized with vectorized operations, Parquet format (10x faster loading), and parallel processing where possible. Training time reduced to 15 minutes.

Challenge 3: Poor Recommendation Quality at Scale
When scaling to 1M+ movies, recommendations included obscure films with 2-3 votes. Solution: Implemented three-tier quality filtering (low/medium/high) and a quality score combining rating and vote count. Dramatically improved relevance.

Challenge 4: Cold Start - No User History
Content-based filtering doesn't require user history, but how to bootstrap? Solution: Rich feature engineering (genres, companies, plot, keywords) allows high-quality recommendations from the first query without any user interaction data.

Evolution & Impact

Metric Version 1.0 Version 2.0 Improvement
Dataset Size 10K movies 930K+ movies 93x larger
Model Size 320MB 180MB 44% smaller
Memory Usage 800MB 350MB 56% less
Recommendation Time ~100ms <50ms 2x faster
Features Used 4 10+ 2.5x richer

Future Enhancements

Key Learnings

Project Links

GitHub Repository Full Documentation Read Blog Post

Technical Resources: