The Movie Recommendation System is a production-ready Django application that delivers intelligent movie recommendations using advanced machine learning algorithms. Built to scale from thousands to millions of movies, it combines TF-IDF vectorization, SVD dimensionality reduction, and content-based filtering to provide highly relevant suggestions in under 50 milliseconds.
Unlike basic keyword matching systems, this recommender understands the semantic meaning of movie features—genres, keywords, production companies, plot summaries, and more. The system evolved from handling 10,000 movies in the MovieLens dataset to successfully processing 930,000+ movies from the comprehensive TMDB dataset while maintaining excellent performance and recommendation quality.
Building a recommendation system that works at scale presents several key challenges:
Feature Engineering → TF-IDF Vectorization → SVD Reduction → Cosine Similarity → Recommendations
class MovieRecommender:
"""Production-ready recommendation engine"""
def __init__(self, model_dir):
# Load pre-trained model artifacts
self.metadata = pd.read_parquet(model_dir / 'movie_metadata.parquet')
self.similarity_matrix = load_npz(model_dir / 'similarity_matrix.npz').toarray()
with open(model_dir / 'title_to_idx.json') as f:
self.title_to_idx = json.load(f)
def get_recommendations(self, movie_title, n=15, min_rating=None):
# Fuzzy match title
matched_title = self.find_movie(movie_title)
if not matched_title:
return {'error': 'Movie not found'}
# Get similarity scores
movie_idx = self.title_to_idx[matched_title]
sim_scores = list(enumerate(self.similarity_matrix[movie_idx]))
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:]
# Apply filters and format results
recommendations = []
for idx, score in sim_scores:
if len(recommendations) >= n:
break
movie = self.metadata.iloc[idx]
if min_rating and movie['vote_average'] < min_rating:
continue
recommendations.append({
'title': movie['title'],
'rating': f"{movie['vote_average']:.1f}/10",
'genres': ', '.join(movie['genres']),
'similarity_score': f"{score:.3f}",
'imdb_link': f"https://www.imdb.com/title/{movie['imdb_id']}"
})
return {'recommendations': recommendations}
The system includes a complete training pipeline for creating custom models from any movie dataset:
from training.train import MovieRecommenderTrainer
# Initialize trainer with configuration
trainer = MovieRecommenderTrainer(
output_dir='./models',
use_dimensionality_reduction=True,
n_components=500 # SVD components
)
# Train on TMDB dataset
df, sim_matrix = trainer.train(
'path/to/TMDB_movie_dataset.csv',
quality_threshold='medium', # 50+ votes
max_movies=100000 # Top 100K by quality
)
# Models saved automatically:
# - movie_metadata.parquet
# - similarity_matrix.npz
# - title_to_idx.json
# - tfidf_vectorizer.pkl
# - svd_model.pkl
| Configuration | Movies | Training Time | Memory | Model Size |
|---|---|---|---|---|
| Demo (Included) | 2K | N/A | 50MB | 8MB |
| Small | 10K | 2 min | 500MB | 40MB |
| Medium ⭐ | 100K | 15 min | 2GB | 180MB |
| Large | 930K+ | 60 min | 6GB | 800MB |
Watch how the recommendation engine handles user queries with fuzzy matching, filtering, and rich metadata display!
The system is deployment-ready with configurations for multiple platforms:
# 1. Push to GitHub
git push origin main
# 2. Connect to Render (auto-detects render.yaml)
# 3. Configure environment variables
SECRET_KEY=
DEBUG=False
MODEL_DIR=./models
# 4. Deploy! (Render handles build & deployment)
Challenge 1: Memory Explosion with Large Datasets
Computing similarity for 100K movies creates a 100K × 100K matrix (40GB if dense).
Solution: Implemented sparse matrix storage (scipy.sparse) reducing size to ~150MB,
plus chunked computation to prevent memory overflow during training.
Challenge 2: Slow Feature Extraction
Initial implementation took 30+ minutes to process 100K movies due to inefficient JSON parsing.
Solution: Optimized with vectorized operations, Parquet format (10x faster loading),
and parallel processing where possible. Training time reduced to 15 minutes.
Challenge 3: Poor Recommendation Quality at Scale
When scaling to 1M+ movies, recommendations included obscure films with 2-3 votes.
Solution: Implemented three-tier quality filtering (low/medium/high) and a quality
score combining rating and vote count. Dramatically improved relevance.
Challenge 4: Cold Start - No User History
Content-based filtering doesn't require user history, but how to bootstrap?
Solution: Rich feature engineering (genres, companies, plot, keywords) allows
high-quality recommendations from the first query without any user interaction data.
| Metric | Version 1.0 | Version 2.0 | Improvement |
|---|---|---|---|
| Dataset Size | 10K movies | 930K+ movies | 93x larger |
| Model Size | 320MB | 180MB | 44% smaller |
| Memory Usage | 800MB | 350MB | 56% less |
| Recommendation Time | ~100ms | <50ms | 2x faster |
| Features Used | 4 | 10+ | 2.5x richer |