ML Algorithms from Scratch - Educational Python Implementation

Project Overview

ML Algorithms from Scratch is a comprehensive educational repository that implements 18 fundamental machine learning algorithms using only Python and NumPy. The project demonstrates complete algorithm implementations with mathematical rigor, featuring over 2,500+ lines of documentation that explain the theory, mathematics, and code behind each algorithm.

Unlike typical machine learning tutorials that rely on scikit-learn or TensorFlow, this repository builds every algorithm from first principles. Each implementation follows a consistent object-oriented design pattern with a scikit-learn-like API (fit, predict, score), making it ideal for students, researchers, and engineers who want to understand how ML algorithms actually work under the hood.

The Challenge

Understanding machine learning algorithms at a fundamental level presents several key challenges:

Abstract Understanding: High-level libraries obscure the mathematical operations behind algorithms
Documentation Gaps: Most resources either show formulas or code, rarely connecting both thoroughly
Implementation Complexity: Building correct, efficient algorithms requires careful design
Mathematical Rigor: Balancing mathematical precision with practical, understandable code
Educational Value: Creating resources that serve both learning and reference purposes
Consistency: Maintaining uniform code patterns across diverse algorithm types

Solution Architecture

The project addresses these challenges through a structured approach combining mathematical foundations with production-quality code:

Traditional Learning	This Project
Import library and use	Build algorithm from scratch
Memorize formulas	Derive and implement equations
Trial-and-error tuning	Understand parameter effects
Abstract concepts	Concrete code examples
Limited documentation	2,500+ lines of explanations

Target Audience

This repository serves multiple user segments in the ML community:

Students & Bootcamp Learners: Comprehensive resources for understanding ML fundamentals
Software Engineers: Clear pathway for transitioning into ML/AI engineering roles
Interview Candidates: Complete implementations for technical interview preparation
Researchers: Reference implementations for understanding algorithm internals
Educators: Teaching materials with detailed explanations and examples
ML Practitioners: Deep understanding beyond library usage for better model debugging

Technical Specifications

4 Algorithms

Fully implemented with 14 more planned

2,500+ Lines

Comprehensive documentation across all algorithms

NumPy Only

Pure implementation, no ML libraries used

Interview Ready

Perfect preparation for FAANG interviews

Implementation Status

The repository currently features 4 completed algorithms with comprehensive documentation and working code. Development follows a phased approach with 14 additional algorithms planned:

#	Algorithm	Type	Code Lines	Doc Lines	Status
1	Linear Regression	Regression	160	391	✅ Complete
2	Multiple Regression	Regression	173	356	✅ Complete
3	Ridge Regression	Regression	256	696	✅ Complete
4	Logistic Regression	Classification	414	873	✅ Complete

Coming Soon

The roadmap includes 14 more essential algorithms:

Classification: K-Nearest Neighbors, Decision Trees, Naive Bayes, SVM
Ensemble Methods: Random Forests, AdaBoost, Gradient Boosting, XGBoost
Clustering: k-Means, Hierarchical Clustering
Dimensionality Reduction: PCA, t-SNE
Association Rules: Apriori Algorithm

Core Features

1. 📖 Comprehensive Documentation

Every algorithm includes a detailed markdown file (300-900 lines) that covers:

Intuitive Explanations: Real-world analogies that make complex concepts relatable
Mathematical Foundations: Equations broken down step-by-step with plain language
Implementation Walkthrough: Line-by-line explanation of the code
Practical Examples: Multiple use cases from different domains
Comparison Tables: Side-by-side comparisons with related algorithms
When to Use: Clear guidance on algorithm selection criteria

2. 💻 Production-Quality Code

All implementations follow best practices:

Python - Clean Architecture Pattern

class AlgorithmName:
    def __init__(self, hyperparameter1=default1, ...):
        """Initialize with hyperparameters"""
        
    def fit(self, X, y):
        """Train the model on training data"""
        
    def predict(self, X):
        """Make predictions on new data"""
        
    def score(self, X, y):
        """Evaluate model performance"""
        
    def get_coefficients(self):
        """Get learned parameters"""

Features include:

Clean object-oriented design with well-defined interfaces
Detailed docstrings for every method
Type hints for parameters and return values
Robust error handling and edge case management
Scikit-learn-like API for familiarity
PEP 8 compliance and Python best practices

3. 🧮 Mathematical Rigor

Understanding the math is critical. Each algorithm includes:

Python - Linear Regression Normal Equation

# Normal Equation: θ = (X^T X)^(-1) X^T y
def fit(self, X, y):
    # Add bias term (column of ones)
    X_with_bias = np.c_[np.ones((X.shape[0], 1)), X]
    
    # Calculate coefficients using Normal Equation
    # Inverts (X^T X) and multiplies by X^T y
    self.coefficients = np.linalg.inv(
        X_with_bias.T @ X_with_bias
    ) @ X_with_bias.T @ y
    
    return self

Mathematical coverage includes:

Formal Definitions: Precise mathematical notation
Derivations: Step-by-step derivation of key formulas
Gradient Calculations: Detailed breakdown of optimization steps
Loss Functions: Explanation of why specific loss functions are used
Complexity Analysis: Time and space complexity discussions

Implementation Example: Logistic Regression

Complete implementation demonstrating the gradient descent approach for binary classification:

Python - Logistic Regression with Gradient Descent

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

class LogisticRegression:
    def __init__(self, learning_rate=0.01, iterations=1000):
        self.learning_rate = learning_rate
        self.iterations = iterations
        self.coefficients = None
        self.losses = []
    
    def _sigmoid(self, z):
        """Sigmoid activation function"""
        return 1 / (1 + np.exp(-z))
    
    def fit(self, X, y):
        """Train using gradient descent"""
        n_samples, n_features = X.shape
        
        # Add bias term
        X_with_bias = np.c_[np.ones((n_samples, 1)), X]
        
        # Initialize coefficients
        self.coefficients = np.zeros(n_features + 1)
        
        # Gradient descent
        for i in range(self.iterations):
            # Forward pass
            y_pred = self._sigmoid(X_with_bias @ self.coefficients)
            
            # Calculate loss (binary cross-entropy)
            loss = -np.mean(
                y * np.log(y_pred + 1e-15) + 
                (1 - y) * np.log(1 - y_pred + 1e-15)
            )
            self.losses.append(loss)
            
            # Calculate gradients
            error = y_pred - y
            gradients = (1 / n_samples) * (X_with_bias.T @ error)
            
            # Update coefficients
            self.coefficients -= self.learning_rate * gradients
        
        return self
    
    def predict(self, X):
        """Make binary predictions"""
        X_with_bias = np.c_[np.ones((X.shape[0], 1)), X]
        probabilities = self._sigmoid(X_with_bias @ self.coefficients)
        return (probabilities >= 0.5).astype(int)
    
    def score(self, X, y):
        """Calculate accuracy"""
        predictions = self.predict(X)
        return np.mean(predictions == y)

# Usage Example
data = load_breast_cancer()
X, y = data.data, data.target

# Split and standardize
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train model
model = LogisticRegression(learning_rate=0.1, iterations=2000)
model.fit(X_train, y_train)

# Evaluate
accuracy = model.score(X_test, y_test)
print(f"Accuracy: {accuracy:.4f}")  # Output: ~0.96

Algorithm Progression

Algorithms are structured in progressive difficulty, building foundational concepts before advanced techniques:

Week 1: Linear Models

Linear Regression (2 hours)
Foundation of all ML • Normal Equation • Bias & Slope
Multiple Regression (2 hours)
Multiple features • Matrix operations • Multicollinearity

Week 2: Regularization & Classification

Ridge Regression (2-3 hours)
Overfitting • L2 regularization • Hyperparameter tuning
Logistic Regression (3 hours)
Classification • Sigmoid function • Gradient descent • Binary cross-entropy

Usage Recommendations

Review documentation files before examining implementation code
Execute provided examples with real datasets to verify functionality
Experiment with hyperparameter modifications to observe behavior changes
Compare outputs with scikit-learn implementations for validation
Utilize as reference material for interview preparation
Extend implementations for specific use cases or research

Technical Deep-Dive

Ridge Regression Implementation

Ridge regression implementation demonstrates L2 regularization through the modified Normal Equation:

Python - Ridge Regression with Regularization

# Regularized Normal Equation: θ = (X^T X + λI)^(-1) X^T y
def fit(self, X, y):
    X_with_bias = np.c_[np.ones((X.shape[0], 1)), X]
    
    # Create identity matrix (don't penalize bias term)
    identity = np.eye(X_with_bias.shape[1])
    identity[0, 0] = 0  # Don't regularize intercept
    
    # Add regularization term (λI)
    regularization_term = self.alpha * identity
    
    # Solve regularized Normal Equation
    self.coefficients = np.linalg.inv(
        X_with_bias.T @ X_with_bias + regularization_term
    ) @ X_with_bias.T @ y
    
    return self

Performance Characteristics

Algorithm	Training Time	Prediction Time	Memory Usage
Linear Regression	O(n³)	O(n)	O(n²)
Multiple Regression	O(n³)	O(n)	O(n²)
Ridge Regression	O(n³)	O(n)	O(n²)
Logistic Regression	O(i·n²)	O(n)	O(n)

where n = number of samples, i = iterations

Installation & Setup

The project requires Python 3.7+ and NumPy. Setup process:

Bash - Installation & Setup

# 1. Clone the repository
git clone https://github.com/inboxpraveen/ML-Algorithms-from-scratch.git
cd ML-Algorithms-from-scratch

# 2. Install dependencies
pip install numpy

# 3. Install optional dependencies for examples
pip install matplotlib scikit-learn pandas jupyter

# 4. Run an example
python "1. Linear Regression/_1_linear_regressions.py"

Quick Start Example

Python - Quick Start

import numpy as np
import sys
sys.path.append('1. Linear Regression')
from _1_linear_regressions import LinearRegression

# Create sample data (years of experience vs salary)
X_train = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1)
y_train = np.array([30000, 35000, 40000, 45000, 50000, 55000, 60000, 65000, 70000, 75000])

# Create and train model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
X_test = np.array([11, 12, 15]).reshape(-1, 1)
predictions = model.predict(X_test)

# Evaluate
r2_score = model.score(X_train, y_train)
print(f"R² Score: {r2_score:.4f}")
print(f"Predictions: {predictions}")

# Get learned coefficients
coeffs = model.get_coefficients()
print(f"Equation: y = {coeffs['intercept']:.2f} + {coeffs['slope']:.2f}x")

Use Cases

Academic Applications

Coursework Reference: Complete implementations for ML course assignments
Exam Preparation: Comprehensive coverage of fundamental ML concepts
Research Foundation: Base implementations for custom algorithm development
Teaching Material: Structured content for ML education

Professional Development

The repository addresses common technical interview requirements:

Technical Questions:

"Implement linear regression from scratch"
"Explain the difference between L1 and L2 regularization"
"How does gradient descent work?"
"What is the difference between classification and regression?"
"When would you use Ridge vs Lasso regression?"

Conceptual Questions:

"What is overfitting and how do you prevent it?"
"Explain the bias-variance tradeoff"
"How do you choose between different ML algorithms?"
"What is the intuition behind logistic regression?"

Design Philosophy

Educational Clarity

Code prioritizes readability and understanding over computational optimization:

Readable and self-documenting
Conceptually clear over computationally optimal
Easy to modify and experiment with
Directly traceable to mathematical formulas

Consistent Interface Design

All algorithms follow the same API pattern, inspired by scikit-learn:

__init__(): Set hyperparameters
fit(X, y): Train the model
predict(X): Make predictions
score(X, y): Evaluate performance
get_coefficients(): Inspect learned parameters

Documentation Standards

The project maintains a 2.3:1 documentation-to-code ratio, ensuring comprehensive coverage:

Every concept is explained multiple times in different ways
Mathematical formulas are broken down step-by-step
Real-world analogies make abstract concepts concrete
Multiple examples show different use cases

Project Roadmap

Phase 1: Foundation ✅ Complete (4/18)

✅ Linear Regression
✅ Multiple Regression
✅ Ridge Regression
✅ Logistic Regression

Phase 2: Classification 🔄 In Progress (0/4)

⏳ K-Nearest Neighbors (KNN)
⏳ Naive Bayes
⏳ Support Vector Machines (SVM)
⏳ Decision Trees

Phase 3: Ensemble Methods 📋 Planned (0/4)

📅 Random Forests
📅 AdaBoost
📅 Gradient Boosting
📅 XGBoost

Phase 4: Unsupervised Learning 📋 Planned (0/4)

📅 k-Means Clustering
📅 Hierarchical Clustering
📅 Principal Component Analysis (PCA)
📅 t-SNE

Contributing

The project welcomes contributions across multiple areas. Contribution opportunities include:

Priority	Contribution Areas
High	• Implement remaining 14 algorithms • Add more examples to existing algorithms • Create visualization utilities • Add unit tests
Medium	• Bug fixes and improvements • Documentation enhancements • Translate documentation • Create video tutorials
Always Welcome	• Typo fixes • Grammar improvements • Better explanations • More examples

Project Impact

This repository provides foundational understanding of machine learning algorithms through complete, well-documented implementations. The project serves as both a learning resource and reference material for algorithm internals, supporting students, researchers, and professionals in developing deeper ML expertise.

Getting Started

Clone the repository from GitHub
Review documentation for Linear Regression as the foundation
Progress through algorithms in numbered order
Execute provided examples with included datasets
Modify implementations for specific use cases
Contribute improvements or additional algorithms

Resources & Links

Tags:

#MachineLearning #Python #NumPy #FromScratch #Education #MLAlgorithms #DataScience #LinearRegression #LogisticRegression #GradientDescent