Project Overview

ML Algorithms from Scratch is a comprehensive educational repository that implements 18 fundamental machine learning algorithms using only Python and NumPy. The project demonstrates complete algorithm implementations with mathematical rigor, featuring over 2,500+ lines of documentation that explain the theory, mathematics, and code behind each algorithm.

ML Algorithms from Scratch

Unlike typical machine learning tutorials that rely on scikit-learn or TensorFlow, this repository builds every algorithm from first principles. Each implementation follows a consistent object-oriented design pattern with a scikit-learn-like API (fit, predict, score), making it ideal for students, researchers, and engineers who want to understand how ML algorithms actually work under the hood.

The Challenge

Understanding machine learning algorithms at a fundamental level presents several key challenges:

Solution Architecture

The project addresses these challenges through a structured approach combining mathematical foundations with production-quality code:

Traditional Learning This Project
Import library and use Build algorithm from scratch
Memorize formulas Derive and implement equations
Trial-and-error tuning Understand parameter effects
Abstract concepts Concrete code examples
Limited documentation 2,500+ lines of explanations

Target Audience

This repository serves multiple user segments in the ML community:

Technical Specifications

4 Algorithms

Fully implemented with 14 more planned

2,500+ Lines

Comprehensive documentation across all algorithms

NumPy Only

Pure implementation, no ML libraries used

Interview Ready

Perfect preparation for FAANG interviews

Implementation Status

The repository currently features 4 completed algorithms with comprehensive documentation and working code. Development follows a phased approach with 14 additional algorithms planned:

# Algorithm Type Code Lines Doc Lines Status
1 Linear Regression Regression 160 391 ✅ Complete
2 Multiple Regression Regression 173 356 ✅ Complete
3 Ridge Regression Regression 256 696 ✅ Complete
4 Logistic Regression Classification 414 873 ✅ Complete

Coming Soon

The roadmap includes 14 more essential algorithms:

Core Features

1. 📖 Comprehensive Documentation

Every algorithm includes a detailed markdown file (300-900 lines) that covers:

2. 💻 Production-Quality Code

All implementations follow best practices:

Python - Clean Architecture Pattern
class AlgorithmName:
    def __init__(self, hyperparameter1=default1, ...):
        """Initialize with hyperparameters"""
        
    def fit(self, X, y):
        """Train the model on training data"""
        
    def predict(self, X):
        """Make predictions on new data"""
        
    def score(self, X, y):
        """Evaluate model performance"""
        
    def get_coefficients(self):
        """Get learned parameters"""

Features include:

3. 🧮 Mathematical Rigor

Understanding the math is critical. Each algorithm includes:

Python - Linear Regression Normal Equation
# Normal Equation: θ = (X^T X)^(-1) X^T y
def fit(self, X, y):
    # Add bias term (column of ones)
    X_with_bias = np.c_[np.ones((X.shape[0], 1)), X]
    
    # Calculate coefficients using Normal Equation
    # Inverts (X^T X) and multiplies by X^T y
    self.coefficients = np.linalg.inv(
        X_with_bias.T @ X_with_bias
    ) @ X_with_bias.T @ y
    
    return self

Mathematical coverage includes:

Implementation Example: Logistic Regression

Complete implementation demonstrating the gradient descent approach for binary classification:

Python - Logistic Regression with Gradient Descent
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

class LogisticRegression:
    def __init__(self, learning_rate=0.01, iterations=1000):
        self.learning_rate = learning_rate
        self.iterations = iterations
        self.coefficients = None
        self.losses = []
    
    def _sigmoid(self, z):
        """Sigmoid activation function"""
        return 1 / (1 + np.exp(-z))
    
    def fit(self, X, y):
        """Train using gradient descent"""
        n_samples, n_features = X.shape
        
        # Add bias term
        X_with_bias = np.c_[np.ones((n_samples, 1)), X]
        
        # Initialize coefficients
        self.coefficients = np.zeros(n_features + 1)
        
        # Gradient descent
        for i in range(self.iterations):
            # Forward pass
            y_pred = self._sigmoid(X_with_bias @ self.coefficients)
            
            # Calculate loss (binary cross-entropy)
            loss = -np.mean(
                y * np.log(y_pred + 1e-15) + 
                (1 - y) * np.log(1 - y_pred + 1e-15)
            )
            self.losses.append(loss)
            
            # Calculate gradients
            error = y_pred - y
            gradients = (1 / n_samples) * (X_with_bias.T @ error)
            
            # Update coefficients
            self.coefficients -= self.learning_rate * gradients
        
        return self
    
    def predict(self, X):
        """Make binary predictions"""
        X_with_bias = np.c_[np.ones((X.shape[0], 1)), X]
        probabilities = self._sigmoid(X_with_bias @ self.coefficients)
        return (probabilities >= 0.5).astype(int)
    
    def score(self, X, y):
        """Calculate accuracy"""
        predictions = self.predict(X)
        return np.mean(predictions == y)

# Usage Example
data = load_breast_cancer()
X, y = data.data, data.target

# Split and standardize
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train model
model = LogisticRegression(learning_rate=0.1, iterations=2000)
model.fit(X_train, y_train)

# Evaluate
accuracy = model.score(X_test, y_test)
print(f"Accuracy: {accuracy:.4f}")  # Output: ~0.96

Algorithm Progression

Algorithms are structured in progressive difficulty, building foundational concepts before advanced techniques:

Week 1: Linear Models

  1. Linear Regression (2 hours)
    Foundation of all ML • Normal Equation • Bias & Slope
  2. Multiple Regression (2 hours)
    Multiple features • Matrix operations • Multicollinearity

Week 2: Regularization & Classification

  1. Ridge Regression (2-3 hours)
    Overfitting • L2 regularization • Hyperparameter tuning
  2. Logistic Regression (3 hours)
    Classification • Sigmoid function • Gradient descent • Binary cross-entropy

Usage Recommendations

Technical Deep-Dive

Ridge Regression Implementation

Ridge regression implementation demonstrates L2 regularization through the modified Normal Equation:

Python - Ridge Regression with Regularization
# Regularized Normal Equation: θ = (X^T X + λI)^(-1) X^T y
def fit(self, X, y):
    X_with_bias = np.c_[np.ones((X.shape[0], 1)), X]
    
    # Create identity matrix (don't penalize bias term)
    identity = np.eye(X_with_bias.shape[1])
    identity[0, 0] = 0  # Don't regularize intercept
    
    # Add regularization term (λI)
    regularization_term = self.alpha * identity
    
    # Solve regularized Normal Equation
    self.coefficients = np.linalg.inv(
        X_with_bias.T @ X_with_bias + regularization_term
    ) @ X_with_bias.T @ y
    
    return self

Performance Characteristics

Algorithm Training Time Prediction Time Memory Usage
Linear Regression O(n³) O(n) O(n²)
Multiple Regression O(n³) O(n) O(n²)
Ridge Regression O(n³) O(n) O(n²)
Logistic Regression O(i·n²) O(n) O(n)

where n = number of samples, i = iterations

Installation & Setup

The project requires Python 3.7+ and NumPy. Setup process:

Bash - Installation & Setup
# 1. Clone the repository
git clone https://github.com/inboxpraveen/ML-Algorithms-from-scratch.git
cd ML-Algorithms-from-scratch

# 2. Install dependencies
pip install numpy

# 3. Install optional dependencies for examples
pip install matplotlib scikit-learn pandas jupyter

# 4. Run an example
python "1. Linear Regression/_1_linear_regressions.py"

Quick Start Example

Python - Quick Start
import numpy as np
import sys
sys.path.append('1. Linear Regression')
from _1_linear_regressions import LinearRegression

# Create sample data (years of experience vs salary)
X_train = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1)
y_train = np.array([30000, 35000, 40000, 45000, 50000, 55000, 60000, 65000, 70000, 75000])

# Create and train model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
X_test = np.array([11, 12, 15]).reshape(-1, 1)
predictions = model.predict(X_test)

# Evaluate
r2_score = model.score(X_train, y_train)
print(f"R² Score: {r2_score:.4f}")
print(f"Predictions: {predictions}")

# Get learned coefficients
coeffs = model.get_coefficients()
print(f"Equation: y = {coeffs['intercept']:.2f} + {coeffs['slope']:.2f}x")

Use Cases

Academic Applications

Professional Development

The repository addresses common technical interview requirements:

Technical Questions:

  • "Implement linear regression from scratch"
  • "Explain the difference between L1 and L2 regularization"
  • "How does gradient descent work?"
  • "What is the difference between classification and regression?"
  • "When would you use Ridge vs Lasso regression?"

Conceptual Questions:

  • "What is overfitting and how do you prevent it?"
  • "Explain the bias-variance tradeoff"
  • "How do you choose between different ML algorithms?"
  • "What is the intuition behind logistic regression?"

Design Philosophy

Educational Clarity

Code prioritizes readability and understanding over computational optimization:

Consistent Interface Design

All algorithms follow the same API pattern, inspired by scikit-learn:

Documentation Standards

The project maintains a 2.3:1 documentation-to-code ratio, ensuring comprehensive coverage:

Project Roadmap

Phase 1: Foundation ✅ Complete (4/18)

✅ Linear Regression
✅ Multiple Regression
✅ Ridge Regression
✅ Logistic Regression

Phase 2: Classification 🔄 In Progress (0/4)

⏳ K-Nearest Neighbors (KNN)
⏳ Naive Bayes
⏳ Support Vector Machines (SVM)
⏳ Decision Trees

Phase 3: Ensemble Methods 📋 Planned (0/4)

📅 Random Forests
📅 AdaBoost
📅 Gradient Boosting
📅 XGBoost

Phase 4: Unsupervised Learning 📋 Planned (0/4)

📅 k-Means Clustering
📅 Hierarchical Clustering
📅 Principal Component Analysis (PCA)
📅 t-SNE

Contributing

The project welcomes contributions across multiple areas. Contribution opportunities include:

Priority Contribution Areas
High • Implement remaining 14 algorithms
• Add more examples to existing algorithms
• Create visualization utilities
• Add unit tests
Medium • Bug fixes and improvements
• Documentation enhancements
• Translate documentation
• Create video tutorials
Always Welcome • Typo fixes
• Grammar improvements
• Better explanations
• More examples

Project Impact

This repository provides foundational understanding of machine learning algorithms through complete, well-documented implementations. The project serves as both a learning resource and reference material for algorithm internals, supporting students, researchers, and professionals in developing deeper ML expertise.

Getting Started

  1. Clone the repository from GitHub
  2. Review documentation for Linear Regression as the foundation
  3. Progress through algorithms in numbered order
  4. Execute provided examples with included datasets
  5. Modify implementations for specific use cases
  6. Contribute improvements or additional algorithms

Resources & Links

Tags:

#MachineLearning #Python #NumPy #FromScratch #Education #MLAlgorithms #DataScience #LinearRegression #LogisticRegression #GradientDescent