ML Algorithms from Scratch is a comprehensive educational repository that implements 18 fundamental machine learning algorithms using only Python and NumPy. The project demonstrates complete algorithm implementations with mathematical rigor, featuring over 2,500+ lines of documentation that explain the theory, mathematics, and code behind each algorithm.
Unlike typical machine learning tutorials that rely on scikit-learn or TensorFlow, this repository builds
every algorithm from first principles. Each implementation follows a consistent object-oriented design
pattern with a scikit-learn-like API (fit, predict, score), making
it ideal for students, researchers, and engineers who want to understand how ML algorithms actually work
under the hood.
Understanding machine learning algorithms at a fundamental level presents several key challenges:
The project addresses these challenges through a structured approach combining mathematical foundations with production-quality code:
| Traditional Learning | This Project |
|---|---|
| Import library and use | Build algorithm from scratch |
| Memorize formulas | Derive and implement equations |
| Trial-and-error tuning | Understand parameter effects |
| Abstract concepts | Concrete code examples |
| Limited documentation | 2,500+ lines of explanations |
This repository serves multiple user segments in the ML community:
Fully implemented with 14 more planned
Comprehensive documentation across all algorithms
Pure implementation, no ML libraries used
Perfect preparation for FAANG interviews
The repository currently features 4 completed algorithms with comprehensive documentation and working code. Development follows a phased approach with 14 additional algorithms planned:
| # | Algorithm | Type | Code Lines | Doc Lines | Status |
|---|---|---|---|---|---|
| 1 | Linear Regression | Regression | 160 | 391 | ✅ Complete |
| 2 | Multiple Regression | Regression | 173 | 356 | ✅ Complete |
| 3 | Ridge Regression | Regression | 256 | 696 | ✅ Complete |
| 4 | Logistic Regression | Classification | 414 | 873 | ✅ Complete |
The roadmap includes 14 more essential algorithms:
Every algorithm includes a detailed markdown file (300-900 lines) that covers:
All implementations follow best practices:
class AlgorithmName:
def __init__(self, hyperparameter1=default1, ...):
"""Initialize with hyperparameters"""
def fit(self, X, y):
"""Train the model on training data"""
def predict(self, X):
"""Make predictions on new data"""
def score(self, X, y):
"""Evaluate model performance"""
def get_coefficients(self):
"""Get learned parameters"""
Features include:
Understanding the math is critical. Each algorithm includes:
# Normal Equation: θ = (X^T X)^(-1) X^T y
def fit(self, X, y):
# Add bias term (column of ones)
X_with_bias = np.c_[np.ones((X.shape[0], 1)), X]
# Calculate coefficients using Normal Equation
# Inverts (X^T X) and multiplies by X^T y
self.coefficients = np.linalg.inv(
X_with_bias.T @ X_with_bias
) @ X_with_bias.T @ y
return self
Mathematical coverage includes:
Complete implementation demonstrating the gradient descent approach for binary classification:
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
class LogisticRegression:
def __init__(self, learning_rate=0.01, iterations=1000):
self.learning_rate = learning_rate
self.iterations = iterations
self.coefficients = None
self.losses = []
def _sigmoid(self, z):
"""Sigmoid activation function"""
return 1 / (1 + np.exp(-z))
def fit(self, X, y):
"""Train using gradient descent"""
n_samples, n_features = X.shape
# Add bias term
X_with_bias = np.c_[np.ones((n_samples, 1)), X]
# Initialize coefficients
self.coefficients = np.zeros(n_features + 1)
# Gradient descent
for i in range(self.iterations):
# Forward pass
y_pred = self._sigmoid(X_with_bias @ self.coefficients)
# Calculate loss (binary cross-entropy)
loss = -np.mean(
y * np.log(y_pred + 1e-15) +
(1 - y) * np.log(1 - y_pred + 1e-15)
)
self.losses.append(loss)
# Calculate gradients
error = y_pred - y
gradients = (1 / n_samples) * (X_with_bias.T @ error)
# Update coefficients
self.coefficients -= self.learning_rate * gradients
return self
def predict(self, X):
"""Make binary predictions"""
X_with_bias = np.c_[np.ones((X.shape[0], 1)), X]
probabilities = self._sigmoid(X_with_bias @ self.coefficients)
return (probabilities >= 0.5).astype(int)
def score(self, X, y):
"""Calculate accuracy"""
predictions = self.predict(X)
return np.mean(predictions == y)
# Usage Example
data = load_breast_cancer()
X, y = data.data, data.target
# Split and standardize
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Train model
model = LogisticRegression(learning_rate=0.1, iterations=2000)
model.fit(X_train, y_train)
# Evaluate
accuracy = model.score(X_test, y_test)
print(f"Accuracy: {accuracy:.4f}") # Output: ~0.96
Algorithms are structured in progressive difficulty, building foundational concepts before advanced techniques:
Ridge regression implementation demonstrates L2 regularization through the modified Normal Equation:
# Regularized Normal Equation: θ = (X^T X + λI)^(-1) X^T y
def fit(self, X, y):
X_with_bias = np.c_[np.ones((X.shape[0], 1)), X]
# Create identity matrix (don't penalize bias term)
identity = np.eye(X_with_bias.shape[1])
identity[0, 0] = 0 # Don't regularize intercept
# Add regularization term (λI)
regularization_term = self.alpha * identity
# Solve regularized Normal Equation
self.coefficients = np.linalg.inv(
X_with_bias.T @ X_with_bias + regularization_term
) @ X_with_bias.T @ y
return self
| Algorithm | Training Time | Prediction Time | Memory Usage |
|---|---|---|---|
| Linear Regression | O(n³) | O(n) | O(n²) |
| Multiple Regression | O(n³) | O(n) | O(n²) |
| Ridge Regression | O(n³) | O(n) | O(n²) |
| Logistic Regression | O(i·n²) | O(n) | O(n) |
where n = number of samples, i = iterations
The project requires Python 3.7+ and NumPy. Setup process:
# 1. Clone the repository
git clone https://github.com/inboxpraveen/ML-Algorithms-from-scratch.git
cd ML-Algorithms-from-scratch
# 2. Install dependencies
pip install numpy
# 3. Install optional dependencies for examples
pip install matplotlib scikit-learn pandas jupyter
# 4. Run an example
python "1. Linear Regression/_1_linear_regressions.py"
import numpy as np
import sys
sys.path.append('1. Linear Regression')
from _1_linear_regressions import LinearRegression
# Create sample data (years of experience vs salary)
X_train = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1)
y_train = np.array([30000, 35000, 40000, 45000, 50000, 55000, 60000, 65000, 70000, 75000])
# Create and train model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions
X_test = np.array([11, 12, 15]).reshape(-1, 1)
predictions = model.predict(X_test)
# Evaluate
r2_score = model.score(X_train, y_train)
print(f"R² Score: {r2_score:.4f}")
print(f"Predictions: {predictions}")
# Get learned coefficients
coeffs = model.get_coefficients()
print(f"Equation: y = {coeffs['intercept']:.2f} + {coeffs['slope']:.2f}x")
The repository addresses common technical interview requirements:
Code prioritizes readability and understanding over computational optimization:
All algorithms follow the same API pattern, inspired by scikit-learn:
__init__(): Set hyperparametersfit(X, y): Train the modelpredict(X): Make predictionsscore(X, y): Evaluate performanceget_coefficients(): Inspect learned parametersThe project maintains a 2.3:1 documentation-to-code ratio, ensuring comprehensive coverage:
✅ Linear Regression
✅ Multiple Regression
✅ Ridge Regression
✅ Logistic Regression
⏳ K-Nearest Neighbors (KNN)
⏳ Naive Bayes
⏳ Support Vector Machines (SVM)
⏳ Decision Trees
📅 Random Forests
📅 AdaBoost
📅 Gradient Boosting
📅 XGBoost
📅 k-Means Clustering
📅 Hierarchical Clustering
📅 Principal Component Analysis (PCA)
📅 t-SNE
The project welcomes contributions across multiple areas. Contribution opportunities include:
| Priority | Contribution Areas |
|---|---|
| High |
• Implement remaining 14 algorithms • Add more examples to existing algorithms • Create visualization utilities • Add unit tests |
| Medium |
• Bug fixes and improvements • Documentation enhancements • Translate documentation • Create video tutorials |
| Always Welcome |
• Typo fixes • Grammar improvements • Better explanations • More examples |
This repository provides foundational understanding of machine learning algorithms through complete, well-documented implementations. The project serves as both a learning resource and reference material for algorithm internals, supporting students, researchers, and professionals in developing deeper ML expertise.