BareBonesML: Inner workings of ML algorithms

This repository contains Python scripts that implement machine learning
algorithms from scratch. The purpose of this repository is to provide simple and
pedagogic implementations of machine learning algorithms to aid in understanding
their inner workings. The focus is on clarity and simplicity, and the code is
not intended to replace standard machine learning libraries for use in
production.

Supervised Algorithms

Support Vector Machine (SVM)

Central to the SVM is the concept of maximizing the margin between classes,
achieved by finding the hyperplane that separates the data points of different
classes while maximizing the distance between the hyperplane and the nearest
data points, known as support vectors.

Here is a minimal implementation of SVM. This implementation incorporates a
regularization parameter, \(\lambda\), which controls the trade-off
between minimizing the training error and preventing overfitting by penalizing
large parameter values. The logic is as follows:

Input: Training dataset (X, y),
       hyperparameters (learning_rate, lambda_param, num_iterations)

1. Initialize weights (w) and bias (b) to zeros or random small values.

2. Repeat for num_iterations:
     a. Compute the hinge loss:
	L = lambda_param * (sum(max(0, 1 - y * (X * w - b))) / num_samples)
     b. Compute the gradient of the hinge loss with respect to weights:
	    dw = (w - lambda_param * y * X) if distance > 0 else w
     c. Compute the gradient of the hinge loss with respect to bias:
	    db = -lambda_param * y if distance > 0 else 0
     d. Update weights and bias using gradient descent: 
	w = w - learning_rate * dw
	b = b - learning_rate * db

Output: Trained SVM model with weights (w) and bias (b)

X: input features matrix
y: target labels
lambda_param: is the regularization parameter controlling the trade-off
between margin maximization and error minimization
learning_rate: is the step size for gradient descent
num_iterations: is the number of iterations for gradient descent
distance represents the distances between the decision
boundary and the training samples
num_samples: is the number of samples in the training dataset
hinge loss: penalizes misclassifications

This works well for datasets which have:

no outliers i.e. no overlapping class labels
linearly sperable

Real world datasets with noise and non-linear speration boudaries are dealt with
as follows:

Dealing with outliers: The Slack term

Slack refers to the allowance of misclassification or errors in the SVM
optimization process. Introducing slack enables SVM to handle linearly/non-linearly
separable data by allowing some data points to fall within the margin or even on
the wrong side of the decision boundary. This flexibility helps SVM generalize
better to complex datasets, although too much slack can lead to overfitting.

Dealing with non-linear speration boundires: Kernel Trick

Thee kernel trick is a method used to extend SVMs for non-linear decision
boundaries without explicitly mapping the input data into a higher-dimensional
feature space. It works by implicitly computing the dot product between data
points in a higher-dimensional space, as if they were explicitly mapped. This
allows SVMs to effectively learn complex decision boundaries while avoiding the
computational burden of explicit feature mapping.

Together, slack and the kernel trick contribute to SVMs’ ability to handle
diverse datasets and achieve high classification accuracy in various real-world
applications.

Naive Bayes

Naive Bayes is a probabilistic algorithm commonly used for classification
tasks. Its simplicity, speed, and effectiveness make it a popular choice for
text classification, spam filtering, and other tasks involving high-dimensional
data with discrete features. The core principle behind Naive Bayes is Bayes’
theorem, which describes the probability of a hypothesis given evidence. In the
context of classification, Naive Bayes calculates the probability of each class
given a set of features and selects the class with the highest probability as
the predicted class for a given instance.

One of the key assumptions of Naive Bayes is the “naive” assumption of feature
independence. This assumption simplifies the calculation of probabilities by
assuming that each feature contributes independently to the probability of a
class label, given the input data. While this assumption may not always hold
true in practice, Naive Bayes often performs remarkably well even when the
independence assumption is violated. Another advantage of Naive Bayes is its
ability to handle large feature spaces efficiently. It requires only a small
amount of training data to estimate the parameters of the probability
distributions, making it particularly useful for datasets with many
features. Despite its simplicity and the simplicity of its underlying
assumptions, Naive Bayes has proven to be effective in a wide range of
real-world applications, making it a valuable tool in the machine learning
practitioner’s toolkit.

Here is a minimal implementation of Naive Bayes Classification

The logic as is follows:

Input: Training dataset (X, y)

1. Calculate prior probabilities for each class:
     a. Count the occurrences of each class label in y.
     b. Divide the count of each class by the total number of samples to get the
	prior probability P(c)

2. For each feature:
     a. Calculate likelihood probabilities for each feature given each class:
	  i. Count the occurrences of each feature value in each class.
	  ii. Divide the count by the total number of occurrences of that
	      feature value in the entire dataset.

3. For each instance x in the test set:
     a. For each class c:
	  i. Calculate the posterior probability P(c | x) using Bayes' theorem:
	       P(c | x) = P(x | c) * P(c) / P(x)
	  ii. Compute the product of likelihood probabilities for each feature
	      value in x given class c.
	  iii. Multiply the product by the prior probability of class c.
     b. Predict the class with the highest posterior probability as the label
	for instance x.

Output: Predicted class labels for the test set.

Limitations

Assumption of Feature Independence
Handling of Continuous Features
Naive Bayes is inherently designed for categorical features and assumes a
discrete distribution for feature values. While it can handle continuous
features by discretizing them this discretization may lead to loss of
information and reduced model performance
Imbalanced Datasets
Naive Bayes may exhibit biased predictions on imbalanced datasets, where one
class significantly outweighs the others. It may prioritize the majority
class and struggle to accurately classify instances from the minority class,
particularly when combined with the assumption of feature independence.

Despite these limitations, Naive Bayes remains a popular and useful algorithm,
especially in text classification and other domains with high-dimensional and
sparse feature spaces.

Ensemble Learning

Ensemble learning is a ML technique that combines the predictions of multiple
individual models to produce a stronger overall prediction. The idea behind
ensemble learning is to leverage the diversity of different models’ predictions
to improve the accuracy and robustness of the final prediction. Ensemble methods
can be applied to both classification and regression problems.

The major types of ensemble learning are:

Bagging: Mutiple models are independently trained on a subsets of
training data. These subsets are obtained by sampling with replacement (i.e
bootstrap). The predictions of all the models are combined to obtain the final
prediction either by taking a majority vode (for classification) or averaging
(for regression).
Boosting: Mutiple weak learners (i.e they perform slightly better than
chance) are sequentially trained on different weighted versions of the
data. At each step in the sequence, the subsequent weak learner tries to
minimize the error on the subset of training datapoints that had the largest
error in the previous step.

Limitations

Both Bagging and Boosting methods are good are reducing the variance
Bagging doesn’t reduce the Bias
Boosting is prone to overfitting

Random Forest (Bagging)

Random Forest is popular ensemble learning algorithm based on Bagging.
During the training phase, multiple decision trees are trained. Each decision
tree in is trained on a bootstraped subset (with replacement) of the training
data and makes predictions based on a random subset of features at each node
split. This randomness introduces diversity among the trees, reducing
overfitting and improving the model’s generalization performance. The final
output is either the mode ormean prediction of the individual trees for
classification or regression, respectively.

It can easily handle high-dimensional datasets with complex interactions between
features. By aggregating predictions from multiple decision trees, Random Forest
can capture a wide range of patterns and relationships present in the data,
making it robust to noisy or incomplete datasets. Additionally, it
can identify the most influential features in the prediction process.

Here is the code implementation in python.

Here is the logic:

Input: Training dataset (X_train, y_train),
       number of trees (n_trees),
       number of features to consider at each split (max_features)

RandomForest(X_train, y_train, n_trees, max_features):
    forest = []

    for _ in range(n_trees):
	# Create a bootstrap sample
	X_bootstrap, y_bootstrap = bootstrap_sample(X_train, y_train)

	# Train a decision tree on the bootstrap sample
	tree = DecisionTree(X_bootstrap, y_bootstrap, max_features)

	# Add the trained tree to the forest
	forest.append(tree)

    return forest

DecisionTree(X, y, max_features):
    tree = Tree()
    tree.build(X, y, max_features)
    return tree

bootstrap_sample(X, y):
    indices = random.sample(range(len(X)), len(X))
    return X[indices], y[indices]

predict(forest, X_test):
    predictions = []
    for tree in forest:
	predictions.append(tree.predict(X_test))

    # Aggregate predictions from all trees
    # (e.g., using majority voting for classification)
    return aggregate(predictions)

Gradient Boosting Machine

It works by combining multiple weak learners, typically decision trees, to
create a strong learner that can make accurate predictions. Unlike traditional
boosting algorithms that focus on re-weighting misclassified samples, gradient
boosting optimizes the model by sequentially fitting new weak learners to the
residuals or errors made by the previous ones. This iterative process involves
training each new weak learner to predict the residuals left over by the
ensemble of existing weak learners, gradually reducing the errors in
prediction. It is widely used in various domains, including regression,
classification, and ranking tasks, and popular implementations like XGBoost and
LightGBM have made it a cornerstone of modern machine learning workflows.

Here is the impementation of gbm from scratch

Gaussian Discriminant Analysis

Gaussian Discriminant Analysis (GDA) is a probabilistic generative model used
for classification tasks. In GDA, it is assumed that the features of each class
follow a multivariate normal (Gaussian) distribution. The model estimates the
parameters of these distributions for each class from the training data. By
utilizing Bayes’ theorem, GDA calculates the posterior probability of each class
given a feature vector. During prediction, the class with the highest posterior
probability is assigned to the input sample. GDA also allows for the estimation
of class priors, which represent the likelihood of encountering each class in
the population. Despite its simplicity and strong assumptions, GDA often
performs well in practice, especially when the underlying data distribution
closely resembles a Gaussian distribution. However, GDA can be sensitive to
deviations from Gaussianity and may not perform optimally in high-dimensional
spaces or with highly skewed or multimodal data.

Here is the code

K-Nearest Neighbours

K-Nearest Neighbors (KNN) is a simple yet powerful supervised machine learning
algorithm used for both classification and regression tasks. In KNN, the
prediction of a new data point is determined by the majority vote (for
classification) or the average (for regression) of its K nearest neighbors in
the feature space. The “nearest” neighbors are typically defined by a distance
metric, commonly the Euclidean distance. KNN is a non-parametric algorithm,
meaning it does not make explicit assumptions about the underlying data
distribution, and it’s often referred to as a lazy learner as it doesn’t build a
model during training, instead, it memorizes the entire training dataset. KNN’s
simplicity and effectiveness make it a popular choice for many applications,
although it may suffer from computational inefficiency, especially with large
datasets, and it requires careful tuning of the hyperparameter K to balance
between bias and variance.

Here is the code

Unsupervised Algorithms

K-Means Clustering

K-Means is an algorithm used for partitioning a dataset into K distinct,
non-overlapping clusters. The algorithm works by iteratively assigning data
points to the nearest cluster centroid and then updating the centroids based on
the mean of the data points assigned to each cluster. This process continues
until the centroids no longer change significantly or a specified number of
iterations is reached. K-means aims to minimize the within-cluster sum of
squares, essentially minimizing the distance between data points within the same
cluster while maximizing the distance between different clusters.

Here is the logic:

1. Initialize K cluster centroids randomly.
2. Repeat until convergence:
    a. Assign each data point to the nearest cluster centroid.
    b. Recompute the cluster centroids as the mean of the data points assigned
       to each cluster.
3. Convergence criteria:
    - No change in cluster assignments for any data point.
    - Maximum number of iterations reached.

A simple implementation can be found here

PCA

Principal Component Analysis (PCA) is a dimensionality reduction technique
commonly used in data analysis and machine learning. Its primary goal is to
reduce the dimensionality of a dataset while preserving most of its important
information. PCA achieves this by transforming the original features into a new
set of uncorrelated variables called principal components. These components are
linear combinations of the original features, ordered in such a way that the
first principal component captures the maximum variance in the data, the second
component captures the second highest variance, and so on. By selecting only a
subset of these principal components, often those that capture the most
variance, one can effectively reduce the dimensionality of the dataset while
retaining most of its important characteristics. PCA is widely used for data
visualization, noise reduction, feature extraction, and speeding up machine
learning algorithms by reducing the computational complexity associated with
high-dimensional data.

The logic is as follows:

Input: Data matrix X (n x m), where n is the number of samples and m is the
number of features.
Compute the mean of each feature and subtract it from the corresponding
feature in each sample to center the data.
Compute the covariance matrix of the centered data.
Compute the eigenvalues and eigenvectors of the covariance matrix.
Sort the eigenvectors by their corresponding eigenvalues in descending order.
Select the top k eigenvectors corresponding to the largest eigenvalues to
form the principal components.
Project the centered data onto the selected principal components to obtain
the transformed data.
Output: Transformed data and the selected principal components.

Here is the code

Sequential Data

Hidden Markov Model

Hidden Markov Models (HMMs) have been succfully used in various fields from
speech recognition to bioinformatics, for modeling sequential data. At its core,
an HMM models the dynamics of the underlying states that influence observed outcomes.

Here’s how it works:

Hidden States: These are the unseen variables that govern the system’s
behavior. For instance, in weather forecasting, hidden states could represent
weather conditions like sunny, rainy, or cloudy.

Observations: These are the visible outcomes influenced by the hidden
states. For instance, the observable outcomes in weather forecasting could be
temperature, humidity, or precipitation.

Transition Probabilities: Hidden Markov Models capture how the system
transitions between hidden states over time.

Emission Probabilities: These determine the likelihood of observing a particular
outcome given the current hidden state. For example, in our weather analogy,
emission probabilities could represent the likelihood of observing specific
temperatures or humidity levels for each weather condition.

Applications: HMMs find applications in diverse domains. In speech recognition,
they help decipher spoken words by modeling phonetic states and observed
acoustic features. In finance, they forecast market trends by modeling hidden
states representing market conditions.

Challenges and Limitations: While HMMs are versatile, they have
limitations. They assume a fixed number of hidden states and discrete
observations, which may not always align with real-world
complexities. Additionally, training HMMs can be computationally intensive.

Despite their limitations, Hidden Markov Models remain invaluable for modeling
sequential data and making predictions in various domains. Understanding their
basic principles can unlock a world of possibilities in data analysis and
decision-making.

Optimization techniques

Simulated annealing

A technique to find approximate global optimum in hightly rugged energy
landscapes

Simulated annealing is like a smart explorer searching for the highest peak in a
rugged landscape. Instead of following a strict path, it takes detours and
explores different routes, sometimes even climbing uphill initially. This
randomness allows it to escape from valleys and potentially find the highest
summit. As it progresses, the explorer becomes more methodical, gradually
reducing its willingness to take risky moves, akin to the cooling of a metal
during annealing. Eventually, it settles on a solution that might not be the
absolute peak but is pretty close. This approach is incredibly useful in
tackling complex problems where traditional methods might get stuck in local
optima. Whether it’s designing efficient circuits or optimizing logistical
routes, simulated annealing offers a versatile tool to navigate through the
complexities and find near-optimal solutions

Here is a demo of how to use simulated annealing to solve the classical
travelling salesman problem:

Travelling Salesman Problem solved using simulated annealing