Code ML Algorithms From Scratch
Coding interviews can mean different things for “traditional” software engineers (backend, frontend, fullstack, etc.) and engineers with a machine learning focus. Apart from LeetCodestyle questions, ML engineers (as well as applied scientists, research engineers, and, occasionally, machine learning data scientists) may be asked to implement a classic ML algorithm from scratch during an interview.
This may sound scary if you’ve only used libraries to train models without understanding how learning algorithms work under the hood. Moreover, there are way too many algorithms to memorize. The good news is, only 4 algorithms are commonly asked in interviews: Linear regression, logistic regression, knearest neighbors (kNN), and kmeans. You’re probably not gonna code a transformer on the fly in 45 minutes. Let’s implement each using vanilla NumPy (and some builtin libraries).
Linear Regression
Linear regression uses a vector of realvalued inputs $\mathbf{x} \in \mathbb{R}^D$ (e.g., age, years of education, onehot encoded job categories) to predict a realvalue output $y \in \mathbb{R}$ (e.g., income): $\hat{y_i} = \mathbf{w}^T \mathbf{x} + b = w_1 x_1 + w_2 x_2 + \ldots + b$ (predicted income for person $i$).
In the formula above, $\mathbf{w}$ is “regression coefficients” (statistics) or “weights” (ML), which quantify how much each feature impacts the outcome with all else being held constant. The bias term $b$ (“intercept” in stats) is the outcome value when all features are 0 — or, we can understand it as a force that moves $\hat{y}$ up and down.
We can use several methods to solve for $\mathbf{w}$ and $\mathrm{b}$, such as Ordinary Least Squares (OLS) and gradient descent. Since the latter method applies to a wide range of supervised learning algorithms, not just linear regression, I’ll implement it first.
Gradient Descent
When using gradient descent, we can initialize $\mathbf{w}$ and $b$ with random values, get terrible results at first, and then gradually learn from our mistakes. If using gradient descent, below are the building blocks of a linear regressor:

Cost function: Mean squared error (MSE), $J(\mathbf{w}, b) = \frac{\sum_{i}^{n} (y_i  \hat{y_i})^2}{n}$, can measure how “off” the model predictions are. $y_i$ is the observed value of observation $i$, $\hat{y_i}$ is the predicted value of $i$, and $n$ is the number of observations in the dataset. The objective is to find values of $\mathbf{w}$ and $b$ that minimize $J(\mathbf{w}, b)$.

Optimization routine: If you wanna brush of your memory of gradient descent, I strongly suggestion YouTube videos by StatQuest and 3Blue1Brown.
A gradient is the partial derivative of a function with respect to a parameter. For visualization purposes, say we only have one parameter and the picture below shows how the cost function changes with different parameter values. If the gradient is negative, we wanna move a bit to the right; if positive, a bit to the left. When the gradient is 0, it means we’re at a minimum, global or local.
Luckily, we don’t have to code up different scenarios: By subtracting gradients from parameter values, we’ll naturally move to the right when gradients are negative and to the left when they’re positive. One thing to note is that if we move too much at a time, we may well shoot past the minima — instead, we can adjust the step size using a learning rate parameter, which is usually a small number (e.g., 0.01, 0.001). However, if the learning rate is too small (e.g., 0.000001), training can take a long time and we may get stuck in local minima. We can use crossvalidation to find an ideal learning rate.
For each set of parameter values, each observation has its own gradients. In canonical gradient descent, we need to calculate the gradients of all observations. Let $\mathbf{X}$ be the feature matrix of $n$ observations and $\mathbf{y}$ the outcome vector; below are gradients with regard $\mathbf{w}$ and $b$ (see the derivation):
 With regard to $\mathbf{w}$: $\frac{\partial J(\mathbf{w}, b)}{\partial \mathbf{w}} = \frac{2 \mathbf{X}^T \cdot (\mathbf{y}\hat{\mathbf{y}})}{n}$
 With regard to $b$: $\frac{\partial J(\mathbf{w}, b)}{\partial b} = \frac{2 \sum_{i}^{n} y_i  \hat{y_i}}{n}$

Stopping criteria: We can stop when gradients are 0 (or sufficiently small), or if we’ve run the algorithm after a fixed number of epochs.
ScikitLearn doesn’t actually implement gradient descent for linear regression. SGDRegressor
uses stochastic gradient descent (SGD) to optimize parameter values. Instead of using all observations each time, SGD randomly chooses one observation per epoch to compute gradients, which speeds up learning and introduces randomness to the learning process. For large datasets, this added randomness is useful for avoiding getting stuck at local minima. For small datasets, it’s a nightmare.
1# import libraries
2from sklearn import datasets
3from sklearn.model_selection import train_test_split
4from sklearn.linear_model import SGDRegressor
5
6# load the diabetes dataset for demonstration
7X, y = datasets.load_diabetes(return_X_y=True)
8
9# use 80% for training and 20% for test
10X_train, X_test, y_train, y_test = train_test_split(
11 X, y, test_size=0.2, random_state=42
12)
13
14# create a new model
15lr = SGDRegressor(learning_rate="optimal", max_iter=10000)
16# fit model to training data
17lr.fit(X_train, y_train)
18# use fitted model to predict test data
19y_pred = lr.predict(X_test)
20
Spoiler: The diabetes dataset only has 442 data points, which turns out far from enough for the SGD regressor to convergeπ.
In interviews, you’re most likely not allowed to use highlevel libraries such as ScikitLearn. We can write a custom LinearRegression
class using just NumPy (for vectorized operations). The example below is adapted from GeeksforGeeks.
1class LinearRegression:
2 def __init__(self, lr=0.01, epoch=10):
3 # set hyperparameter values
4 self.lr = lr # learning rate
5 self.epoch = epoch # number of epochs
6
7 def fit(self, X, y):
8 # number of observations, number of features
9 self.n_obs, self.n_features = X.shape
10 # initialize weights and bias
11 self.w = np.zeros(self.n_features)
12 self.b = 0
13 # read data and labels from input
14 self.X = X
15 self.y = y
16 # use gradient descent to update weights
17 for i in range(self.epoch):
18 self.update_weights()
19
20 return self
21
22 def update_weights(self):
23 # use current parameters to predict
24 y_pred = self.predict(self.X)
25 # compute gradients with respect weights
26 grad_w = 2 * np.dot(self.X.T, (self.y  y_pred)) / self.n_obs
27 # compute gradient with respect to bias
28 grad_b = 2 * np.sum(self.y  y_pred) / self.n_obs
29 # update parameters by substracting lr * gradients
30 self.w = self.w  self.lr * grad_w
31 self.b = self.b  self.lr * grad_b
32
33 return self
34
35 def predict(self, X):
36 # use current parameters to make predictions
37 return X.dot(self.w) + self.b
A few things to highlight:

Initialization (
__init__
): In the ScikitLearn example earlier, we only provided hyperparameter values when initializing the model. If values are not provided, we use default values. We can mimic that pattern in our custom class. Here we got two hyperparameters: Learning rate (lr
) and the number of iterations (epoch
). 
Model fitting (
fit
): The.fit
method takes two arguments, the data (a $n \times m$ array, where $n$ is the number of observations and $m$ is the number of features) and the target (a $n$vector). Modeling fitting doesn’t return any values — rather, $\mathbf{w}$ and $b$ are modified in place using gradient descent.The helper function
update_weights
is where we implement gradient descent. Note that it calls the.predit
method each time to generate predictions using current parameter values. After calculating gradients of $\mathbf{w}$ and $b$ by translating the two corresponding formulas to code, we use them to update parameter values. This process repeats for the number of epochs specified (epoch
). 
Model prediction (
predict
): The.predict
method uses the best parameter values we found in.fit
to generate predictions for new data, $\mathbf{y} = \mathbf{X} \cdot \mathbf{w} + b$.
The rootmeansquare error (RMSE, which is the square root of MSE) is a common metric to evaluate regression models. Unlike classification metrics (e.g., precision, recall, F1 score, AUC), the absolute value of RMSE doesn’t tell us much about how good a model is. We need to compare RMSE of several models to find the winner.
OLS
Former Quora data scientist Zi Chong Kao has an extraordinary geometric explanation of the Ordinary Least Squares (OLS) method. Like gradient descent, the goal of OLS is to minimize prediction errors. The slight difference is that instead of minimizing MSE, OLS minimizes the sum of squared residuals (SSR), $\lVert \mathbf{y}  \mathbf{X}\mathbf{\beta} \rVert ^2$.
Here we’re using a new formula $\mathbf{\hat{y}} = \mathbf{X}\mathbf{\beta}$, which is similar to the one before, $\mathbf{\hat{y}} = \mathbf{X} \cdot \mathbf{w} + b$, except that now we added a column all 1’s to the beginning of the original feature matrix $\mathbf{X}$. Long story short, this transformation simplifies the closedform solution of the best $\beta$ (see OLS in Matrix Form)): $(\mathbf{X}^T \mathbf{X})^{1}\mathbf{X}^T \mathbf{y}$. Know that $\beta_0 = b$.
Fun fact: OLS is the method used by LinearRegression
in ScikitLearn.
1# import library
2from sklearn.linear_model import LinearRegression
3# create a new model
4lr = LinearRegression()
5# fit model to training data
6lr.fit(X_train, y_train)
7# use fitted model to predict test data
8y_pred = lr.predict(X_test)
Counterintuitively, I find gradient descent easier to implement than OLS, mostly due to the transformation (e.g., adding an $n$vector of 1’s to $\mathbf{X}$) required by the latter. The code below is simplified from blogposts by IBM and datascience+.
1import copy
2
3class LinearRegression:
4 def __init__(self):
5 # no hyperparameters to intialize
6 pass
7
8 def fit(self, X, y):
9 # read data and labels from input
10 self.X = X
11 self.y = y
12 # create a vector of all 1's to X
13 X = copy.deepcopy(X) # keep original X intact
14 dummy = np.ones(X.shape[0]) # create a vector of 1's
15 X = np.concatenate((dummy, X), 1) # add it to X
16 # use OLS to estimate betas
17 xT = X.transpose()
18 inversed = np.linalg.inv(xT.dot(X))
19 betas = inversed.dot(xT).dot(y)
20 # bias is the first column
21 self.b = betas[0]
22 # weights are the rest
23 self.w = betas[1:]
24
25 return self
26
27 def predict(self, X):
28 return X.dot(self.w) + self.b

Initialization (
__init__
): OLS doesn’t have any hyperparamters, so we can usepass
to skip hyperparameter initialization. 
Model fitting (
fit
): As mentioned, we need to transform $\mathbf{X}$ by adding a new column, but we don’t want to mess with the original matrix. We can create a copy of the original (need thecopy
library) and transform the copy instead. The centerpiece of the.fit
method is translating $(\mathbf{X}^T \mathbf{X})^{1}\mathbf{X}^T \mathbf{y}$ into code.Unless we also want to transform new data $\mathbf{X}$ when making new predictions, we can extract the bias ($\beta_0$) and the weights ($\beta_j$, where $j=1, 2, \ldots, m$) from fitted $\beta$ and apply $\mathbf{X}\cdot \mathbf{w} + b$ to the untransformed matrix.

Model prediction (
predict
): This part is the same as in gradient descent.
In this case, OLS is a bit worse (higher RMSE) than gradient descent, but it works faster on small datasets and doesn’t require us to find the best learning rate.
Logistic Regression
Some say the name “logistic regression” is a tad misleading because it’s a classification method. However, I think the name appropriately highlights the connection between logistic regression and linear regression. In logistic regression, we still use realvalued features (e.g., income, age, onehot encoded move genres) to first predict some realvalued output. However, the direct output doesn’t carry much meaning — we want to transform it into a probability ([0, 1]) and make a decision based on thAT probability ($p \geq 0.5$: will buy popcorn; $p < 0.5$: won’t buy popcorn).
 The part same as in linear regression: $z = \mathbf{w}^T \mathbf{x} + b$ (for one observation)
 Transformation using the logistic function: $p = \frac{1}{1 + \exp(z)}$
 Make a decision: $\hat{y} = 1$ if $p \geq t$ and 0 if $p < t$ ($t$ is the decision threshold)
Rather than using MSE as the cost function, we can use cross entropy (explained by StatQuest): $\frac{\sum_{i}^{n} y_i \log(\hat{y_i})}{n}$, where $y_i$ is the actual outcome (0 or 1) and $\hat{y_i}$ the predicted outcome. We don’t want to use classification metrics (e.g., accuracy, precision, recall) as the cost function because they don’t have sensitive gradients.
Unlike linear regression, OLS is no longer appropriate for logistic regression, because it may predict values greater than 1 or less than 0. However, we can still use gradient descent to find the best $\mathbf{w}$ and $b$ values. I’ll skip the ScikitLearn implementation because it’s just a matter of importing different models.
First, we can use make_classification
to generate some toy data:
1# create some toy data
2from sklearn.datasets import make_classification
3
4X, y = make_classification(
5 n_samples=1000,
6 n_features=2,
7 n_redundant=0,
8 n_informative=2,
9 random_state=1,
10 n_clusters_per_class=1,
11)
12
13# use 80% for training and 20% for test
14X_train, X_test, y_train, y_test = train_test_split(
15 X, y, test_size=0.2, random_state=42
16)
17
Under cross entropy, the gradients of the parameters are almost the same as when we used MSE, except that now we don’t have the scalar 2:
 With regard to $\mathbf{w}$: $\frac{\partial J(\mathbf{w}, b)}{\partial \mathbf{w}} = \frac{\mathbf{X}^T \cdot (\mathbf{y}\hat{\mathbf{y}})}{n}$
 With regard to $b$: $\frac{\partial J(\mathbf{w}, b)}{\partial b} = \frac{\sum_{i}^{n} y_i  \hat{y_i}}{n}$
1class LogisticRegression:
2 def __init__(self, lr=0.01, epoch=10):
3 # set hyperparameter values
4 self.lr = lr # learning rate
5 self.epoch = epoch # number of epochs
6
7 def fit(self, X, y):
8 # number of observations, number of features
9 self.n_obs, self.n_features = X.shape
10 # initialize weights and bias
11 self.w = np.zeros(self.n_features)
12 self.b = 0
13 # read data and labels from input
14 self.X = X
15 self.y = y
16 # use gradient descent to update weights
17 for i in range(self.epoch):
18 self.update_weights()
19
20 return self
21
22 def update_weights(self):
23 # use current parameters to predict
24 y_pred = self.predict(self.X)
25 # calculate gradients with respect to weights
26 grad_w = np.dot(self.X.T, (self.y  y_pred)) / self.n_obs
27 # calculate gradients with respect to bias
28 grad_b = np.sum(self.y  y_pred) / self.n_obs
29 # update parameters by substracting lr * gradients
30 self.w = self.w  self.lr * grad_w
31 self.b = self.b  self.lr * grad_b
32
33 return self
34
35 def sigmoid(self, z):
36 return 1 / (1 + np.exp(z))
37
38 def predict(self, X):
39 # linear part
40 z = X.dot(self.w) + self.b
41 # sigmoid > transform [0, 1]
42 p = self.sigmoid(z)
43 # make decisions
44 return np.where(p < 0.5, 0, 1)
The implementation is similar to that of linear exception, except that we need extra steps to turn realvalued predictions into binary (0 or 1) decisions.
Compared to regression metrics, classification metrics are way more complex. With the same decision threshold, we can derive accuracy, precision, recall, F1score, and some more obscure ones (e.g., specificity) from the confusion matrix. To find a good threshold, we can use ROC AUC or PrecisionRecall AUC. To learn these metrics in detail, I highly recommend StatQuest’s new book. The results look quite good.
precision recall f1score support
0 0.90 0.97 0.93 108
1 0.96 0.87 0.91 92
accuracy 0.93 200
macro avg 0.93 0.92 0.92 200
weighted avg 0.93 0.93 0.92 200
KNN
Unlike linear regression or logistic regression, the knearest neighbors (kNN) algorithm is a nonparametric method in that it makes no assumptions about population distributions. Rather, kNN makes predictions by memorizing the training data: When a new observation comes in, we compute its distance (e.g., Euclidean, Manhattan, etc.) from all the training data to find its knearest neighbors. We can use crossvalidation to find the best k. For classification, we can take a majority vote on k neighbors' labels to predict the label of the new observation. For regression, we can take the average target value of k neighbors to make a prediction. I’ll just implement a kNN classifier, which you can easily change into a regressor.
The code below is adapted from a GeeksforGeeks post.
1# import library (to get mode)
2from scipy.stats import mode
3
4class KNeighborsClassifier:
5 def __init__(self, n_neighbors):
6 # set hyperparameter value
7 self.n_neighbors = n_neighbors
8
9 def fit(self, X_train, y_train):
10 # read train data and labels
11 self.X_train = X_train
12 self.y_train = y_train
13 # number of train observations
14 self.n_train = X_train.shape[0]
15
16 return self
17
18 def predict(self, X_test):
19 # read test data
20 self.X_test = X_test
21 # number of test observations
22 self.n_test = X_test.shape[0]
23 # collect predicted labels
24 y_pred = np.zeros(self.n_test)
25 # loop over all points in test data
26 for i in range(self.n_test):
27 # find k nearest neighbors of given point
28 p = self.X_test[i]
29 neighbors = np.zeros(self.n_neighbors)
30 neighbors = self.find_neighbors(p)
31 # take a majority vote on the final label
32 y_pred[i] = mode(neighbors)[0][0]
33 # return predicted labels
34 return y_pred
35
36 def euclidean(self, p1, p2):
37 # compute Eucleadian distance between two points
38 return np.sqrt(np.sum((p1  p2) ** 2))
39
40 def find_neighbors(self, p):
41 # compute distance between given point and all train points
42 distances = np.zeros(self.n_train)
43 # loop over all points in training data
44 for i in range(self.n_train):
45 distance = self.euclidean(p, self.X_train[i])
46 distances[i] = distance
47 # sort training labels by distance (ascending)
48 _, y_train_sorted = zip(*sorted(zip(distances, y_train)))
49 # return top k labeles
50 return y_train_sorted[: self.n_neighbors]
 Initialization (
__init__
): The only hyperparameter used here is the number of neighbors. In reality, you can also change the distance metric, whether all neighbors weigh equally into the final prediction, and so on.  Modeling fitting (
.fit
): Not much is going on when we “fit” the model — we just read in the training data and labels, without modifying any parameters (none exists).  Modeing prediction (
.predict
): The prediction process is quite computationally expensive. First, we need to loop over all test data. For each point in the test dataset, we loop over all the training data to compute pairwise distances and sort training labels by distances to get the nearest neighbors. Finally, we find the mode label for each test data point to classify it.
For the iris dataset, the results are “too perfect” when $k=3$ — this is not a coincidence: Small k values often lead to overfitting (because we’re mimicking the training data too closely) whereas large values often lead the opposite problem.
precision recall f1score support
0 1.00 1.00 1.00 10
1 1.00 1.00 1.00 9
2 1.00 1.00 1.00 11
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
KMeans
Kmeans is a clustering algorithm, which is not to be confused with kNN, a supervised algorithm that we just coded. The idea behind kmeans is that we first determine how many clusters there are (by visualizing the data or using crossvalidation) and choose random points to be the centers of the clusters (centroids). Then we assign a cluster label to each point based on the closet centroid and then choose the average location of each cluster as the new centroid. We repeat this process for a given number of times to find final locations of the centroids.
The trained model is the locations of the centroids: When a new observation comes in, whichever centroid that’s closest to it determines its clsuter label.
1class KMeans:
2 def __init__(self, n_clusters=3, epoch=100):
3 # initialize hyperparameter values
4 self.n_clusters = n_clusters
5 self.epoch = epoch
6
7 def fit(self, X_train):
8 # read in the data
9 self.X_train = X_train
10 # number of training observations
11 self.n_train = X_train.shape[0]
12 # initialize with random centroids (cannot exceed training boundaries)
13 min_, max_ = np.min(self.X_train, axis=0), np.max(self.X_train, axis=0)
14 centroids = [
15 np.random.uniform(min_, max_) for _ in range(self.n_clusters)
16 ] # this generates n_clusters points with the correct dimensions
17
18 # train for given number of epochs
19 for i in range(self.epoch):
20 centroids = self.update_centroids(centroids)
21 # add final centroids to model attributes
22 self.centroids = centroids
23 return self
24
25 def update_centroids(self, centroids):
26 # store cluster labels of training data
27 clusters = np.zeros(self.n_train)
28 # assign each training data point to closest centroid
29 for i in range(self.n_train):
30 p = self.X_train[i] # isolate a data point
31 dists = [self.euclidean(p, centroid) for centroid in centroids]
32 clusters[i] = np.argmin(dists) # index of closest centroid is cluster label
33 # update centroids by averaging points in given cluster
34 for i in range(self.n_clusters):
35 # all points assigned to given cluster
36 points = self.X_train[np.array(clusters) == i]
37 # update new centroid to be the average point
38 centroids[i] = points.mean(axis=0)
39
40 return centroids
41
42 def euclidean(self, p1, p2):
43 # compute Eucleadian distance between two points
44 return np.sqrt(np.sum((p1  p2) ** 2))
45
46 def predict(self, X_test):
47 # read test data
48 self.X_test = X_test
49 # number of test observations
50 self.n_test = X_test.shape[0]
51 # store cluster labels of test data
52 clusters = np.zeros(self.n_test)
53 # assign each test data point to closest centroid
54 for i in range(self.n_test):
55 p = self.X_test[i]
56 dists = [self.euclidean(p, centroid) for centroid in self.centroids]
57 clusters[i] = np.argmin(dists)
58 # return predicted clusters of test data
59 return clusters

Initialization (
__init__
): We need two hyperparameters — the number of clusters and the number of epochs. 
Model training (
.fit
): Unlike supervised learning algorithm, kmeans only takes one argument, which is the feature array (X_train
), but not the label vector (y_train
). This is because the goal of clustering is no longer to predict ground truth labels, but to find intrinsic structures in the data.One thing to note is that we can’t just initialize centroids with whatever values we want — they should lie inside the training data, so we can find the best centroid locations faster. The
update_centroids
function updates the centroid locations in each epoch: It first clusters each training data point with the closest centroid to it; after going through all training data, it returns the average points of each of the k clusters as the new centroid locations. 
Model prediction (
.predict
): The prediction part is quite like training, except that we don’t update centroid locations anymore. Rather, we use the locations found during training to assign cluster labels to each test data point.
Since we don’t know the ground truth anymore, evaluating clustering results is quite complex. You can read about different evaluation metrics for clustering in this Wikipedia article. Some popular ones are the rand index (when you actually have the labels) and the silhouette coefficient (when you truly don’t have labels).
Resources
While learning to write the 4 algorithms above from scratch, I read many blog posts and tutorials. However, the code in these articles is often more convoluted than one manage during interviews. I (re)wrote these algorithms in as clean and straightforward a manner as possible, so that you and I can hopefully reproduce them from understanding and memory. To learn more, check out the resources below:
 The StatQuest Illustrated Guide To Machine Learning by StatQuest π If you’re new to ML, I highly recommend this lovely little book for building vivid intuitions about how ML algorithms work in general. Even though I’ve been doing ML stuff for a while, I still had several aha moments from my read.
 Source code of ScikitLearn (GitHub) π Of course, the model implementations in ScikitLearn are much more thorough than what you’re expected to write during interviews, but it’s illuminating to learn from how it’s done in practice.
 The Machine Learning series on GeeksforGeeks π I learned most of my implementations from GeeksforGeeks (tweaked the code quite a bit to make it more accessible). It’s totally a candy shop for ML kids.
 Machine Learning Algorithms From Scratch by Jason Brownlee π I haven’t read this book and I’m pretty sure you can learn all this stuff without buying this book. But if somehow you’re in a hurry and need to quick access to implementations of common algorithms, I guess you can check this one out.