This lecture will be a broad introduction to machine learning, framed in the context of specifying three separate aspects of machine learning algorithms. We will first cover a broad introduction to the idea of (supervised machine learning), define some notation on the topic, and then finally define the three elements of a machine learning algorithm: the hypothesis class, the loss function, and optimization methods.

Despite the seemingly vast number of machine learning algorithms, they all share a common framework that consists of three key ingredients: the hypothesis class, the loss function, and the optimization method. Understanding these three components is essential for situating the wide range of machine learning algorithms into a common framework.

**Hypothesis Class**: This is the set of all functions that the algorithm can choose from when learning from the data. For example, in the case of linear classification, the hypothesis class would consist of all linear functions.**Loss Function**: The loss function measures the discrepancy between the predicted outputs of the model and the true outputs. It guides the learning algorithm by providing a metric for the quality of a particular hypothesis.**Optimization Method**: This refers to the algorithm used to search through the hypothesis class and find the function that minimizes the loss function. Common optimization methods include gradient descent and its variants.

By specifying these three ingredients for a given task, one can define a machine learning algorithm. These ingredients are foundational to all machine learning algorithms, whether they are decision trees, boosting, neural networks, or any other method.

The running example used throughout the lecture series is multi-class linear classification. This example serves as a practical application of the concepts of supervised machine learning. It involves classifying images into categories based on their pixel values. For instance, an image of a handwritten digit is represented as a vector of pixel values, and the goal is to classify it as one of the possible digit categories (0 through 9).

In multi-class linear classification, the hypothesis class consists of linear functions, the loss function could be something like the softmax loss, and the optimization method could be stochastic gradient descent (SGD). This example will be used to illustrate the general principles of machine learning and will be implemented in code to provide a hands-on understanding of the concepts.

Machine learning offers an alternative approach to classical programming. Instead of manually encoding the logic for digit recognition, machine learning relies on providing a large set of examples of the desired input-output mappings. The machine learning algorithm then “figures out” the underlying logic from these examples.

Supervised machine learning, in particular, involves collecting numerous examples of digits, each paired with its correct label (e.g., a collection of images of the digit ‘5’ labeled as ‘5’). This collection of labeled examples is known as the training set. The training set is fed into a machine learning algorithm, which processes the data and learns to map new inputs to their appropriate outputs.

The process of machine learning is sometimes described as data-driven programming. Rather than specifying the logic explicitly, the programmer provides examples of the input-output pairs and lets the algorithm deduce the patterns and rules that govern the mapping. This approach can handle the variability and complexity of real-world data more effectively than classical programming.

The training set is a crucial component of supervised machine learning. It consists of numerous examples of inputs (e.g., images of digits) along with their corresponding outputs (the actual digits they represent). The machine learning algorithm uses this training set to learn the patterns and features that are predictive of the output.

The machine learning algorithm itself can be thought of as a “black box” that takes the training set and produces a model capable of making predictions on new, unseen data. The specifics of how the algorithm learns from the training set and what kind of model it produces will be discussed in further detail throughout the course.

Two primary examples of machine learning tasks are digit classification and language modeling. In digit classification, the goal is to classify images of handwritten digits into their corresponding numerical categories. Language modeling, on the other hand, deals with predicting the next word in a sequence given the beginning of a sentence. For example, given the input “the quick brown,” the model should predict the next word, such as “fox.”

Despite the apparent differences between images and text, machine learning algorithms can treat them similarly by operating on vector representations of the data. This approach allows for a unified treatment of various types of data.

Inputs in machine learning algorithms are represented as vectors, denoted by \(x \in \mathbb{R}^n\), which reside in an \(n\)-dimensional space. This means that each input vector is a collection of \(n\) real-valued numbers. For instance, an example of such a vector could be \[ x = \begin{bmatrix} 0.1 \\ 0 \\ -5 \\ \end{bmatrix}. \]

To refer to individual elements within this vector, subscripts are used. For example, \(x_2\) denotes the second element of the vector \(x\). In general \(x_j\) represents the \(j\)-th element of the vector.

In practice, machine learning algorithms work with not just a single input but a set of inputs known as a training set. To denote different vectors within a training set, superscripts enclosed in parentheses are used. For example, \(x^{(i)}\) represents the \(i\)-th vector in the training set, where \(i\) ranges from \(1\) to \(m\). Here, \(m\) is the number of training examples, and \(n\) is the dimensionality of the input space.

The targets or outputs, denoted by \(y\), correspond to the desired outcome for each input. In a multi-class classification setting, which is the focus of this discussion, the targets are a set of \(k\) numbers. Each output \(y^{(i)}\) is associated with an input vector \(x^{(i)}\) and is a discrete number in the set \(\{1, 2, ..., k\}\), where \(k\) is the number of possible classes. The evaluation set, often referred to as the test set, is another collection of ordered pairs, denoted as \(\bar{x}^{(i)}, \bar{y}^{(i))}\), where \(i\) ranges from 1 to \(\bar{m}\). The vectors \(\bar{x}^{(i)}\) are in the same space \(\mathbb{R}^n\), and the targets \(\bar{y}^{(i)}\) are from the same discrete set 1 to \(k\). The evaluation set is used to assess the performance of the machine learning model.

Multiple different settings are possible depending on the type of target:

**Multi-class Classification**: This is the scenario where there are multiple classes, and each input is classified into one of these classes. For example, in digit recognition, there are 10 classes (digits 0 through 9).**Binary Classification**: A special case of classification where there are only two classes. It is distinct enough to have its own name.**Regression**: If the target is a real value, the problem is referred to as regression. There are also more complex forms such as multi-value or vector-value regression and structured prediction, which deals with complex structured output spaces.

It is important to note that while modern AI may seem to produce complex outputs like paragraphs, the underlying algorithms often operate by outputting simpler elements, like one word at a time, which collectively form a structured output. Understanding multi-class classification provides a foundation to grasp these more complex outputs.

To illustrate the concept of the training set, consider the digit classification problem. Images of handwritten digits are represented as matrices of pixel values, where each pixel value ranges from 0 to 1, with 0 representing a black pixel and 1 representing a white pixel. For a 28 by 28 pixel image, the matrix is flattened into a vector \(x\) in \(\mathbb{R}^{784}\), where 784 is the product of the image dimensions. Each image corresponds to a high-dimensional vector, and the target value \(y\) is the actual digit the image represents, ranging from 0 to 9.

In language modeling, the input can be a sentence, such as “The quick brown fox.” The target for this input could be the next word in the sequence, for example, “jumps.” Each word in the vocabulary is assigned a unique number. For example, the word “the” might be assigned the number 283, “quick” the number 78, “brown” the number 151 and so on. This numbering is arbitrary but must remain consistent across all inputs and examples within the model. The vocabulary size is defined, such as 1,000 possible words, and this set of words encompasses all the terms the language model needs to recognize.

One-hot encoding is used to represent words numerically. In this encoding, a word is represented by a vector that is mostly zeros except for a single one at the position corresponding to the word’s assigned number. For instance, if the vocabulary size is 1,000 words, and the word “the” is number 253, then the one-hot encoding for “the” would be a vector with a one at position 23 and zeros everywhere else.

The input to the model \(x\), could be a vector contains the contatenation of some past history of words. If the model is designed to take three words as input for each prediction, and the vocabulary size is 1,000 words, then \(x\) would be a 3,000-dimensional vector. This vector would contain mostly zeros, with ones at the positions corresponding to the input words’ numbers. For example, if the input words are numbered 283, 78, and 151, then \(X_{283}\), \(X_{1078}\), and \(X_{2151}\) would be set to one, with all other positions in the vector set to zero.

The output of the model, denoted as \(Y\), is not one-hot encoded but is simply the index number of the target word. If the target word is “fox” and it corresponds to index 723, then \(Y\) would be equal to 723.

The order of words is crucial in language modeling. The input vector’s structure encodes the order of the words, with different segments of the vector representing the positions of the words in the input sequence. The concatenation of these segments ensures that the order is preserved, which is essential for the model to understand the meaning of sentences correctly.

In summary, language modeling involves assigning numbers to words, representing these words with one-hot encodings, and processing the input through a model that predicts the next word based on the numerical representation of the input sequence. The process of tokenization and the importance of maintaining the order of words are also highlighted as key components of language modeling.

In the context of machine learning, a hypothesis function, also referred to as a model, is a function that maps inputs to predictions. These predictions are made based on the input data, and if the hypothesis function is well-chosen, these predictions should be close to the actual targets. The formal representation of a hypothesis function is a function \[h : \mathbb{R}^n \rightarrow \mathbb{R}^k\] that maps inputs from \(\mathbb{R}^n\) to \(\mathbb{R}^k\), where \(n\) is the dimensionality of the input space and \(k\) is the number of classes in a classification problem.

For multiclass classification, the output of the hypothesis function is a vector in \(\mathbb{R}^k\), where each element of the vector represents a measure of belief for the corresponding class. This measure of belief is not necessarily a probability but indicates the relative confidence that the input belongs to each class. The \(j\)-th element of the vector \(h(x)\), denoted as \(h(x)_j\), corresponds to the belief that the input \(x\) is of class \(j\). For example, if $ \[h(x) = \begin{bmatrix} -5.2 \\ 1.3 \\ 0.2 \end{bmatrix}\] this would suggest a low belief in class 1 and higher beliefs in classes 2 and 3. The class with the highest value in the vector is typically taken as the predicted class.

In practice, hypothesis functions are often parameterized by a set of parameters \(\theta\), which is denoted \(h_\theta\). These parameters define which specific hypothesis function from the hypothesis class is being used. The hypothesis class itself is a set of potential models from which the learning algorithm selects the most appropriate one based on the data. The choice of parameters \(\theta\) determines the behavior of the hypothesis function and its predictions.

A common example of a hypothesis class is the linear hypothesis class. In this class, the hypothesis function \(h_\theta(x)\) is defined as the matrix product of \(\theta\) and \(x\), \[ h_\theta(x) = \theta^T x \] where \(\theta\) is a matrix of parameters and \(x\) is an input vector. The dimensions of \(\theta\) are determined by the dimensions of the input and output spaces. Specifically, for an input space of \(n\) dimensions and an output space of \(k\) dimensions, \(\theta\) must be a matrix of size \(n \times k\) to ensure the output is \(k\)-dimensional. Here, \(\theta\) is a matrix and the transpose operation \(\theta^T\) swaps its rows and columns. This function takes \(n\)-dimensional input vectors and produces \(k\)-dimensional output vectors, satisfying the requirements of the hypothesis class.

The parameters of a hypothesis function, often referred to as weights in machine learning terminology, define the specific instance of the function within the hypothesis class. These parameters are the elements of the matrix \(\theta\) in the case of a linear hypothesis class. The number of parameters, or weights, in a model can vary greatly depending on the complexity of the model. For instance, a neural network may have a large number of parameters, sometimes in the billions, which define the specific function it represents within its hypothesis class.

To fully grasp the operation of a linear hypothesis function, one must be familiar with matrix-vector products. The product of a matrix \(\theta\) and a vector \(x\) results in a new vector, where each element is a linear combination of the elements of \(x\) weighted by the corresponding elements of \(\theta\).

Linear models are surprisingly effective in many machine learning tasks. Despite their simplicity, they can achieve high accuracy in problems such as digit classification. The success of linear models raises questions about why they work well and under what circumstances more complex models might be necessary. These considerations will be explored further in subsequent discussions, along with practical examples and coding demonstrations.

One can gain some insight into the performance of such a linear hypothesis function by looking more closely at the computations being performed. When the transposed matrix \(\theta^T\) is multiplied by the input vector \(x\), the result is a \(k \times 1\) dimensional vector. This operation can be expressed as follows:

\[ \theta^T x = \begin{bmatrix} \theta_1^T \\ \theta_2^T \\ \vdots \\ \theta_k^T \end{bmatrix} x = \begin{bmatrix} \theta_1^T x \\ \theta_2^T x \\ \vdots \\ \theta_k^T x \end{bmatrix} \]

The \(i\)-th element of the resulting vector is the inner product of \(\theta_i^T\) and \(x\), which can be written as:

\[ (\theta^T x)_i = \theta_i^T x = \sum_{j=1}^{n} \theta_{ij} x_j \]

This represents the sum of the products of the corresponding elements of the \(i\)-th row of \(\theta^T\) and the input vector \(x\). Each element of the resulting vector is a scalar, as it is the product of a \(1 \times n\) matrix and an \(n \times 1\) matrix.

The idea of template matches can help explain the effectiveness of such a model. Each \(\theta_i\) can be visualized as an image, where positive values indicate pixels that increase the likelihood of class \(i\), and negative values indicate pixels that decrease it. When visualized, \(\theta_i\) resembles a template that matches the features of the corresponding digit. For example, \(\theta_1\) might have positive values where a digit ‘1’ typically has pixels and negative values around it, forming a template that matches the general shape of a ‘1’.

This template matching is a fundamental aspect of linear models in vision systems, where the weights in the model serve as generic templates for the classifier. The effectiveness of this approach is demonstrated by the fact that a linear classifier can achieve approximately 93% accuracy on a digit classification task, significantly better than random guessing, which would yield only 10% accuracy.

In the context of machine learning, particularly for vision systems, inputs and targets are often represented in a batch or minibatch format. Inputs, denoted as \(x^{(i)}\), are vectors in \(\mathbb{R}^n\), where \(n\) is the dimension of the input. Targets, denoted as \(y^{(i)}\), are elements in the set \(\{1, 2, \ldots, k\}\), where \(k\) is the number of classes. These inputs and targets are typically organized into matrices for batch processing.

Inputs are defined as an \(m \times n\) matrix \(X\), where \(m\) is the number of examples in the training set. Each row of \(X\) corresponds to an input vector transposed, such that the first row is \(x_1^T\), the second row is \(x_2^T\), and so on, up to \(x_m^T\). The matrix \(X\) is expressed as:

\[ X = \begin{bmatrix} {x^{(1)}}^T \\ {x_{(2)}}^T \\ \vdots \\ {x_{(m)}}^T \end{bmatrix} \]

Targets are similarly organized into a vector \(Y\), which is an \(m\)-dimensional vector containing the target values for each example in the training set. The vector \(Y\) is expressed as:

\[ Y = \begin{bmatrix} y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(m)} \end{bmatrix} \]

The hypothesis function \(h_\theta\) can be also applied to an entire set of examples. When applied to a batch, the notation \(h_\theta(X)\) represents the application of \(h_\theta\) to every example within the batch. The result is a matrix where each row corresponds to the hypothesis function applied to the respective input vector. The expression for the hypothesis function applied to a batch is:

\[ h_{\theta}(X) = \begin{bmatrix} h_{\theta}(x^{(1)})^T \\ h_{\theta}(x^{(2)})^T \\ \vdots \\ h_{\theta}(x^{(m)})^T \end{bmatrix} \]

For a linear hypothesis function \(h_\theta(x) = \theta^T x\), this takes a very simple form

\[ h_{\theta}(X) = \begin{bmatrix} (\theta^T x^{(1)} )^T \\ (\theta^T x^{(2)} )^T \\ \vdots \\ (\theta^T x^{(2)} )^T \end{bmatrix} = \begin{bmatrix} {x^{(1)}}^T \theta \\ {x^{(2)}}^T \theta \\ \vdots \\ {x^{(m)}}^T \theta \end{bmatrix} = X \theta \]

In other words, the hypothesis class applied to every example in the dataset has the extremely simple form of a single matrix operation \(X \theta\)

We can implement all these operations very easily using libraries like PyTorch. Below is code that loads the MNIST dataset and computes a linear hypothesis function applied to all the data

```
import torch
from torchvision.datasets import MNIST
= MNIST('.', train=True, download=True)
dataset = dataset.data.reshape(60000,784)/255.
X = dataset.targets
Y
= torch.randn(784,10)
theta = X@theta H
```

The second ingredient of a machine learning algorithm is the loss function. This function quantifies the difference between the predictions made by the classifier and the actual target labels. It formalizes the concept of a ‘good’ prediction by assigning a numerical value to the accuracy of the predictions. The loss function is a critical component of the training process, as it guides the optimization of the model parameters to improve the classifier’s performance.

A loss function, denoted as \[ \ell : \mathbb{R}^k \times \{1,\ldots,k\} \] is a mapping from the output of hypothesis functions, which are vectors in \(\mathbb{R}^k\) for multiclass classification and true classes \(\{1, \ldots, k\}\) to positive real numbers. This mapping is applicable not only to multiclass classification but also has analogs for binary classification, regression, and other machine learning tasks.

One of the most straightforward loss functions is the zero-one loss, also known as the error. This loss function is defined such that it equals zero if the prediction made by the classifier is correct and one otherwise. Formally, the zero-one loss function can be expressed as follows:

\[ \ell(h_\theta(x), y) = \begin{cases} 0 & \text{if } h_\theta(x)_y > h_\theta(x)_{y'} \text{ for all } y' \\ 1 & \text{otherwise} \end{cases} \]

This means that the loss is zero if the element with the highest value in the hypothesis output corresponds to the true class, indicating a correct prediction. Conversely, if any other element is higher, indicating an incorrect prediction, the loss is one.

Despite its intuitive appeal, the zero-one loss function is not ideal for two main reasons. Firstly, it is not differentiable, meaning it does not provide a smooth gradient that can guide the improvement of the classifier. This lack of differentiability means that the loss function does not offer a nuanced way to adjust the classifier’s parameters based on the degree of error in the predictions.

Secondly, the zero-one loss function does not handle the notion of stochastic or uncertain outputs well. In tasks like language modeling, where there is no single correct answer, the zero-one loss function fails to capture the probabilistic nature of the predictions. It assigns a hard loss value without considering the closeness of the prediction to the true class or the possibility of multiple plausible predictions.

The most commonly used loss function in modern machine learning is the cross-entropy loss. This loss function addresses the issue of transforming hypothesis outputs, which are often fuzzy and amorphous in terms of “belief,” into concrete probabilities. To achieve this, we introduce the softmax operator

To define cross-entropy loss, we need a mechanism to convert real-valued predictions to probabilities. This is achieved using the softmax function. Given a hypothesis function \(h\) that maps \(n\)-dimensional real-valued inputs to \(k\)-dimensional real-valued outputs (where \(k\) is the number of classes), the softmax function is defined as follows:

\[ \text{softmax}(h(x))_i = \frac{\exp(h(x)_i)}{\sum_{j=1}^{k} \exp(h(x)_j)} \]

This function ensures that the output is a probability distribution: each element is non-negative and the sum of all elements is 1.

The goal of cross-entropy loss is to maximize the probability of the correct class. However, for practical and technical reasons, loss functions are typically minimized rather than maximized. Therefore, the negative log probability is used. The cross-entropy loss for a single prediction and the true class is defined as:

\[ \ell_{ce}(h(x), y) = -\log(\text{softmax}(h(x))_y) \]

This loss function takes the negative natural logarithm of the softmax probability corresponding to the true class \(y\). The use of the logarithm helps to manage the scale of probability values, which can become very small or very large due to the exponential nature of the softmax function. While the initial intention might be to maximize the probability of the correct class, the cross-entropy loss is designed to be minimized. This is a common practice in optimization, where minimizing a loss function is equivalent to maximizing some form of likelihood or probability.

In the context of machine learning, the term “log” typically refers to the natural logarithm, sometimes denoted as \(\ln\). However, for simplicity, the notation \(\log\) is used with the understanding that it implies the natural logarithm unless otherwise specified.

The cross-entropy loss function can be simplified by examining the softmax function. The softmax function is defined as the exponential of a value over the sum of exponentials of a set of elements. When the logarithm of the softmax function is taken, due to the properties of logarithms, the expression simplifies to the difference between the log of the numerator and the log of the denominator. Since the numerator is an exponential function, and the logarithm is being applied to it, these operations cancel each other out, leaving the \(y\)th element of the hypothesis function \(h_{\theta}(x)\).

The simplified expression for the first element of the cross-entropy loss, including the negation, is given by: \[ \ell_{ce}(h(x), y) = - h_{\theta}(x)_y + \log \sum_{j=1}^{k} \exp\left( h_{\theta}(x)_j \right) \] where \(k\) is the number of classes. The second term in the simplified cross-entropy loss expression involves the logarithm of a sum of exponentials, which is a non-simplifiable function known as the log-sum-exp function.

To facilitate the computation, especially when taking derivatives, the cross-entropy loss can be expressed using vector notation. We specifically introduce the unit basis vector \(e_y\), which is a one-hot vector with all elements being zero except for a one in the \(y\)th position. The cross-entropy loss can then be written as the inner product of \(E_y\) and the hypothesis function \(h_{\beta}(x)\), negated: \[ \ell_{ce}(h(x), y) = -e_y^T h_{\theta}(x) + \log \sum_{j=1}^{K} \exp\left( h_{\theta}(x)_j \right) \]

The loss function can also be expressed in batch form, which is useful for processing multiple examples simultaneously. The batch loss is defined as the average of the individual losses over all examples in the batch. Let \(X \in \mathbb{R}^{M \times N}\) be the matrix of input features for all examples in the batch, and \(Y\) be the corresponding vector of labels. The batch loss is given by: \[ \ell_{ce}(h_{\beta}(X), Y) = \frac{1}{m} \sum_{i=1}^{m} \ell_{ce}(h_{\theta}(x^{(i)}), y^{(i)}) \] where \(\ell_{ce}\) is the cross-entropy loss for a single example, \(,\) is the number of examples in the batch, and \(x^{(i)}\) and \(y^{(i)}\) are the \(i\)th example and target, respectively.

Implementing both the zero-one loss and the cross entropy loss in Python is straightforward.

The zero-one loss is a simple loss function that counts the number of
misclassifications. It is defined as zero if the predicted class (the
class with largest value of the hypothesis function) for an example
matches the actual class, and one otherwise. To compute the 0-1 loss in
Python, the `argmax`

function is used to identify the column
that achieves the maximum value for each sample. The `argmax`

function takes an argument specifying the dimension over which to
perform the operation. Using `-1`

as the argument indicates
that the operation should be performed over the last dimension, yielding
a list of indices corresponding to the maximum values in each row of the
tensor. The matrix \(H\) represents the
hypothesis function applied to the input matrix \(X\), and vector \(Y\) contains the actual class labels for
each example.

```
def loss_01(H, Y):
return (H.argmax(-1) != Y).float().mean()
```

The cross-entropy loss function is more complex than the 0-1 loss but can also be implemented concisely in Python. To compute the first term of the cross entropy loss, we simply index intro the \(y\)th element for each row of \(H\). The second term, the log-sum-exp, is computed by exponentiating the hypothesis matrix \(H\), summing over the last dimension (each row), and then taking the logarithm of the result. This term accounts for the normalization of the predicted probabilities and is summed over all samples to complete the cross-entropy loss calculation.

```
def loss_cross_entropy(H, Y):
""" Cross entropy loss """
= H.exp().sum(-1).log()
lse_H return -H[torch.arange(len(Y)),Y].mean() + lse_H.mean()
```

An example is provided where the cross-entropy loss is calculated between a prediction \(H\) and the true labels \(Y\), resulting in a high loss value of 15. This high value is indicative of the fact that when predictions are incorrect, they can be significantly off, which is captured by the cross-entropy loss. A comparison is made to a scenario where the predictions are all zeros, which would result in a lower cross-entropy loss, highlighting the sensitivity of this loss function to the probabilities associated with the true class labels.

The final ingredient of machine learning is the optimization method. Specifically, the goal of the machine learning optimization problem is to find the set of paramaters that minimizes the average loss of the corresponding hypothesis function and target output, over the entire training set. This is written as finding the set of parameters \(\theta\) that minimizes the average loss over the entire training set:

\[ \DeclareMathOperator{\minimize}{minimize} \minimize_{\theta} \frac{1}{m} \sum_{i=1}^{m} \ell(h_{\theta}(x^{(i)}), y^{(i)}) \]

This optimization problem is at the core of all machine learning algorithms, regardless of the specific type, such as neural networks, linear regression, or boosted decision trees. The goal is to search among the class of allowable functions determined by the parameters \(\theta\) to find the one that best fits the training data according to the chosen loss function.

While the lecture will initially focus on the manual computation of derivatives, modern libraries like PyTorch offer automatic differentiation, which eliminates the need for manual derivative calculations. However, understanding the underlying process is valuable. The course will later cover automatic differentiation at a high level, and after some initial manual calculations, we will rely on PyTorch to handle derivatives.

For linear classification problems utilizing cross-entropy loss, the optimization problem is formalized as minimizing the average cross-entropy loss over all training examples. This is given by

\[ \minimize_{\theta} \frac{1}{m} \sum_{i=1}^{m} \ell_{ce}(x^{(i)}), y^{(i)}) \] where \(\theta\) represents the parameters of the linear classifier.

The gradient descent algorithm is an iterative method to find the parameters that minimize the loss function. The process involves taking small steps in the direction opposite to the gradient (a multivariate analog of the derivate derivative) of the loss function at the current point.

An analogous update rule for the parameters in the one-dimensional case is given by: \[ x_{n+1} = x_n - \alpha \cdot f'(x_n) \] where \(x_n\) is some current parameter value, \(\alpha\) is the step size (also known as the learning rate), and \(f'(x_n)\) is the derivative of the loss function with respect to the parameter at \(x_n\).

The derivative provides the direction of the steepest ascent, and by moving in the opposite direction (negative derivative), the algorithm seeks to reduce the loss function value, moving towards a local minimum.

The gradient is a generalization of the derivative for functions with multivariate inputs, such as vectors or matrices. It is a matrix of partial derivatives and is denoted as \(\nabla f(\theta)\), where \(\theta\) is the point at which the gradient is evaluated. The gradient is only defined for functions with scalar-valued outputs.

For a function \(f: \mathbb{R}^{n \times k} \rightarrow \mathbb{R}\), the gradient \(\nabla_\theta f(\theta)\) is an \(n \times k\) matrix given by:

\[ \nabla_\theta f(\theta) = \begin{bmatrix} \frac{\partial f(\theta)}{\partial \theta_{11}} & \cdots & \frac{\partial f(\theta)}{\partial \theta_{1k}} \\ \vdots & \ddots & \vdots \\ \frac{\partial f(\theta)}{\partial \theta_{n1}} & \cdots & \frac{\partial f(\theta)}{\partial \theta_{nk}} \end{bmatrix} \]

Each element of this matrix is the partial derivative of \(f\) with respect to the corresponding element of \(\theta\). A partial derivative is computed by treating all other elements of \(\theta\) as constants.

The gradient has a crucial property that it always points in the direction of the steepest ascent in the function’s value. Conversely, the negative gradient points in the direction of the steepest descent, which is useful for optimization. This property holds true for functions with vector or matrix inputs, just as the derivative points in the direction of steepest ascent for functions with a single input. This concept is fundamental as it implies that by evaluating the gradient, one can determine the direction that is most uphill or, conversely, most downhill by considering the negative gradient. This property is crucial in machine learning because it allows for scanning every possible direction around a current point to find the single direction that points most uphill or downhill.

Gradient descent is highlighted as one of the single most important algorithms in computer science, particularly in the field of artificial intelligence. It is the underlying algorithm for training various AI models and is used in almost every advance in AI. Gradient descent is a procedure for iteratively minimizing a function, and it consists of the following steps:

**Initialization**: Begin with an initial point \(\theta\).**Iteration**: For a predetermined number of iterations \(T\), repeat the following update: \[ \theta := \theta - \alpha \nabla_\theta f(\theta) \] where \(\alpha\) is a positive real value known as the step size, and \(\nabla_\theta f(\theta)\) is the gradient of the function \(f\) with respect to \(\theta\) evaluated at the current point \(\theta\).

The function \(f\) is the one to be optimized, and the choice of step size \(\alpha\) and the number of iterations \(T\) can significantly affect the optimization process. Libraries such as PyTorch can compute gradients automatically using a technique called automatic differentiation, simplifying the optimization process even further.

While the basic premise of gradient descent is straightforward, practical implementation involves several considerations. The choice of step size is critical, and the number of iterations must be chosen carefully to ensure convergence. In the context of neural networks, initialization plays a significant role, although it is less critical for convex problems.

Stochastic Gradient Descent (SGD) is a variant of the gradient descent algorithm, particularly useful when the objective function is an average of many individual functions. This is often the case in machine learning, where the objective is to minimize the average loss across a dataset.

For example, consider the applciation of SGD to multi-class linear classification. The objective function in such cases is typically the sum or average of loss functions over the training examples:

\[ \min_{\theta} \frac{1}{m} \sum_{i=1}^{m} \ell(h_{\theta}(x^{(i)}), y^{(i)}) \]

where \(\ell\) is the loss function, \(h_{\theta}\) is the hypothesis function parameterized by \(\theta\), \(x^{(i)}\) is the \(i\)-th training example, and \(y^{(i)}\) is the corresponding target value.

Instead of computing the gradient over the entire dataset, which can be computationally expensive, SGD approximates the gradient using a random subset of the data, known as a batch or minibatch. The size of the batch, denoted by \(|B|\), is typically much smaller than the size of the full dataset, allowing for more frequent updates of the parameters within the same computational budget. The approximation of the objective function using a batch is given by: \[ \frac{1}{|B|} \sum_{i \in B} \ell(h_{\theta}(x^{(i)}), y^{(i)}) \] where \(B\) is a randomly selected subset of the indices from \(1\) to \(m\). This subset is used to compute the gradient and update the parameters.

The SGD algorithm proceeds by initializing the parameters and then iteratively updating them by subtracting a scaled gradient computed on a random batch. The scaling factor is often referred to as the learning rate. The algorithm can be summarized as follows:

- Initialize \(\theta\).
- Repeat for a number of iterations (or epochs):
- Select a random batch $B {1,,m}.
- Update \(\theta\) by subtracting a fraction of the gradient. \[ \theta := \theta - \frac{\alpha}{|B|} \sum_{i\in B} \nabla_\theta \ell(h_{\theta}(x^{(i)}), y^{(i)})\]

We emphasize that while the intermediate steps of deriving gradients may appear complex, the process is manageable and will eventually be simplified by using automatic differentiation tools. However, understanding the mechanics of SGD is important for grasping the underlying principles of optimization in machine learning.

In practice, rather than selecting a completely random subset each time, the dataset is often divided into batches of a fixed size, and the algorithm iterates over these batches. This approach is known as mini-batch gradient descent. The size of the mini-batch is a trade-off between computational efficiency and the frequency of updates. While a batch size of one (pure SGD) maximizes the number of updates, it is not computationally efficient due to the reliance on matrix-vector products rather than matrix-matrix products. Common batch sizes are 32 or 64, which balance the speed of updates with computational efficiency.

Cyclic SGD is a variant where the entire dataset is iterated over in chunks or batches, possibly in a randomized order. This method is harder to analyze mathematically compared to the true form of SGD, where a new random subset is selected at each iteration. However, cyclic SGD is a reasonable approximation used in practice.

The choice of batch size in SGD is influenced by the trade-off between the accuracy of the gradient direction and the speed of computation. Smaller batch sizes allow for more frequent updates but may provide a rougher approximation of the true gradient. Conversely, larger batch sizes use computational resources more efficiently but update the parameters less frequently. The optimal batch size depends on the specific problem and computational constraints.

To perform SGD updates, we need to compute the gradient of the loss function with respect to the parameters. In the context of multi-class linear classification, the loss function is the cross-entropy loss of the hypothesis class, which is a linear function of the parameters \(\beta\) and the input features \(x\). The cross-entropy loss can be expressed as:

\[ \ell(\theta^T x, y) = -e_y^T (\theta^T x) + \log \left( \sum_{j=1}^{k} e^{\theta_j^T x} \right) \]

where \(e_y\) is the basis vector corresponding to the true class, \(\theta\) is the parameter matrix, and \(k\) is the number of classes. The \(j\)-th column of \(\theta\), denoted as \(\theta_j\), represents the parameters corresponding to the \(j\)-th class.

The gradient of the loss function with respect to \(\theta\) is a matrix of partial derivatives. To compute this gradient, we can consider each element of the matrix, which involves taking the partial derivative with respect to each element \(\theta_{rs}\) of the parameter matrix \(\theta\). The gradient with respect to \(\theta_{rs}\) of the loss function is given by: \[ \frac{\partial}{\partial \theta_{rs}} -e_y^T (\theta^T x) = -x_r \cdot \mathbb{1}_{\{s=y\}} \] where \(\mathbb{1}(y = s)\) is the indicator function that is 1 if the true class label \(y\) \(s\) and 0 otherwise, and \(x_r\) is the \(r\)-th feature of the \(x\).

Next we consider the partial derivative of the log-sum-exp term. The derivative of the log of a function is the derivative of the function divided by the function itself. Applying this to the softmax function, the derivative with respect to \(\theta_{rs}\) is computed as follows:

\[ \begin{align} \frac{\partial}{\partial \theta_{rs}} \log \left( \sum_{j=1}^{K} \exp \left( \sum_{i=1}^{N} \theta_{ij} X_i \right) \right) & = \frac{\frac{\partial}{\partial \theta_{rs}} \left( \sum_{j=1}^{K} \exp \left( \sum_{i=1}^{N} \theta_{ij} x_i \right) \right)}{ \sum_{j=1}^{K} \exp \left( \sum_{i=1}^{N} \theta_{ij} X_i \right) } \\ & = \frac{\exp \left( \sum_{i=1}^{N} \theta_{is} x_i \right) \cdot x_r}{ \sum_{j=1}^{K} \exp \left( \sum_{i=1}^{N} \theta_{ij} X_i \right) } \end{align} \]

The gradient of a function with respect to its parameters can be
expressed in a native matrix form. This form simplifies the
representation and calculation of the gradient. Looking at the
expression for the \(rs\) element of
the gradient, we can respresent *all* elements of the gradient
simultaneously as \[
\nabla_\theta \ell_{ce}(\theta^T x, y) = -x (\mathbf{e}_y -
\text{softmax}(\theta^T x))^T.
\]

Thus, the gradient of the loss for a given example is the product of the example vector \(\mathbf{x}\) and the difference between the probabilistic prediction and the one-hot vector of the targets. This form of the gradient is intuitive as it adjusts the parameters corresponding to the weights of the matrix to minimize the difference between the predicted probabilities and the actual target distribution.

Although we can always derive gradients element-by-element in this manner, there is a way that is easier in practice, and works well for most practical cases. The core idea is to treat all the elements in an expression as scalars, compute the derivates using normal scalar calculus, and then determine the gradient form by matching sizes. This may not always work, but it often in fact gives the correct gradient. To see how this works, we can first define the cross entropy loss of a linear classifier in a “vector” form \[ \ell_{ce}(\theta^T x,y) = -e_y^T \theta^T x + \log(\mathbf{1}^T \exp(\theta^T x)) \] where \(\mathbf{1}\) represents the vector of all ones. Now taking the derivative with respect to \(\theta\), assuming that all values are scalars, and applying the chain rule we have \[ \frac{\partial}{\partial \theta} \ell_{ce}(\theta^T x,y) = -e_y x + \frac{\exp(\theta^T x) x}{1^T \exp(\theta^T x)} = x (-e_y^T + \text{softmax}(\theta^T x)). \] Re-arranging terms so that the sizes match, we have as above that \[ \nabla_\theta \ell_{ce}(\theta^T x, y) = -x (\mathbf{e}_y - \text{softmax}(\theta^T x))^T. \]

This gradient has be easily extended to a batch of examples. The loss function is overloaded to handle batches, and the derivative with respect to \(\theta\) for the entire batch is computed. The batch version of the hypothesis function is represented by the matrix product \(X\theta\), where \(X\) is the matrix of input features for the entire batch. The derivative for the batch is given by: \[ \frac{\partial}{\partial \theta} \ell_{ce}(X\theta, Y) = X^T \left(-I_y + \text{softmax}(X\theta)\right) \] where \(I_Y\) is the matrix of one-hot encoded targets in each row for the entire batch, and \(\text{softmax}(X\theta)\) is the softmax function applied to each row of the matrix product \(X\theta\). The resulting gradient is an \(n \times k\) dimensional matrix, where \(n\) is the number of features and \(k\) is the number of classes.

To verify the correctness of the derived gradient expression, a numerical gradient approximation method is introduced. This method involves perturbing each element of the parameter matrix \(\theta\) by a small value \(\epsilon\) and computing the resulting change in the cross-entropy loss. This can be computed using the following code:

```
def compute_cross_entropy_gradient_approx(theta, X, Y, eps=1e-3):
""" Example function to compute an approximate numerical gradient of cross entropy"""
= theta.shape
n,k = loss_cross_entropy(X@theta, Y)
loss = torch.zeros(n,k)
grad for i in range(n):
for j in range(k):
+= eps
theta[i,j] = (loss_cross_entropy(X@theta, Y) - loss) / eps
grad[i,j] -= eps
theta[i,j] return grad
```

The numerical gradient approximation serves as a check against the analytically derived gradient. If the numerical and analytical gradients match for a randomly initialized \(\theta\), it is likely that the analytical gradient is correct. However, it is cautioned that this method can be computationally expensive, as it requires iterating over every element of \(\theta\) and evaluating the loss function multiple times. Thus, it’s only used as a measure to check the analytical gradient, and then you would want to use the analytical computation of the gradient after that.

Let’s put all these elements together to implement a complete implementation of linear class classification in Python. While the resulting code is quite small, it’s important to emphasize the complexity of what is happening. Specifically, the implementation defines all the three ingredients of a machine learning algorithm: 1. We use a linear hypothesis function \(h_\theta(x) = \theta^T x\) 2. We use the cross entropy loss \(\ell_{ce}(h_\theta(x), y)\) 3. We solve the optimization problem of finding the parameters that minimize the loss over the training set using stochastic gradient descent updates by via the updates \[ \theta := \theta - \frac{\alpha}{|B|} \cdot X^T(\text{softmax}(X\theta) - I_Y) \]

```
def train_linear_classifier(X_train, Y_train, epochs=20, batch_size=100, alpha=0.5):
= X_train.shape
m,n = Y_train.max()+1
k = torch.zeros(n,k)
theta = []
losses
for t in range(epochs):
for X,Y in zip(X_train.split(batch_size), Y_train.split(batch_size)):
@theta, Y))
losses.append(loss_cross_entropy(X-= alpha / batch_size * X.T@(torch.softmax(X@theta,-1) - torch.eye(k)[Y])
theta return theta, losses
```

Running this example on the MNIST dataset results in a linear
classifier that achieves about 7.5% error, on a held out test set. We
can further try to visualize the results columns of `theta`

and the results look approximately like “templates” for the digits we
are trying to classify. Generally speaking, such visualization won’t be
possible for more complex classifiers, but this kind of “template
matching” nonetheless provides a reasonable intuition about what machine
learning methods are doing, even for more complex classifiers.