Notation#
Overview#
In mathematics, notation is a system for representing ideas and concepts symbolically. It consists of a combination of letters, numbers, symbols, and signs. For example, the notation \(\sum_{i=1}^{n}i\) represents the sum of integers from 1 to \(n\), and it contains:
- Symbols: \(\sum\), \(>\), \(\leq\) 
- Letters: \(i\), \(n\) 
- Numbers: \(1\) 
Properties#
A notation has the following properties:
- Expressiveness: A notation must contain all the relevant information necessary to convey its meaning. 
- Uniqueness: A notation must specify a unique meaning, eradicating any ambiguity. 
Scalars, Vectors, and Matrices#
- Scalars are typically represented by lowercase letters \(x, y, z, \alpha, \beta, \gamma\) or uppercase Latin letters \(N, M, T\). The latter are often used to indicate a count (e.g., number of examples, features, timesteps) and are often accompanied by corresponding indices \(n, m, t\) (e.g., current example, feature, timestep). 
- Vectors are bold lowercase letters, e.g., \(\mathbf{x} = [x_1, x_2, x_3, \ldots, x_M]^T\), and are typically assumed to be column vectors. When handwritten, vectors are indicated by an over-arrow: \(\vec{x} = [x_1, x_2, \ldots, x_M]^T\). 
- Matrices are bold uppercase letters, e.g., 
Subscripts are used as indices into structured objects such as vectors or matrices.
Sets#
Sets are represented by calligraphic uppercase letters, e.g., \(\mathcal{X}, \mathcal{Y}, \mathcal{D}\). We often index a set by labels in parenthesized superscripts:
Alternatively, we can write:
This shorthand is convenient when defining a set of training examples:
which is equivalent to:
Random Variables#
Random variables are represented by uppercase Latin letters \(X, Y, Z\). When a random variable \(X_i\) and a scalar \(x_i\) are upper/lower-case versions of each other, we typically mean that the scalar is a value taken by the random variable.
Greek letters are often reserved for parameters (\(\theta, \phi\)) or hyperparameters (\(\alpha, \beta, \gamma\)).
For a random variable \(X\), we write \(X \sim \text{Gaussian}(\mu, \sigma^2)\) to indicate that \(X\) follows a 1D Gaussian distribution with mean \(\mu\) and variance \(\sigma^2\). We write \(x \sim \text{Gaussian}(\mu, \sigma^2)\) to indicate that \(x\) is a value sampled from this distribution.
A conditional probability distribution over random variable \(X\) given \(Y\) and \(Z\) is written \(P(X|Y, Z)\), and its probability mass function (pmf) or probability density function (pdf) is \(p(x|y, z)\).
The pmf/pdf can be expressed in several ways depending on the context:
- \(p(x|y, z; \alpha, \beta)\) (parameters are demarcated clearly) 
- \(p(x|y, z, \alpha, \beta)\) (parameters treated as random variables) 
- \(p_{\alpha,\beta}(x|y, z)\) (parameters as subscripts for brevity) 
To denote the pmf/pdf as a function over possible values of \(\alpha\):
Using this notation, we can write:
- \(X \sim p_{\alpha,\beta}(\cdot|y, z)\) (random variable \(X\) follows the distribution) 
- \(x \sim p_{\alpha,\beta}(\cdot|y, z)\) (sample \(x\) from the distribution) 
The expectation of a random variable \(X\) is denoted \(\mathbb{E}[X]\). When the generating distribution is important, we include it in the expectation:
This represents the expectation of \(f(x, y, z)\) where \(x\) is sampled from \(p_{\alpha,\beta}(\cdot|y, z)\), with \(y\) and \(z\) treated as constants.
Functions and Derivatives#
For a function \(f(x)\):
- The partial derivative with respect to \(x\) is written \(\frac{\partial f(x)}{\partial x}\) or \(\frac{df(x)}{dx}\)[1]. 
- The first derivative is denoted \(f'(x)\), the second derivative as \(f''(x)\), and so on. 
- For a multivariate function \(f(x) = f(x_1, \ldots, x_M)\), the gradient is written \(\nabla_x f(x)\). Often, the subscript is omitted if the variable is clear from the context: \(\nabla f(x)\). 
- Sometimes, to represent higher-order derivatives, we use the notation \(f^{i}(x)\), where \(i\) indicates the order of the derivative. For example, \(f^{(3)}(x)\) is the third derivative of \(f(x)\). 
Common Notations#
- Disclaimer: For this course, we will try to follow the following notation system as much as possible, but there are no guarantees that all the notation will be consistent with the notation used in the course material. Please always read the directions of the problem carefully to understand the notation used in the problem. 
| Notation | Description | 
|---|---|
| \(N\) | Number of training examples | 
| \(M\) | Number of features | 
| \(K\) | Number of classes | 
| \(n\) or \(i\) | Current training example | 
| \(m\) | Current feature index | 
| \(k\) | Current class | 
| \(\mathbb{Z}\) | Set of integers | 
| \(\mathbb{N}\) | Set of natural numbers | 
| \(\mathbb{R}\) | Set of real numbers | 
| \(\mathbb{R}^M\) | Set of real-valued vectors of length \(M\) | 
| \(\{0,1\}^M\) | Set of binary vectors of length \(M\) | 
| \(\mathbf{x}\) | Feature vector (input); \(\mathbf{x} \in \mathbb{R}^M\) or \(\mathbf{x} \in \{0,1\}^M\) | 
| \(y\) | Label/regressand (output); varies by task (classification, regression, etc.) | 
| \(\mathcal{X}\) | Input space; \(\mathbf{x} \in \mathcal{X}\) | 
| \(\mathcal{Y}\) | Output space; \(y \in \mathcal{Y}\) | 
| \(x^{(i)}\) | The \(i\)-th feature vector in the training data | 
| \(y^{(i)}\) | The \(i\)-th true output in the training data | 
| \((x^{(i)}, y^{(i)})\) | The \(i\)-th training example (feature vector, true output) | 
| \(\mathcal{D}\) | Set of training examples; \(\mathcal{D} = \{(x^{(n)}, y^{(n)})\}_{n=1}^N\) |