Matrix Operations
-
Matrices can be added and scaled just like vectors.
-
For addition, both matrices must have the same dimensions, and you add element by element:
- For scalar multiplication, you multiply every element by the scalar:
- The simplest thing you can do with a matrix is multiply it by a vector. Matrix-vector multiplication $A\mathbf{x}$ combines the columns of $A$ using the entries of $\mathbf{x}$ as weights:
-
This is the core operation in ML. Every neural network layer computes $A\mathbf{x} + \mathbf{b}$: a matrix times an input vector, plus a bias.
-
The general case is matrix multiplication. Given $A$ ($m \times n$) and $B$ ($n \times p$), the product $C = AB$ is an $m \times p$ matrix where each element is a dot product:
$$C_{ij} = \sum_{k=1}^{n} A_{ik} B_{kj}$$
-
Each entry in the result is the dot product of a row from $A$ with a column from $B$. The inner dimensions must match ($n$), and the result takes the outer dimensions ($m \times p$).
-
Another way to see it: each column of the result is a weighted sum of the columns of $A$, where the weights come from the corresponding column of $B$.
-
If $B$ has column $[2, 3]^T$, the result column is $2 \times (\text{column 1 of } A) + 3 \times (\text{column 2 of } A)$.
-
A useful special case: multiplying a matrix by its transpose always gives a square matrix. $AA^T$ is $m \times m$ and $A^TA$ is $n \times n$:
-
Matrix multiplication has important rules:
-
Not commutative: $AB \neq BA$ in general. The order matters.
-
Associative: $(AB)C = A(BC)$. You can group multiplications however you like.
-
Distributive: $A(B + C) = AB + AC$.
-
Identity: $AI = IA = A$.
-
-
The Hadamard product (element-wise product) multiplies two matrices of the same size entry by entry, written $A \odot B$:
-
Unlike standard matrix multiplication, the Hadamard product is commutative ($A \odot B = B \odot A$) and requires both matrices to have the same dimensions. It is used heavily in ML for gating: multiplying element-wise by a mask of values between 0 and 1 controls how much of each entry "passes through."
-
The outer product of two vectors $\mathbf{u}$ and $\mathbf{v}$ produces a matrix: $\mathbf{u}\mathbf{v}^T$. Each entry is the product of one element from $\mathbf{u}$ and one from $\mathbf{v}$:
-
The result always has rank 1, because every row is a scaled version of $\mathbf{v}^T$. Any matrix can be written as a sum of rank-1 outer products, which is exactly what SVD does (covered in decompositions).
-
Matrix multiplication is computationally expensive. Multiplying two $n \times n$ matrices takes $O(n^3)$ operations. For a $1000 \times 1000$ matrix, that is a billion multiplications.
-
When matrices are sparse (mostly zeros), naive multiplication wastes time multiplying by zero. The Compressed Sparse Row (CSR) format stores only the nonzero elements along with their positions:
- Values: the nonzero entries in row order
- Column indices: which column each value belongs to
- Row offsets: where each row starts in the values list
-
For example, the matrix:
-
Is stored as: values = [5, 2, 3, -1], columns = [0, 3, 2, 3], row offsets = [0, 2, 3, 4]. This skips all the zeros and makes sparse operations much faster.
-
A core use of matrices is solving systems of linear equations. The system $A\mathbf{x} = \mathbf{b}$ asks: "what vector $\mathbf{x}$, when transformed by $A$, produces $\mathbf{b}$?"
-
For example, say you are buying fruit. Apples cost $x_1$ dollars each and bananas cost $x_2$ dollars each. You know that 2 apples and 1 banana cost $5, and 1 apple and 3 bananas cost $10. In matrix form:
- Multiplying the matrix by the vector row by row (each row dotted with $[x_1, x_2]^T$) gives two equations:
$$2x_1 + 1x_2 = 5 \qquad \text{(row 1)} \qquad \qquad x_1 + 3x_2 = 10 \qquad \text{(row 2)}$$
-
From row 1, $x_2 = 5 - 2x_1$. Substituting into row 2: $x_1 + 3(5 - 2x_1) = 10$, which gives $x_1 = 1$, then $x_2 = 3$. Apples cost $1 and bananas cost $3.
-
Verify — it checks out:
-
If $A$ has an inverse, the solution is simply $\mathbf{x} = A^{-1}\mathbf{b}$. But computing the inverse directly is expensive and numerically unstable. In practice, we use decompositions instead.
-
Not every matrix is square, and not every square matrix is invertible. The pseudo-inverse $A^+$ generalises the inverse to any matrix. It always exists and provides the "best possible" inverse:
$$A^+ = (A^TA)^{-1}A^T$$
-
When $A$ is lower triangular, solving $L\mathbf{x} = \mathbf{b}$ is easy by forward substitution: solve for $x_1$ first, then use it to find $x_2$, and so on down.
-
When $A$ is upper triangular, solving $U\mathbf{x} = \mathbf{b}$ works by back substitution: solve for the last variable first, then work upward.
-
This is why decomposing a matrix into triangular factors (as we will see in decompositions) is so useful. It turns a hard problem into two easy ones.
Coding Tasks (use CoLab or notebook)
- Multiply two matrices and verify the dimensions. Then swap the order and observe that the result changes (or that it fails if dimensions don't match).
import jax.numpy as jnp
A = jnp.array([[1.0, 2.0],
[3.0, 4.0]])
B = jnp.array([[5.0, 6.0],
[7.0, 8.0]])
print(f"A @ B:\n{A @ B}")
print(f"B @ A:\n{B @ A}")
print(f"Equal: {jnp.allclose(A @ B, B @ A)}")
- Solve a system of linear equations $A\mathbf{x} = \mathbf{b}$ and verify the solution by multiplying back. Try changing $\mathbf{b}$ to see how the solution shifts.
import jax.numpy as jnp
A = jnp.array([[2.0, 1.0],
[5.0, 3.0]])
b = jnp.array([4.0, 7.0])
x = jnp.linalg.solve(A, b)
print(f"Solution x: {x}")
print(f"A @ x: {A @ x}")