Deep Learning Book Notes, Chapter 2: Linear Algebra for Deep Learning

Published in

Becoming Human: Artificial Intelligence Magazine

17 min readMar 17, 2018

These are my notes for chapter 2 of the Deep Learning book. They can also serve as a quick intro to linear algebra for deep learning.

My notes for chapter 1 can be found below:

Deep Learning Book Notes, Chapter 1

These are my notes on the Deep Learning book. There are many like them but these ones are mine.

becominghuman.ai

For this section I decided to make things a bit more intuitive using code, which should be appealing to the many of us who are coders first and math people second (or eighth). This means that my “notes” are probably on the same order of length as the chapter itself, but hopefully they are extra useful for people out there (and for myself, frankly).

The code I am including will be in the form of images because uploading chunks of notebooks to Medium is quite hard. To see the full notebook itself, go here.

Book recommendations

Recommendations from the Deep Learning book

The Matrix Cookbook by Petersen and Pedersen, for people who need a refresher.
My comment: I haven’t (yet) read it, but the pdf is free (linked above), so probably good to check out.
Linear Algebra by Shilov, for a full course.
My comment: I haven’t (yet?) read it. Seems to be getting good reviews and very cheap (only 10 bucks). Apparently pretty dense.

Basic objects

Scalar: a single number, written in italics. A scalar is a vector with a single entry.
A scalar in python: x = 0.0.
Vector: a 1D array of numbers, written in bold. You can “access” elements by using a subscript, eg x_1 is the first element of x. Can be thought of as a point in nD space where n is the number of entries of the vector. A vector is a matrix with a single column.
A vector of 10 zeros in python: v = np.zeros(10)
Matrix: a 2D array of numbers. Written in bold and uppercase. Accessing elements is the same as with a vector but using two subscripts, the first for the row and the second for the column, eg A1,2A1,2 is the second element of the first row. A1,:A1,: is the entire first row. Matrices can be transposed: ATAT is simply A with the columns made into rows and the rows made into columns. A matrix is a tensor that happens to have 2 dimensions.
A 10x7 matrix of all ones in python: M = np.ones((10, 7))
Tensor: a nD array of numbers. Indexing is the same as with a matrix or vector.
A 2x3x7x4 tensor of all -1 in python: T = -np.ones((2, 3, 7, 4)).

All of these can be added to other objects of the same size (this is done elementwise) and multiplied by scalars (each element is multiplied by the scalar). The book permits “broadcasting”, in which an object of lower dimension can be added to an object of greater dimension (eg a matrix + a vector), in this case, for example, the vector is added to each row of the matrix.

Multiplication

Dot product

Unlike to the authors, I like to think of the dot product first and of matrix multiplication second, so let’s define the dot product first:

The dot product of two vectors that we can call a and b can be obtained by first multiplying the each element of a by the corresponding element of b and then by taking the sum of the result. In mathematical notation, it becomes (if nn is the number of elements in both a and b):

In python, we can either implement it directly by using multiplication and sum, or simply use the .dot function.

Matrix multiplication

To multiply matrix A and B into a matrix C, we simply make it so that each entry Ci,j is the dot product of row i of A and column j of B, so we have, for all possible i and j:

This means that the length of every row of A must be the same as the length of every column of B, in other words that the number of columns of A is equal to the number of rows in B. C will have the same number of rows as A and the same number of columns as B. In other words, if A is of size m x n and B is of size n x k the multiplication is valid and C is of size m x k.

Often, A and B will both be “square” (meaning they have the same number of rows and columns), in which case C will have the same size as both of them.

Because matrix multiplication is so similar to the dot product, it is implemented using the .dotfunction in numpy as well.

In fact, you could even say that the vector dot product is the same as the multiplication of a matrix with only one row with a matrix with only one column, and indeed this is why the notation used by the book for the vector dot product is not a⋅b but aTb.

Matrix multiplication is not commutative, ie BA≠AB

Proof:

This seems to tell us everything we need to know about matrix multiplication, but there are actually a couple other important things to note before we move on.

First, we viewed matrix multiplication on an entry-by-entry basis, and this is how the Deep Learning book (and most everybody else) presents it. However, it is sometimes useful to see matrix multiplication as involving either entire columns or entire rows! I will not go into details here, but Gilbert Strang explains this concept in his MIT OCW course, you can see his full explanation of matrix multiplication including the “entire columns” and “entire rows” aspects here:

Secondly, since we’ve already talked about transposes, it is interesting to ask what the transpose of a multiplication is. As it turns out, transposing the result of a matrix multiplication is the same as multiplying the original transposed matrices in reverse order, or, in math:

In code:

Finally, note that there is a huge difference between matrix multiplication and element-wisemultiplication of two matrices. In the element-wise product, we simply multiply each element in one matrix by the corresponding element in the other, whereas in matrix multiplication, we use the dot product of a row and a vector.

In numpy, matrix multiplication is implemented using the .dot function whereas you can perform element-wise multiplication using the * operator. In the book's math notation, element-wise multiplication is represented by ⊙, though other operators are used by others. Matrix multiplication can just be represented by stringing the matrices together.

Inverses

The matrix I has a very important property: multiplying it with any other matrix does nothing at all! It is simply a matrix with 0s everywhere and 1s on the diagonal going from the upper left to the lower right. In python, it is always possible to create an identity matrix of a given size using the np.eyefunction (I -> eye, get it?).

I is basically the equivalent of 1 in the matrix world (since multiplying by 1 does nothing to ordinary numbers).

Most (but not all) square matrices have a special matrix called an “inverse”, represented with a -1 superscript like so: A−1A−1. The inverse of a matrix functions very much like the inverse of a number, in that multiplying a matrix by its inverse gives the identity function (just like multiplying a number by its inverse gives 1). Basically, multiplying by the inverse of a matrix is very similar to dividing by that matrix. So:

You can get the inverse of a matrix in numpy using np.linalg.inv.

Inverses can help us solve equations using matrices! Suppose we have a the following equation:

So a matrix A multiplies an unknown vector x to get a known vector b, what is the x that makes this possible?

Using the inverse we get:

And we can now compute x!

Not all matrices have inverses! Next we will see why.

Spans

Let’s explore why our bad matrix couldn’t be inverted before. For this, the easiest way is to introduce a geometrical way of thinking about matrices and vectors. I highly recommend watching 3Blue1Brown’s videos (see resources above) to gain full intuition on this.

You can think of any vector as a point in space. With a 2D vector this is simple, the first component is its location on the X axis and the second on the Y axis. With 3D things are still relatively familiar, just add a Z axis. With more dimensions, the vector is still a point in space, even though you can’t imagine it.

To deal with a 14-dimensional space, visualize a 3-D space and say ‘fourteen’ to yourself very loudly. Everyone does it.
Geoffrey Hinton

A matrix multiplication takes a point in space and moves it to another point in space, that’s all it does.

Viewing this this way, we can say that a matrix inverse takes a point that (conceptually at least) has been moved by the original matrix and finds the original location of that point (before it was moved).

The problem is: do all matrices map every point to a single other point? What if a matrix map several source points to the same destination point? What if it never maps any point to a given destination? In either case, we can’t have an inverse!

In our non-invertible matrix, both of these problems occur! There are at least two different vectors x that map to the point [0, 0], as we show in code below, and yet there is no way to get to the point [1, 1] (try it!)

To understand this, let’s visualize the way the two matrices transform different points. We’ll represent the points using a grid:

So it turns out that our bad matrix projects the entire grid onto a line, whereas A just tilts things around a little bit. This means that any point outside of the line cannot have been created by bad matrix, and thus an inverse cannot exist! (and as we saw before, several points in 2D can give us the same point on the line, which is also a problem).

This line is the span of our bad matrix (the set of all values it can map to). By contrast, the span of A is the entire plane, as can be guessed from the fact that it’s only slightly shearing the grid.

As it turns out, matrices can only ever span linear spaces such as points, lines, planes and hyperplanes (a plane in more than 2 dimensions). Further, all these spaces always have to contain the point at the origin, since multiplying any matrix by the 0 vector always gives the 0 vector. Only matrices that span the entire space they are in have an inverse.

If we look more closely at our bad matrix, we notice something strange about its columns: the second column ([2, 4]) is exactly twice the first column ([1, 2])! As it turns out, this is exactly why our matrix doesn’t span the whole space!

Multiplying a matrix and a vector can be thought of combining the columns of the matrix based on the elements of the vector, so if I multiply a matrix M by the vector [1, 2, 3, 4], the final vector is 1 times the first column of M plus 2 times the second column of M and so on. So whenever we multiply our bad matrix with a vector, the result can only ever be a multiple of the vector [1, 2], which indeed forms a line!

A set of vectors is called “dependent” if it is possible to generate one of the vectors by multiplying and adding some of the other vectors (in our case, just multiplying). If the columns of a matrix are dependent, the matrix doesn’t span the whole space and can’t be inverted.

Also note that if the columns of a matrix are dependent, its rows are also dependent, and vice-versa. We won’t prove this here.

Norms

Norms are a measure of the length of a vector. The most common types of norms are called the Lp norms, and they are of the form:

The most common Lp norms are the L1, L2 and L∞ norms, which you might already know under the names of Mahnattan distance (the distance to go from the origin to the tip of the vector, if you can only move along an axis), Euclidian distance (the distance to go from the origin to the tip if you can go in any direction you want) and maximum (of the absolute values), respectively.

L∞ might seem like a weird name, but it is actually simply what happens as p reaches infinity.

You can access the norms using np.linalg.norm.

Finally note that we can measure a matrix using the same norms, but that sometimes people call norms on matrices differently! In particular, the Frobenius norm is simply the L2 norm applied to a matrix. Remember to use that word if you want to sound smart.

Special matrices

Diagonal matrix: only has non-zero entries on its upper left to lower right diagonal (the other diagonal doesn’t count!)
Numpy can create a diagonal from a vector using np.diag.

Symmetric matrix: equal to its own transpose (the entries are symmetric across the up-left to down-right diagonal).
A matrix times its transpose is always symmetric.

Unit vector: a vector whose L2 norm is 1.
You can make a vector into a unit vector by dividing by its L2 norm.

Orthogonal vectors: two vectors whose dot product is 0.
[0, 0] is orthogonal to every vector. Non-zero orthogonal vectors are perpendicular.

Orthonormal vectors: two unit vectors who are also orthogonal.

Orthogonal matrix: a matrix whose columns (and rows) are mutually orthonormal (yes, it is called orthogonal but it is made up of orthonormal vectors). Amazing property: the transpose of an orthogonal matrix is its own inverse!!!

Eigen-stuff

The word “eigen” can seem scary, so I like to mentally replace it with “special” (that’s not what the word actually means in German, but it’s good enough for our purposes). So whenever I write “eigenvector” or “eigenvalue”, replace it with “special vector” and “special value”.

So how are these vectors and values special? The eigenvectors are special because they only get stretched when they are multiplied by the matrix (ie their direction doesn’t change, only their length, and they might also be going “backwards” if they are stretched by a negative amount). The eigenvalues is the amount by which the vectors are stretched.

Eigen-stuff is accessible in numpy through np.linalg.eig. Let's look at what they do. Below the black vector is always the original one and the red vector is the transformed one.

Here we see that the first (arbitrary) vector is not just stretched, but also slightly rotated by the matrix, so it clearly isn’t an eigenvector. We compute the eigenvectors and show how they get transformed and indeed they do not change direction! We also see that in the first case the transformed vector is much longer than the original vector, and in the second case it is much shorter, so the amount of stretching is per-vector.

Now the two important questions are:

Does every matrix have eigenvectors and eigenvalues, and if so how many?
What is the point of all of this?

The answer for the first question is… sorta. Given the definition we have given so far, the answer should be an emphatic “no”, because there exist rotation matrices, which always rotate a given vector. Since we have defined eigenvectors as “vectors that don’t get rotated when they are multiplied with the matrix” and since rotation matrices rotate every vector, they should not have any eigenvectors. However, if you call np.linalg.eig on a rotation matrix such as [[0, -1], [1, 0]] (which rotates vectors by 90 degrees), you will find that it won't error out! The reason is that if you use complex eigenvalues and eigenvectors, you can find eigen-stuff for every possible matrix.

Having to deal with complex numbers is not very convenient, of course. Thankfully, many of the matrices we’ll encounter in practice will have real eigenvalues and eigenvectors.

Now as to what the point of all this is. It so happens that we can decompose a matrix using its eigenvectors. Specifically, if we have a matrix A, we can put all its eigenvectors in the columns of a matrix V and all its eigenvalues along the diagonal of a diagonal matrix Λ and we find that:

An important way in which this is useful is that it allows us to multiply a matrix with itself repeatedly very efficiently: when we multiply A with itself, the V^{−1} of the leftmost decomposition cancels out the V of the rightmost decomposition, and we end up with:

Because Λ is a diagonal matrix, taking it to a power simply involves taking each of its elements to that power, which is much faster than doing a lot of matrix multiplications.

Finally we should note that for all symmetric matrices, the eigenvector matrix is orthogonal, which means that its transpose is its own inverse, so we get:

If A is symmetric, which is particularly useful.

SVD and pseudoinverse

In the previous section we found that eigenvalues were convenient but sadly didn’t apply to all matrices: specifically, for some real square matrices they required the use of complex numbers, and they of course don’t work at all for non-square matrices.

Singular value decomposition (SVD) tries to solve this problem by providing a decomposition with two orthogonal matrices (U and V) and one diagonal matrix (D), such that:

The book does not go into many details about the uses of SVD (though there are many) except for finding the Moore-Penrose pseudoinverse.

SVD can be accessed using np.linalg.svd.

The Moore-Penrose pseudoinverse can be computed using SVD. How doesn’t matter too much here, the point is what it can do: basically it can find the an “inverse” for non-invertible matrices. Of course, because they are not invertible, the “inverse” will lack some properties, but it will still be quite useful.

As you might recal, the inverse was useful for solving equations like:

In which case you could find x using:

With the non-invertible, there might not be an x that satisfies Ax=b, but the pseudoinverse can find the x that comes closest (by minimizing the L2 distance between Ax and b).

The Moore-Penrose pseudoinverse is accessible by using np.linalg.pinv.

Trace and determinant

The trace is simply an operator that sums the entries along the diagonals of a matrix. It has some interesting properties (eg the trace of a transpose is the same as the original trace and the order of matrix multiplication doesn’t matter within the trace operator), and its main use is to simplify some math by getting rid of explicit summing in equations in some cases.

It is accessible as np.trace

The determinant is explained very quickly in the book, although it has many interesting properties. Suffice it to say that it is a single number that describes a matrix, it has the following properties:

If det(A) is 0, then A is singular
It is the product of the eigenvalues
It can be thought of as the amount to which multiplying by the matrix streches space: if the determinant is 2, then the matrix can be thought of as doubling the volume of space. If it is one, it doesn’t stretch space at all.

Finally, the chapter ends on a derivation of PCA with a single component. It is a cool example of derivation but I won’t go through it here since it doesn’t really introduce new material, just shows how to use what we saw above.

That’s it for this week’s notes. I think I’ll keep translating the book’s concepts into python code, it enlivens things a bit and makes them more concrete, but maybe next time’s notes will be shorter: this was a lot of work!