Advanced Machine Learning 2 | Basics of PyTorch and Backpropagation

aml

1. Basics For PyTorch

(1) History of PyTorch

PyTorch began as an internship project by Adam Paszke, which is a part of the Torch library. Torch is an open-source machine learning library, a scientific computing framework, and a script language based on the Lua programming language. It had an initial release October 2002.

(2) Features of PyTorch

2. Mathematics Behind PyTorch

(1) Central Difference

Generally, we have two ways for calculating derivatives. In the previous sections (i.e. Introduction to machine learning) we have talked about, we have used a symbolic derivatives which means that we need full symbolic function. Even though this method is accurate, we can hardly calculate the derivatives when the symbolic function is unclear. Therefore, the second approach we have here is called the numerical derivatives.

Althought this formula will not give us an accurate value of derivative, it provides flexibility. Here, we can have an example function where we select , which can be modified case by case.

(2) Autodifferentiation

Automatic differentiation (aka. AD or autodiff) a set of techniques to evaluate the derivative of a function specified by a computer program based on the chain rule. There are two different kinds of pass strategy,

Suppose now we would like to run backward pass on one function , and we have already known that it has derivative . Then we can write it in code as,

Note that if we want to calculate the backward pass of in this one function situation, we should pass d_out = 1 as square(1).backward(1) .

Now let's consider the two-function backward with one argument. So here is a forward pass of function and on that shows as follows,

Screen Shot 2022-02-04 at 1.49.12 AM

If we use the univariate chain rule, we can have,

Which means we can extend the former class definitions by,

So, to cauculate the value, and the d_out passed to backward of the first box should be the value returned by the backward of the second box. The 1 at the end is to start off the chain rule process with a value for d_out.

Screen Shot 2022-02-04 at 2.13.44 AM

Now, let's see an example. Suppose we want to calculate the derivate of at . Based on chain rule, we can derive that,

As we have defined square class, now we have to define the sine class. Note that we use symbolic derivatives for and numerical derivatives for in order to show both of these methods. So now these classes should be defined as,

Then we can have the forward pass as,

And the output should be 0.4121.

Also, the backward pass is calculated as,

This result of -5.4668 matches the value we have calculated by chain rule.

(3) Backpropagation

The backward function tells us how to compute the derivastive of one operation, and the chain rule tells us how to compute the derivative of two sequential operations. For backpropagation (aka. Backprop), it is going to show us how to use these rules to compute a derivative for an arbitrary series of operations.

Now, let's suppose we have two arguments and and a function defined as follows,

Let's suppose we would like to compute the derivatives of and . Then based on the function we have defined, we can construct a box graph as follows,

Screen Shot 2022-02-04 at 11.01.23 AM

Based on the chain rule, we have,

Then let's consider the backpropagation. From right to left, we have d_out = 1 as an initial backward input. According to the derivatives of on and are both 1, the first step backward pass got 1s for as the d_out input for the next step.

Screen Shot 2022-02-04 at 11.13.01 AM

Then, let's suppose we first compute the backward of the function. From the backward method we have discussed, the result of this step should be computed as,

Screen Shot 2022-02-04 at 11.20.29 AM

So the next issue is that, it sames that we can continue to backward to both the green box and the blue box, but the order of the backward pass really matters. Actually, in this case we have to perform the blue box as the next backward step instead of the green one, but how can the machine know that?

(4) Topological Sort

To handle this issue, we will process te nodes in a topological order. Firstly, we have to note that our graph is not a random directed graph, it is actually a DAG (aka. Directed Acyclic Graph). Please refer to this article if you can not remember clearly what a DAG is. In this case, the direcionality comes from the backward function and the lack of cycles is a consequence of the choice that every function must create a new variable.