# The “gradient” argument in Pytorch’s “backward” function — explained by examples

This post is some examples for the `gradient` argument in Pytorch's `backward` function. The math of `backward(gradient)` is explained in this tutorialand these threads (thread-1, thread-2), along with some examples. Those were very helpful, but I wish there were more examples on how the numbers in the example correspond to the math, to help me more easily understand. I could not find many such examples so I will make some and write them here, so that I can look back when I forget this in two weeks.

In the examples, I run code in torch, write down the math, and run the math in numpy, and show that the torch result matches the math/numpy result.

The Jupyter notebook version of this post is on github is better formatted and runnable. You can download it and run the code.

Here’s how Pytorch tutorial explains the math:

We will make examples of `x` and `y=f(x)` (we omit the arrow-hats of `x` and `y` above), and manually calculate Jacobian `J`.

Pytorch tutorial goes on with the explanation:

The above basically says: if you pass `vᵀ` as the `gradient` argument, then `y.backward(gradient)` will give you not `J` but `vᵀ・J` as the result of `x.grad`.

We will make examples of `vᵀ`, calculate `vᵀ・J` in numpy, and confirm that the result is the same as `x.grad` after calling `y.backward(gradient)` where `gradient` is `vᵀ`.

All good? Let’s go.

`import torchimport numpy as npfrom torch import tensorfrom numpy import array`

# input is scalar, output is scalar

First, a simple example where `x=1` and `y = x^2` are both scalar.

In pytorch:

`x = tensor(1., requires_grad=True)print('x:', x)y = x**2print('y:', y)y.backward() # this is the same as y.backward(tensor(1.))print('x.grad:', x.grad)`

Out:

`x: tensor(1., requires_grad=True)y: tensor(1., grad_fn=<PowBackward0>)x.grad: tensor(2.)`

Now manually calculate Jacobian `J`. In this case `x` and `y` are both scalar (each only has one component `x_1` and `y_1` respectively). And we have

In numpy:

`x = x.detach().numpy()J = array([[2*x]])print('J:', J)`

Out:

`J: [[2.]]`

In this example, we did not pass the `gradient` argument to `backward()`, and this defaults to passing the value 1. As a reminder, `vᵀ` is our `gradient`with value 1. We can confirm that `vᵀ・J` gives the same result as `x.grad`. All good.

`vᵀ = array([[1,]])print('vᵀ:', vᵀ)print('vᵀ・J:', vᵀ@J)`

Out:

`vᵀ: []vᵀ・J: [[2.]]`

# input is scalar, output is scalar, non-default gradient

We can keep everything else the same but pass a non-default `gradient` with the value 100 to `backward()` instead of the default value 1.

`x = tensor(1., requires_grad=True)print('x:', x)y = x**2print('y:', y)gradient_value = 100.y.backward(tensor(gradient_value)) print('x.grad:', x.grad)`

Out:

`x: tensor(1., requires_grad=True)y: tensor(1., grad_fn=<PowBackward0>)x.grad: tensor(200.)`

This is the same as setting the value `100` for `vᵀ`, and we can see `vᵀ・J` still matches `x.grad`. Still good.

`x = x.detach().numpy()J = array([[2*x]])print('J:', J)vᵀ = array([[gradient_value,]])print('vᵀ:', vᵀ)print('vᵀ・J:', vᵀ@J)`

Out:

`J: [[2.]]vᵀ: [[100.]]vᵀ・J: [[200.]]`

# input is vector, output is scalar

Now let’s look at a more interesting example where `x=[x_1,x_2]=[1,2]` is a vector and `y=sum(x)` is a scalar.

`x = tensor([1., 2.], requires_grad=True)print('x:', x)y = sum(x)print('y:', y)y.backward() print('x.grad:', x.grad)`

Out:

`x: tensor([1., 2.], requires_grad=True)y: tensor(3., grad_fn=<AddBackward0>)x.grad: tensor([1., 1.])`

Now manually calculate Jacobian `J`. In this since `x` is a vector with components `x_1` and `x_2`, and `y=x_1+x_2` is a scalar. We have

In numpy:

`J = array([[1, 1]])print('J:')print(J)`

Out:

`J:[[1 1]]`

In this example, we did not pass the `gradient` argument to `backward()`, and this defaults to passing the value 1, i.e., `vᵀ` has value 1. We can confirm that `vᵀ・J` gives the same result as `x.grad`.

`vᵀ = array([])print('vᵀ:', vᵀ)print('vᵀ・J:', vᵀ@J)`

Out:

`vᵀ: []vᵀ・J: [[1 1]]`

# input is vector, output is scalar, non-default gradient

We can keep everything else the same as above but pass a non-default `gradient` with the value 100 to `backward()` instead of the default value 1. Still, `x=[x_1,x_2]=[1,2]` is a vector and `y=sum(x)` is a scalar.

`x = tensor([1., 2.], requires_grad=True)print('x:', x)y = sum(x)gradient_value = 100.y.backward(tensor(gradient_value)) print('x.grad:', x.grad)`

Out:

`x: tensor([1., 2.], requires_grad=True)x.grad: tensor([100., 100.])`

This is the same as setting the value 100 for `vᵀ`, and we can see `vᵀ・J` still matches `x.grad`. Still good.

`J = array([[1, 1]])print('J:')print(J)vᵀ = array([[gradient_value,]])print('vᵀ:', vᵀ)print('vᵀ・J:', vᵀ@J)`

Out:

`J:[[1 1]]vᵀ: [[100.]]vᵀ・J: [[100. 100.]]`

# input is vector, output is vector

Now let’s look at an example where both `x=[x_1,x_2]=[1,2]` and `y=3x^2` are vectors.

`x = tensor([1., 2.], requires_grad=True)print('x:', x)y = 3*x**2print('y:', y)gradient_value = [1., 1.]y.backward(tensor(gradient_value)) print('x.grad:', x.grad)`

Out:

`x: tensor([1., 2.], requires_grad=True)y: tensor([ 3., 12.], grad_fn=<MulBackward0>)x.grad: tensor([ 6., 12.])`

Now manually calculate Jacobian `J`. In this since `x` is a vector with components `x_1` and `x_2`, and `y=3x^2` is a vector with component `y_1=3x_1^2`and `y_2=3x_2^2`. We have

In numpy:

`x = x.detach().numpy()J = array([[6*x, 0], [0, 6*x]])print('J:')print(J)`

Out:

`J:[[ 6.  0.] [ 0. 12.]]`

In this example, because `y` is a vector, we must pass a `gradient` argument to `backward()`. We pass `vᵀ` with the same length as `y` and has values 1. We can confirm that `vᵀ・J` gives the same result as `x.grad`.

`vᵀ = array([gradient_value])print('vᵀ:', vᵀ)print('vᵀ・J:', vᵀ@J)`

Out:

`vᵀ: [[1. 1.]]vᵀ・J: [[ 6. 12.]]`

# input is vector, output is vector, non-one gradient

We can keep everything else the same as above but pass a non-default `gradient` with the value [1, 10] to `backward()`.

`x = tensor([1., 2.], requires_grad=True)print('x:', x)y = 3*x**2print('y:', y)gradient_value = [1., 10.]y.backward(tensor(gradient_value)) print('x.grad:', x.grad)`

Out:

`x: tensor([1., 2.], requires_grad=True)y: tensor([ 3., 12.], grad_fn=<MulBackward0>)x.grad: tensor([  6., 120.])`

This is the same as setting the value [1,10] for `vᵀ`, and we can see `vᵀ・J` still matches `x.grad`. Still good.

`x = x.detach().numpy()J = array([[6*x, 0], [0, 6*x]])print('J:')print(J)vᵀ = array([gradient_value])print('vᵀ:', vᵀ)print('vᵀ・J:', vᵀ@J)`

Out:

`J:[[ 6.  0.] [ 0. 12.]]vᵀ: [[ 1. 10.]]vᵀ・J: [[  6. 120.]]`

# input is vector, output is vector — another example

Now let’s look at a slightly more complex/full-fledged example where

is a vector with 2 components and

is a vector with 3 components.

`x = tensor([1., 2.], requires_grad=True)print('x:', x)y = torch.empty(3)y = 3*x**2y = x**2 + 2*x**3y = 10*xprint('y:', y)gradient_value = [1., 10., 100.,]y.backward(tensor(gradient_value)) print('x.grad:', x.grad)`

Out:

`x: tensor([1., 2.], requires_grad=True)y: tensor([ 3., 17., 20.], grad_fn=<CopySlices>)x.grad: tensor([  26., 1240.])`

Now manually calculate Jacobian `J`. In this since `x` is a vector with components `x_1` and `x_2`, and

We have

In numpy:

`x = x.detach().numpy()J = array([[6*x, 0],            [2*x, 6*x**2],            [0, 10]])print('J:')print(J)`

Out:

`J:[[ 6.  0.] [ 2. 24.] [ 0. 10.]]`

In this example, because `y` is a vector, we must pass a `gradient` argument to `backward()`. We pass `vᵀ` with the same length as `y` and has values [1., 10., 100.,]. We can confirm that `vᵀ・J` gives the same result as `x.grad`.

`vᵀ = array([gradient_value])print('vᵀ:', vᵀ)print('vᵀ・J:', vᵀ@J)`

Out:

`vᵀ: [[  1.  10. 100.]]vᵀ・J: [[  26. 1240.]]`

But there’s more. I’ve not seen it elsewhere other than here, but when `y` is a scalar, you can actually pass a vector as the `gradient`.

## extra cases-1: input is scalar, output is scalar, gradient is vector

As in our very first simple example, let `x=1` and `y = x^2`, both scalar. But instead of a scalar, we can pass a vector of arbitrary length as `gradient`.

In pytorch:

`x = tensor(1., requires_grad=True)print('x:', x)y = x**2print('y:', y)gradient_value = [1., 10., 100., 1000.]y.backward(tensor(gradient_value)) print('x.grad:', x.grad)`

Out:

`x: tensor(1., requires_grad=True)y: tensor(1., grad_fn=<PowBackward0>)x.grad: tensor(2222.)`

With this vector gradient, `backward` accumulates gradient for `x`:

`x = tensor(1., requires_grad=True)y = x**2gradient_value = [1., 10., 100., 1000.]for v in gradient_value:    y.backward(tensor(v), retain_graph=True)     print('x.grad:', x.grad)`

Out:

`x.grad: tensor(2.)x.grad: tensor(22.)x.grad: tensor(222.)x.grad: tensor(2222.)`

In the matrix multiplication universe, this behavior is as if `J` is broadcast to the same length of the `gradient`.

As before, the Jacobian:

In numpy:

`x = x.detach().numpy()J = array([[2*x]])print('J:', J)J_broadcast = np.repeat(J, len(gradient_value), axis=0)print('J_broadcast:')print(J_broadcast)`

Out:

`J: [[2.]]J_broadcast:[[2.] [2.] [2.] [2.]]`

We can confirm that `vᵀ・J_broadcast` gives the same result as `x.grad`. All good.

`vᵀ = array([gradient_value])print('vᵀ:', vᵀ)print('vᵀ・J_broadcast:', vᵀ@J_broadcast)`

Out:

`vᵀ: [[   1.   10.  100. 1000.]]vᵀ・J_broadcast: [[2222.]]`

## extra cases-2: input is vector, output is scalar, gradient is vector

Here’s another example of broadcast/accumulate. As in our second example, let `x=[x_1,x_2]=[1,2]` is a vector and `y=sum(x)`. But instead of a scalar, we can pass a vector of arbitrary length as `gradient`.

`x = tensor([1., 2.], requires_grad=True)print('x:', x)y = sum(x)print('y:', y)gradient_value = [1., 10., 100., 1000.]y.backward(tensor(gradient_value)) print('x.grad:', x.grad)`

Out:

`x: tensor([1., 2.], requires_grad=True)y: tensor(3., grad_fn=<AddBackward0>)x.grad: tensor([1111., 1111.])`

With this vector gradient, `backward` accumulates gradient for `x`:

`x = tensor([1., 2.], requires_grad=True)y = sum(x)gradient_value = [1., 10., 100., 1000.]for v in gradient_value:    y.backward(tensor(v), retain_graph=True)     print('x.grad:', x.grad)`

Out:

`x.grad: tensor([1., 1.])x.grad: tensor([11., 11.])x.grad: tensor([111., 111.])x.grad: tensor([1111., 1111.])`

In the matrix multiplication universe, this behavior is as if `J` is broadcast to the same length of the `gradient`.

In numpy:

`x = x.detach().numpy()J = array([[1, 1]])print('J:', J)J_broadcast = np.repeat(J, len(gradient_value), axis=0)print('J_broadcast:')print(J_broadcast)`

Out:

`J: [[1 1]]J_broadcast:[[1 1] [1 1] [1 1] [1 1]]`

We can confirm that `vᵀ・J_broadcast` gives the same result as `x.grad`.

`vᵀ = array([gradient_value])print('vᵀ:', vᵀ)print('vᵀ・J_broadcast:', vᵀ@J_broadcast)`

Out:

`vᵀ: [[   1.   10.  100. 1000.]]vᵀ・J_broadcast: [[1111. 1111.]]`

# That’s it.

Hope this helps. Since there’s some manual calculation of gradients here, if I made mistakes, please let me know so I can correct.