The “gradient” argument in Pytorch’s “backward” function — explained by examples

This post is some examples for the gradient argument in Pytorch's backward function. The math of backward(gradient) is explained in this tutorialand these threads (thread-1, thread-2), along with some examples. Those were very helpful, but I wish there were more examples on how the numbers in the example correspond to the math, to help me more easily understand. I could not find many such examples so I will make some and write them here, so that I can look back when I forget this in two weeks.

In the examples, I run code in torch, write down the math, and run the math in numpy, and show that the torch result matches the math/numpy result.

The Jupyter notebook version of this post is on github is better formatted and runnable. You can download it and run the code.

Here’s how Pytorch tutorial explains the math:

We will make examples of x and y=f(x) (we omit the arrow-hats of x and y above), and manually calculate Jacobian J.

Pytorch tutorial goes on with the explanation:

The above basically says: if you pass vᵀ as the gradient argument, then y.backward(gradient) will give you not J but vᵀ・J as the result of x.grad.

We will make examples of vᵀ, calculate vᵀ・J in numpy, and confirm that the result is the same as x.grad after calling y.backward(gradient) where gradient is vᵀ.

All good? Let’s go.

import torch
import numpy as np
from torch import tensor
from numpy import array

input is scalar, output is scalar

First, a simple example where x=1 and y = x^2 are both scalar.

In pytorch:

x = tensor(1., requires_grad=True)
print('x:', x)
y = x**2
print('y:', y)
y.backward() # this is the same as y.backward(tensor(1.))
print('x.grad:', x.grad)

Out:

x: tensor(1., requires_grad=True)
y: tensor(1., grad_fn=<PowBackward0>)
x.grad: tensor(2.)

Now manually calculate Jacobian J. In this case x and y are both scalar (each only has one component x_1 and y_1 respectively). And we have

In numpy:

x = x.detach().numpy()
J = array([[2*x]])
print('J:', J)

Out:

J: [[2.]]

In this example, we did not pass the gradient argument to backward(), and this defaults to passing the value 1. As a reminder, vᵀ is our gradientwith value 1. We can confirm that vᵀ・J gives the same result as x.grad. All good.

vᵀ = array([[1,]])
print('vᵀ:', vᵀ)
print('vᵀ・J:', vᵀ@J)

Out:

vᵀ: [[1]]
vᵀ・J: [[2.]]

input is scalar, output is scalar, non-default gradient

We can keep everything else the same but pass a non-default gradient with the value 100 to backward() instead of the default value 1.

x = tensor(1., requires_grad=True)
print('x:', x)
y = x**2
print('y:', y)
gradient_value = 100.
y.backward(tensor(gradient_value))
print('x.grad:', x.grad)

Out:

x: tensor(1., requires_grad=True)
y: tensor(1., grad_fn=<PowBackward0>)
x.grad: tensor(200.)

This is the same as setting the value 100 for vᵀ, and we can see vᵀ・J still matches x.grad. Still good.

x = x.detach().numpy()
J = array([[2*x]])
print('J:', J)
vᵀ = array([[gradient_value,]])
print('vᵀ:', vᵀ)
print('vᵀ・J:', vᵀ@J)

Out:

J: [[2.]]
vᵀ: [[100.]]
vᵀ・J: [[200.]]

input is vector, output is scalar

Now let’s look at a more interesting example where x=[x_1,x_2]=[1,2] is a vector and y=sum(x) is a scalar.

x = tensor([1., 2.], requires_grad=True)
print('x:', x)
y = sum(x)
print('y:', y)
y.backward()
print('x.grad:', x.grad)

Out:

x: tensor([1., 2.], requires_grad=True)
y: tensor(3., grad_fn=<AddBackward0>)
x.grad: tensor([1., 1.])

Now manually calculate Jacobian J. In this since x is a vector with components x_1 and x_2, and y=x_1+x_2 is a scalar. We have

In numpy:

J = array([[1, 1]])
print('J:')
print(J)

Out:

J:
[[1 1]]

In this example, we did not pass the gradient argument to backward(), and this defaults to passing the value 1, i.e., vᵀ has value 1. We can confirm that vᵀ・J gives the same result as x.grad.

vᵀ = array([[1]])
print('vᵀ:', vᵀ)
print('vᵀ・J:', vᵀ@J)

Out:

vᵀ: [[1]]
vᵀ・J: [[1 1]]

input is vector, output is scalar, non-default gradient

We can keep everything else the same as above but pass a non-default gradient with the value 100 to backward() instead of the default value 1. Still, x=[x_1,x_2]=[1,2] is a vector and y=sum(x) is a scalar.

x = tensor([1., 2.], requires_grad=True)
print('x:', x)
y = sum(x)
gradient_value = 100.
y.backward(tensor(gradient_value))
print('x.grad:', x.grad)

Out:

x: tensor([1., 2.], requires_grad=True)
x.grad: tensor([100., 100.])

This is the same as setting the value 100 for vᵀ, and we can see vᵀ・J still matches x.grad. Still good.

J = array([[1, 1]])
print('J:')
print(J)
vᵀ = array([[gradient_value,]])
print('vᵀ:', vᵀ)
print('vᵀ・J:', vᵀ@J)

Out:

J:
[[1 1]]
vᵀ: [[100.]]
vᵀ・J: [[100. 100.]]

input is vector, output is vector

Now let’s look at an example where both x=[x_1,x_2]=[1,2] and y=3x^2 are vectors.

x = tensor([1., 2.], requires_grad=True)
print('x:', x)
y = 3*x**2
print('y:', y)
gradient_value = [1., 1.]
y.backward(tensor(gradient_value))
print('x.grad:', x.grad)

Out:

x: tensor([1., 2.], requires_grad=True)
y: tensor([ 3., 12.], grad_fn=<MulBackward0>)
x.grad: tensor([ 6., 12.])

Now manually calculate Jacobian J. In this since x is a vector with components x_1 and x_2, and y=3x^2 is a vector with component y_1=3x_1^2and y_2=3x_2^2. We have

In numpy:

x = x.detach().numpy()
J = array([[6*x[0], 0], [0, 6*x[1]]])
print('J:')
print(J)

Out:

J:
[[ 6. 0.]
[ 0. 12.]]

In this example, because y is a vector, we must pass a gradient argument to backward(). We pass vᵀ with the same length as y and has values 1. We can confirm that vᵀ・J gives the same result as x.grad.

vᵀ = array([gradient_value])
print('vᵀ:', vᵀ)
print('vᵀ・J:', vᵀ@J)

Out:

vᵀ: [[1. 1.]]
vᵀ・J: [[ 6. 12.]]

input is vector, output is vector, non-one gradient

We can keep everything else the same as above but pass a non-default gradient with the value [1, 10] to backward().

x = tensor([1., 2.], requires_grad=True)
print('x:', x)
y = 3*x**2
print('y:', y)
gradient_value = [1., 10.]
y.backward(tensor(gradient_value))
print('x.grad:', x.grad)

Out:

x: tensor([1., 2.], requires_grad=True)
y: tensor([ 3., 12.], grad_fn=<MulBackward0>)
x.grad: tensor([ 6., 120.])

This is the same as setting the value [1,10] for vᵀ, and we can see vᵀ・J still matches x.grad. Still good.

x = x.detach().numpy()
J = array([[6*x[0], 0], [0, 6*x[1]]])
print('J:')
print(J)
vᵀ = array([gradient_value])
print('vᵀ:', vᵀ)
print('vᵀ・J:', vᵀ@J)

Out:

J:
[[ 6. 0.]
[ 0. 12.]]
vᵀ: [[ 1. 10.]]
vᵀ・J: [[ 6. 120.]]

input is vector, output is vector — another example

Now let’s look at a slightly more complex/full-fledged example where

is a vector with 2 components and

is a vector with 3 components.

x = tensor([1., 2.], requires_grad=True)
print('x:', x)
y = torch.empty(3)
y[0] = 3*x[0]**2
y[1] = x[0]**2 + 2*x[1]**3
y[2] = 10*x[1]
print('y:', y)
gradient_value = [1., 10., 100.,]
y.backward(tensor(gradient_value))
print('x.grad:', x.grad)

Out:

x: tensor([1., 2.], requires_grad=True)
y: tensor([ 3., 17., 20.], grad_fn=<CopySlices>)
x.grad: tensor([ 26., 1240.])

Now manually calculate Jacobian J. In this since x is a vector with components x_1 and x_2, and

We have

In numpy:

x = x.detach().numpy()
J = array([[6*x[0], 0],
[2*x[0], 6*x[1]**2],
[0, 10]])
print('J:')
print(J)

Out:

J:
[[ 6. 0.]
[ 2. 24.]
[ 0. 10.]]

In this example, because y is a vector, we must pass a gradient argument to backward(). We pass vᵀ with the same length as y and has values [1., 10., 100.,]. We can confirm that vᵀ・J gives the same result as x.grad.

vᵀ = array([gradient_value])
print('vᵀ:', vᵀ)
print('vᵀ・J:', vᵀ@J)

Out:

vᵀ: [[  1.  10. 100.]]
vᵀ・J: [[ 26. 1240.]]

extra cases: broadcast/accumulate

But there’s more. I’ve not seen it elsewhere other than here, but when y is a scalar, you can actually pass a vector as the gradient.

extra cases-1: input is scalar, output is scalar, gradient is vector

As in our very first simple example, let x=1 and y = x^2, both scalar. But instead of a scalar, we can pass a vector of arbitrary length as gradient.

In pytorch:

x = tensor(1., requires_grad=True)
print('x:', x)
y = x**2
print('y:', y)
gradient_value = [1., 10., 100., 1000.]
y.backward(tensor(gradient_value))
print('x.grad:', x.grad)

Out:

x: tensor(1., requires_grad=True)
y: tensor(1., grad_fn=<PowBackward0>)
x.grad: tensor(2222.)

With this vector gradient, backward accumulates gradient for x:

x = tensor(1., requires_grad=True)
y = x**2
gradient_value = [1., 10., 100., 1000.]
for v in gradient_value:
y.backward(tensor(v), retain_graph=True)
print('x.grad:', x.grad)

Out:

x.grad: tensor(2.)
x.grad: tensor(22.)
x.grad: tensor(222.)
x.grad: tensor(2222.)

In the matrix multiplication universe, this behavior is as if J is broadcast to the same length of the gradient.

As before, the Jacobian:

In numpy:

x = x.detach().numpy()
J = array([[2*x]])
print('J:', J)
J_broadcast = np.repeat(J, len(gradient_value), axis=0)
print('J_broadcast:')
print(J_broadcast)

Out:

J: [[2.]]
J_broadcast:
[[2.]
[2.]
[2.]
[2.]]

We can confirm that vᵀ・J_broadcast gives the same result as x.grad. All good.

vᵀ = array([gradient_value])
print('vᵀ:', vᵀ)
print('vᵀ・J_broadcast:', vᵀ@J_broadcast)

Out:

vᵀ: [[   1.   10.  100. 1000.]]
vᵀ・J_broadcast: [[2222.]]

extra cases-2: input is vector, output is scalar, gradient is vector

Here’s another example of broadcast/accumulate. As in our second example, let x=[x_1,x_2]=[1,2] is a vector and y=sum(x). But instead of a scalar, we can pass a vector of arbitrary length as gradient.

x = tensor([1., 2.], requires_grad=True)
print('x:', x)
y = sum(x)
print('y:', y)
gradient_value = [1., 10., 100., 1000.]
y.backward(tensor(gradient_value))
print('x.grad:', x.grad)

Out:

x: tensor([1., 2.], requires_grad=True)
y: tensor(3., grad_fn=<AddBackward0>)
x.grad: tensor([1111., 1111.])

With this vector gradient, backward accumulates gradient for x:

x = tensor([1., 2.], requires_grad=True)
y = sum(x)
gradient_value = [1., 10., 100., 1000.]
for v in gradient_value:
y.backward(tensor(v), retain_graph=True)
print('x.grad:', x.grad)

Out:

x.grad: tensor([1., 1.])
x.grad: tensor([11., 11.])
x.grad: tensor([111., 111.])
x.grad: tensor([1111., 1111.])

In the matrix multiplication universe, this behavior is as if J is broadcast to the same length of the gradient.

In numpy:

x = x.detach().numpy()
J = array([[1, 1]])
print('J:', J)
J_broadcast = np.repeat(J, len(gradient_value), axis=0)
print('J_broadcast:')
print(J_broadcast)

Out:

J: [[1 1]]
J_broadcast:
[[1 1]
[1 1]
[1 1]
[1 1]]

We can confirm that vᵀ・J_broadcast gives the same result as x.grad.

vᵀ = array([gradient_value])
print('vᵀ:', vᵀ)
print('vᵀ・J_broadcast:', vᵀ@J_broadcast)

Out:

vᵀ: [[   1.   10.  100. 1000.]]
vᵀ・J_broadcast: [[1111. 1111.]]

That’s it.

Hope this helps. Since there’s some manual calculation of gradients here, if I made mistakes, please let me know so I can correct.

Software Engineering SMTS at Salesforce Commerce Cloud Einstein