Gradients Automation
1. Gradients and tensors
Neural networks can get very complicated very quickly. There are several types of architectures, and each can have multiple layers. Coding the gradients becomes a complicated task. Instead, libraries such as Pytorch or TensorFlow use a smart mechanism that allows to calculate gradients without having to code them. Essentially, these libraries keep track of all the mathematical operations on a tensor object and use the chain rule to determine the derivatives.
So how to calculate derivatives of tensors? Consider the scalar \(\Phi=\exp\Big(\sum_{ij} T_{ij}w_iw_j\Big)\) and its derivative relative to the tensor $w_i$. First, we calculate the derivative of $\Phi(u)$ with respect to $u$
\(\frac{\partial\Phi}{\partial u}=\exp(u)\) and then use the chain rule together with the derivatives
\[\frac{\partial u}{\partial w_k}=\sum_{ij}T_{ij}\frac{\partial w_i}{\partial w_k}w_j+\sum_{ij}T_{ij} w_i\frac{\partial w_j}{\partial w_k}+\sum_{ij}\frac{\partial T_{ij}}{\partial w_k}w_iw_j\]But we can do this calculation differently. Instead of starting from the function $\Phi(u)$ and propagate back the derivatives, we can keep track of the results at each step of the forward operation. That is, we perform the calculation in the following order
- Calculate the derivatives $\frac{\partial w_i}{\partial w_k}$ and $\frac{\partial T_{ij}}{\partial w_k}$
- Determine $X_i\equiv\sum_{ij}T_{ij}w_j$ and $\frac{\partial X_i}{\partial w_k}=\sum_{ij}\frac{\partial T_{ij}}{\partial w_k}w_i + \sum_{ij}T_{ij}\frac{\partial w_i}{\partial w_k}$
- Determine $Y\equiv\sum_i X_iw_i$ and $\frac{\partial Y}{\partial w_k}=\sum_{i}\frac{\partial X_{i}}{\partial w_k}w_i + \sum_{i}X_{i}\frac{\partial w_i}{\partial w_k}$
- Finally calculate $\Phi=\exp(Y)$ and $\frac{\partial \Phi}{\partial w_k}=\Phi \frac{\partial Y}{\partial w_k}$
At each step, we calculate the resulting tensor and the corresponding derivative. To accomplish this, we need an object (class) that implements various mathematical operations and keeps track of the gradients while the function takes place. In pseudo-code
class Tensor:
def __init__(array,gradient)
...
def __add__(self,tensor):
result=add_operation(self,tensor)->array
gradient=grad_operation(self,tensor)->array
return Tensor(result,gradient)
def __sub__(self,tensor):
...
def __mul__(self,tensor):
result=mul_operation(self,tensor)->array
gradient=grad_operation(self,tensor)->array
return Tensor(result,gradient)
We should also define non-linear functions, such as the sigmoid activation function
class Sigmoid:
def __call__(self,x):
return 1/(1+np.exp(-x))
def grad(self,x):
return np.exp(-x)/(1+np.exp(-x))**2
As an example, say we want to calculate the scalar \(\phi=\sigma\Big(\sum_ix_i\omega_i\Big)\) where $\sigma(z)$ is the sigmoid function, and we are interested in the derivative relative to $\omega_i$. First we create the Tensor objects,
w_tensor=Tensor(w_array,requires_grad=True)
x_tensor=Tensor(x_array,requires_grad=False)
where w_tensor and x_tensor have dimensions $(d,1)$ and $(1,d)$ respectively. The flag “requires_grad” specifies whether we want or not the derivative relative to that tensor. We instantiate the function Sigmoid
sig=Sigmoid()
and calculate
phi=sig(x_tensor*w_tensor)
The operator * is overloaded in the class Tensor by the matrix multiplication operation. The object “phi” is an instance of the class Tensor, which contains both the result of the mathematical operation and the gradient of “phi” with respect to $\omega_i$. Then we can access the gradient by the attribute “phi.grad”.
2. Python Implementation
The Tensor class contains linear mathematical operations together with other methods such as transposing or squeezing.
class Tensor:
calc_grad=True
"""calc_grad: bool, signaling whether to carry the gradients while artihmetic operations take place"""
def __init__(self,array,
grad=None,
requires_grad=False):
"""
array: numpy array
grad: dic={id(object): numpy.array}
requires_grad: bool, signaling whether to calculate or not the derivative relative to this tensor
"""
self.array=array
self.requires_grad=requires_grad
if requires_grad:
name=id(self)
self.grad={name: self.make_grad()}
else:
self.grad={'none':0}
if grad is not None:
self.grad=grad
@property
def shape(self):
return self.array.shape
@property
def ndim(self):
return self.array.ndim
@property
def T(self):
return self.array.T
def transpose(self,shape):
self.array=self.array.transpose(shape)
if self.calc_grad:
for w in self.grad:
if isinstance(self.grad[w],np.ndarray):
self.grad[w]=self.grad[w].transpose(shape)
def squeeze(self,axis=0):
result=self.array.squeeze(axis)
if self.calc_grad:
grad={}
for w in self.grad:
if isinstance(self.grad[w],np.ndarray):
grad[w]=self.grad[w].squeeze(axis)
else:
grad[w]=0
return Tensor(result,grad=grad)
else:
return Tensor(result,grad='NA')
def __getitem__(self,index):
result=self.array[index]
grad={}
for w in self.grad:
if isinstance(self.grad[w],np.ndarray):
grad[w]=self.grad[w][index]
else:
grad[w]=0
return Tensor(result,grad=grad)
def make_grad(self,):
shape=self.array.shape
Kron=1
for d in shape:
ID=np.identity(d)
Kron=np.tensordot(Kron,ID,axes=0)
new_shape=[i for i in range(0,2*len(shape),2)]
new_shape+=[i for i in range(1,2*len(shape),2)]
Kron=Kron.transpose(new_shape)
return Kron
def check_grads(self,x):
for w in self.grad:
if w not in x.grad:
x.grad[w]=0
for w in x.grad:
if w not in self.grad:
self.grad[w]=0
def __add__(self,x):
if isinstance(x,Tensor):
result=self.array+x.array
if self.calc_grad:
self.check_grads(x)
grad={}
for w in self.grad:
grad[w]=self.grad[w]+x.grad[w]
return Tensor(result,grad=grad)
else:
return Tensor(result,grad='NA')
if isinstance(x,int) or isinstance(x,float):
result=self.array+x
if self.calc_grad:
return Tensor(result,grad=self.grad.copy())
else:
return Tensor(result,grad='NA')
def __radd__(self,x):
if isinstance(x,int) or isinstance(x,float):
result=self.array+x
if self.calc_grad:
return Tensor(result,grad=self.grad.copy())
else:
return Tensor(result,grad='NA')
def __sub__(self,x):
if isinstance(x,Tensor):
result=self.array-x.array
if self.calc_grad:
self.check_grads(x)
grad={}
for w in self.grad:
grad[w]=self.grad[w]-x.grad[w]
return Tensor(result,grad=grad)
else:
return Tensor(result,grad='NA')
if isinstance(x,int) or isinstance(x,float):
result=self.array-x
if self.calc_grad:
return Tensor(result,grad=self.grad.copy())
else:
return Tensor(result,grad='NA')
def __rsub__(self,x):
if isinstance(x,int) or isinstance(x,float):
result=x-self.array
if self.calc_grad:
grad={}
for w in self.grad:
grad[w]=-self.grad[w]
return Tensor(result,grad=grad)
else:
return Tensor(result,grad='NA')
def __mul__(self,x):
if isinstance(x,int) or isinstance(x,float):
result=x*self.array
if self.calc_grad:
grad={}
for w in self.grad:
grad[w]=x*self.grad[w]
return Tensor(result,grad=grad)
else:
return Tensor(result,grad='NA')
if isinstance(x,Tensor):
result=np.tensordot(self.array,x.array,axes=([-1],[0]))
if self.calc_grad:
self.check_grads(x)
grad={}
for w in self.grad:
if x.grad[w] is 0:
grad1=0
else:
grad1=np.tensordot(self.array,x.grad[w],axes=([-1],[0]))
if self.grad[w] is 0:
grad2=0
else:
i=len(self.array.shape)
grad2=np.tensordot(self.grad[w],x.array,axes=([i-1],[0]))
n1=self.grad[w].ndim
n2=self.array.ndim
n3=x.array.ndim
r1=[j for j in range(n2-1)]+[j for j in range(n1-1,n1+n3-2)]
r2=[j for j in range(n2-1,n1-1)]
grad2=grad2.transpose(r1+r2)
grad[w]=grad1+grad2
return Tensor(result,grad=grad)
else:
return Tensor(result,grad='NA')
def __rmul__(self, x):
if isinstance(x,int) or isinstance(x,float):
result=x*self.array
if self.calc_grad:
grad={}
for w in self.grad:
grad[w]=x*self.grad[w]
return Tensor(result,grad=grad)
else:
return Tensor(result,grad='NA')
def __neg__(self):
result=-self.array
if self.calc_grad:
grad={}
for w in self.grad:
grad[w]=-self.grad[w]
return Tensor(result,grad=grad)
else:
return Tensor(result,grad='NA')
def sum(self,axis):
result=self.array.sum(axis=axis)
if self.calc_grad:
grad={}
for w in self.grad:
if self.grad[w] is not 0:
grad[w]=self.grad[w].sum(axis=axis)
else:
grad[w]=0
return Tensor(result,grad=grad)
else:
return Tensor(result,grad='NA')
def __repr__(self):
return f'Tensor({self.array},dtype {self.array.dtype},requires_grad={self.requires_grad})'
Non-linear functions (some examples) can be defined as:
class Sigmoid:
"""
returns: Tensor with gradients
"""
def __call__(self,x):
u=np.exp(-x.array)
out=1/(1+u)
if Tensor.calc_grad:
grad={}
for w in x.grad:
if x.grad[w] is not 0:
i=x.ndim
l=x.grad[w].ndim
expand=tuple([k for k in range(i,l)])
grad_func=self.grad(u)
grad_func=np.expand_dims(grad_func,axis=expand)
grad[w]=grad_func*x.grad[w]
else:
grad[w]=0
return Tensor(out,grad=grad)
else:
return Tensor(out,grad='NA')
@staticmethod
def grad(u):
den=(1+u)*(1+u)
gd=u/den
return gd
class Log:
def __call__(self,x):
out=np.log(x.array)
grad={}
for w in x.grad:
if x.grad[w] is not 0:
i=x.ndim
l=x.grad[w].ndim
expand=tuple([k for k in range(i,l)])
grad_func=self.grad(x)
grad_func=np.expand_dims(grad_func,axis=expand)
grad[w]=grad_func*x.grad[w]
else:
grad[w]=0
return Tensor(out,grad=grad)
@staticmethod
def grad(x):
gd=1/x.array
return gd
class ReLU:
def __call__(self,x):
sign=(x.array<0)
z=x.array.copy()
z[sign]=0
if Tensor.calc_grad:
grad={}
for w in x.grad:
if x.grad[w] is not 0:
i=x.ndim
l=x.grad[w].ndim
expand=tuple([k for k in range(i,l)])
grad_func=self.grad(x,sign)
grad_func=np.expand_dims(grad_func,axis=expand)
grad[w]=grad_func*x.grad[w]
else:
grad[w]=0
return Tensor(z,grad=grad)
else:
return Tensor(z,grad='NA')
@staticmethod
def grad(x,sign):
z=x.array.copy()
z[sign]=0
z[~sign]=1
return z
class Softmax:
def __call__(self,x):
"""calculate grads after softmaz operation
Args:
x (Tensor): shape=(batch,num_classes)
Returns:
Tensor: contains gradients relative to softmax function
"""
prob=np.exp(x.array)
Z=prob.sum(1).reshape(-1,1)
prob=prob/Z
if Tensor.calc_grad:
grad={}
for w in x.grad:
if x.grad[w] is not 0:
i=x.ndim
l=x.grad[w].ndim
expand=tuple([k for k in range(i,l)])
grad_func=np.expand_dims(prob,axis=expand)
dp=grad_func*x.grad[w]
grad[w]=dp-grad_func*np.expand_dims(dp.sum(1),axis=1)
else:
grad[w]=0
return Tensor(prob,grad=grad)
else:
return Tensor(prob,grad='NA')
With these definitions it is now easy to build a simple feed forward neural network, without the trouble of coding the gradients explicitly.
Here we show an example. First we define a LinearLayer class:
class LinearLayer:
def __init__(self,in_dim,out_dim,bias=True):
self.in_dim=in_dim
self.out_dim=out_dim
weight_,bias_=self.init_params()
self.weight=Tensor(weight_,requires_grad=True)
if bias:
self.bias=Tensor(bias_,requires_grad=True)
else:
self.bias=0
self.trainable={id(self.weight): self.weight,
id(self.bias): self.bias}
def __call__(self,x):
"""
x: Tensor [batch,in_dim]
"""
out=x*self.weight+self.bias
return out
def init_params(self):
weight=np.random.normal(0,1,(self.in_dim,self.out_dim))
bias=np.random.normal(0,1,(1,self.out_dim))
return weight, bias
and the Feed Forward model is obtained by superimposing linear layers,
class FeedForward:
def __init__(self,input_dim,hidden_dim,out_dim=1,n_hid_layers=0):
self.train()
self.in_layer=LinearLayer(input_dim,hidden_dim)
self.hid_layers=[LinearLayer(hidden_dim,hidden_dim) for i in range(n_hid_layers)]
self.out_layer=LinearLayer(hidden_dim,out_dim)
self.relu=ReLU()
self.sig=Sigmoid()
def __call__(self,x):
"""
assume two class problem
"""
out=self.in_layer(x)
out=self.relu(out)
for layer in self.hid_layers:
out=layer(out)
out=self.relu(out)
out=self.out_layer(out)
out=self.sig(out)
return out
def predict(self,x):
"""
predict
"""
#set model to eval mode so we dont need to calculate the derivatives
self.eval()
pred=self(x)
pred=pred.array.squeeze(1)
y_pred=(pred.array>=0.5).astype('int8')
return y_pred
def train(self):
Tensor.calc_grad=True
def eval(self):
Tensor.calc_grad=False
For a two-class problem we define the loss function and also the optimizer
class LogLoss:
def __init__(self,model):
self.model=model
self.back_grads=None
self.log=Log()
def __call__(self,prob,y):
"""
prob: Tensor
y: Tensor
"""
not_y=(1-y.array).reshape(-1,1).T
not_y=Tensor(not_y)
y_=y.array.reshape(-1,1).T
y_=Tensor(y_)
not_prob=1-prob.array
grad={}
for w in prob.grad:
grad[w]=-prob.grad[w]
not_prob=Tensor(not_prob,grad=grad)
size=1/prob.shape[0]
L=y_*self.log(prob)+not_y*self.log(not_prob)
L=-L.sum(axis=0)
L=size*L
self.back_grads=L.grad
return L.array[0]
def backward(self):
self.model.grads=self.back_grads
class Optimizer:
def __init__(self,model,lr=0.01):
self.model=model
self.lr=lr
self.tensors=self.find_tensor()
def zero_grad(self):
for idx, tensor in self.tensors.items():
if tensor.requires_grad:
grad={}
grad[idx]=tensor.grad[idx]
tensor.grad=grad
else:
grad={'none':0}
tensor.grad=grad
def step(self):
if self.model.grads is not None:
for idx, tensor in self.tensors.items():
if idx in self.model.grads:
tensor.array-=self.lr*self.model.grads[idx].squeeze(0)
else:
print('No grads!')
def find_tensor(self):
tensors={}
for _,param1 in self.model.__dict__.items():
if isinstance(param1,Tensor):
tensors[id(param1)]=param1
elif hasattr(param1,'__dict__'):
for _,param2 in param1.__dict__.items():
if isinstance(param2,Tensor):
tensors[id(param2)]=param2
return tensors
For mini-batch gradient descent we use the data loader
class DataSet:
def __init__(self,x,y,batch_size=28):
self.data_x=x
self.data_y=y
self.bsz=batch_size
def __len__(self):
return self.data_x.shape[0]
def __iter__(self):
L=self.data_x.shape[0]
bsz=self.bsz
for i in range(0,L,bsz):
batch_x=Tensor(self.data_x[i:i+bsz])
batch_y=Tensor(self.data_y[i:i+bsz])
yield batch_x, batch_y
Now, we are ready to train a one-hidden layer model. As an example, we load the breast_cancer dataset from sklearn api.
from sklearn.datasets import load_breast_cancer
from tqdm import tqdm
data=load_breast_cancer()
x=data['data']
y=data['target']
x=x/x.max()
data_loader=DataSet(x,y,128)
Define model, loss function and optimizer
model=FeedForward(30,50)
loss=LogLoss(model)
opt=Optimizer(model,0.1)
Perform training
def train(model,loss,optimizer,data_loader,epochs):
L=len(data_loader)
model.train()
for epoch in tqdm(range(epochs)):
total_loss=0
for batch in data_loader:
x_batch, y_batch=batch
bsz=x_batch.shape[0]
optimizer.zero_grad()
out=model(x_batch)
total_loss+=loss(out,y_batch)*bsz
loss.backward()
opt.step()
if epoch%10==0:
print('Loss: ',total_loss/L)
train(model,loss,opt,data_loader,20)