1. Gradients and tensors

Neural networks can get very complicated very quickly. There are several types of architectures, and each can have multiple layers. Coding the gradients becomes a complicated task. Instead, libraries such as Pytorch or TensorFlow use a smart mechanism that allows to calculate gradients without having to code them. Essentially, these libraries keep track of all the mathematical operations on a tensor object and use the chain rule to determine the derivatives.

So how to calculate derivatives of tensors? Consider the scalar \(\Phi=\exp\Big(\sum_{ij} T_{ij}w_iw_j\Big)\) and its derivative relative to the tensor $w_i$. First, we calculate the derivative of $\Phi(u)$ with respect to $u$

\(\frac{\partial\Phi}{\partial u}=\exp(u)\) and then use the chain rule together with the derivatives

\[\frac{\partial u}{\partial w_k}=\sum_{ij}T_{ij}\frac{\partial w_i}{\partial w_k}w_j+\sum_{ij}T_{ij} w_i\frac{\partial w_j}{\partial w_k}+\sum_{ij}\frac{\partial T_{ij}}{\partial w_k}w_iw_j\]

But we can do this calculation differently. Instead of starting from the function $\Phi(u)$ and propagate back the derivatives, we can keep track of the results at each step of the forward operation. That is, we perform the calculation in the following order

  1. Calculate the derivatives $\frac{\partial w_i}{\partial w_k}$ and $\frac{\partial T_{ij}}{\partial w_k}$
  2. Determine $X_i\equiv\sum_{ij}T_{ij}w_j$ and $\frac{\partial X_i}{\partial w_k}=\sum_{ij}\frac{\partial T_{ij}}{\partial w_k}w_i + \sum_{ij}T_{ij}\frac{\partial w_i}{\partial w_k}$
  3. Determine $Y\equiv\sum_i X_iw_i$ and $\frac{\partial Y}{\partial w_k}=\sum_{i}\frac{\partial X_{i}}{\partial w_k}w_i + \sum_{i}X_{i}\frac{\partial w_i}{\partial w_k}$
  4. Finally calculate $\Phi=\exp(Y)$ and $\frac{\partial \Phi}{\partial w_k}=\Phi \frac{\partial Y}{\partial w_k}$

At each step, we calculate the resulting tensor and the corresponding derivative. To accomplish this, we need an object (class) that implements various mathematical operations and keeps track of the gradients while the function takes place. In pseudo-code

class Tensor:
    def __init__(array,gradient)
        ...

    def __add__(self,tensor):
        result=add_operation(self,tensor)->array
        gradient=grad_operation(self,tensor)->array
        return Tensor(result,gradient)

    def __sub__(self,tensor):
        ...

    def __mul__(self,tensor):
        result=mul_operation(self,tensor)->array
        gradient=grad_operation(self,tensor)->array
        return Tensor(result,gradient)

We should also define non-linear functions, such as the sigmoid activation function

class Sigmoid:

    def __call__(self,x):
        return 1/(1+np.exp(-x))

    def grad(self,x):
        return np.exp(-x)/(1+np.exp(-x))**2

As an example, say we want to calculate the scalar \(\phi=\sigma\Big(\sum_ix_i\omega_i\Big)\) where $\sigma(z)$ is the sigmoid function, and we are interested in the derivative relative to $\omega_i$. First we create the Tensor objects,

w_tensor=Tensor(w_array,requires_grad=True)
x_tensor=Tensor(x_array,requires_grad=False)

where w_tensor and x_tensor have dimensions $(d,1)$ and $(1,d)$ respectively. The flag “requires_grad” specifies whether we want or not the derivative relative to that tensor. We instantiate the function Sigmoid

sig=Sigmoid()

and calculate

phi=sig(x_tensor*w_tensor)

The operator * is overloaded in the class Tensor by the matrix multiplication operation. The object “phi” is an instance of the class Tensor, which contains both the result of the mathematical operation and the gradient of “phi” with respect to $\omega_i$. Then we can access the gradient by the attribute “phi.grad”.

2. Python Implementation

The Tensor class contains linear mathematical operations together with other methods such as transposing or squeezing.

class Tensor:
    calc_grad=True 

    """calc_grad: bool, signaling whether to carry the gradients while artihmetic operations take place"""

    def __init__(self,array,
                grad=None,
                requires_grad=False):
        """
        array: numpy array
        grad: dic={id(object): numpy.array}
        requires_grad: bool, signaling whether to calculate or not the derivative relative to this tensor
        
        """
        self.array=array
        self.requires_grad=requires_grad
        
        if requires_grad:
            name=id(self) 
            self.grad={name: self.make_grad()}
        else:
            self.grad={'none':0}
        if grad is not None:
            self.grad=grad

    @property
    def shape(self):
        return self.array.shape
    @property
    def ndim(self):
        return self.array.ndim

    @property
    def T(self):
        return self.array.T

    def transpose(self,shape):
        self.array=self.array.transpose(shape)
        if self.calc_grad:
            for w in self.grad:
                if isinstance(self.grad[w],np.ndarray):
                    self.grad[w]=self.grad[w].transpose(shape)


    def squeeze(self,axis=0):
        result=self.array.squeeze(axis)
        if self.calc_grad:
            grad={}
            for w in self.grad:
                if isinstance(self.grad[w],np.ndarray):
                    grad[w]=self.grad[w].squeeze(axis)
                else:
                    grad[w]=0
            return Tensor(result,grad=grad)
        else:
            return Tensor(result,grad='NA')
        
    def __getitem__(self,index):
        result=self.array[index]
        grad={}
        for w in self.grad:
            if isinstance(self.grad[w],np.ndarray):
                grad[w]=self.grad[w][index]
            else:
                grad[w]=0
        
        return Tensor(result,grad=grad)

    def make_grad(self,):
        shape=self.array.shape
        Kron=1
        for d in shape:
            ID=np.identity(d)
            Kron=np.tensordot(Kron,ID,axes=0)
        new_shape=[i for i in range(0,2*len(shape),2)]
        new_shape+=[i for i in range(1,2*len(shape),2)]
        Kron=Kron.transpose(new_shape)

        return Kron

    def check_grads(self,x):

        for w in self.grad:
            if w not in x.grad:
                x.grad[w]=0
        for w in x.grad:
            if w not in self.grad:
                self.grad[w]=0

    def __add__(self,x):
        
        if isinstance(x,Tensor):
            result=self.array+x.array
            if self.calc_grad:
                self.check_grads(x)
                grad={}
                for w in self.grad:
                    grad[w]=self.grad[w]+x.grad[w]
                return Tensor(result,grad=grad)
            else:
                return Tensor(result,grad='NA')

        if isinstance(x,int) or isinstance(x,float):
            result=self.array+x
            if self.calc_grad:
                return Tensor(result,grad=self.grad.copy())
            else:
                return Tensor(result,grad='NA')
    
    
    def __radd__(self,x):

        if isinstance(x,int) or isinstance(x,float):
            result=self.array+x
            if self.calc_grad:
                return Tensor(result,grad=self.grad.copy())
            else:
                return Tensor(result,grad='NA')

    def __sub__(self,x):
        
        if isinstance(x,Tensor):
            result=self.array-x.array
            if self.calc_grad:
                self.check_grads(x)
                grad={}
                for w in self.grad:
                    grad[w]=self.grad[w]-x.grad[w]

                return Tensor(result,grad=grad)
            else:
                return Tensor(result,grad='NA')

        if isinstance(x,int) or isinstance(x,float):
            result=self.array-x
            if self.calc_grad:
                return Tensor(result,grad=self.grad.copy())
            else:
                return Tensor(result,grad='NA')
    
    def __rsub__(self,x):
        
        if isinstance(x,int) or isinstance(x,float):
            result=x-self.array
            if self.calc_grad:
                grad={}
                for w in self.grad:
                    grad[w]=-self.grad[w]
                return Tensor(result,grad=grad)
            else:
                return Tensor(result,grad='NA')

    def __mul__(self,x):

        if isinstance(x,int) or isinstance(x,float):
            result=x*self.array
            if self.calc_grad:
                grad={}
                for w in self.grad:
                    grad[w]=x*self.grad[w]
                return Tensor(result,grad=grad)
            else:
                return Tensor(result,grad='NA')

        if isinstance(x,Tensor):
            result=np.tensordot(self.array,x.array,axes=([-1],[0]))
            if self.calc_grad:
                self.check_grads(x)
                grad={}
                for w in self.grad:
                    if x.grad[w] is 0:
                        grad1=0
                    else:
                        grad1=np.tensordot(self.array,x.grad[w],axes=([-1],[0]))
                        
                    if self.grad[w] is 0:
                        grad2=0
                    else:
                        i=len(self.array.shape)
                        grad2=np.tensordot(self.grad[w],x.array,axes=([i-1],[0]))
                        n1=self.grad[w].ndim
                        n2=self.array.ndim
                        n3=x.array.ndim
                        r1=[j for j in range(n2-1)]+[j for j in range(n1-1,n1+n3-2)]
                        r2=[j for j in range(n2-1,n1-1)]
                        grad2=grad2.transpose(r1+r2)
                    
                    grad[w]=grad1+grad2

                return Tensor(result,grad=grad)
            else:
                return Tensor(result,grad='NA')
    
    def __rmul__(self, x):
        if isinstance(x,int) or isinstance(x,float):
            result=x*self.array
            if self.calc_grad:
                grad={}
                for w in self.grad:
                    grad[w]=x*self.grad[w]
                return Tensor(result,grad=grad)
            else:
                return Tensor(result,grad='NA')
    
    def __neg__(self):
        result=-self.array
        if self.calc_grad:
            grad={}
            for w in self.grad:
                grad[w]=-self.grad[w]
            return Tensor(result,grad=grad)
        else:
            return Tensor(result,grad='NA')
        
    def sum(self,axis):
        result=self.array.sum(axis=axis)
        if self.calc_grad:
            grad={}
            for w in self.grad:
                if self.grad[w] is not 0:
                    grad[w]=self.grad[w].sum(axis=axis)
                else:
                    grad[w]=0
            return Tensor(result,grad=grad)
        else:
            return Tensor(result,grad='NA')

    def __repr__(self):
        return f'Tensor({self.array},dtype {self.array.dtype},requires_grad={self.requires_grad})'

Non-linear functions (some examples) can be defined as:

class Sigmoid:
    """
    returns: Tensor with gradients
    """
    def __call__(self,x):

        u=np.exp(-x.array)
        out=1/(1+u)

        if Tensor.calc_grad:
            grad={}
            for w in x.grad:
                if x.grad[w] is not 0:
                    i=x.ndim
                    l=x.grad[w].ndim
                    expand=tuple([k for k in range(i,l)])
                    grad_func=self.grad(u)
                    grad_func=np.expand_dims(grad_func,axis=expand)
                    grad[w]=grad_func*x.grad[w]
                else:
                    grad[w]=0

            return Tensor(out,grad=grad)
        else:
            return Tensor(out,grad='NA')

    @staticmethod
    def grad(u):
        den=(1+u)*(1+u)
        gd=u/den

        return gd

class Log:

    def __call__(self,x):
        out=np.log(x.array)

        grad={}
        for w in x.grad:
            if x.grad[w] is not 0:
                i=x.ndim
                l=x.grad[w].ndim
                expand=tuple([k for k in range(i,l)])
                grad_func=self.grad(x)
                grad_func=np.expand_dims(grad_func,axis=expand)
                grad[w]=grad_func*x.grad[w]
            else:
                grad[w]=0

        return Tensor(out,grad=grad)

    @staticmethod
    def grad(x):
        gd=1/x.array

        return gd

class ReLU:
    def __call__(self,x):
        sign=(x.array<0)
        z=x.array.copy()
        z[sign]=0

        if Tensor.calc_grad:
            grad={}
            for w in x.grad:
                if x.grad[w] is not 0:
                    i=x.ndim
                    l=x.grad[w].ndim
                    expand=tuple([k for k in range(i,l)])
                    grad_func=self.grad(x,sign)
                    grad_func=np.expand_dims(grad_func,axis=expand)
                    grad[w]=grad_func*x.grad[w]
                else:
                    grad[w]=0

            return Tensor(z,grad=grad)
        else:
            return Tensor(z,grad='NA')

    @staticmethod
    def grad(x,sign):
        z=x.array.copy()
        z[sign]=0
        z[~sign]=1

        return z

class Softmax:

    def __call__(self,x):
        """calculate grads after softmaz operation

        Args:
            x (Tensor): shape=(batch,num_classes)

        Returns:
            Tensor: contains gradients relative to softmax function
        """
    
        prob=np.exp(x.array)
        Z=prob.sum(1).reshape(-1,1)
        prob=prob/Z

        if Tensor.calc_grad:
            grad={}
            for w in x.grad:
                if x.grad[w] is not 0:
                    i=x.ndim
                    l=x.grad[w].ndim
                    expand=tuple([k for k in range(i,l)])
                    grad_func=np.expand_dims(prob,axis=expand)
                    dp=grad_func*x.grad[w]
                    grad[w]=dp-grad_func*np.expand_dims(dp.sum(1),axis=1)
                else:
                    grad[w]=0

            return Tensor(prob,grad=grad)
        else:
            return Tensor(prob,grad='NA')

With these definitions it is now easy to build a simple feed forward neural network, without the trouble of coding the gradients explicitly.

Here we show an example. First we define a LinearLayer class:

class LinearLayer:
    def __init__(self,in_dim,out_dim,bias=True):
        self.in_dim=in_dim
        self.out_dim=out_dim

        weight_,bias_=self.init_params()

        self.weight=Tensor(weight_,requires_grad=True)
        if bias:
            self.bias=Tensor(bias_,requires_grad=True)
        else:
            self.bias=0

        self.trainable={id(self.weight): self.weight,
                        id(self.bias): self.bias}
        
    def __call__(self,x):
        """
        x: Tensor [batch,in_dim]
        """
        out=x*self.weight+self.bias
        return out

    def init_params(self):
        weight=np.random.normal(0,1,(self.in_dim,self.out_dim))
        bias=np.random.normal(0,1,(1,self.out_dim))
        return weight, bias

and the Feed Forward model is obtained by superimposing linear layers,

class FeedForward:

    def __init__(self,input_dim,hidden_dim,out_dim=1,n_hid_layers=0):
        self.train() 
        self.in_layer=LinearLayer(input_dim,hidden_dim)
        self.hid_layers=[LinearLayer(hidden_dim,hidden_dim) for i in range(n_hid_layers)]
        self.out_layer=LinearLayer(hidden_dim,out_dim)
        self.relu=ReLU()
        self.sig=Sigmoid()

    def __call__(self,x):
        """
        assume two class problem
        """
        out=self.in_layer(x)
        out=self.relu(out)
        for layer in self.hid_layers:
            out=layer(out)
            out=self.relu(out)
        out=self.out_layer(out)
        out=self.sig(out)

        return out
    
    def predict(self,x):
        """
        predict
        """
        #set model to eval mode so we dont need to calculate the derivatives
        self.eval()
        pred=self(x)
        pred=pred.array.squeeze(1)
        y_pred=(pred.array>=0.5).astype('int8')

        return y_pred

    def train(self):
        Tensor.calc_grad=True
    
    def eval(self):
        Tensor.calc_grad=False 

For a two-class problem we define the loss function and also the optimizer

class LogLoss:
    def __init__(self,model):
        self.model=model
        self.back_grads=None
        self.log=Log()

    def __call__(self,prob,y):
        """
        prob: Tensor
        y: Tensor
        """
        not_y=(1-y.array).reshape(-1,1).T
        not_y=Tensor(not_y)
        y_=y.array.reshape(-1,1).T
        y_=Tensor(y_)

        not_prob=1-prob.array
        grad={}
        for w in prob.grad:
            grad[w]=-prob.grad[w]
        not_prob=Tensor(not_prob,grad=grad)

        size=1/prob.shape[0]
        L=y_*self.log(prob)+not_y*self.log(not_prob)
        L=-L.sum(axis=0)
        L=size*L

        self.back_grads=L.grad

        return L.array[0]
    
    def backward(self):
        self.model.grads=self.back_grads

class Optimizer:
    def __init__(self,model,lr=0.01):
        self.model=model
        self.lr=lr
        self.tensors=self.find_tensor()
    
    def zero_grad(self):
        for idx, tensor in self.tensors.items():
            if tensor.requires_grad:
                grad={}
                grad[idx]=tensor.grad[idx]
                tensor.grad=grad
            else:
                grad={'none':0}
                tensor.grad=grad 

    def step(self):
        if self.model.grads is not None:
            for idx, tensor in self.tensors.items():
                if idx in self.model.grads:
                    tensor.array-=self.lr*self.model.grads[idx].squeeze(0)
        else:
            print('No grads!')

    def find_tensor(self):
        tensors={}
        for _,param1 in self.model.__dict__.items():
            if isinstance(param1,Tensor):
                tensors[id(param1)]=param1
            elif hasattr(param1,'__dict__'):
                for _,param2 in param1.__dict__.items():
                    if isinstance(param2,Tensor):
                        tensors[id(param2)]=param2
        return tensors

For mini-batch gradient descent we use the data loader

class DataSet:
    def __init__(self,x,y,batch_size=28):
        self.data_x=x
        self.data_y=y
        self.bsz=batch_size

    def __len__(self):

        return self.data_x.shape[0]
        
    def __iter__(self):
        L=self.data_x.shape[0]
        bsz=self.bsz
        for i in range(0,L,bsz):
            batch_x=Tensor(self.data_x[i:i+bsz])
            batch_y=Tensor(self.data_y[i:i+bsz])
            yield batch_x, batch_y

Now, we are ready to train a one-hidden layer model. As an example, we load the breast_cancer dataset from sklearn api.

from sklearn.datasets import load_breast_cancer
from tqdm import tqdm 

data=load_breast_cancer()
x=data['data']
y=data['target']
x=x/x.max()

data_loader=DataSet(x,y,128)

Define model, loss function and optimizer


model=FeedForward(30,50)
loss=LogLoss(model)
opt=Optimizer(model,0.1)

Perform training

def train(model,loss,optimizer,data_loader,epochs):
        
    L=len(data_loader)
    model.train()
    for epoch in tqdm(range(epochs)):
        total_loss=0
        for batch in data_loader:
            x_batch, y_batch=batch
            bsz=x_batch.shape[0]

            optimizer.zero_grad()
            out=model(x_batch)
            total_loss+=loss(out,y_batch)*bsz
            loss.backward()
            opt.step()

        if epoch%10==0:
            print('Loss: ',total_loss/L)
    
train(model,loss,opt,data_loader,20)