Facial Expression Recognition Using Pytorch

By Tumin Sharma on June 21,2020

Computer Vision is a very well known keyword today in the year of 2020. Yes, there actually exist a hype where research teams and many are up to make computers see and understand as the human does. One of the features of human vision is understanding the facial expression of friends, families, and strangers, like that today we will be doing a project to make the computer understand too!

The Main Idea

Computer Vision is a field of artificial intelligence that trains computers to understand the visual world by replicating the complexity of the human visual system. Using digital images acquired from cameras and videos, machines can be "trained" to accurately identify and classify objects and even react to what they perceive.

The Facial Expression Recognition can be featured as one of the classification jobs people would really like to include in the set of computer vision. The job of our project will be to look through a camera that will be used as eyes for the machine and classify the face of the person (if any) based on his current expression/mood.


The Planning

So, no we will plan out a checklist we would follow from start to completion of the project for the ease in understanding the strategy and more clarity of what we may need to do and don’t to make the project work successfully. 

First, we will find ourselves a dataset based on which we will train the model then we will explore the dataset a little to gain insights about how we can edit the mode later.

Secondly, we will get the data ready so that we can work on the model, then we will create a model pipeline. After the model is ready we will train on the model and tweak the hyperparameters based on need. After the model is fully trained we will test the model on test data and save the model.

Then finally we will write a python script using Haar-Cascade Classifier with the help of the OpenCV module to detect our face through camera then use the saved model to classify the expression.


plan for the project


Our planning of the roadmap is ready. now we can get on with the project!


Working With The Data

After going around the internet for some time I found a dataset that is of my interest and pretty perfect for the problem we are going to solve in the project!

The dataset is called FER2013 but I only found the dataset from here. Unfortunately, there was no information regarding which classes were which. So now we have to work ourselves to extract that information so that we can make the project fruitful. 



After taking a look at the data frame we can see clearly that there are 3 columns in the 
frame first has the index number of the classes that is 0 for happy face and 1 for sad but we do not know which is what yet so we will categorize it ourselves a little later

The second column contains the pixel values for the photo of the faces with the respective reactions and usage column states which data rows are for testing which are for training.

After little exploration on the dataset, this is what I found,



So now we know the total number of data in the data frame is 35887. That is actually a lot and is enough helpful for the training. There are actually three kinds of classes in the Usage column, Training, PrivateTest, and PublicTest. Probably they ran a contest and the Training class has the dataset used by the contestants to train and they had to use the PublicTest for the prediction and the online leaderboard used the PrivateTest to test the accuracy for the leaderboard.

We can see that the test datasets each take up to 10% of the total data frame while the train takes up 80% of it.

We can also see that the total number of expression classes are 7. So let us find out ourselves by looking up at the photos and at their respective expression class number and describe which class represents which mood.



While performing the code I came to know that the total number of pixel values is 2304 that is each picture is of the dimension 48x48. After looking at the photos I also concluded that 0 represents anger, 1 represents disgust, 2 represents fear, 3 represents happiness, 4 represents sadness, 5 represent surprised and 6 represents neutral.

classes = {
    0: 'Angry', 1: 'Disgust', 2: 'Fear', 3: 'Happy', 4: 'Sad', 5: 'Surprise', 6: 'Neutral'

Since we also have a little fewer data compared to a big deep learning model, we will use some transformation on the images. That is do get a more number of the dataset and avoiding overfitting, I am going to give random cropping of 48x48 dimension on the images after applying a 4pixel wide padding on each side of each image. Also, I will randomize the horizontal flip on the images by 50%. And after everything, I am adding normalization with 0.5 mean and variance.

# this is for the transforms
train_trfm = transforms.Compose(
        transforms.RandomCrop(48, padding=4, padding_mode='reflect'),
        transforms.Normalize((0.5), (0.5), inplace=True)
val_trfm = transforms.Compose(
        transforms.Normalize((0.5), (0.5))

Now we will create the class for the dataset which will also transform our pixels to shaped tensors and will be helpful to use in the model later. I am going to use the test dataset for the validation dataset since the labels for the test images are available.

# Creating the class for our dataset for the FER
class FERDataset(Dataset):
    def __init__(self, images, labels, transforms):
        self.X = images
        self.y = labels
        self.transforms = transforms
    def __len__(self):
        return len(self.X)
    def __getitem__(self, i):
        data = [int(m) for m in self.X[i].split(' ')]
        data = np.asarray(data).astype(np.uint8).reshape(48,48,1)
        data = self.transforms(data)
        label = self.y[i]
        return (data, label)
# assigning the transformed data
train_data = FERDataset(train_images, train_labels, train_trfm)
val_data = FERDataset(test_images, test_labels, val_trfm)

Now when the dataset is fully ready we will have our data loader-ready with a batch size of 128. We will have this extra validation data loader because it is a good practice and let us know when we may overfit the model and understand the accuracy better.

batch_num = 400

train_dl = DataLoader(train_data, batch_num, shuffle=True, num_workers=4, pin_memory=True)
val_dl = DataLoader(val_data, batch_num*2, num_workers=4, pin_memory=True)

Since everything is ready now with the data, let us take a look at the photos of the first batch of the training data loader.



Everything is perfect now with the dataset!! So now we can begin with our model!


Getting The Model Ready

Since the data is fully ready to utilize, we will be creating the model now. The first thing I will be doing is creating a base classifier class that will hold on to the validation data loss and accuracy and the overall epoch loss and accuracy.

def accuracy(outputs, labels):
    _, preds = torch.max(outputs, dim=1)
    return torch.tensor(torch.sum(preds==labels).item()/len(preds))

class FERBase(nn.Module):
    # this takes is batch from training dl
    def training_step(self, batch):
        images, labels = batch
        out = self(images)                     # calls the training model and generates predictions
        loss = F.cross_entropy(out, labels)    # calculates loss compare to real labels using cross entropy
        return loss
    # this takes in batch from validation dl
    def validation_step(self, batch):
        images, labels = batch
        out = self(images)
        loss = F.cross_entropy(out, labels)
        acc = accuracy(out, labels)            # calls the accuracy function to measure the accuracy
        return {'val_loss': loss.detach(), 'val_acc': acc}
    def validation_epoch_end(self, outputs):
        batch_losses = [x['val_loss'] for x in outputs]
        epoch_loss = torch.stack(batch_losses).mean()    # finds out the mean loss of the epoch batch
        batch_accs = [x['val_acc'] for x in outputs]
        epoch_acc = torch.stack(batch_accs).mean()       # finds out the mean acc of the epoch batch
        return {'val_loss': epoch_loss.item(), 'val_acc': epoch_acc.item()}
    def epoch_end(self, epoch, result):
        print("Epoch [{}], last_lr: {:.5f}, train_loss: {:.4f}, val_loss: {:.4f}, val_acc: {:.4f}".format(
            epoch, result['lrs'][-1], result['train_loss'], result['val_loss'], result['val_acc']))

Then I am going to create a Convolutional Neural Network Model with Residual Networks which will gradually increase the number of channels of the facial data and decrease the dimension and will be followed by a fully connected neural network which will lastly output an array of 7 values between -1 to 1 describing the probability of which facial expression class could it be.

def conv_block(in_chnl, out_chnl, pool=False, padding=1):
    layers = [
        nn.Conv2d(in_chnl, out_chnl, kernel_size=3, padding=padding),
    if pool: layers.append(nn.MaxPool2d(2))
    return nn.Sequential(*layers)

class FERModel(FERBase):
    def __init__(self, in_chnls, num_cls):
        self.conv1 = conv_block(in_chnls, 64, pool=True)           # 64x24x24 
        self.conv2 = conv_block(64, 128, pool=True)                # 128x12x12
        self.resnet1 = nn.Sequential(conv_block(128, 128), conv_block(128, 128))    # Resnet layer 1: includes 2 conv2d
        self.conv3 = conv_block(128, 256, pool=True)       # 256x6x6 
        self.conv4 = conv_block(256, 512, pool=True)       # 512x3x3
        self.resnet2 = nn.Sequential(conv_block(512, 512), conv_block(512, 512))    # Resnet layer 2: includes 2 conv2d
        self.classifier = nn.Sequential(nn.MaxPool2d(3),
                                        nn.Linear(512, num_cls))    # num_cls
    def forward(self, xb):
        out = self.conv1(xb)
        out = self.conv2(out)
        out = self.resnet1(out) + out
        out = self.conv3(out)
        out = self.conv4(out)
        out = self.resnet2(out) + out
        return self.classifier(out)

The network architecture I used here is also famously called the Resnet9.

The Model is all working perfectly. Now setting up the torch so that it can use GPU for the training and then after loading the datasets on the device available to us, that is GPU in our case. We can now happily head on to training our model!


Training The Model

In the model here, I am going to use the 1Cycle learning rate scheduler where the learning rate isn’t manually implemented and, starts from a very low learning rate, increases then again get reduced.

@torch.no_grad()    # this is for stopping the model from keeping track ofold parameters
def evaluate(model, val_loader):
    # This function will evaluate the model and give back the val acc and loss
    outputs = [model.validation_step(batch) for batch in val_loader]
    return model.validation_epoch_end(outputs)

# getting the current learning rate
def get_lr(optimizer):
    for param_group in optimizer.param_groups:
        return param_group['lr']

# this fit function follows the intuition of 1cycle lr
def fit(epochs, max_lr, model, train_loader=train_dl, val_loader=val_dl, weight_decay=0, grad_clip=None, opt_func=torch.optim.Adam):
    history = []    #keep track of the evaluation results
    # setting upcustom optimizer including weight decay
    optimizer = opt_func(model.parameters(), max_lr, weight_decay=weight_decay)
    # setting up 1cycle lr scheduler
    sched = torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr, epochs=epochs, steps_per_epoch=len(train_loader))
    for epoch in range(epochs):
        # training
        train_losses = []
        lrs = []
        for batch in train_loader:
            loss = model.training_step(batch)
            # gradient clipping
            if grad_clip:
                nn.utils.clip_grad_value_(model.parameters(), grad_clip)
            # record the lr
        result = evaluate(model, val_loader)
        result['train_loss'] = torch.stack(train_losses).mean().item()
        result['lrs'] = lrs
        model.epoch_end(epoch, result)
    return history

I am going to use a gradient clipping of from -0.1 to 0.1 so that the gradient descent jump falls of far too away. After a lot of training, I finally chose the epoch of 30 and the maximum learning rate for the 1Cycle as 0.001. This gives me 70% accuracy at this moment.



You can find the log of all the model architectures I have used and their corresponding accuracy in the jupyter notebook file from the link of the completed repo I will add at the end of this article.


Saving The model

Since training the model is successfully done, we can now go on and save the model, so we can access it without retraining it every time we try to use the facial expression recognition script.

torch.save(model.state_dict(), 'FER2013-Resnet9.pth')


Getting The Scripts Ready

First things first, I created the FERModel.py ready. It will be a lot of head storming to go through that file. You can just go to the repo link and look in the file yourself.

Now time to script the main script!

Since we would need to take the ROI (region of interest) of the face and transform it to tensors and feed it to the model then predict we need to define some prediction functions to load the save model and predict.

# function to turn photos to tensor
def img2tensor(x):
    transform = transforms.Compose(
            transforms.Normalize((0.5), (0.5))])
    return transform(x)

# the model for predicting
model = FERModel(1, 7)
softmax = torch.nn.Softmax(dim=1)
model.load_state_dict(torch.load('FER2013-Resnet9.pth', map_location=get_default_device()))

def predict(x):
    out = model(img2tensor(img)[None])
    scaled = softmax(out)
    prob = torch.max(scaled).item()
    label = classes[torch.argmax(scaled).item()]
    return {'label': label, 'probability': prob}

Now we will code the script for connecting with our camera then load up the HaarCascade .xml file to detect faces and feed each faces it finds to the predict model and get the labels of the prediction.

face_cascade = cv2.CascadeClassifier('haarcascade_frontalface_default.xml')

while vid.isOpened():
    _, frame = vid.read()

    # takes in a gray coloured filter of the frame
    gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

    # initializing the haarcascade face detector
    faces = face_cascade.detectMultiScale(frame)
    for (x,y,w,h) in faces:
        cv2.rectangle(frame, (x,y), (x+w,y+h), (0,255,0), 2)

        # takes the region of interest of the face only in gray
        roi_gray = gray[y:y+h, x:x+h]
        resized = cv2.resize(roi_gray, (48, 48))    # resizes to 48x48 sized image

        # predict the mood
        img = img2tensor(resized)
        prediction = predict(img)

        cv2.putText(frame, f"{prediction['label']}", (x, y-10), cv2.FONT_HERSHEY_SIMPLEX, 1, (0,255,0))

    cv2.imshow('video', frame)
    if cv2.waitKey(1) == 27:


Now everything is done all right and ready to run!!

End Note

The code is perfectly done! Setting up the data was a bit of the tiring work. But the data also has some biased data that is in most of the aspects the feared, sad and angry faces looked pretty much so same, that’s why even in the final model, the computer finds it difficult to understand the difference.

Here is the link to the repository. You can download it and try it out yourself or take reference from it.

This was one of the most interesting Deep Learning project I have ever worked on. Thanks to zerotogans.com for the inspiration!!

Pytorch Deep Learning

Loved the article? Share the love with your friends on your favorite social media!

About Author

Tumin Sharma

Hello! I am Tumin Sharma. I'm an aspiring Data Scientist and I am intermediately proficient in python.

Similar Posts