similarfaces2.me

A distributed facial recognition pipeline

Instructions

Drag and drop an image from your favorite social media, or click to upload

Example GIF

Detection performance deteriorates with higher resolution photos. To minimize scope, this app was built and tested using lower resolution photos like those on LinkedIn.

Facial Analysis

Age

--

Gender

--

Emotion

--

Race

--

About

Hello, I am a software engineer currently studying at the University of Alberta.

This is a machine learning project that I built to develop my skills in deploying, serving and scaling machine learning models.

There is zero caching of results in this application. Every inference request is ran fresh. What you see is the results being computed and delivered to you in real time.

The Tech Stack

Technical Details

System Design

System Design Diagram

Let's go over the flow of how requests are processed, and why I am using certain technologies.

  1. 1. A user sends a request through the frontend. This is through simple HTTP with an image as the payload. This is recieved by my Go backend, converted into jpeg (ideally, this conversion should be in my preprocessing step), and then temporarily uploaded into an S3 bucket for further processing on different nodes. Since facial recognition pipelines are multi stage, I attempt to split up processing steps into individual microservices.
  2. 2. After the image is available in S3, the middleware triggers a call to the preprocessing service. All of the internal communications is handled via gRPC.
  3. 3. The preprocessing service downloads the image from S3, and then rescales, and pads the image into a 800x800 format which my face extraction model expects. The image is then put through a face extraction model, which extracts the face found in the image and then uploads it into S3 for use by other models. The 800x800 shape is an arbitrary shape I decided on after compiling my models with the Neuron compiler. Compilation produces a model with restricted input sizing to most effectively take advantage of hardware-specific optimizations.
  4. 4. After the preprocessed image is available, the backend triggers multiple inference calls (in parallel) to the model servers. The model servers pull down the preprocessed images from S3, and then run their models on them. There are currently 6 different models run: face embedding, emotion classification, age classification, race classification and gender classification. You might have noticed how the similarity search, and face extraction runs significantly faster than everything else. This is because inf1.xlarge instances are pretty expensive to run, so I only run the largest models on them (face extraction and embedding creation), and put everything else on CPU instances. By bringing you results as soon as they are individually available, I am able to hide some of the latency and make the app feel a lot faster than it is.
  5. 6. The frontend polls the backend for results, populating fields as they are available

Service Definitions

These are proto stubs, they define the endpoints my services use to communicate with eachother.

1. Preprocessor

Preprocessor service definition

The preprocessor service rescales, and pads input images to the desired shape of 800x800. It then extracts the image of the face, and then uploads it into S3 so other model servers can use it. A byproduct of face extraction are the bounding boxes of the face, and coordinates for the eyes which are drawn on the frontend. This runs on an inf1.xlarge instance and is accelerated.

2. Embedder

Preprocessor service definition

The embedder model server pulls down the preprocessed image of the face, and transforms it into a 512 dimensional vector. This vector is used by the backend to query the vector database. This runs on an inf1.xlarge instance and is accelerated.

3. Analyzer

Preprocessor service definition

The 'analyzers' are a group of 4 models that I group together because they all do similar things. The age, gender, race and emotion models are all grouped here. They run on CPU instances because they are smaller and don't need as much compute to respond to requests in a reasonable amount of time.

Scaling

The distributed architecture of this app was built with scaling in mind. Because each service in the pipeline is (relatively) isolated, I am able to set up autoscaling node groups in EKS which will spin up a new backends, analyzers, or preprocessors whenever there is continuous load applied to my services . My Kubernetes deployments and services will then adjust to these nodegroups, create replicas and route requests to them when they are available. (I don't do this right now because I don't want bezos to take all of my tuition money)