Case Study: MockAI — Syed Asjad Abbas

Problem & Challenge

Traditional candidate interview preparations are bottlenecked by manual reviewer latency, subjective assessment biases, and the inability to track non-verbal physiological traits (such as structural emotional fluctuations and vocal rhythms) in real-time. Automated grading systems typically evaluate only the semantic text of a candidate's answer, missing vital cues like confidence levels, eye contact drift, and structural answer relevance.

The core engineering challenge was to build a system that concurrently processes live WebRTC camera streams, extracts and transcript-transcribes vocal signals, monitors emotional attributes from video frames, and semantic-grades textual responses—without introducing execution lags or socket connection timeouts on the client interface.

Technical Approach & Architecture

We designed an asynchronous micro-pipeline utilizing a FastAPI WebSockets backend acting as a central coordinator, routing raw client media feeds to isolated processing loops. This decoupled structure ensures that heavy computer vision calculations do not interfere with WebRTC frame capture or audio processing threads.

Live Pipeline Stream Routing

Client browser (React WebRTC)

Streams video/audio frames via WebSocket links

↓

FastAPI Core Gateway

Ingests sockets, chunks buffer streams, routes asynchronously

↓

isolated Worker Process Thread Pools

Queue 1: DeepFace CV (Emotions) · Queue 2: Whisper API (Transcription)

↓

BERT Scoring & DB Flush

Semantic comparison against response schemas & save to MongoDB

Rather than running multi-threaded loops inside a single process, the system offloads heavy computational nodes to independent workers. Frame analysis is managed via dynamic sampling, while language translation tasks run in an asynchronous consumer queue.

Engineering Challenges & Solutions

1. The GIL Bottleneck & Socket Dropouts

During initial testing, executing DeepFace frame evaluations on incoming WebSocket buffers blocked the main asyncio event loop due to Python's Global Interpreter Lock (GIL). This blockage stalled socket handshakes, causing the client connection to drop within 10–12 seconds of streaming.

Solution: We resolved this by isolating the computer vision pipeline into a native subprocess using Python's multiprocessing.Process class and utilizing memory-mapped queues (multiprocessing.Queue) for inter-process communication. Frame capture runs continuously on the asyncio event loop, writing data to a shared memory buffer, while the CV worker process pulls and processes frames independently at its own pace.

2. Audio-Video Synchronization Drift

Network latency jitter during live WebRTC streaming created drift between the timestamp of candidate emotional changes and their corresponding verbal responses, leading to skewed evaluation charts.

Solution: We implemented a timestamp synchronization protocol where each video frame and audio packet are indexed with a client-generated epoch timestamp. Upon backend collection, evaluations are matched and aligned on these markers, ensuring final report charts accurately represent what candidates were saying at the exact moment they showed key emotional markers.

Optimization Decisions

Dynamic Frame Skipping: Rather than processing all 30 frames per second (which caused server CPU utilization to spike to 100%), the system processes only every 4th frame (7.5 FPS). Emotion state tracking metrics remain highly accurate while reducing server load by 75%.
Model Cache Warmup: Model weights for the BERT NLP semantic evaluator and face detectors are preloaded into CPU cache memory upon server initialization. This cuts the evaluation starting delay of the first candidate answer from 8.2 seconds down to less than 0.4 seconds.
Optimized Database Writing: Metric data point flushes to MongoDB Atlas are batched into a buffer memory cache and written in single bulk write transactions every 5 seconds, rather than single data writes on every frame. This preserves DB socket connections.

Deployment & Production Notes

The system is deployed on a Linux VPS infrastructure using Gunicorn with Uvicorn workers running behind an Nginx reverse proxy. Static frontend directories are compiled and served via Vercel to optimize assets delivery speed, and media assets are routed securely through an authenticated SSL WebSocket connection.

For scale management, a Docker compose structure isolates FastAPI backend nodes and handles automatic worker instance pooling, providing clean modular upgrades.