In Part 2, we successfully trained a Transformer model to map sequences of body keypoints to sign language glosses using CTC loss. However, training on pre-segmented videos is one thing; making it work in the real world—where a webcam stream is infinite and boundaries are unknown—is an entirely different beast. In this article, we tear down inference/realtime.py, the beating heart of the asl-to-vo