Lip sync is the matching of lip movements to live or recorded video, and a critical element of good user experience. Managing lip sync is particularly challenging in live environments, since the viewer can simultaneously watch the event itself and a reproduction of the event on a monitor or screen. Not only is the window of acceptable latency for lip sync razor thin in this scenario, it’s compounded by the fact that the audio and video are often being processed by separate systems. Thus, it’s important to manage customers’ expectations with regard to the latency they should expect from their media distribution system.

Before we focus on the camera’s role in the latency chain, let’s revisit some of the other factors involved in media transmission.

What is Latency?
In the context of networks, latency is the time it takes for a signal to travel over the connection (be it copper or fiber) from the source endpoint/node to its destination endpoint/node. For signal processing, it’s the amount of time from the point it enters the processing circuit until it emerges as the output. This processing can take the form of an equalizer for an audio signal, resolution scaling of a video signal, or even the amount of time it takes for a monitor to display an image (i.e. display lag). System latency is additive, so each step in the processing chain adds a portion to the overall transmission latency.

Another common case to consider is when the same audio or video signal is transmitted to multiple endpoints (speakers and displays) within the same space or room. In that scenario, it’s important for transmissions to reach each endpoint simultaneously, or as closely as possible. If there is noticeable delay in signal delivery, the un-synced video can be very distracting, while the unaligned audio can result in phase cancellations.

All systems have latency, but it may be of such small value that it has no noticeable impact on the user experience. However, it’s never “zero.” If a product is touted as having zero latency, the data is likely erroneous (or the marketing department is in full spin mode).

Processing Steps in Video Transmission
It’s important to understand the basic steps involved in transmitting video over a network in order to interpret manufacturers’ latency claims and apply that to lip sync capabilities. Once a signal is received at the input port of an encoder, there are a number of DSP operations that take place: 1) scaling the signal for transmission, 2) frame rate conversion (if needed), 3) chroma sub-sampling (if needed), 4) compressing the signal (if needed), and more. The signal is then transmitted over the network (for pro AV applications, the transmission has to be deterministic, not best effort), and lastly, the signal is reconverted into a replica of the original stream.

processing-chain

You really need to read the fine print on manufacturers’ latency measurements, because the number of processing steps involved may differ from one product to the next. For instance, if Product A claims network transit latency of 1 frame (17ms at 60fps – it’s the single dark blue step above) and Product B states port-to-port system transit latency of 2 frames (33ms at 60fps – including all of the steps above), Product A is not necessarily faster than Product B, since the two products are not being measured equally. Remember to note this distinction and educate your customers about it since 4K cameras are often – but not always – the single greatest contributor to latency1 in a typical pro AV signal chain.

The Basics of Digital Camera Processing
Many digital cameras use a charge-coupled device (CCD) for converting an electrical charge into a digital value. In a CCD image sensor, there is a photoactive region constructed from an ultra-thin layer of silicon, followed by an underlying transmission region made out of a shift register (nerd out on Wikipedia if you like).

An image is projected through a lens onto the photoactive region, causing each capacitor (there’s one capacitor for each pixel) to accumulate an electric charge proportional to the light intensity at that location. An A/D converter measures the charge and creates a digital signal that represents the value of the charge at each pixel. Then, an onboard signal processor interpolates the data from each pixel to create natural color. On many cameras, it’s possible to see the output on a flip-out LCD at this stage. Lastly, some cameras may perform a preset level of compression on the data before outputting the video stream.

Doing the Math
A 4K image contains more than 8 million pixels per frame, each with its own chroma and luminance data, and 4K cameras can very easily introduce 3-4 frames of latency (51-66ms at 60fps) before the video signal even reaches the input port of the encoder. Adding two frames of system transit latency to the calculation brings the latency to approximately 84-99ms. Finally, add about one frame (or more, depending on the display and stream parameters) for display lag at the output for an overall latency of 101-116ms.

Conversely, lip sync (arrival of audio signal compared to video) should be within +45 to -125 milliseconds2 for most people not to find it annoying. In general, if the audio is offset by more than 200 milliseconds, it begins to have a negative impact on viewers’ experience. At 101-116ms of transmission latency, we are already approaching the outside edge of imperceptibility (and under theoretical conditions no less). Processing the audio and video through separate systems will often introduce even more latency.

Key Takeaways
Deterministic transmission of AV is a must for live environments – the buffering involved in best effort delivery causes too much delay. Networked media transmission protocols like AVB/TSN, CobraNet®, and Dante™ guarantee a network-wide deterministic latency; but only AVB/TSN can transport both audio and video signals (contrary to recent announcements, there are still no true video over Dante products). AVB-based solutions, such as Biamp’s Tesira platform, provide significant advantages over processing audio and video separately. Since Tesira is in control of the entire signal path, it automatically calculates and implements all required internal delays to ensure that audio and video signals are synchronized throughout the signal chain.

However, latency is unavoidable; it can only be mitigated. All of the components in the signal path contribute to the latency aggregate, so it’s very difficult to get an accurate sense of the overall latency when looking at only a portion of the system. One thing is certain: if a manufacturer claims their product has “zero latency,” further investigation is warranted since they’re breaking at least one immutable law of physics.

 

1 Audio latency is negligible by comparison.

2 ITU-R publication BT.1359 recommends the relative timing of audio and video for broadcasting.