Understanding WebRTC: Real-Time Media Communication in Browsers
WebRTC (Web Real-Time Communication) is the core/only protocol that lets you do real time media communication from inside a browser. Applications requiring sub-second latency rely heavily on WebRTC.
Why WebRTC?
WebRTC is indispensable for applications where immediate interaction is crucial. Here are some common use cases:
Multi-party calls: Platforms like Zoom and Google Meet.
1:1 calls: Applications like Omegle or virtual teaching platforms.
Real-time gaming: WebRTC even supports data transfer for high-speed applications, such as 30 FPS games.
Despite its capabilities, WebRTC can become expensive as the number of users grows. For use cases where latency is less critical (e.g., live cricket matches or YouTube live streams), HLS (HTTP Live Streaming) is preferred.
HLS offers a 10-second delay but delivers prime video quality. In contrast, WebRTC delivers ultra-low latency (0.1 seconds) but demands higher resources and costs.
The Architecture of WebRTC
P2P(Peer-to-Peer)
WebRTC is a peer-to-peer (P2P) protocol. This means media is transferred directly between browsers without the need for a central media server. However, establishing a connection involves several components.
Signaling Server
Before browsers can communicate directly, they need to exchange their connection details (IP addresses and ports). This initial handshake is facilitated by a signaling server, typically implemented using WebSocket or HTTP.
Once the connection details are shared, the signaling server is no longer required for media transfer.
NAT and STUN Servers: Handling Network Challenges
What is NAT?
Network Address Translation (NAT) translates local IP addresses into global ones, allowing devices in a private network to access the internet. NAT also masks port numbers, which can complicate direct P2P connections.
Learn more: GeeksforGeeks - NAT
STUN Servers
To overcome NAT, browsers query a STUN (Session Traversal Utilities for NAT) server. The STUN server provides the browser with possible network endpoints (called ICE candidates) that can be used for communication.
The process:
The browser sends a request to the STUN server, which returns a list of ICE candidates (IP and port combinations).
These ICE candidates are sent to the signaling server, which exchanges them with the other party.
The peers attempt to establish a direct connection using these ICE candidates.
If direct communication is blocked (e.g., due to strict network rules), a fallback mechanism like a TURN server is used.
ICE Candidates
ICE (Interactive Connectivity Establishment) candidates represent possible endpoints for establishing a connection. WebRTC tries different candidates until a successful connection is made. Examples of candidates:
Host candidates: Local IP addresses (used for connections within the same network, e.g., a hostel Wi-Fi).
Reflexive candidates: Public IP addresses discovered via the STUN server.
Relay candidates: Routes through a TURN server, used as a last resort.
Try it out: https://webrtc.github.io/samples/src/content/peerconnection/trickle-ice/
TURN Servers
A TURN (Traversal Using Relays around NAT) server acts as a relay for media when direct P2P communication is blocked. Although TURN servers enable robust connections, they are resource-intensive and increase latency, making them a fallback option.
Connection the two sides
The steps to create a webrtc connection between 2 sides includes -
- Browser 1 creates an RTCPeerConnection
- Browser 1 creates an offer
- Browser 1 sets the local description to the offer
- Browser 1 sends the offer to the other side through the signaling server
- Browser 2 receives the offer from the signaling server
- Browser 2 sets the remote description to the offer
- Browser 2 creates an answer
- Browser 2 sets the local description to be the answer
- Browser 2 sends the answer to the other side through the signaling server
- Browser 1 receives the answer and sets the remote description
This is just to establish the p2p connection b/w the two parties
To actually send media, we have to
- Ask for camera /mic permissions
- Get the audio and video streams
- Call addTrack on the pc
- This would trigger a onTrack callback on the other side
How WebRTC Establishes Connections
The WebRTC connection setup involves the exchange of the following:
Offer: The initiating browser sends its ICE candidates to the other browser.
Answer: The receiving browser responds with its ICE candidates.
SDP (Session Description Protocol): Both parties share an SDP file containing:
ICE candidates.
Media details (e.g., type, encoding protocols).
An example SDP file:
v=0
o=- 423904492236154649 2 IN IP4 127.0.0.1
s=-
t=0 0
m=audio 49170 RTP/AVP 0
c=IN IP4 192.168.1.101
a=rtpmap:0 PCMU/8000
a=ice-options:trickle
a=candidate:1 1 UDP 2122260223 192.168.1.101 49170 typ host
a=candidate:2 1 UDP 2122194687 10.0.1.1 49171 typ host
a=candidate:3 1 UDP 1685987071 93.184.216.34 49172 typ srflx raddr 10.0.1.1 rport 49171
a=candidate:4 1 UDP 41819902 10.1.1.1 3478 typ relay raddr 93.184.216.34 rport 49172
Scaling Challenges with P2P
While P2P is efficient for small groups (e.g., 1:1 calls), it becomes impractical for larger groups (e.g., 50 participants). In such cases, WebRTC adopts centralized architectures.
When sending and receiving data with 3 people, it works well. However, with 50 people in a P2P setup, receiving data from all 50 doesn't make any sense.
Centralized Architectures: SFU and MCU
Selective Forwarding Unit (SFU)
An SFU acts as a central media server that forwards streams to participants selectively. Advantages of SFUs:
Efficient for multi-party calls.
Optimizes bandwidth by sending streams only to active participants (e.g., pauses video for users on a different tab).
SFUs are widely used in industry (e.g., Zoom and Google Meet).
Multipoint Control Unit (MCU)
An MCU mixes audio and video streams on the server before distributing a single merged stream to participants. While this reduces processing on the client side, it is resource-intensive for the server.
Advantages of MCU:
Can optimize audio by suppressing low-decibel noises.
Delivers a consistent experience for all participants.
Disadvantage:
- High server cost, making it less scalable for large groups.
Distributed SFU
For extremely large-scale events (e.g., 2,000+ participants), distributed SFU setups are used. Here, multiple SFU servers work together to handle the load. Platforms like Unacademy implement distributed SFUs to support massive user bases.
Simulcast: Handling Variable Network Speeds
Simulcast allows participants to receive video in different quality levels based on their network speed and preferences. The host's browser encodes and sends multiple streams (e.g., 480p, 720p, 1080p). This ensures that:
Users with limited bandwidth can receive lower-quality streams.
High-speed connections can enjoy better quality.
Simulcast is handled by the browser, not the server, reducing server complexity.