How Remoto Playback Handles Multichannel Audio Fold Down
Table of Contents
At a Glance
If you are joining a Remoto Playback session from a web browser, iPhone, iPad, or Apple TV and the host is streaming surround sound (5.1 or 7.1), you will automatically hear a stereo version of the mix. Dialogue, music, and effects are all preserved. Surrounds are blended into Left and Right, and the subwoofer channel is removed. No setup is required on your end. If you need the full surround mix, use the Desktop application with a multichannel audio output device.
Overview
When a Remoto Playback session is configured to stream multichannel audio (5.1, 7.1, or up to 16 discrete channels), participants who are monitoring on a stereo device, or who explicitly select stereo output, will hear a fold down (also called a downmix) of the multichannel stream into two channels (Left and Right).
This article explains:
- What a fold down is and why it happens
- How Remoto performs the fold down on each client platform
- The exact fold down coefficients used
- What production teams should keep in mind
Note on Streaming Type: Sessions can be configured as either Desktop Only or Desktop & Web. This setting controls which clients can join the session. It does not change how the fold down works. In a "Desktop Only" session, only Desktop app users participate, so the fold down only applies if a Desktop user selects stereo output. In a "Desktop & Web" session, web, iOS, and Apple TV guests can also join and will automatically receive the stereo fold down described in this article.
What Is a Fold Down?
A fold down is the process of combining a multichannel audio signal (for example, 5.1 surround with six discrete channels) into a signal with fewer channels (typically stereo). The goal is to preserve the spatial intent of the original mix as faithfully as possible, while ensuring all content (dialogue, music, effects, surrounds) remains audible in the reduced channel count.
The standard fold down formula for 5.1-to-stereo is defined by ITU-R BS.775-1 and has been widely adopted across broadcast, cinema, and streaming.
Supported Audio Input Modes
The Remoto host (Desktop application) can configure a session with any of the following audio input modes. The input mode defines the channel layout and ordering, and this is critical because the fold down matrix must know which channel is which.
| Input Mode | Channels | Channel Order |
|---|---|---|
| Stereo | 2 | L, R |
| 5.1 Film | 6 | L, C, R, Ls, Rs, LFE |
| 5.1 DCP | 6 | L, R, C, LFE, Ls, Rs |
| 5.1 SMPTE | 6 | L, R, C, LFE, Ls, Rs |
| 7.1 Film | 8 | L, C, R, Ls, Rs, Lrs, Rrs, LFE |
| 7.1 DCP | 8 | L, R, C, LFE, Ls, Rs, Lrs, Rrs |
| 7.1 SMPTE | 8 | L, R, C, LFE, Ls, Rs, Lrs, Rrs |
| 16 Unrelated | 16 | Discrete (no spatial relationship assumed) |
| Custom | 2 to 16 | User-defined |
Important: The distinction between Film, DCP, and SMPTE orderings affects where the Center, LFE, and Surround channels sit in the stream. The fold down engine must respect this ordering to avoid misassigning channels (for example, treating LFE as a Center channel).
Fold Down Coefficients
Standard 5.1 to Stereo (Lo/Ro), ITU-R BS.775-1
This is the reference fold down used across all Remoto clients. Coefficients are expressed in both linear gain and dB:
| Source Channel | Left Output Coefficient | Right Output Coefficient | Gain (dB) |
|---|---|---|---|
| L (Left) | 1.0 | 0.0 | 0 dB |
| R (Right) | 0.0 | 1.0 | 0 dB |
| C (Center) | 0.707 (~1/√2) | 0.707 (~1/√2) | -3.0 dB |
| LFE | 0.0 | 0.0 | -∞ (discarded) |
| Ls (Left Surround) | 0.707 (~1/√2) | 0.0 | -3.0 dB |
| Rs (Right Surround) | 0.0 | 0.707 (~1/√2) | -3.0 dB |
Resulting formula:
L_out = L + (C × 0.707) + (Ls × 0.707)
R_out = R + (C × 0.707) + (Rs × 0.707)
Note on LFE: The LFE (Low Frequency Effects / ".1") channel is discarded in the standard Lo/Ro fold down per ITU-R BS.775. This is intentional. The LFE carries content meant for subwoofers at +10 dB relative to main channels. Mixing it in at full or even attenuated level into stereo speakers can cause distortion, bass overload, and an unnatural tonal balance. This is standard practice across all professional broadcast and streaming platforms.
Standard 7.1 to Stereo (Lo/Ro)
Extends the 5.1 matrix with the rear surround pair:
| Source Channel | Left Output Coefficient | Right Output Coefficient | Gain (dB) |
|---|---|---|---|
| L (Left) | 1.0 | 0.0 | 0 dB |
| R (Right) | 0.0 | 1.0 | 0 dB |
| C (Center) | 0.707 | 0.707 | -3.0 dB |
| LFE | 0.0 | 0.0 | -∞ (discarded) |
| Ls (Left Surround) | 0.707 | 0.0 | -3.0 dB |
| Rs (Right Surround) | 0.0 | 0.707 | -3.0 dB |
| Lrs (Left Rear Surround) | 0.707 | 0.0 | -3.0 dB |
| Rrs (Right Rear Surround) | 0.0 | 0.707 | -3.0 dB |
Resulting formula:
L_out = L + (C × 0.707) + (Ls × 0.707) + (Lrs × 0.707)
R_out = R + (C × 0.707) + (Rs × 0.707) + (Rrs × 0.707)
Alternative Coefficients (Lt/Rt, Dolby Pro Logic Compatible)
Some playback systems may use the Lt/Rt (Left-total/Right-total) matrix, which preserves surround encoding for downstream Pro Logic decoding. This is not currently used by Remoto's default fold down, but is worth noting for production teams:
| Source Channel | Lt Coefficient | Rt Coefficient |
|---|---|---|
| L | 1.0 | 0.0 |
| R | 0.0 | 1.0 |
| C | 0.707 | 0.707 |
| LFE | 0.0 | 0.0 |
| Ls | 0.707 x j (−90° phase shift) | −0.707 x j |
| Rs | 0.707 x j | −0.707 x j |
The Lt/Rt matrix is not currently used by Remoto's live streaming fold down.
Discrete 16-Channel Streams
When the session is configured as 16 Unrelated (16 discrete channels with no defined spatial relationship), a standard surround fold down matrix does not apply. In this scenario:
- Playback for Web / iOS / Apple TV: Receive a stereo stream from the server. The server-side Opus encoder applies its default channel coupling, which typically sums all channels equally into L/R with gain normalization. This is a lossy simplification and is not meant to represent a meaningful stereo fold.
- Desktop App: Can route individual discrete channels to specific output pairs. When stereo output is selected, the behavior depends on the host's routing configuration. If no custom routing is defined, the first two channels are passed through as L/R.
How Each Client Handles the Fold Down
Playback for Web (Browser)
| Aspect | Detail |
|---|---|
| Fold down location | Server-side (MediaMTX streaming server) |
| When it happens | At WebRTC stream negotiation time |
| Mechanism | Playback for Web always negotiates 2-channel stereo Opus via the WebRTC SDP offer (opusChannels: 2 with stereo=1; sprop-stereo=1). MediaMTX receives the full multichannel Opus stream from the Desktop host, decodes it, applies the Lo/Ro fold down via libopus, and re-encodes a stereo Opus stream for the browser. |
| Codec | Opus (compressed, lossy) at up to 320 kbps stereo |
| User control | None. Playback for Web always receives stereo. Volume and mute controls are available, but output channel configuration is not. |
| Audio processing chain |
MediaStream > AudioProcessor (Web Audio API: source > delay > gain > mute > destination). No channel manipulation occurs client-side. |
Desktop Application (macOS / Windows)
| Aspect | Detail |
|---|---|
| Fold down location | Client-side (local audio engine within the Desktop app) |
| When it happens | When the user selects a stereo output device or stereo output mode while a multichannel session is active |
| Mechanism | The Desktop app receives the full multichannel stream (up to 16 channels) over the network. Its native C++ audio engine applies a Lo/Ro downmix matrix locally, using the session's AudioInputMode to determine the correct channel ordering (Film, DCP, or SMPTE). |
| Codec | Opus (compressed) for streaming; PCM internally for mixing |
| User control | Full. The user can select output mode (stereo, 5.1, 7.1, discrete), choose audio output devices, and adjust per-channel routing. The fold down is only applied when the output mode has fewer channels than the input. |
| Quality | Highest fidelity fold down, as the app has access to discrete channels before mixing |
iOS Application (iPhone / iPad)
| Aspect | Detail |
|---|---|
| Fold down location | Server-side (MediaMTX) + device-level (Apple Core Audio / AVAudioSession) |
| When it happens | At WebRTC stream negotiation, identical to Playback for Web |
| Mechanism | The iOS app connects to the same MediaMTX WHEP streaming endpoint used by Playback for Web and negotiates stereo Opus via WebRTC SDP. The server performs the fold down before the audio reaches the device. |
| Codec | Opus (compressed) decoded by WebRTC, rendered via AVAudioSession |
| User control | None. iOS devices output stereo (or mono on single-speaker devices). The device's built-in speaker or connected headphones/AirPods receive the stereo downmix. |
| Spatial Audio note | When using AirPods Pro or AirPods Max with Spatial Audio enabled, iOS may apply head-tracked spatialization to the stereo signal. This is an Apple system-level feature and is independent of Remoto's fold down. |
Apple TV Application (tvOS)
| Aspect | Detail |
|---|---|
| Fold down location | Server-side (MediaMTX) + device-level (Apple Core Audio / AVAudioEngine) |
| When it happens | At WebRTC stream negotiation |
| Mechanism | The Apple TV app uses the same WebRTC/WHEP pipeline and negotiates stereo Opus from the server. The fold down occurs server-side before audio reaches the device. Although Apple TV hardware supports multichannel output (5.1, 7.1, Atmos) to connected receivers and soundbars, the Remoto app currently requests a stereo stream from the server, so there is no multichannel audio to pass through. |
| Codec | Opus (compressed) decoded by WebRTC |
| User control | None within the Remoto app. Users can configure their Apple TV's audio output format in tvOS Settings (Stereo, Dolby Digital 5.1, Dolby Atmos), but since the Remoto stream arrives as stereo, the Apple TV will output stereo regardless of this setting. |
Platform Comparison Matrix
| Desktop App | Playback for Web | iOS App | Apple TV App | |
|---|---|---|---|---|
| Fold down location | Client-side | Server-side | Server-side | Server-side |
| Receives multichannel? | Yes (up to 16ch) | No (stereo only) | No (stereo only) | No (stereo only) |
| Fold down engine | Native C++ audio engine | MediaMTX / libopus | MediaMTX / libopus | MediaMTX / libopus |
| Fold down standard | ITU-R BS.775 (Lo/Ro) | ITU-R BS.775 (Lo/Ro via Opus) | ITU-R BS.775 (Lo/Ro via Opus) | ITU-R BS.775 (Lo/Ro via Opus) |
| LFE handling | Discarded in stereo mode | Discarded | Discarded | Discarded |
| User can select output mode | Yes | No | No | No |
| Channel ordering aware | Yes (Film/DCP/SMPTE) | N/A (receives stereo) | N/A (receives stereo) | N/A (receives stereo) |
| Audio codec | Opus to PCM (local decode) | Opus (stereo) | Opus (stereo) | Opus (stereo) |
| Adjustable fold down coefficients | No | No | No | No |
What Will I Hear Differently?
If you are used to hearing content in surround and are now listening through a stereo fold down (on Playback for Web, iOS, or Apple TV), here is what changes perceptually:
| What you're listening for | What happens in the fold down |
|---|---|
| Dialogue | Sounds the same or very similar. The Center channel (where dialogue typically lives) is mixed equally into Left and Right at a slight reduction (-3 dB). Dialogue remains clear and centered in the stereo image. |
| Music and effects in the front L/R channels | No change. These pass through to stereo at full level. |
| Surround effects (ambience, rear sound design, crowd noise) | Still audible, but no longer "behind" you. They are mixed into the front Left and Right channels at a slight reduction (-3 dB). You will hear them, but they will feel like they are part of the front soundstage rather than coming from behind. |
| Subwoofer / LFE (deep bass rumbles, explosions) | Not audible. The LFE channel is removed entirely. Most bass content in film and TV is also present in the main channels, so typical low-end will still be heard. However, LFE-only content (for example, a standalone deep rumble) will be silent. |
| Overall loudness | May be slightly louder than the surround version, because multiple channels are being summed together. This is normal and expected. |
| Spatial sense | Reduced. Surround sound creates a 360-degree experience. The stereo fold down preserves Left-Right separation but collapses front-back depth into the front stereo field. |
In short: You will hear all the important content (dialogue, music, and effects) clearly. The main difference is that surround envelopment is reduced to a front-facing stereo image, and isolated LFE content is removed.
Production Guidance
What to Expect as a Viewer
- If you are joining a session on Playback for Web, iOS, or Apple TV and the host is streaming in 5.1 or 7.1 surround, you will automatically hear a stereo fold down. No action is required.
- The fold down is a reference-quality stereo preview. All dialogue (Center channel) will be clearly audible, mixed equally into both Left and Right. Surround content will be mixed in at -3 dB.
- The LFE channel is not included in the stereo fold down. If the content has significant bass-only elements in the LFE, those will not be audible in the stereo preview.
What to Expect as a Host / Organizer
- The fold down is automatic and transparent. You do not need to configure it.
- Your Desktop application continues to receive the full multichannel stream. If you select stereo output on the Desktop app, you will hear the same Lo/Ro fold down that remote participants hear.
- The fold down does not alter the source stream. All guests on multichannel-capable endpoints (Desktop app with a surround output device) will receive the full discrete channels.
Recommendations for Critical Listening
- For critical stereo evaluation of multichannel content, perform the fold down in your DAW or NLE before streaming, where you have full control over coefficients, limiter settings, and monitoring levels.
- Use the Desktop app with a calibrated stereo output for the most faithful fold down during a live session.
- Remember that the Playback for Web/iOS/Apple TV fold down goes through an additional encode-decode cycle (Opus compression), which may introduce subtle artifacts compared to a direct PCM fold down.
Known Limitations
| Limitation | Detail |
|---|---|
| No user-adjustable coefficients | The fold down matrix is fixed (ITU-R BS.775 Lo/Ro). Users cannot adjust center or surround levels in any client. |
| LFE is always discarded | The .1 channel is not included in the stereo fold down on any platform. |
| 16-channel discrete has no standard fold | Streams configured as "16 Unrelated" have no spatial relationship. The server-side fold down for Playback for Web/iOS/Apple TV is a simple sum-and-normalize, not a perceptually optimized mix. |
| Playback for Web/iOS/Apple TV are stereo-only | These clients cannot receive or output multichannel audio. Multichannel monitoring requires the Desktop application. |
| Codec artifacts | The Playback for Web/iOS/Apple TV fold down occurs on a compressed (Opus) signal. For critical stereo evaluation, use the Desktop app with a calibrated stereo output. |
Frequently Asked Questions
Q: I'm joining a session from my browser. Will I hear the surround mix?
No. Your browser will receive a stereo version. All content is still audible (dialogue, music, effects) but spatial surround information is reduced to a stereo Left/Right image. This happens automatically and requires no setup.
Q: Why does the mix sound slightly different on my laptop compared to the screening room?
In a screening room with a surround speaker system, you hear discrete channels placed around you (front, sides, rear, subwoofer). On your laptop, all of those channels are folded into two speakers. The center channel (dialogue) is blended into both left and right, surround elements move to the front, and the subwoofer channel is removed. The artistic intent is preserved, but the immersive spatial experience is inherently limited by stereo speakers.
Q: Can I hear the full surround mix on my iPhone or Apple TV?
Not currently. The iOS and Apple TV apps receive a pre-mixed stereo stream. To monitor the full discrete multichannel mix, use the Desktop application with a connected surround audio output device (for example, an audio interface routed to a 5.1 or 7.1 speaker system).
Q: Will I miss any dialogue or important audio?
No. Dialogue is typically carried on the Center channel, which is mixed into both Left and Right at nearly full volume (-3 dB). Music and effects in the front Left/Right channels pass through at full level. Surround effects are still audible, just repositioned to the front. The only content that is removed is the LFE (subwoofer) channel, which carries deep bass rumbles and is not intended for stereo speakers.
Q: Why is the subwoofer / LFE channel removed?
The LFE channel carries content at +10 dB relative to the main channels and is designed exclusively for subwoofer playback. Mixing it into stereo speakers at full or even reduced level would cause distortion, bass overload, and mask other elements. Removing it is standard industry practice (per ITU-R BS.775) and is how every major streaming platform, broadcast network, and cinema system handles the fold down.
Q: The stereo fold down sounds louder than the surround version. Is that normal?
Yes. When multiple channels are summed into two, the combined signal level increases. This is expected behavior. If the increase is distracting, reduce your output volume. The relative balance between dialogue, music, and effects remains correct.
Q: Can I adjust the fold down settings (for example, make surrounds louder or quieter)?
Not currently. The fold down uses fixed, industry-standard coefficients (Center and Surrounds at -3 dB). This is not adjustable in any Remoto client. For critical work requiring custom fold down settings, we recommend performing the fold down in your DAW or NLE before streaming.
Q: Does the fold down affect the source stream or what other participants hear?
No. The fold down is applied independently to each participant's output. The original multichannel stream is never altered. A guest on the Desktop app with a 5.1 system will hear full surround, while a guest on Playback for Web will hear stereo, simultaneously, in the same session.
Q: I'm connected to a soundbar / HomePod / AirPods. What will I hear?
You will hear stereo. Even if your device supports Dolby Atmos or Spatial Audio, the Remoto stream arrives as stereo on Playback for Web, iOS, and Apple TV. Apple's Spatial Audio (on AirPods Pro/Max) may apply head-tracked spatialization to the stereo signal for a wider feel, but this is an Apple system feature unrelated to Remoto's fold down.
How the Audio Flows: A Simple View
Below is a simplified view of what happens when a host streams a 5.1 surround session and participants join on different devices:
HOST (Desktop App)
Streams 5.1: L | R | C | LFE | Ls | Rs
│
▼
┌───────────────┐
│ MediaMTX │ (Streaming Server)
│ Server │
└───┬───────┬───┘
│ │
Full 5.1 │ │ Stereo fold down
(6 ch) │ │ (2 ch)
│ │
▼ ▼
┌──────────┐ ┌──────────────────────────────┐
│ Desktop │ │ Playback for Web / iOS / │
│ App │ │ Apple TV │
│ Guest │ │ Receives: │
│ │ │ L_out = L + C×0.7 + Ls×0.7 │
│ Hears │ │ R_out = R + C×0.7 + Rs×0.7 │
│ full 5.1 │ │ │
│ surround │ │ Hears stereo │
└──────────┘ └──────────────────────────────┘
- Desktop guests with a surround system hear the full 5.1 mix as the host intended.
- Playback for Web, iOS, and Apple TV guests hear a stereo version where dialogue remains centered, effects remain clear, and surround elements are blended into the front Left/Right.
- The host's original stream is never modified. Each participant receives the version appropriate for their device.
Glossary
| Term | Definition |
|---|---|
| Lo/Ro | Left-only / Right-only. A straightforward stereo fold down where surround channels are mixed into L/R without phase encoding. The standard for broadcast monitoring. |
| Lt/Rt | Left-total / Right-total. A matrix-encoded stereo fold down compatible with Dolby Pro Logic decoders. Surround channels are phase-shifted before mixing. |
| ITU-R BS.775 | International standard defining the reference stereo downmix coefficients for multichannel audio (published by the International Telecommunication Union). |
| LFE | Low Frequency Effects, the ".1" channel in 5.1/7.1. Carries bass content at +10 dB relative level, intended for subwoofer reproduction only. |
| Opus | An open, royalty-free audio codec optimized for interactive speech and music. Used by Remoto for real-time streaming over WebRTC. Supports multichannel via the MultiOpus extension (RFC 8486). |
| MultiOpus | Extension of the Opus codec for more than 2 channels. Uses multiple coupled and uncoupled Opus streams with a channel mapping table to represent 5.1, 7.1, and higher channel counts. |
| MediaMTX | Open-source real-time media server used by Remoto for WebRTC (WHEP) stream distribution. Handles the server-side fold down when serving stereo to Playback for Web, iOS, and Apple TV. |
| WHEP | WebRTC-HTTP Egress Protocol. The standard used by clients to subscribe to a WebRTC stream from MediaMTX. |
| SDP | Session Description Protocol. Describes the media capabilities (codecs, channels, etc.) negotiated between WebRTC peers. The channel count and Opus parameters are set here. |
| AVAudioSession | Apple's iOS/tvOS framework for managing audio behavior, routing, and session configuration on Apple devices. |