Cinema C2: Hiding a Control Channel Inside a Video File

The idea

In the early days of television, not every line on your screen carried a picture. A handful of scan lines at the top of each frame – invisible to viewers – were reserved for the Vertical Blanking Interval. Broadcasters used the VBI to transmit closed captions, teletext pages, time codes, and copy-protection signals. The control layer rode inside the content itself. No sideband. No second connection. If you had the signal, you had the commands.

The thing about the VBI that people forget is that it was not a hack. It was a deliberate allocation of bandwidth for purposes other than picture. The engineers who designed NTSC understood that a communication channel is more valuable than a picture channel, because a communication channel can carry anything – including instructions about what to do with the picture. They gave up a few scan lines and got an entire control plane.

Cinema C2 applies this principle to modern video files. An MKV container carries video and audio tracks that any player can decode, plus one or two additional tracks that no player recognizes. Those tracks contain a timed sequence of cryptographically signed commands: play, pause, seek, synchronize, display an overlay, rotate keys. A lightweight process called the shim reads the hidden tracks and executes commands at precise timestamps while driving an unmodified video player via IPC. The viewer sees a movie. The shim sees a script.

The question was whether you could embed a full command-and-control channel inside a video container, keep multiple remote viewers synchronized to the same timeline, and let authorized operators inject live amendments – without modifying the player, the container format, or the network protocol.

You can.

The NTSC standard was adopted by the FCC in 1941. VBI data services
are specified in EIA-608 (closed captions, 1979) and ETSI EN 300 706
(World System Teletext, 1976). Macrovision's Analog Copy Protection
was added to VBI line 21 in 1984. The VBI occupied lines 1-21 of the
262.5-line NTSC field -- roughly 8% of the vertical resolution,
deliberately sacrificed for non-picture data.

The container

Most people think of video files as monolithic. But a container format is just an envelope describing how multiple data streams are interleaved and timestamped. The video codec handles compression. The audio codec handles sound. The container holds them together and says which bytes go where.

MKV – Matroska – is built on EBML, a binary markup language designed to be extended. You can add tracks with arbitrary codec IDs and compliant parsers will skip anything they do not recognize. No error, no warning. MP4, by contrast, requires codec identifiers from a fixed registry. Add something unknown and many parsers reject the entire file.

Cinema C2 adds two tracks with codec IDs beginning with S_CC2/. One carries timed command blocks. The other carries an encrypted session key. mpv, VLC, and FFmpeg all ignore these tracks. They are invisible in every player’s UI. The C2 tracks do not exploit a vulnerability or abuse a parser bug. They use the format exactly as designed – a track with a codec ID that no player has a decoder for. The player skips it. The shim reads it. Both are behaving correctly.

You can take any existing MKV and bolt C2 tracks onto it without touching the original content. The video, audio, and subtitle tracks pass through unchanged.

EBML is specified in RFC 8794 (2020). The Matroska container format is
defined in an IETF draft (draft-ietf-cellar-matroska) and documented
at matroska.org. Codec IDs are free-form strings; the official registry
is at matroska.org/technical/codec_specs.html but the format places no
restriction on using unregistered IDs. MP4's sample description box
(stsd) is defined in ISO 14496-12; its codec registry (the "codecs"
parameter space) is maintained by MP4RA at mp4ra.org.

The shim

The shim is a Rust binary that sits between the file and the player. At startup it scans the MKV for tracks whose codec IDs begin with S_CC2/, derives a session key from the data track content, and loads every command block into a priority queue ordered by timestamp.

Then it launches mpv with a JSON IPC socket and begins polling the playback position. When the player’s clock reaches a block’s timestamp, the shim executes it – sending commands back to mpv (seek here, pause now, show this text) or updating internal state. Simultaneously, it joins an Iroh gossip mesh using a topic ID embedded in the session key, and begins broadcasting its playback position to other peers watching the same file.

The shim never modifies the file. It never patches the player. It reads the hidden channel and translates it into player commands and network messages.

The session bootstrap

There is a pleasing bit of circular-dependency avoidance in how the session key works. The key is encrypted. The encryption key derives from a SHA-256 hash of the data track. But the session key has to live in the file alongside the data track.

The solution: two separate MKV tracks, authored in sequence. The data track is written first and hashed. Then a random session key is generated, encrypted with a key derived from that hash, and written as a separate session track. The data track hash is stable before the session track exists. No circularity.

This means the shared secret for a basic session is the data track content itself. The file is the key. You can re-encode the video at a different bitrate, swap the audio language, add subtitles – as long as the C2 data track is preserved, the session key still works. The identity of a session is the sequence of commands, not the pixels or the sound.

mpv's JSON IPC protocol is documented at mpv.io/manual/master/#json-ipc.
Session key derivation uses HKDF-SHA256 (RFC 5869) with the data track
SHA-256 hash (FIPS 180-4) as input keying material. The session key
blob is encrypted with ChaCha20-Poly1305 (RFC 8439). Command block
authentication uses HMAC-SHA256 (RFC 2104). The shim is built in Rust
using the matroska-demuxer crate for MKV parsing and the RustCrypto
crate ecosystem (hkdf, sha2, hmac, chacha20poly1305) for all
cryptographic operations.

Synchronized playback

One viewer with embedded commands is a parlor trick. The interesting problem is multiple viewers on different machines, watching the same file, staying in lockstep.

Every shim broadcasts its playback state over Iroh gossip every two seconds: current position, drift from the data track timeline, a confidence score based on buffer health, and whether it is playing, paused, buffering, or seeking. Every peer receives every other peer’s state. No central server. No coordinator chosen in advance. Two peers behind carrier-grade NAT on different continents find each other using nothing but a 32-byte topic ID from the session key.

The conductor

Somebody has to notice when a peer has drifted too far. Rather than designating a leader, the system elects one continuously.

Think about what you want from a conductor. You want the peer tracking the timeline most closely – the one with the smallest absolute drift and a confidence score above 0.7. Ties break by node ID. A peer must hold the lead for two consecutive gossip rounds before assuming the role, which prevents flip-flopping when messages arrive in different orders on different machines.

When the conductor detects excessive drift, it issues a correction: a Seek command to nudge a straggler back into range, or a Gate command that pauses everyone until the slow peer catches up. If every peer is buffering and nobody qualifies, a Gate fires automatically. Playback halts until the group reconstitutes.

Drift zones

Not every moment in a video needs the same precision. A dialogue scene tolerates half a second of drift. A synchronized reveal – fifty people seeing the same frame at the same instant – needs a hundred milliseconds or less.

Content authors embed DriftZone blocks at authoring time. Each zone specifies a time range, a threshold, and an enforcement policy: Soft (warn the peer), Hard (force an immediate seek), or Gate (pause everyone at the boundary until the whole group arrives). Outside marked zones, the default threshold is 500 milliseconds.

The video’s own timeline carries the synchronization policy. The shim enforces it.

Iroh is a peer-to-peer networking library built by n0 (n0.computer),
providing gossip pub/sub, NAT traversal, and relay fallback over QUIC.
Peer discovery uses Pkarr (pkarr.org), a system for publishing node
addresses to DNS. The conductor election algorithm is original to this
project; the two-round confirmation requirement follows the same
principle as Raft's leader election protocol (Ongaro & Ousterhout,
"In Search of an Understandable Consensus Algorithm," 2014), adapted
for a leaderless gossip context where the elected role carries no
state authority -- only the right to issue drift corrections.

Live commands

Everything so far is deterministic – baked into the file at authoring time. But the system also supports live amendments from authorized operators while peers are watching. This is where it becomes more than a fancy subtitle track.

pts_anchor

A live command does not execute “now.” It executes at a specific point on the timeline, declared when the operator signs it:

SignedCommand {
  command:     (the action)
  pts_anchor:  (data track timestamp where this takes effect)
  session_id:  (which session)
  sig:         (Ed25519 signature over all of the above)
}

The pts_anchor field is the idea that makes live injection work. When an operator broadcasts a command with pts_anchor = 35000, every peer executes it at exactly thirty-five seconds into the video, regardless of when the gossip message arrives. A peer at second thirty-four queues it for one second. A peer at second thirty-six has passed the anchor and logs it without executing.

Think about what this means for replay. Record every SignedCommand broadcast during a live session. Play them back against the same file a week later. You get the same behavior. Every command fires at its anchored timestamp. The system is simultaneously live and deterministic – a broadcast medium and a recording medium – because liveness is pinned to the video clock, not the wall clock. And the video clock is reproducible.

Authority

Commands carry different weight depending on where they come from:

Level 0  Embedded data track     Pre-signed at authoring time
Level 1  Keyholder (live)        Signed with session key
Level 2  Peer (gossip)           Validated against capability tokens
Level 3  Local shim              Drift correction, buffering

A peer command cannot override an embedded command. A keyholder can amend the timeline but not contradict a Level 0 instruction. The hierarchy is enforced at the shim. No configuration flag relaxes it.

Capability tokens let a session operator grant scoped, time-limited permissions – “you may inject commands for the next sixty seconds of video” – without sharing the master signing key. The expiry is measured in video time, not wall clock time. Fast-forward and your token expires faster. Pause and it does not tick down.

Ed25519 signatures are specified in RFC 8032. The SPAKE2 password-
authenticated key exchange used in the Wormhole handshake is specified
in RFC 9382. Live commands are serialized with rmp-serde (msgpack
struct encoding, deterministic array form) to ensure identical byte
output across implementations before signing. The pts_anchor concept
-- pinning live commands to the media timeline rather than wall clock
-- is original to this project; a related approach exists in MPEG-DASH
(ISO 23009-1) where media presentation time anchors segment requests,
though DASH does not support authenticated command injection.

The steganographic fallback

The data track works when you control file distribution. But upload a video to YouTube and the platform transcodes it – re-encodes the pixels, re-encodes the audio, strips everything it does not recognize. The C2 tracks vanish.

The stego path solves this by encoding commands in the video signal itself. The data lives in the same domain as the picture. To destroy the commands, you would have to destroy the image.

Three channels span the robustness/bandwidth tradeoff:

S1 operates at roughly two bits per second by placing scene cuts at controlled frame offsets. The shim detects them from keyframe flags in the bitstream – no pixel decoding required. Scene cuts are structural features of the video; they survive every transcoding pipeline tested. S1 is a bootstrap channel. It delivers a salt and an encrypted hint telling the shim where to find the primary payload.

S2 runs at roughly eight bits per second. In each I-frame, 64 pairs of mid- frequency DCT coefficients are selected by a PRNG seeded with the extraction key. For each pair, the relative magnitude encodes one bit. Reed-Solomon error correction recovers bits that transcoding flipped. Five minutes of video carries about 300 bytes – enough for a full command schedule. Without the key, you do not know which coefficients to compare. The video looks normal. There is no fixed pattern to find.

S3 is the fallback at four bits per second, activated when transcoding damage exceeds S2’s Reed-Solomon capacity. It divides each frame into a grid and compares average luminance between selected cell pairs. Luminance averages integrate over hundreds of pixels and barely shift even under brutal requantization. Coarser than S2, but harder to kill.

All three channels are key-dependent. Without the passphrase or a Wormhole- delivered key, the shim scans the first thirty I-frames for a calibration pattern at key-dependent locations and finds nothing. It exits silently. No output, no error, no indication that it looked. The calibration scan also measures the bit error rate, which tells the shim whether to use S2 or fall back to S3. One set of test frames yields two answers: is this C2 content, and how damaged is the channel.

The adversary here is not a human trying to scrub a watermark. It is YouTube’s transcoder, indifferent and mechanical, trying to compress a video. The stego data needs to survive one round of automated, unintentional destruction. If the channel is too damaged, the shim falls back. If the fallback fails, it stays silent. No data is better than wrong data.

The Discrete Cosine Transform's role in video compression is defined
in ITU-T H.264 (ISO 14496-10), sec. 8.5. Mid-frequency coefficient
steganography builds on Jsteg (Upham, 1993) and the F5 algorithm
(Westfeld, 2001), adapted here for video I-frames rather than JPEG
stills. Reed-Solomon error correction was introduced by Reed & Solomon,
"Polynomial Codes Over Certain Finite Fields" (1960). Passphrase-based
key derivation uses Argon2id (RFC 9106). The observation that platform
transcoders are indifferent adversaries -- destroying stego data as
collateral damage rather than targeting it -- distinguishes this threat
model from traditional steganographic security (Cachin, "An Information-
Theoretic Model for Steganography," 1998), where the adversary
actively searches for hidden content.

What this proves

The proof of concept runs end to end. A demo orchestrator authors a fifty-second video with nine embedded command blocks – heartbeats, scene markers, drift zones, on-screen text at eight seconds, a Gate command at thirty-five seconds that pauses all peers. The shim reads the hidden tracks, derives the session key, launches mpv, and executes blocks at their timestamps. The player has no idea. A second shim joins via gossip, the conductor election runs, and within four seconds the two peers are synchronized with active drift monitoring. A keyholder injects a live command; both peers execute it at the same video timestamp regardless of network latency.

Separate Python tooling encodes commands into I-frame DCT coefficients and extracts them given a passphrase. The stego path is implemented but not yet wired into the shim’s automatic fallback – the proof of concept focused on the data track path first. The next test is uploading to a platform, downloading the transcoded result, and measuring what survives.

The source code is at crates/cc2-shim (shim binary), crates/cc2-core
(block schema), crates/cc2-mux (embedding pipeline), and crates/
cc2-stego (steganographic channels). The demo orchestrator is demo.sh.
Test fixtures including the nine-block demo schedule are in tests/
fixtures/demo_schedule.json. Python prototype tooling is in tools/
(mkvembed.py, stego_embed.py, stego_extract.py).

Why it matters

We think of video as inert content. Data that flows in one direction, from storage to screen, and does nothing except become light and sound. You open a file. You watch it. It does not act.

But the VBI engineers understood something that still holds: if you have a timed data channel, you can make it carry instructions. The instructions ride alongside the content at zero marginal cost, because the channel already exists. You do not need a second connection, a signaling server, or the viewer’s awareness.

A film distributor sends an MKV to fifty theaters; each runs the shim; all fifty screens hit the same beat at the same millisecond without a network dependency during playback. A sports analyst injects live overlays that all subscribers see at the same point in the timeline, even if they started playback hours apart. A journalist uploads a video to YouTube with commands encoded in the pixels; anyone with the passphrase extracts them after downloading via yt-dlp. A lecture carries timed quizzes that replay deterministically when a student rewinds.

None of these require modifying the player, the container format, or the network. The control layer rides inside the content.

The VBI carried closed captions and teletext. Cinema C2 carries a programmable command channel. The principle is the same one those engineers understood eighty years ago: if you have a signal, you can make it carry its own instructions. The channel is there. You just have to use it.

The HbbTV standard (ETSI TS 102 796) is a modern descendant of the
VBI concept, embedding interactive applications in DVB broadcast
streams. SCTE-35 (ANSI/SCTE 35) carries ad insertion cues in MPEG
transport streams -- another case of in-band signaling riding alongside
video content. Cinema C2 differs from both in that the control channel
is cryptographically authenticated, encrypted, peer-synchronized, and
designed to survive container-stripping via steganographic fallback.