A post-production facility built an AI system to analyze raw multi-track audio from film and television shoots. The system identifies technical issues (noise, clipping, hum), classifies content by track type, transcribes dialogue, and generates automated production notes - work that previously required a sound editor to manually review hours of raw footage.

The Problem Context

Film production generates large volumes of raw audio: production dialogue, boom microphone tracks, wireless lapel feeds, ambient recording, and scratch tracks. For a typical television episode, this is 12-20 hours of raw material per shooting day.

Before an editor can work with this material, someone must review it: identify which takes are usable, note technical problems, verify that sync marks are present, and produce a set of notes describing the audio quality and content of each clip. Manual review takes 4-6 hours per shooting day.

Architecture

Ingestion and demuxing - Raw multi-track audio files (typically 48-channel BWF files) are demuxed into individual channel tracks using FFmpeg in an AWS Lambda function triggered by S3 upload. Metadata from the production sound mixer (track assignments, scene/take labels from the recorder) is parsed and attached as S3 object metadata.

Technical analysis layer - A custom DSP analysis function runs on each track and computes: peak levels, noise floor, clipping events (samples at 0dBFS), frequency spectrum profile, and dynamic range. This analysis runs without AI - it is deterministic signal processing. Results flag tracks with clipping, excessive noise floor, or unusual spectral characteristics for attention.

Transcription and speaker identification - Dialogue tracks are sent to Amazon Transcribe with custom vocabulary enabled for production-specific terminology (character names, location names, technical terms from the script). Speaker diarization identifies individual speakers per clip.

AI analysis and note generation - An LLM receives the technical analysis results, transcription, and production metadata (scene number, take number, script excerpt for the scene) and generates production notes: whether the take is technically clean, which specific issues were detected, whether the dialogue matches the script, and a recommendation (print/hold/print-if-needed).

Integration with Production Workflow

The system writes notes directly to the production’s editorial database in a format compatible with the editing software. Notes appear in the editor’s interface alongside the audio clips, reducing the time between shooting and editorial pickup.

A confidence threshold determines which notes are automatically written versus queued for sound editor review. Technical analysis findings are always written automatically. AI-generated recommendations are written automatically only when confidence exceeds 0.85; lower-confidence notes are held for review.

Key Lessons

Custom vocabulary has significant impact on transcription accuracy. Production dialogue contains character names, location names, and script-specific terms not in the general vocabulary. Adding a 200-term custom vocabulary reduced transcription error rate on proper nouns by 60%.

The LLM note generation step works best with structured context. Initially, raw transcription text was passed to the model. Adding structured context (scene number, character names in scene, expected dialogue from the script) dramatically improved the relevance of generated notes and reduced hallucinated observations.

Track assignment metadata from the recorder is often inconsistent. Sound mixers use different labeling conventions, abbreviations, and channel assignments across productions. Building a normalization step that maps recorder-specific labels to canonical track types before analysis was necessary to make the system work across multiple productions without per-production configuration.