Building an AI Video Pipeline on AWS

Architecture guide for an end-to-end AI video pipeline: S3 ingest, Lambda trigger, Rekognition analysis, Bedrock processing, FFmpeg editing, and Step Functions orchestration.

Added 24 Mar 2026 4 min read Updated 30 May 2026

#media-processing #advanced #video-pipeline #media #architecture #transcoding #aws

Learn this your way

Read Guided course

An AI video pipeline automates the process of ingesting raw video, extracting intelligence from it, and producing edited or enriched output. This article describes a production-ready architecture built on AWS that handles media ingest through final output delivery.

VideoFlow: the pipeline in two minutes

VideoFlow is the worked example behind this guide: one upload becomes three finished cuts, automatically, for about 34 cents. The full beginner course walks the architecture layer by layer.

Pipeline Overview

The pipeline has five conceptual stages:

Ingest - video arrives in S3 and triggers the pipeline
Normalize - MediaConvert converts raw formats to a consistent baseline
Analyze - Rekognition extracts labels, scenes, faces, and text; Transcribe produces a transcript
Process - Bedrock summarizes content, identifies highlights, generates metadata
Edit and output - FFmpeg assembles selected segments; output lands in S3

Step Functions orchestrates the entire workflow, with EventBridge triggering execution on S3 upload.

Stage 1: Ingest

Raw video arrives in an S3 bucket at the raw/ prefix. Videos can come from direct upload (content producers using presigned URLs), API submission, or automated feed ingestion from broadcast systems.

An EventBridge rule matches ObjectCreated events at the raw/ prefix and starts a Step Functions execution, passing the S3 bucket and key as input. The trigger is near-instantaneous - processing begins within seconds of upload completion.

Stage 2: Normalize

A Lambda function submits a MediaConvert job that converts the raw file to H.264 MP4 at 1080p with AAC audio. The output lands at the normalized/ prefix. MediaConvert publishes a completion event to EventBridge, which the Step Functions waitForTaskToken pattern uses to resume the workflow.

For cost optimization, the pipeline also creates a 360p proxy simultaneously. Downstream analysis runs against the proxy; final output uses the 1080p normalized file.

Stage 3: Analyze

Analysis runs in parallel (Step Functions parallel state) to minimize total pipeline time:

Rekognition video analysis - a Lambda function starts an async Rekognition StartLabelDetection job against the normalized video. The job identifies objects, activities, and scenes throughout the video with timestamps and confidence scores. Completion triggers via SNS topic configured in the Rekognition request.

Amazon Transcribe - a Lambda function submits a transcription job. The output is a JSON transcript with word-level timestamps and speaker labels (diarization enabled). On completion, a Lambda function formats the transcript as plain text and stores it alongside the video.

Frame extraction - Lambda runs FFmpeg to extract one frame per 5 seconds as JPEG thumbnails, stored in S3. These feed into Rekognition image analysis for any frame-level capabilities not covered by the video job (celebrity recognition, faces).

Stage 4: Bedrock Processing

A Lambda function aggregates the analysis results and calls Bedrock (Claude) with a structured prompt:

Rekognition labels and their timestamps provide the scene timeline
The transcript provides the audio content
The request: identify the 3-5 most significant segments, generate a summary, extract metadata tags

Bedrock returns a structured JSON response (via tool use or structured output) listing segment timestamps, a 150-word summary, and category tags. A Lambda function parses this and stores it as a metadata document alongside the video.

Stage 5: Edit and Output

Lambda runs FFmpeg to trim and concatenate the highlight segments identified by Bedrock. For each segment: ffmpeg -i normalized.mp4 -ss [start] -t [duration] -c copy segment_N.mp4. After all segments are cut, FFmpeg concatenates them: ffmpeg -f concat -safe 0 -i filelist.txt -c copy highlights.mp4.

The output (full normalized video, highlight reel, transcript, metadata JSON, thumbnails) lands at the output/ prefix. A final Step Functions state sends an SNS notification with the output location and a summary of extracted metadata.

Error Handling

Each Step Functions state has a configured Catch block routing failures to an error handler Lambda that logs structured error details to CloudWatch and publishes to an alert SNS topic. The Step Functions execution history provides a visual timeline of where failures occurred.

For transient errors (Rekognition throttle, temporary S3 unavailability), states use Retry with exponential backoff before failing.

Cost Profile

For a typical 10-minute news segment: MediaConvert ($0.015), Rekognition video ($0.10), Transcribe ($0.024), Bedrock Claude call ($0.03), Lambda executions ($0.001), S3 storage (negligible). Total approximately $0.17 per 10-minute video.

Amazon S3 - storage layer
Amazon EventBridge - pipeline trigger
AWS Elemental MediaConvert - normalization
FFmpeg - editing and assembly
Amazon Rekognition - video analysis

Open source projects

Freelancer Templates Contracts, proposals, SOWs

Freelancer Automation Workflow recipes, AI playbooks

Work with Linda

Workshop Series €2,000/mo x 3

1:1 Consulting 60 min session