Building an AI Video Pipeline on AWS
Architecture guide for an end-to-end AI video pipeline: S3 ingest, Lambda trigger, Rekognition analysis, Bedrock processing, FFmpeg editing, and Step Functions orchestration.
An AI video pipeline automates the process of ingesting raw video, extracting intelligence from it, and producing edited or enriched output. This article describes a production-ready architecture built on AWS that handles media ingest through final output delivery.
Pipeline Overview
The pipeline has five conceptual stages:
- Ingest - video arrives in S3 and triggers the pipeline
- Normalize - MediaConvert converts raw formats to a consistent baseline
- Analyze - Rekognition extracts labels, scenes, faces, and text; Transcribe produces a transcript
- Process - Bedrock summarizes content, identifies highlights, generates metadata
- Edit and output - FFmpeg assembles selected segments; output lands in S3
Step Functions orchestrates the entire workflow, with EventBridge triggering execution on S3 upload.
Stage 1: Ingest
Raw video arrives in an S3 bucket at the raw/ prefix. Videos can come from direct upload (content producers using presigned URLs), API submission, or automated feed ingestion from broadcast systems.
An EventBridge rule matches ObjectCreated events at the raw/ prefix and starts a Step Functions execution, passing the S3 bucket and key as input. The trigger is near-instantaneous - processing begins within seconds of upload completion.
Stage 2: Normalize
A Lambda function submits a MediaConvert job that converts the raw file to H.264 MP4 at 1080p with AAC audio. The output lands at the normalized/ prefix. MediaConvert publishes a completion event to EventBridge, which the Step Functions waitForTaskToken pattern uses to resume the workflow.
For cost optimization, the pipeline also creates a 360p proxy simultaneously. Downstream analysis runs against the proxy; final output uses the 1080p normalized file.
Stage 3: Analyze
Analysis runs in parallel (Step Functions parallel state) to minimize total pipeline time:
Rekognition video analysis - a Lambda function starts an async Rekognition StartLabelDetection job against the normalized video. The job identifies objects, activities, and scenes throughout the video with timestamps and confidence scores. Completion triggers via SNS topic configured in the Rekognition request.
Amazon Transcribe - a Lambda function submits a transcription job. The output is a JSON transcript with word-level timestamps and speaker labels (diarization enabled). On completion, a Lambda function formats the transcript as plain text and stores it alongside the video.
Frame extraction - Lambda runs FFmpeg to extract one frame per 5 seconds as JPEG thumbnails, stored in S3. These feed into Rekognition image analysis for any frame-level capabilities not covered by the video job (celebrity recognition, faces).
Stage 4: Bedrock Processing
A Lambda function aggregates the analysis results and calls Bedrock (Claude) with a structured prompt:
- Rekognition labels and their timestamps provide the scene timeline
- The transcript provides the audio content
- The request: identify the 3-5 most significant segments, generate a summary, extract metadata tags
Bedrock returns a structured JSON response (via tool use or structured output) listing segment timestamps, a 150-word summary, and category tags. A Lambda function parses this and stores it as a metadata document alongside the video.
Stage 5: Edit and Output
Lambda runs FFmpeg to trim and concatenate the highlight segments identified by Bedrock. For each segment: ffmpeg -i normalized.mp4 -ss [start] -t [duration] -c copy segment_N.mp4. After all segments are cut, FFmpeg concatenates them: ffmpeg -f concat -safe 0 -i filelist.txt -c copy highlights.mp4.
The output (full normalized video, highlight reel, transcript, metadata JSON, thumbnails) lands at the output/ prefix. A final Step Functions state sends an SNS notification with the output location and a summary of extracted metadata.
Error Handling
Each Step Functions state has a configured Catch block routing failures to an error handler Lambda that logs structured error details to CloudWatch and publishes to an alert SNS topic. The Step Functions execution history provides a visual timeline of where failures occurred.
For transient errors (Rekognition throttle, temporary S3 unavailability), states use Retry with exponential backoff before failing.
Cost Profile
For a typical 10-minute news segment: MediaConvert ($0.015), Rekognition video ($0.10), Transcribe ($0.024), Bedrock Claude call ($0.03), Lambda executions ($0.001), S3 storage (negligible). Total approximately $0.17 per 10-minute video.
Related Articles
- Amazon S3 - storage layer
- Amazon EventBridge - pipeline trigger
- AWS Elemental MediaConvert - normalization
- FFmpeg - editing and assembly
- Amazon Rekognition - video analysis
Need help implementing this?
Turn this knowledge into a working prototype. Our structured workshop methodology takes you from idea to deployed AI solution in three sessions.
Explore AI Workshops