YTVideoTranscriber: Automated YouTube Transcription

Production-ready system for automated YouTube channel monitoring and video transcription. Leverages OpenAI Whisper and WhisperX for speech-to-text with speaker diarization, featuring a Python backend, FastAPI REST API, and Next.js dashboard.

PythonFastAPIOpenAI WhisperWhisperXNext.js 15React 19SQLAlchemyyt-dlpSpeaker Diarization

The Challenge

Content creators, researchers, and businesses need to monitor YouTube channels and extract accurate transcriptions from videos at scale. Manual transcription is time-consuming, expensive, and lacks speaker identification capabilities.

•Monitoring multiple YouTube channels for new content requires constant attention
•Manual transcription is slow and error-prone, especially for long-form content
•Identifying who said what (speaker diarization) is nearly impossible manually
•Exporting to multiple formats (SRT, JSON, TXT) requires additional processing

Our Solution

We built YTVideoTranscriber as a comprehensive automated pipeline that monitors YouTube channels via RSS feeds, downloads audio using yt-dlp, and transcribes content using OpenAI Whisper with WhisperX for speaker diarization. The system includes a full-stack dashboard for management and search.

•Automated channel monitoring discovers new videos via RSS feeds
•High-accuracy transcription using OpenAI Whisper models (tiny to large)
•Speaker diarization identifies and labels different speakers in conversations
•Multi-format output: JSON with timestamps, plain text, and SRT subtitles

System Architecture

YTVideoTranscriber follows a three-tier architecture with clear separation between the CLI/orchestration layer, REST API, and web dashboard. The core pipeline handles video discovery, audio extraction, and AI-powered transcription with speaker identification.

Orchestrator

Central coordinator managing the entire transcription pipeline from discovery to output

Python, Click CLI, State machine for video processing

YouTube Monitor

Discovers new videos from subscribed channels using RSS feeds and yt-dlp

RSS parsing, yt-dlp integration, Duplicate detection

Transcription Engine

Core AI engine using Whisper for STT and WhisperX for alignment and speaker diarization

OpenAI Whisper, WhisperX, GPU acceleration, Multiple model sizes

FastAPI REST Server

Full-featured API with 25+ endpoints for channel management, transcription control, and search

FastAPI, SQLAlchemy ORM, SQLite/PostgreSQL, Background tasks

Technology Stack

AI & Transcription

OpenAI WhisperWhisperXPyAnnoteSpeaker DiarizationGPU Acceleration

Backend

Python 3.8+FastAPISQLAlchemyClick CLIyt-dlp

Frontend

Next.js 15React 19TypeScriptTailwindCSSRadix UIReact Query

Database

SQLitePostgreSQLAlembic MigrationsFull-text Search

Output Formats

JSON with TimestampsSRT SubtitlesPlain TextSpeaker Labels

AI-Powered Transcription Pipeline

The core innovation is a sophisticated multi-stage pipeline that combines video discovery, audio extraction, speech recognition, and speaker identification into a seamless automated workflow.

Whisper Model Selection

Support for all Whisper model sizes (tiny, base, small, medium, large) - trade accuracy for speed based on your needs

WhisperX Alignment

Precise word-level timestamps through forced alignment, enabling accurate subtitle generation

Speaker Diarization

PyAnnote-powered speaker identification labels each segment with SPEAKER_00, SPEAKER_01, etc.

State Machine Processing

Videos progress through states: PENDING → DOWNLOADING → TRANSCRIBING → COMPLETED with full error recovery

Full-text Search

Search across all transcriptions to find specific content, speakers, or topics instantly

Platform Metrics

YTVideoTranscriber is a production-ready system with comprehensive tooling for automated YouTube transcription at scale.

25+

API Endpoints

Whisper Models

8,000+

Lines of Code

Output Formats

YTVideoTranscriber: Automated YouTube Transcription

PythonFastAPIOpenAI WhisperWhisperXNext.js 15React 19SQLAlchemyyt-dlpSpeaker Diarization

The Challenge

•Monitoring multiple YouTube channels for new content requires constant attention

•Manual transcription is slow and error-prone, especially for long-form content

•Identifying who said what (speaker diarization) is nearly impossible manually

•Exporting to multiple formats (SRT, JSON, TXT) requires additional processing

Our Solution

•Automated channel monitoring discovers new videos via RSS feeds

•High-accuracy transcription using OpenAI Whisper models (tiny to large)

•Speaker diarization identifies and labels different speakers in conversations

•Multi-format output: JSON with timestamps, plain text, and SRT subtitles

System Architecture

Orchestrator

Central coordinator managing the entire transcription pipeline from discovery to output

Python, Click CLI, State machine for video processing

YouTube Monitor

Discovers new videos from subscribed channels using RSS feeds and yt-dlp

RSS parsing, yt-dlp integration, Duplicate detection

Transcription Engine

Core AI engine using Whisper for STT and WhisperX for alignment and speaker diarization

OpenAI Whisper, WhisperX, GPU acceleration, Multiple model sizes

FastAPI REST Server

Full-featured API with 25+ endpoints for channel management, transcription control, and search

FastAPI, SQLAlchemy ORM, SQLite/PostgreSQL, Background tasks

Technology Stack

AI & Transcription

OpenAI WhisperWhisperXPyAnnoteSpeaker DiarizationGPU Acceleration

Backend

Python 3.8+FastAPISQLAlchemyClick CLIyt-dlp

Frontend

Next.js 15React 19TypeScriptTailwindCSSRadix UIReact Query

Database

SQLitePostgreSQLAlembic MigrationsFull-text Search

Output Formats

JSON with TimestampsSRT SubtitlesPlain TextSpeaker Labels

AI-Powered Transcription Pipeline

The core innovation is a sophisticated multi-stage pipeline that combines video discovery, audio extraction, speech recognition, and speaker identification into a seamless automated workflow.

Whisper Model Selection

Support for all Whisper model sizes (tiny, base, small, medium, large) - trade accuracy for speed based on your needs

WhisperX Alignment

Precise word-level timestamps through forced alignment, enabling accurate subtitle generation

Speaker Diarization

PyAnnote-powered speaker identification labels each segment with SPEAKER_00, SPEAKER_01, etc.

State Machine Processing

Videos progress through states: PENDING → DOWNLOADING → TRANSCRIBING → COMPLETED with full error recovery

Full-text Search

Search across all transcriptions to find specific content, speakers, or topics instantly