YouTubeAgent

YouTube research and media operations agent — searches videos, extracts metadata, downloads media, reads comments, fetches transcripts, and transcribes audio locally with Whisper when captions are unavailable.

Project Structure
  • YouTubeAgent/
    • CLAUDE.mdAgent instructions — tools, workflows, core rules
    • AGENTS.mdCoreAgent registry entry
    • TOOLS/
      • YouTubeSearch/Search YouTube by query
        • youtube_search.pyMain script — yt-dlp search wrapper
        • youtube-searchCLI entrypoint
        • README.mdUsage docs and flags
        • requirements.txt
        • TOOL-OUTPUT-RESULTS/Exported search results
      • YouTubeVideoInfo/Get detailed video metadata
        • youtube_video_info.py
        • youtube-video-info
        • README.md
        • requirements.txt
        • TOOL-OUTPUT-RESULTS/
      • YouTubeChannelInfo/Channel metadata and video list
        • youtube_channel_info.py
        • youtube-channel-info
        • README.md
        • requirements.txt
        • TOOL-OUTPUT-RESULTS/
      • YouTubeVideoComments/Fetch video comments and replies
        • youtube_video_comments.py
        • youtube-video-comments
        • README.md
        • requirements.txt
        • TOOL-OUTPUT-RESULTS/
      • YouTubeTranscript/Get video captions/subtitles
        • youtube_transcript.py
        • youtube-transcript
        • README.md
        • requirements.txt
        • TOOL-OUTPUT-RESULTS/
      • YouTubeDownload/Download video or audio
        • youtube_download.py
        • youtube-download
        • README.md
        • requirements.txt
        • TOOL-OUTPUT-RESULTS/
          • DOWNLOADED-VIDEOS/Downloaded media files
      • Whisper-CLI/Local audio transcription via WhisperKit
        • README.mdWorkflow docs, command reference
        • AI-Models-Comparison.mdModel benchmarks and selection guide
        • scripts/
          • transcribe.shMain transcription script
          • export_segments_md.shJSON/SRT → Markdown converter
        • models/WhisperKit CoreML models (~632 MB)
        • tokenizers/Tokenizer configs for whisper-large-v3
        • out/Transcription output (.md files)
        • logs/Benchmark and timing logs
      • VideoClipper/Clip segments from audio/video with ffmpeg
        • README.mdffmpeg commands and parameters
        • TOOL-OUTPUT-RESULTS/Clipped segments
About this Agent

YouTubeAgent is a research-and-media automation agent built on top of the CoreAgent framework. It gives Claude the ability to search YouTube, pull detailed metadata from videos and channels, read comments, fetch transcripts, download media, clip segments, and transcribe audio locally — all through a set of 8 purpose-built CLI tools orchestrated by a single CLAUDE.md instruction file.

The agent follows a tool-routing pattern: when a user makes a request, the agent reads each tool's README.md to determine the exact flags and expected output format, then invokes the corresponding shell command. Six of the eight tools are thin Python wrappers around yt-dlp, the de-facto YouTube extraction library. The two exceptions are Whisper-CLI, which uses Apple's WhisperKit (CoreML) for on-device speech-to-text, and VideoClipper, which shells out to ffmpeg for lossless segment cutting.

A key architectural decision is the transcript fallback chain: the agent first attempts to fetch YouTube's own captions via YouTubeTranscript; if none are available, it downloads the audio with YouTubeDownload and pipes it through Whisper-CLI for local transcription. A similar pattern exists for clip transcription — VideoClipper extracts the segment, then Whisper-CLI processes it. This guarantees text output for any video, regardless of whether the uploader provided captions.

Every tool writes structured output (JSON or plain text) to its own TOOL-OUTPUT-RESULTS/ directory, making results easy to reference across multi-step research workflows. The agent excels at tasks like competitive analysis, content research, audience sentiment extraction, and media archival — combining metadata lookup, transcript retrieval, and comment analysis into a single conversational flow.

Agent Architecture

YouTubeAgent is a tool-routing agent — it receives a user request, determines which CLI tool to invoke, reads the tool's README for exact flags, and executes the command. All 6 Python tools share the same pattern: a youtube_*.py script wrapped by a shell entrypoint, using yt-dlp as the YouTube backend. Whisper-CLI is the exception — it uses WhisperKit (CoreML) for on-device transcription and only runs when YouTube captions are unavailable. VideoClipper is a thin wrapper around ffmpeg for cutting segments. Each tool writes its output to its own TOOL-OUTPUT-RESULTS/ directory.

flowchart TD
  USER["User Request"]
  AGENT["YouTubeAgent"]

  SEARCH["YouTubeSearch"]
  INFO["YouTubeVideoInfo"]
  CHANNEL["YouTubeChannelInfo"]
  COMMENTS["YouTubeVideoComments"]
  TRANSCRIPT["YouTubeTranscript"]
  DOWNLOAD["YouTubeDownload"]
  WHISPER["Whisper-CLI"]
  CLIPPER["VideoClipper"]

  USER --> AGENT
  AGENT --> SEARCH
  AGENT --> INFO
  AGENT --> CHANNEL
  AGENT --> COMMENTS
  AGENT --> TRANSCRIPT
  AGENT --> DOWNLOAD

  TRANSCRIPT -.->|"No captions"| DOWNLOAD
  DOWNLOAD -.->|"Audio file"| WHISPER
  DOWNLOAD -.->|"Media file"| CLIPPER
  CLIPPER -.->|"Clipped segment"| WHISPER

  style USER fill:#f0f0f0,stroke:#888,color:#333,stroke-width:1.5px,rx:10,ry:10
  style AGENT fill:#f0f0f0,stroke:#888,color:#333,stroke-width:1.5px,rx:10,ry:10
  style SEARCH fill:#f0f0f0,stroke:#888,color:#333,stroke-width:1.5px,rx:10,ry:10
  style INFO fill:#f0f0f0,stroke:#888,color:#333,stroke-width:1.5px,rx:10,ry:10
  style CHANNEL fill:#f0f0f0,stroke:#888,color:#333,stroke-width:1.5px,rx:10,ry:10
  style COMMENTS fill:#f0f0f0,stroke:#888,color:#333,stroke-width:1.5px,rx:10,ry:10
  style TRANSCRIPT fill:#f0f0f0,stroke:#888,color:#333,stroke-width:1.5px,rx:10,ry:10
  style DOWNLOAD fill:#f0f0f0,stroke:#888,color:#333,stroke-width:1.5px,rx:10,ry:10
  style WHISPER fill:#f0f0f0,stroke:#888,color:#333,stroke-width:1.5px,rx:10,ry:10
  style CLIPPER fill:#f0f0f0,stroke:#888,color:#333,stroke-width:1.5px,rx:10,ry:10
        
Available Tools

Eight CLI tools organized by function: 6 Python tools using yt-dlp for YouTube API operations, 1 shell-based transcriber using WhisperKit CoreML, and 1 ffmpeg wrapper for media clipping. All Python tools share the same dependency stack (Python 3.10+, yt-dlp, Node.js) and follow a consistent --export / --output-dir pattern for Markdown output.

YouTubeSearch
Searches YouTube by query string. Returns results in the format: Channel – Title – ID – Views – Uploaded on Date. Defaults to 10 results. Supports Markdown export with table formatting.
./youtube-search --query "search term" --results 30 --export
Python 3.10+ · yt-dlp · Node.js
FlagDefaultDescription
--queryRequiredSearch query string
--results10Number of results to return
--exportfalseExport results to Markdown
--output-dirTOOL-OUTPUT-RESULTS/Custom export directory
YouTubeVideoInfo
Retrieves detailed metadata for a single video: title, description, views, likes, upload date, duration, channel, live status, and tags. Accepts full URLs or video IDs.
./youtube-video-info --video "VIDEO_URL_OR_ID" --export
Python 3.10+ · yt-dlp · Node.js
FlagDefaultDescription
--videoRequiredYouTube URL or video ID
--exportfalseExport metadata to Markdown
--output-dirTOOL-OUTPUT-RESULTS/Custom export directory
YouTubeChannelInfo
Returns channel metadata and video list using YouTube's native ordering. Supports sorting by recent uploads or popularity. Defaults to 50 results.
./youtube-channel-info --channel "@OpenAI" --sort popularity --results 40 --export
Python 3.10+ · yt-dlp · Node.js
FlagDefaultDescription
--channelRequiredChannel handle (e.g. @OpenAI)
--sortrecentSort order: recent or popularity
--results50Number of videos to fetch
--exportfalseExport to Markdown
YouTubeVideoComments
Fetches comments and threaded replies for a single video. Each comment includes author, text, and like count. Supports sorting by popularity or recency.
./youtube-video-comments --video "VIDEO_URL_OR_ID" --comments 60 --sort recent --export
Python 3.10+ · yt-dlp · Node.js
FlagDefaultDescription
--videoRequiredYouTube URL or video ID
--comments20Number of top-level comments
--replies-per-comment3Replies per comment (0 to disable)
--sortpopularitySort: popularity or recent
--exportfalseExport to Markdown
YouTubeTranscript
Gets video captions without timestamps. Supports line-by-line or paragraph format. Can request translation if a translatable track exists. If no captions are available, returns an error prompting the Whisper fallback workflow.
./youtube-transcript --video "VIDEO_URL_OR_ID" --format paragraph --translate-to en --export
Python 3.10+ · yt-dlp · Node.js
FlagDefaultDescription
--videoRequiredYouTube URL or video ID
--formatlinesOutput format: lines or paragraph
--languageautoPreferred source language code
--translate-tononeTarget translation language
--exportfalseExport to Markdown
YouTubeDownload
Downloads a single YouTube video or its audio track. Video mode supports quality selection from 360p to 4K. Audio mode supports format conversion (MP3, M4A, OPUS, WAV, FLAC) with configurable bitrate. Outputs a JSON result with file path and metadata.
./youtube-download --video "VIDEO_URL_OR_ID" --mode audio --audio-format mp3 --audio-quality 192
Python 3.10+ · yt-dlp · FFmpeg
FlagDefaultDescription
--videoRequiredYouTube URL or video ID
--modevideoDownload mode: video or audio
--qualitybestVideo quality: best, 2160, 1080, 720, etc.
--audio-formatbestAudio format: best, mp3, m4a, wav, etc.
--audio-qualitybestAudio bitrate (e.g. 192)
--list-formatsfalseList all available formats
--exportfalseExport summary to Markdown
Whisper-CLI
Transcribes local audio/video files using WhisperKit (CoreML, on-device). Produces a Markdown file with timestamped segments. Uses the whisper-large-v3 turbo model (~632 MB). Only used when YouTube captions are unavailable or when the user explicitly requests local transcription.
./scripts/transcribe.sh -a "/path/to/audio.mp3" -l en
WhisperKit CLI · CoreML models · macOS
FlagDefaultDescription
-a, --audioRequiredAbsolute path to input audio file
-l, --languageautoForce language (en, es, etc.)
-o, --out-dirout/Override output directory
--keep-artifactsfalseKeep JSON, SRT, TXT, and logs
VideoClipper
Clips a segment from a local audio or video file using ffmpeg. Re-encodes audio for precise timestamp cuts (MP3 frames don't align to arbitrary times). For video, copies the video stream without re-encoding for speed.
ffmpeg -i "input.mp3" -ss 00:02:30 -to 00:05:00 -acodec libmp3lame -b:a 192k "output.mp3"
FFmpeg
FlagDefaultDescription
-iRequiredInput file (absolute path)
-ssRequiredStart timestamp (HH:MM:SS)
-toRequiredEnd timestamp (HH:MM:SS)
-acodeclibmp3lameAudio codec for re-encoding
-b:a192kAudio bitrate
-c:v copyCopy video stream (video files only)
Workflows
Research a video
1 YouTubeVideoInfo — get metadata (title, channel, views, duration, description)
2 YouTubeTranscript — get captions. If unavailable, offer Whisper fallback
3 YouTubeVideoComments — get top comments if user wants audience reaction
Research a channel
1 YouTubeChannelInfo — get channel metadata and video list (recent or popular)
2 Optionally drill into specific videos with YouTubeVideoInfo or YouTubeTranscript
Download and transcribe (no captions available)
1 YouTubeDownload with --mode audio to get the audio file
2 Whisper-CLI with the downloaded audio path to generate a Markdown transcript
Download, clip, and transcribe a specific section
1 YouTubeVideoInfo — get metadata to identify chapter timestamps
2 YouTubeDownload with --mode audio to get the full audio file
3 VideoClipper — clip the relevant segment using the timestamps
4 Whisper-CLI — transcribe only the clipped segment
Search and explore
1 YouTubeSearch — find videos by query
2 Pick a result and use YouTubeVideoInfo, YouTubeTranscript, or YouTubeVideoComments for deeper analysis
Improvement Areas
No batch operations
Every tool operates on a single video or channel at a time. There's no way to pass a list of video IDs and process them in batch — the agent must loop manually, running each tool sequentially.
What's missing: A --batch flag or a file-based input mode where the user provides a list of URLs and the tool processes them all, aggregating results into a single export.
No caching or rate-limit handling
Each tool invocation makes fresh API calls via yt-dlp. Repeated requests for the same video (e.g. getting info then comments then transcript) hit YouTube three separate times. There's no local cache and no rate-limit detection or retry logic.
What's missing: A shared metadata cache (even a simple JSON file per video ID) that stores previously fetched data. Rate-limit detection that backs off and retries instead of failing silently.
Whisper-CLI is macOS-only
The transcription tool uses WhisperKit with CoreML, which only runs on macOS with Apple Silicon. There's no Linux or Windows fallback, and no alternative transcription backend configured.
What's missing: A platform detection step that falls back to whisper.cpp or the OpenAI Whisper API on non-macOS systems. The README doesn't mention this platform limitation.
No unified export format
Each tool generates its own Markdown format with slightly different heading structures, metadata sections, and table layouts. There's no shared template or consistent schema across exports, making it harder to parse results programmatically.
What's missing: A shared Markdown template with consistent frontmatter (YAML), standardized metadata table format, and uniform heading hierarchy across all tools.
TOOL-OUTPUT-RESULTS folders have no cleanup policy
Each tool writes exports to its own TOOL-OUTPUT-RESULTS/ directory. These files accumulate indefinitely. There's no age-based cleanup, no size limit, and no command to purge old results.
What's missing: A cleanup command or retention policy. At minimum, a --clean flag on each tool or a top-level script that removes exports older than N days.