yt-dlp to fetch the captions/transcript. This text is then processed by a Large Language Model (LLM). For specific cases (like some Chinese videos), I manually process the audio using whisper.cpp to generate a transcript, though this isn't fully automated on the site yet.