Why it matters
One API call to Gemini 3 Flash Preview: speaker labels by name, timestamps, emotion tags, language detection with English translation, and a full summary. That is the audio understanding layer that underlies everything else Thor Schaeff demos here, including speech generation directed by a "director's note" rather than
My takeaway: From Transcription to Live Music: Gemini's Audio Stack — Thor Schaeff, Google DeepMind is an AI-engineering signal. The practical read is to connect the implementation pattern to reliability, data boundaries, observability, and the controls needed when AI leaves prototypes.