Audio & Video Intelligence
Audio & Video Intelligence
Deeptrain bridges the gap between temporal media and Large Language Models, allowing your AI agents to "listen" to audio and "watch" video content. By converting these multi-dimensional streams into structured data, Deeptrain enables models to perform sentiment analysis, summarization, and context-aware reasoning on non-textual data.
Audio Processing
The Audio Intelligence module allows you to ingest audio files and streams for training or real-time interaction. It handles noise reduction and voice activity detection internally to ensure the LLM receives high-fidelity information.
Key Features:
- Multi-format Support: Process
.mp3,.wav,.flac, and more. - Contextual Embedding: Convert audio segments directly into vector embeddings for localized database retrieval.
- Speech-to-Text Integration: Seamlessly pipe audio data through the Transcribe API to feed textual prompts.
Video Intelligence
Deeptrain’s video capabilities go beyond simple transcription. It treats video as multi-dimensional data, analyzing both the visual frames and the synchronized audio track to provide a comprehensive understanding of the content.
Supported Sources:
- Local Storage: Process
.mp4,.mkv, and.avifiles stored on your server. - Cloud Platforms: Direct integration with YouTube and Vimeo URLs.
- Self-Hosted: Support for private HLS or MP4 streams via URL.
Transcribe API Reference
The Transcribe API is the primary interface for converting media files into LLM-readable formats.
transcribe(source, options)
Processes a media source and returns a structured transcription object.
Parameters:
| Parameter | Type | Description |
| :--- | :--- | :--- |
| source | string | The path to a local file or a valid URL (YouTube, Vimeo, etc.). |
| options | object | Configuration for processing (see below). |
Options Object:
model:string- The specific transcription model to use (default:standard).includeTimestamps:boolean- Whether to return timestamps for each sentence.detectSpeakers:boolean- Enables diarization to identify different speakers.visualContext:boolean- (Video only) Enables frame-by-frame analysis to supplement transcription.
Returns: Promise<TranscriptionResult>
{
text: string; // The full transcribed content
segments: Array<{ // Detailed breakdown of the media
start: number;
end: number;
text: string;
speaker?: string;
}>;
metadata: {
duration: number;
sourceType: string;
language: string;
};
}
Usage Examples
Processing a YouTube Video
This example demonstrates how to ingest a remote video and prepare its content for an AI agent.
import { Deeptrain } from 'deeptrain-sdk';
const dt = new Deeptrain('your-api-key');
async function processVideoContent() {
const result = await dt.transcribe('https://www.youtube.com/watch?v=example', {
visualContext: true,
detectSpeakers: true
});
console.log("Transcribed Text:", result.text);
// Feed the transcription into your LLM context
await dt.agents.updateContext('marketing-agent', result.text);
}
Local Audio Training
Use local audio files to enhance an agent's knowledge base via localized embeddings.
import { Deeptrain } from 'deeptrain-sdk';
const dt = new Deeptrain('your-api-key');
const trainFromAudio = async (filePath) => {
const audioData = await dt.transcribe(filePath, {
includeTimestamps: false
});
// Store in the localized embedding database
await dt.embeddings.store({
content: audioData.text,
metadata: { source: filePath, type: 'audio' }
});
};
trainFromAudio('./recordings/meeting_notes.wav');
Internal Components
- FrameProcessor (Internal): Handles the extraction of visual features from video files. Users do not interact with this directly, but its output is toggled via the
visualContextflag in the Transcribe API. - StreamBuffer (Internal): Manages data chunking for high-latency cloud sources like Vimeo to ensure stable processing.