Audio & Video Intelligence

Deeptrain bridges the gap between temporal media and Large Language Models, allowing your AI agents to "listen" to audio and "watch" video content. By converting these multi-dimensional streams into structured data, Deeptrain enables models to perform sentiment analysis, summarization, and context-aware reasoning on non-textual data.

Audio Processing

The Audio Intelligence module allows you to ingest audio files and streams for training or real-time interaction. It handles noise reduction and voice activity detection internally to ensure the LLM receives high-fidelity information.

Key Features:

Multi-format Support: Process .mp3, .wav, .flac, and more.
Contextual Embedding: Convert audio segments directly into vector embeddings for localized database retrieval.
Speech-to-Text Integration: Seamlessly pipe audio data through the Transcribe API to feed textual prompts.

Video Intelligence

Deeptrain’s video capabilities go beyond simple transcription. It treats video as multi-dimensional data, analyzing both the visual frames and the synchronized audio track to provide a comprehensive understanding of the content.

Supported Sources:

Local Storage: Process .mp4, .mkv, and .avi files stored on your server.
Cloud Platforms: Direct integration with YouTube and Vimeo URLs.
Self-Hosted: Support for private HLS or MP4 streams via URL.

Transcribe API Reference

The Transcribe API is the primary interface for converting media files into LLM-readable formats.

`transcribe(source, options)`

Processes a media source and returns a structured transcription object.

Parameters:

Options Object:

model: string - The specific transcription model to use (default: standard).
includeTimestamps: boolean - Whether to return timestamps for each sentence.
detectSpeakers: boolean - Enables diarization to identify different speakers.
visualContext: boolean - (Video only) Enables frame-by-frame analysis to supplement transcription.

Returns: Promise<TranscriptionResult>

{
  text: string;           // The full transcribed content
  segments: Array<{       // Detailed breakdown of the media
    start: number;
    end: number;
    text: string;
    speaker?: string;
  }>;
  metadata: {
    duration: number;
    sourceType: string;
    language: string;
  };
}

Usage Examples

Processing a YouTube Video

This example demonstrates how to ingest a remote video and prepare its content for an AI agent.

import { Deeptrain } from 'deeptrain-sdk';

const dt = new Deeptrain('your-api-key');

async function processVideoContent() {
  const result = await dt.transcribe('https://www.youtube.com/watch?v=example', {
    visualContext: true,
    detectSpeakers: true
  });

  console.log("Transcribed Text:", result.text);
  
  // Feed the transcription into your LLM context
  await dt.agents.updateContext('marketing-agent', result.text);
}

Local Audio Training

Use local audio files to enhance an agent's knowledge base via localized embeddings.

import { Deeptrain } from 'deeptrain-sdk';

const dt = new Deeptrain('your-api-key');

const trainFromAudio = async (filePath) => {
  const audioData = await dt.transcribe(filePath, {
    includeTimestamps: false
  });

  // Store in the localized embedding database
  await dt.embeddings.store({
    content: audioData.text,
    metadata: { source: filePath, type: 'audio' }
  });
};

trainFromAudio('./recordings/meeting_notes.wav');

Internal Components

FrameProcessor (Internal): Handles the extraction of visual features from video files. Users do not interact with this directly, but its output is toggled via the visualContext flag in the Transcribe API.
StreamBuffer (Internal): Manages data chunking for high-latency cloud sources like Vimeo to ensure stable processing.