Multi-dimensional Video

Multi-dimensional Video Processing

Deeptrain's multi-dimensional video capabilities allow you to transform raw video data into structured, AI-ready information. Unlike standard transcription services, Deeptrain processes video across multiple vectors—audio, visual context, and temporal metadata—to ensure your AI agents understand the full context of the content.

Supported Data Sources

Deeptrain supports ingestion from a variety of sources to ensure flexibility in your data pipeline:

Public Platforms: Native support for YouTube and Vimeo URLs.
Self-Hosted: Direct integration with self-hosted video repositories via URI.
Local Storage: Upload and process .mp4, .mkv, .avi, and other common formats directly from your local environment.

Transcribe API

The Transcribe API is the primary interface for converting video content into consumable data for LLMs. It handles the extraction of dialogue, visual descriptions, and metadata.

API Interface

interface TranscribeRequest {
  source: string;           // URL (YouTube/Vimeo) or local file path
  mode: 'audio' | 'visual' | 'full'; // 'full' includes both transcript and visual scene descriptions
  options?: {
    enable_timestamps: boolean;
    language_code?: string;
    frame_sampling_rate?: number; // Frequency of visual analysis (for visual/full mode)
  };
}

interface TranscribeResponse {
  job_id: string;
  transcript: string;
  metadata: {
    duration: number;
    resolution: string;
    author?: string;
  };
  segments: Array<{
    start: number;
    end: number;
    text: string;
    visual_description?: string;
  }>;
}

Usage Examples

Processing a YouTube Video

To integrate a YouTube video into your agent's knowledge base, pass the URL to the transcription engine.

from deeptrain import VideoConnector

# Initialize the connector
connector = VideoConnector(api_key="your_api_key")

# Process a remote video
video_data = connector.transcribe(
    source="https://www.youtube.com/watch?v=example",
    mode="full",
    options={
        "enable_timestamps": True,
        "language_code": "en-US"
    }
)

print(f"Processed Video: {video_data.transcript[:100]}...")

Local Video Ingestion

For private data or local datasets, provide the path to the video file.

# Process a local file for embedding
local_video = connector.transcribe(
    source="./data/internal_training_session.mp4",
    mode="audio"
)

# The output can now be passed to your localized embedding database
connector.save_to_vector_store(local_video)

Multi-modal Synchronization

The "Multi-dimensional" aspect of Deeptrain ensures that the output is synchronized. When using mode: 'full', the API returns a temporal map where visual scene descriptions are aligned with the audio transcript. This allows your AI agents to answer complex queries such as:

"What was shown on the screen when the speaker mentioned 'architecture'?"
"Summarize the visual steps shown between the 2-minute and 5-minute marks."

Performance and Limits

Maximum File Size: 2GB for local uploads.
Processing Time: Typically 1x–1.5x the video duration for "full" multi-dimensional processing.
Model Agnostic: The resulting data can be injected into the context window of any of the 200+ supported LLMs (OpenAI, Anthropic, Llama, etc.).