Multi-dimensional Video
Multi-dimensional Video Processing
Deeptrain's multi-dimensional video capabilities allow you to transform raw video data into structured, AI-ready information. Unlike standard transcription services, Deeptrain processes video across multiple vectors—audio, visual context, and temporal metadata—to ensure your AI agents understand the full context of the content.
Supported Data Sources
Deeptrain supports ingestion from a variety of sources to ensure flexibility in your data pipeline:
- Public Platforms: Native support for YouTube and Vimeo URLs.
- Self-Hosted: Direct integration with self-hosted video repositories via URI.
- Local Storage: Upload and process
.mp4,.mkv,.avi, and other common formats directly from your local environment.
Transcribe API
The Transcribe API is the primary interface for converting video content into consumable data for LLMs. It handles the extraction of dialogue, visual descriptions, and metadata.
API Interface
interface TranscribeRequest {
source: string; // URL (YouTube/Vimeo) or local file path
mode: 'audio' | 'visual' | 'full'; // 'full' includes both transcript and visual scene descriptions
options?: {
enable_timestamps: boolean;
language_code?: string;
frame_sampling_rate?: number; // Frequency of visual analysis (for visual/full mode)
};
}
interface TranscribeResponse {
job_id: string;
transcript: string;
metadata: {
duration: number;
resolution: string;
author?: string;
};
segments: Array<{
start: number;
end: number;
text: string;
visual_description?: string;
}>;
}
Usage Examples
Processing a YouTube Video
To integrate a YouTube video into your agent's knowledge base, pass the URL to the transcription engine.
from deeptrain import VideoConnector
# Initialize the connector
connector = VideoConnector(api_key="your_api_key")
# Process a remote video
video_data = connector.transcribe(
source="https://www.youtube.com/watch?v=example",
mode="full",
options={
"enable_timestamps": True,
"language_code": "en-US"
}
)
print(f"Processed Video: {video_data.transcript[:100]}...")
Local Video Ingestion
For private data or local datasets, provide the path to the video file.
# Process a local file for embedding
local_video = connector.transcribe(
source="./data/internal_training_session.mp4",
mode="audio"
)
# The output can now be passed to your localized embedding database
connector.save_to_vector_store(local_video)
Multi-modal Synchronization
The "Multi-dimensional" aspect of Deeptrain ensures that the output is synchronized. When using mode: 'full', the API returns a temporal map where visual scene descriptions are aligned with the audio transcript. This allows your AI agents to answer complex queries such as:
- "What was shown on the screen when the speaker mentioned 'architecture'?"
- "Summarize the visual steps shown between the 2-minute and 5-minute marks."
Performance and Limits
- Maximum File Size: 2GB for local uploads.
- Processing Time: Typically 1x–1.5x the video duration for "full" multi-dimensional processing.
- Model Agnostic: The resulting data can be injected into the context window of any of the 200+ supported LLMs (OpenAI, Anthropic, Llama, etc.).