Video Processing

Video Processing Overview

Deeptrain's multi-dimensional video processing engine allows AI agents to ingest, interpret, and learn from video content. By bridging the gap between raw video data and LLM context windows, Deeptrain enables your agents to understand visual sequences and auditory information from local files and major streaming platforms.

The platform supports three primary video sources:

Local Storage: Direct file paths from your server or local environment.
YouTube: Public or unlisted video URLs.
Vimeo: Professional video hosting links.

The Transcribe API

The Transcribe API is the primary interface for processing video data. It handles the extraction of audio, speech-to-text conversion, and synchronization of visual metadata to provide a comprehensive dataset for your AI models.

Usage Example

from deeptrain import VideoManager

# Initialize the manager
vm = VideoManager(api_key="your_api_key")

# Process a video for AI training
processed_video = vm.transcribe(
    source="https://www.youtube.com/watch?v=example",
    provider="youtube",
    config={
        "extract_metadata": True,
        "language": "en"
    }
)

print(processed_video.transcript)
print(processed_video.metadata)

API Reference: `transcribe()`

Returns: A VideoData object containing the transcript, timestamps, and extracted visual context.

Working with Local Video Files

To process videos stored on your local file system, ensure the path is accessible by the Deeptrain environment. This is ideal for processing proprietary data, internal recordings, or sensitive training material.

# Processing a local MP4 file
local_data = vm.transcribe(
    source="/path/to/video/training_demo.mp4",
    provider="local"
)

Integrating YouTube and Vimeo

Deeptrain abstracts the complexity of web scraping or API management for video platforms. By providing a URL, Deeptrain fetches the necessary streams for processing without requiring manual downloads.

YouTube Integration

Use the youtube provider to ingest public educational content, tutorials, or webinars directly into your agent's knowledge base.

# Syncing a YouTube tutorial
yt_context = vm.transcribe(
    source="https://youtu.be/dQw4w9WgXcQ",
    provider="youtube"
)

Vimeo Integration

For high-quality professional content or private enterprise videos, use the vimeo provider.

# Syncing a Vimeo presentation
vimeo_context = vm.transcribe(
    source="https://vimeo.com/123456789",
    provider="vimeo"
)

Multi-dimensional Output

When a video is processed, Deeptrain generates a structured output that can be directly fed into an LLM or a vector database:

Textual Transcript: A full text-based representation of the audio.
Temporal Metadata: Time-coded segments that allow the AI to reference specific moments in the video.
Visual Descriptions (Optional): Keyframe analysis that describes the visual scene, enabling non-vision models to "understand" the video content.

Data Structure

The returned VideoData object follows this structure:

{
  "source_id": "unique_video_id",
  "full_text": "The transcript content...",
  "segments": [
    {
      "start_time": 0.0,
      "end_time": 10.5,
      "text": "Introduction to the topic."
    }
  ],
  "metadata": {
    "duration": 120,
    "resolution": "1080p",
    "provider": "youtube"
  }
}

Video Processing