Video Knowledge Extraction

Deeptrain's video processing engine allows AI agents to ingest, interpret, and learn from temporal data. By converting video content into structured, searchable knowledge, you can bridge the gap between visual media and text-based Large Language Models.

Supported Video Sources

Deeptrain provides a unified interface for various video formats and hosting platforms:

Public Platforms: Seamlessly ingest content from YouTube and Vimeo using standard URLs.
Local Storage: Process .mp4, .avi, .mov, and other standard video formats stored on your local file system.
Self-Hosted/Direct Links: Integrate videos hosted on private servers or cloud storage via direct HTTP/S links.

Transcribe API

The Transcribe API is the primary interface for extracting knowledge from video sources. It handles the multi-dimensional task of audio transcription and visual context extraction, preparing the data for LLM consumption.

Input Parameters

When calling the Transcribe API, the following parameters are required:

Output Schema

The API returns a structured object containing the extracted knowledge:

{
  "video_id": "string",
  "transcript": "string",
  "metadata": {
    "duration": "float",
    "resolution": "string",
    "platform": "string"
  },
  "knowledge_chunks": [
    {
      "timestamp": "00:01:20",
      "content": "Segment text or visual description"
    }
  ]
}

Usage Examples

Processing a YouTube Video

To expand your agent's knowledge base with a YouTube tutorial or lecture:

from deeptrain import VideoModule

# Initialize the Video Module
video_engine = VideoModule(api_key="your_api_key")

# Extract knowledge from a YouTube URL
result = video_engine.transcribe(
    source="https://www.youtube.com/watch?v=example",
    source_type="youtube"
)

print(f"Extracted Transcript: {result['transcript'][:100]}...")

Processing Local Video Files

For proprietary data or internal training videos stored locally:

# Extract knowledge from a local MP4 file
local_result = video_engine.transcribe(
    source="./data/internal_demo.mp4",
    source_type="local"
)

# Inject the extracted content into your AI agent's memory
agent.learn(local_result['transcript'])

Key Capabilities

Multi-Dimensional Analysis: Deeptrain doesn't just look at text; it processes audio and visual cues to provide a comprehensive context that standard transcription services often miss.
Temporal Indexing: Every piece of extracted knowledge is timestamped, allowing your AI agent to reference specific moments within a video during a conversation.
Model Agnostic: Extracted video data can be fed into any of the 200+ supported LLMs, whether they are private deployments or open-source models.

Note: For high-resolution or long-form videos, processing time may vary based on the complexity of the visual data. It is recommended to use the asynchronous processing flag for videos exceeding 30 minutes.