Video Ingestion
Video Ingestion
Deeptrain allows you to transform video content into actionable data for your AI agents. By processing visual and auditory information from various sources, you can equip your models with context that was previously locked in video formats.
Supported Sources
Deeptrain supports video ingestion from both remote and local sources:
- Public Platforms: YouTube and Vimeo.
- Local Storage: Common video formats (MP4, AVI, MOV, etc.) stored on your machine or server.
- Self-Hosted: Direct links to video files hosted on private servers.
Ingesting Videos
To integrate video content, you use the VideoIngestor module. This module handles the fetching, frame analysis, and audio extraction required to prepare the data for your LLM.
Using Public URLs (YouTube/Vimeo)
You can ingest videos directly by providing the URL. Deeptrain handles the stream resolution and data extraction.
from deeptrain import VideoIngestor
# Initialize the ingestor
ingestor = VideoIngestor()
# Ingest a YouTube video
video_data = ingestor.ingest(
source="https://www.youtube.com/watch?v=example",
include_timestamps=True
)
print(video_data.summary)
Using Local Files
For local files, provide the absolute or relative path to the video file.
# Ingest a local MP4 file
local_video = ingestor.ingest(
source="./data/recordings/meeting_01.mp4",
sampling_rate=1.0 # Capture 1 frame per second for visual context
)
Transcribe API
The Transcribe API is the primary interface for converting video speech into text that AI agents can index and query. It supports multi-dimensional processing, meaning it tracks visual changes alongside the transcribed text.
API Interface
Method: transcribe(source, options)
Inputs:
| Parameter | Type | Description |
| :--- | :--- | :--- |
| source | str | The URL (YouTube/Vimeo) or local file path. |
| language | str | (Optional) ISO 639-1 language code. Defaults to auto-detection. |
| output_format | str | The desired format: json, text, or srt. |
| visual_context | bool | If True, includes image descriptions mapped to timestamps. |
Output:
Returns a TranscriptionResponse object containing:
text: The full transcript string.segments: A list of dictionaries containingstart_time,end_time, andcontent.metadata: Video duration, resolution, and source details.
Usage Example
from deeptrain.api import TranscribeAPI
# Set up the API
api = TranscribeAPI(api_key="your_api_key")
# Process a video for AI training
response = api.transcribe(
source="https://vimeo.com/123456789",
visual_context=True
)
# Accessing the processed data
for segment in response.segments:
print(f"[{segment['start_time']}] {segment['content']}")
Integration with AI Agents
Once ingested, the video data can be fed into your localized embedding database or used directly as context for supported language models.
from deeptrain.agents import Agent
agent = Agent(model="gpt-4")
# Attach the ingested video context to the agent
agent.add_context(video_data)
# Query the agent about the video content
response = agent.ask("What were the key takeaways from the video presentation?")
Configuration Note
While the ingestion process is automated, ensure your environment has the necessary codecs installed (e.g., ffmpeg) for processing local files. For public URLs, Deeptrain manages dependencies internally via its cloud-optimized scrapers.