Training Workflows

Deeptrain provides a streamlined path for transforming disparate multi-modal data into training-ready formats for AI agents. Whether you are performing supervised fine-tuning (SFT) or implementing Retrieval-Augmented Generation (RAG), the following workflows outline best practices for utilizing Deeptrain’s data connectors.

Data Preparation and Ingestion

Before training or inference, data must be normalized through Deeptrain’s ingestion layer. This ensures that multi-modal inputs—such as video transcripts or flowchart logic—are converted into a format your LLM can process.

Processing Video and Audio

The Transcribe API is the primary entry point for video and audio data. It converts temporal media into structured text and metadata, which can then be used to populate your training datasets.

from deeptrain import TranscribeAPI

# Initialize the API for a remote source
transcriber = TranscribeAPI(source_type="youtube")

# Process a video to retrieve training-ready text
transcription_data = transcriber.process(
    url="https://www.youtube.com/watch?v=example",
    output_format="json"
)

# Use transcription_data for fine-tuning or context injection
print(transcription_data['text'])

Bridging Non-Vision Models

For models that do not natively support computer vision, Deeptrain acts as an intermediary. It processes images, flowcharts, and graphs into descriptive or structural representations (e.g., DOT language for graphs or detailed semantic descriptions for images).

Best Practice: Use Deeptrain to generate "Visual Context Tokens" that are appended to your text-based training prompts.

Workflow: Retrieval-Augmented Generation (RAG)

To extend an agent's knowledge without full model retraining, use the localized embedding database. This workflow is ideal for real-time data sources.

Source Data: Connect Deeptrain to your live data source (e.g., a local directory or a web feed).
Vectorization: Deeptrain generates embeddings for the incoming multi-modal content.
Real-Time Retrieval: During inference, the agent queries the localized database to retrieve contextually relevant information.

from deeptrain import DataConnector

# Connect to a live data source for real-time retrieval
connector = DataConnector(mode="live")
connector.sync_source("./my_knowledge_base")

# The agent can now retrieve data beyond its fixed context window
context = connector.query("What are the latest updates in the project flow?")

Workflow: Multi-modal Fine-Tuning

When building specialized agents, you may need to fine-tune a model on your specific multi-modal domain. Deeptrain facilitates this by creating unified datasets from varied sources.

Step 1: Aggregate Multi-modal Content

Combine text, transcribed audio, and image descriptions into a standardized JSONL format.

Step 2: Configure the Training Pipeline

Since Deeptrain is model-agnostic, you can export the processed data to any training framework (e.g., Hugging Face Transformers, PyTorch).

# Example: Exporting Deeptrain-processed data for a fine-tuning loop
dataset = connector.export_training_set(
    format="huggingface",
    include_visual_descriptions=True
)

# dataset is now ready for your Trainer class

Optimization Strategies

To get the most out of your AI agents using Deeptrain, follow these configuration guidelines:

Context Window Management: For text-heavy training, use Deeptrain’s chunking utilities to ensure data fits within the target model's limits while maintaining semantic coherence.
Model Agnostic Selection: Since Deeptrain supports 200+ models, test your processed data across both private (e.g., OpenAI, Anthropic) and open-source (e.g., Llama, Mistral) models to find the best performance-to-cost ratio.
Transcribe API Buffering: When processing high-volumes of video data (Vimeo/YouTube), utilize batch processing to optimize throughput for large-scale training runs.

Internal Components (Reference)

While users interact primarily with the high-level DataConnector and TranscribeAPI, the following internal systems handle the heavy lifting:

Vector Engine (Internal): Manages the localized embedding database and similarity searches.
Vision Bridge (Internal): Translates visual pixels into semantic text for non-vision LLMs.
Multi-modal Parser (Internal): Normalizes inputs from different file formats into a consistent internal schema.