Architecture Overview
High-Level System Architecture
Deeptrain (VMTP) operates as a sophisticated middleware layer positioned between diverse data sources and Large Language Models (LLMs). Its primary objective is to act as a Multi-modal Data Connector, abstracting the complexity of data ingestion, processing, and vectorization so that AI agents can consume non-textual information as seamlessly as text.
The architecture is built on three core pillars: Ingestion, Transformation, and Delivery.
Core Components
1. The Multi-modal Connector Engine
The engine serves as the entry point for all data. It handles the authentication and streaming of data from various sources, including local file systems, cloud storage, and third-party platforms like YouTube and Vimeo.
- Public Interface: The engine exposes a unified API to register data streams.
- Transcribe API: A specific gateway for audio and video inputs that handles real-time or batch conversion of spoken word into searchable text.
2. Multi-dimensional Processing Layer
This layer is responsible for "transducing" non-LLM-native formats into representations that transformers can process.
- Vision-to-Text Bridge: Converts images, flowcharts, and diagrams into descriptive metadata or structured data.
- Temporal Processing: For video content, the system processes frames and audio tracks concurrently to maintain context over time.
3. Localized Embedding Database
To overcome context window limitations, Deeptrain utilizes a localized vector database. Instead of feeding raw data directly into an LLM, Deeptrain:
- Chunks and embeds incoming data.
- Stores it locally.
- Retrieves relevant context in real-time based on the agent's query.
4. Model-Agnostic Adapter
Deeptrain is designed to be model-independent. It features a translation layer that formats processed data into the specific prompt structures required by over 200 different private and open-source models.
Data Flow Workflow
The following steps illustrate how Deeptrain bridges external data to an LLM:
- Source Integration: Data is pulled via the
Connector(e.g., a YouTube URL or a local.mp4). - Preprocessing: The
Transcribe APIor Vision modules extract text and visual context. - Vectorization: The extracted content is converted into high-dimensional vectors using a localized embedding model.
- Retrieval-Augmented Generation (RAG): When an AI agent makes a request, Deeptrain queries the local database for the most relevant snippets.
- Model Inference: The retrieved context is injected into the LLM prompt, providing the agent with "knowledge" it didn't previously have.
Public Interface & Usage
Developers interact with the architecture through a simplified set of APIs. Below is a conceptual example of how to connect a video source for AI processing.
Example: Integrating Video Context
from deeptrain import DeeptrainConnector
# Initialize the connector
dt = DeeptrainConnector(api_key="your_api_key")
# Connect a multi-modal source (e.g., a YouTube video)
# This triggers the Transcribe API and the Vectorization engine
source = dt.sources.add(
type="video",
url="https://www.youtube.com/watch?v=example",
metadata={"category": "technical_training"}
)
# Query your AI agent with real-time video context
response = dt.query(
model="gpt-4-vision",
prompt="Based on the video, what are the three main steps of the installation?"
)
print(response.content)
Supported Data Modalities
| Modality | Input Types | Processing Method | | :--- | :--- | :--- | | Text | PDF, TXT, Live Docs | Localized Embedding / RAG | | Images | JPG, PNG, Diagrams | Computer Vision OCR & Contextual Labeling | | Audio | MP3, WAV, Streams | Transcribe API (Speech-to-Text) | | Video | Local, YouTube, Vimeo | Multi-dimensional Temporal Processing | | Graphs | Flowcharts, Diagrams | Structural Analysis & Node-Edge Mapping |
Integration Flexibility
The VMTP architecture is designed to be Model-Agnostic. This means you can swap the underlying LLM (e.g., moving from a private Llama-3 instance to a hosted OpenAI model) without re-configuring your data pipelines. The Deeptrain layer ensures that the multi-modal context is delivered in a format compatible with the selected model's input requirements.