Data Processing Pipeline

The Deeptrain pipeline is designed to transform unstructured, multi-modal data into a structured format that Large Language Models (LLMs) can ingest and reason upon. The pipeline follows a four-stage lifecycle: Ingestion, Transformation, Embedding, and Retrieval.

1. Data Ingestion

Deeptrain acts as a connector between external data silos and your AI environment. It supports various input streams, including local files, self-hosted media, and third-party platforms.

Text/Live Data: Real-time fetching from live sources.
Visuals: Uploads of images, flowcharts, and diagrams.
Video/Audio: Support for local storage, Vimeo, and YouTube via the Transcribe API.

Example: Ingesting a Video Source

from deeptrain import DataConnector

# Initialize the connector for a YouTube source
connector = DataConnector(source_type="video")
data_stream = connector.ingest(url="https://www.youtube.com/watch?v=example")

2. Multi-modal Transformation

Once data is ingested, Deeptrain's processing engine breaks down the content based on its modality. This stage is critical for model-agnostic support, as it converts non-textual data into interpretable formats.

Transcription: Audio and video files are processed via the Transcribe API, converting speech to timestamped text.
Computer Vision (CV) Analysis: For models without native vision support, Deeptrain analyzes images, flowcharts, and graphs, generating descriptive metadata and relational mappings that describe the visual content.
Video Dimensionality: Videos are parsed into frames and audio tracks to extract multi-dimensional context.

3. Vectorization and Localized Storage

To overcome the context window limitations of standard LLMs, Deeptrain utilizes a localized embedding database.

Chunking: Transformed text and metadata are broken into optimized segments.
Vectorization: Data is converted into high-dimensional vectors using embedding models.
Indexing: Vectors are stored in a localized database, allowing for semantic search rather than simple keyword matching.

4. Retrieval and Model Integration

The final stage of the pipeline facilitates the interaction between the AI agent and the processed data. Instead of feeding the entire dataset into the LLM, Deeptrain performs Retrieval-Augmented Generation (RAG).

When a user or agent queries the system:

The query is vectorized.
The pipeline retrieves the most relevant context from the localized embedding database.
The relevant context is injected into the LLM's prompt, enabling it to provide accurate, real-time responses based on data it was not originally trained on.

Example: Querying the Integrated Pipeline

# Querying the agent with context retrieved from a previously processed video
response = agent.query(
    "What were the key takeaways from the video documentation?",
    context_source=data_stream.id
)

print(response.content)

Summary of Data Flow