Embedding Databases

Deeptrain utilizes localized vector storage to enable real-time content retrieval, allowing your AI agents to reference data that exceeds standard context window limitations. By indexing multi-modal content into a localized embedding database, Deeptrain facilitates Retrieval-Augmented Generation (RAG) directly on your infrastructure.

Localized Vector Storage

The embedding database acts as a bridge between your raw data sources (text, video transcripts, flowcharts) and the LLM. It stores high-dimensional representations (embeddings) of your data, enabling semantic search rather than simple keyword matching.

Key Features:

Context Expansion: Bypass the token limits of models like GPT-4 or Claude by retrieving only the most relevant snippets.
Privacy-Centric: Store your vector indices locally to ensure sensitive data does not leave your environment during the retrieval phase.
Real-Time Sync: Update the database with live content to provide your agents with the most current information.

Configuration and Initialization

To begin using the embedding database, you must initialize the storage manager and define the embedding model to be used. Deeptrain is model-agnostic and supports over 200 private and open-source models for generating embeddings.

from deeptrain import VectorManager

# Initialize the localized database
vm = VectorManager(
    database_type="local",  # Options: "local", "chroma", "faiss"
    storage_path="./data/vectors",
    embedding_model="text-embedding-3-small" # Supports local and API-based models
)

Ingesting Data

You can populate the embedding database with various data types. Deeptrain automatically handles the heavy lifting of chunking and vectorization.

Adding Text Content

vm.add_content(
    source_type="text",
    data="Detailed technical documentation about the new Transcribe API...",
    metadata={"source": "api_docs", "version": "1.0"}
)

Adding Video Transcripts

When using the Transcribe API, you can feed the resulting transcription directly into the vector store.

transcript = deeptrain.transcribe("https://vimeo.com/example-video")
vm.add_content(
    source_type="video",
    data=transcript,
    metadata={"url": "vimeo.com/example-video", "type": "tutorial"}
)

Inputs:

source_type (str): The format of the input (e.g., "text", "video", "audio").
data (str | bytes): The actual content to be embedded.
metadata (dict): Optional key-value pairs for filtering during retrieval.

Retrieving Relevant Content

Retrieve the most semantically relevant information based on a user query. This output is typically injected into the prompt of your AI agent.

results = vm.query(
    query_text="How does the Transcribe API handle YouTube links?",
    top_k=3
)

for match in results:
    print(f"Content: {match.text}")
    print(f"Similarity Score: {match.score}")

Query Parameters:

query_text (str): The natural language question or prompt.
top_k (int): The number of relevant documents to return (default: 5).

Output Object: Returns a list of Match objects containing:

text (str): The retrieved content chunk.
score (float): The similarity confidence score (0.0 to 1.0).
metadata (dict): The metadata associated with the chunk.

Maintenance and Persistence

Since the database is localized, you must periodically commit changes to ensure data is saved to the disk.

vm.save(): Flushes the current in-memory index to the defined storage_path.
vm.clear(): Wipes the localized index (Internal use recommended only for testing).
vm.get_stats(): Returns the total number of vectors and the dimensions of the current index.