Text & Embeddings

Deeptrain's text processing capabilities allow AI agents to transcend the physical limitations of a model's context window. By leveraging a localized embedding database, you can store massive datasets and retrieve only the most relevant information in real-time, providing your models with a "long-term memory" and access to live data sources.

Localized Embedding Database

Instead of sending your entire dataset to an LLM provider or relying on expensive cloud-based vector stores, Deeptrain utilizes a localized approach. This ensures low-latency retrieval and maintains data privacy while allowing your agent to interact with millions of tokens of information.

Key Benefits:

Infinite Context: Bypass token limits by injecting only relevant snippets into the prompt.
Real-time Retrieval: Fetch content from live data sources dynamically.
Model Agnostic: Use these embeddings with any of the 200+ supported private or open-source models.

Basic Usage

To manage text and embeddings, use the EmbeddingsManager interface. This allows you to ingest text data and query it later for context-aware responses.

Ingesting Text Data

Before an agent can retrieve information, the data must be indexed into the localized database.

from deeptrain import EmbeddingsManager

# Initialize the manager
em = EmbeddingsManager(db_path="./local_store")

# Add text content to the local database
em.add_text(
    text="Deeptrain is a multi-modal data connector for LLMs.",
    metadata={"source": "documentation", "topic": "overview"}
)

# Bulk ingest from multiple strings
texts = ["First paragraph...", "Second paragraph..."]
em.add_bulk_text(texts)

Retrieving Relevant Context

When a user asks a question, use the query method to find the most semantically similar fragments from your local store.

# Query the database for the top 3 most relevant matches
results = em.query(
    prompt="What is Deeptrain?",
    top_k=3
)

for match in results:
    print(f"Content: {match['text']}")
    print(f"Score: {match['score']}")

API Reference

`add_text(text, metadata)`

Indexes a single string into the localized vector store.

Inputs:
- text (str): The raw text to be embedded.
- metadata (dict, optional): Key-value pairs to store alongside the text for filtering.
Returns: str: The unique ID of the indexed document.

`query(prompt, top_k)`

Performs a semantic search against the localized database.

Inputs:
- prompt (str): The natural language query or user input.
- top_k (int): The number of relevant results to return.
Returns: List[Dict]: A list of objects containing text, metadata, and a similarity score.

`clear_database()`

Wipes the localized embedding database. Note: This action is irreversible.

Returns: bool: True if successful.

Real-time Content Integration

Deeptrain can be configured to fetch content from live sources (e.g., web scrapers, API feeds) and update the local embedding database on-the-fly. This ensures that when your AI agent queries the database, it receives the most up-to-date information available, rather than relying on static training data.

To implement real-time updates, wrap your data ingestion in a scheduled task or event trigger that pushes new content to add_text().