Text & Context Management

Deeptrain provides a robust framework for handling large-scale text data, specifically designed to overcome the physical token limitations of modern LLMs. By utilizing a localized embedding database, Deeptrain allows AI agents to access vast repositories of information through semantic retrieval rather than cramming all data into a single prompt.

Localized Embedding Database

The core of Deeptrain’s text management is the localized vector store. This allows you to index local documents or live data streams and retrieve only the most relevant "context chunks" based on the user's query.

Initialization

To start managing context, initialize the embedding provider. Deeptrain is model-agnostic and supports various embedding models.

from deeptrain import ContextManager

# Initialize the manager with a preferred embedding model
cms = ContextManager(
    provider="openai", # or "huggingface", "local"
    db_path="./vector_store/my_project"
)

Ingesting Text Data

Deeptrain supports both static file ingestion and real-time data sourcing. When text is ingested, it is automatically chunked, embedded, and stored in your localized database.

`ingest_text(content, metadata=None)`

Adds raw text or document content to the context database.

Inputs:
- content (str): The raw text or file path to be processed.
- metadata (dict, optional): Key-value pairs (e.g., source, timestamp) to help filter results later.
Returns: indexing_id (str) — a reference to the stored content.

cms.ingest_text(
    content="The company's Q3 roadmap focuses on multi-modal integration...",
    metadata={"source": "internal_memo", "department": "R&D"}
)

Bypassing Context Window Limitations

Instead of sending an entire 100,000-word document to an LLM, Deeptrain uses a Retrieve-and-Synthesize workflow. It identifies the specific segments of text required to answer a prompt, effectively giving your agent "infinite" memory.

`get_context(query, top_k=5)`

Retrieves the most relevant snippets of text from the database based on semantic similarity.

Inputs:
- query (str): The user's question or the agent's current task.
- top_k (int): The number of relevant text chunks to return.
Outputs: List[Dict] — A list containing the retrieved text and its associated metadata.

# Retrieve relevant context for an LLM prompt
relevant_chunks = cms.get_context(
    query="What are the goals for Q3?",
    top_k=3
)

# Implementation example with an LLM call
context_string = "\n".join([chunk['text'] for chunk in relevant_chunks])
prompt = f"Use this context: {context_string} \n\n Question: What are the Q3 goals?"

Real-time Content Retrieval

Deeptrain can be configured to fetch and embed content from live data sources (such as web scrapers or API feeds) to ensure your AI agent's responses are grounded in current information rather than stale training data.

Configuration Options

You can tune the context management behavior during initialization or via a configuration file:

chunk_size: The maximum number of characters/tokens per segment (Default: 1000).
chunk_overlap: The amount of text to overlap between segments to maintain semantic continuity (Default: 200).
similarity_threshold: A float (0.0 to 1.0) to filter out low-relevance results.