Text & Vector Embeddings
Text & Vector Embeddings
Deeptrain provides a robust framework for managing text data through a localized embedding database. This system allows you to bypass the context window limitations of standard Large Language Models (LLMs) by indexing large datasets and retrieving only the most relevant information in real-time.
Localized Embedding Database
Unlike cloud-based vector stores, Deeptrain's localized approach ensures that your data remains under your control while providing low-latency retrieval for Retrieval-Augmented Generation (RAG) workflows.
Key Features:
- Real-time Retrieval: Fetch context from live data sources as your agent processes queries.
- Context Expansion: Support for datasets that exceed the token limits of models like GPT-4 or Claude.
- Model Agnostic: Compatible with embeddings from over 200 private and open-source models.
Basic Usage
To begin utilizing text embeddings, you must first initialize the Deeptrain engine and configure your local vector store.
Initializing the Vector Store
from deeptrain import DeepTrain
# Initialize the Deeptrain instance
dt = DeepTrain(
model_provider="openai", # or "huggingface", "anthropic", etc.
embedding_model="text-embedding-3-small"
)
# Configure the localized database
dt.initialize_vector_db(path="./local_embeddings_db")
Indexing Content
You can add text from various sources into the localized database. Deeptrain handles the chunking and embedding generation automatically.
# Adding raw text data
dt.index_text(
content="Deeptrain is a multi-modal data connector for LLMs.",
metadata={"source": "readme", "topic": "overview"}
)
# Adding content from a file
dt.index_file(path="./documents/user_manual.pdf")
Real-time Content Retrieval (RAG)
The core strength of the Text module is its ability to perform semantic searches and inject relevant context into LLM prompts.
Querying the Database
The retrieve method searches the localized database for content semantically similar to the input query.
results = dt.retrieve(
query="How does Deeptrain handle multi-modal data?",
top_k=3
)
for doc in results:
print(f"Score: {doc.score}")
print(f"Content: {doc.text}")
Integrating with LLM Chains
You can automatically augment your AI agent's responses by connecting the retriever to the model's generation loop.
response = dt.query_with_context(
prompt="Explain the benefits of localized embeddings.",
stream=False
)
print(response)
API Reference
dt.initialize_vector_db(path: str, collection_name: str = "default")
Sets up the local storage for vector embeddings.
- path: The local directory where embedding data will be persisted.
- collection_name: (Optional) Groups related embeddings into a single namespace.
dt.index_text(content: str, metadata: dict = None)
Processes and stores a string of text.
- content: The raw text to be embedded.
- metadata: A dictionary of key-value pairs for filtering during retrieval.
dt.retrieve(query: str, top_k: int = 5, filters: dict = None) -> List[SearchResult]
Performs a similarity search.
- query: The natural language string to match against.
- top_k: The number of relevant document chunks to return.
- filters: (Optional) Metadata filters to narrow the search scope.
Advanced Configuration
For specialized use cases, you can configure the chunking strategy to better suit your data structure:
| Parameter | Type | Description |
| :--- | :--- | :--- |
| chunk_size | int | The maximum number of characters/tokens per embedding chunk. |
| chunk_overlap | int | The number of overlapping units between consecutive chunks to maintain context. |
| distance_metric | str | The mathematical method used for similarity (e.g., cosine, euclidean). |
Note: While Deeptrain manages these processes internally by default, they can be overridden in the global configuration file or during class initialization for fine-tuned performance.