Storage Connectors
Storage Connectors
Deeptrain provides a robust suite of Storage Connectors designed to bridge the gap between your raw data repositories and your AI agents. These connectors allow you to ingest multi-modal content—including text, images, audio, and video—from both local environments and self-hosted cloud infrastructures.
Local Storage Connector
The LocalStorageConnector is the primary interface for indexing and retrieving data from your local file system or mounted network drives. It is optimized for high-speed access to training datasets and real-time document retrieval.
Usage Example
from deeptrain.storage import LocalStorageConnector
# Initialize the connector pointing to your data directory
local_storage = LocalStorageConnector(
base_path="./data/my_knowledge_base",
recursive=True,
supported_extensions=[".txt", ".pdf", ".mp4", ".png"]
)
# Index files to make them available for the AI agent
local_storage.index()
# Retrieve a specific file for processing
file_data = local_storage.get_file("project_specs/architecture.pdf")
Configuration Options
| Parameter | Type | Description |
| :--- | :--- | :--- |
| base_path | str | The root directory where your data is stored. |
| recursive | bool | Whether to scan subdirectories. Defaults to True. |
| supported_extensions | List[str] | Filter for specific file types to be processed by Deeptrain. |
Remote & Self-Hosted Connectors
For teams utilizing self-hosted platforms (such as MinIO, OpenStack Swift, or private S3-compatible buckets), Deeptrain offers a unified interface to stream data directly into your LLM's context window or embedding database.
S3-Compatible Connector
This connector allows you to interface with any self-hosted object storage that utilizes the S3 protocol.
from deeptrain.storage import S3Connector
s3_storage = S3Connector(
endpoint_url="https://your-self-hosted-storage.com",
access_key="YOUR_ACCESS_KEY",
secret_key="YOUR_SECRET_KEY",
bucket_name="ai-training-data"
)
# Stream a video file directly to the Transcribe API
video_stream = s3_storage.stream_file("videos/product_demo.mp4")
API Reference: S3Connector
Input Parameters:
endpoint_url(str): The URL of your self-hosted storage service.access_key(str): Your authentication access key.secret_key(str): Your authentication secret key.bucket_name(str): The specific bucket to monitor.
Returns:
stream_file(path): Returns a binary stream object compatible with Deeptrain’s transcription and vision processing pipelines.list_objects(prefix): Returns a list of available files within the specified path.
Multi-modal Data Ingestion
Storage connectors act as the entry point for Deeptrain’s multi-dimensional processing. Once a connector is established, you can route data to specific processing modules:
- Text & Documents: Data is routed to the localized embedding database for RAG (Retrieval-Augmented Generation).
- Vision (Images/Graphs): Visual data is passed to the vision-integration layer to allow non-vision models to interpret diagrams and flowcharts.
- Video & Audio: Media files are sent to the Transcribe API, which converts audio/video content into processable text and metadata for the LLM.
Internal Data Manager
While the connectors handle the external fetching, Deeptrain uses an internal DataManager component to synchronize these sources.
Note: The DataManager is an internal utility and typically does not require manual configuration by the user; it automatically handles the lifecycle of data fetched via the connectors.
Best Practices
- Security: Always use environment variables for sensitive credentials (API keys, Secret keys) when configuring connectors.
- Filtering: Use the
supported_extensionsparameter in theLocalStorageConnectorto prevent the system from attempting to index binary or system files that do not contain useful training data.