Vision & Graph Processing
Vision Processing
Deeptrain’s Vision module bridges the gap between traditional text-only Large Language Models and visual data. By converting visual information into structured text or descriptive embeddings, you can enable non-vision models to interpret, analyze, and reason about images without requiring a native multimodal architecture.
Image-to-Text Integration
The vision interface allows you to process static images and generate comprehensive descriptions or data extractions that can be fed directly into an LLM's context window.
Usage Example
from deeptrain import VisionProcessor
# Initialize the processor
vision = VisionProcessor(api_key="your_api_key")
# Process a local image for a text-only model
result = vision.process_image(
source="./assets/product_screenshot.png",
detail_level="high",
task="description"
)
print(result["text_description"])
API Reference: process_image
| Parameter | Type | Description |
| :--- | :--- | :--- |
| source | str | Path to local file, or a public URL of the image. |
| task | str | The processing goal: "description", "ocr", or "object_detection". |
| detail_level | str | Granularity of output: "low", "medium", or "high". |
Returns: A dictionary containing the text_description, detected labels, and confidence scores.
Graph and Flowchart Interpretation
One of Deeptrain’s most powerful features is its ability to parse complex logical diagrams. This module translates visual flowcharts, hierarchy trees, and architectural diagrams into structured formats (like JSON or Mermaid syntax) that LLMs can use to understand workflows and relationships.
Processing Flowcharts
When a flowchart is processed, Deeptrain identifies nodes, edges, and directional logic, providing the AI agent with a logical map of the visual content.
from deeptrain import GraphProcessor
graph_engine = GraphProcessor()
# Analyze a system architecture diagram
structure = graph_engine.analyze_diagram(
image_path="system_design.jpg",
output_format="mermaid"
)
# The output can now be passed to a standard LLM to explain the logic
print(structure["data"])
API Reference: analyze_diagram
| Parameter | Type | Description |
| :--- | :--- | :--- |
| image_path | str | Path to the diagram or flowchart image. |
| output_format | str | Format for the parsed logic: "mermaid", "json", or "text_summary". |
| extract_text | bool | Whether to perform OCR on text within nodes (default: True). |
Key Capabilities
- Model Agnostic: Works with any of the 200+ supported LLMs by providing text-based visual context.
- Logic Mapping: Goes beyond simple OCR to understand the connection between elements in a graph or chart.
- Real-time Integration: Stream processed visual data directly into your agent's localized embedding database for immediate retrieval.
Configuration
To configure the vision engine, ensure your config.yaml or environment variables include the necessary provider credentials (if using third-party vision backends) or local pathing for Deeptrain’s internal processing modules.
vision_settings:
engine: "deeptrain-vision-v1"
max_resolution: "2048x2048"
supported_formats: ["jpg", "png", "webp", "tiff"]