Vision & Graph Processing

Vision Processing

Deeptrain’s Vision module bridges the gap between traditional text-only Large Language Models and visual data. By converting visual information into structured text or descriptive embeddings, you can enable non-vision models to interpret, analyze, and reason about images without requiring a native multimodal architecture.

Image-to-Text Integration

The vision interface allows you to process static images and generate comprehensive descriptions or data extractions that can be fed directly into an LLM's context window.

Usage Example

from deeptrain import VisionProcessor

# Initialize the processor
vision = VisionProcessor(api_key="your_api_key")

# Process a local image for a text-only model
result = vision.process_image(
    source="./assets/product_screenshot.png",
    detail_level="high",
    task="description"
)

print(result["text_description"])

API Reference: `process_image`

Returns: A dictionary containing the text_description, detected labels, and confidence scores.

Graph and Flowchart Interpretation

One of Deeptrain’s most powerful features is its ability to parse complex logical diagrams. This module translates visual flowcharts, hierarchy trees, and architectural diagrams into structured formats (like JSON or Mermaid syntax) that LLMs can use to understand workflows and relationships.

Processing Flowcharts

When a flowchart is processed, Deeptrain identifies nodes, edges, and directional logic, providing the AI agent with a logical map of the visual content.

from deeptrain import GraphProcessor

graph_engine = GraphProcessor()

# Analyze a system architecture diagram
structure = graph_engine.analyze_diagram(
    image_path="system_design.jpg",
    output_format="mermaid"
)

# The output can now be passed to a standard LLM to explain the logic
print(structure["data"])

API Reference: `analyze_diagram`

Key Capabilities

Model Agnostic: Works with any of the 200+ supported LLMs by providing text-based visual context.
Logic Mapping: Goes beyond simple OCR to understand the connection between elements in a graph or chart.
Real-time Integration: Stream processed visual data directly into your agent's localized embedding database for immediate retrieval.

Configuration

To configure the vision engine, ensure your config.yaml or environment variables include the necessary provider credentials (if using third-party vision backends) or local pathing for Deeptrain’s internal processing modules.

vision_settings:
  engine: "deeptrain-vision-v1"
  max_resolution: "2048x2048"
  supported_formats: ["jpg", "png", "webp", "tiff"]