Vision & Image Processing

Overview

The Vision & Image Processing module enables non-vision Large Language Models (LLMs) to interpret and interact with visual data. By transforming images, flowcharts, and diagrams into high-dimensional embeddings or structured textual descriptions, Deeptrain bridges the gap between text-only reasoning and visual intelligence.

Vision Connector

The primary interface for handling visual data is the VisionConnector. This component processes raw visual inputs and prepares them for consumption by your chosen LLM.

Processing Standard Images

To integrate image data into your AI agent's context, use the process_image method. This method extracts features and generates descriptive metadata that allows a non-vision model to "understand" the visual content.

from deeptrain import VisionConnector

# Initialize the connector
vision = VisionConnector()

# Process a local image or a URL
image_data = vision.process_image(
    source="path/to/image.jpg",
    detail_level="high"
)

print(image_data.description)

Parameters:

| Parameter | Type | Description | | :--- | :--- | :--- | | source | str | Path to a local file or a valid image URL. | | detail_level | str | Granularity of the extraction (low, medium, high). Defaults to medium. |

Returns:

Type: VisionResponse object.
Attributes:
- description (str): A detailed textual representation of the image.
- tags (List[str]): Extracted keywords and objects identified in the scene.
- embedding: The vector representation for use in vector databases.

Flowcharts and Graphs

Deeptrain provides specialized logic for technical diagrams, including flowcharts, organizational charts, and statistical graphs. This allows agents to follow logic flows or interpret data trends.

Interpreting Diagrams

The interpret_diagram function is optimized for structural recognition rather than just aesthetic description.

# Interpret a complex flowchart
flowchart_logic = vision.interpret_diagram(
    source="process_flow.png",
    output_format="markdown"
)

# The output can be directly injected into an LLM prompt
prompt = f"Based on this flowchart: {flowchart_logic}, identify the bottleneck."

Parameters:

| Parameter | Type | Description | | :--- | :--- | :--- | | source | str | Path or URL to the diagram. | | output_format | str | The format of the structural description (text, markdown, json). |

Integrating with LLM Chains

Because Deeptrain is model-agnostic, you can pipe vision outputs into any of the 200+ supported models.

Example: Image-to-Text Reasoning

import deeptrain
from deeptrain.models import ModelConfig

# 1. Process the image
vision = deeptrain.VisionConnector()
context = vision.process_image("invoice_sample.png")

# 2. Initialize a non-vision LLM (e.g., Llama-3 or GPT-3.5)
agent = deeptrain.Agent(model="meta-llama/Llama-3-8b")

# 3. Query the agent using visual context
response = agent.query(
    prompt=f"Extract the total amount and due date from this data: {context.description}"
)

print(response)

Supported Formats

The Vision module supports the following standard formats:

Images: .jpg, .jpeg, .png, .webp
Diagrams: .png, .svg (rendered), .pdf (single page)

Note: For real-time data retrieval, vision outputs are automatically indexed in your localized embedding database if the auto_index configuration is set to True.