Audio Intelligence

Deeptrain’s Audio Intelligence module bridges the gap between raw auditory data and text-based Large Language Models. By converting spoken content into high-fidelity text and structured metadata, you can integrate podcasts, meeting recordings, and voice commands directly into your AI agent’s knowledge base.

Overview

The Audio Intelligence suite focuses on two primary workflows:

Transcription: Converting audio files or streams into text.
Contextual Injection: Feeding transcribed data into Deeptrain’s localized embedding database to enhance RAG (Retrieval-Augmented Generation) pipelines.

The Transcribe API

The Transcribe API is the primary interface for processing audio content. It supports both local files and remote URLs (e.g., S3 buckets, public URLs).

Method: `deeptrain.audio.transcribe()`

Inputs:

Output: TranscriptionResponse

{
  text: string;           // The full transcribed text
  confidence: number;     // Accuracy score (0.0 to 1.0)
  duration: number;       // Duration of the audio in seconds
  segments: Array<{       // Included only if timestamps: true
    start: number;
    end: number;
    text: string;
  }>;
}

Usage Examples

Processing a Local Audio File

To train an agent on a local recording, use the transcribe method and pass the resulting text to your agent's memory.

import deeptrain

# Transcribe the audio file
result = deeptrain.audio.transcribe(
    source="./recordings/q3_earnings_call.mp3",
    language="en",
    timestamps=True
)

# Integrate into your AI Agent's context
agent = deeptrain.Agent(model="gpt-4")
agent.learn(result.text)

print(f"Processed {result.duration} seconds of audio.")

Real-time Content Sourcing

You can pull audio content from hosted platforms to provide real-time updates to your agents.

# Sourcing from a remote URL
remote_audio = "https://example.com/podcasts/latest_tech_trends.wav"

result = deeptrain.audio.transcribe(source=remote_audio)

# The transcribed text is now searchable via the localized embedding database
deeptrain.vector_db.upsert(
    content=result.text,
    metadata={"source": remote_audio, "type": "audio"}
)

Supported Formats and Limits

Supported File Types: .mp3, .wav, .m4a, .ogg, .flac.
Max File Size: 25MB via standard API; for larger files, utilize the deeptrain.audio.stream_process internal utility (which handles chunking automatically).
Multilingual Support: Supports over 50 languages. If no language is specified, the system will attempt auto-detection during the first 30 seconds of audio.

Advanced Configuration: Speaker Diarization

For interviews or meetings with multiple participants, you can enable speaker diarization to attribute text to specific individuals.

result = deeptrain.audio.transcribe(
    source="meeting_minutes.mp3",
    diarize=True # Public flag for speaker identification
)

# Output includes speaker labels
# [Speaker 0]: Hello everyone.
# [Speaker 1]: Hi, let's start the review.

By leveraging these audio capabilities, your AI agents move beyond static text files, allowing them to "listen" to and learn from the vast amount of information stored in voice and audio formats.

Audio Intelligence