easwee.net // tiny digital playground

hobby guide to ai ep.2 - ai audio intelligence

TLDR #

git clone git@github.com:easwee/ai-hobby-lab.git

curl -LsSf https://astral.sh/uv/install.sh | sh

uv sync

source .venv/bin/activate

echo "SONIOX_API_KEY=<your_soniox_api_key>" >> .env

python main.py -w youtube_audio_data_extractor -i '["https://www.youtube.com/watch?v=AyOxay5AtMM"]'

Output file will be created as ./output/<video_title>.html.

I want to know more #

This workflow demonstrates how handy can be the native audio reasoning of Soniox Omnio multimodal LLM.

Our workflow consists of 3 steps:

  1. Input: Download cooking videos from Youtube using yt-dlp and extract audio from a list of input urls.
  2. Processing: Process the audio using Soniox Omnio API and prompt for a cooking recipe format.
  3. Output: Save Soniox Omnio markdown response to a .html file.

Let's go over the fully commented code:

import os
import base64

from typing import Tuple
from concurrent.futures import ThreadPoolExecutor
from yt_dlp import YoutubeDL
from openai import OpenAI
from markdown2 import Markdown

# Sample input - a list of youtube cooking videos:
# [
# "https://www.youtube.com/watch?v=AyOxay5AtMM",
# "https://www.youtube.com/watch?v=vlEGY8IPD-Q"
# ]

# USER_PROMPT_TEMPLATE
# - defines instructions for our model.
# - it is currently instructing Soniox Omnio multimodal LLM
# to extract formatted food recipes data from input audio
USER_PROMPT_TEMPLATE = """
Input audio contains cooking recipe data. Output formatted markdown document containing:
1. Recipe title
2. Short summary paragraph of what is being cooked.
3. List of ingredients where each list item contains "<ingredient> <required amount>"
4. Step by step cooking instructions as explained in the audio.

Use wording as if you were conveying the recipe over radio. Make sure every listed ingredient is also used in the cooking process, otherwise do not list it.
"""
.strip()

# define our workflow class (we could define a Workflow abstract to extend from, since .run method is always required)
class YoutubeAudioDataExtractor:

# We will be triggering the run method and pass in a list of youtube video urls
# from which we want to extract data
def run(self, input: list[str]):
print("Workflow started.")

# We can speed up the processing with multi-threading - increase max_workers amount if you can afford more
with ThreadPoolExecutor(max_workers=2) as executor:
# execute the process method on each of our input objects
results = executor.map(self.process, input)
# Print out informational tuple list
print(list(results))

# Our process method will be in charge of chaining together multiple steps of data extraction
# We want to run the process in a separate thread for each url, to speed things up in case of multiple urls
def process(self, url: str) -> Tuple[str, str]:
print(f"Processing {url}")

try:
# combine all the steps of this workflow together
# 1. download youtube video and obtain an audio file and it's name
file_name = self.download(url)

# 2. run Soniox Omnio multimodal llm and extract data from audio
data = self.extract_data(file_name)

# 3. write output to pdf
self.create_pdf(file_name, data)

# success
return (file_name, "Done.")
except:
# indicate simple failure
return (file_name, "Failed.")

# In download step we use yt-dlp to download video/audio from youtube
def download(self, url: str) -> str:
print(f"[download] Start {url}.")

# configuration options for yt-dlp:-
# - we are interested in audio only, we like flac
# - after the video is downloaded the yt-dlp postprocessor will convert video to .flac instantly
options = {
'format': 'bestaudio/best',
'outtmpl': "./%(title)s.%(ext)s",
'postprocessors': [{
'key': 'FFmpegExtractAudio',
'preferredcodec': 'flac',
'preferredquality': '0',
}],
}

# start downloading the file using yt-dlp library, and also extract audio file meta information
# like downloaded file name, so we can reference the file later
with YoutubeDL(options) as ydl:
info = ydl.extract_info(url, download=True)
# we want to store and return the fle name because we need it later
# Gfirst get the original filename that yt-dlp read from web (.m4a or .webm)
original_file_name = ydl.prepare_filename(info)

# Manually change the extension to .flac to match file on disk
downloaded_file_name = os.path.splitext(original_file_name)[0] + '.flac'

print(f"[download] {downloaded_file_name} done.")
return downloaded_file_name

def extract_data(self, file_name: str) -> str:
print(f"[ExtractData]: {file_name}.")

# We need to read the downloaded audio file and convert it to base64
# so we can pass it to Soniox Omnio API
with open(file_name, "rb") as file:
audio_data = file.read()
audio_data_b64 = base64.b64encode(audio_data).decode('utf-8')

# Soniox Omnio API is fully compatible with the now standard OpenAI SDK,
# we just need to point the base_url to Soniox api url, instead of the default OpenAI api
# and also we use our SONIOX_API_KEY that is set in .env file
client = OpenAI(
api_key=os.getenv("SONIOX_API_KEY"),
base_url="https://api.llm.soniox.com/v1",
)
print("[ExtractData]: Running audio intelligence...")

# we use completions method to create a new request
# important part here is that audio base64 data has to be set as a partial content message
# as "audio_data_b64" prop, so the API will grab audio properly
completion = client.chat.completions.create(
model="omnio-chat-audio-preview",
messages=[
{
"role": "user",
"content": [
{"audio_data_b64": audio_data_b64},
{"text": USER_PROMPT_TEMPLATE},
],
}
],
)

# this is optional part, but if we don't need the downloaded file anymore
# we can just remove it to not waste space
if os.path.exists(file_name):
os.remove(file_name)

# read llm response - non-streamed all in one
data = completion.choices[0].message.content

print("-- [ExtractData] Done.")
print(data)

return data

def create_pdf(self, file_name: str, data: str):
# since the LLM returns markdown output we need a parser for that
markdowner = Markdown()
output_dir = os.getenv("OUTPUT_DIR") or "./output"

# make sure our output path exists
if not os.path.exists(output_dir):
os.makedirs(output_dir)

# create new writable file
recipe_file = open(f"./output/{file_name}.html","w")

# convert markdown to HTML
recipe_file.write(markdowner.convert(data))

# we are done
recipe_file.close()

As you can see, writing custom AI workflows does not really require advanced AI frameworks - the abstraction layers those frameworks bring in mostly make the entire code less flexible and harder to understand and maintain.

While such frameworks do provide a lot of tested util functions, I would much rather just extract those utils out and include them in my workflow separately. Full control to tap-in at each step is what we want.