Query Audio with Gemini 1.5

Published on April 15, 2024

Query Audio with Google Gemini 1.5

Gemini 1.5

Google’s AI research keeps pushing the boundaries with the introduction of Gemini 1.5. This next-generation model builds upon the strengths of its predecessors, boasting significant performance enhancements. At its core lies a powerful combination of Transformer and Mixture of Experts (MoE) architectures. This allows Gemini 1.5 to activate only the most relevant parts of its neural network for a specific task, maximizing efficiency.

Furthermore, Gemini 1.5 takes a leap forward in understanding context. It can now comprehend massive passages of text, with a context window reaching up to 1 million tokens. This allows the model to reason across extensive stretches of text, leading to richer and more informative outputs.

Beyond text, Gemini 1.5 ventures into the realm of multimodality. It can now understand information from spoken audio, processing speech alongside text. This expands its capabilities for tasks like generating captions for videos or creating summaries of audio recordings. In the realm of video analysis, Gemini 1.5 doesn’t just decipher the visuals, it can also analyze the accompanying audio, leading to a more comprehensive grasp of the video’s content.

Keeping real-world applications in mind, Gemini 1.5 is designed for scalability. It finds a middle ground between size and power, making it more accessible for deployment in various projects. Developers can leverage Gemini 1.5 through Google AI Studio and Vertex AI, offering flexibility in integrating this powerful tool into their workflows.

Gemini 1.5 represents a major advancement in LLM technology. It boasts enhanced performance, broader context understanding, the ability to process multiple modalities, and a focus on practical applications. This powerful new model has the potential to revolutionize various fields, and Google AI’s ongoing development efforts promise even greater capabilities in the future.

Audio Processing Application

Imagine harnessing the power of AI to delve deep into audio files, extracting insights, summaries, and sentiments with a simple click. That’s exactly what the AI-Powered Audio Processing App delivers. Powered by Python, Streamlit, and the Gemini 1.5 model, this app is a game-changer in audio analysis.

The app greets users with a clean, organized interface, inviting them to enter custom AI prompts for personalized queries. Want a summary? Analysis on sentiment? It’s all just a prompt away. At the heart of the app’s design philosophy is a commitment to user experience. A prominent “Upload Audio File” section with drag-and-drop functionality makes submitting audio content seamless, supporting file formats like WAV and MP3.

Once uploaded, users can play back the audio file directly within the app, ensuring they’ve got the right file and can listen to the content before processing. Beyond summarization, the app empowers users to get creative with their prompts, catering to a variety of audio processing needs. Simply type in what you want the AI to do with your audio, and let the Gemini 1.5 model work its magic.

After processing, the app doesn’t just give you text — it provides a dedicated area to display the processed output. Whether it’s a summary or a more complex analysis, users can read the AI’s response directly below the audio upload section.

This AI-Powered Audio Processing App is more than just a utility; it’s a testament to where the world of audio analysis is heading. It’s a bridge connecting complex AI processing and everyday usability, exemplifying how cutting-edge technology can be accessible to all.

Implementation

This Streamlit-based script outlines a web application that leverages Google’s generative AI to process audio files according to user-provided prompts. The application begins by configuring the necessary API keys and setting up the environment. It introduces a function, ‘process_audio’, which invokes a model from Google’s Generative AI to interpret the uploaded audio file in the context of the user’s prompt. The prompt could be a request for summarization, sentiment analysis, or other audio analyses.

A utility function, ‘save_uploaded_file’, manages the temporary storage of the uploaded audio, ensuring it’s accessible for processing. The Streamlit app interface provides a user-friendly environment with a title and a sidebar that includes the developer’s bio and a LinkedIn profile for professional connection. Custom CSS is injected directly into the app to enhance the visual elements, like buttons and text inputs, creating a more engaging user experience.

Users interact with the app through a text input field, where they can enter custom prompts that instruct the AI on what to do with the uploaded audio file. The audio file can be uploaded via a file uploader that accepts WAV and MP3 formats. Once uploaded, the file can be played back directly in the app, and with the press of a button, the audio is processed according to the user’s instructions. The results are displayed in a text area that can be scrolled through, allowing users to read through the processed output generated by the AI.

This script represents the backend of a web application designed to make AI-powered audio analysis accessible and easy to use for a wide range of applications.

import streamlit as st
import tempfile
import os
import google.generativeai as genai

from dotenv import load_dotenv

load_dotenv()

# Configure Google API for audio processing
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
genai.configure(api_key=GOOGLE_API_KEY)

def process_audio(audio_file_path, user_prompt):
"""Process the audio using the user's prompt with Google's Generative API."""
model = genai.GenerativeModel("models/gemini-1.5-pro-latest")
audio_file = genai.upload_file(path=audio_file_path)
response = model.generate_content(
[
user_prompt,
audio_file
]
)
return response.text

def save_uploaded_file(uploaded_file):
"""Save uploaded file to a temporary file and return the path."""
try:
with tempfile.NamedTemporaryFile(delete=False, suffix='.' + uploaded_file.name.split('.')[-1]) as tmp_file:
tmp_file.write(uploaded_file.getvalue())
return tmp_file.name
except Exception as e:
st.error(f"Error handling uploaded file: {e}")
return None

# Streamlit app interface
st.title('AI-Powered Audio Processing App')

# Profile Sidebar
st.sidebar.title('About Me')
st.sidebar.image('TOni ANotherpic.jpg', width=100) # Replace URL
st.sidebar.markdown("""
**Name:** Toni Ramchandani
**Bio:** Driven by Sports, Adventure, Technology & Innovations.
[LinkedIn Profile](https://www.linkedin.com/in/toni-ramchandani/)
""")

# Inject custom CSS for styling
st.markdown("""

""", unsafe_allow_html=True)

user_prompt = st.text_input("Enter your custom AI prompt:", placeholder="E.g., 'Please summarize the audio:'")

audio_file = st.file_uploader("Upload Audio File", type=['wav', 'mp3'])
if audio_file is not None:
audio_path = save_uploaded_file(audio_file)
st.audio(audio_path)

if st.button('Process Audio'):
with st.spinner('Processing...'):
processed_text = process_audio(audio_path, user_prompt)
st.text_area("Processed Output", processed_text, height=300)

GitHub - toniramchandani1/AudioProcessingApplication: Powered by Python, Streamlit, and the Gemini 1.5 model, this app is a game-changer in audio analysis.

Find all the code on the above Github repo.

About Me🚀
Hello! I’m Toni Ramchandani 👋. I’m deeply passionate about all things technology! My journey is about exploring the vast and dynamic world of tech, from cutting-edge innovations to practical business solutions. I believe in the power of technology to transform our lives and work. 🌐

Let’s connect at https://www.linkedin.com/in/toni-ramchandani/ and exchange ideas about the latest tech trends and advancements! 🌟

Engage & Stay Connected 📢
If you find value in my posts, please Clapp 👏 | Like 👍 and share 📤 them. Your support inspires me to continue sharing insights and knowledge. Follow me for more updates and let’s explore the fascinating world of technology together! 🛰️

This story is published under Generative AI Publication.

Connect with us on Substack, LinkedIn, and Zeniteq to stay in the loop with the latest AI stories. Let’s shape the future of AI together!


Query Audio with Gemini 1.5 was originally published in Generative AI on Medium, where people are continuing the conversation by highlighting and responding to this story.