Cloud to Pocket — Redefining AI Accessibility: On-Device LLMs

Published on April 2, 2024

Cloud to Pocket — Redefining AI Accessibility: On-Device LLMs

Imagine a world where your device whispers secrets of the universe to you, untethered from the vast cloud above. This is the promise of integrating large language models directly onto your handheld devices, a reality made possible by MediaPipe and TensorFlow Lite. The power of AI, once locked away in remote servers, now fits snugly in your pocket, ready to leap into action at your command. This innovation marks the dawn of a new era in technology, bringing us closer to an AI that is not just everywhere but also for everyone.

What’s the story

In a pioneering move, Google has unveiled the MediaPipe LLM Inference API, leveraging TensorFlow Lite’s robust framework for on-device machine learning and MediaPipe’s comprehensive support for ML pipelines. This groundbreaking release facilitates the execution of Large Language Models (LLMs) directly on various devices, addressing the significant memory and computational demands of LLMs. The API supports popular LLMs such as Gemma and Phi 2, enabling developers to integrate advanced AI functionalities into mobile and web applications efficiently. This development promises to revolutionize the accessibility and application of AI technologies across platforms. Additionally, it boosts the efficiency of AI applications by reducing latency and enabling offline functionality, thus making AI tools more accessible, especially in regions with limited internet connectivity. This paradigm shift not only democratizes AI but also opens new avenues for application development, making sophisticated AI interactions a reality across a wider range of devices.

Let’s understand the key players

MediaPipe

MediaPipe is an open-source framework developed by Google, designed to facilitate the building of both research and production-level machine learning pipelines for processing media content. This includes video, audio, and other time-series data. It offers ready-to-use yet customizable models and solutions for a variety of tasks such as facial detection, object tracking, and gesture recognition, enabling developers to easily integrate advanced machine learning capabilities into applications across platforms.

The MediaPipe LLM Inference API is a cutting-edge tool that brings the power of large language models directly onto devices, enhancing apps and products with capabilities like text generation, information retrieval, and document summarization. It supports a variety of text-to-text models including Gemma 2B and others like Phi-2 and StableLM-3B, making state-of-the-art generative AI accessible for on-device applications. This experimental API is still under development and its use aligns with specific policies to ensure responsible application.

Tensorflow Lite

TensorFlow Lite is an open-source deep learning framework developed by Google for on-device inference. It enables the deployment of machine learning models on mobile, embedded, and IoT devices, allowing for efficient and lightweight processing of AI applications directly on the device. TensorFlow Lite supports a wide range of devices, making machine learning models more accessible and faster, with the added benefit of operating offline. This capability enhances user privacy and reduces the dependency on cloud computing, making AI applications more versatile and responsive.

WebGpu

WebGPU is an emerging web standard designed to provide modern, high-performance graphics and computational capabilities in web browsers. It acts as a successor to WebGL, offering a more powerful interface for developers to tap into GPU resources for complex rendering tasks and parallel computations. This allows for the creation of sophisticated graphical applications and games directly in the browser, as well as leveraging the GPU for heavy computational tasks like machine learning, all with enhanced efficiency and performance.

To enable WebGPU in Chrome, navigate to `chrome://flags` in your browser’s address bar. Search for "WebGPU" in the available flags list, then select "Enabled" from the dropdown menu next to the WebGPU flag. After enabling it, restart Chrome for the changes to take effect. This will allow you to experiment with WebGPU features on supported websites and applications. For the most detailed and updated instructions, refer to the Chrome documentation or help resources.

Gemma

Gemma models, stemming from Google DeepMind and other Google teams, are a suite of open, lightweight, state-of-the-art models for generative AI applications. Named after the Latin word for "precious stone," Gemma models draw from the technology behind the Gemini models, offering adaptability for various computing resources and tasks. These models can be tuned for specific tasks, enhancing their performance in targeted applications, and are supported across multiple platforms, including TensorFlow, JAX, and PyTorch. For further details and development guides, visit the official Gemma documentation.

Gemini

Gemini, Google’s most advanced AI model, is now accessible for application development, detailed on its dedicated site. The latest version, Gemini 1.5 Pro, is in Public Preview on Google AI Studio, allowing for immediate, web-based prototyping. For newcomers, the platform offers a quickstart notebook and API guide, alongside tutorials in various programming languages. It also emphasizes the importance of using LLMs safely, providing comprehensive safety settings and guidelines documentation.

Phi-2

Phi-2, a Transformer model with 2.7 billion parameters, builds on Phi-1.5 using additional NLP synthetic texts and carefully selected websites to enhance safety and educational value. It excels in benchmarks for common sense, language understanding, and logical reasoning, ranking near the top among models under 13 billion parameters. Unlike many, it’s not fine-tuned with human feedback, aiming instead to offer the research community a tool for addressing safety challenges like toxicity reduction and bias understanding, all within an open-source framework.

Falcon-RW-1B

Falcon-RW-1B, developed by TII, is a causal decoder-only model with 1 billion parameters, trained on the RefinedWeb dataset comprising 350 billion tokens. This high-quality dataset utilizes extensive filtering and deduplication processes. Falcon-RW-1B, available under the Apache 2.0 license, demonstrates competitive performance against models trained on curated datasets. Now incorporated into the transformers library, it serves as a valuable resource for research on web-data training impacts. For cutting-edge models, Falcon-7B and Falcon-40B, trained on over 1,000 billion tokens, are recommended.

Implementation

To set up and run the MediaPipe LLM Inference task for web applications, follow these steps:

1. Ensure your browser supports WebGPU, like Chrome on macOS or Windows.
2. Create a folder named llm_task.
3. Copy index.html and index.js files into your llm_task folder.
4. Download the Gemma 2B model from Gemma or convert an external LLM model (Phi-2, Falcon, or StableLM) into the llm_task folder, ensuring it’s compatible with a GPU backend.
5. In the index.js file, update the modelFileName variable to match your model file’s name.
6. Run a local server within the llm_task folder using the command python -m http.server 8080 or python -m SimpleHTTPServer 8080 for older Python versions.
7. Open localhost:8080 in your Chrome browser. The web interface will activate, ready for use in about 10 seconds.

Please find below the content for ‘index.html’ and ‘index.js’ respectively.




    
    
    
    


 

        
        

            Toni Ramchandani

            Driven by Sports, Adventure, Technology & Innovations

            https://www.linkedin.com/in/toni-ramchandani/" class="profile-link">LinkedIn Profile
        

    

    

        Running Large Language Models On-Device with MediaPipe and TensorFlow Lite

        Input:
        
        
        Result:

import {FilesetResolver, LlmInference} from 'https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai';

const input = document.getElementById('input');
const output = document.getElementById('output');
const submit = document.getElementById('submit');

const modelFileName = 'gemma-2b-it-gpu-int4.bin'; 

/**
 * Display newly generated partial results to the output text box.
 */
function displayPartialResults(partialResults, complete) {
  output.textContent += partialResults;

  if (complete) {
    if (!output.textContent) {
      output.textContent = 'Result is empty';
    }
    submit.disabled = false;
  }
}

/**
 * Main function to run LLM Inference.
 */
async function runDemo() {
  const genaiFileset = await FilesetResolver.forGenAiTasks(
      'https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai/wasm');
  let llmInference;

  submit.onclick = () => {
    output.textContent = '';
    submit.disabled = true;
    llmInference.generateResponse(input.value, displayPartialResults);
  };

  submit.value = 'Loading the model...'
  LlmInference
      .createFromOptions(genaiFileset, {
        baseOptions: {modelAssetPath: modelFileName},
      })
      .then(llm => {
        llmInference = llm;
        submit.disabled = false;
        submit.value = 'Get Response'
      })
      .catch(() => {
        alert('Failed to initialize the task.');
      });
}

runDemo();

Conclusion

The integration of the MediaPipe LLM Inference API into web applications marks a significant advancement in bringing powerful language models directly to devices. This development opens up new possibilities for creating more dynamic, efficient, and privacy-conscious applications. As we’ve explored the setup and implementation on web platforms, it’s worth noting that similar advancements can extend these capabilities to mobile devices in the future, further broadening the scope of on-device AI applications.

Which we will implement some other day 😎.

About Me🚀
Hello! I’m Toni Ramchandani 👋. I’m deeply passionate about all things technology! My journey is about exploring the vast and dynamic world of tech, from cutting-edge innovations to practical business solutions. I believe in the power of technology to transform our lives and work. 🌐

Let’s connect at https://www.linkedin.com/in/toni-ramchandani/ and exchange ideas about the latest tech trends and advancements! 🌟

Engage & Stay Connected 📢
If you find value in my posts, please Clapp 👏 | Like 👍 and share 📤 them. Your support inspires me to continue sharing insights and knowledge. Follow me for more updates and let’s explore the fascinating world of technology together! 🛰️

Cloud to Pocket — Redefining AI Accessibility: On-Device LLMs was originally published in Generative AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Continue reading on website

Other news

🌸 Spring bingo - Wellness challenge - Halfway! 🌸

April 15, 2025

Hey Hivebriters! Quick check-in on our April Wellness Challenge - Spring Bingo! We're halfway through the month, and it's the perfect time to jump in if you haven't started yet (or keep going if you have)! Quick Reminders:Complete rows or columns for 5 raffle entries eachSquares with 📷 require photo submissions in the commentsSubmit completed rows/columns through the form by April 30thBonus entri