How to Use Llama 3.2 Vision Models: From Local Inference to API Integration, part 1

Llama 3.2, the latest iteration of the LLaMA series, brings enhanced multimodal capabilities, including a powerful vision model. Whether you're processing images for analysis, generating visual content, or building AI-driven applications, Llama 3.2's vision model opens up new possibilities in computer vision tasks. In this blog post series, we will explore how to leverage the vision model locally and via APIs, giving you flexibility based on your specific needs.

Before we dive into the "how," let’s touch on the "why." Llama 3.2’s vision model combines advanced image processing capabilities with language understanding, enabling tasks like:

Image captioning: Generating descriptive text based on images.

Visual question answering: Providing answers to questions about an image.

Image classification and object detection: Identifying objects or categories within an image.

Visual storytelling: Generating narrative text from images for creative or practical applications.Whether you’re a researcher, developer, or business owner, Llama 3.2 can simplify complex visual tasks and improve AI-driven projects.

Currently, llama.cpp doesn't support Llama 3.2 vision models, so using them for local inference through platforms like Ollama or LMStudio isn’t possible. However, there are other ways to utilize these models for personal use.

That said, lightweight versions of Llama 3.2, such as the 1B and 3B models, can be easily run locally. In fact, I tested the 1B version on a single-board computer like the Orange Pi 3 LTS with 2 GB of RAM.

Let’s return to the main purpose of this blog post. The primary way to use the Llama 3.2 vision models locally is through the Hugging Face API.

Running Llama 3.2 Vision Models Locally through Hugging face.

To run the model locally, you'll need to ensure that your system meets the required hardware and software specifications, particularly a GPU with sufficient memory to handle image-based tasks. You’ll also need frameworks like Tirch or TensorFlow, depending on your setup.

You also need setup Python conda environment and Jupyter Lab to run the notebook. For a complete guide on setting up a Python environment with Conda and Jupyter Lab, please refer to the sample chapter in the book "Getting started with Generative AI", which describes each process step by step.

Step 1. Create a new notebook and add the following codes:

!pip install -U transformers

import torch

import time from huggingface_hub

import interpreter_login  interpreter_login()
import requests
import torch
from PIL import Image

from transformers import (
    MllamaForConditionalGeneration,
    AutoProcessor
)

this code snippet sets up your environment by ensuring that the necessary libraries are installed and imported, and it authenticates your access to the Hugging Face Hub. Use your Hugging face token to access the model. Also make sure that you got the permission from the Meta to use the model.

Step 2. Download the model from Hugging face

model_id = "meta-llama/Llama-3.2-11B-Vision"

model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_id)

It will take a few moments to download the entire model, which is approximately 20 GB in size. The code also initializes a processor to prepare data for inference. Using bfloat16 and automatic device mapping helps optimize the model's performance during execution.

Step 3. Upload the image and give a prompt

url = "https://d2sofvawe08yqg.cloudfront.net/quickstartwithai/s_hero?1728376971"
image = Image.open(requests.get(url, stream=True).raw)
prompt = "<|image|><|begin_of_text|>Describe the Image"
inputs = processor(image, prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=250)
print(processor.decode(output[0]))

For describing the image, we are going to use our book cover image from the URL.

The above code snippet fetches an image from a specified URL, processes it with a prompt for description, and then generates and prints a description of the image using the Llama 3.2 vision model.

If everything goes well, you should see an output similar to the one shown below.

This is a book cover that describes a hands-on guide for AI development with local LLMs.

As always, you can use Google Colab or Kaggle to run the notebook. The next screenshot shows a running notebook on Google Colab, which requires 14 GB of RAM to operate the model.

This setup is ideal for developers who want direct control over the model's behavior and execution. You can fine-tune the model or integrate it into local applications with ease.

Clean UI for running Llama 3.2 vision model.

This open source project gives a simple way to run the Llama 3.2 vision model locally. However, to run the model through Clean UI, you need 12GB of VRAM. The setup process is straight foreword.

Step 1. Setup the environment.

As usual. for a complete guide on setting up a Python environment with Conda and Jupyter Lab, please refer to the sample chapter in the book "Getting started with Generative AI", which describes each process step by step.

Step 2. Clone GitHub repository

git clone https://github.com/ThetaCursed/clean-ui.git

cd clean-ui

Step 3. Install dependencies.

pip install -r requirements.txt

Install torch with separate command as shown below:

pip install torch==2.4.1+cu121 --index-url https://download.pytorch.org/whl/cu121

Step 4. Start the UI

Use the command to start the user interface

python clean-ui.py

Upload an image, give a prompt like "describe the image" adjust parameters such as temperature to play around it.

This setup is straightforward for users who simply want to use the model for image processing.

Llama 3.2's vision model offers versatile applications, whether you're generating image descriptions, answering questions about visual content, or automating image classification tasks. With the choice between local inference and API integration, you can adapt the model to suit your workflow and requirements. Local inference provides a more customizable approach, while API integration offers convenience and accessibility.

Regardless of how you choose to deploy it, Llama 3.2 is a powerful tool for enhancing your AI-driven projects. Explore its potential and discover the method that works best for you. In the next part of this series, we will explore the Groq API and how to integrate it with the Llama 3.2 vision models. Stay tuned!