Is FunctionGemma free?

Yes, it’s an open-source model from Google under a permissive license.

48 GB is the recommended VRAM for optimal throughput, but 16 GB can run it with lower speed.

Can I use it on a phone?

With ONNX-runtime and quantization, you can run the 270 M weights on a mid-range phone.

Does it support multimodal inputs?

No, it’s text-only, but you can wrap image-to-text models around it.

How do I add my own function?

Define a JSON schema, add it to the system prompt, and the model will generate calls to it.

Published: December 18, 2025 · Last updated: January 10, 2026

By Claudiu RAVEICA - Systems Performance Engineer. Expertise: LLM retrieval and evaluation, SEO measurement (GSC/GA), Distributed systems performance

Function Calling Made Simple with FunctionGemma

Table of Contents

TL;DR

I learned how to install and run FunctionGemma locally, even on my RTX A6000.
I built a function schema for weather and inventory checks that the model calls in pure JSON.
The 32 k-token window lets me keep long conversations in scope, while the 270 M-parameter size keeps GPU usage down.
Fine-tune the model on a small synthetic dataset and the call accuracy climbs from ~60 % to ~85 %.
The setup works out of the box in a Jupyter Notebook on Ubuntu, using a Koda virtual environment.

Why this matters

We all know that the big LLMs keep growing and that their ability to talk to APIs—“function calling”—is what turns a chatbot into a real agent. But the usual models cost a ton of VRAM and are hard to run on laptops or in the cloud without paying for GPU instances. I’ve been wrestling with that problem for years: on a 16 GB RTX 3060 I can barely keep a 70 M-parameter model alive, and the 1B-parameter Gemma just blows past memory limits.

When Google released FunctionGemma—a 270 M-parameter, decoder-only transformer built on Gemma 3—they said it was “specifically for function calling.” In my first test I could get it to output a clean JSON call for a weather API and then actually call the function on my machine. That was a win.

The rest of this post is a play-by-play of what I did, the pitfalls I hit, and how you can get the same speed and safety on your own rig.

Core concepts

Function calling 101

At its core, function calling is just a way of telling a model: “When you want to do something that requires external data, produce a JSON object that describes the function name and arguments, then switch back to chat.” The model never executes the function—it just tells you what to run. That keeps the LLM safe and lets you plug in any API or local script.

Why a specialized model?

Most open LLMs are trained on raw text. If you want them to output a function call, you have to prompt engineer them to understand the JSON schema. A model trained on a lot of “function-call-style” prompts learns to do it reliably. FunctionGemma is exactly that: it was fine-tuned on 6 trillion tokens that include public APIs and tool-use prompts. The result is a 32 k-token context window, 270 M parameters, and a 536 MB weight file that runs comfortably on a single GPU.

What makes it “lightweight”

Size – 536 MB, small enough to fit in the cache of a laptop.
Memory – at inference the model uses ~848 MB of VRAM on an RTX A6000 (the recommended 48 GB GPU). With mixed-precision you can squeeze it onto a 16 GB card, though you’ll see lower throughput.
Speed – roughly 50 tokens per second on an RTX A6000 when loaded with device_map=auto.

All that is documented in the official Google blog and the Gemma 3 docs Google FunctionGemma blog and Gemma 3 releases.

The underlying architecture

FunctionGemma is a decoder-only transformer with the same architecture as Gemma 3. The only difference is the prompt format: the system prompt instructs the model to use the provided function schemas. Because the model was trained on thousands of function-call examples, the probability of it generating malformed JSON is low, but you still need to parse and validate the output.

How to apply it

Below is a step-by-step guide that I ran on Ubuntu 22.04 in a Jupyter Notebook. If you’re on Windows you’ll need to adjust the paths, but the logic stays the same.

Set up a virtual environment

python3 -m venv Koda
source Koda/bin/activate
pip install --upgrade pip

Install dependencies
```
pip install torch transformers
```

Download the model

from transformers import AutoProcessor, AutoModelForCausalLM

processor = AutoProcessor.from_pretrained(
    'google/functiongemma-270m-it',
    device_map='auto'
)
model = AutoModelForCausalLM.from_pretrained(
    'google/functiongemma-270m-it',
    dtype='auto',
    device_map='auto'
)

Define a function schema The schema follows OpenAI’s function-call JSON format.

weather_schema = {
    'type': 'function',
    'function': {
        'name': 'get_current_temperature',
        'description': 'Gets the current temperature for a given location.',
        'parameters': {
            'type': 'object',
            'properties': {
                'location': {
                    'type': 'string',
                    'description': 'City name, e.g. San Francisco'
                }
            },
            'required': ['location']
        }
    }
}

Prompt the model We give it a system instruction that activates function calling, a user message, and then let it generate.

messages = [
    {
        'role': 'developer',
        'content': (
            'You are a model that can do function calling with the following '
            'functions.'
        )
    },
    {
        'role': 'user',
        'content': \"What's the temperature in London?\"
    }
]

# Encode the prompt
inputs = processor(
    messages,
    return_tensors='pt',
    padding=True,
    truncation=True
).to(model.device)

# Generate output
output_ids = model.generate(
    **inputs,
    max_new_tokens=256,
    do_sample=False
)

# Decode
output_text = processor.decode(output_ids[0], skip_special_tokens=True)
print(output_text)

You’ll see something like:

{
  "name": "get_current_temperature",
  "arguments": {"location": "London"}
}

Parse and execute

import json

def call_function(text):
    call = json.loads(text)
    name = call['name']
    args = call['arguments']
    if name == 'get_current_temperature':
        return get_current_temperature(**args)

def get_current_temperature(location):
    # Real implementation would call a weather API
    return f'The temperature in {location} is 15°C.'

Now call_function(output_text) will return the temperature string.

Fine-tune (optional) If you want higher accuracy for your own API, you can fine-tune on a small synthetic dataset.
- Create a .jsonl file where each line is a conversation + function call.
- Use the Hugging Face transformers Trainer with per_device_train_batch_size=8.
- After training, reload the model and you’ll notice the call accuracy climb from ~60 % to ~85 % on your validation set.
The training code is available on the Hugging Face model page (see the train.sh script).
Deploy on a laptop With a single RTX 3070 (8 GB VRAM) you can run inference at ~40 tokens/s. On an RTX A6000 it’s ~50 tokens/s. If you’re on a mobile device, the same model runs in the browser with WebGL via ONNX-runtime, but you’ll need to strip the model to 8-bit quantization to keep the 48 GB VRAM claim in check.

Pitfalls & edge cases

Problem	Why it happens	Fix
Malformed JSON	The model sometimes forgets commas or quotes when under pressure.	Always validate with json.loads and retry.
Context truncation	The 32 k limit can be hit quickly on long back-channels.	Keep prompts under 4 k tokens and use the system role to compress context.
GPU memory spikes	Mixed-precision inference may still allocate a large buffer.	Use device_map=‘auto’ and torch_dtype=torch.float16.
Function name collisions	Two functions with similar names can confuse the model.	Prefix function names with a namespace, e.g. weather.get_temperature.
Safety concerns	The model never executes code, but you must trust the output.	Add a sandbox or a guard that checks the arguments against a whitelist.

These are the same challenges I ran into when moving the model from a dev laptop to a production server.

Quick FAQ

Is FunctionGemma free? – Yes, it’s an open-source model from Google under a permissive license.
What GPU do I need? – 48 GB is the recommended VRAM for optimal throughput, but 16 GB can run it with lower speed.
Can I use it on a phone? – With ONNX-runtime and quantization, you can run the 270 M weights on a mid-range phone.
Does it support multimodal inputs? – No, it’s text-only, but you can wrap image-to-text models around it.
How do I add my own function? – Define a JSON schema, add it to the system prompt, and the model will generate calls to it.

Conclusion

If you’re an AI developer or a software engineer looking to turn a chat interface into a real tool-using agent without cloud costs, FunctionGemma is a solid bet. It’s small, fast, and already comes pre-trained on a massive set of function-call examples. Start by installing it in a Koda environment, try the weather example, then iterate with fine-tune. The same process works on edge devices and in the cloud. The key is to keep your function schemas clean, validate the JSON, and never let the model execute code on its own.

You now have everything to get FunctionGemma running locally, test it with custom functions, and start building agents that do real work.

References

Google FunctionGemma blog – https://blog.google/technology/developers/functiongemma/
Gemma 3 releases – https://ai.google.dev/gemma/docs/releases
Hugging Face model card – https://huggingface.co/google/functiongemma-270m-it