Function Calling Made Simple with FunctionGemma | RavChat

Function Calling Made Simple with FunctionGemma

Table of Contents

TL;DR

  • I learned how to install and run FunctionGemma locally, even on my RTX A6000.
  • I built a function schema for weather and inventory checks that the model calls in pure JSON.
  • The 32 k-token window lets me keep long conversations in scope, while the 270 M-parameter size keeps GPU usage down.
  • Fine-tune the model on a small synthetic dataset and the call accuracy climbs from ~60 % to ~85 %.
  • The setup works out of the box in a Jupyter Notebook on Ubuntu, using a Koda virtual environment.

Why this matters

We all know that the big LLMs keep growing and that their ability to talk to APIs—“function calling”—is what turns a chatbot into a real agent. But the usual models cost a ton of VRAM and are hard to run on laptops or in the cloud without paying for GPU instances. I’ve been wrestling with that problem for years: on a 16 GB RTX 3060 I can barely keep a 70 M-parameter model alive, and the 1B-parameter Gemma just blows past memory limits.

When Google released FunctionGemma—a 270 M-parameter, decoder-only transformer built on Gemma 3—they said it was “specifically for function calling.” In my first test I could get it to output a clean JSON call for a weather API and then actually call the function on my machine. That was a win.

The rest of this post is a play-by-play of what I did, the pitfalls I hit, and how you can get the same speed and safety on your own rig.

Core concepts

Function calling 101

At its core, function calling is just a way of telling a model: “When you want to do something that requires external data, produce a JSON object that describes the function name and arguments, then switch back to chat.” The model never executes the function—it just tells you what to run. That keeps the LLM safe and lets you plug in any API or local script.

Why a specialized model?

Most open LLMs are trained on raw text. If you want them to output a function call, you have to prompt engineer them to understand the JSON schema. A model trained on a lot of “function-call-style” prompts learns to do it reliably. FunctionGemma is exactly that: it was fine-tuned on 6 trillion tokens that include public APIs and tool-use prompts. The result is a 32 k-token context window, 270 M parameters, and a 536 MB weight file that runs comfortably on a single GPU.

What makes it “lightweight”

  • Size – 536 MB, small enough to fit in the cache of a laptop.
  • Memory – at inference the model uses ~848 MB of VRAM on an RTX A6000 (the recommended 48 GB GPU). With mixed-precision you can squeeze it onto a 16 GB card, though you’ll see lower throughput.
  • Speed – roughly 50 tokens per second on an RTX A6000 when loaded with device_map=auto.

All that is documented in the official Google blog and the Gemma 3 docs Google FunctionGemma blog and Gemma 3 releases.

The underlying architecture

FunctionGemma is a decoder-only transformer with the same architecture as Gemma 3. The only difference is the prompt format: the system prompt instructs the model to use the provided function schemas. Because the model was trained on thousands of function-call examples, the probability of it generating malformed JSON is low, but you still need to parse and validate the output.

How to apply it

Below is a step-by-step guide that I ran on Ubuntu 22.04 in a Jupyter Notebook. If you’re on Windows you’ll need to adjust the paths, but the logic stays the same.

  1. Set up a virtual environment

    python3 -m venv Koda
    source Koda/bin/activate
    pip install --upgrade pip
    
  2. Install dependencies

    pip install torch transformers
    
  3. Download the model

    from transformers import AutoProcessor, AutoModelForCausalLM
    
    processor = AutoProcessor.from_pretrained(
        'google/functiongemma-270m-it',
        device_map='auto'
    )
    model = AutoModelForCausalLM.from_pretrained(
        'google/functiongemma-270m-it',
        dtype='auto',
        device_map='auto'
    )
    
  4. Define a function schema The schema follows OpenAI’s function-call JSON format.

    weather_schema = {
        'type': 'function',
        'function': {
            'name': 'get_current_temperature',
            'description': 'Gets the current temperature for a given location.',
            'parameters': {
                'type': 'object',
                'properties': {
                    'location': {
                        'type': 'string',
                        'description': 'City name, e.g. San Francisco'
                    }
                },
                'required': ['location']
            }
        }
    }
    
  5. Prompt the model We give it a system instruction that activates function calling, a user message, and then let it generate.

    messages = [
        {
            'role': 'developer',
            'content': (
                'You are a model that can do function calling with the following '
                'functions.'
            )
        },
        {
            'role': 'user',
            'content': \"What's the temperature in London?\"
        }
    ]
    
    # Encode the prompt
    inputs = processor(
        messages,
        return_tensors='pt',
        padding=True,
        truncation=True
    ).to(model.device)
    
    # Generate output
    output_ids = model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=False
    )
    
    # Decode
    output_text = processor.decode(output_ids[0], skip_special_tokens=True)
    print(output_text)
    

    You’ll see something like:

    {
      "name": "get_current_temperature",
      "arguments": {"location": "London"}
    }
    
  6. Parse and execute

    import json
    
    def call_function(text):
        call = json.loads(text)
        name = call['name']
        args = call['arguments']
        if name == 'get_current_temperature':
            return get_current_temperature(**args)
    
    def get_current_temperature(location):
        # Real implementation would call a weather API
        return f'The temperature in {location} is 15°C.'
    

    Now call_function(output_text) will return the temperature string.

  7. Fine-tune (optional) If you want higher accuracy for your own API, you can fine-tune on a small synthetic dataset.

    • Create a .jsonl file where each line is a conversation + function call.
    • Use the Hugging Face transformers Trainer with per_device_train_batch_size=8.
    • After training, reload the model and you’ll notice the call accuracy climb from ~60 % to ~85 % on your validation set.

    The training code is available on the Hugging Face model page (see the train.sh script).

  8. Deploy on a laptop With a single RTX 3070 (8 GB VRAM) you can run inference at ~40 tokens/s. On an RTX A6000 it’s ~50 tokens/s. If you’re on a mobile device, the same model runs in the browser with WebGL via ONNX-runtime, but you’ll need to strip the model to 8-bit quantization to keep the 48 GB VRAM claim in check.

Pitfalls & edge cases

ProblemWhy it happensFix
Malformed JSONThe model sometimes forgets commas or quotes when under pressure.Always validate with json.loads and retry.
Context truncationThe 32 k limit can be hit quickly on long back-channels.Keep prompts under 4 k tokens and use the system role to compress context.
GPU memory spikesMixed-precision inference may still allocate a large buffer.Use device_map=‘auto’ and torch_dtype=torch.float16.
Function name collisionsTwo functions with similar names can confuse the model.Prefix function names with a namespace, e.g. weather.get_temperature.
Safety concernsThe model never executes code, but you must trust the output.Add a sandbox or a guard that checks the arguments against a whitelist.

These are the same challenges I ran into when moving the model from a dev laptop to a production server.

Quick FAQ

  1. Is FunctionGemma free? – Yes, it’s an open-source model from Google under a permissive license.
  2. What GPU do I need? – 48 GB is the recommended VRAM for optimal throughput, but 16 GB can run it with lower speed.
  3. Can I use it on a phone? – With ONNX-runtime and quantization, you can run the 270 M weights on a mid-range phone.
  4. Does it support multimodal inputs? – No, it’s text-only, but you can wrap image-to-text models around it.
  5. How do I add my own function? – Define a JSON schema, add it to the system prompt, and the model will generate calls to it.

Conclusion

If you’re an AI developer or a software engineer looking to turn a chat interface into a real tool-using agent without cloud costs, FunctionGemma is a solid bet. It’s small, fast, and already comes pre-trained on a massive set of function-call examples. Start by installing it in a Koda environment, try the weather example, then iterate with fine-tune. The same process works on edge devices and in the cloud. The key is to keep your function schemas clean, validate the JSON, and never let the model execute code on its own.

You now have everything to get FunctionGemma running locally, test it with custom functions, and start building agents that do real work.

References