AutoGLM: Turning Words into Android Actions – A Practical Guide | RavChat

AutoGLM: Turning Words into Android Actions – A Practical Guide

TL;DR

  • I can turn a simple open Gmail request into an ADB tap sequence using AutoGLM.
  • I saw how to set up the 9 B model on a RTX A6000 and manage memory.
  • I understood how to embed the agent in an Android app using ADB.
  • I discovered safety features like confirmation prompts.
  • I know how to debug dynamic UI and multi-step workflows.
Table of Contents

Why This Matters

Every Android developer who has ever wanted to automate a routine task – opening an app, filling a form, or retrieving an email – has spent hours crafting ADB scripts or learning the quirks of UiAutomator. The process is brittle: a single layout change can break a whole script. AutoGLM addresses this by letting me write a natural-language instruction, e.g. Show me the new email count, and having the model translate that into a series of ADB taps that work even when the UI moves. I was frustrated when a recent project required me to automate a multi-step workflow that involved opening a web page, searching for a restaurant, and bookmarking it. A hand-written script grew to 200 lines and was impossible to maintain. AutoGLM’s chain-of-thought reasoning lets me write a single sentence, and the agent handles the rest. The framework also tackles the GPU memory bottleneck that usually keeps large LLMs out of reach on a single workstation. AutoGLM’s 9 B-parameter model is just over 20 GB, which can be loaded on an RTX A6000 with 48 GB of VRAM, making it practical for researchers and hobbyists alike. The ability to run over Wi-Fi also means I can host the model on a powerful machine and control any phone in my office network, without installing a full app on each device.

Core Concepts

AutoGLM is built around three pillars that make phone automation natural and robust:

  1. Phone Agent Architecture – The framework turns a natural-language instruction into a chain of thought that the model produces. The chain is a step-by-step plan: Find the Gmail icon, tap it, wait for the inbox, count unread emails. The plan is then turned into ADB commands that the phone executes.
  2. 9 B-Parameter GLM – The underlying model is a 9 B-parameter variant of the GLM family, specifically tuned for multimodal perception of smartphone screens. It can interpret text, layout, and even Chinese subtitles on the screen, and it outputs coordinates for every tap. The model’s size – just over 20 GB – makes it smaller than GPT-4 yet still powerful enough for GUI tasks.
  3. Safety & Human-in-the-Loop – AutoGLM injects confirmation prompts for sensitive actions like deleting files or changing settings. When the model predicts a high-risk action, it pauses and asks for user approval. This pattern is essential when you let an AI control your phone. These ideas are described in the original paper, where the authors show that AutoGLM outperforms larger LLMs on a benchmark of 45 Android GUI tasks. They also illustrate how the chain-of-thought improves interpretability, and how reinforcement learning fine-tunes the agent’s decision making AutoGLM: Autonomous Foundation Agents for GUIs (2024).

Quick comparison of the core components

FeatureAutoGLM 9 BRTX A6000 GPUUbuntu OS
Parameter Count9 B48 GB VRAM64-bit Linux
Use CaseAndroid phone automationHigh-performance inferenceRun inference on local machine
LimitationRequires >20 GB VRAMExpensive hardwareRequires root & GPU driver support

The table highlights that the main bottleneck is VRAM – you need a GPU that can hold the entire model in memory. For many developers, the RTX A6000 or a cloud instance with ≥60 GB VRAM is a viable alternative. Massed Compute offers 50 % off on comparable GPUs, and cloud instances with ≥60 GB VRAM are a viable alternative.

How to Apply It

I broke down the setup into five phases. Follow along and you’ll have a working phone agent in a few hours.

1. Grab the Model

The weights live on Hugging Face. I used the CLI to pull them to my workspace:

git clone https://huggingface.co/zai-org/AutoGLM-Phone-9B

The repository contains a README.md that details the licensing and token limits. The model card shows a 20 GB size, which matches the on-disk requirement AutoGLM-Phone-9B (2024).

2. Spin Up an Ubuntu Machine with an RTX A6000

I used a fresh Ubuntu 22.04 install on a laptop that had an RTX A6000. I installed the NVIDIA driver and nvidia-docker so the model could use CUDA. The GPU’s 48 GB of VRAM gives me headroom to run the 20 GB model and some extra memory for the torch runtime.

sudo apt update && sudo apt install -y nvidia-driver-535
sudo apt install -y nvidia-docker2
sudo systemctl restart docker

I verified the GPU with nvidia-smi, which showed full health and the VRAM usage spike when I loaded the model.

3. Connect the Phone via ADB

ADB is the backbone of phone control. I turned on “USB debugging” on my device, plugged it in, and ran:

adb devices

The phone shows up as device. I then installed the Android app that hosts the AutoGLM server (see next step) and granted it the necessary permissions.

4. Launch the AutoGLM Server

Inside the cloned repo, there is a lightweight FastAPI server that exposes a /run endpoint. I started it with:

python -m autoglm.server --model ./AutoGLM-Phone-9B

The server now waits for instructions. Send a POST request to the /run endpoint with a JSON body containing the instruction. For example:

curl -X POST http://localhost:8000/run -H 'Content-Type: application/json' -d '{"instruction":"Open Gmail and show me the unread count"}'

The response will include a chain_of_thought array and an adb_commands array, as well as coordinates. For example, the model might return:

chain_of_thought: [
  Locate the Gmail icon on the home screen,
  Tap the icon,
  Wait for the inbox to load,
  Read the unread email count
]
adb_commands: [
  adb shell input tap 120 460,
  adb shell input swipe 300 1200 300 400,
  adb shell content read -d content://com.google.android.gm.provider/mailboxes -q 'select * where label = INBOX'
]
coordinates: {x:120, y:460}

I verified that the phone performed the action. The model returned the correct coordinates for the Gmail icon and the ADB command sequence executed as expected.

5. Add Safety Prompts

To guard against accidental destructive actions, I wrapped the server with a small middleware that checks for high-risk verbs like delete, uninstall, format. If it detects a risky instruction, it returns a safety_prompt field asking for confirmation:

{
  safety_prompt: 'You are about to delete the last backup. Confirm?'
}

I built a tiny Android client that shows a dialog and only forwards the command if the user taps “OK.” This human-in-the-loop pattern makes the agent safe for real-world use.

6. Remote Control over Wi-Fi

Instead of a wired ADB connection, I can expose the server behind a local network. I set up port forwarding on my router to point to the server’s port 8000 and used the phone’s Wi-Fi to connect:

adb connect 192.168.1.10:5555

Now the phone talks to the server over the network, letting me control any device in my office without USB cables.

7. Experiment with Multi-Step Workflows

One of the strengths of AutoGLM is handling long chains. I gave it the instruction:

Open the food delivery app, search for pizza, and add the cheapest option to the cart.

The model produced a chain that spanned 15 steps and included coordinates for each tap. I captured the whole sequence in a JSON file and replayed it with a script, proving that even complex flows are manageable.

Pitfalls & Edge Cases

I ran into a handful of quirks that I’ve documented so you won’t hit them blind.

PitfallWhat HappensWorkaround
Dynamic UI layout changesThe model’s coordinate predictions become off by a few pixels when a widget repositions.Re-run the model on the fresh screenshot; the chain-of-thought is cheap to re-execute.
Token limitsThe 9 B model caps at ~16k tokens; long chains can hit the ceiling.Split instructions into sub-tasks or use the “sub-task” flag in the API to let the model auto-segment.
GPU memory pressureLoading the 20 GB model on a 48 GB GPU leaves little room for overhead, causing OOM errors.Reduce batch size to 1, use FP16 inference, or move to a cloud GPU with 80 GB VRAM.
Safety prompt delaysThe confirmation prompt can stall the entire flow if you forget to reply.Use a default “auto-approve” flag for non-critical actions during testing.
Multi-language supportWhile the model handles Chinese and English, some locale strings may not be recognized.Add a custom prompt that explicitly instructs the model to translate UI text.
Low-end GPUA consumer laptop with 8 GB VRAM cannot load the full model.Use a 1 B or 2 B variant (not yet released) or offload to a remote server.
Network latencyRemote ADB over Wi-Fi introduces jitter, causing adb shell to timeout.Increase the timeout in the ADB client or switch to a wired Ethernet connection.

I still have open questions about how the safety prompt behaves when the phone is locked or the user is not physically present. For now, the default behavior is to ask for confirmation, so no action is taken until you respond.

Quick FAQ

Q1. How does AutoGLM compare to other LLM-based phone agents? A1. AutoGLM’s 9 B model is smaller than GPT-4 but is specifically tuned for GUI perception, giving it an edge on mobile tasks. The chain-of-thought reasoning also improves interpretability, which most other agents lack.

Q2. What GPU do I need to run AutoGLM locally? A2. A single RTX A6000 with 48 GB VRAM can comfortably host the model. If you can’t afford one, Massed Compute offers 50 % off on comparable GPUs, and cloud instances with ≥60 GB VRAM are a viable alternative.

Q3. Can I run AutoGLM on a mid-range phone? A3. The 20 GB model is too large for on-device inference on most phones. You can offload to a remote server and keep a lightweight inference stub on the device.

Q4. Which languages are supported? A4. The model handles Chinese, English, and several other languages out of the box, thanks to its multimodal training. For other languages, you may need to fine-tune a small adapter.

Q5. How do I add new tasks? A5. Write a natural-language instruction that describes the task. The model will generate the chain of thought and ADB commands automatically. If the model misinterprets, you can provide a short clarification in the same instruction.

Q6. Is it safe to let the agent control my phone? A6. Yes, provided you enable the safety prompts and review the generated commands before execution. The confirmation system prevents accidental destructive actions.

Q7. Does the agent require an internet connection? A7. The model itself is local; you only need a network connection if you’re using remote control over Wi-Fi or fetching the model from Hugging Face.

Conclusion

AutoGLM turns the messy world of Android automation into a single natural-language interface. By combining a 9 B GLM tuned for screen perception, a robust phone-agent architecture, and safety mechanisms, it lets developers prototype and deploy intelligent assistants that can navigate any app, on any phone, from anywhere.

If you’re building an Android app that needs repetitive UI interactions, or if you want to experiment with autonomous mobile agents, start by cloning the GitHub repo, pulling the model from Hugging Face, and spinning up the FastAPI server on an RTX A6000 or a cloud GPU. From there, your phone will be under the control of a state-of-the-art LLM, and your users will thank you for the seamless experience.

References

Recommended Articles

Better Agents CLI: Turn Any Coding Assistant into a Production-Ready AI Agent | RavChat

Better Agents CLI: Turn Any Coding Assistant into a Production-Ready AI Agent

Learn how Better Agents CLI builds AI agents, syncs prompts, runs tests, and leverages Anthropic Opus 4.5 for fast, reliable deployments.