Transformer Policy

Definition

A transformer policy is a robot control policy built on the transformer architecture — the same attention-based neural network design that powers large language models. Instead of processing text tokens, a transformer policy processes robot tokens: image patches from cameras, proprioceptive state vectors (joint angles, gripper state), language instruction embeddings, and action history. The attention mechanism allows the model to dynamically weight which inputs are most relevant for each action decision, enabling flexible multi-modal reasoning.

Transformers have rapidly become the dominant architecture for robot learning, displacing the CNN+MLP and RNN designs that preceded them. Their success stems from three key properties: the ability to handle variable-length, heterogeneous input sequences; effective scaling to large datasets and model sizes; and the natural support for multi-task conditioning through language or task embeddings.

How It Works

A transformer policy converts all inputs into a sequence of tokens that are processed through self-attention layers. The typical pipeline:

1. Tokenization: Camera images are processed through a vision encoder (ResNet, ViT, EfficientNet, or SigLIP) to produce a set of visual tokens. Proprioceptive state (joint angles, end-effector pose, gripper opening) is projected through a linear layer into the token dimension. Language instructions are tokenized by a text encoder. Previous actions may be included as additional tokens.

2. Sequence construction: All tokens are concatenated into a single sequence, with positional embeddings to distinguish token types and positions. For multi-camera setups, each camera produces its own set of visual tokens.

3. Transformer processing: The token sequence passes through multiple transformer layers (typically 4-12 layers for manipulation policies). Self-attention allows each token to attend to all other tokens, enabling the model to correlate visual observations with proprioceptive state and language instructions.

4. Action decoding: A decoder head (MLP or transformer decoder) converts the processed token representations into action predictions. For action chunking policies, this outputs a sequence of future actions. For single-step policies, it outputs one action per forward pass.

Key Models

ACT (Action Chunking with Transformers) — A CVAE + transformer encoder-decoder that predicts chunks of 8-50 future joint-space actions. Designed for bimanual manipulation with the ALOHA system. Fast inference (~4ms), data-efficient (20-50 demonstrations), and the default policy in LeRobot. See our Action Chunking entry for details.
RT-1 (Robotics Transformer 1) — Google DeepMind's first large-scale transformer policy. Uses an EfficientNet vision backbone with a transformer trunk that processes tokenized images and outputs discretized actions. Trained on 130,000 real-world demonstrations across 700+ tasks. Demonstrated strong multi-task generalization but required massive data collection.
RT-2 (Robotics Transformer 2) — Extends RT-1 by using a Vision-Language Model (PaLM-E or PaLI-X) as the backbone. Actions are encoded as text tokens, allowing the model to leverage internet-scale visual and language pretraining for zero-shot transfer to novel objects and instructions.
Decision Transformer — Reformulates RL as sequence modeling: given a desired return, observation history, and action history, the transformer predicts the next action that would achieve that return. Eliminates the need for value functions and Bellman backups, treating control as a conditional sequence generation problem.
Octo — An open-source generalist robot policy from UC Berkeley. A transformer trained on the Open X-Embodiment dataset (800K+ episodes across 22 robot types). Designed for fine-tuning: users adapt Octo to their specific robot and task with a small amount of in-domain data.
Pi0 (Physical Intelligence) — A flow-matching-based transformer policy trained on diverse manipulation data. Combines the generality of foundation models with the precision of flow-matching action generation. Demonstrates strong zero-shot and fine-tuned performance across manipulation tasks.

Why Transformers Help

Long-context reasoning: Attention allows the model to relate current observations to events many timesteps ago. This is critical for multi-step tasks where the robot must remember what it has already done (e.g., which items it has already picked) or track slowly changing state.

Multi-modal fusion: The token-based interface naturally accommodates heterogeneous inputs. Images, joint states, force readings, tactile data, and language instructions are all converted to tokens and processed uniformly. Adding a new modality requires only a new tokenizer, not a redesign of the entire architecture.

Scalability: Transformers scale predictably with data and compute. Performance improves log-linearly with dataset size and model parameters, following scaling laws similar to those observed in language models. This makes them the architecture of choice for robot foundation model efforts.

Multi-task conditioning: Language-conditioned transformers can execute different tasks based on text instructions, enabling a single model to handle hundreds of manipulation skills. This is a fundamental shift from single-task policies that must be retrained for each new behavior.

Comparison with CNN-Based and RNN-Based Policies

CNN+MLP policies (e.g., ResNet encoder + MLP action head) are simpler, faster to train, and sufficient for single-task manipulation with one or two cameras. They lack temporal reasoning (no memory of past observations) and scale poorly to multi-task settings. Still a strong baseline for simple tasks.

RNN-based policies (LSTM, GRU) add temporal modeling by maintaining a hidden state across timesteps. However, they struggle with long-range dependencies (information decays over time), are difficult to parallelize during training, and do not naturally handle multi-modal inputs. Largely superseded by transformers for robot learning.

Transformers combine the spatial feature extraction of CNNs (in the vision encoder) with superior temporal and multi-modal reasoning. The cost is higher compute requirements and more complex training pipelines. For teams with limited compute, CNN+MLP remains practical; for teams building general-purpose manipulation systems, transformers are the clear direction.

Compute Requirements

Training: Small transformer policies (ACT-scale: ~10M parameters) train in 2-4 hours on a single RTX 4090 with 50-200 demonstrations. Medium-scale models (Octo-scale: ~100M parameters) require 1-4 A100 GPUs for 1-3 days. Large foundation models (RT-2-scale: 5-55B parameters) require clusters of 64-256 TPUs or A100s for weeks.

Inference: Real-time robot control requires action generation at 10-50 Hz. Small transformers (ACT) run at 200+ Hz on a consumer GPU, well within budget. Medium models (Octo) achieve 10-30 Hz on an RTX 4090. Large VLAs (RT-2) require server-grade GPUs (A100, H100) and may need action chunking to amortize inference cost across multiple timesteps.

Edge deployment: For robots without GPU workstations (mobile robots, low-cost arms), model distillation or quantization (INT8, INT4) enables deployment on NVIDIA Jetson Orin or similar edge devices. Inference latency increases but remains feasible for 10 Hz control.

Fine-Tuning Pre-Trained Transformer Policies

The most practical path to deploying transformer policies today is fine-tuning a pre-trained checkpoint on your specific robot and task. This leverages the general manipulation knowledge embedded in the base model while adapting to your particular hardware, objects, and environment.

Data requirements — Fine-tuning Octo on a new embodiment requires as few as 20-100 demonstrations for simple tasks. Complex multi-step tasks may need 200-500. The key is diversity: vary object positions, orientations, and backgrounds across demonstrations.
Adapting to new robots — Cross-embodiment models like Octo accept different action dimensions and camera configurations. Fine-tuning adapts the action head and observation tokenizers to your specific robot while preserving learned visual and spatial reasoning from the base model.
Compute for fine-tuning — Fine-tuning Octo-Base (93M parameters) takes 2-8 hours on a single A100 or RTX 4090. Fine-tuning larger models (1B+ parameters) may require 2-4 GPUs. LoRA and other parameter-efficient methods can reduce compute by 3-5x with minimal quality loss.
Evaluation protocol — Test on 20-50 real-world rollouts per task variant. Report success rate, average task completion time, and failure mode breakdown. Compare against a from-scratch ACT baseline to quantify the benefit of pre-training.

Key Papers

Zhao, T. et al. (2023). "Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware." RSS 2023. Introduces ACT, the most widely used transformer policy for imitation learning.
Brohan, A. et al. (2023). "RT-1: Robotics Transformer for Real-World Control at Scale." RSS 2023. Demonstrates large-scale multi-task transformer policies trained on 130K real demonstrations.
Brohan, A. et al. (2023). "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control." CoRL 2023. Shows that VLM backbones enable zero-shot transfer to novel objects and instructions.
Chen, L. et al. (2021). "Decision Transformer: Reinforcement Learning via Sequence Modeling." NeurIPS 2021. Reformulates RL as sequence prediction, eliminating value functions.
Team, O. et al. (2024). "Octo: An Open-Source Generalist Robot Policy." RSS 2024. Open-source cross-embodiment transformer policy designed for community fine-tuning.

Related Terms

Action Chunking — The chunked action prediction technique used by ACT and other transformer policies
Diffusion Policy — An alternative action generation method that can use transformer backbones
Foundation Model — Large pretrained models that transformer policies scale toward
Imitation Learning — The primary training paradigm for transformer manipulation policies
Zero-Shot Transfer — Language-conditioned transformers enable zero-shot task execution

Apply This at SVRC

Robotics Center of Silicon Valley provides GPU workstations for training transformer policies, from lightweight ACT models on a single RTX 4090 to multi-GPU Octo fine-tuning runs. Our data platform integrates with LeRobot and Hugging Face for seamless dataset management and model training. We can help you select the right transformer architecture for your task, data budget, and compute constraints.

Explore Data Services Contact Us

Definition

How It Works

Key Models

Why Transformers Help

Comparison with CNN-Based and RNN-Based Policies

Compute Requirements

Fine-Tuning Pre-Trained Transformer Policies

See Also

Key Papers

Related Terms

Apply This at SVRC

Related Pages

Action Chunking

Diffusion Policy

Foundation Model

Zero-Shot Transfer