Teaching a Robot Arm to 'See': SO-101 Imitation Learning with WALL-OSS
OpenClaw SO101 Robot Arm Demo
1. Why Do This?
If large language models taught AI to "speak," then Embodied AI teaches AI to "act" — not just to understand human language, but to execute actions in the physical world.
The difficulty of this undertaking far exceeds imagination. Traditional robot control relies on precise programming: every joint angle, every motion trajectory must be manually designed by engineers. The moment the environment changes — say, an object shifts a few centimeters — the entire program may fail.
Imitation Learning offers a fundamentally different path: let a human demonstrate the task first, and the robot learns by watching. It is like a master teaching an apprentice — no control code needs to be written, you simply teach it "hand in hand."
The core goal of this project is: to achieve end-to-end Vision-Language-Action (VLA) closed-loop control on a low-cost open-source robot arm — you just say a sentence, and the robot arm autonomously completes the task.

2. Hardware Platform: Open-Source Six-Axis Robot Arm SO-101
The project uses the SO-101 six-axis robot arm, co-designed by TheRobotStudio and HuggingFace. The reasons for choosing it are clear:
- Fully open-source: from 3D printing blueprints to control code, everything is public, with a BOM cost of approximately $300-400;
- Native LeRobot ecosystem support: HuggingFace's LeRobot framework has built-in SO-101 drivers, calibration, and teleoperation interfaces, ready to use out of the box;
- Leader-Follower dual-arm architecture: one arm (Leader) is used for human teleoperation demonstration, while the other arm (Follower) synchronously executes, naturally suited for imitation learning data collection workflows.
The hardware assembly process includes: servo assembly and ID configuration → 3D-printed part installation → dual-arm calibration. The entire process can be completed by a single person, taking about one to two days from parts to a working teleoperation setup.

3. Data Collection: 110 Teleoperation Demonstrations
The core of imitation learning is high-quality demonstration data. The data collection workflow is as follows:
Human operates Leader arm → Follower arm synchronously replicates → Dual cameras record → LeRobot datasetFor the task of "precisely placing an object onto the back of a mobile robot dog," I collected 110 demonstration trajectories via teleoperation. Each trajectory contains:
- Angle sequences for 6 joints
- Wrist-mounted RGB camera (640×480, 30fps)
- Side-mounted Intel RealSense L515 depth camera (640×480, 30fps)
- Natural language task description
The dataset is stored in LeRobot v3.0 format and can be directly used for model training within the HuggingFace ecosystem.
4. Model: Fine-tuning the WALL-OSS VLA Foundation Model
Traditional imitation learning models (such as ACT, Diffusion Policy) only accept visual input and output actions, lacking the ability to understand natural language instructions.
WALL-OSS is an open-source Vision-Language-Action (VLA) foundation model from the X-Square Robot team, built on the Qwen2.5-VL-3B backbone. Its key design highlights include:
- Tightly coupled MoE architecture: vision-language experts and action experts are separated, preserving VLM semantic understanding while enabling high-frequency continuous action generation;
- Flow Matching action generation: replaces discrete action tokenization with continuous flow matching, achieving smooth, high-precision trajectory prediction;
- Two-stage training (Inspiration → Integration): first injects discrete action priors into the VLM, then achieves fine-grained control through flow matching.
Fine-tuning was completed on a multi-GPU lab server using the LeRobot training pipeline. With 110 demonstrations, convergence was achieved within a few hours:
LeRobot dataset → WALL-OSS pretrained weights → Full fine-tuning → Deployable policyThe model was not quantized; inference on a single GPU occupies approximately 18GB of VRAM.
5. Inference Deployment: GPU Service + Real-Time Control
Deployment adopts a GPU inference service + PC-side real-time control separated architecture:
| Role | Device | Responsibility |
|---|---|---|
| Inference Server | GPU Server (single card) | Loads the WALL-OSS model, receives observation data, returns action predictions |
| Control End | PC + SO-101 + Dual Cameras | Captures real-time images/joint states → sends inference requests → executes action commands |
In each control cycle, the PC packages wrist camera images and joint states, sends them to the GPU inference service, and immediately executes the received action predictions, forming an end-to-end perception-decision-execution closed loop.
6. Results
After fine-tuning on 110 demonstration trajectories, the model successfully learned to grasp a target object from the table and precisely place it onto the back platform of a mobile robot dog — a task involving spatial localization, path planning, and precise end-effector control.
Key metrics:
- ✅ Task success rate: reliably completes the full "grasp → move → place" closed-loop process
- ✅ Generalization capability: adapts to objects at different initial positions
- ✅ Memory efficiency: 18GB on a single GPU, full FP16 inference without quantization
More notably, this technical approach of low-cost open-source hardware + open-source VLA model + few-shot fine-tuning validates the feasibility of rapidly deploying embodied intelligence in a lab setting — no longer dependent on expensive industrial robot arms and massive datasets.
Subsequently, this SO-101 robot arm was integrated into the Multi-Robot Agent Collaboration Framework, forming a "two dogs, one arm" system together with two robot dogs (see the collaboration demo video below), achieving a full-stack closed loop of cloud decision-making, edge scheduling, and terminal execution.
OpenClaw Two-Dogs-One-Arm Demo Video. Task: Robot Dog 1 goes to the loading zone to pick up cargo, then moves to the unloading zone to unload. Meanwhile, Dog 2 goes to the docking zone and waits; when the robot arm starts loading, Dog 2 then moves to the charging zone.
Further Reading
- SO-101 Open-Source Hardware: TheRobotStudio/SO-ARM-100
- LeRobot Framework: huggingface/lerobot
- WALL-OSS Paper: Igniting VLMs toward the Embodied Space
- Wall-X Model Repository: X-Square-Robot/wall-oss-flow
- SO-101 Inference Deployment Adaptation: pgq18/wall-x (includes SO-101 inference launch scripts)
- SO-101 PC Control Tools: pgq18/Tools (includes teleoperation and inference client scripts)
Comments