Teaching a Robot Arm to 'See': SO-101 Imitation Learning with WALL-OSS

1. Why Do This?

If large language models taught AI to "speak," then Embodied AI teaches AI to "act" — not just to understand human language, but to execute actions in the physical world.

The difficulty of this undertaking far exceeds imagination. Traditional robot control relies on precise programming: every joint angle, every motion trajectory must be manually designed by engineers. The moment the environment changes — say, an object shifts a few centimeters — the entire program may fail.

Imitation Learning offers a fundamentally different path: let a human demonstrate the task first, and the robot learns by watching. It is like a master teaching an apprentice — no control code needs to be written, you simply teach it "hand in hand."

The core goal of this project is: to achieve end-to-end Vision-Language-Action (VLA) closed-loop control on a low-cost open-source robot arm — you just say a sentence, and the robot arm autonomously completes the task.

SO-101 Follower Arm, powered by the LeRobot ecosystem

2. Hardware Platform: Open-Source Six-Axis Robot Arm SO-101

The project uses the SO-101 six-axis robot arm, co-designed by TheRobotStudio and HuggingFace. The reasons for choosing it are clear:

Fully open-source: from 3D printing blueprints to control code, everything is public, with a BOM cost of approximately $300-400;
Native LeRobot ecosystem support: HuggingFace's LeRobot framework has built-in SO-101 drivers, calibration, and teleoperation interfaces, ready to use out of the box;
Leader-Follower dual-arm architecture: one arm (Leader) is used for human teleoperation demonstration, while the other arm (Follower) synchronously executes, naturally suited for imitation learning data collection workflows.

The hardware assembly process includes: servo assembly and ID configuration → 3D-printed part installation → dual-arm calibration. The entire process can be completed by a single person, taking about one to two days from parts to a working teleoperation setup.

SO-101 Leader Arm, used for human teleoperation data collection

3. Data Collection: 110 Teleoperation Demonstrations

The core of imitation learning is high-quality demonstration data. The data collection workflow is as follows:

Human operates Leader arm → Follower arm synchronously replicates → Dual cameras record → LeRobot dataset

For the task of "precisely placing an object onto the back of a mobile robot dog," I collected 110 demonstration trajectories via teleoperation. Each trajectory contains:

Angle sequences for 6 joints
Wrist-mounted RGB camera (640×480, 30fps)
Side-mounted Intel RealSense L515 depth camera (640×480, 30fps)
Natural language task description

The dataset is stored in LeRobot v3.0 format and can be directly used for model training within the HuggingFace ecosystem.

4. Model: Fine-tuning the WALL-OSS VLA Foundation Model

Traditional imitation learning models (such as ACT, Diffusion Policy) only accept visual input and output actions, lacking the ability to understand natural language instructions.

WALL-OSS is an open-source Vision-Language-Action (VLA) foundation model from the X-Square Robot team, built on the Qwen2.5-VL-3B backbone. Its key design highlights include:

Tightly coupled MoE architecture: vision-language experts and action experts are separated, preserving VLM semantic understanding while enabling high-frequency continuous action generation;
Flow Matching action generation: replaces discrete action tokenization with continuous flow matching, achieving smooth, high-precision trajectory prediction;
Two-stage training (Inspiration → Integration): first injects discrete action priors into the VLM, then achieves fine-grained control through flow matching.

Fine-tuning was completed on a multi-GPU lab server using the LeRobot training pipeline. With 110 demonstrations, convergence was achieved within a few hours:

LeRobot dataset → WALL-OSS pretrained weights → Full fine-tuning → Deployable policy

The model was not quantized; inference on a single GPU occupies approximately 18GB of VRAM.

5. Inference Deployment: GPU Service + Real-Time Control

Deployment adopts a GPU inference service + PC-side real-time control separated architecture:

Role	Device	Responsibility
Inference Server	GPU Server (single card)	Loads the WALL-OSS model, receives observation data, returns action predictions
Control End	PC + SO-101 + Dual Cameras	Captures real-time images/joint states → sends inference requests → executes action commands

In each control cycle, the PC packages wrist camera images and joint states, sends them to the GPU inference service, and immediately executes the received action predictions, forming an end-to-end perception-decision-execution closed loop.

OpenClaw Dog 2 and 1 Arm Demo Video, Task: Robot Dog 1 goes to the loading area to load goods, then goes to the unloading area to unload. Meanwhile, Dog 2 goes to the docking area, docked and waits. When the robotic arm loads, Dog 2 goes to the charging area.

6. Results

After fine-tuning on 110 demonstration trajectories, the model successfully learned to grasp a target object from the table and precisely place it onto the back platform of a mobile robot dog — a task involving spatial localization, path planning, and precise end-effector control.

Key metrics:

✅ Task success rate: reliably completes the full "grasp → move → place" closed-loop process
✅ Generalization capability: adapts to objects at different initial positions
✅ Memory efficiency: 18GB on a single GPU, full FP16 inference without quantization

More notably, this technical approach of low-cost open-source hardware + open-source VLA model + few-shot fine-tuning validates the feasibility of rapidly deploying embodied intelligence in a lab setting — no longer dependent on expensive industrial robot arms and massive datasets.

Subsequently, this SO-101 robot arm was integrated into the Multi-Robot Agent Collaboration Framework, forming a "two dogs, one arm" system together with two robot dogs (see the collaboration demo video below), achieving a full-stack closed loop of cloud decision-making, edge scheduling, and terminal execution.