Giving a Robot Dog a 'Brain': An LLM-Driven Embodied Intelligent Dog System
Project Demo: Robot Dog + Cart + Smart Glasses Collaboration
1. Making the Robot Dog "Understand" Human Speech
Imagine this: you say to a quadruped robot dog, "go take a lap around the lab and see if anyone is there," and then it heads out on its own — navigating autonomously, recognizing its environment, and coming back to report. This is not science fiction; it is an LLM-driven embodied intelligent dog system that I built from scratch over the course of a year in my research group.
The core proposition of this work is straightforward: how do we enable a quadruped robot to understand natural language instructions and autonomously complete tasks in real-world environments? This requires bridging three key links — semantic understanding (comprehending human speech), environmental perception (seeing the world), and motion control (moving steadily). My answer: a three-tier large model system that organically connects these three links.

2. System Architecture: A Three-Tier Large Model System
Making a robot dog autonomously complete tasks is not as simple as hooking up an API to a large model. A robot system in the real world needs to simultaneously handle semantic understanding, spatial perception, long-term memory, real-time control, and many other requirements — a single model simply cannot handle all of them.
I designed a three-tier LLM system architecture, assigning different types of tasks to models at different levels:

| Tier | Role | Core Capability | Technical Solution |
|---|---|---|---|
| Top · Decision Layer | The Robot's "Cerebrum" | Natural language understanding, task decomposition and planning, human-robot dialogue | Large Language Model (cloud inference) |
| Middle · Perception & Memory Layer | The Robot's "Hippocampus" | Environmental memory construction and retrieval, SLAM mapping and localization, navigation planning | ReMEmbR + Point-LIO + Nav2 |
| Bottom · Execution Layer | The Robot's "Cerebellum" | Real-time motion control, sensor data acquisition, safety emergency stop | autonomy_stack_go2 (ROS2) |
The core philosophy of this architecture is "let the right model do the right thing": large models handle semantic understanding and task planning that require deep reasoning (slow but smart), while small models handle perception and control that require real-time response (fast and reliable).
A typical instruction execution flow: the user says "go check if there are empty seats in Zone B" → the top-tier LLM parses the intent and decomposes it into subtasks (navigate to Zone B → scan seats → return and report) → the middle-tier ReMEmbR retrieves environmental memory of Zone B, SLAM provides real-time localization, Nav2 plans the path → the bottom tier executes motion control, with real-time status feedback throughout.
3. Core Innovation: Hacking ReMEmbR — Giving the Robot "Long-Term Memory"
A major pain point of traditional robots is the lack of memory — every task execution starts from scratch as if amnesiac. Humans, by contrast, can act efficiently in complex environments largely thanks to long-term environmental memory.
ReMEmbR is a retrieval-augmented robot memory system proposed by NVIDIA. I performed targeted modifications on this foundation to adapt it for the Go2 quadruped platform and real-world application scenarios:
| Module | Our ReMEmbR Solution |
|---|---|
| Visual Description | Scene description compatible with any VLM |
| Vector Storage | MilvusDB general-purpose storage |
| Memory Retrieval | Hybrid retrieval strategy integrating spatiotemporal proximity |
| Inference Query | Multi-round progressive reasoning, supporting fuzzy spatiotemporal queries |
The hacked ReMEmbR gives the robot dog time-dimensioned environmental memory — it can not only remember "where the kitchen is," but also "half an hour ago, there were three people discussing something in the kitchen." This capability is critically important for long-duration deployment scenarios such as inspection and security patrol.
4. Autonomous Navigation: Point-LIO + Nav2 Dual Engines
For a robot dog to move autonomously in real environments, both localization accuracy and path planning must be solved simultaneously.
For localization and mapping, I integrated the Point-LIO algorithm — a LiDAR-inertial odometry SLAM solution with two significant advantages over traditional methods: point-cloud-level fusion (rather than feature-level) ensures stability even in degenerate environments such as long corridors; and high computational efficiency allows real-time operation on the Orin Nano edge device.
For path planning, I adopted the Nav2 navigation stack from the ROS2 ecosystem. Nav2 provides a complete behavior-tree-driven navigation framework: the Global Planner handles long-distance path generation, the Controller handles real-time obstacle avoidance and velocity smoothing, and the Behavior Tree automates anomaly recovery during navigation (such as re-planning when blocked).
Point-LIO provides accurate real-time localization, and Nav2 uses this localization for path planning — together forming a complete navigation closed loop of "where am I → where am I going → how do I get there."
Navigation Demo
5. Prompt Engineering: Making the LLM Output "Executable" Instructions
LLMs are great at conversation, but getting an LLM to output robot motion commands is a completely different matter. I invested considerable effort in prompt engineering, primarily solving two problems:
- 🧩 Ambiguous Instruction Decomposition: users do not issue commands like programmers. An ambiguous instruction like "go check outside for me" is automatically decomposed by the LLM into a clear sequence of subtasks: "open the door → move outdoors → take photos → return → report what was seen."
- 🔄 Task Feedback Loop: after each subtask is completed, the execution layer reports the status (success/failure/exception) to the LLM, which dynamically adjusts subsequent plans accordingly. For example, if the path is found blocked during navigation, the LLM re-plans an alternative route.
6. Results and Significance
After a year of iterative development, this three-tier large model system achieved a complete closed loop from natural language instructions to autonomous task execution on the Unitree Go2 platform:
- ✅ Natural Language Control: users issue commands in everyday spoken language, without any programming or predefined commands
- ✅ Long-Term Environmental Memory: a spatiotemporal memory system based on hacked ReMEmbR, supporting historical information retrieval and reasoning
- ✅ Autonomous Navigation and Obstacle Avoidance: Point-LIO + Nav2 dual engines, operating stably in complex indoor environments
- ✅ Safety and Controllability: multi-layer safety mechanisms (Prompt constraints + behavior tree anomaly recovery + low-level emergency stop)
- ✅ Patent Output: one invention patent on "Embodied Intelligence of Multimodal Large Models," as first inventor
The greatest insight from this project is: the value of large models in robotics lies not in replacing traditional algorithms, but in providing a unified "intent understanding layer" — transforming vague human natural language into precise, machine-executable instruction sequences. From this perspective, the LLM is more akin to the robot system's "operating system" rather than an "application."
Further Reading
- Unitree Go2: unitree.com/go2
- ReMEmbR Paper: Building and Reasoning Over Long-Horizon Spatio-Temporal Memory
- ROS2 Navigation2: navigation.ros.org
- Point-LIO: hku-mars/Point-LIO
Comments