Browser does not support (or has disabled) JavaScript, some features of this page may not work properly

Giving a Robot Dog a 'Brain': An LLM-Driven Embodied Intelligent Dog System

0:00
/1:43

Project Demo: Robot Dog + Cart + Smart Glasses Collaboration


1. Making the Robot Dog "Understand" Human Speech

Imagine this: you say to a quadruped robot dog, "go take a lap around the lab and see if anyone is there," and then it heads out on its own — navigating autonomously, recognizing its environment, and coming back to report. This is not science fiction; it is an LLM-driven embodied intelligent dog system that I built from scratch over the course of a year in my research group.

The core proposition of this work is straightforward: how do we enable a quadruped robot to understand natural language instructions and autonomously complete tasks in real-world environments? This requires bridging three key links — semantic understanding (comprehending human speech), environmental perception (seeing the world), and motion control (moving steadily). My answer: a three-tier large model system that organically connects these three links.

Unitree Go2 Quadruped Robot Dog experimental platform, equipped with NVIDIA Orin Nano edge computing module

2. System Architecture: A Three-Tier Large Model System

Making a robot dog autonomously complete tasks is not as simple as hooking up an API to a large model. A robot system in the real world needs to simultaneously handle semantic understanding, spatial perception, long-term memory, real-time control, and many other requirements — a single model simply cannot handle all of them.

I designed a three-tier LLM system architecture, assigning different types of tasks to models at different levels:

Three-Tier System Architecture Diagram
Three-Tier LLM System Architecture: Decision Layer → Memory/Perception Layer → Execution Layer, with clearly defined responsibilities
TierRoleCore CapabilityTechnical Solution
Top · Decision LayerThe Robot's "Cerebrum"Natural language understanding, task decomposition and planning, human-robot dialogueLarge Language Model (cloud inference)
Middle · Perception & Memory LayerThe Robot's "Hippocampus"Environmental memory construction and retrieval, SLAM mapping and localization, navigation planningReMEmbR + Point-LIO + Nav2
Bottom · Execution LayerThe Robot's "Cerebellum"Real-time motion control, sensor data acquisition, safety emergency stopautonomy_stack_go2 (ROS2)

The core philosophy of this architecture is "let the right model do the right thing": large models handle semantic understanding and task planning that require deep reasoning (slow but smart), while small models handle perception and control that require real-time response (fast and reliable).

A typical instruction execution flow: the user says "go check if there are empty seats in Zone B" → the top-tier LLM parses the intent and decomposes it into subtasks (navigate to Zone B → scan seats → return and report) → the middle-tier ReMEmbR retrieves environmental memory of Zone B, SLAM provides real-time localization, Nav2 plans the path → the bottom tier executes motion control, with real-time status feedback throughout.


3. Core Innovation: Hacking ReMEmbR — Giving the Robot "Long-Term Memory"

A major pain point of traditional robots is the lack of memory — every task execution starts from scratch as if amnesiac. Humans, by contrast, can act efficiently in complex environments largely thanks to long-term environmental memory.

ReMEmbR is a retrieval-augmented robot memory system proposed by NVIDIA. I performed targeted modifications on this foundation to adapt it for the Go2 quadruped platform and real-world application scenarios:

ModuleOur ReMEmbR Solution
Visual DescriptionScene description compatible with any VLM
Vector StorageMilvusDB general-purpose storage
Memory RetrievalHybrid retrieval strategy integrating spatiotemporal proximity
Inference QueryMulti-round progressive reasoning, supporting fuzzy spatiotemporal queries

The hacked ReMEmbR gives the robot dog time-dimensioned environmental memory — it can not only remember "where the kitchen is," but also "half an hour ago, there were three people discussing something in the kitchen." This capability is critically important for long-duration deployment scenarios such as inspection and security patrol.


4. Autonomous Navigation: Point-LIO + Nav2 Dual Engines

For a robot dog to move autonomously in real environments, both localization accuracy and path planning must be solved simultaneously.

For localization and mapping, I integrated the Point-LIO algorithm — a LiDAR-inertial odometry SLAM solution with two significant advantages over traditional methods: point-cloud-level fusion (rather than feature-level) ensures stability even in degenerate environments such as long corridors; and high computational efficiency allows real-time operation on the Orin Nano edge device.

For path planning, I adopted the Nav2 navigation stack from the ROS2 ecosystem. Nav2 provides a complete behavior-tree-driven navigation framework: the Global Planner handles long-distance path generation, the Controller handles real-time obstacle avoidance and velocity smoothing, and the Behavior Tree automates anomaly recovery during navigation (such as re-planning when blocked).

Point-LIO provides accurate real-time localization, and Nav2 uses this localization for path planning — together forming a complete navigation closed loop of "where am I → where am I going → how do I get there."

0:00
/2:13

Navigation Demo


5. Prompt Engineering: Making the LLM Output "Executable" Instructions

LLMs are great at conversation, but getting an LLM to output robot motion commands is a completely different matter. I invested considerable effort in prompt engineering, primarily solving two problems:

  • 🧩 Ambiguous Instruction Decomposition: users do not issue commands like programmers. An ambiguous instruction like "go check outside for me" is automatically decomposed by the LLM into a clear sequence of subtasks: "open the door → move outdoors → take photos → return → report what was seen."
  • 🔄 Task Feedback Loop: after each subtask is completed, the execution layer reports the status (success/failure/exception) to the LLM, which dynamically adjusts subsequent plans accordingly. For example, if the path is found blocked during navigation, the LLM re-plans an alternative route.

6. Results and Significance

After a year of iterative development, this three-tier large model system achieved a complete closed loop from natural language instructions to autonomous task execution on the Unitree Go2 platform:

  • Natural Language Control: users issue commands in everyday spoken language, without any programming or predefined commands
  • Long-Term Environmental Memory: a spatiotemporal memory system based on hacked ReMEmbR, supporting historical information retrieval and reasoning
  • Autonomous Navigation and Obstacle Avoidance: Point-LIO + Nav2 dual engines, operating stably in complex indoor environments
  • Safety and Controllability: multi-layer safety mechanisms (Prompt constraints + behavior tree anomaly recovery + low-level emergency stop)
  • Patent Output: one invention patent on "Embodied Intelligence of Multimodal Large Models," as first inventor

The greatest insight from this project is: the value of large models in robotics lies not in replacing traditional algorithms, but in providing a unified "intent understanding layer" — transforming vague human natural language into precise, machine-executable instruction sequences. From this perspective, the LLM is more akin to the robot system's "operating system" rather than an "application."


Further Reading

Tags: Portfolio
Author: 月儿
Date:2025年09月30日

Comments