Browser does not support (or has disabled) JavaScript, some features of this page may not work properly

From Point Cloud to Semantics: Deploying SpatialLM on the RDK X5 Edge Board

0:00
/1:49

SpatialLM on RDK X5 Demo

SpatialLM is a large AI model capable of "understanding" 3D space. This article documents how to deploy it onto a palm-sized embedded development board, achieving on-device spatial semantic understanding.


1. Why Squeeze a Large Model into an Edge Device?

In 2025, the ManyCore team published SpatialLM at NeurIPS — a large language model for 3D spatial understanding. Think of it as a "spatial version of ChatGPT": feed it a segment of point cloud data (3D scan results), and it tells you — where the walls are, where the doors are, where the bed is, which direction the sofa faces.

But SpatialLM natively depends on GPU execution. Robots, drones, AR glasses — these real-world devices have extremely limited computing power. If large models can only run in the cloud, they will never truly enter the physical world.

Thus, this project was born: deploy SpatialLM onto the D-Robotics RDK X5 (with only 10 TOPS of compute), enabling edge devices to "grow their own eyes."


2. System Architecture: PC Encoding + RDK X5 Inference

To balance precision and efficiency, we adopted a separated deployment architecture:

System Architecture Diagram
System Architecture: PC handles point cloud encoding, RDK X5 handles model inference and visualization format conversion
StageLocationTask
Feature EncodingPC (GPU)Compress raw point cloud into feature vectors using the SONATA encoder
Semantic InferenceRDK X5 (CPU)Load lightweight SpatialLM model, output semantic labels and 3D bounding boxes
VisualizationPC WebRender interactive 3D scenes using the Rerun tool

The entire workflow is automated: user uploads a .ply point cloud via the web interface → PC encodes → network transmission to RDK X5 → model inference → results sent back to PC and rendered.


3. Three Inference Modes, Flexible Adaptation to Different Scenarios

SpatialLM supports three structured inference tasks, covering spatial understanding needs from coarse to fine:

  • Layout Estimation: identify the spatial positions of walls, doors, and windows
  • Object Detection: detect furniture categories and 3D bounding boxes (supporting 59 furniture categories)
  • Structured Reconstruction: combine both above for complete spatial modeling

The most compelling feature is Zero-shot generalization capability: specify target categories using natural language (e.g., "find all chairs and tables"), and the model can focus detection without retraining.

SpatialLM Web Upload Interface
Web Upload Interface: supports .ply format, one-click to start the processing pipeline

4. Results Showcase

Below are the actual runtime results from point cloud input to semantic annotation. The model successfully identified objects such as beds, nightstands, windows, doors, wall art, and wardrobes in a bedroom scene, intuitively presented with 3D bounding boxes and semantic labels.

3D Point Cloud Scene Semantic Annotation Results
Structured Semantic Visualization: walls annotated with colored lines, furniture displayed as 3D bounding boxes, category labels clearly distinguishable

From raw point cloud to a semantically labeled 3D scene, the entire pipeline requires zero manual annotation.


5. Technical Highlights

1. Compute Partitioning — Trading Architecture for Performance

Offload the computationally intensive 3D point cloud encoding to a PC GPU, keeping only lightweight LLM inference on the embedded side. Even 10 TOPS of compute can drive a large model.

2. Large Model Lightweighting

Adapted the originally CUDA-dependent SpatialLM to run purely on CPU, and is currently exploring BPU quantization acceleration — aiming to reduce inference time from minutes to seconds.

3. The Magic of LLM Generalization

Traditional 3D detection models require re-annotation and retraining for every new object category. SpatialLM natively supports zero-shot detection of arbitrary categories — this is the core breakthrough brought by the language model paradigm.

4. Engineering Closed Loop

From point cloud upload, encoding transmission, model inference, to visualization rendering — the entire chain is connected end-to-end, with open-source, reproducible code.


6. Application Outlook

Edge-side spatial understanding capability is key infrastructure for embodied intelligence to enter the real world:

  • Service Robots: walk into an unfamiliar room and instantly understand the layout and object positions
  • Autonomous Navigation: drones / autonomous vehicles performing real-time perception and obstacle avoidance in dynamic environments
  • AR / VR: use a phone to record a short video and automatically generate a structured 3D model

When large models are no longer imprisoned in data centers but enter every embedded chip — that is when spatial intelligence truly begins to land.


7. Project Links

Tags: Portfolio
Author: 月儿
Date:2025年12月31日

Comments