Part 1: Why ROS Switched From TCPROS to DDS
Exploring ROS 1's communication model and why TCPROS didn't cut it.
Introduction
As the new year rolled into 2014, a group of roboticists gathered around a table in Mountain View, California, debating how to solve a hairy problem in front of them.
They were still in the same city where this journey first began, seven years earlier, in 2007, at Willow Garage. One of Silicon Valley’s most iconic robotics labs, it served as a launchpad for its alumni to found and lead some of the most exciting companies today: Agility, Dusty, Fetch, Zipline, and Simbe, among others.
Led by what is now called the Open Robotics, the Willow team set out to reduce the barrier to getting started with robots and kick-start a standardized robotics framework. So, they built ROS (Robot Operating System). Only a few years after its launch, it had exploded in popularity. ROS was modular, customizable, and crucially, open-source. Academics, hobbyists, and startups all fervently adopted it.
Through 2014, the growth continued to compound. ROS was far and away the most dominant robotics framework. Its usage had graduated past its small-sized early adopters. Commercial robotics teams with large fleets and exacting performance requirements were now heavy users of ROS. But ROS wasn’t performant enough for their use case; the cracks were starting to show.
Notably, teams were frustrated with ROS 1’s native wire protocol, TCPROS. It wasn’t optimal for stringent, real-time speed. It didn’t scale well to large systems. And in some cases, the failure risks created safety concerns.
Open Robotics had commanding ambitions. They wanted ROS to scale to millions of robots across every conceivable vertical. To do that, they knew they needed to rebuild their foundation.
They started over with ROS 2. This time, they focused on delivering the performance commercial teams needed.
This blog is the story of that transformation. Why ROS 1’s transport layer wasn’t good enough, what makes DDS better, and why the move to DDS was a bet worth making.
What is Pub/Sub?
Robots are real-time, distributed systems. They’re a messy collection of modules, each doing its own job: one node might handle perception, another localization, a third planning, and a fourth controls. Depending on the system, these modules might run synchronously or independently. Sometimes they’ll share compute. Other times, they’ll each get their own microcontroller.
The point is that robots are complicated, and whatever software framework we use needs to manage this complexity and facilitate clean, efficient communication between modules.
This means that the communication process needs to be fast, reliable, and allow asynchrony.
Fast because latency kills performance. For real-time operations, every millisecond of overhead counts.
Reliable because modules can be independent (a robot’s control module may rely on data from its perception module). If a message is dropped between the two, the robot will fail to actuate.
Asynchrony because nodes can operate on distinct cadences. Sensors will collect and post data on their own schedule. Modules will consume data based on their own application’s needs.
To meet those needs, ROS adopted a publish/subscribe (pub/sub) model. It’s a messaging paradigm where data producers (publishers) and consumers (subscribers) communicate indirectly by sharing a topic. Publishers write to a topic; subscribers listen to it. This decoupling means that there are no blocking handshakes or synchronous calls like in request/response systems.
The pub/sub model is well-suited for robotics:
Modularity: As we mentioned, a robot consists of multiple nodes that work together. We want to separate the concerns between them. This makes our system easier to build, test, and debug, rather than a monolithic monstrosity. The ability for a publisher to publish to many subscribers (in a many-to-many relationship) makes this possible.
Low-latency, high-throughput: If every subscriber connection is handled in a single thread, high message rates, like from a LiDAR publishing thousands of messages per second, can create bottlenecks. Without concurrency or shared transport, the publisher has to serialize and send every message separately, which increases latency and wastes CPU resources.
Asynchrony: Sensors produce data at their own pace. If we tried to use a request/response model like gRPC or REST, we’d constantly hit timeouts and backpressure. In this regard, pub/sub is much more efficient. Its ‘fire and forget’ approach allows modules to consume data at their own pace.
So, ROS 1 used pub/sub. It was the right choice.
But having the right messaging pattern is just the start. WE still need to move bytes across the system network.
That’s where TCPROS came in, handling message delivery under the hood. Here's how it worked, and why it didn’t hold up.
How did TCPROS work?
To move messages between nodes, ROS needed a way to serialize, send, and receive data across a network. This is where the wire protocol operates: it’s the layer responsible for turning these messages into bytes and sending them to their receiver.
ROS 1 implemented a custom protocol, TCPROS. As the name suggested, it used Transmission Control Protocol (TCP) as its transport layer.
Here’s how it works in practice:
Let’s say camera_node
is publishing images to the topic /camera/image_raw
, and perception_node
subscribes to that topic.
Discovery through the ROS Master
First, the subscriber (
perception_node
) reaches out to the ROS Master. This is a centralized registry that keeps track of all publishers and subscribers in the system. Every ROS node registers itself with the Master when it launches.Topic Lookup
The subscriber asks the Master, “Who’s publishing the
/camera/image_raw
topic?”Address Exchange
The Master replies with the publisher’s URI (IP address and port).
Direct Connection Setup
The subscriber then opens a TCP socket directly to the publisher. It sends a handshake that includes:
The name of the topic
The message type
A checksum of the message definition to ensure compatibility
If the publisher agrees, it accepts the connection.
Note: This connection is only scoped for these two nodes. If any other nodes want to talk to these nodes, they will have to open a separate TCP socket. All TCP sockets are 1:1 channels.
Message Streaming
From that point on, the publisher serializes and streams messages to the subscriber over that TCP connection. After the connection is established, the two nodes are no longer communicating with the ROS Master. It just handles discovery. It is not a message broker
The Implications of TCPROS
TCPROS was simple and pragmatic. Unfortunately, that turned out to be its undoing.
There were no brokers that slowed down message routing. No heavy external libraries that made ROS more of a resource hog than it needed to be. And it was ~easy to debug and reason about. For these reasons, it performed well for hobbyists and researchers.
The troubles revealed themselves as the commercial community adopted ROS. They found difficulties with their production robots, which had strict uptime requirements and performance constraints.
No Shared Transport or Multicast
As we discussed, each pub/sub pair gets its own TCP socket. So if five different subscribers want the same message from a publisher, the node must serialize the message five times, send the message five times, and maintain five independent TCP connections.
This creates performance bottlenecks on three fronts:
Repeated serialization burns unnecessary CPU cycles.
More sockets mean more system overhead, which becomes a problem on memory-constrained edge computers.
Duplicating large messages (like point clouds or images) for each subscriber saturates the network.
Real-Time Constraints
TCP is designed for in-order, reliable delivery. This is why it’s used for rendering web pages and email, where every packet matters. It’d be quite unpleasant to read an email with every fourth sentence missing!
If a packet gets dropped, TCP will delay every packet behind it. This is called head-of-line blocking (HOL blocking). Until the dropped packet is retransmitted, all other packets are stuck waiting in the queue.
For example, if a LiDAR scan arrives 200ms late due to HOL blocking, your robot might miss its planning window. In most robotic workloads, dropping a frame is better than stalling the pipeline. The large volume of messages we’re sending means we’re insensitive if an individual packet gets dropped.
No QoS Controls
TCPROS didn’t support Quality of Service (QoS) customization. You couldn’t prioritize messages. You couldn’t specify delivery deadlines. You couldn’t configure retries, durability, or resource constraints.
Not all data is treated equally. We may want to prioritize delivering safety-critical messages (like emergency stop commands) over general debug logs. With TCPROS, they were treated the same.
This made it difficult to create custom, high-performance systems as teams scaled to dozens of nodes on their system.
Single Point of Failure: ROS Master
If the ROS Master went down, or if any node lost its connection to the Master, new pub/sub connections couldn’t be made. This created a critical single point of failure, which was unacceptable to commercial customers.
Conclusion
Overall, ROS 1 wasn’t built with production in mind. As more serious robotic teams adopted it, TCPROS’ fragility became a problem. It couldn’t offer the speed, reliability, or configurability their systems required.
Open Robotics knew that they needed to reimagine their transport layer. This time, they searched outward for an off-the-shelf solution.
And that’s what led them to DDS.
In the next blog, we’ll explore how they overhauled the transport layer in ROS 2, why they bet on DDS, and what makes it better suited for modern robotics.
Stay tuned!