Using K3s and ArgoCD for Robotics OTA Deployment
Intro
You’re going for a hike. You wake up early, lace up your boots, and grab a banana as you walk out the door. You drive a ways out, into the thick of the woods, to find the trailhead. You hike for a few grueling hours, and finally, you get to the summit!
You take the perfect picture; the sun is cresting over the ridgeline in just the perfect way. Now this is Instagram worthy! You open the app and try to upload it, but tragedy strikes. The upload has failed, we have no service! Much to your chagrin, you don’t have a network connection. Oh well, no worries. It’ll be a few more hours till your superfans (your mom + your other 64 followers) get to see your picture.
No matter, let’s hike back down the mountain. But now, the bad is getting worse. Our lack of networking means that we can’t access the updated Google Maps information. We’re officially lost!
The plight of your hike is the same struggle that robots face every day. When robots are deployed to customer sites, such as factories, logistics centers, farms, and construction zones, networking is often poor. Sometimes, an enterprise customer will even firewall their facility, meaning that a robot can only communicate with a local server.
Over-the-air (OTA) software updates are hard to get right in environments like this.
And networking challenges are just one of the intricacies of OTA:
You need visibility. What packages are running on what robot? What updates failed? What robots are out of sync?
You need security. How can we securely (but also quickly) deliver and download software packages remotely?
You need failsafes: Can we easily roll back an update if it causes a bug?
Robots are nightmarishly complex systems. So much has to work perfectly for things to go right, and even still, there’s a long way the industry has to go to build dependable, robust robotic systems.
Given this undertaking, most teams don’t have the time or bandwidth to stand up a solid OTA pipeline. What makes it more complicated is that there’s no clear standard for how it should be done.
This means that they end up cobbling together a hacky solution that scales poorly, and usually bites them in the back as they grow their fleet!
In this blog, I propose what that standard should be!
We’ll walk through a setup using K3s and ArgoCD, using the principles of GitOps. We’ll briefly discuss its alternatives, its merits, and how the pipeline operates in production.
Let’s go!
The Common Ways Teams Deploy OTA Updates
1. SSH + SCP
What is it?
SSH
(Secure Shell) lets you remotely access a robot’s terminal from your own workstation. You connect to the robot using its IP address and credentials.
SCP
(Secure Copy) lets you transfer files between your workstation and the robot over SSH. You specify the file path and the robot’s IP, and it securely copies the file to the destination.
Manually SSH
into each robot in the fleet. Then, use SCP
(Secure Copy) to transfer your binaries or Docker containers. Finally, systemctl restart to restart your robot’s application and apply the update.
Why do teams use it?
This is a dead-simple way to get started with remote updates. It’s easy for any engineer to perform, and it’s instrumental in the early prototyping/testing phase. Plus, once you’re shelled in, it’s easy to debug.
Why does it break down?
Unscalable
This is an inherently manual and unscalable model. If you have a small fleet of robots, it’s not that annoying. Once you get past 10, it becomes time-consuming.
Error prone
The manual nature of this process also means it can be error-prone. By the time you update robot #15, you might forget to copy over the
object-detection
module. There’s no declarative way to ensure that each package gets updated.
Visibility
It’s impossible to see which packages are on each robot in our fleet. Are they all running correctly? Is each robot on the right version?
Rollbacks
If something goes wrong, what then? We can’t easily roll back to a previous version.
Security
Every inbound port that you open via SSH is another attack vector for malicious actors (like a man-in-the-middle attack)
You’ll have to monitor and audit this access, including who accessed which device, who made what change, and why.
SSH keys need to be distributed securely and rotated often
Once someone is shelled into the robot, they can run arbitrary code, including malicious updates, with little to no visibility or alerting.
2. Ansible (or other automation tools)
What is it?
Ansible is an open-source configuration management tool that enables you to automate the setup and updates of your infrastructure. You write YAML “playbooks” that describe the desired state of the robot, such as which packages should be installed, which files should exist, which environment variables should be set, and which services should be running.
When you run a playbook (using ansible-playbook
), Ansible SSHs into each robot and executes your instructions step by step. It doesn’t require an agent on your device.
Why do teams use it?
Ansible’s power is that it’s declarative. Rather than having to manually copy each package in the terminal, you can specify the state you want your application to be in, and Ansible will handle the rest.
It also helps keep things DRY. You can reuse playbooks across robots or environments with minimal changes. Templating, variables, conditionals, and roles let you write flexible, modular playbooks
Why does it break down?
Why it breaks down
It’s still SSH-based and, for that reason, faces most of the practicality and security challenges as SSH/SCP.
Atomcity
Updates are not atomic (all or nothing). If the network connection drops midway through an update, your device could be left in a bad state.
Retries
It’s a push-based system. The central server initiates updates, unaware of the current state of the device. If a robot is offline when the playbook runs, it will miss the update. There’s no way for the device to request or pull updates on its own.
Ansible was originally designed to update thousands of server racks in a data center, not edge devices with inconsistent networking. That’s why it can be brittle for robotics!
3. Docker Compose
What is it?
Docker Compose is a tool for defining and running multi-container applications. You write a docker-compose.yml
file that declares each container (aka service) your application needs. Each service can specify its image, volumes, environment variables, and dependencies.
To deploy, you copy the Compose file onto the robot and run docker-compose up -d
. This boots up all the containers with a single command.
Why do teams use it?
Compose is the easiest way to do container orchestration. It’s simple to learn, well-documented, and extremely easy to set up.
The Compose file is practically a manifest. It becomes your source of truth for how the app should run: what images to use, what ports to expose, and how containers should interact.
Why does it break down?
No state enforcement
Compose doesn’t monitor your system to ensure it stays in the desired state. If something crashes or a container exits unexpectedly, Compose doesn’t notice.
No self-healing
If a container dies, Compose won’t restart it automatically. You’ll need to log in and restart it manually. Kubernetes handles this automatically!
Weak rollouts/rollbacks
Compose has no canary rollouts and doesn’t natively support rolling back to a previous version in the case of issues. This won’t fly with a large fleet of production robots.
Compose is great at first, but you’ll eventually want more and more functionality.
Health checks and automatic restarts
Resource limits per container
Declarative state enforcement
Centralized observability and metrics
Rolling updates with version control
You’ll start to duct-tape this together as you scale, until you realize that you should be using Kubernetes!
Wrapping Up
And before you beat me to the punch, you definitely don’t want to be building this yourself! The general inclinations of most engineers are. “Oh, this is easy! We can spin this up quickly ourselves.” Of course, it’s never that easy. You’ll be able to get a hacky v1, but the real problems will come with scaling and hardening the solution. But that’s nothing new; the same pattern happens in data and cloud software applications.
The unfortunate part about scaling software for a robotic fleet is that things are much stickier for physical devices. A fleet of robots has uptime requirements. Switching out processes can require physical interventions: imagine that your robot is deployed to a firewalled site. To update an agent for a deployment, a technician must physically install it.
Ripping and replacing an OTA system is really, really hard! So we need to get this right from day 1.
So, we’ve looked at three of the most common ways teams approach OTA: SSH/SCP, Ansible, and Docker Compose. And we’ve seen where each of them breaks down.
Now, let’s talk about what a better system looks like.
Why to Use K3s, ArgoCD, and GitOps
There’s a better way! Let’s talk through (in my opinion) the optimal architecture for your OTA stack.
This setup uses K3s and ArgoCD, powered with the principles of GitOps to deliver a system that is:
Declarative: We describe what the robot should be running, not how to make it run in that way.
Resilient: Can handle network drops and power cycles.
Tractable: Every change is tracked in Git.
Secure: No inbound ports, no unfettered SSH access.
Here’s how it works:
Each robot runs its own K3s cluster. Within that cluster is an ArgoCD agent that monitors a remote Git repository. This Git repo will describe the desired state: what containers to run, config values, and startup instructions.
When the Git repo changes, for example, if an image has a new tag, the agent detects the difference and pulls the update down. The local K3 cluster will then perform the necessary updates and synchronize itself with Git. No need to SSH, SCP, or execute any manual commands.
Our cloud backend has a CI pipeline that builds images, pushes them to our container registry, and commits changes to Git. Our cloud will also host our ArgoCD control plane, which monitors the fleet status of our K3 clusters, providing us with fleet-wide observability on software versions and health status.
TLDR:
Commit a change in Git.
Agent watches Git server.
Update gets pulled down to the robot.
K3 cluster applies update.
Let’s dive deeper into each component.
K3s
K3s is a lightweight version of Kubernetes, designed to run on resource-constrained devices like robots or edge devices. It’s fast and compact, bundled into a single binary.
For clarity, Kubernetes is a container orchestration tool that helps deploy, scale, and manage containers. You can think of it as a beefed-up Docker Compose.
Kubernetes monitors your containers. If a container crashes or hangs, Kubernetes automatically restarts it. It uses:
Liveness probes to check if a container is still functioning
Readiness probes to check if a container is ready to serve traffic
Everything is defined in the manifest (YAML file). This declarative approach means you don’t need to write hacky scripts. Just describe what you want, and Kubernetes does the rest.
K3s gives you the power of Kubernetes in a single binary (<100 MB) and uses SQLite by default instead of etcd.
This allows it to run on embedded devices like Jetson or Raspberry Pi.
ArgoCD and GitOps
ArgoCD:
ArgoCD is a GitOps controller. It automates Kubernetes deployments by syncing your cluster to a Git repo.
In a typical setup, ArgoCD runs as a centralized service in the cloud (the “control plane”) and pushes updates to your clusters when Git changes.
The ArgoCD agent flips that model. It’s a lightweight version of ArgoCD that runs inside each robot’s local K3s cluster. Instead of a push system, it pulls updates from the Git repo and applies them locally.
In this setup, the remote ArgoCD control plane doesn’t push code or directly interact with the cluster. Instead, it’s just used for observability.
GitOps:
As the name suggests, GitOps is the philosophy that Git is the source of truth for your application’s state.
When using Ansible, Compose, or even Kuberenetes by itself, you have to execute imperative commands to deploy software (
kubectl apply -f deployment.yaml
,docker-compose up -d
,ansible-playbook deploy.yml
). With GitOps deploys kick off automatically when you push.By using GitOps, application versioning is more tractable. There’s a single truth, and it’s easy to change: just push to prod. ArgoCD detects the difference and automatically reconciles your local cluster to match.
Note: You could run a traditional Kubernetes (K8s) cluster across your entire fleet, treating each robot as a node in that cluster. In that setup, the ArgoCD control plane would push updates directly to the nodes.
This doesn’t work for two glaring reasons:
It faces all the shortcomings of other push-based systems.
If a robot disconnects from the network, Kubernetes might try to reschedule its pods onto a different node, which could be on a different robot!
In summary, we use K3s, ArgoCD, and GitOps because the stack is declarative, version-controlled, resilient (self-healing), and secure (no inbound ports!).
Now we’ll talk more tactically about how to structure this stack.
Walking Through the Setup
Git Structure
We use a single shared Helm chart to define our robot’s application stack. This includes deployments, services, volumes, etc.
Per-robot configuration (sensor calibrations or location-specific overrides) is stored in external ConfigMaps.
You can manage these overrides with:
Templating systems like Helm values.yaml
Manual overrides via Kustomize patches
Or, more scalably, with a dynamic config management with Miru
These ConfigMaps are injected into the application at runtime.
Here’s an example file tree:
/robot-deployment/
/charts/
robot-app/ # Shared Helm chart
/robot-values/
robot-001.yaml # Config for robot-001
robot-002.yaml # Config for robot-002
Sample robot-app
apiVersion: apps/v1
kind: Deployment
metadata:
name: robot-app
spec:
replicas: 1
selector:
matchLabels:
app: robot-app
template:
metadata:
labels:
app: robot-app
spec:
containers:
- name: robot-app
image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
env:
- name: ROBOT_ID
value: "{{ .Values.env.ROBOT_ID }}"
- name: WIFI_SSID
valueFrom:
configMapKeyRef:
name: robot-config
key: wifi_ssid
- name: CAMERA_OFFSET
valueFrom:
configMapKeyRef:
name: robot-config
key: camera_offset
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 10
periodSeconds: 20
Sample configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: robot-config
data:
wifi_ssid: "{{ .Values.config.wifi_ssid }}"
camera_offset: "{{ .Values.config.camera_offset }}"
Sample values.yaml
image:
repository: ghcr.io/my-org/robot-app
tag: v1.5.3
env:
ROBOT_ID: robot-001
config:
wifi_ssid: "Factory_A"
camera_offset: "0.12"
CI/CD Pipeline
Our pipeline uses Git as the source of the truth and the trigger for deployments.
Here’s what happens when we release a new software version:
An engineer pushes a new commit to Git (let’s say it bumps the image tag in
values.yaml
)The CI pipeline (GitHub Actions, GitLab, Jenkins)
Runs the test suite
Builds a new Docker image (
robot-app:v1.5.4
)Pushes it to a container registry (Docker Hub, GHCR, ECR)
Commits the updated manifest to Git
ArgoCD Agent
Each robot runs an ArgoCD agent inside its K3s cluster. The agent intermittently polls Git for changes.
It’s configured to watch a specific branch. If it sees a diff:
A new commit on
main
A bumped Helm chart tag (e.g.
tag: v1.5.3
)
It will pull down the new manifest and supply it to the K3 cluster.
K3 Cluster
Once the K3 cluster has the new manifest, it applies the update declaratively. Kubernetes handles this with its built-in controllers.
Here’s what happens:
If it sees a new Deployment that doesn’t exist, it creates it.
If the Deployment already exists but something changed (like the container image tag), it rolls out a new ReplicaSet.
If a Service, ConfigMap, or Volume changed, it updates those too, and may restart Pods if needed.
If a referenced image isn’t cached locally, it pulls the new image from your container registry.
We mentioned that it may restart a Pod if needed. It will do this when the Pod ‘spec’ changes. It needs to destroy and recreate the Pod to match the desired change.
This could happen because the ports change, environment variables change, or a container image tag changes (among other reasons).
Benefits of This Approach
The K3s + ArgoCD + GitOps stack gives us a repeatable, reliable, and scalable way to deploy software to our fleet.
Versioning, tractability, and rollbacks
Every production change goes through Git. This means that we have a central source of truth for our deployments. We can be sure that we have a clear audit trail of what changed, when, and why. And, if things go wrong, we have an easy way to roll back just using Git. We don’t have to build any other fancy infrastructure to support it.
Declarative + Scalable
No more running hacky scripts (or even worse, manually SSHing one by one into each robot). Now, we can use Git to declare our desired state and let ArgoCD handle the rest.
Push a change to the manifest, and let it be pulled to every robot in your fleet. It doesn’t matter if you have five robots or 100.
Resilient to Network Drops and Power Cycles
Other systems (like Ansible) that are push-based and rely on SSH have a gaping problem when it comes to edge devices. We can’t assume that we have a steady connection to our device. It’s common for our device to lose power connection to our network.
In the case of a push-based system, this could mean that some devices (which aren’t connected to the network at the time) won’t receive the update. Moreover, if the network drops while an update is being applied, our device could be left in a bad state.
Security
A mortal sin in the world of edge devices is opening inbound ports to our devices.
With ArgoCD agents, all communication is outbound. The robot polls the GIt server.
Plus, because everything runs locally on-device, no remote commands are being executed over SSH. No chance of a man-in-the-middle attack!
Open source and well supported
Kubernetes is used everywhere for container orchestration. ArgoCD is the dominant GitOps tool for Kubernetes. Both tools have been battle-tested for years, from startups to Fortune 500s.
Downsides of This Approach
Kubernetes is Heavy
Our robots have only a limited amount of compute and storage. We’ve heard war stories from our customers, where the perception and controls team are fighting for every last byte of resources. Often, the computer is pushed to its absolute limit to wring out the best performance for our robot.
So, even though K3s is a lighter distribution of K8s, you can still argue that with its container runtime, control plane, and system daemons, it’s heavy. K3s typically takes ~100–300 MB of RAM and ~100–200 MB of disk space depending on your app, logs, and images
If you’re running a Jetson or Raspberry PI, you should be good to go.
Steep Learning Curve
Kubernetes has reputation for being hard to master (and for good). My co-founder once spent an entire summer wrangling K8s!
Most robotics teams will need to learn these behaviors/skills:
Kubernetes basics: pods, deployments, services, volumes, probes, etc.
Declarative manifests: writing and understanding YAML configs
GitOps workflows: stricter Git practices (branches, PRs, approvals)
CI/CD discipline: everything goes through the pipeline
Team alignment: everyone has to operate within the system
Renowned Infra Focus
To use this stack to the fullest extent, you’ll need to lay down the proper infrastructure around it. Depending on your resources, this may be prohibitive.
TLS certificates (for secure comms between agents and registries)
Private container registries (with authentication + access control)
Secrets management (Wi-Fi creds, tokens, API keys → Kubernetes Secrets)
Networking edge cases (NAT traversal, firewalls, proxies)
Structured observability (logs, metrics, status sync from ArgoCD)
Conclusion
Building OTA for robots is hard. Doing it well, securely, scalably, and reliably. is even harder. Most teams start with bright eyes and hacky internal tools, but those don’t hold up once the fleet grows.
We’ve walked through the most common approaches (SSH + SCP, Ansible, Docker Compose) and why they don’t scale.
The pair of K3s and ArgoCD is the most functional way to deploy OTA software updates. So the next time you’re evaluating your OTA stack, remember this blog!