Building an IOT undercover agent with Rust

June 10, 2024
Perspective
Luís Silva

Revolutionizing the healthcare sector is hard work. You need a sleek, intuitive product for your users to use. You need the best algorithms running the best models. You need to make the best predictions and generate the best results to show to your users.

Now, you just need to setup your hardware on the rooms to run these things and you should be good to go. In the age of connectivity how hard can it be to manage a couple hundred devices across a country? Right? … Right?

Well let me tell you…

While our AI models get better and better, if our systems don’t have all the right libraries installed, nothing works. Even though we have the most smooth animations on our product if our sensors fail to send messages our users see nothing.

The solution is obvious, we just need a tool that would allow us to manage our devices. Should be simple enough. Right? …. Right?

💡  Maybe all of these AI alignment issues can be solved by generating malicious CUDA libraries to keep our future AI overlords from knowing what to download from the internet to make their brains works. But until our AI agents can figure out their own substrate we still have to chew most of their food for them.


When you search for an IoT management tool, you'll find plenty of options. They all promise to handle IoT devices for you as if by magic. However, like any magic trick, it's only impressive if you don't know how it's done. Once you understand the trick, you realize it's just sleight of hand.

We tried a few solutions, but our experience was mixed. Some were Docker-based, which posed specific challenges. Integrating NVIDIA custom code was often tricky, and dealing with large payload sizes became a major issue. These problems made it hard to get the solutions to work smoothly with our own use case. We spent a significant amount of time trying to troubleshoot and adapt these tools to our needs, only to hit roadblocks that consumed valuable time and resources.

Other tools had ridiculous pricing models. It wasn't just that they were expensive (they were), but the way they charged was frustrating. For example, some charged per "tag," a feature that takes minimal effort to implement. Essentially, adding a tag is just one row in a database. This kind of pricing strategy felt unfair and exploitative.

💡  Take into consideration that we had some deployments where we are on mobile network, where we pay for each Mb. Don’t get me started on the state of mobile network providers for devices at scale. That would be a whole new blog post.

And so this is the state we found ourselves in. A conundrum as old as time: should we work our way around the limitations of these paid solutions or … hear me out… build our own solution.

Ancient engineer pondering if he should build his own solution

I think that most of you know where this is headed, let's take this moment to give a bit of context. Most of our back-office products consist of C++ code bit banged to work for our specific use case. This means that we had our fair share of core dumped messages and memory usage steadily increasing while our program was running 🙄. So, after we decided to build our own solution, we also decided to give Rust a chance.

This is known as an Engineering Royal Flush.

Engineer excited to go work on Rust

Agent Smith - the undercover agent

We decided to iteratively build our software, leveraging what we were learning along the way. Also known as “we are not really sure of what we are doing, but we are going to go forward either way”.

The first step as with any good project is to lay down your requirements.

Our agent should be able to run standalone, independent of our current application. A new daemon, always running, always monitoring our devices.

The initial version was a simple program that would just ping back to our servers, nothing fancy, just a simple POST on a public endpoint.

Version 1

After getting over the initial fear, and riding on the back of Github Copilot, we were confident in keep using Rust for this piece of our infrastructure. As with any good project, the next step was to increase complexity.

Until now, our little agent was just a simple binary running on our devices that we side-loaded using another agent that we were using at the time (the one that was so bad it prompted us to embark on this adventure in the first place). However, if we intended to solely use our new agent, we needed to port some of the capabilities that we used from the other tool.

💡  It's always daunting to update tools that you already use. Even when you think you're creating a better, more useful tool, there's a period where the old tool still outperforms the new one. The old tool's features are already integrated, and even if imperfect, the team has adapted their processes to work around these flaws. Fortunately, we are all idealistic geeks. 🤓

We initially opted for a state machine pattern because it best fit our mental model. We wanted the device to go through different states: performing initial checks, checking for remote updates, running external apps, and collecting and posting metrics.

Version 2

Turns out state machines are quite hard to debug, the more states you have, the more transitions you might encounter the more conditions you have to check. It really balloons out of your control really quickly. And specially when you are dealing with remote devices, doing their own thing you need to somehow keep it under control what they are doing. If they suddenly stop pinging, is it because they are updating? Are they in a failure state? Will they recover? Will they come back?

The biggest difference between our use case and simply having some servers in a cloud provider is that cloud providers offer substantial assistance. They help keep your server online, ensure its accessibility, and aid in monitoring it. They provide a fully operational platform; you just have to decide what to do on top of it. In our fleet, we act as the cloud provider and the servers are our devices. Maintaining all of these devices and keeping them up, running, and online is quite challenging. We've gained a lot of respect for the cloud providers that abstract all of these complexities from us.

The final version

Smith infrastructure

A couple of refactors later we have finally reached a point where we felt very comfortable with the overall results and architecture. To wrap this long monologue let me just mention two more things that I think were the biggest improvements in this final version.

Firstly: Actors and the Command Pattern. Both appeared as solutions to some of the problems we encountered when implementing our agent with state machines. This approach encapsulates state and behavior within individual actors, each responsible for its own state transitions and interactions. This makes debugging a lot easier by isolating state changes to specific actors, allowing us to trace and manage transitions with less headache. The Command Pattern further enhances this setup by decoupling the request for an action from the execution of that action. Essentially, commands are sent to actors, which process them independently. This setup ensures each action is explicitly tracked and handled, offering a clear separation of concerns. It also allows for explicit and expected outcomes. If you are interested in going deeper on how to implement actors in rust, please take a look at this awesome blog post about it.

Example actors

Secondly, Debian Packages and Systemd. On this final iteration we tried to leverage existing tools in the Ubuntu/Debian world. Even if you try to build your own tools, you can’t really build everything from scratch. As you can imagine, the features that you want to add quickly catch-up with the time that you have. We chose to interconnect evenly with the system that is running in our devices which is a custom Ubuntu image. We ditched our custom update code, and adopted deb packages across our fleet. Easier to install, uninstall, managed dependencies. Overall enabling a consistent installation process across all devices. This approach makes upgrades and rollbacks a breeze, allowing us to deploy new versions or revert to previous ones seamlessly if issues arise. Systemd, on the other hand, provides a robust framework for managing our software services. It handles starting, stopping, and monitoring of services, ensuring they are always running as expected. Systemd is a powerful tool, allowing for a lot of configuration, for instance, defining dependencies between services, setting up automatic restarts on failure, and monitoring the status of each service in real time.

Looking Back

Our agent, Smith, is now on all our hundreds of devices and counting. It’s responsible to monitor the devices, ensuring applications are running, updates are done through him and so is remote debugging. All in a couple of lines of safe Rust code. Managing a fleet of remote devices is no small feat, but with the right tools and patterns, it can done and I would say you can even have fun while doing it.

Is this the end? Is it done now? Probably not.

We still have a couple of new ideas that we want to try out and probably a couple of bugs waiting to be discovered and fixed.

Agent Smith