A common experience when being brought in to help address issues in existing embedded systems is to discover there is no observability. The device is a black box exhibiting incorrect behaviour, with little to nothing to go on beyond that. Perhaps there are a few print statements available on a serial console, if you’re lucky, and can access it. In many cases, you’re dealing with in-field devices where you simply have no physical access to them.
There is an obvious solution to this, but it requires up-front commitment:
Treat future troubleshooting and debugging ability as a key requirement from the start!
The basics
Logging
First of all, you’ll want to pay some attention to your logging. Once things have gone wrong, your logs will be your first port of call to figure out what’s actually happened. No logs or poor logs generally means you have no idea or at best a poor idea of what’s happened.
This invariably means you will need some manner of log buffer to hold the historical log entries. This may be a RAM buffer, but ideally you’ll want to ensure the log is also persisted. How easily this can be done depends on the hardware available. With plenty of flash, writing log entries straight to flash may be feasible without risking flash wearout. In situations where some flash space is available but writing on every log message would be likely to hit the wearout limit, a hybrid compromise can be used whereby the logs primarily go into a RAM buffer, but get flushed to flash on a flash-page by flash-page basis, thus reducing the wear on the flash. If using this approach, pay special attention to unexpected reboots and if possible hook the relevant interrupts to attempt to flush to flash on a crash.
Depending on the complexity of the system, having full logging enabled all the time might not be feasible due to the volume of messages produced. In such cases adding support to dynamically enable and disable different log categories and/or levels will be a great benefit. Ensure the log settings are persisted and loaded early on startup. This will allow you to successively hone in on problem areas when troubleshooting without getting overwhelmed by excessive logging.
Naturally, logging things is only useful as long as you can actually access the stored logs. Which neatly brings us to the next topic.
Command line interface
You are going to want some manner of command line support. The ability to run commands both to inspect and modify state is golden. A simple space-delimited command parser using a dispatch table will only cost you some twenty lines of code. The first command I tend to implement is “help”, which just walks the command table and prints the entries. It’s exceptionally useful for debugging the parser you’ve just written, and exceptionally helpful a year later when you’ve forgotten all about this project but find yourself needing to troubleshoot a device. A “dump” command is often the second command I add, for those times when you just need to inspect raw memory, be it RAM or flash.
“Okay”, I hear you say, “but we don’t have an available user interface”. Not to worry. This is where the fun starts. This is where you get to be creative. This is where you get to do clever things that will actually pay off and provide real value. This is where you make the actual development of the device easier for yourself.
“But I don’t/won’t even have a serial console!” I hear you lament. Patience, we’ll get to it, I promise. During development you actually are quite likely to have a serial console available, and it’s an excellent first step for making use of your shiny command line interface. As you’re building the various features, keep adding commands that let you inspect or configure said features, be they GPIOs, or WiFi connectivity or something else altogether. Use the commands for our own debugging as you develop; don’t rely on the crutch of JTAG. While it’s absolutely lovely while it’s available, once disabled or inaccessible it can leave you crippled in terms of system visibility. If instead you’ve grown to only rely on the commands you’ve provided yourself, you’ll not only still have visibility, but you’ll also have built enough coverage over the features by virtue of having used it all throughout development.
Not only that, but you also get an interface for doing automated testing through! With a little bit of careful design, your command line functions map cleanly to whatever API/user functionality you have exposed, so it effectively becomes possible to drive the device in isolation, in a representative manner. Whether you use something as classic as expect(1) or integrate with a more modern test framework, you have not only device control but device observability to a level that lends itself well to in-depth testing.
Do keep your security model in mind when you’re hooking up your command line interface. Depending on the device you may need to ensure it is only accessible after suitable authorisation rather than being open for everyone, or only available after having been explicitly enabled via a secure channel. For some products though, a freely accessible command line can be a selling point for the tinkering crowd. The key thing is to be aware, and not break your security model whatever it is.
The creative
So now that you have your debugging/testing interface we can focus on what you can do to make/keep it accessible in the field. Rather than tell you what you could do, let me tell you about some of the things we have done, and let that serve as inspiration.
USB Mass Storage
One product we worked on was effectively a consumer grade data acquisition device with a limited user interface. It exposed only a USB connection, and enumerated as a Mass Storage Device (MSD). We had wanted to have it as a composite device that also exposed a virtual serial port, but due to lacking OS drivers at the time, that had to be ruled out. So, we got creative!
First of all, we added a couple of hidden virtual files to the MSD that provided read-only access to the full RAM and flash contents. Since we were getting callbacks for all “disk” access anyway, it was fairly easy to intercept requests for certain blocks, and fill the I/O buffer from a different source. This setup meant that we could ask customers to zip up the full folder structure and attach to the support tickets, and it would give us a near complete snapshot of the device.
Why include the flash contents, you ask? Excellent question! Aside from telling us the precise firmware version, it also gave us access to historical console log data. While we didn’t have enough flash to set aside an area for logging, the device supported Over-The-Air (OTA) upgrades, which meant that for most of the time there was an entire partition slot sitting unused. We (very carefully) repurposed it so we could also use it as a circular buffer for log data. So, by getting the full flash contents included in a support ticket we automatically got a good view of the sequence of events that led up to the issue being reported.
We also wrote a small utility that would take the .map file and the RAM dump and process that into an easy-to-read listing of all named variables and their values in the device.
That was all well and good for off-line support, but didn’t help us during more interactive debugging sessions. So, we got a bit more creative. While we had set up the FAT partitions so as to have zero unused blocks, we also didn’t have a need for any sort of Master Boot Record (MBR), which meant we actually did have a few hundred bytes of unused space in the first block.
So, we virtualised that block, and overlaid the console output buffer data there. There was a bit of effort to write a tool that could bypass the OS disk caching so we could read the most recent data all the time, but that too was doable. We also made that block writeable, and intercepted the writes and fed them into our command line interpreter instead. Just like that, and we had a functional “serial” console available, despite the device only appearing as a Mass Storage Device!
AWS IoT Device Shadow
For another severely resource constrained product which also lacked an accessible console port, we took a different approach. In this case the device was network connected and made use of the AWS IoT Device Shadow feature for managing the device. If you’re not familiar with Device Shadows or similar, they are effectively a cloud-side document containing the most recently reported device data/status, together with any pending requests for changes to said data/status. The pending requests are held until the device is online, creating an asynchronous request buffer in effect. Normally, such change requests are simply in the form of a new desired value for a given property. But we’re not normal, we’re creative! 😉
There is nothing stopping you from having a property called “command”, which when a change request arrives for you route its value to your command line interpreter instead. Just like there is nothing stopping you from having a “recentlog” property through which you report the last N lines of console output.
Between those two, we have a shadow of a command line (terrible pun intended). Certainly harder to use than many incarnations, but awkward access is a league better than no access!
AWS IoT MQTT
For a slightly cleaner version of the AWS IoT Device Shadow “cleverness”, on a device with a bit more headroom we instead had a single flag in the device shadow indicating whether the command line was active or not. When activated, the device would subscribe to a well-known device-unique MQTT topic, and treat any messages received there as commands to be fed into the command line interpreter. Similarly, while active, any console log output also got directed to a different well-known device-unique MQTT topic. Together, those two topics provided direct access to the full command line functionality of the device.
Why not always have the command line active, you ask? Because of the cost of always reporting log messages up via MQTT. Most of the time there would be nobody paying attention to it anyway.
One very helpful command we had on this device was the “dumplog” command, which replayed the contents of the circular log buffer, giving us an easy way of looking into the past. Crucially, this command didn’t attempt to dump the whole buffer in one go, but chunked it up and printed it with enough time in between to allow it to make its way out over MQTT.
Concluding thoughts
Building in command line support into devices provides an excellent debugging aid, and it generally comes with only a small cost that is far outweighed by the time and effort it ends up saving during the lifetime of a product. It is of course not a silver bullet — there are no such things. Together with a fit-for-purpose logging implementation they do get remarkably close, all things considered. So go, be creative!