The rise of multi-MCU systems in embedded design
As time has passed, the complexity of computer systems everywhere has continued to grow. Embedded systems are no exception. Where once a single microcontroller (MCU) may have comprised the entirety of the system, nowadays it’s not uncommon to find two or more microcontrollers even in smaller embedded systems. Often this is due to a separation of responsibilities and capabilities, such as adding Bluetooth or WiFi capability to an existing design. Other reasons such as needing ultra-low power in a battery-backed device can also necessitate the inclusion of more than one processor, as can real-time constraints of certain data processing. One example which exemplifies all of those is the Powersensor. In yet other cases, client requirements necessitate even more complex systems — a recent product comprised five MCUs!
Navigating the complexity: More MCUs, more challenges
Going from a single MCU to two MCUs in a design does not double the complexity however. It’s probably more like a quadrupling, if not more. Not only does it add a second firmware to be developed and maintained, it also adds a requirement for inter-MCU communication, raises the need for firmware distribution within the system as well as a power/reset control scheme or hierarchy, and overall increases the cognitive load due to having a split code base. Often the two MCUs are from different vendors and will use different SDKs and programming styles, further adding to the complexity.
Weighing the costs: Is a multi-MCU design worth It?
This is not to say that a multi-MCU design is inherently bad — it’s not. But it does come with costs which need to be acknowledged. Where it is possible to avoid expanding into multi-MCU territory, it is often worthwhile to do so. We have had at least one instance where a client brought us in to rescue a badly designed embedded system, and we found that a lot of the bugs in the system could’ve been avoided simply by sticking to a single MCU. By consolidating the logic onto the one MCU we were able to reduce complexity and increase robustness in one stroke. The old adage of “keep it simple” very much still applies.
When faced with a multi-MCU system, there are three core aspects that need
to be carefully considered and handled:
- Firmware handling and upgrades
- Power and reset control of all MCUs
- Inter-MCU communication
Firmware handling: The foundation of multi-MCU success
The first of these, firmware handling and upgrades, ties in with the other two aspects, but is in many ways the foundation upon which everything else depends. A potentially trivial approach is to let each MCU manage its own firmware and upgrades as if it was a standalone system. This has some severe drawbacks however. Firstly, it may not have the necessary network interface to download its updates. This is typically the case where the reason for the second MCU is to add network connectivity. Secondly and more importantly, letting each MCU handle its own firmware risks getting the firmware versions
out of sync within the unit. This is an unavoidable consequence of this approach, and while typically they may be able to be kept in sync, the design choice still necessitates implementing whatever logic is necessary to handle both forwards and backwards compatibility between versions across the MCUs. This task quickly reaches Herculean proportions, as the number of potential version combinations grows exponentially with each release. The regression testing alone quickly becomes unmanageable.
To avoid this issue, in our designs we endeavour to designate one of the MCUs as the firmware master, which is then responsible for disseminating the firmware to the other MCUs. In all cases this involves downloading a single bundle containing firmware for all MCUs. This way it’s guaranteed that the versions are consistent. In some cases, this is achieved by actually embedding the other firmwares within the firmware for the master MCU. The precise details will depend on the situation and hardware, but the key is to make the firmware download atomic — either firmware for all MCUs is downloaded, or none. Of course, that is only the first half of the problem.
The art of firmware distribution in multi-MCU systems
The second half is installing the new firmware on all the MCUs. Addressing the firmware distribution issue within the system can be done in a couple of ways. For the master MCU it more or less becomes mandatory to use an A/B redundancy scheme. Technically it may be possible to elide this if under flash space duress, but it’s often far more beneficial to simply use a chip with larger flash. The usual benefit of being able to automatically recover from a firmware upgrade that somehow breaks network communication or upgradeability are well worth it. You may refer to my previous post on this topic for more details.
Next comes the more challenging part. After upgrading the master MCU, how do you ensure the other MCU(s) are guaranteed to upgrade successfully? Here, the key is to use a _pull_ approach. The other MCU(s) are given boot loaders which, on each and every boot, pull the current firmware from the master MCU. When possible, running the firmware straight out of RAM can make this especially attractive as this avoids any possibility of flash erase/write errors. Even with MCUs which require the firmware written to flash, the pull approach is the superior solution. In the unlikely event of a flash error,
the device must be considered faulty regardless, and having the boot loader hang and refuse to progress is a safe and reasonable error behaviour, compared to letting it continue and run corrupted code. There is also the slim possibility that a firmware rollback to the previous version could still be possible, depending on the specific circumstance of flash error.
Power and reset control: The master-other MCU relationship
Here is where the tie-in to the second point comes up. For the master MCU to ensure each other MCU gets back into its boot loader and thus pulls the new firmware, it needs to have control over the other MCUs’ reset lines. With such a configuration, the otherwise hairy upgrade procedure becomes simply:
- Download and install master MCU firmware, as if it was a standalone system
- Reset the other MCU(s)
Of course, the master MCU’s firmware must contain the necessary support to feed the firmware to the requestor(s), but that is far less onerous a requirement than trying to handle backwards and forwards compatibility. While we have established that the master MCU needs to have reset control over the other MCU(s), it may also very well be that the other MCU(s) also require the ability to reset the master MCU. For example, in our designs we generally allow the boot loaders to issue a reset request to the master MCU if it has not responded to the firmware download request in a timely manner. This situation could for example arise if the master MCU has received a faulty upgrade. Of course, as the saying goes, “with great power comes great responsibility”. The ability to reset another MCU, especially the master MCU, is very much a great power that needs to be treated with utmost respect to avoid creating pathological behaviours.
One common situation where the master MCU may not be responding to firmware requests immediately, is when it’s busy performing its own upgrade. If other MCUs were to eagerly reset it in an attempt to get their firmwares downloaded, this could very well interfere with the ability to upgrade the device in the first place. Special care, and long timeouts will be your friend here. In highly complex scenarios, extra levels of arbitration may be required.
Recovery planning: Avoiding the pitfalls of reset control
Outside of the bootloaders’ ability to trigger a hardware reset of the master MCU, there is the overall recovery plan to be considered. One absolute need in a standalone system is that all of the MCUs *must* be possible to hardware reset in order to recover them back into a known state. Simply implementing a reset command available over the communications channel between the MCUs is not enough. To use an anecdote, in one system we were brought in to fix, the reset line to the secondary MCU was not hooked up and the serial line command processor was sufficiently bug-ridden that it was prone to
not responding at all. This was quite the pain to deal with, and ended up with having to fall back to a deliberate stack-smash attack over the serial line to trigger an invalid memory access, which in turn would cause the MCU to reset. To avoid having to resort to such horrible workarounds (satisfying as they may be in the end), design in reset control from the very start, and be explicit about under which circumstances which MCU will be reset, and by whom.
Embedded design principles: Simplify to master complexity
Design, especially design of embedded devices, is at its heart all about edge cases, and handling them. Good designs reduce the number of edge cases possible in the first place — an edge case avoided is much better than an edge case managed. Clean domain modelling, the use of state machines, and following good code design and implementation practices all help here. Above all, keep it as simple as possible (but no simpler). Complexity you get for free; simplicity you have to work for.