Improved control architecture and strategy for iodine ion thruster following in-orbit demonstration and system-level radiation testing

This paper describes the general control strategy of the NPT30-I2, the improvements made to the software architecture and hardware following the on-orbit demonstration and the result of a system-level radiation test campaign. It details the iodine ion thruster control architecture, algorithm, and design approach. In addition, it presents data from the in-orbit demonstration. The results of the system-level radiation tests and the redesign of the system to reduce radiation sensitivity will also be discussed.


Introduction
The NPT30-I2 is the world-first iodine-based electric propulsion system successfully operated in space. Two versions having different form factors have been developed -1U and 1.5U with a total impulse of 5500mN and 9500mN respectively. The propulsion system includes all the subsystems required for its operation, such as a power-processing unit, iodine propellant storage and flow control management, thermal management, radio-frequency (RF) plasma source coupled with the two-grid ion acceleration assembly, and a neutralization subsystem.
The main goal in the development of the control strategy for the NPT30-I2 was to design a stand-alone system, which requires only high-level commands from the satellite and can respond to events autonomously. This approach facilitates integration in small low-cost satellites. The NPT30-I2 uses Custom-Of-The-Shelf (COTS) components, with the possibility to replace the microcontroller and CAN transceiver with a radiationhardened version. However, the current context of new space and the needs of the space industry in multiple constellations of more than 10,000 satellites push toward the use of COTS.
An in-orbit demonstration of the NPT30-I2-1U has been performed to validate the design concepts and telemetry data has been collected. In addition, a radiation testing campaign was conducted. The goal was to detect the main failure point of the NPT30-I2 to improve the hardware and software. Radiation testing is often required for electronic systems used in space [1]. Traditionally this type of testing is performed on the component level [2][3][4] which can be prohibitively expensive and time-consuming, especially for small satellite platforms. Here we present the results of an alternative approach to system-level testing, and the software and hardware modification made to improve the propulsion system.

Control strategy of NPT30-I2
The NPT30-I2 is a fully integrated electric propulsion system, based on RF gridded thruster technology and uses iodine as a propellant [5]. It uses fives microcontrollers to control the system: one main board (MB) and four dependent subsystems, as shown in Fig. 1. The subsystems are the RF Generator (RFG) based on proprietary non-PLL architecture, the Grid Supply Unit (GSU) controlling both acceleration grids, the Cathode Supply Unit (CSU) powering the neutralizer, and the Flow Control Unit (FCU) handling iodine flow control to the thruster. The subsystems are used to regulate some thruster parameters according to the targets received from the MB and locally measured values. The control of the NPT30-I2 is explained in more detail in [6]. The MB uses a SAMV71 microcontroller. This microcontroller has the advantage to have a pin-compatible radiation-tolerant chip (SAMV71RT). The Microchip V71 family also has a radiation-hardened chip (SAMRH71).
The Main Board communicates with the spacecraft, using redundant CAN and/or I 2 C buses. An internal and independent CAN bus is used to communicate with the four subsystems. In addition, the MB manages all safety algorithms and controls the operation process. To perform all these tasks with respect to certain timing constraints, the MB implements the real-time operating systems (RTOS) FreeRTOS.
One of the tasks controls the state machine, i.e., the firing sequence. A simplified representation of this state machine is shown in Fig. 2. The OBC can send the main trigger command to start the propulsion: "arm firing" and "start the propulsion system". The NPT30-I2 then performs a build-in self-test, starts the ignition where the plasma is ignited, and goes to the preparation and operation statuses. While in operation, multiple regulation algorithms can be activated depending on whether the propulsion requires thrust control, input power feedback loop locks, or other options with distributed priorities.
During the different statuses of the system, safety is ensured by multi-level monitoring. The first is at the level of the four subsystems. Each subsystem implements hard-coded limits for parameters they measure. If a value is not within the predefined range, the subsystem sends a signal to the MB, which triggers an emergency shutdown. This subsystems level allows fast detection of simple anomalies, such as neutralizer current drops, too high tank temperature, etc.
On the MB, a safety task is running in parallel to the state machine. This safety task provides a second level of monitoring. As the MB has access to the telemetry of all subsystems, the safety task can detect more complex "global" fault, such insufficient levels of beam neutralization, ions focusing anomalies, out-of-range input voltage, etc.
The control and safety algorithm of the NPT30-I2 has been successfully tested during the In-Orbit Demonstration (IOD) and multiple commercial flights.

In-orbit demonstration results
During the IOD, the propulsion system has been used during several firings with different operating modes. Detailed results can be found in [5]. Figure 3 below shows the thrust and input power of the NPT30-I2-1U during one space maneuver of the IOD mission. The thrust is obtained by measuring grids current and voltage, using an indirect thrust measurement method described in [5]. Between t = 0 s and t = 1000 s, it shows the self-test and ignition statuses. The plasma is generated around t = 1000 s and the system goes to operation state, where the thrust is generated.
During some firings, a few recoverable issues have been detected. For example GSU high voltage controller failure, internal CAN communication stuck, internal I 2 C failure on analog-to-digital converters (ADCs). Most of these errors are directly handled by the safety task of the NPT30-I2 and can be solved with a soft or a hard reset of the faulty controller. When a subsystem fault is detected, the MB restarts the related subsystem and resumes the operation. During the IOD, two minor issues on the MB level have also been noticed: one on the external CAN transceiver and the other on the microcontroller. Due to the first issue, the OBC was unable to request and receive telemetry from the NPT30-I2 during part of the firing, but the propulsion system continued to operate safely until it was powered down. The second issue was an unexpected watchdog timer (WDT) reset of the microcontroller, likely triggered by a single event functional interrupt (SEFI). Due to this second event, the propulsion system remained in standby and did not operate during one of the time slots as the OBC did not resend the main process trigger commands.

Radiation testing
Traditionally the electronic components used in space products have been rad-hard, however, with the recent growth of micro and nanosatellites, the industry increasingly employs COTS devices. The advantages offered by using COTS components compared to rad-hard include lower costs, shorter lead times, and increased performance.
To test the system while keeping cost competitiveness and fast development process, system-level testing was performed on all five electronic boards of the NPT30-I2. While giving low radiation hardness assurance [7], the tests produce useful information on functional reliability, self-recovery and failure modes of the NPT30-I2. The tests also issue reference values for the highest proton beam energy and the maximum total dose that can be sustained without irreversible system failure. Finally, they provide guidance on potential preventive measures to be implemented.
It should be noted that a component-level TID test was performed, by Open Source Satellite, on the SAMV71 microcontroller. The results showed that the chip can withstand more than 60 krad [8].

Setup
Two radiation tests have been conducted on two separate sets of the electronic assemblies -the single event effects (SEE) test and the total ionizing dose (TID) test -following ESCC guidelines 25,100 [9] and 22,900 [10] respectively as much as possible.
The SEE test was performed with a proton beam and a moderator that can provide energies between 40 and 200 MeV at a flux in the interval between 10 5 protons/s/cm 2 and 108 protons/s/cm 2 and fluence up to 1011 protons/cm 2 . For the TID test, a Co-60 All subsystems were mounted on a frame facing the beam as shown in Fig. 5. In the case of the SEE test the subsystems were positioned in a single plane directly exposed to the beam, whereas during the TID test, due to a narrower beam angle the subsystems were located behind each other.
The requirements for the positioning of the load unit in the two test setups were to be outside of the irradiation beam and to be as close as possible to DUT to minimize EMI due to the load cables. The two tests were conducted at different facilities and have different arrangements -The TID irradiation beam was calibrated to be homogenous in a rectangle of 10 × 10 cm at a distance of 100 cm from the radioactive source. The irradiation area and distance are determined by the facility and therefore it is not possible to arrange all subsystems of the DUT cannot be irradiated at the same time in a single irradiation plane. Since gamma ray radiation passes through matter and the attenuation rate by FR4 substrate, copper planes, and components is insignificant the DUT boards were stacked behind each other facing the irradiation beam. The SEE test facility allows for all submodules of the DUT to be arranged in a single plane -the DUT is located 175 cm from the beam moderator and the beam is scanned over an area of 30 × 20 cm on the irradiation plane. The tests were conducted with the boards operating as close as possible to real working conditions. All subsystems were powered continuously unless a critical fault requiring power cycling is observed. Each subsystem was connected to a representative load (e.g., plasma, neutralizer, fuel tank heaters) and was running at its high-power mode for 10 min followed by a 3 min cool down period, after which the operating cycle is repeated. An exception to this logic is the RFG which is powered for 5 s followed by 55 s of standby. Figure 6 shows the operating cycle of the different boards. The operation cycling of the test unit is synchronized with the required downtime of the radiation beam to maximize the fluence during high-power modes.
All internal system parameters were logged at a sampling rate of 4 Hz. In addition to the internal parameters, an external data acquisition system located outside of the radiation beams, monitors the input current of the device under test (DUT), internal bus supply voltages, and the output voltage and current of each subsystem. This allows to compare the parameter values read by the internal sensors and the actual values read by the data acquisition system, helping with the analysis of the radiation test data. While in the TID test the beam irradiates simultaneously all subsystems, in the SEE test the beam cone covers only a portion of the tested unit, and therefore the irradiation was done in steps. While doing the test, the map of fluence has been recorded to help the analysis. Figure 7 shows an example of this fluence map, after scanning row 6 at 100 MeV. At this energy level, the beam needs 9 steps to fully irradiate the DUT.

See results
Several non-fatal errors were observed before a destructive SEE caused the loss of communication at 200 MeV and fluence of 10 * 10 9 p/cm 2 . These errors were noticed from the lowest energy level at 40 MeV and were mostly due to the failure of the two CAN communication interfaces -the internal one between the MB and the subsystems, and the external one between the MB and the control station. Examples of some of the errors are shown below. On the next few figures, beam on area, in blue, indicates that the proton beam is on.

Communication failure with a subsystem
The loss of communication is detected when the subsystems do not respond to the telemetry requests by the MB. This type of error occurred at energy levels of 70 MeV and higher. An example is shown in Fig. 8. CSU/DAQ Cathode Voltage are measured by the Cathode Supply Unit and an external Data Acquisition device respectively. In this example, after the communication failure, the cathode is still active, but the target heating current value cannot be updated by the MB. Interruption of firing followed by a power cycle of the CSU recovered the communication.

Main board microcontroller latch-up
With multiple levels of beam irradiation, from 40 to 200 MeV, on the MB or FCU, a sudden rapid increase in the internal temperature of the microcontroller was recorded, as shown in Fig. 9. This temperature increase is not coupled with a significant current increase. This is due to a microcontroller internal latch-up. On the figure below, the MB temperature is measured by the microcontroller. MB VDD is the 3.3 V supply line that powers this microcontroller. When the board is powered cycle, this voltage goes to 0 V, then back to 3.3 V.
Except for temperature increases no other errors were observed and the microcontroller was operating normally. A manual power cycle of the affected board was performed when the event occurred to prevent unrecoverable damage.

Sample error
Other errors that were noticed are related to the system parameters monitored and controlled by the subsystems. This type of error includes both single samples of analog voltage errors, Fig. 10, and offset errors, Fig. 11. Due to the single reading error, the proportional-integral-derivative controller changes the reference of the DC/DC converter, which causes the actual screen grid voltage to increase. After a few more samples, the microcontroller reads the correct  Larger offset in the reading could be more critical, as this propagates and creates a thrust offset. If the offset is high, the MB stops the operation, powers cycle the subsystem to recover and restarts the operation.

TID results
The electronic subsystems of NPT30-I2 with COTS components were able to reach a total gamma radiation dose of 14.28 krad at a dose rate of 3.33 krad/h when a shortterm irreversible loss of communication between MB and the test control station occurred. After 23 h the test was restarted in its initial configuration and a final dose of 17.82 krad was reached. In both cases, the error occurred in the CAN communication interface.
Other than the loss of communication no other errors were noticed on the DUT. However, unexpected instabilities were observed, as shown in Fig. 12 b), while the temperature of the microcontrollers temperature was stable during the test without the radioactive source, Fig. 12 a). To be noted that the board that use the same microcontroller (MB/FCU and GSU/CSU) show the same behavior.
While the radiation tests give reference values for the radiation levels that can be tolerated by the COTS version of the thruster, the analysis of the observed errors is used to implement preventive measures both in software and hardware.

Improved solutions
Following the IOD and the radiation test campaign, improvements were made to the software and hardware of the NPT30-I2. To reduce the SET propagation on analog readings, the number of values to be averaged for critical sensors have been increased. In addition, the software has been improved to be able to discard wrong readings. This section presents two other major improvements done on the propulsion system: the SET and SEU protections.

Single event latch-up protection
The most concerning issue detected during the radiation testing is the SEL inside the MB and FCU microcontroller. If not addressed quickly enough, SEL can damage the system. One way to detect the SEL is by the rapid increase of the temperature of the microcontroller.
A soft reset is not enough to recover from the effect, the microcontroller must be power cycled. For the FCU failure, there is a straightforward way to fix the SEL as it is possible to power cycle subsystems microcontroller, using the dedicated power switches on the MB as soon as the event has been detected. However, in the hardware version of the MB used for the radiation testing, the microcontroller, the system relies on the power source to power cycle the NPT30-I2.
A new version of the MB has been designed to allow MB to power cycle itself. Instead of always enabling the 3.3 V regulator, the ON/OFF pin is connected to two pulse generators in series, as shown Fig. 13. The first pulse generator "A", connected to the "CPU_ ALIVE" signal, has a pulse set around 0.1 s. The second "B" is set to approximately 5 s.
By default, when the program is running, it feeds the "CPU_ALIVE" line with a pulsewidth modulation (PWM) signal, with a period set to 50 ms. This PWM signal keeps triggering the pulse generator A, maintaining its output signal low, thus the output of the pulse generator B and the ON/OFF pin high. This way, the 3.3 V regulator is enabled.
When the microcontroller detects a SEL, it saves the operation context in the nonvolatile memory (NVM), and disables the PWM signal, as shown in Fig. 14.
After 100 ms, the pulse generator A timeout, and its output goes high. This rising edge triggers the pulse generator B, which drives its output low. This disables the 3.3 V regulator and turns off the microcontroller. After 5 s, the pulse generator B output goes back to high state, which turns the microcontroller back on. When the MB is restarted, it checks the previous firing and reset context, and resumes firing if needed.

Bootloader improvement
Data and code are stored within NVMs, directly on the microcontroller's flash memory, which is sensitive to radiation effect [11], especially for the COTS version of the microcontroller. The NPT30-I2 MB and subsystems implement a bootloader that greatly enhances the in-flight debugging capability and allows the system to be completely reprogramed in the event of corruption of the application or if the firmware needs to be modified for any reasons. The architecture of the MB and subsystem memory is shown in Fig. 15.
When a NPT30-I2 board is powered on, the microcontroller starts in bootloader mode. This bootloader flowchart is shown Fig. 16. By default, it checks the user application integrity, running a secure hash algorithm (SHA) of the application's memory area and comparing the result with the hard-coded value, which has been set during the firmware upload. If the two hashes are equal, the microcontroller branch into the application. If the hashes do not match, or if the user requested to stay in bootloader mode, the microcontroller proceeds to the bootloader, where the whole memory of the application can be reprogrammed.
However, the bootloader itself may be corrupted. This could remove the ability to update the application, or even prevent the microcontroller from branching into the application, making the propulsion system unusable. To decrease the probability of bootloader corruption, its architecture has been structured, as shown in Fig. 17. The improved bootloader is composed of 3 levels: the previous bootloader, boot level 3, and two intermediate levels. The first level is a small and non-redundant code that is used to randomly branch into one of the copies of the level 2, 2a or 2b. This first level consists of only a few dozen instructions, for 116 bytes of program memory, which highly reduce the risk of corruption.
The second level of the bootloader is used to build the third level, by running a triple voting algorithm on the voters a, b, and c, and writing the result to the level 3 result memory area. The second level of the bootloader has more instructions than the first level, with approximately 3800 bytes of program memory, but has a "same design redundancy" with two identical codes Sects. 2a and 2b. If a corrupted version of boot level 2 is selected by the first level and the program gets into a deadlock, the watchdog timer triggers a restart after few seconds. The first level of the bootloader will eventually select the uncorrupted version of boot level 2. The safety lies in the fact that the probability of corrupting both copies of boot level 2 is low.
After the triple-voting algorithm, the bootloader branch into boot level 3, where it is possible to reprogram the application memory area, as in the previous version of the bootloader. The boot level 3 memory area is much bigger than the first two levels of the bootloader, with around 30,000 bytes of program memory, and is more prone to memory corruption. The triple modular redundancy ensures the integrity of the third level by correcting every single bit of corruption, as shown in Fig. 18.
To corrupt a bit in the boot level 3 result, the output of the triple-voting algorithm, the same bit must be corrupted on two different voters, which reduces the probability.
The flowchart Fig. 19 sums up the three-level improved bootloader. The improved bootloader increases the reliability of the system because it limits the risk of deadlock due to a program memory corruption. Before the improvement, the bootloader was stored between 0 × 0 and 0 × 8000. Any corruption on this memory area could lead to program stuck.
On the new bootloader architecture, we reduce this risk with multiples redundancies. To corrupt the bootloader, a corruption must occur either on the 116 bytes of the first boot level, or on the two boot levels 2a/2b, or on exactly the same bit on two constituents of the third level. This new three-level bootloader has been extensively tested. Thanks to the new architecture and multiple redundancies, the microcontrollers should be more resilient to radiation and SEUs.

Conclusion
The article describes the radiation test campaign performed under the power processing unit of the world-first iodine ion thruster operated in space. The purpose of this radiation test campaign was not limited by the system qualification but rather aimed to highlight the most critical problems and to improve the system with respect to various radiation environments. Some of the detected issues were confirmed by the in-orbit demonstration, such as the single-event functional interrupt and soft restart of the main board microcontroller or single-event transient on some analog readings. Another particular problem has also been discovered, linked to single-event latch-ups on the main board and flow control unit microcontrollers. The NPT30-I2 hardware and software have been consequently improved to make the propulsion system less sensitive to space radiation, detect problems, and be able to apply corrective measures. The next step is to perform more extensive tests for getting statistics results, not on a single device, according to the ESCC basic specifications 25,100/22900. The test software will also be improved to be able to detect and log all upsets, and not only their consequences. Nomenclature ADC = Analog-to-digital converter CAN = Controller area network COTS = Custom-off-the-shelf CSU = Cathode supply unit DUT = Device under test FCU = Flow control unit GSU = Grid supply unit I 2 C = Inter-integrated circuit IOD = In-orbit demonstration NVM = Non-volatile memory OBC = On-board computer PWM = Pulse-width modulation RF = Radio-frequency RFG = Radio-frequency generator RTOS = Real-time operation system SEE = Single event effect