Dear colleagues, Update: As our vendor indicated the issues were likely a result of a hardware failure, we have replaced both online Hardware Security Modules (HSMs) on Friday evening. This resolved most of the issues, but not all. After further investigation, we learned that a memory leak in the Security world software was causing the remaining issue. We downgraded the software to the stable version on the acceptance environment on Monday afternoon. Once we made sure that the downgrade was safe to do, we performed it on production the next morning. This resolved all issues. It is worth noting that as of Saturday morning, no noticeable downtime in the RPKI Dashboard and publication service occurred. We are in contact with the vendor about upgrading to a newer and safer version of the Security world software. Kind regards, Stella Vouteva
On 24 Jun 2022, at 10:32, Stella Vouteva <svouteva@ripe.net> wrote:
Dear colleagues,
Yesterday, we performed an upgrade on the Security world software on our RPKI core servers. The upgrade was finished at approximately 08:45 UTC. We tested the upgrade and verified that everything worked before enabling the RPKI dashboard again.
At approximately 10:50 UTC, we received an alert from our monitoring that showed an error for both our online Hardware Security Modules (HSMs). While we immediately started the investigation of this alert, we also decided to temporarily stop RPKI Core to keep a consistent state. This also meant that we had to temporarily close down the RPKI Dashboard.
At 11:22 UTC we contacted our vendor as we had never seen this behaviour before. A consultant from our vendor advised a reboot of the HSMs, which we performed at 11:55 UTC. After the reboot, the HSMs got back online and we enabled the RPKI Core and RPKI dashboard. It is still unknown whether the upgrade was the direct cause of the errors, as the error was very generic.
While we are working on finding the root cause, we still need to reboot systems and HSMs occasionally, which causes unavailability of the RPKI Dashboard for a few minutes and it will take a bit longer than usual for objects to get published in our repository. As soon as we have more information, we will share it here.
As a result of this outage, we will speed up the process to replace the online HSMs, which we described in our recent RIPE Labs article <https://labs.ripe.net/author/ties/securing-the-ripe-ncc-trust-anchor/> [0].
Kind regards, Stella Vouteva
[0]: https://labs.ripe.net/author/ties/securing-the-ripe-ncc-trust-anchor/ <https://labs.ripe.net/author/ties/securing-the-ripe-ncc-trust-anchor/> --
To unsubscribe from this mailing list, get a password reminder, or change your subscription options, please visit: https://lists.ripe.net/mailman/listinfo/routing-wg