RPKI Outage Post-Mortem
Summary: Yesterday, on 7 January 2021, an issue with our RPKI software caused an inconsistent certificate to be published from 15:29-16:20 (UTC+1). This may have resulted in outages. We strongly recommend network operators update their Relying Party software to the latest version. Details: At 15:06 (UTC+1) yesterday, we processed an outgoing transfer of IP resources to another RIR service region. This caused our system to update the corresponding RPKI certificates in our Certificate Authority (CA). Unfortunately, our RPKI software published the updated parent certificate (production CA) ahead of its child certificate (member CA). As a result, in the period immediately after the updated parent was published, the child certificate (updated later) contained resources that were no longer on the updated parent, and the child certificate over-claimed. This was resolved once the child certificate was updated. Currently we have three separate processes: * One that updates the resources in the registry in RPKI (every 15min) * One that updates the resources of the RIPE production CA (parent of all member CA) from the registry (1h, takes ~5 min) * One that updates the resources for member CAs from the registry (1h, takes ~40 min) If there is an outgoing transfer and the member CA update runs before the production CA update, the situation with over-claiming occurs. The update of the member CA needs to happen at the same time (i.e. same RRDP delta), or before the production CA resources are reduced. This does not happen the other way around (and so is not an issue with incoming resources). Some older Relying Parties had applied a strict manifest handling interpretation in their validator software. This meant that they were configured to reject all certificates in the manifest if a single entry was invalid. As a consequence, all RPKI certificates covering RIPE resources were rejected by these validators during this period. Based on our access logs, we estimate that 327 instances of Relying Party software were impacted. On Monday 11 January, we will implement a fix so that every time a RIPE NCC certificate changes, we will look at all members to see if their certificates are over-claiming and force an immediate re-issue if so. This approach does not give us a 100% bullet-proof fix to the problem, but it reduces the period of over-claiming from an hour to a couple of minutes. We will work on reducing this time to less than a minute, to further reduce the potential for inconsistency. In the longer term, we will work on implementing atomic publishing of data for this type of situation. In the meantime, we strongly recommend that network operators update their RPKI Relying Party software to the latest version: * Routinator 0.8.2 * rpki-client 6.8p1 * FORT 1.4.2 * octorpki 1.2.2 * RIPE NCC 3.2-2020.12.10.13.57 Best regards, Nathalie Trenaman Routing Security Programme Manager RIPE NCC
Dear group, I think the situation is a bit more involved, I'd like split yesterday's situation into 3 aspects: 1) RIPE NCC's hosted CA service can indeed attempt to shrink children before shrinking parents. The upcoming January 11th change to improve the ordering of signing events is welcome news, but really only helps CAs hosted by RIPE NCC under the RIPE NCC Trust Anchor. 2) Some versions of routinator, RIPE NCC's validator, and OctoRPKI handled manifests incorrectly, resulting in large VRPs drops for those RPs. These vendors released updates about one month ago, which do address some shortcomings. Great! However the problem of losing visibility of all ROAs under the 'KpSo3VVK5wEHIJnHC2QHVV3d5mk.mft' manifest remains, the transfer of 81.199.64.0/20 from RIPE NCC to ARIN resulted in an outage for two ROAs which continued to exist under the RIPE Trust Anchor. 3) Validators following RFC 6487 consider 'rpki.ripe.net/repository/DEFAULT/UBgr7pqgEMH_0tgE9qp7FL3bkfc.cer' in its *entirety* invalid, thereby also rejecting these two ROAs: AS61317,81.199.112.0/24-24,RIPE + AS61317,81.199.113.0/24-24,RIPE. This 6487 behavior of course is bad news for users of 81.199.112/23, as the covered prefixes became 'not-found' instead of a steady state 'valid' in the BGP DFZ. Some may say 'whatever', but to me this seems brittle for no good reason, it reduces the RPKI's reliability unnecessarily. Luckily there is a solution: RFC 8360 describes a more graceful and robust validation procedure in which an overclaiming but otherwise valid CA is 'trimmed' by the constraints derived from its parents, allowing ROAs which are wholy validly covered by the entire chain to not be rejected. More info: https://www.internetsociety.org/blog/2018/04/new-rfc-8360-rpki-validation-re... RFC 8360 specifies a bunch of new OIDs and a new improved validation policy, however this new goodness *only* is activated when both the signer sets OID 1.3.6.1.5.5.7.14.3 as policy (instead of 1.3.6.1.5.5.7.14.2 which is what RIPE NCC hosted CA sets), AND the validator is able to recognise the new OID. A high barrier! This smells like a contrived case of IETF politics where some folks didn't want to change an existing default even though it harms internet operations, thus themselves and everyone else... It seems awkward to come up with a great solution, but ... not enable it by default. I think a solution to expedite deployment of RFC 8360 is for validators to just apply the RFC 8360 validation strategy to objects on which the RFC 6487 policy OIDs are set. A PKIX policy violation in the public interest. The following rpki-client patch proposal changes the 'refuse' policy to 'trim': https://marc.info/?l=openbsd-tech&m=161011710120123&w=2 In routinator 'Trim' could always be the strategy rather than 'Refuse' (https://github.com/NLnetLabs/rpki-rs/blob/7cf083b90f97a383ab44e86995151cc5c5...) In FORT I also see an 'if ... else ...' which can probably be changed to just always use nid_certPolicyRpkiV2(). https://github.com/NICMx/FORT-validator/blob/f8f97c489dceecda60f8cc7d70991c6... I didn't inspect RIPE NCC's validator, operators should shut those instances down and use something else. Kind regards, Job
participants (2)
-
Job Snijders
-
Nathalie Trenaman