Re: [routing-wg] Changes to the RRDP repository

28 Oct 2021

      Hi Randy,
...
On 27 Oct 2021, at 19:45, Randy Bush <randy@psg.com> wrote:
...
We aim to keep this simple at an initial stage, closely monitor how
the environment behaves
i am deeplying interested in how a CA and a PP (and RP and routers) are
measured and monitored.  in general, i am scared to death of the growing
deployment of rpki and rov with so little, if any, measurement.
so i would beg/encourage you to publish how you do this and maybe even
think of making your tools more generally useful.
We have made a significant investment in monitoring and alerting, using
Prometheus. I will introduce the part of our monitoring relevant to the
repository content below (there is more), and we will include an update on
monitoring in our RIPE 83 presentation.

We have metrics for the Certification Authority (CA) system, monitor Relying
Party software instances, and run tools specifically for monitoring. We also run
smoke-tests (via the UI) and an end-to-end test that validates that VRPs for a
ROA created by a user become visible to RPs.

The metrics in the CA system are mostly for liveliness (e.g. "job x is running
successfully"), ongoing publication, and error (rates). We do test (hosted) CA
creation/deletion in our staging environment - but not in our production
environment because we do not have the two (hosted, delegated) production LIR
accounts required.

As a liveliness check for the publication server instances, we check when the
publication server received the last withdrawal and publish, and when the most
recent notification.xml is written (using an RP via serial and directly).

Furthermore, we have three types of checks on the content of the repository. For
this, we have two endpoints on the CA system: "hash and filename of all files in
the repository" and "all VRPs".

For the files, using an internal tool, we check that:
  * All files in the CA "filename+hash" endpoint are present in each repository
    (rsync instances, publication server instances, rrdp.ripe.net) after they
    have had time to converge.
  * Not too many "leftover" files in each of the repository instances.
  * No objects are present in the repo that are about to expire within ~13.5
    hours.

Using RP instances, we check that:
  * All VRPs in the CA system show up in the effective VRPs within
    <time_threshold> (using rtrmon).

Because we monitor that no files are mismatching between the CA system and the
repositories, this check implies that the VRPs are visible in all the repository
instances.

Please let us know if there is interest in the tool we use to compare
repositories. We might add that to our roadmap if there is interest.

Kind regards,
Ties

Re: [routing-wg] Changes to the RRDP repository

Ties de Kock