Hi Randy,
On 27 Oct 2021, at 19:45, Randy Bush <randy@psg.com> wrote:
We aim to keep this simple at an initial stage, closely monitor how the environment behaves
i am deeplying interested in how a CA and a PP (and RP and routers) are measured and monitored. in general, i am scared to death of the growing deployment of rpki and rov with so little, if any, measurement.
so i would beg/encourage you to publish how you do this and maybe even think of making your tools more generally useful.
We have made a significant investment in monitoring and alerting, using Prometheus. I will introduce the part of our monitoring relevant to the repository content below (there is more), and we will include an update on monitoring in our RIPE 83 presentation. We have metrics for the Certification Authority (CA) system, monitor Relying Party software instances, and run tools specifically for monitoring. We also run smoke-tests (via the UI) and an end-to-end test that validates that VRPs for a ROA created by a user become visible to RPs. The metrics in the CA system are mostly for liveliness (e.g. "job x is running successfully"), ongoing publication, and error (rates). We do test (hosted) CA creation/deletion in our staging environment - but not in our production environment because we do not have the two (hosted, delegated) production LIR accounts required. As a liveliness check for the publication server instances, we check when the publication server received the last withdrawal and publish, and when the most recent notification.xml is written (using an RP via serial and directly). Furthermore, we have three types of checks on the content of the repository. For this, we have two endpoints on the CA system: "hash and filename of all files in the repository" and "all VRPs". For the files, using an internal tool, we check that: * All files in the CA "filename+hash" endpoint are present in each repository (rsync instances, publication server instances, rrdp.ripe.net) after they have had time to converge. * Not too many "leftover" files in each of the repository instances. * No objects are present in the repo that are about to expire within ~13.5 hours. Using RP instances, we check that: * All VRPs in the CA system show up in the effective VRPs within <time_threshold> (using rtrmon). Because we monitor that no files are mismatching between the CA system and the repositories, this check implies that the VRPs are visible in all the repository instances. Please let us know if there is interest in the tool we use to compare repositories. We might add that to our roadmap if there is interest. Kind regards, Ties