Hi Job,
On 13 Jul 2021, at 12:57, Job Snijders via routing-wg <routing-wg@ripe.net> wrote:
Hi,
On Mon, Jul 12, 2021 at 10:23:20AM +0200, Daniel Karrenberg wrote:
Natanlie pointed us to https://www.ripe.net/manage-ips-and-asns/resource-management/rpki/rpki-plann... a while ago. Among other things this says:
“In preparation for the improved RPKI repository architecture, the distributed nature of the RRDP repository is going to be implemented using containers and krill-sync that pulls data from the centralised on-premise repository. This greatly simplifies smooth transitioning between publication servers without any downtime.
NOTE: We are not referring to cloud technologies here, just to our internal deployment technologies.”
The silence here worries me.
What silence?!
Over the last few months there have been quite some mail threads in this working group about RPKI and RPKI outage incidents, and NCC staff have provided updates during the virtual RIPE meetings in the Routing WG slot.
To me the roadmap seems to reflect the sentiment that reliability is the key objective at this moment in time.
I would like to see some feedback from this group whether this is what you want to see happening. The RIPE Routing WG is the forum for giving guidance to the RIPE NCC about RPKI. I know other channels exist too and that is fine. I also know that individuals here seem to be happy with what is happening. However private channels and conversations are not the way RIPE does this. This group is where the RIPE NCC looks for guidance and where that guidance gets properly archived and responded to.
To be honest I am not sure what the purpose of krill-sync is.
In May 2021 [1] extensive testing was conducted with the help of the NLNOG RING to see if krill-sync could be used to power the RSYNC service, but it turned out there were multiple issues with krill-sync making it a suboptimal choice. I believe RIPE NCC ended up deploying a different solution to serve RSYNC - and my hope is that the recently-achieved stability is here to stay, because the current setup seems to work quite nicely.
We are [1] evaluating krill-sync as a tool to build rsync servers that are independent of NFS and can use cached IO. The reason for this is rsync fallback. We see ~139 RPs using the rsync repository (as well as the majority of the NLNOG RING nodes ) and >1600 RPs using the RRDP repository [2]. When rsync fallback happens for many RPs, the current infrastructure will likely not scale, even when each RP starts from the last RRDP state. We are evaluating krill-sync because it allows us to build a rsync repository from RRDP and is available as an open-source project. I recall that while evaluating that krill-sync based environment we found three issues: * Repository versions need to be available for two hours _after they last were the current version_ to give slow clients the chance to retrieve them [3]. * The modification time of objects needs to be the same (between nodes and between copies for a serial) to prevent additional IOs for RPs. * There are very slow outliers reading repositories, but keeping versions available for two hours is long enough in practice. Finding these issues was good: it ensured that they were accounted for in our implementation that writes to NFS. After reporting the relevant issues upstream they have been fixed in krill-sync. The use of NLNOG RING helped verify the current NFS based setup - which I agree is working nicely. Kind regards, Ties [1]: https://www.ripe.net/ripe/mail/archives/routing-wg/2021-June/004351.html [2]: rsync: number of unique IPs reading from /repository yesterday in one hour. hour-to-hour variance is minimal. RRDP: number of unique IPs retrieving notification.xml >24 times/day in early July. [3]: Example: revision 0 gets published at 0h0m, revision 1 at 1h59m, revision 2 at 2h01m (and revision 0 is deleted). The files that clients that connect at 1h58m read get deleted.