Dear Gert, Hank, First, our apologies again for the delay in our response. A few of us were taking our summer break and our colleagues didn't want to respond without checking with us first. To recap, we’ve outlined our core goals - improve the resilience of our services, become more agile and flexible as an organisation, and focus engineering expertise on our core business. You correctly point out that we haven't really talked about the problems we’re trying to solve. Fair point - we're not used to talking about the firefighting that's needed behind the scenes. We can go over some of this now. We can start by noting that if you take the inverse of the benefits we've listed so far, you find most the problems we're trying to solve. 1. Improve resilience and availability We currently host our infrastructure in two data centres in Amsterdam. While they have provided excellent availability so far, users further afield (South America, Oceania, Asia) experience high latency when accessing our services. Importantly, an outage affecting both of these data centres would render all of our services offline. Public cloud providers have many global regions available, allowing us to choose the level of resilience that best fits a particular service - protecting us against multiple hardware failures or natural disasters (remember that we are below sea level here). 2. Become more agile and flexible We're proud of the stable and highly-available services we provide. Here we can credit the expertise and hard work of our engineering staff, but also a continuous investment in our infrastructure over time. This has a big footprint - we are currently using almost 50 racks across our two data centres. Each hardware element has its own lifecycle: procurement, shipping, installation, configuration, patching, upgrading and retiring. With hundreds of servers, network and storage equipment, this is a continuous operation that takes a lot of time and effort. Hardware maintenance is not even the biggest challenge here: our infrastructure doesn't offer much in the way of flexibility and making changes is complex and expensive. Our infrastructure also lacks elasticity, meaning that we have to estimate demand and over-provision our services to cover any peaks. This makes us less agile, by forcing us into long-term commitments and requiring us to pay for a lot of unused or idle resources. 3. Focus engineering expertise on our core business For each new application or change to our infrastructure, there are a lot of manual steps that require tickets back and forth between separate engineering teams. Getting from idea to reality can take many months, and we can see this impacting our ability to innovate. This is inevitable when attention turns from service excellence to fixing problems and time-consuming, mundane maintenance tasks. We especially don't like this because we often need to react quickly as an organisation, while also being able to experiment with new services in an efficient way. By moving to the cloud, we can build pipelines to deploy code faster, with fewer errors and manual steps, and provide sandbox accounts for engineers to quickly and safely test new technologies. We can also automate security auditing and reporting as much as possible, at all application and infrastructure layers. There were two good comments on the article recently, from Niall Murphy and Bert Hubert. We will respond to these soon, but I would like to reference one point Bert makes there, which is essentially "Don't outsource your key capabilities." We completely agree with this (many of us have been reading Bert's article on this topic recently). This is precisely what we are *not* doing. While it is important to have in-house expertise on all technical layers, some are more important than others. For example, at the physical layer we are already using data centre remote hands to replace failed disks, and we generally want to eliminate as much of the repetitive work to unpack, rack, and cable equipment in the data centre as we can. The resources we save here can be used to double down on the capabilities we want to develop further. We will continue to write our own software and control our deployment pipelines, and configure routers, firewalls, load balancers, and storage devices - whether they are physical or virtual, on-premise or in the cloud. I see Hank's suggestion that we compile a list of outages. I'm reluctant to ask our engineers to spend time on this when I think they'll find we have very resilient services. But past results are not always the best indicator of future performance. And with RPKI especially, I also expect that what we consider acceptable resilience might increase as more and more networks come to rely on it.
(Also I find "evade the discussion on the list by posting a new lengthy article on labs every few months" not really helpful)
I do want to respond to this point. We sometimes miss a comment or take longer to respond than is acceptable, and this is not something that we take lightly as a company. But I would be disappointed if the community thought we were trying to evade discussion. We are here, we are listening, and we will respond. With that, it's over to you again - let me know if you feel I’ve missed anything here. Regards Felipe Victolla Silveira Chief Operations Officer RIPE NCC