Re: [atlas]Probe flapping

16 Dec 2010

      Intermediate update to keep those interested informed. 
I am writing this to keep the engineers free to work
the problem. I do not know nitty gritty details, so 
this is a general overview.

No conclusions yet.

Architecture:

After registering with the RIPE Atlas network the probes are connected
to "controllers" that handle requests to/from the probes.  The
architecture allows probes to use any controller in the system.  Probes
are distributed among controllers according to geographic and load
balancing heuristics at the moment.  We have four controllers at the
moment:

1 in Germany on a dedicated server: jonin
1 in the US on a dedicated server: carson
2 in NL on RIPE NCC VMs: caldwell and zelenka

You can see the number of probes associated with each controller and
some other details on 

	https://atlas.ripe.net/statistics

This page is updated hourly.

What happened:

This morning zelenka was in standby and ronin started disassociating
probes in a massive way.  We do not know the root cause of this.  
The most likely cause so far is a connectivity problem but we are
investigating with an open mind. 

The system reacted as designed and the probes dropped by ronin started
to register with caldwell.  Unfortunately caldwell became overloaded by
this both because of its physical limitations and because of an
unfortunate database configuration error. 

Probes associated to carson were not affected.

What we are doing:

We brought up Zelenka but as Murphy dictates the RIPE NCC firewall
prevented probes from reaching it.  This has been fixed and zelenka is
now picking up probes. 

We working hard to fix a lot of minor problems uncovered by this and to
get all probes re-connected and their data backlog processed.

What we have learned so far:

We need a larger safety margin in the capacity of the controllers vs the
number of deployed probes.  We will start moving caldwell and zelenka
onto physical machines outside of firewalls and other complications. 

We also need to exercise moving probes among controllers and verify that
the safety margin exists in reality. 

Personally I regard all this as normal teehting problems in a
distributed computing deployment.  So far the architecture is holding up
well.  Just the implementation has some flaws.  Plwase bear with us. 

If anyone has suggestions for high quality hosting of controllers in the
RIPE region, please drop me and Robert a private mail. 

Daniel

Re: [atlas]Probe flapping

Daniel Karrenberg