New subject: atlas probe usb power usage question

22 Sep 2015

      Dear all,

Here is a description of the events that took place last Friday and some
'lessons learned' that we took away from it.

The high level summary of events is that an Atlas user was authorised to
create an extreme number of measurements involving a large number of
probes. This effectively overloaded the back-end machinery in various
different ways. Even though this event could only happen because of the
exception in resource use limits, we have implemented workarounds and
countermeasures to avoid a repetition in future, and we will be
investigating some of the more fundamental issues in the coming period.

More information is available below for those interested.

Observations:
- Problems started shortly before 11am when one of the Atlas users
created a large number of new measurements each involving all available
probes. The user had been given an exceptional amount of credits as part
of a special experiment. Therefore the normal limitations on the impact
any individual user can have on the system were not active when the
measurements were created and activated.
- The results of the newly created measurements put a lot of strain on
the measurement scheduler, which triggered our interest. After some
investigation the cause of the overload was identified and the related
measurements were ended.
- However, by this time the majority of results, up to that moment, had
already reached our queuing servers and the consumers were already
ingesting the results into our Hadoop storage platform.
- At this phase we discovered a capacity problem with the process that
consumes the Atlas results, so we doubled the capacity of that component
on the fly.
- This exposed the next bottleneck in our platform in the form of an
accumulation of the created results on a very small number of processing
nodes. Normally the incoming measurement results are distributed over
several storage nodes, so this strongly reduced the consumption rate of
new data.
- A third factor that contributed was the fact that, in attempt to curb
growth of the Atlas data, we have migrated the Atlas data sets to a more
efficient compression algorithm earlier in the year. This saved us some
40-50% of storage space for the Atlas data, at the expense of some
compute power. Under normal circumstances, even at high loads, this
compute power is abundantly available on the storage cluster. Under the
specific circumstances of last Friday's events, it turned out that the
change of the compression algorithm had increased processing time for
some Hadoop system tasks by up to a factor of 8, which had a direct
impact on the data consumption speed.

Immediate actions taken:
- Removed special privileges of the end-user in question
- Added capacity to the Atlas consumer processes
- Returned (temporarily) to less efficient compression on the Atlas data
sets.

Lessons learned and further planned action:
- Granting special privileges for some of the Atlas users needs (even)
more attention than it already receives.
- We need to better communicate "best practices" to these power users so
they can use their extra allowances responsibly.
- Improved compression of Atlas data has decreased our storage demands
but also decreased our processing capacity. This needs further
investigation to find the optimum configuration.
- Investigate possibilities to better spread incoming results over more
worker nodes (reduce hotspots).
- Investigate and quantify reasonable boundaries of scalability of the
whole system, to guide the limits for granting credits to end users.

Kind regards,
Romeo Zwart

Friday's events on RIPE Atlas

Romeo Zwart

Antonio Prado

Robert Kisteleki

Randy Bush

Antonio Prado

Roman Mamedov

Randy Bush

Gert Doering

Philip Homburg

Nick Hilliard

Randy Bush

Mark Santcroos

Colin Johnston

Robert Kisteleki

Colin Johnston

Philip Homburg

tags

participants (10)