Erroneous data in measurement 7000
Dear RIPE Team, I am a Master's student at TU Delft, currently working on a tool that uses RIPE data to reliably detect network outages. While analyzing the measurements, I noticed what appears to be incorrect data being reported. In particular, when counting the number of disconnected probes per country, I observed that many countries show a major outage on 23 April 2025, where nearly all probes were marked as disconnected for about an hour and a half before recovering. For example, in Spain (ES): 2024-09-10 12:58:10 UTC – 16:37:15 UTC (Duration: 3:39:05, Peak offline probes: 154) Happened across many(40+) countries 2024-09-17 10:37:04 UTC – 12:50:56 UTC (Duration: 2:13:52, Peak offline probes: 129) Happened across many(40+) countries 2025-04-23 09:39:27 UTC – 11:11:59 UTC (Duration: 1:32:32, Peak offline probes: 210) Happened across many(90+) countries The timing and duration of the April 23 outage are consistent across many countries. I’ve compiled more examples and raw data in this editable online notepad https://shrib.com/#Asa3ZLP3Xe. It seems that a large portion of probes were marked as disconnected, even though the raw measurement data indicates they were still actively sending and receiving information. It is important to note that such errors can be observed at other times for short periods of less than a minute - 20-05-2025, again across many countries. I would greatly appreciate your clarification on a few points: 1. What could explain this discrepancy? Is it a measurement issue or something related to RIPE’s internal systems? 1. Are all measurements potentially affected by this? For example, measurement 10310, during an outage in NMK on May 18, some probes showed no responses for a 30-minute window(possibly still timing out). Similarly, during the outage in Portugal on April 28, several probes had no data for multiple hours. 1. How trustworthy are the results from the measurements? 1. How long does it typically take for RIPE to detect and reflect a disconnected probe (or other events) in the API? 1. Different requests to root measurements take the same time, even though one is requesting 30 minutes and another is requesting 4 minutes of data. Is the overhead on Ripe's side the reason for this? Looking forward to your response! Best regards, Filip Dobrev
The 23rd April outage was a scheduled controller software upgrade: https://status.ripe.net/incidents/6jxc3zrc9nfy I think the 20th May issues were database corruption on a controller in Ireland: https://status.ripe.net/incidents/36666jcqdjqq I can't answer the other questions. Gavin On Tue, 7 Oct 2025 at 10:46, Filip Dobrev via ripe-atlas < ripe-atlas@ripe.net> wrote:
Dear RIPE Team,
I am a Master's student at TU Delft, currently working on a tool that uses RIPE data to reliably detect network outages. While analyzing the measurements, I noticed what appears to be incorrect data being reported.
In particular, when counting the number of disconnected probes per country, I observed that many countries show a major outage on 23 April 2025, where nearly all probes were marked as disconnected for about an hour and a half before recovering. For example, in Spain (ES):
2024-09-10 12:58:10 UTC – 16:37:15 UTC (Duration: 3:39:05, Peak offline probes: 154) Happened across many(40+) countries 2024-09-17 10:37:04 UTC – 12:50:56 UTC (Duration: 2:13:52, Peak offline probes: 129) Happened across many(40+) countries 2025-04-23 09:39:27 UTC – 11:11:59 UTC (Duration: 1:32:32, Peak offline probes: 210) Happened across many(90+) countries
The timing and duration of the April 23 outage are consistent across many countries. I’ve compiled more examples and raw data in this editable online notepad https://shrib.com/#Asa3ZLP3Xe.
It seems that a large portion of probes were marked as disconnected, even though the raw measurement data indicates they were still actively sending and receiving information. It is important to note that such errors can be observed at other times for short periods of less than a minute - 20-05-2025, again across many countries.
I would greatly appreciate your clarification on a few points:
1. What could explain this discrepancy? Is it a measurement issue or something related to RIPE’s internal systems?
2. Are all measurements potentially affected by this? For example, measurement 10310, during an outage in NMK on May 18, some probes showed no responses for a 30-minute window(possibly still timing out). Similarly, during the outage in Portugal on April 28, several probes had no data for multiple hours.
3. How trustworthy are the results from the measurements?
4. How long does it typically take for RIPE to detect and reflect a disconnected probe (or other events) in the API?
5. Different requests to root measurements take the same time, even though one is requesting 30 minutes and another is requesting 4 minutes of data. Is the overhead on Ripe's side the reason for this?
Looking forward to your response! Best regards, Filip Dobrev
----- To unsubscribe from this mailing list or change your subscription options, please visit: https://mailman.ripe.net/mailman3/lists/ripe-atlas.ripe.net/ As we have migrated to Mailman 3, you will need to create an account with the email matching your subscription before you can change your settings. More details at: https://www.ripe.net/membership/mail/mailman-3-migration/
RIPE has a page where they disclose incidents. I guess to go forward your tooling could scrape that webpage and discard measurements around these incidents. See https://status.ripe.net/ Regards, Ernst J. Oud
On 7 Oct 2025, at 11:46, Filip Dobrev via ripe-atlas <ripe-atlas@ripe.net> wrote:
Dear RIPE Team,
I am a Master's student at TU Delft, currently working on a tool that uses RIPE data to reliably detect network outages. While analyzing the measurements, I noticed what appears to be incorrect data being reported.
In particular, when counting the number of disconnected probes per country, I observed that many countries show a major outage on 23 April 2025, where nearly all probes were marked as disconnected for about an hour and a half before recovering. For example, in Spain (ES):
2024-09-10 12:58:10 UTC – 16:37:15 UTC (Duration: 3:39:05, Peak offline probes: 154) Happened across many(40+) countries 2024-09-17 10:37:04 UTC – 12:50:56 UTC (Duration: 2:13:52, Peak offline probes: 129) Happened across many(40+) countries 2025-04-23 09:39:27 UTC – 11:11:59 UTC (Duration: 1:32:32, Peak offline probes: 210) Happened across many(90+) countries
The timing and duration of the April 23 outage are consistent across many countries. I’ve compiled more examples and raw data in this editable online notepad https://shrib.com/#Asa3ZLP3Xe.
It seems that a large portion of probes were marked as disconnected, even though the raw measurement data indicates they were still actively sending and receiving information. It is important to note that such errors can be observed at other times for short periods of less than a minute - 20-05-2025, again across many countries.
I would greatly appreciate your clarification on a few points:
What could explain this discrepancy? Is it a measurement issue or something related to RIPE’s internal systems?
Are all measurements potentially affected by this? For example, measurement 10310, during an outage in NMK on May 18, some probes showed no responses for a 30-minute window(possibly still timing out). Similarly, during the outage in Portugal on April 28, several probes had no data for multiple hours.
How trustworthy are the results from the measurements?
How long does it typically take for RIPE to detect and reflect a disconnected probe (or other events) in the API?
Different requests to root measurements take the same time, even though one is requesting 30 minutes and another is requesting 4 minutes of data. Is the overhead on Ripe's side the reason for this?
Looking forward to your response! Best regards, Filip Dobrev
----- To unsubscribe from this mailing list or change your subscription options, please visit: https://mailman.ripe.net/mailman3/lists/ripe-atlas.ripe.net/ As we have migrated to Mailman 3, you will need to create an account with the email matching your subscription before you can change your settings. More details at: https://www.ripe.net/membership/mail/mailman-3-migration/
Dear Filip et al, I would greatly appreciate your clarification on a few points:
1. What could explain this discrepancy? Is it a measurement issue or something related to RIPE’s internal systems?
Probes try to maintain a permanent connection to the infrastructure; both ends (and many middle boxes...) can have influence over this. We can have maintenance / outage events on the infra side, affecting the otherwise stable connection of multiple probes at the same time. The significant ones are usually mentioned on the status page: status.ripe.net Disconnected probes keep on measuring and report later once they are able to.
2. Are all measurements potentially affected by this? For example, measurement 10310, during an outage in NMK on May 18, some probes showed no responses for a 30-minute window(possibly still timing out). Similarly, during the outage in Portugal on April 28, several probes had no data for multiple hours.
In general when probes are powered up, they are measuring. As said earlier, reporting may happen later. Spain+Portugal had a well-known power outage on April 28, most probes were unfortunately really-really down.
3. How trustworthy are the results from the measurements?
All probes run similar code to well-known networking tools such as ping, traceroute and dig (implemented in the firmware running on them). The results can be trusted - but on our scale there are always outliers because of local network conditions, misbehaving middle boxes, perhaps bugs, corner cases or manual interference.
4. How long does it typically take for RIPE to detect and reflect a disconnected probe (or other events) in the API?
During normal operations most events, including connections and disconnection are handled in near real-time. Result reporting for periodic measurements from probes happens every 60-90 seconds (plus the overhead it takes for those to go through the processing pipes until they are retrievable).
5. Different requests to root measurements take the same time, even though one is requesting 30 minutes and another is requesting 4 minutes of data. Is the overhead on Ripe's side the reason for this?
The built-in measurements run on all probes, so we have to be conscious of the volume of data collected, therefore some of these are running with a lower frequency. I hope this helps, Robert
participants (4)
-
Ernst J. Oud -
Filip Dobrev -
Gavin Atkinson -
Robert Kisteleki