USB drive more harmful than helpful?
From both my own (short term) experience and from what's being written on this list, I'm getting the impression that the USB drive may be costing more than it's worth.
I have in only about 3 months experienced multiple probe issues due to USB drives and there have been multiple threads on this list which suggest that I am far from alone. I will go so far as to suspect that a substantial number of disconnected and abandoned probes have similar issues, but the hosters may be unwilling or unable to spend the necessary time to investigate and resolve them. If the main reason for the drive is to cache data during unavailability of the command and control center, this may not be worth the effort. I would suggest making the drive optional. That may mean fewer data points from and/or fewer UDMs possible on those probes without a functioning drive. But it may also mean a couple thousand probes more connected and therefore available for measurements at all. I'm not saying that probes should not ask for their drives to be fixed/replaced, but it should not be a requirement for the probe to run. One might give an incentive to hosters to run their probes with functioning drives by giving less credits for connected probes without a drive. Any thoughts? Michael -- Sent from a mobile. Please excuse my brevity. // M: +49-163-6866568
On 20/05/2016 15:37, Michael Ionescu wrote: Interesting idea to make the USB drive optional. Based on literature: https://en.wikipedia.org/wiki/USB_flash_drive#Failures https://askleo.com/can_a_usb_thumbdrive_wear_out/ - 10,000-100,000 http://cfgearblog.blogspot.co.il/2011/03/how-long-does-flash-drive-last_22.h... - 10,000-1M Has anyone tested how many writes are going on to the ATLAS thumb drive? Perhaps with all the failures within a year of start, perhaps too many writes are taking place? Regards, Hank
From both my own (short term) experience and from what's being written on this list, I'm getting the impression that the USB drive may be costing more than it's worth.
I have in only about 3 months experienced multiple probe issues due to USB drives and there have been multiple threads on this list which suggest that I am far from alone.
I will go so far as to suspect that a substantial number of disconnected and abandoned probes have similar issues, but the hosters may be unwilling or unable to spend the necessary time to investigate and resolve them.
If the main reason for the drive is to cache data during unavailability of the command and control center, this may not be worth the effort.
I would suggest making the drive optional. That may mean fewer data points from and/or fewer UDMs possible on those probes without a functioning drive. But it may also mean a couple thousand probes more connected and therefore available for measurements at all.
I'm not saying that probes should not ask for their drives to be fixed/replaced, but it should not be a requirement for the probe to run. One might give an incentive to hosters to run their probes with functioning drives by giving less credits for connected probes without a drive.
Any thoughts?
Michael -- Sent from a mobile. Please excuse my brevity. // M: +49-163-6866568
On 2016/05/20 14:57 , Hank Nussbacher wrote:
Has anyone tested how many writes are going on to the ATLAS thumb drive? Perhaps with all the failures within a year of start, perhaps too many writes are taking place?
We have no clear idea why they fail. It seems that time to failure is highly variable.
Hi, On Fri, May 20, 2016 at 04:10:47PM +0200, Philip Homburg wrote:
We have no clear idea why they fail. It seems that time to failure is highly variable.
Can you correlate tests-until-failure or data-written-until-failure? One of mine has failed at least two times now, and it could be that people just *love* to run tests from 3320... My gen 1 probe in 5539 has never had *any* issues. gert -- have you enabled IPv6 on something today...? SpaceNet AG Vorstand: Sebastian v. Bomhard Joseph-Dollinger-Bogen 14 Aufsichtsratsvors.: A. Grundner-Culemann D-80807 Muenchen HRB: 136055 (AG Muenchen) Tel: +49 (0)89/32356-444 USt-IdNr.: DE813185279
So I have a few theories. I have now had 3 different USB sticks fail on me: Two Sandisk 4GB SDCZ33 and one cheap generic 8GB replacement. The power draw of the TP-Link system + USB is probably more than the opportunistic USB ports they get plugged in to. An underpowered probe runs great MOST of the time, but a flash bit write is probably the highest power strain and Flash can get really unhappy with power interrupts, based on this SSD research: https://www.usenix.org/system/files/conference/fast13/fast13-final80.pdf I usually use a 500mA or 800mA supply, or tap a nearby router USB port in that range. I suspect the system may demand 1200mA or more. When most flash sticks get errored out enough, they permanently fail into a read only mode, or become fully unreadable. Read-only mode can be reset on some models, but it is not recommended by the vendor. At least one of the failed SANdisk units I had was stuck in a read-only mode. Also, probes may be subjected to ungraceful power down situations, depending on where they are stationed. That can also be a flash drive killer. I don't think we are hitting the write limits of the sticks. I suspect the units are often in underpowered or ungraceful pwoer-down situations, or the USB flash itself is not responding gracefully to poweroff situations. I don't suppose RIPE buys enough USB sticks to get to talk to engineers at SanDISK? I know the newer Raspberry Pi will report when it is in an underpowered situation. Can the TP-Link detect and warn when underpowered? What is the minimum power recommended for TP-Link + USB? Also, are there any USB sticks that have lower power needs and are more robust in low power IoT situations? Is anyone trying to post-mortem the failed sticks? On Fri, May 20, 2016 at 10:03 AM, Gert Doering <gert@space.net> wrote:
Hi,
On Fri, May 20, 2016 at 04:10:47PM +0200, Philip Homburg wrote:
We have no clear idea why they fail. It seems that time to failure is highly variable.
Can you correlate tests-until-failure or data-written-until-failure?
One of mine has failed at least two times now, and it could be that people just *love* to run tests from 3320...
My gen 1 probe in 5539 has never had *any* issues.
gert -- have you enabled IPv6 on something today...?
SpaceNet AG Vorstand: Sebastian v. Bomhard Joseph-Dollinger-Bogen 14 Aufsichtsratsvors.: A. Grundner-Culemann D-80807 Muenchen HRB: 136055 (AG Muenchen) Tel: +49 (0)89/32356-444 USt-IdNr.: DE813185279
On 20/05/2016 22:08, Phillip Remaker wrote:
When most flash sticks get errored out enough, they permanently fail into a read only mode, or become fully unreadable. Read-only mode can be reset on some models, but it is not recommended by the vendor. At least one of the failed SANdisk units I had was stuck in a read-only mode.
Also, probes may be subjected to ungraceful power down situations, depending on where they are stationed. That can also be a flash drive killer.
I don't think we are hitting the write limits of the sticks. I suspect the units are often in underpowered or ungraceful pwoer-down situations, or the USB flash itself is not responding gracefully to poweroff situations.
I don't suppose RIPE buys enough USB sticks to get to talk to engineers at SanDISK?
Sandisk R&D is located in Israel: http://www.globes.co.il/en/article-sandisk-acquisition-affects-650-israeli-e... I could probably arrange a meeting with the technical staff there provided there is a clear document detailing the issue. Maybe RIPE ATLAS technical staff would like to come to a meeting? Regards, Hank
On 2016/05/21 21:32 , Hank Nussbacher wrote:
On 20/05/2016 22:08, Phillip Remaker wrote:
I don't suppose RIPE buys enough USB sticks to get to talk to engineers at SanDISK?
Sandisk R&D is located in Israel: http://www.globes.co.il/en/article-sandisk-acquisition-affects-650-israeli-e... I could probably arrange a meeting with the technical staff there provided there is a clear document detailing the issue. Maybe RIPE ATLAS technical staff would like to come to a meeting?
We switched from SanDisk to Verbatim because the failure rate of the SanDisk was too high. Unfortunately, the log files that contain details about the USB sticks are archived in a way that makes them hard to access. But we will try to analyze those logs soon to see what we can get out of them. Independent of that, it would be nice if we could figure out a way to induce failures (instead of having to way for probes in the field to end up with a corrupt filesystem) and to figure out what causes those failures. One thing we currently wonder is if a marginal power supply (as seen from the USB stick) would cause corruption or not. Philip
On May 20, 2016 9:08:08 PM GMT+02:00, Phillip Remaker <remaker@gmail.com> wrote:
I don't suppose RIPE buys enough USB sticks to get to talk to engineers at SanDISK?
I just had a Verbatim drive originally supplied with the probe go read-only, so I would say RIPE is not procuring only SanDISK. --
[...]
Has anyone tested how many writes are going on to the ATLAS thumb drive? Perhaps with all the failures within a year of start, perhaps too many writes are taking place?
I know that a very small number of probes is not a valid basis for statistics, but there wasn't a USB drive failure yet for the long-term, always-on probe. But they are powered with dedicated, stable power sources. Thus I tend to lean more towards the explanation involving level or stability of power, rather than # of writes. FWIW, Wilfried
Regards, Hank
Le 23/05/2016 à 14:41, Wilfried Woeber a écrit :
[...]
Has anyone tested how many writes are going on to the ATLAS thumb drive? Perhaps with all the failures within a year of start, perhaps too many writes are taking place? I know that a very small number of probes is not a valid basis for statistics, but there wasn't a USB drive failure yet for the long-term, always-on probe.
But they are powered with dedicated, stable power sources. Thus I tend to lean more towards the explanation involving level or stability of power, rather than # of writes.
FWIW, Wilfried
Regards, Hank
FWIW, my failed #12033 probe was powered using only 1 usb port from my ISP provided router. I’ve plugged the replacement one on a 2+1 A dedicated power supply. So while that second one hasn’t been around long enough to be relevant, the first one fall in the low power level issue range. Bruno
On May 23, 2016, at 8:41 AM, Wilfried Woeber <woeber@cc.univie.ac.at> wrote:
[...]
Has anyone tested how many writes are going on to the ATLAS thumb drive? Perhaps with all the failures within a year of start, perhaps too many writes are taking place?
I know that a very small number of probes is not a valid basis for statistics, but there wasn't a USB drive failure yet for the long-term, always-on probe.
But they are powered with dedicated, stable power sources. Thus I tend to lean more towards the explanation involving level or stability of power, rather than # of writes.
FWIW, Wilfried
Regards, Hank
Stable power, as from a UPS, also isolates the probe from power glitches which may cause rebooting of the probe, thus adding to the write count. Not to mention corruption from power failure during writes. Opinion: If the device/system/operation is at all important, use of a UPS is effectively mandatory. James R. Cutler James.cutler@consultant.com PGP keys at http://pgp.mit.edu
It doesn't help, when we're talking about Atlas probes. I have one probe, where external flash died twice, even it's placed in datacenter with UPS-protected power. On 23.05.16 15:35, "ripe-atlas on behalf of James R Cutler" <ripe-atlas-bounces@ripe.net on behalf of james.cutler@consultant.com> wrote: Stable power, as from a UPS, also isolates the probe from power glitches which may cause rebooting of the probe, thus adding to the write count. Not to mention corruption from power failure during writes. Opinion: If the device/system/operation is at all important, use of a UPS is effectively mandatory.
Hi, On Fri, May 20, 2016 at 02:37:44PM +0200, Michael Ionescu wrote:
From both my own (short term) experience and from what's being written on this list, I'm getting the impression that the USB drive may be costing more than it's worth. [..] Any thoughts?
The USB outages and the lack of proper guidance for probe hosts about the problem status and how to get the probes back has been my gripe #1 for a while now. So, yes, this needs fixing, one way or the other. gert -- have you enabled IPv6 on something today...? SpaceNet AG Vorstand: Sebastian v. Bomhard Joseph-Dollinger-Bogen 14 Aufsichtsratsvors.: A. Grundner-Culemann D-80807 Muenchen HRB: 136055 (AG Muenchen) Tel: +49 (0)89/32356-444 USt-IdNr.: DE813185279
+1. I lost most of the probes this way and I'm not really sure how to recover them - I need to ask for a batch of USB drives or ask all the hosts to remove them... can't this be handled better with a firmware replacement? I would at least then ask all the hosts to unplug the USB and leave the hosts as is. Gil On Fri, May 20, 2016 at 4:06 PM, Gert Doering <gert@space.net> wrote:
Hi,
On Fri, May 20, 2016 at 02:37:44PM +0200, Michael Ionescu wrote:
From both my own (short term) experience and from what's being written on this list, I'm getting the impression that the USB drive may be costing more than it's worth. [..] Any thoughts?
The USB outages and the lack of proper guidance for probe hosts about the problem status and how to get the probes back has been my gripe #1 for a while now.
So, yes, this needs fixing, one way or the other.
gert -- have you enabled IPv6 on something today...?
SpaceNet AG Vorstand: Sebastian v. Bomhard Joseph-Dollinger-Bogen 14 Aufsichtsratsvors.: A. Grundner-Culemann D-80807 Muenchen HRB: 136055 (AG Muenchen) Tel: +49 (0)89/32356-444 USt-IdNr.: DE813185279
On 2016/05/20 14:37 , Michael Ionescu wrote:
If the main reason for the drive is to cache data during unavailability of the command and control center, this may not be worth the effort.
No, the probe actually runs from the USB stick. The internal 4MB flash is just enough to initialize the USB stick in a secure way. And even that is already tricky.
participants (10)
-
Bruno Pagani
-
Daniel Suchy
-
Gert Doering
-
Gil Bahat
-
Hank Nussbacher
-
James R Cutler
-
Michael Ionescu
-
Philip Homburg
-
Phillip Remaker
-
Wilfried Woeber