Robert,
Hi,
Indeed, this is on our list -- but see also below.
On 2011.12.14. 5:37, Greg B - NANOG wrote:
> Hi,
> I see there was a thread started back on September 7, 2011 with
> subject: Email or SMS alert when probe goes offline/online
> this was prior to me joining the mailing list.
>
> I'd like to voice my support for a user-configurable amount of time for the
> Atlas system to send out an email notification that your probe is down (and
> returned to service).
A little background story:
> My probe which I run on my home internet connection was apparently down for
> 3.5 days before I just happened to login to look at the stats. Considering I
> was at home for much of these 3.5 days, and my Internet connection was
> working, I assume the probe crashed because simply power-cycling it "fixed"
> the problem.
>
> I know that if I got an email ~15 minutes after the probe was down, my
> probes downtime would probably have been closer to about 30 minutes rather
> than 3.5 days.
We have identified a particular condition on the probes where the probe
refuses to connect back to our infrastructure after a disconnect (which can
be caused by a network hickup, anywhere between the probe and our
infrastructure, for example). This particular issue happens in low memory
situations. The probe still does measurements happily, it just cannot
connect to us and send the results in.
After a while, the storage on the probe fills up, so as a best effort the
probe reboots -- which fixes the low memory situation and then everything is
back to normal again. The punch line: the probe's local storage, as with the
current configuration, fills up in about 3.5 days...
We're rolling out a new firmware (4.280) to address this. So, unless there
are other similar conditions, after upgrading you will not see 3.5 day
downtimes. Fingers crossed :-)
Regards,
Robert
> Thanks.
>
> -Greg