FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor


 
 
LinkBack Thread Tools
 
Old 04-14-2011, 12:47 PM
Bastian Blywis
 
Default watchdog

Hello,

I hope my general questions about the watchdog package belong on this list.

1) Is it really the desired behavior that wd_keepalive is started in
/etc/init.d/watchdog when the watchdog daemon is stopped? If the system shall
be kept from rebooting due to terminating the watchdog process, does it not
suffice to close /dev/watchdog as it is documented in the manual page? It
makes sense if the kernel is compiled with CONFIG_WATCHDOG_NOWAYOUT but
otherwise it does not. (The capabilities could be queried with the
WDIOC_GETSUPPORT ioctl AFAIK.)

From my point of view, when the system administrator explicitely sets
CONFIG_WATCHDOG_NOWAYOUT or provides "nowayout" to the kernel module, he/she
wants the system to reboot if something happens, including an accidental or
intentional stop of the watchdog daemon.

2) The way the watchdog package currently works, it will not always reboot an
unresponsive system. This is related to my comment on bug #499796. For
example, when the system enters rc6 and watchdog is terminated by the init
script, wd_keepalive will seemingly keep the system from rebooting even if the
kernel hangs.

Would't it be better to run the init script (stop watchdog but do not start
wd_keepalive) just before calling reboot or halt? That way, the watchdog
daemon will be able to trigger a reboot until the last moment. Unfortunately,
there are some issues when the monitored event happens (e.g. process is killed
in rc6 or hd is unmounted) more than 60s before the watchdog is terminated.

Regards,

Bastian


--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 201104141447.17091.blywis@zedat.fu-berlin.de">http://lists.debian.org/201104141447.17091.blywis@zedat.fu-berlin.de
 
Old 04-14-2011, 02:05 PM
Michael Meskes
 
Default watchdog

> 1) Is it really the desired behavior that wd_keepalive is started in
> /etc/init.d/watchdog when the watchdog daemon is stopped? If the system shall

Yes.

> be kept from rebooting due to terminating the watchdog process, does it not
> suffice to close /dev/watchdog as it is documented in the manual page? It
> makes sense if the kernel is compiled with CONFIG_WATCHDOG_NOWAYOUT but
> otherwise it does not. (The capabilities could be queried with the
> WDIOC_GETSUPPORT ioctl AFAIK.)

Why? Sorry, I'm not sure I actually understand what you're saying. wd_keepalive
is started to still have basic watchdog functionality without the additional
checks performed by the watchdog daemon.

> From my point of view, when the system administrator explicitely sets
> CONFIG_WATCHDOG_NOWAYOUT or provides "nowayout" to the kernel module, he/she
> wants the system to reboot if something happens, including an accidental or
> intentional stop of the watchdog daemon.

Right, in this case wd_keepalice is not started so that should work.
wd_keepalive is only started if watchdog is stopped by using the init script
which seems to be intentional to me.

> 2) The way the watchdog package currently works, it will not always reboot an
> unresponsive system. This is related to my comment on bug #499796. For
> example, when the system enters rc6 and watchdog is terminated by the init
> script, wd_keepalive will seemingly keep the system from rebooting even if the
> kernel hangs.

No, only if the kernel does not actually hang. In the case you talk about the
kernel does not hang enough to not execute wd_keepalive anymore, so there is
simply no way to figure out that the system needs a reset. If the kernel really
hangs and stops working having started wd_keepalive guarantees a reboot if you
have a hardware watchdog.

> Would't it be better to run the init script (stop watchdog but do not start
> wd_keepalive) just before calling reboot or halt? That way, the watchdog
> daemon will be able to trigger a reboot until the last moment. Unfortunately,
> there are some issues when the monitored event happens (e.g. process is killed
> in rc6 or hd is unmounted) more than 60s before the watchdog is terminated.

watchdog has to be stopped before the server it monitors get stopped or else it
would trigger some sort of action. wd_keepalive then is started to make sure
the system itself stays under supervision.

Michael
--
Michael Meskes
Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org)
Michael at BorussiaFan dot De, Meskes at (Debian|Postgresql) dot Org
Jabber: michael.meskes at googlemail dot com
VfL Borussia! Força Barça! Go SF 49ers! Use Debian GNU/Linux, PostgreSQL


--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20110414140535.GA4970@feivel.credativ.lan">http://lists.debian.org/20110414140535.GA4970@feivel.credativ.lan
 
Old 04-14-2011, 02:31 PM
Bastian Blywis
 
Default watchdog

Thanks for the reply.

> Why? Sorry, I'm not sure I actually understand what you're saying.
wd_keepalive
> is started to still have basic watchdog functionality without the additional
> checks performed by the watchdog daemon.

Does it actually perform some kind of checks? What I got from the
documentation is that it only writes to /dev/watchdog periodically regardless
what happens. Thus "basic watchdog functionality" would only mean that it is
checked if the userspace process is still running.

> No, only if the kernel does not actually hang. In the case you talk about
> the kernel does not hang enough to not execute wd_keepalive anymore, so
> there is simply no way to figure out that the system needs a reset. If the
> kernel really hangs and stops working having started wd_keepalive
> guarantees a reboot if you have a hardware watchdog.

You are right. I did not actually mean that the kernel hangs but that there is
a deadlock like in the other bug report: the kernel waits for the nfs server
to reply but the watchdog does not trigger because at this time the watchdog
daemon has already been stopped and wd_keepalive started. Therefore the event
that was monitored (timestamp of a periodically touched file) did not trigger
a reboot.

> watchdog has to be stopped before the server it monitors get stopped or else
> it would trigger some sort of action. wd_keepalive then is started to make
> sure the system itself stays under supervision.

That's what I assumed: prevent an accidental reboot in rc6 or rc0 (and of
course when watchdog is stopped by some other means).


Regards,

Bastian


--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 201104141631.06427.blywis@zedat.fu-berlin.de">http://lists.debian.org/201104141631.06427.blywis@zedat.fu-berlin.de
 
Old 04-17-2011, 10:48 AM
Michael Meskes
 
Default watchdog

> Does it actually perform some kind of checks? What I got from the

Watchdog itself? Yes, which ones depends on your configuration. wd_keepalive
only triggers the hardware watchdog.

> documentation is that it only writes to /dev/watchdog periodically regardless
> what happens. Thus "basic watchdog functionality" would only mean that it is
> checked if the userspace process is still running.

Yes, if it doesn't the hw watchdog will reset the system.

Michael
--
Michael Meskes
Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org)
Michael at BorussiaFan dot De, Meskes at (Debian|Postgresql) dot Org
Jabber: michael.meskes at googlemail dot com
VfL Borussia! Força Barça! Go SF 49ers! Use Debian GNU/Linux, PostgreSQL


--
To UNSUBSCRIBE, email to debian-devel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20110417104843.GA14777@feivel.credativ.lan">http://lists.debian.org/20110417104843.GA14777@feivel.credativ.lan
 
Old 04-17-2011, 12:27 PM
Bastian Blywis
 
Default watchdog

> > Does it actually perform some kind of checks? What I got from the

>

> Watchdog itself? Yes, which ones depends on your configuration.

> wd_keepalive only triggers the hardware watchdog.

No, I meant wd_keepalive and not watchdog.



> > documentation is that it only writes to /dev/watchdog periodically

> > regardless what happens. Thus "basic watchdog functionality" would only

> > mean that it is checked if the userspace process is still running.

>

> Yes, if it doesn't the hw watchdog will reset the system.

Unfortunately, as I mentioned, it seems that in some scenarios wd_keepalive will happily continue to write to /dev/watchdog and keep the system from rebooting although it should.



From my point of view this is not the desired behavior because the watchdog is started as desired by the system administrator, then stopped in rc0 and rc6, and thus the (desired) reboot prevented if something bad happens.



There are several solutions to this problem:

1) Add a parameter in /etc/default/watchdog, e.g., START_WD_KEEPALIVE (best and easiest solution)

2) Move wd_keepalive to a separate package and let the administrator decide if he/she wants wd_keepalive to be installed and started, when watchdog is stopped

3) Add a parameter to wd_keepalive so that it will only keep the system alive for a specific time. For example when in rc6, a timeout will trigger a hard reset even if this means that some services are not shut down properly. (most complex solution)



In the end it boils down to two opinions how the watchdog and system shall behave:

1) Value a proper shutdown higher than the chance to have an unavailable system

2) Have a system that has a high availability but accept that a hard reset might be triggered in rc0 and rc6



If there is consensus that opinion 1 (the current state) is ok, I can understand this and will not complain but a configuration option would be nice ;-)



Regards,



Bastian
 

Thread Tools




All times are GMT. The time now is 11:30 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright ©2007 - 2008, www.linux-archive.org