A watchdog timer is a safety feature found in microprocessor-controlled electronics that prevents runaway software from halting a system which experiences a critical or fatal system fault. When it times out, it stops a microprocessor from executing meaningless code, a situation that arises from an electrical or programming error. More specifically, if the software is not being executed properly, it fails to clear the watchdog timer; if the watchdog timer is not cleared for a specified interval, the watchdog timer causes the microprocessor to reboot and execute software from a designated location.

A watchdog timer is most useful for mission critical systems that must remain in continuous operation without human intervention, or in environments where personnel are not normally present or readily available to intervene should a system failure occur. This hardware-based feature is generally application-specific, and is often adequate safety in lieu of a redundant system.

Watchdog timers can be used in many programming contexts, not just low-level hardware systems. For example, a thread responsible for downloading a file might hang for reasons beyond the control of the programmer if there is a problem with the network library. One wants one's program to recover with grace and reliability, so one uses a watchdog thread. What follows is an example of a watchdog thread implemented in Java. In normal use one would use and anonymous inner subclass of Watchdog, overriding killed() to perform some cleanup should the watched thread be killed.

Note also that this implementation avoids part of the "multiple clients" problem described by hobyrne below; multiple threads sharing a watchdog can each set a different index in the boolean semaphore. Should any thread fail to set its part of the semaphore, the watchdog will be tripped. It is the responsibility of the programmer to kill the individual "extra" threads as necessary by overriding killed(). Still, it is probably better to have one Watchdog per thread.



/** This class implements a 'watchdog' thread: one which oversees the operation
 * of another thread.  A <code>boolean[]</code> is used as a semaphore between the
 * watchdog thread and the watched thread. Before a specified interval has
 * elapsed, the watched thread indicates through the semaphore that it is
 * functioning correctly.  That is, each element of the boolean array must
 * be set to <tt>true</tt> in turn within <tt>sleep</tt> milliseconds.
 * Otherwise, the watchdog will terminate the watched thread.  Elements of the 
 * semaphore are automatically cleared by the watchdog after they are read,
 * and the watchdog returns to the first element of the semaphore after
 * the last element is checked.
 * <p>
 * The watchdog automatically terminates if the watched thread stops 
 * (i.e. <tt>isAlive()</tt> returns <tt>false</tt>) and is a daemon
 * thread so that it will not prevent the VM from exiting.
 *
 * @author Pyrogenic
 */
public class Watchdog extends Thread {

    Thread watched;
    int sleep;
    boolean[] semaphore;
    int length;

    boolean ok;
    boolean run = true;

    public Watchdog(final Thread watched, 
                    final int sleep, 
                    final boolean[] semaphore) {
	this(null, watched, sleep, semaphore);
    }

    public Watchdog(final ThreadGroup threadGroup, 
                    final Thread watched, 
                    final int sleep,
                    final boolean[] semaphore) {
	super(threadGroup, "Watching " + watched);
	this.watched = watched;
	this.sleep = sleep;
	this.semaphore = semaphore;
	length = semaphore.length;
	setDaemon(true);
    }

    public void run() {
        while (run && watched.isAlive()) {
            ok = true;
            for (int i = 0; i < length; i++) {
                if (!semaphore[i]) {
                    try {
                        sleep(sleep);
                    } catch (InterruptedException e) {
                    }
                }
                if (!run) {
                    ok = true;
                    break;
                }
                ok = semaphore[i] && ok;
                semaphore[i] = false;
            }
            if (!ok) {
                trigger();
            }
        }
    }

    /** invoke this method to trigger the watchdog regardless of the state of the semaphore.
     * Note that the <CODE>@warn()</CODE> method is invoked before the
     * watched thread is killed, and the result is still heeded.
     * @see #warn
     */
    public void trigger() {
        if (warn()) {
            watched.stop();
            killed();
        }
    }

    /** invoked before the watchdog terminates the thread it is watching.
     * The default implementation returns <CODE>true</CODE>.  Override this
     * method if you might want to cancel the shutdown of the watched thread.
     * <P>
     * In the event that a shutdown is cancelled, the watchdog will go back
     * to sleep, then check the semaphores again after the sleep time.
     *
     * @return  whether to continue shutting down the watched thread.  Return
     *        <CODE>true</CODE> to continue the shutdown.  Return
     *        <CODE>false</CODE> to cancel the shutdown.
     */
    protected boolean warn() {
        return true;
    }

    /** invoked after the watched thread has been killed.  The default
     * implementation does nothing. Subclasses can use this as an
     * opportunity to clean up after the killed thread by closing streams,
     * removing temporary files, etc.
     */
    protected void killed() {
        /* does nothing */
    }

    /** invoke this method to stop the watchdog in a clean manner.  The
     * watched thread will be left alone, and the watchdog will exit.
     */
    public void cease() {
        run = false;
        interrupt();
    }

}

In embedded electronic systems, a simple device which is supposed to ensure fail-safe behaviour. Unfortunately, it is all too often abused.

The basic idea is this: there is a timer. One (or several) components in the system can reset the timer. If the timer ever hits zero, (i.e. there is too long of a gap between two successive reset signals), the entire circuit is reset.

The idea is that during normal operation of the device, one central subsystem will regularly reset the watchdog timer (also known as petting the watchdog, or kicking the dog. Only a catastrophic error will cause the subsystem to fail to reset the timer (and every catastrophic error that is otherwise untrapped will cause the subsystem to fail to reset the timer). If a timeout ever occurs, then the simplest way to resolve the serious error is to reset the entire system.

There are several problems with the implementations of watchdog timers. Some implementations do not even have a well-defined period for the timer. This makes it impossible to use it effectively. Other implementations do not provide for an appropriate timer period - some systems may be better off with a fast timer, others may require a slow one. If the programmer is lazy or the implementation of the watchdog is inappropriate, there may be watchdog resets in several subsystems, or in a peripheral subsystem (whose only purpose, sometimes, is to prevent the watchdog timer from resetting). If there are resets in several subsystems, then one subsystem could fail, stop transmitting it's own reset signal, and the watchdog would not catch the failure because it's still getting petted by the others. If the resets are put on a peripheral system, then there could be a major failure in the core and the watchdog would not help at all, whereas a failure in that particular (minor) subsystem could incapacitate the entire system.

Even when the watchdog timer is used correctly, it's a bit of a cop out. There are many types of failure that can occur that will have no effect on the watchdog or the subsystem in charge of the watchdog. And with enough design work, one can predict the sets of circumstances that would cause a watchdog timer timeout, then add to the design to handle these particular circumstances in a more graceful way than crapping out the entire thing. That way, one could prove at design time that there will be little if any benefit to including a watchdog timer, and a very real risk with potentially serious consequences to having one in.

The above sounds harsh - and it's meant to. However, I do not discount the value of watchdog timers altogether. There are times when the extra design work I talk of is simply not worth it. In this wonderful capitalistic society, failure is always an option (provided the cost of failure is less than the cost of avoiding failure).

Pyrogenic's implementation of a watchdog timer is somewhat different than the type I've been ranting about, it's much more limited in scope. When calling a routine that may hang, it's only common sense in good programming practice to add a timer if you don't want your own system to hang. This timer won't overkill the error recovery process by starting the whole thing from scratch again. This kind of watchdog is reasonable, and is a good idea.

Hardware watchdog timers are a good thing and a bad thing.

Good things about watchdog timers revolve around the fact that the programmer (ie., me) is getting screamed at about the fact that the project is missing deadlines. This is because all software projects miss deadlines, except for the ones that get released early and broken. Windows 95 is a good example of this.

So, as a programmer at the sharp end, and knowing that you don't have time to debug everything properly, you say, "ah, nuts to it, I'll just make sure we have a good watchdog to reboot it." This reduces the number of fault tickets that get logged, because if it restarts itself, the customer generally won't bother to call for support.

Bad things about hardware watchdog timers revolve around the fact that every embedded system is allowed to get released with bugs. My satellite TV decoder crashes about once a week and then takes a minute or two to reboot. Sometimes it does it during a programme, which is exceedingly annoying. Sony should be ashamed.

In a perfect world, there'd be no need for watchdog timers.

Log in or register to write something here or to contact authors.