android watchdog学习

android watchdog 看门狗学习记录

发布日期 2018-04-05

我们都知道,当应用超过一定时间无响应的时候,系统为了不让应用长时处于不可操作的状态,会弹出一个“无响应”(ANR)的对话框,用户可以选择强制关闭,从而关掉这个进程。

ANR机制是针对应用的,对于系统进程来说,如果长时间“无响应”,Android系统设计了WatchDog机制来管控。如果超过了“无响应”的延时,那么系统WatchDog会触发重启机制。

当我们分析死机重启问题LOG的时候,经常会看到下面这样一句话:

*** WATCHDOG KILLING SYSTEM PROCESS:

从字面意思上看是:看门狗杀掉了系统进程。这里就提到了本文将要分析的WatchDog机制,对于WatchDog机制来说,主要通过添加两种类型的Checker,然后每隔30s去检测一次是否有死锁和线程池堵塞的情况,如果存在,则kill掉系统。Checker主要是如下两类:

  • MonitorChecker 检查系统核心服务是否被锁时间过长。
  • HandlerChecker 检查系统核心线程创建的Looper管理的消息队列是否阻塞。实际上,MonitorChecker也是一种线程是FgThread的HandlerChecker。

为了方便描述,本文所指HandlerChecker不包括线程是FgThread的MonitorChecker。

接下来我们将从以下几个方面对Watchdog机制展开分析:、

  1. Watchdog、MonitorChecker 、Handlerchecker初始化。
  2. Watchdog 机制原理分析。
  3. Watchdog死机重启问题分析方法。

Watchdog 启动

Watchdog是一个线程,继承于Thread,在SystemServer.java里面通过getInstance获取watchdog的对象。

@SystemServer.java

            final Watchdog watchdog = Watchdog.getInstance();
            watchdog.init(context, mActivityManagerService);

init方法里面,注册了ACTION_REBOOT的广播接收器。

    public void init(Context context, ActivityManagerService activity) {

        context.registerReceiver(new RebootRequestReceiver(),
                new IntentFilter(Intent.ACTION_REBOOT),
                android.Manifest.permission.REBOOT, null);
    }

初始化HandlerChecker

在Watchdog初始化的过程中,会初始化Handlerchecker,代码如下:

        HandlerChecker(Handler handler, String name, long waitMaxMillis) {
            mHandler = handler;
            mName = name;
            mWaitMax = waitMaxMillis;
            mCompleted = true;
        }

参数分别表示:

  • handler: 观察的Handler.
  • name: 对Handler对应的线程名字命名,主要方便后续发生异常之后,在LOG中输出对应的线程名。
  • waitMaxMillis: 消息队列阻塞的最大时长,超过这个时长,就会Kill系统,默认是60s。

watchdog的构造方法里面,会初始化名字分别是foreground threadmain threadui threadi/o threaddisplay thread的HandlerChecker(FgThread特例),默认的DEFAULT_TIMEOUT是60s,也就是说,线程创建的Looper里面的消息队列不能阻塞超过60s。

代码如下:

    private Watchdog() {
        super("watchdog");

        mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
                "foreground thread", DEFAULT_TIMEOUT);
        mHandlerCheckers.add(mMonitorChecker);
        // Add checker for main thread.  We only do a quick check since there
        // can be UI running on the thread.
        mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
                "main thread", DEFAULT_TIMEOUT));
        // Add checker for shared UI thread.
        mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
                "ui thread", DEFAULT_TIMEOUT));
        // And also check IO thread.
        mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
                "i/o thread", DEFAULT_TIMEOUT));
        // And the display thread.
        mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
                "display thread", DEFAULT_TIMEOUT));

        // Initialize monitor for Binder threads.
        addMonitor(new BinderThreadMonitor());
    }

初始化MonitorChecker

如上代码,在初始化Watchdog的过程中,会添加BinderThreadMonitor。

     */
    private static final class BinderThreadMonitor implements Watchdog.Monitor {
        @Override
        public void monitor() {
            Binder.blockUntilThreadAvailable();
        }
    }

外部添加

除了WatchDog里面自己添加的固定的Checker之外,Watchdog还提供了两个方法addMonitoraddThread供外部添加HandlerChecker和MonitorChecker。代码如下:

    public void addMonitor(Monitor monitor) {
        synchronized (this) {
            if (isAlive()) {
                throw new RuntimeException("Monitors can't be added once the Watchdog is running");
            }
            mMonitorChecker.addMonitor(monitor);
        }
    }

    public void addThread(Handler thread, long timeoutMillis) {
        synchronized (this) {
            if (isAlive()) {
                throw new RuntimeException("Threads can't be added once the Watchdog is running");
            }
            final String name = thread.getLooper().getThread().getName();
            mHandlerCheckers.add(new HandlerChecker(thread, name, timeoutMillis));
        }
    }

比如ActivityManagerService就分别添加了monitorhandler

public class ActivityManagerService extends IActivityManager.Stub implements Watchdog.Monitor
{
		// ...
        Watchdog.getInstance().addMonitor(this);
        Watchdog.getInstance().addThread(mHandler);
        
}

watchdog 机制原理

当系统的核心服务都运行之后,SystemServer.java会调用Watchdog.getInstance().start();从而开始执行Watchdog线程的run方法。代码如下:

 @Override
    public void run() {
        boolean waitedHalf = false;
        while (true) {
            final ArrayList<HandlerChecker> blockedCheckers;
            final String subject;
            final boolean allowRestart;
            int debuggerWasConnected = 0;
            synchronized (this) {
                long timeout = CHECK_INTERVAL;
                // [a] 
                // Make sure we (re)spin the checkers that have become idle within
                // this wait-and-check interval
                for (int i=0; i<mHandlerCheckers.size(); i++) {
                    HandlerChecker hc = mHandlerCheckers.get(i);
                    hc.scheduleCheckLocked();
                }

                if (debuggerWasConnected > 0) {
                    debuggerWasConnected--;
                }

                // NOTE: We use uptimeMillis() here because we do not want to increment the time we
                // wait while asleep. If the device is asleep then the thing that we are waiting
                // to timeout on is asleep as well and won't have a chance to run, causing a false
                // positive on when to kill things.
                long start = SystemClock.uptimeMillis();
                // [b] wait 30s 
                while (timeout > 0) {
                    if (Debug.isDebuggerConnected()) {
                        debuggerWasConnected = 2;
                    }
                    try {
                        wait(timeout); // wait time 
                    } catch (InterruptedException e) {
                        Log.wtf(TAG, e);
                    }
                    if (Debug.isDebuggerConnected()) {
                        debuggerWasConnected = 2;
                    }
                    timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
                }
                // [c] evaluate checker state
                final int waitState = evaluateCheckerCompletionLocked();
                if (waitState == COMPLETED) {
                    // The monitors have returned; reset
                    waitedHalf = false;
                    continue;
                } else if (waitState == WAITING) {
                    // still waiting but within their configured intervals; back off and recheck
                    continue;
                } else if (waitState == WAITED_HALF) {
                    if (!waitedHalf) {
                        // We've waited half the deadlock-detection interval.  Pull a stack
                        // trace and wait another half.
                        ArrayList<Integer> pids = new ArrayList<Integer>();
                        pids.add(Process.myPid());
                        // [d] 
                        ActivityManagerService.dumpStackTraces(true, pids, null, null,
                            getInterestingNativePids());
                        waitedHalf = true;
                    }
                    continue;
                }
				// [e]
                // something is overdue!
                blockedCheckers = getBlockedCheckersLocked();
                subject = describeCheckersLocked(blockedCheckers);
                allowRestart = mAllowRestart;
            }

            // If we got here, that means that the system is most likely hung.
            // First collect stack traces from all threads of the system process.
            // Then kill this process so that the system will restart.
            EventLog.writeEvent(EventLogTags.WATCHDOG, subject);

            ArrayList<Integer> pids = new ArrayList<>();
            pids.add(Process.myPid());
            if (mPhonePid > 0) pids.add(mPhonePid);
            // Pass !waitedHalf so that just in case we somehow wind up here without having
            // dumped the halfway stacks, we properly re-initialize the trace file.
            // [f] pint all stack info
            final File stack = ActivityManagerService.dumpStackTraces(
                    !waitedHalf, pids, null, null, getInterestingNativePids());

            // Give some extra time to make sure the stack traces get written.
            // The system's been hanging for a minute, another second or two won't hurt much.
            SystemClock.sleep(2000);

            // Pull our own kernel thread stacks as well if we're configured for that
            if (RECORD_KERNEL_THREADS) {
                dumpKernelStackTraces();
            }

            // Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
            doSysRq('w');
            doSysRq('l');

            // Try to add the error to the dropbox, but assuming that the ActivityManager
            // itself may be deadlocked.  (which has happened, causing this statement to
            // deadlock and the watchdog as a whole to be ineffective)
            Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
                    public void run() {
                        mActivity.addErrorToDropBox(
                                "watchdog", null, "system_server", null, null,
                                subject, null, stack, null);
                    }
                };
            dropboxThread.start();
            try {
                dropboxThread.join(2000);  // wait up to 2 seconds for it to return.
            } catch (InterruptedException ignored) {}

            IActivityController controller;
            synchronized (this) {
                controller = mController;
            }
            // [g] report to controller
            if (controller != null) {
                Slog.i(TAG, "Reporting stuck state to activity controller");
                try {
                    Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
                    // 1 = keep waiting, -1 = kill system
                    int res = controller.systemNotResponding(subject);
                    if (res >= 0) {
                        Slog.i(TAG, "Activity controller requested to coninue to wait");
                        waitedHalf = false;
                        continue;
                    }
                } catch (RemoteException e) {
                }
            }
			// [h] kill the process
            // Only kill the process if the debugger is not attached.
            if (Debug.isDebuggerConnected()) {
                debuggerWasConnected = 2;
            }
            if (debuggerWasConnected >= 2) {
                Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
            } else if (debuggerWasConnected > 0) {
                Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
            } else if (!allowRestart) {
                Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
            } else {
                Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
                for (int i=0; i<blockedCheckers.size(); i++) {
                    Slog.w(TAG, blockedCheckers.get(i).getName() + " stack trace:");
                    StackTraceElement[] stackTrace
                            = blockedCheckers.get(i).getThread().getStackTrace();
                    for (StackTraceElement element: stackTrace) {
                        Slog.w(TAG, "    at " + element);
                    }
                }
                Slog.w(TAG, "*** GOODBYE!");
                Process.killProcess(Process.myPid());
                System.exit(10);
            }

            waitedHalf = false;
        }
    }

接下来一步步分析这个函数。我们可以看到,Watchdog线程是一个死循环,也就是说会一直执行。在以上代码片段添加[a]-[g]的标识。分别对应下面的a-g。

a. 首先遍历系统所有的HandlerChecker,然后调用scheduleCheckLocked执行检查动作。代码片段:

        public void scheduleCheckLocked() {
            if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {
               // ...
                mCompleted = true;
                return;
            }

            if (!mCompleted) {
                // we already have a check in flight, so no need
                return;
            }

            mCompleted = false;
            mCurrentMonitor = null;
            mStartTime = SystemClock.uptimeMillis();
            mHandler.postAtFrontOfQueue(this);
        }
  • HanderChecker: 对于名字不是foreground thread的HandlerChecker来说,mMonitors.size()为0,如果mHandler.getLooper().getQueue().isPolling()返回true,说明当前的消息池正常,否则,说明当前的消息已经阻塞。那么后面的mHandler.postAtFrontOfQueue(this)也会阻塞,mCompleted就等于false
  • MoniterChecker: 对于monitorChecker来说,mHandler.postAtFrontOfQueue(this)将会顺利执行,而且消息是在消息队列的最前端。所以会立即执行run方法。代码片段如下:
        @Override
        public void run() {
            final int size = mMonitors.size();
            for (int i = 0 ; i < size ; i++) {
                synchronized (Watchdog.this) {
                    mCurrentMonitor = mMonitors.get(i);
                }
                mCurrentMonitor.monitor();
            }

            synchronized (Watchdog.this) {
                mCompleted = true;
                mCurrentMonitor = null;
            }
        }

MoniterChecker会执行monitor()方法。我们看ActivityManagerService,javamonitor方法,仅仅是请求了synchronized,如果this被其他地方持有,那么这个地方就会等待。

    public void monitor() {
        synchronized (this) { }
    }

BinderThreadMonitor比较特殊,最终的判断位于frameworks/native/libs/binder/IPCThreadState.cpp,判断方法是当前进行Binder通信的线程数不能超过mMaxThreads,对于SysemServer来说,这个最大值是31,定义在SystemServer.java里面。代码片段:

void IPCThreadState::blockUntilThreadAvailable()
{
    pthread_mutex_lock(&mProcess->mThreadCountLock);
    while (mProcess->mExecutingThreadsCount >= mProcess->mMaxThreads) {
        ALOGW("Waiting for thread to be free. mExecutingThreadsCount=%lu mMaxThreads=%lu\n",
                static_cast<unsigned long>(mProcess->mExecutingThreadsCount),
                static_cast<unsigned long>(mProcess->mMaxThreads));
        pthread_cond_wait(&mProcess->mThreadCountDecrement, &mProcess->mThreadCountLock);
    }
    pthread_mutex_unlock(&mProcess->mThreadCountLock);

回到HandlerCheckerrun方法,如果mCurrentMonitor.monitor();执行完成,没有等待,那么就会赋值mCompleted = true;mCurrentMonitor = null;

后面的[c]步骤会用到这里的结果。

b. 由于我们的检查周期是30s,当启动检查之后,会让Watchdog线程等待30s.

c. 调用evaluateCheckerCompletionLocked计算当前的检查结果。然后调用getCompletionStateLocked获取完成状态。代码片段:

        public int getCompletionStateLocked() {
            if (mCompleted) {
                return COMPLETED;
            } else {
                long latency = SystemClock.uptimeMillis() - mStartTime;
                if (latency < mWaitMax/2) {
                    return WAITING;
                } else if (latency < mWaitMax) {
                    return WAITED_HALF;
                }
            }
            return OVERDUE;
        }

分别介绍四种状态以及对应的条件:

  • COMPLETED: 监控的消息队列没有阻塞且监控的monitor可以正常申请锁。如步骤[a] 所讲,此时mCompleted=true
  • WAITING: 监控的消息队列阻塞时间或者监控的monitor无法申请锁时间在0-30s之间。
  • WAITED_HALF:监控的消息队列阻塞时间或者监控的monitor无法申请锁的时间在30-60s之间。
  • OVERDUE:监控的消息队列阻塞时间或者监控的monitor无法申请锁的时间超过我们默认的延时60s。

d. 如果返回的状态是COMPLETEDWAITING,是在可以接受的范围之内,但是如果返回了WAITED_HALF状态,此时会调用ActivityManagerService.dumpStackTraces(true, pids, null, null,getInterestingNativePids())打印当前进程的Trace信息,并且会打印感兴趣的native进程的Trace信息。主要包含如下进程:

    public static final String[] NATIVE_STACKS_OF_INTEREST = new String[] {
        "/system/bin/audioserver",
        "/system/bin/cameraserver",
        "/system/bin/drmserver",
        "/system/bin/mediadrmserver",
        "/system/bin/mediaserver",
        "/system/bin/sdcard",
        "/system/bin/surfaceflinger",
        "media.extractor", // system/bin/mediaextractor
        "media.codec", // vendor/bin/hw/android.hardware.media.omx@1.0-service
        "com.android.bluetooth",  // Bluetooth service
    };

e. 如果返回了OVERDUE状态,说明已经超时,会通过getBlockedCheckersLocked获取当前延时的checker类型,并且通过describeCheckersLocked打印当前阻塞信息。

  • MonitorChecker 延时:打印Blocked in monitor + monitor名字 + on + 线程名
  • HandlerChecker 延时:打印Blocked in handler on + 名字(比如ui thread) + 线程名

f. 再次调用ActivityManagerService.dumpStackTraces打印当前的进程和感兴趣的native进程调用Stack。调用dumpKernelStackTraces打印kernel的回调。还会执行doSysRq来打印当前kernel和cpu的状态。

  • doSysRq(‘w’): Dumps tasks that are in uninterruptable (blocked) state.
  • doSysRq(‘l’): Shows a stack backtrace for all active CPUs.

而且会把当前的Error写进DropBox里面。

g. 如果设置了ActivityController,会将当前的信息传递过去。

h.WatchDog系统自杀,向LOG里面输出WATCHDOG KILLING SYSTEM PROCESS的信息,调用Process.killProcess(Process.myPid());将system杀掉。

总结

1、Watchdog用HandlerChecker来监控消息队列是否发生阻塞,用MonitorChecker来监控系统核心服务是否发生长时间持锁。
2、HandlerChecker通过``mHandler.getLooper().getQueue().isPolling()判断是否超时,BinderThreadMonitor主要是通过判断Binder线程是否超过了系统最大值来判断是否超时,其他MonitorChecker通过synchronized(this)`判断是否超时。
3、超时之后,系统会打印一系列的信息,包括当前进程以及核心native进程的Stacktrace,kernel线程Stacktrace,打印Kernel里面blocked的线程以及所有CPU的backtraces。
4. 超时之后,Watchdog会杀掉自己,导致zygote重启。