我们都知道,当应用超过一定时间无响应的时候,系统为了不让应用长时处于不可操作的状态,会弹出一个“无响应”(ANR)的对话框,用户可以选择强制关闭,从而关掉这个进程。
ANR机制是针对应用的,对于系统进程来说,如果长时间“无响应”,Android系统设计了WatchDog机制来管控。如果超过了“无响应”的延时,那么系统WatchDog会触发重启机制。
当我们分析死机重启问题LOG的时候,经常会看到下面这样一句话:
*** WATCHDOG KILLING SYSTEM PROCESS:
从字面意思上看是:看门狗杀掉了系统进程。这里就提到了本文将要分析的WatchDog机制,对于WatchDog机制来说,主要通过添加两种类型的Checker,然后每隔30s去检测一次是否有死锁和线程池堵塞的情况,如果存在,则kill掉系统。Checker主要是如下两类:
- MonitorChecker 检查系统核心服务是否被锁时间过长。
- HandlerChecker 检查系统核心线程创建的Looper管理的消息队列是否阻塞。实际上,MonitorChecker也是一种线程是FgThread的HandlerChecker。
为了方便描述,本文所指HandlerChecker不包括线程是FgThread的MonitorChecker。
接下来我们将从以下几个方面对Watchdog机制展开分析:、
- Watchdog、MonitorChecker 、Handlerchecker初始化。
- Watchdog 机制原理分析。
- Watchdog死机重启问题分析方法。
Watchdog 启动
Watchdog是一个线程,继承于Thread,在SystemServer.java里面通过getInstance获取watchdog的对象。
@SystemServer.java
final Watchdog watchdog = Watchdog.getInstance();
watchdog.init(context, mActivityManagerService);
在init
方法里面,注册了ACTION_REBOOT
的广播接收器。
public void init(Context context, ActivityManagerService activity) {
context.registerReceiver(new RebootRequestReceiver(),
new IntentFilter(Intent.ACTION_REBOOT),
android.Manifest.permission.REBOOT, null);
}
初始化HandlerChecker
在Watchdog初始化的过程中,会初始化Handlerchecker
,代码如下:
HandlerChecker(Handler handler, String name, long waitMaxMillis) {
mHandler = handler;
mName = name;
mWaitMax = waitMaxMillis;
mCompleted = true;
}
参数分别表示:
- handler: 观察的Handler.
- name: 对Handler对应的线程名字命名,主要方便后续发生异常之后,在LOG中输出对应的线程名。
- waitMaxMillis: 消息队列阻塞的最大时长,超过这个时长,就会Kill系统,默认是60s。
watchdog的构造方法里面,会初始化名字分别是foreground thread
,main thread
,ui thread
,i/o thread
,display thread
的HandlerChecker(FgThread特例),默认的DEFAULT_TIMEOUT是60s,也就是说,线程创建的Looper里面的消息队列不能阻塞超过60s。
代码如下:
private Watchdog() {
super("watchdog");
mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
"foreground thread", DEFAULT_TIMEOUT);
mHandlerCheckers.add(mMonitorChecker);
// Add checker for main thread. We only do a quick check since there
// can be UI running on the thread.
mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
"main thread", DEFAULT_TIMEOUT));
// Add checker for shared UI thread.
mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
"ui thread", DEFAULT_TIMEOUT));
// And also check IO thread.
mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
"i/o thread", DEFAULT_TIMEOUT));
// And the display thread.
mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
"display thread", DEFAULT_TIMEOUT));
// Initialize monitor for Binder threads.
addMonitor(new BinderThreadMonitor());
}
初始化MonitorChecker
如上代码,在初始化Watchdog的过程中,会添加BinderThreadMonitor。
*/
private static final class BinderThreadMonitor implements Watchdog.Monitor {
@Override
public void monitor() {
Binder.blockUntilThreadAvailable();
}
}
外部添加
除了WatchDog里面自己添加的固定的Checker之外,Watchdog还提供了两个方法addMonitor
和addThread
供外部添加HandlerChecker和MonitorChecker。代码如下:
public void addMonitor(Monitor monitor) {
synchronized (this) {
if (isAlive()) {
throw new RuntimeException("Monitors can't be added once the Watchdog is running");
}
mMonitorChecker.addMonitor(monitor);
}
}
public void addThread(Handler thread, long timeoutMillis) {
synchronized (this) {
if (isAlive()) {
throw new RuntimeException("Threads can't be added once the Watchdog is running");
}
final String name = thread.getLooper().getThread().getName();
mHandlerCheckers.add(new HandlerChecker(thread, name, timeoutMillis));
}
}
比如ActivityManagerService
就分别添加了monitor
和handler
;
public class ActivityManagerService extends IActivityManager.Stub implements Watchdog.Monitor
{
// ...
Watchdog.getInstance().addMonitor(this);
Watchdog.getInstance().addThread(mHandler);
}
watchdog 机制原理
当系统的核心服务都运行之后,SystemServer.java会调用Watchdog.getInstance().start();
从而开始执行Watchdog线程的run
方法。代码如下:
@Override
public void run() {
boolean waitedHalf = false;
while (true) {
final ArrayList<HandlerChecker> blockedCheckers;
final String subject;
final boolean allowRestart;
int debuggerWasConnected = 0;
synchronized (this) {
long timeout = CHECK_INTERVAL;
// [a]
// Make sure we (re)spin the checkers that have become idle within
// this wait-and-check interval
for (int i=0; i<mHandlerCheckers.size(); i++) {
HandlerChecker hc = mHandlerCheckers.get(i);
hc.scheduleCheckLocked();
}
if (debuggerWasConnected > 0) {
debuggerWasConnected--;
}
// NOTE: We use uptimeMillis() here because we do not want to increment the time we
// wait while asleep. If the device is asleep then the thing that we are waiting
// to timeout on is asleep as well and won't have a chance to run, causing a false
// positive on when to kill things.
long start = SystemClock.uptimeMillis();
// [b] wait 30s
while (timeout > 0) {
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
try {
wait(timeout); // wait time
} catch (InterruptedException e) {
Log.wtf(TAG, e);
}
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
}
// [c] evaluate checker state
final int waitState = evaluateCheckerCompletionLocked();
if (waitState == COMPLETED) {
// The monitors have returned; reset
waitedHalf = false;
continue;
} else if (waitState == WAITING) {
// still waiting but within their configured intervals; back off and recheck
continue;
} else if (waitState == WAITED_HALF) {
if (!waitedHalf) {
// We've waited half the deadlock-detection interval. Pull a stack
// trace and wait another half.
ArrayList<Integer> pids = new ArrayList<Integer>();
pids.add(Process.myPid());
// [d]
ActivityManagerService.dumpStackTraces(true, pids, null, null,
getInterestingNativePids());
waitedHalf = true;
}
continue;
}
// [e]
// something is overdue!
blockedCheckers = getBlockedCheckersLocked();
subject = describeCheckersLocked(blockedCheckers);
allowRestart = mAllowRestart;
}
// If we got here, that means that the system is most likely hung.
// First collect stack traces from all threads of the system process.
// Then kill this process so that the system will restart.
EventLog.writeEvent(EventLogTags.WATCHDOG, subject);
ArrayList<Integer> pids = new ArrayList<>();
pids.add(Process.myPid());
if (mPhonePid > 0) pids.add(mPhonePid);
// Pass !waitedHalf so that just in case we somehow wind up here without having
// dumped the halfway stacks, we properly re-initialize the trace file.
// [f] pint all stack info
final File stack = ActivityManagerService.dumpStackTraces(
!waitedHalf, pids, null, null, getInterestingNativePids());
// Give some extra time to make sure the stack traces get written.
// The system's been hanging for a minute, another second or two won't hurt much.
SystemClock.sleep(2000);
// Pull our own kernel thread stacks as well if we're configured for that
if (RECORD_KERNEL_THREADS) {
dumpKernelStackTraces();
}
// Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
doSysRq('w');
doSysRq('l');
// Try to add the error to the dropbox, but assuming that the ActivityManager
// itself may be deadlocked. (which has happened, causing this statement to
// deadlock and the watchdog as a whole to be ineffective)
Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
public void run() {
mActivity.addErrorToDropBox(
"watchdog", null, "system_server", null, null,
subject, null, stack, null);
}
};
dropboxThread.start();
try {
dropboxThread.join(2000); // wait up to 2 seconds for it to return.
} catch (InterruptedException ignored) {}
IActivityController controller;
synchronized (this) {
controller = mController;
}
// [g] report to controller
if (controller != null) {
Slog.i(TAG, "Reporting stuck state to activity controller");
try {
Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
// 1 = keep waiting, -1 = kill system
int res = controller.systemNotResponding(subject);
if (res >= 0) {
Slog.i(TAG, "Activity controller requested to coninue to wait");
waitedHalf = false;
continue;
}
} catch (RemoteException e) {
}
}
// [h] kill the process
// Only kill the process if the debugger is not attached.
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
if (debuggerWasConnected >= 2) {
Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
} else if (debuggerWasConnected > 0) {
Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
} else if (!allowRestart) {
Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
} else {
Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
for (int i=0; i<blockedCheckers.size(); i++) {
Slog.w(TAG, blockedCheckers.get(i).getName() + " stack trace:");
StackTraceElement[] stackTrace
= blockedCheckers.get(i).getThread().getStackTrace();
for (StackTraceElement element: stackTrace) {
Slog.w(TAG, " at " + element);
}
}
Slog.w(TAG, "*** GOODBYE!");
Process.killProcess(Process.myPid());
System.exit(10);
}
waitedHalf = false;
}
}
接下来一步步分析这个函数。我们可以看到,Watchdog线程是一个死循环,也就是说会一直执行。在以上代码片段添加[a]-[g]的标识。分别对应下面的a-g。
a. 首先遍历系统所有的HandlerChecker,然后调用scheduleCheckLocked
执行检查动作。代码片段:
public void scheduleCheckLocked() {
if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {
// ...
mCompleted = true;
return;
}
if (!mCompleted) {
// we already have a check in flight, so no need
return;
}
mCompleted = false;
mCurrentMonitor = null;
mStartTime = SystemClock.uptimeMillis();
mHandler.postAtFrontOfQueue(this);
}
- HanderChecker: 对于名字不是
foreground thread
的HandlerChecker来说,mMonitors.size()
为0,如果mHandler.getLooper().getQueue().isPolling()
返回true,说明当前的消息池正常,否则,说明当前的消息已经阻塞。那么后面的mHandler.postAtFrontOfQueue(this)
也会阻塞,mCompleted
就等于false
。 - MoniterChecker: 对于monitorChecker来说,
mHandler.postAtFrontOfQueue(this)
将会顺利执行,而且消息是在消息队列的最前端。所以会立即执行run
方法。代码片段如下:
@Override
public void run() {
final int size = mMonitors.size();
for (int i = 0 ; i < size ; i++) {
synchronized (Watchdog.this) {
mCurrentMonitor = mMonitors.get(i);
}
mCurrentMonitor.monitor();
}
synchronized (Watchdog.this) {
mCompleted = true;
mCurrentMonitor = null;
}
}
MoniterChecker会执行monitor()
方法。我们看ActivityManagerService,java
的monitor
方法,仅仅是请求了synchronized
,如果this
被其他地方持有,那么这个地方就会等待。
public void monitor() {
synchronized (this) { }
}
BinderThreadMonitor比较特殊,最终的判断位于frameworks/native/libs/binder/IPCThreadState.cpp
,判断方法是当前进行Binder通信的线程数不能超过mMaxThreads
,对于SysemServer来说,这个最大值是31,定义在SystemServer.java
里面。代码片段:
void IPCThreadState::blockUntilThreadAvailable()
{
pthread_mutex_lock(&mProcess->mThreadCountLock);
while (mProcess->mExecutingThreadsCount >= mProcess->mMaxThreads) {
ALOGW("Waiting for thread to be free. mExecutingThreadsCount=%lu mMaxThreads=%lu\n",
static_cast<unsigned long>(mProcess->mExecutingThreadsCount),
static_cast<unsigned long>(mProcess->mMaxThreads));
pthread_cond_wait(&mProcess->mThreadCountDecrement, &mProcess->mThreadCountLock);
}
pthread_mutex_unlock(&mProcess->mThreadCountLock);
回到HandlerChecker
的run
方法,如果mCurrentMonitor.monitor();
执行完成,没有等待,那么就会赋值mCompleted = true;
和mCurrentMonitor = null;
后面的[c]步骤会用到这里的结果。
b. 由于我们的检查周期是30s,当启动检查之后,会让Watchdog线程等待30s.
c. 调用evaluateCheckerCompletionLocked
计算当前的检查结果。然后调用getCompletionStateLocked
获取完成状态。代码片段:
public int getCompletionStateLocked() {
if (mCompleted) {
return COMPLETED;
} else {
long latency = SystemClock.uptimeMillis() - mStartTime;
if (latency < mWaitMax/2) {
return WAITING;
} else if (latency < mWaitMax) {
return WAITED_HALF;
}
}
return OVERDUE;
}
分别介绍四种状态以及对应的条件:
- COMPLETED: 监控的消息队列没有阻塞且监控的monitor可以正常申请锁。如步骤[a] 所讲,此时
mCompleted=true
。 - WAITING: 监控的消息队列阻塞时间或者监控的monitor无法申请锁时间在0-30s之间。
- WAITED_HALF:监控的消息队列阻塞时间或者监控的monitor无法申请锁的时间在30-60s之间。
- OVERDUE:监控的消息队列阻塞时间或者监控的monitor无法申请锁的时间超过我们默认的延时60s。
d. 如果返回的状态是COMPLETED
和WAITING
,是在可以接受的范围之内,但是如果返回了WAITED_HALF
状态,此时会调用ActivityManagerService.dumpStackTraces(true, pids, null, null,getInterestingNativePids())
打印当前进程的Trace信息,并且会打印感兴趣的native
进程的Trace信息。主要包含如下进程:
public static final String[] NATIVE_STACKS_OF_INTEREST = new String[] {
"/system/bin/audioserver",
"/system/bin/cameraserver",
"/system/bin/drmserver",
"/system/bin/mediadrmserver",
"/system/bin/mediaserver",
"/system/bin/sdcard",
"/system/bin/surfaceflinger",
"media.extractor", // system/bin/mediaextractor
"media.codec", // vendor/bin/hw/android.hardware.media.omx@1.0-service
"com.android.bluetooth", // Bluetooth service
};
e. 如果返回了OVERDUE
状态,说明已经超时,会通过getBlockedCheckersLocked
获取当前延时的checker类型,并且通过describeCheckersLocked
打印当前阻塞信息。
- MonitorChecker 延时:打印
Blocked in monitor + monitor名字 + on + 线程名
。 - HandlerChecker 延时:打印
Blocked in handler on + 名字(比如ui thread) + 线程名
f. 再次调用ActivityManagerService.dumpStackTraces
打印当前的进程和感兴趣的native进程调用Stack。调用dumpKernelStackTraces
打印kernel的回调。还会执行doSysRq来打印当前kernel和cpu的状态。
- doSysRq(‘w’): Dumps tasks that are in uninterruptable (blocked) state.
- doSysRq(‘l’): Shows a stack backtrace for all active CPUs.
而且会把当前的Error写进DropBox里面。
g. 如果设置了ActivityController
,会将当前的信息传递过去。
h.WatchDog系统自杀,向LOG里面输出WATCHDOG KILLING SYSTEM PROCESS
的信息,调用Process.killProcess(Process.myPid());
将system杀掉。
总结
1、Watchdog用HandlerChecker来监控消息队列是否发生阻塞,用MonitorChecker来监控系统核心服务是否发生长时间持锁。
2、HandlerChecker通过``mHandler.getLooper().getQueue().isPolling()判断是否超时,BinderThreadMonitor主要是通过判断Binder线程是否超过了系统最大值来判断是否超时,其他MonitorChecker通过
synchronized(this)`判断是否超时。
3、超时之后,系统会打印一系列的信息,包括当前进程以及核心native进程的Stacktrace,kernel线程Stacktrace,打印Kernel里面blocked的线程以及所有CPU的backtraces。
4. 超时之后,Watchdog会杀掉自己,导致zygote重启。