ANR分析记录(二)线程死锁

日常Android ANR问题分析记录。

0x01 获取ANR日志

我们都知道当应用发生ANR时,系统会将ANR时的日志信息保存到在/data/anr/traces.txt文件中。可以通过分析traces.txt文件对引发ANR问题的原因进行分析,但是在实际发生问题的时候,因为系统文件访问权限的问题,很难在非root的系统中获取到/data/anr/traces.txt问题。所以这里需要借助另一个工具adb bugreport来获取异常日志。
https://developer.android.com/studio/debug/bug-report?hl=zh-cn

默认情况下,ZIP 文件称为 bugreport-BUILD_ID-DATE.zip,它可能会包含多个文件,但最重要的文件是 bugreport-BUILD_ID-DATE.txt。此文件就是错误报告,它包含系统服务 (dumpsys)、错误日志 (dumpstate) 和系统消息日志 (logcat) 的诊断输出。系统消息包括设备抛出错误时的堆栈轨迹,以及从所有应用中使用 Log 类写入的消息。
ZIP 文件中有一个 version.txt 元数据文件,其中包含 Android 版本号,而且启用 systrace 后,ZIP 文件中还会包含 systrace.txt 文件。Systrace 工具可以获取并显示应用进程和其他 Android 系统进程的执行时间,从而帮助分析应用的性能。
dumpstate 工具会将文件从设备的文件系统复制到 ZIP 文件的 FS 文件夹下,以便您引用它们。例如,设备中的 /dirA/dirB/fileC 文件会在 ZIP 文件中生成 FS/dirA/dirB/fileC 条目。

https://source.android.com/source/read-bug-reports.html?hl=zh-cn

1
$ adb bugreport

image.png

1
$ adb shell ls /bugreports/

image.png
解压bugreport.zip就可以看到系统logcat和dumpsys日志信息,而我们想要重点分析的ant traces日志就只/FS/data/anr目录中。

image.png

0x02 日志分析

2.1 分析traces.txt时需要关注的关键信息

  1. pid为进程id,sysTid=pid,这里主线程的线程号=进程号,prio=5为线程优先级
  2. 当一个线程占有一个锁的时候,会打印-locked<0xxxxxxx>
  3. 当该线程正在等待别的线程释放该锁,会打印waiting to lock <0xxxxxx>
  4. 如果代码中有wait()调用的话,首先是locked,然后会打印waiting on <0xxxxxx>

2.2 日志一

  • 异常信息: - waiting to lock <0x0c8b6ef5> (a java.lang.Class<xxx.DeviceMeta>) held by thread 60
  • 异常说明:主线程等待thread 60释放<0x0c8b6ef5>对象,导致主线程被blocked,发生ANR
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    "main" prio=5 tid=1 Blocked
    | group="main" sCount=1 dsCount=0 flags=1 obj=0x722423a8 self=0x7e6f0a0c00
    | sysTid=1840 nice=-10 cgrp=default sched=0/0 handle=0x7e7074c4f8
    | state=S schedstat=( 4835749094 86478588 880 ) utm=442 stm=41 core=6 HZ=100
    | stack=0x7ffba79000-0x7ffba7b000 stackSize=8192KB
    | held mutexes=
    at xxxxxxxxxxxxxxxxxxxxxxx.DeviceMeta.getInstance(DeviceMeta.java:24)
    - waiting to lock <0x0c8b6ef5> (a java.lang.Class<xxxxxxxxxxxxxxxxxxxxxxx.DeviceMeta>) held by thread 60
    at xxxxxxxxxxxxxxxxxxxxxxx
    at kotlin.SynchronizedLazyImpl.getValue(LazyJVM.kt:74)
    - locked <0x09ab148a> (a kotlin.SynchronizedLazyImpl)
    at xxxxxxxxxxxxxxxxxxxxxxx
    at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
    at kotlinx.coroutines.DispatchedContinuationKt.resumeCancellableWith(DispatchedContinuation.kt:313)
    at kotlinx.coroutines.intrinsics.CancellableKt.startCoroutineCancellable(Cancellable.kt:26)
    at kotlinx.coroutines.CoroutineStart.invoke(CoroutineStart.kt:109)
    at kotlinx.coroutines.AbstractCoroutine.start(AbstractCoroutine.kt:158)
    at kotlinx.coroutines.BuildersKt__Builders_commonKt.launch(Builders.common.kt:54)
    at kotlinx.coroutines.BuildersKt.launch(unavailable:1)
    at kotlinx.coroutines.BuildersKt__Builders_commonKt.launch$default(Builders.common.kt:47)
    at kotlinx.coroutines.BuildersKt.launch$default(unavailable:1)
    at xxxxxxxxxxxxxxxxxxxxxxx
    at androidx.fragment.app.FragmentManagerImpl.moveToState(FragmentManager.java:1471)
    at androidx.fragment.app.FragmentManagerImpl.moveFragmentToExpectedState(FragmentManager.java:1784)
    at androidx.fragment.app.FragmentManagerImpl.moveToState(FragmentManager.java:1852)
    at androidx.fragment.app.BackStackRecord.executeOps(BackStackRecord.java:802)
    at androidx.fragment.app.FragmentManagerImpl.executeOps(FragmentManager.java:2625)
    at androidx.fragment.app.FragmentManagerImpl.executeOpsTogether(FragmentManager.java:2411)
    at androidx.fragment.app.FragmentManagerImpl.removeRedundantOperationsAndExecute(FragmentManager.java:2366)
    at androidx.fragment.app.FragmentManagerImpl.execPendingActions(FragmentManager.java:2273)
    at androidx.fragment.app.FragmentManagerImpl.dispatchStateChange(FragmentManager.java:3273)
    at androidx.fragment.app.FragmentManagerImpl.dispatchActivityCreated(FragmentManager.java:3229)
    at androidx.fragment.app.FragmentController.dispatchActivityCreated(FragmentController.java:201)
    at androidx.fragment.app.FragmentActivity.onStart(FragmentActivity.java:620)
    at androidx.appcompat.app.AppCompatActivity.onStart(AppCompatActivity.java:178)
    at android.app.Instrumentation.callActivityOnStart(Instrumentation.java:1436)
    at android.app.Activity.performStart(Activity.java:8190)
    at android.app.ActivityThread.handleStartActivity(ActivityThread.java:3566)
    at android.app.servertransaction.TransactionExecutor.performLifecycleSequence(TransactionExecutor.java:221)
    at android.app.servertransaction.TransactionExecutor.cycleToPath(TransactionExecutor.java:201)
    at android.app.servertransaction.TransactionExecutor.executeLifecycleState(TransactionExecutor.java:173)
    at android.app.servertransaction.TransactionExecutor.execute(TransactionExecutor.java:97)
    at android.app.ActivityThread$H.handleMessage(ActivityThread.java:2136)
    at com.didichuxing.doraemonkit.kit.timecounter.instrumentation.ProxyHandlerCallback.handleMessage(ProxyHandlerCallback.java:38)
    at leakcanary.ServiceWatcher$install$3$2.handleMessage(ServiceWatcher.kt:63)
    at android.os.Handler.dispatchMessage(Handler.java:102)
    at android.os.Looper.loop(Looper.java:236)
    at android.app.ActivityThread.main(ActivityThread.java:8060)
    at java.lang.reflect.Method.invoke(Native method)
    at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:656)
    at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:967)

2.3 日志二:

  • 异常信息:pool-1-thread-14,waiting on <0x07f9f1a9>,WebViewChromiumFactoryProvider.getStatics
  • 异常说明,这里tid=60的线程就是pool-1-thread-14,而pool-1-thread-14线程等待<0x07f9f1a9>被释放,涉及到获取默认Webview UA的流程
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
"pool-1-thread-14" prio=5 tid=60 Waiting
| group="main" sCount=1 dsCount=0 flags=1 obj=0x12f465b0 self=0x7d7dec6400
| sysTid=2072 nice=0 cgrp=default sched=0/0 handle=0x7d44671cc0
| state=S schedstat=( 188559697 24197915 209 ) utm=13 stm=5 core=6 HZ=100
| stack=0x7d4456e000-0x7d44570000 stackSize=1043KB
| held mutexes=
at java.lang.Object.wait(Native method)
- waiting on <0x07f9f1a9> (a java.lang.Object)
at java.lang.Object.wait(Object.java:442)
at java.lang.Object.wait(Object.java:568)
at Lw0.b(chromium-TrichromeWebViewGoogle.apk-stable-451513133:22)
at Lw0.g(chromium-TrichromeWebViewGoogle.apk-stable-451513133:3)
- locked <0x07f9f1a9> (a java.lang.Object)
at com.android.webview.chromium.WebViewChromiumFactoryProvider.getStatics(chromium-TrichromeWebViewGoogle.apk-stable-451513133:4)
- locked <0x07f9f1a9> (a java.lang.Object)
at android.webkit.WebSettings.getDefaultUserAgent(WebSettings.java:1355)
at xxxxxxxxxxxxxxxxxxxxxxx.DeviceMeta.initialize(DeviceMeta.java:73)
at xxxxxxxxxxxxxxxxxxxxxxx
- locked <0x0c8b6ef5> (a java.lang.Class<com.hellobike.deviceinfo.DeviceMeta>)
at xxxxxxxxxxxxxxxxxxxxxxx
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:462)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
at java.lang.Thread.run(Thread.java:923)

2.4 日志三

  • 异常信息:DefaultDispatcher-worker-2,<0x07f9f1a9>,WebViewChromiumFactoryProvider.getStatics
  • 异常分析:DefaultDispatcher-worker-2线程等待<0x07f9f1a9>被释放,涉及到获取默认Webview UA的流程
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    "DefaultDispatcher-worker-2" daemon prio=5 tid=88 Waiting
    | group="main" sCount=1 dsCount=0 flags=1 obj=0x12f48e28 self=0x7dcc42b400
    | sysTid=2127 nice=0 cgrp=default sched=0/0 handle=0x7d25aa7cc0
    | state=S schedstat=( 2489276 592551 5 ) utm=0 stm=0 core=1 HZ=100
    | stack=0x7d259a4000-0x7d259a6000 stackSize=1043KB
    | held mutexes=
    at java.lang.Object.wait(Native method)
    - waiting on <0x07f9f1a9> (a java.lang.Object)
    at java.lang.Object.wait(Object.java:442)
    at java.lang.Object.wait(Object.java:568)
    at Lw0.b(chromium-TrichromeWebViewGoogle.apk-stable-451513133:22)
    at Lw0.g(chromium-TrichromeWebViewGoogle.apk-stable-451513133:3)
    - locked <0x07f9f1a9> (a java.lang.Object)
    at com.android.webview.chromium.WebViewChromiumFactoryProvider.getStatics(chromium-TrichromeWebViewGoogle.apk-stable-451513133:4)
    - locked <0x07f9f1a9> (a java.lang.Object)
    at android.webkit.WebSettings.getDefaultUserAgent(WebSettings.java:1355)
    at xxxxxxxxxxxxxxxxxxxxxxx.getDefaultUserAgent(UserAgentUtils.java:29)
    at xxxxxxxxxxxxxxxxxxxxxxx
    at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
    at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:56)
    at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:561)
    at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:727)
    at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:667)
    at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:655)

0x03 异常分析

3.1 线索

通过以上的3条traces日志表现,可以大致得到以下线索:

  1. 因为单例的使用,主线程等待pool-1-thread-14释放单例对象中的锁对象
  2. 而pool-1-thread-14线程和DefaultDispatcher-worker-2都在获取WebSettings.getDefaultUserAgent的流程里,并且这两个线程都在等待状态,webview内核中应该有wait()的调用
  3. 按照常理如果子线程创建UA完毕,结束了等待状态后,主线程就能够获取到想要的锁对象了,也就不会存在ANR问题,但是现在既然出现了ANR,说明很可能是因为两个获取UA的流程都不能结束等待状态。看起来就很像是一个死锁问题了

3.2 为什么会出现死锁?

通过查阅网上的资料大致了解一下WebView内核启动的源码,问题就更清晰了。原来在WebView内核的启动流程里,会有一个runnable被提交给主线程,负责在主线程执行startChromiumLocked方法,只有当主线程执行完该方法以后,才会结束等待状态。这也就是为什么上述的DefaultDispatcher-worker-2和pool-1-thread-14两个线程一直处于等待状态的原因。而发生死锁的原因正式因为在此时,主线程想要获取的锁对象正在被tid=60的pool-1-thread-14线程持有,startChromiumLocked方法又得不到真正的执行。这样相互持有依赖的锁对象而又都处于等待状态,这就是标准的死锁问题了。

WebViewFactoryProvider.getStatics()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22


@Override
public Statics getStatics() {
synchronized (mAwInit.getLock()) {
SharedStatics sharedStatics = mAwInit.getStatics();
if (mStaticsAdapter == null) {
mStaticsAdapter = new WebViewChromiumFactoryProvider.Statics() {
@Override
public String findAddress(String addr) {
return sharedStatics.findAddress(addr);
}

@Override
public String getDefaultUserAgent(Context context) {
return sharedStatics.getDefaultUserAgent(context);
}
};
}
}
return mStaticsAdapter;
}

WebViewChromiumAwInit.getStatics()

1
2
3
4
5
6
7
8
9
10
11
public SharedStatics getStatics() {
synchronized (mLock) {
if (mSharedStatics == null) {
// TODO: Optimization potential: most these methods only need the native library
// loaded and initialized, not the entire browser process started.
// See also http://b/7009882
ensureChromiumStartedLocked(true);
}
}
return mSharedStatics;
}

WebViewChromiumAwInit.ensureChromiumStartedLocked(Boolean)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// This method is not private only because the downstream subclass needs to access it,
// it shouldn't be accessed from anywhere else.
/* package */ void ensureChromiumStartedLocked(boolean onMainThread) {
assert Thread.holdsLock(mLock);

if (mStarted) { // Early-out for the common case.
return;
}

Looper looper = !onMainThread ? Looper.myLooper() : Looper.getMainLooper();
ThreadUtils.setUiThread(looper);

if (ThreadUtils.runningOnUiThread()) {
startChromiumLocked();
return;
}

AwThreadUtils.postToUiThreadLooper(new Runnable() {
@Override
public void run() {
synchronized (mLock) {
startChromiumLocked();
}
}
});
while (!mStarted) {
try {
mLock.wait();
} catch (InterruptedException e) {
// Keep trying... eventually the UI thread will process the task we sent it.
}
}
}

3.3 验证以上的分析,通过demo复现

正如以上分析的那样,可以写出以下的demo代码,就可以100%复现这个问题。解决这个问题的方法也很简单,避免主线程和子线程等待同一个对象锁就可以了。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Object lock = new Object();

private void testUA() {
new Thread(new Runnable() {
@Override
public void run() {
synchronized (lock) {
WebSettings.getDefaultUserAgent(getApplicationContext());
}
}
}).start();

try {
Thread.sleep(30);
} catch (InterruptedException e) {
e.printStackTrace();
}
synchronized (lock) {
WebSettings.getDefaultUserAgent(getApplicationContext());
}
}

3.4 复习以下死锁的必要条件

  1. 产生死锁的必要条件:
    1. 互斥条件:进程要求对所分配的资源进行排它性控制,即在一段时间内某资源仅为一进程所占用。
    2. 请求和保持条件:当进程因请求资源而阻塞时,对已获得的资源保持不放。
    3. 不剥夺条件:进程已获得的资源在未使用完之前,不能剥夺,只能在使用完时由自己释放。
    4. 环路等待条件:在发生死锁时,必然存在一个进程–资源的环形链。
  2. 预防死锁:
    1. 资源一次性分配:一次性分配所有资源,这样就不会再有请求了:(破坏请求条件)
    2. 只要有一个资源得不到分配,也不给这个进程分配其他的资源:(破坏请保持条件)
    3. 可剥夺资源:即当某进程获得了部分资源,但得不到其它资源,则释放已占有的资源(破坏不可剥夺条件)
    4. 资源有序分配法:系统给每类资源赋予一个编号,每一个进程按编号递增的顺序请求资源,释放则相反(破坏环路等待条件)

参考文档

https://ivonhoe.github.io/2018/10/14/android-anr-1/
https://juejin.cn/post/6982896002680225800