ANR分析记录(三)主线程方法耗时

日常Android ANR问题分析记录。

0x01 ANR六看

一看 Trace

关键词:Blocked,waiting to lock,locked
看堆栈:死锁堆栈,业务调用堆栈,IPC堆栈,系统调用堆栈

二看logcat关键字

关键字:Load,CPU,Slow Operation,Kswapd,Mmcqd,Kwork,Lowmemkiller 等等
通过日志中这些关键字判断当前系统是否存在 (CPU,Mem,IO) 资源紧张的情况。

三看系统负载分布

关键字:user,sys,IOWait
通过观察系统负载,则可以进一步明确是 CPU 资源紧张,还是 IO 资源紧张;如果系统负载过高,一定是有某个进程或多个进程引起的。反之系统负载过高又会影响到所有进程调度性能。通过观察 User,Sys 的 CPU 占比,可以进一步发分析当前负载过高是发生在应用空间,还是系统空间,如大量调用逻辑 (如文件读写,内存紧张导致系统不断回收内存等等),知道这些之后,排查方向又会进一步缩小范围。

四看进程 CPU

关键字:CPU usage
image.png

五看 CPU 占比定线程

六看主线程消息调度

0x02 错误分析

2.1 日志分析

测试同学反馈在测试环境上,某些操作后app必现ANR,通过bugreport命令抓取以下ANR trace。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
DALVIK THREADS (226):
"main" prio=5 tid=1 Runnable
| group="main" sCount=0 ucsCount=0 flags=0 obj=0x714acbb0 self=0xb400007416e10800
| sysTid=14827 nice=-10 cgrp=default sched=0/0 handle=0x74c2f26500
| state=R schedstat=( 115021450752 1125156276 11142 ) utm=11377 stm=124 core=7 HZ=100
| stack=0x7fdfbd7000-0x7fdfbd9000 stackSize=8188KB
| held mutexes= "mutator lock"(shared held)
at androidx.viewpager.widget.ViewPager.populate(ViewPager.java:1171)
at androidx.viewpager.widget.ViewPager.populate(ViewPager.java:1092)
at androidx.viewpager.widget.ViewPager.onMeasure(ViewPager.java:1622)
at android.view.View.measure(View.java:25977)
at android.view.ViewGroup.measureChildWithMargins(ViewGroup.java:7073)
at android.widget.FrameLayout.onMeasure(FrameLayout.java:194)
at android.view.View.measure(View.java:25977)
at android.view.ViewGroup.measureChildWithMargins(ViewGroup.java:7073)
at android.widget.FrameLayout.onMeasure(FrameLayout.java:194)
at android.view.View.measure(View.java:25977)
at android.widget.RelativeLayout.measureChildHorizontal(RelativeLayout.java:735)
at android.widget.RelativeLayout.onMeasure(RelativeLayout.java:481)
at android.view.View.measure(View.java:25977)
at android.widget.RelativeLayout.measureChild(RelativeLayout.java:696)
at android.widget.RelativeLayout.onMeasure(RelativeLayout.java:499)
at android.view.View.measure(View.java:25977)
at android.widget.LinearLayout.measureHorizontal(LinearLayout.java:1185)
at android.widget.LinearLayout.onMeasure(LinearLayout.java:723)
at android.view.View.measure(View.java:25977)
at android.view.ViewGroup.measureChildWithMargins(ViewGroup.java:7073)
at android.widget.LinearLayout.measureChildBeforeLayout(LinearLayout.java:1552)
at android.widget.LinearLayout.measureHorizontal(LinearLayout.java:1204)
at android.widget.LinearLayout.onMeasure(LinearLayout.java:723)
at android.view.View.measure(View.java:25977)
at android.widget.RelativeLayout.measureChildHorizontal(RelativeLayout.java:735)
at android.widget.RelativeLayout.onMeasure(RelativeLayout.java:481)
at android.view.View.measure(View.java:25977)
at android.view.ViewGroup.measureChildWithMargins(ViewGroup.java:7073)
at android.widget.LinearLayout.measureChildBeforeLayout(LinearLayout.java:1552)
at android.widget.LinearLayout.measureVertical(LinearLayout.java:842)
at android.widget.LinearLayout.onMeasure(LinearLayout.java:721)

通过trace日志和viewpage源码并不能第一时间定位原因,只能看出并不是简单的锁同步问题导致的ANR。转而通过logcat日志查找更多有用的信息。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Load: 0.6 / 0.62 / 0.48
CPU usage from 24258ms to 13621ms ago (2022-07-12 12:10:12.021 to 2022-07-12 12:10:22.658) with 99% awake:
82% 16394/xxxxxxxxxxxx: 71% user + 11% kernel / faults: 4691 minor 1 major
29% 1115/surfaceflinger: 19% user + 9.9% kernel / faults: 5 minor
16% 1290/system_server: 9.9% user + 6.8% kernel / faults: 3959 minor
13% 1045/vendor.qti.hardware.display.composer-service: 9.3% user + 4% kernel
5.5% 23223/com.heytap.market: 4.2% user + 1.2% kernel / faults: 6350 minor 3 major
5.3% 387/crtc_commit:111: 0% user + 5.3% kernel
3.8% 28079/kworker/u24:6-ufs_pm_qos_0: 0% user + 3.8% kernel
3% 11017/kworker/u24:12-ufs_pm_qos_0: 0% user + 3% kernel
2.9% 296/kgsl_worker_thr: 0% user + 2.9% kernel
2.7% 9946/kworker/u24:3-kverityd: 0% user + 2.7% kernel
2% 2940/com.android.systemui: 1.4% user + 0.5% kernel / faults: 702 minor
.
.
.
24% TOTAL: 15% user + 7% kernel + 0% iowait + 1.6% irq + 0.3% softirq
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
CPU usage from 0ms to 6338ms later (2022-07-12 12:05:52.611 to 2022-07-12 12:05:58.950) with 99% awake:
119% 14827/xxxxxxxxxxxx: 109% user + 10% kernel / faults: 7495 minor
46% 1290/system_server: 21% user + 25% kernel / faults: 5888 minor
20% 627/logd: 16% user + 4.4% kernel / faults: 60205 minor
8.6% 27447/media.codec: 5.5% user + 3.1% kernel / faults: 8918 minor
4.4% 1036/android.hardware.sensors@2.0-service.multihal: 1.8% user + 2.5% kernel / faults: 559 minor 6 major
3.7% 3504/com.android.launcher: 2.6% user + 1.1% kernel / faults: 1874 minor
0.1% 1564/media.swcodec: 0% user + 0% kernel / faults: 4653 minor
2.9% 9945/kworker/u24:1+DIAG_CTRL: 0% user + 2.9% kernel
2.8% 1115/surfaceflinger: 1.1% user + 1.7% kernel / faults: 921 minor
2.6% 9991/kworker/u24:5-adb: 0% user + 2.6% kernel
2.6% 28079/kworker/u24:6-ufs_pm_qos_0: 0% user + 2.6% kernel
2.5% 9946/kworker/u24:3-kverityd: 0% user + 2.5% kernel
2.2% 2940/com.android.systemui: 0.9% user + 1.2% kernel / faults: 239 minor
.
.
.
37% TOTAL: 22% user + 11% kernel + 1% iowait + 1.5% irq + 0.3% softirq

通过以上日志可以看到,

Load: 0.6 / 0.62 / 0.48

ANR 发生前 1 分钟,前 5 分钟,前 15 分钟,系统的整体负载并不高,具体数值代表单位时间等待系统调度的任务数。

CPU usage from 24258ms to 13621ms ago
82% 16394/xxxxxxxxxxxx: 71% user + 11% kernel / faults: 4691 minor 1 major

ANR问题发生之前(CPU usage from XXX to XXX ago),app进程CPU使用率 82%,其中user占比71%,kernel 占比11%。system_server进程CPU使用率16% 。

CPU usage from 0ms to 6338ms later
119% 14827/xxxxxxxxxxxx: 109% user + 10% kernel / faults: 7495 minor

ANR问题发生以后(CPU usage from XXX to XXX later),app进程CPU使用率 119%,其中user占比109%,kernel 占比10%。system_server进程CPU使用率46% 。
通过以上数据可以判断,问题发生时,系统负载并不高,系统调用不频繁,系统IO负载很低。但是app用户进程的CPU资源很紧张,高CPU使用率应该发生在app的用户进程,大概率还是业务代码实现有问题导致的ANR。

2.2 借助Layout Inspector定位问题代码

image.pngimage.png

通过日志大致判断问题发生在app业务代码中,梳理了以下的问题排查思路。

  1. 观察trace堆栈可以发现ANR发生在页面布局绘制流程,这样通过调用堆栈可以获取页面布局的层级关系
  2. 借助AS的Layout Inspector工具分析问题发生的页面就可以找到页面中使用viewpager组件的都是哪些业务模块
  3. 结合以上1,2信息定位出发生ANR问题的详细视图位置,再根据具体的视图实现走查业务代码。
  4. 通过查看业务组件的代码实现最终找到了问题原因,业务为了实现图片banner的轮播效果,通过设置Adapter count和Viewpager mCurItem 为一个超大的整数(Integer.MAX_VALUE)来实现viewpager无限滚动效果。导致Viewpager的populate方法主线程执行超时。
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    void populate(int newCurrentItem) {
    ...
    ...
    // Fill 3x the available width or up to the number of offscreen
    // pages requested to either side, whichever is larger.
    // If we have no current item we have no work to do.
    if (curItem != null) {
    for (int pos = mCurItem - 1; pos >= 0; pos--) {
    // 超大整数导致循环执行超时
    }
    }
    ...
    ...
    }
    解决方案也比较简单,针对轮播banner效果重新更换了实现方式。

0x03 总结

大部分的ANR问题主要难点在于发现问题和分析问题发生的原因。一般的分析步骤是通过借助分析日志,从ANR Trace、logcat日志、kernel日志到消息队列日志,逐步将可能出现问题的范围缩小,通过CPU、IO、系统调用情况并结合代码走查定位问题原因。本文的分析过程相对比较简单,但是总体遵循这样的分析路径,最终定位并解决了该问题。

参考文档

https://blog.csdn.net/liuwg1226/article/details/118465301