自拍偷在线精品自拍偷,亚洲欧美中文日韩v在线观看不卡

得物 Android Crash 治理實(shí)踐

移動(dòng)開發(fā) Android
得物Android端的Crash監(jiān)控體系得到顯著增強(qiáng),使得歷史Crash數(shù)據(jù)的完整捕獲能力得到系統(tǒng)性改善,相應(yīng)Crash指標(biāo)也有所上升,經(jīng)過(guò)架構(gòu)以及各團(tuán)隊(duì)的共同努力下,崩潰率已從最高的萬(wàn)2降至目前的萬(wàn)1.1到萬(wàn)1.5,其中疑難問(wèn)題占比約90%、因系統(tǒng)bug導(dǎo)致的Crash占比約40%,在本文中將簡(jiǎn)要介紹一些較典型的系統(tǒng)Crash的治理過(guò)程。

目錄

一、前言

二、DNS解析崩潰

    1. 背景

    2. 問(wèn)題分析

    3. 解決過(guò)程

三、MediaCodec 狀態(tài)異常崩潰

    1. 背景

    2. 問(wèn)題分析

    3. 解決過(guò)程

四、bio多線程環(huán)境崩潰

    1. 背景

    2. 問(wèn)題分析

    3. 解決過(guò)程

五、小米Android15 焦點(diǎn)處理空指針崩潰

    1. 背景

    2. 問(wèn)題分析

    3. 解決過(guò)程

六、總結(jié)

一、前言

通過(guò)修復(fù)歷史遺留的Crash漏報(bào)問(wèn)題(包括端側(cè)SDK采集的兼容性優(yōu)化及Crash平臺(tái)的數(shù)據(jù)消費(fèi)機(jī)制完善),得物Android端的Crash監(jiān)控體系得到顯著增強(qiáng),使得歷史Crash數(shù)據(jù)的完整捕獲能力得到系統(tǒng)性改善,相應(yīng)Crash指標(biāo)也有所上升,經(jīng)過(guò)架構(gòu)以及各團(tuán)隊(duì)的共同努力下,崩潰率已從最高的萬(wàn)2降至目前的萬(wàn)1.1到萬(wàn)1.5,其中疑難問(wèn)題占比約90%、因系統(tǒng)bug導(dǎo)致的Crash占比約40%,在本文中將簡(jiǎn)要介紹一些較典型的系統(tǒng)Crash的治理過(guò)程。

二、DNS解析崩潰

背景

Android11及以下版本在DNS解析過(guò)程中的有幾率產(chǎn)生野指針問(wèn)題導(dǎo)致的Native Crash,其中Android9占比最高。 

堆棧與上報(bào)趨勢(shì)

at libcore.io.Linux.android_getaddrinfo(Linux.java)
at libcore.io.BlockGuardOs.android_getaddrinfo(BlockGuardOs.java:172)
at java.net.InetAddress.parseNumericAddressNoThrow(InetAddress.java:1631)
at java.net.Inet6AddressImpl.lookupAllHostAddr(Inet6AddressImpl.java:96)
at java.net.InetAddress.getAllByName(InetAddress.java:1154)


#00 pc 000000000003b938  /system/lib64/libc.so (android_detectaddrtype+1164)
#01 pc 000000000003b454  /system/lib64/libc.so (android_getaddrinfofornet+72)
#02 pc 000000000002b5f4  /system/lib64/libjavacore.so (_ZL25Linux_android_getaddrinfoP7_JNIEnvP8_jobjectP8_jstringS2_i+336)

圖片圖片

問(wèn)題分析

崩潰入口方法InetAddress.getAllByName用于根據(jù)指定的主機(jī)名返回與之關(guān)聯(lián)的所有 IP 地址,它會(huì)根據(jù)系統(tǒng)配置的名稱服務(wù)進(jìn)行解析,沿著調(diào)用鏈查看源碼發(fā)現(xiàn)在parseNumericAddressNoThrow方法內(nèi)部調(diào)用Libcore.os.android_getaddrinfo時(shí)中有try catch的容錯(cuò)邏輯,繼續(xù)查看后續(xù)調(diào)用的c++的源碼,在調(diào)用android_getaddrinfofornet函數(shù)返回值不為0時(shí)拋出GaiException異常。

https://cs.android.com/android/platform/superproject/+/android-9.0.0_r49:libcore/ojluni/src/main/java/java/net/InetAddress.java
 
 static InetAddress parseNumericAddressNoThrow(String address) {
        // Accept IPv6 addresses (only) in square brackets for compatibility.
        if (address.startsWith("[") && address.endsWith("]") && address.indexOf(':') != -1) {
            address = address.substring(1, address.length() - 1);
        }
        StructAddrinfo hints = new StructAddrinfo();
        hints.ai_flags = AI_NUMERICHOST;
        InetAddress[] addresses = null;
        try {
            addresses = Libcore.os.android_getaddrinfo(address, hints, NETID_UNSET);
        } catch (GaiException ignored) {
        }
        return (addresses != null) ? addresses[0] : null;
    }
https://cs.android.com/android/platform/superproject/+/master:libcore/luni/src/main/native/libcore_io_Linux.cpp?q=Linux_android_getaddrinfo&ss=android%2Fplatform%2Fsuperproject


static jobjectArray Linux_android_getaddrinfo(JNIEnv* env, jobject, jstring javaNode,
        jobject javaHints, jint netId) {
    ......
    int rc = android_getaddrinfofornet(node.c_str(), NULL, &hints, netId, 0, &addressList);
    std::unique_ptr<addrinfo, addrinfo_deleter> addressListDeleter(addressList);
    if (rc != 0) {
        throwGaiException(env, "android_getaddrinfo", rc);
        return NULL;
    }
    ......
    return result;
}

解決過(guò)程

解決思路是代理android_getaddrinfofornet函數(shù),捕捉調(diào)用原函數(shù)過(guò)程中出現(xiàn)的段錯(cuò)誤信號(hào),接著吃掉這個(gè)信號(hào)并返回-1,使之轉(zhuǎn)換為JAVA異常進(jìn)而走進(jìn)parseNumericAddressNoThrow方法的容錯(cuò)邏輯,和負(fù)責(zé)網(wǎng)絡(luò)的同學(xué)提前做了溝通,確定此流程對(duì)業(yè)務(wù)沒(méi)有影響后開始解決。

首先使用inline-hook代理了android_getaddrinfofornet函數(shù),接著使用字節(jié)封裝好的native try catch工具做吃掉段錯(cuò)誤信號(hào)并返回-1的,字節(jié)工具內(nèi)部原理是在try塊的開始使用sigsetjmp打個(gè)錨點(diǎn)并快照當(dāng)前寄存器的值,然后設(shè)置信號(hào)量處理器并關(guān)聯(lián)當(dāng)前線程,在catch塊中解綁線程與信號(hào)的關(guān)聯(lián)并執(zhí)行業(yè)務(wù)兜底代碼,在捕捉到信號(hào)時(shí)通過(guò)siglongjmp函數(shù)長(zhǎng)跳轉(zhuǎn)到catch塊中,感興趣的同學(xué)可以用下面精簡(jiǎn)后的demo試試,以下代碼保存為mem_err.c,執(zhí)行g(shù)cc ./mem_err.c;./a.out

#include <stdio.h>
#include <signal.h>
#include <setjmp.h>


struct sigaction old;
static sigjmp_buf buf;


void SIGSEGV_handler(int sig, siginfo_t *info, void *ucontext) {
    printf("信號(hào)處理 sig: %d, code: %d\n", sig, info->si_code);
    siglongjmp(buf, -1);
}


int main() {
    if (!sigsetjmp(buf, 0)) {
        struct sigaction sa;


        sa.sa_sigaction = SIGSEGV_handler;
        sigaction(SIGSEGV, &sa, &old);


        printf("try exec\n");
        //產(chǎn)生段錯(cuò)誤
        int *ptr = NULL;
        *ptr = 1;
        printf("try-block end\n");//走不到
    } else {
        printf("catch exec\n");
        sigaction(SIGSEGV, &old, NULL);
    }
    printf("main func end\n");
    return 0;
}


//輸出以下日志
//try exec
//信號(hào)處理 sig: 11, code: 2
//catch exec
//main func end

inline-hook庫(kù): https://github.com/bytedance/android-inline-hook

字節(jié)native try catch工具: https://github.com/bytedance/android-inline-hook/blob/main/shadowhook/src/main/cpp/common/bytesig.c

三、MediaCodec 狀態(tài)異常崩潰

背景

在Android 11系統(tǒng)庫(kù)的音視頻播放過(guò)程中,偶爾會(huì)出現(xiàn)因狀態(tài)異常導(dǎo)致的SIGABRT崩潰。音視頻團(tuán)隊(duì)反饋指出,這是Android 11的一個(gè)系統(tǒng)bug。隨后,我們協(xié)助音視頻團(tuán)隊(duì)通過(guò)hook解決了這一問(wèn)題。

堆棧與上報(bào)趨勢(shì)

#00 pc 0000000000089b1c  /apex/com.android.runtime/lib64/bionic/libc.so (abort+164)
#01 pc 000000000055ed78  /apex/com.android.art/lib64/libart.so (_ZN3art7Runtime5AbortEPKc+2308)
#02 pc 0000000000013978  /system/lib64/libbase.so (_ZZN7android4base10SetAborterEONSt3__18functionIFvPKcEEEEN3$_38__invokeES4_+76)
#03 pc 0000000000006e30  /system/lib64/liblog.so (__android_log_assert+336)
#04 pc 0000000000122074  /system/lib64/libstagefright.so (_ZN7android10MediaCodec37postPendingRepliesAndDeferredMessagesENSt3__112basic_stringIcNS1_11char_traitsIcEENS1_9allocatorIcEEEERKNS_2spINS_8AMessageEEE+720)
#05 pc 00000000001215cc  /system/lib64/libstagefright.so (_ZN7android10MediaCodec37postPendingRepliesAndDeferredMessagesENSt3__112basic_stringIcNS1_11char_traitsIcEENS1_9allocatorIcEEEEi+244)
#06 pc 000000000011c308  /system/lib64/libstagefright.so (_ZN7android10MediaCodec17onMessageReceivedERKNS_2spINS_8AMessageEEE+8752)
#07 pc 0000000000017814  /system/lib64/libstagefright_foundation.so (_ZN7android8AHandler14deliverMessageERKNS_2spINS_8AMessageEEE+84)
#08 pc 000000000001d9cc  /system/lib64/libstagefright_foundation.so (_ZN7android8AMessage7deliverEv+188)
#09 pc 0000000000018b48  /system/lib64/libstagefright_foundation.so (_ZN7android7ALooper4loopEv+572)
#10 pc 0000000000015598  /system/lib64/libutils.so (_ZN7android6Thread11_threadLoopEPv+460)
#11 pc 00000000000a1d6c  /system/lib64/libandroid_runtime.so (_ZN7android14AndroidRuntime15javaThreadShellEPv+144)
#12 pc 0000000000014d94  /system/lib64/libutils.so (_ZN13thread_data_t10trampolineEPKS_+412)
#13 pc 00000000000eba94  /apex/com.android.runtime/lib64/bionic/libc.so (_ZL15__pthread_startPv+64)
#14 pc 000000000008bd80  /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64)

圖片圖片

問(wèn)題分析

根據(jù)堆棧內(nèi)容分析Android11的源碼以及結(jié)合SIGABRT信號(hào)采集到的信息(postPendingRepliesAndDeferredMessages: mReplyID == null, from kWhatRelease:STOPPING following kWhatError:STOPPING),找到崩潰發(fā)生在onMessageReceived函數(shù)處理kWhatRelease類型消息的過(guò)程中,onMessageReceived函數(shù)連續(xù)收到兩條消息,第一條是kWhatError:STOPPING,第二條是kWhatRelease:STOPPING此時(shí)因mReplyID已經(jīng)被置為空,因此走到判空拋異常的邏輯。

https://cs.android.com/android/_/android/platform/frameworks/av/+/refs/tags/android-11.0.0_r48:media/libstagefright/MediaCodec.cpp;l=2280;drc=789055bbcb4560b42faf19103b1cda5534e8f9cb;bpv=0;bpt=0

圖片圖片

圖片圖片

圖片圖片

圖片圖片

對(duì)比Android12的源碼,在處理kWhatRelease事件且狀態(tài)為STOPPING拋異常前,增加了對(duì)mReplyID不為空的判斷來(lái)規(guī)避這個(gè)問(wèn)題。

https://cs.android.com/android/_/android/platform/frameworks/av/+/ca0c3286a4790a4de2d90cb275ae89a9601b805b:media/libstagefright/MediaCodec.cpp;dlc=7327aab894f6c456ea16c95b64134841da8d5737

圖片圖片

解決過(guò)程

Android12的修復(fù)方式意味著上述三個(gè)條件結(jié)合下吃掉異常是符合預(yù)期的,接下來(lái)就是想辦法通過(guò)hook Android11使邏輯對(duì)齊Android12。

  • 【初探】最先想到的辦法是代理相關(guān)函數(shù)通過(guò)判斷走到這個(gè)場(chǎng)景時(shí)提前return出去來(lái)規(guī)避,音視頻的同學(xué)嘗試后發(fā)現(xiàn)不可行,原因如下:

a.void MediaCodec::postPendingRepliesAndDeferredMessages(std::string origin, status_t err): 匹配origin是否為特征字符串(postPendingRepliesAndDeferredMessages: mReplyID == null, from kWhatRelease:STOPPING following kWhatError:STOPPING);很多設(shè)備找不到這個(gè)符號(hào)不可行;

b.void MediaCodec::onMessageReceived(const sp&msg): 已知MediaCodec實(shí)例的內(nèi)存首地址,需要通過(guò)hardcode偏移量來(lái)獲取mReplay、mState兩個(gè)字段,這里又缺少可供校驗(yàn)正確性的特征,風(fēng)險(xiǎn)略大擔(dān)心有不同機(jī)型的兼容性問(wèn)題(不同機(jī)型新增、刪除字段導(dǎo)致偏移量不準(zhǔn))。

  • 【踩坑】接著嘗試使用與修復(fù)DNS崩潰類似思路的保護(hù)方案,使用inline-hook代理onMessageReceived函數(shù)調(diào)用原函數(shù)時(shí)使用setjmp打錨點(diǎn),然后使用plt hook代理_android_log_assert函數(shù)并在內(nèi)部檢測(cè)錯(cuò)誤信息為特征字符串時(shí)通過(guò)longjmp跳轉(zhuǎn)到onMessageReceived函數(shù)的錨點(diǎn)并作return操作,精簡(jiǎn)后的demo如下:

Plt-hook 庫(kù): https://github.com/iqiyi/xHook

#include <iostream>
#include <setjmp.h>
#include <csignal>


static thread_local jmp_buf _buf;
void *origin_onMessageReceived = nullptr;
void *origin__android_log_assert = nullptr;


void _android_log_assert_proxy(const char* cond, const char *tag, const char* fmt, ...) {
    //模擬liblog.so的__android_log_assert函數(shù)
    std::cout << "__android_log_assert start" << std::endl;
    if (!strncmp(fmt, "postPendingRepliesAndDeferredMessages: mReplyID == null", 55)) {
        longjmp(_buf, -1);
    }
    //模擬調(diào)用origin__android_log_assert,產(chǎn)生崩潰 
    raise(SIGABRT);
}


void onMessageReceived_proxy(void *thiz, void *msg) {
    std::cout << "onMessageReceived_proxy start" << std::endl;
    if (!setjmp(_buf)) {
        //模擬調(diào)用onMessageReceived原函數(shù)(origin_onMessageReceived)進(jìn)入崩潰流程
        std::cout << "onMessageReceived_proxy 1" << std::endl;
        _android_log_assert_proxy(nullptr, nullptr, "postPendingRepliesAndDeferredMessages: mReplyID == null, from kWhatRelease:STOPPING following kWhatError:STOPPING");
        std::cout << "onMessageReceived_proxy 2" << std::endl;//走不到
    } else {
        //保護(hù)后從此處返回
        std::cout << "onMessageReceived_proxy 3" << std::endl;
    }
    std::cout << "onMessageReceived_proxy end" << std::endl;
}


int main() {
    std::cout << "main func start" << std::endl;
    /**
     inline-hook: shadowhook_hook_sym_name("libstagefright.so","_ZN7android10MediaCodec17onMessageReceivedERKNS_2spINS_8AMessageEEE",(void *) onMessageReceived_proxy, (void **) &origin_onMessageReceived);
     plhook: xh_core_register("libstagefright.so", "__android_log_assert", (void *) (_android_log_assert_proxy), (void **) (&origin__android_log_assert));
     */
    //模擬調(diào)用libstagefright.so的_ZN7android10MediaCodec17onMessageReceivedERKNS_2spINS_8AMessageEEE函數(shù)
    onMessageReceived_proxy(nullptr, nullptr);
    std::cout << "main func end" << std::endl;
    return 0;
}


/**
日志輸出
 main func start
onMessageReceived_proxy start
onMessageReceived_proxy 1
__android_log_assert start
onMessageReceived_proxy 3
onMessageReceived_proxy end
main func end
*/

線下一陣操作猛如虎經(jīng)測(cè)試保護(hù)邏輯符合預(yù)期,但是在灰度期間踩到棧溢出保護(hù)導(dǎo)致錯(cuò)誤轉(zhuǎn)移的坑,堆棧如下:

#00 pc 000000000004e40c  /apex/com.android.runtime/lib64/bionic/libc.so (abort+164)
#01 pc 0000000000062730  /apex/com.android.runtime/lib64/bionic/libc.so (__stack_chk_fail+20)
#02 pc 000000000000a768 /data/app/~~JaQm4SU8wxP7T2GaSWxYkQ==/com.shizhuang.duapp-N5RFIB8WurdccMgAVsBang==/lib/arm64/libduhook.so (_ZN25CrashMediaCodecProtection5proxyEPvS0_)
#03 pc 0000000001091c0c  [anon:scudo:primary]

關(guān)于棧溢出保護(hù)機(jī)制感興趣的同學(xué)可以參考這篇文章https://bbs.kanxue.com/thread-221762-1.htm

(CSPP 第3版 “3.10.3 內(nèi)存越界引用和緩沖區(qū)溢出”章節(jié)講的更詳細(xì))

longjmp函數(shù)只是恢復(fù)寄存器的值后從錨點(diǎn)處再次返回,過(guò)程中也唯一可能會(huì)操作棧禎只有inline-hook,當(dāng)時(shí)懷疑是與setjmp/longjmp機(jī)制不兼容,由于inline-hook內(nèi)部邏輯大量使用匯編來(lái)實(shí)現(xiàn)排查起來(lái)比較困難,因此這個(gè)問(wèn)題困擾比較久,網(wǎng)上的資料提到可以使用代理出錯(cuò)函數(shù)(__stack_chk_fail)或者編譯so時(shí)增加參數(shù)不讓編譯器生成保護(hù)代碼來(lái)繞過(guò),這兩種方式影響面都比較大所以未采用。有了前面的懷疑點(diǎn)想到使用c++的try catch機(jī)制來(lái)做跨函數(shù)域的跳轉(zhuǎn),大致的思路同上只是把setjmp替換為c++的try catch,把longjmp替換為throw exception,精簡(jiǎn)后的demo如下:

c++異常機(jī)制介紹: https://baiy.cn/doc/cpp/inside_exception.htm

#include <iostream>
#include <csignal>


void *origin_onMessageReceived = nullptr;
void *origin__android_log_assert = nullptr;


class MyCustomException : public std::exception {
public:
    explicit MyCustomException(const std::string& message)
            : msg_(message) {}


    virtual const char* what() const noexcept override {
        return msg_.c_str();
    }


private:
    std::string msg_;
};


void _android_log_assert_proxy(const char* cond, const char *tag, const char* fmt, ...) {
    //模擬liblog.so的__android_log_assert函數(shù)
    std::cout << "__android_log_assert start" << std::endl;
    if (!strncmp(fmt, "postPendingRepliesAndDeferredMessages: mReplyID == null", 55)) {
        throw MyCustomException("postPendingRepliesAndDeferredMessages: mReplyID == null");
    }
    //模擬調(diào)用origin__android_log_assert,產(chǎn)生崩潰
    raise(SIGABRT);
}


void onMessageReceived_proxy(void *thiz, void *msg) {
    std::cout << "onMessageReceived_proxy start" << std::endl;
    try {
        //模擬調(diào)用onMessageReceived原函數(shù)(origin_onMessageReceived)進(jìn)入崩潰流程
        std::cout << "onMessageReceived_proxy 1" << std::endl;
        _android_log_assert_proxy(nullptr, nullptr, "postPendingRepliesAndDeferredMessages: mReplyID == null, from kWhatRelease:STOPPING following kWhatError:STOPPING");
        std::cout << "onMessageReceived_proxy 2" << std::endl;//走不到
    } catch (const MyCustomException& e) {
        //保護(hù)后從此處返回
        std::cout << "onMessageReceived_proxy 3" << std::endl;
    }
    std::cout << "onMessageReceived_proxy end" << std::endl;
}


int main() {
    std::cout << "main func start" << std::endl;
    /**
     inline-hook: shadowhook_hook_sym_name("libstagefright.so","_ZN7android10MediaCodec17onMessageReceivedERKNS_2spINS_8AMessageEEE",(void *) onMessageReceived_proxy, (void **) &origin_onMessageReceived);
     plhook: xh_core_register("libstagefright.so", "__android_log_assert", (void *) (_android_log_assert_proxy), (void **) (&origin__android_log_assert));
     */
    //模擬調(diào)用libstagefright.so的_ZN7android10MediaCodec17onMessageReceivedERKNS_2spINS_8AMessageEEE函數(shù)
    onMessageReceived_proxy(nullptr, nullptr);
    std::cout << "main func end" << std::endl;
    return 0;
}


/**
日志輸出
 main func start
onMessageReceived_proxy start
onMessageReceived_proxy 1
__android_log_assert start
onMessageReceived_proxy 3
onMessageReceived_proxy end
main func end
*/

灰度上線后發(fā)現(xiàn)有設(shè)備走到了_android_log_assert代理函數(shù)中的throw邏輯,但是未按預(yù)期走到catch塊而是把錯(cuò)誤又轉(zhuǎn)移為" terminating with uncaught exception of type" ,有點(diǎn)搞心態(tài)啊。

  • 【柳暗花明】C++的異常處理機(jī)制在throw執(zhí)行時(shí),會(huì)開始在調(diào)用棧中向上查找匹配的catch塊,檢查每一個(gè)函數(shù)直到找到一個(gè)具有合適類型的catch塊,上述的錯(cuò)誤信息代表未找到匹配的catch塊。從轉(zhuǎn)移的堆棧中注意到?jīng)]有onMessageReceived代理函數(shù)的堆棧,此時(shí)基于inline-hook的原理(修改原函數(shù)前面的匯編代碼跳轉(zhuǎn)到代理函數(shù))又懷疑到它身上,再次排查代碼時(shí)發(fā)現(xiàn)代理函數(shù)開頭漏寫了一個(gè)宏,在inline-hook中SHADOWHOOK_STACK_SCOPE就是來(lái)管理?xiàng)5澋?,因此出現(xiàn)找不到catch塊以及前面longjmp的問(wèn)題就不奇怪了。加上這個(gè)宏以后柳暗花明,重新放量后保護(hù)邏輯按預(yù)期執(zhí)行并且保護(hù)生效后視頻播放正常。和音視頻的小伙伴一努力下,經(jīng)歷了幾個(gè)版本終于解決了這個(gè)系統(tǒng)bug,目前僅剩老版本App有零星的上報(bào)。

四、bio多線程環(huán)境崩潰

背景

Android 11  Socket close過(guò)程中在多線程場(chǎng)景下有幾率產(chǎn)生野指針問(wèn)題導(dǎo)致Native Crash,現(xiàn)象是多個(gè)線程同時(shí)close連接時(shí),一個(gè)線程已銷毀了bio的上下文,另外一個(gè)線程仍執(zhí)行close并在此過(guò)程中嘗試獲取這個(gè)bio有多少未寫出去的字節(jié)數(shù)時(shí)出現(xiàn)野指針導(dǎo)致的段錯(cuò)誤。此問(wèn)題從21年首次上報(bào)以來(lái)在得物的Crash列表中一直處于較前的位置。

堆棧與上報(bào)趨勢(shì)

at com.android.org.conscrypt.NativeCrypto.SSL_pending_written_bytes_in_BIO(Native method)
at com.android.org.conscrypt.NativeSsl$BioWrapper.getPendingWrittenBytes(NativeSsl.java:660)
at com.android.org.conscrypt.ConscryptEngine.pendingOutboundEncryptedBytes(ConscryptEngine.java:566)
at com.android.org.conscrypt.ConscryptEngineSocket.drainOutgoingQueue(ConscryptEngineSocket.java:584)
at com.android.org.conscrypt.ConscryptEngineSocket.close(ConscryptEngineSocket.java:480)
at okhttp3.internal.Util.closeQuietly_aroundBody0(Util.java:1)
at okhttp3.internal.Util$AjcClosure1.run(Util.java:1)
at org.aspectj.runtime.reflect.JoinPointImpl.proceed(JoinPointImpl.java:3)
at com.shizhuang.duapp.common.aspect.ThirdSdkAspect.t(ThirdSdkAspect.java:1)
at okhttp3.internal.Util.closeQuietly(Util.java:3)
at okhttp3.internal.connection.ExchangeFinder.findConnection(ExchangeFinder.java:42)
at okhttp3.internal.connection.ExchangeFinder.findHealthyConnection(ExchangeFinder.java:1)
at okhttp3.internal.connection.ExchangeFinder.find(ExchangeFinder.java:6)
at okhttp3.internal.connection.Transmitter.newExchange(Transmitter.java:5)
at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:5)


#00 pc 0000000000064060  /system/lib64/libcrypto.so (bio_ctrl+144)
#01 pc 00000000000615d8  /system/lib64/libcrypto.so (BIO_ctrl_pending+40)
#02 pc 00000000000387dc  /apex/com.android.conscrypt/lib64/libjavacrypto.so (_ZL45NativeCrypto_SSL_pending_written_bytes_in_BIOP7_JNIEnvP7_jclassl+20)

圖片圖片

問(wèn)題分析

從設(shè)備分布上看,出問(wèn)題都全是Android 11且各個(gè)國(guó)內(nèi)廠商的設(shè)備都有,懷疑是Android 11引入的bug,對(duì)比了Android 11 和 Android 12的源碼,發(fā)現(xiàn)在Android12 崩潰堆棧中的相關(guān)類 com.android.org.conscrypt.NativeSsl$BioWrapper有四個(gè)方法增加了讀寫鎖,此時(shí)懷疑是多線程問(wèn)題,通過(guò)搜索Android源碼的相關(guān)issue以及差異代碼的MR描述信息,進(jìn)一步確認(rèn)此結(jié)論。通過(guò)源碼進(jìn)一步分析發(fā)現(xiàn)NativeSsl的所有加鎖的方法,會(huì)分發(fā)到NativeCrypto.java中的native方法,最終調(diào)用到native_crypto.cc中的JNI函數(shù),如果能hook到相關(guān)的native函數(shù)并在Native層實(shí)現(xiàn)與Android12相同的讀寫鎖邏輯,這個(gè)問(wèn)題就可以解決了。

https://cs.android.com/android/platform/superproject/+/android-12.0.0_r1:external/conscrypt/repackaged/common/src/main/java/com/android/org/conscrypt/NativeSsl.java

https://cs.android.com/android/platform/superproject/+/android-11.0.0_r48:external/conscrypt/repackaged/common/src/main/java/com/android/org/conscrypt/NativeCrypto.java

https://cs.android.com/android/platform/superproject/+/android-11.0.0_r48:external/conscrypt/common/src/jni/main/cpp/conscrypt/native_crypto.cc

解決過(guò)程

通過(guò)JNI hook代理Android12中增加鎖的相關(guān)函數(shù),當(dāng)走到代理函數(shù)中時(shí),先分發(fā)到JAVA層通過(guò)反射獲取ReadWriteLock實(shí)例并上鎖再通過(guò)跳板函數(shù)調(diào)用原來(lái)的JNI函數(shù),此時(shí)就完成了對(duì)Android12 增量鎖邏輯的復(fù)刻。經(jīng)歷了兩個(gè)版本的灰度hook方案已穩(wěn)定在線上運(yùn)行,期間無(wú)因hook導(dǎo)致的網(wǎng)絡(luò)不可用和其它崩潰問(wèn)題,目前開關(guān)放全量的版本崩潰設(shè)備數(shù)已降為0。

圖片圖片

JNI hook原理,以及詳細(xì)修復(fù)過(guò)程:  https://blog.dewu-inc.com/article/MTMwNDU?fromType=personal_blog

五、小米Android15 焦點(diǎn)處理空指針崩潰

背景

隨著Android15開放公測(cè),焦點(diǎn)處理過(guò)程中發(fā)生的空指針問(wèn)題逐步增多,并在1月份上升到Top。

堆棧與上報(bào)趨勢(shì)

java.lang.NullPointerException: Attempt to invoke virtual method 'android.view.ViewGroup$LayoutParams android.view.View.getLayoutParams()' on a null object reference
at android.view.ViewRootImpl.handleWindowFocusChanged(ViewRootImpl.java:5307)
at android.view.ViewRootImpl.-$$Nest$mhandleWindowFocusChanged(Unknown Source:0)
at android.view.ViewRootImpl$ViewRootHandler.handleMessageImpl(ViewRootImpl.java:7715)
at android.view.ViewRootImpl$ViewRootHandler.handleMessage(ViewRootImpl.java:7611)
at android.os.Handler.dispatchMessage(Handler.java:107)
at android.os.Looper.loopOnce(Looper.java:249)
at android.os.Looper.loop(Looper.java:337)
at android.app.ActivityThread.main(ActivityThread.java:9568)
at java.lang.reflect.Method.invoke(Native Method)
at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:593)
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:935)

圖片圖片

問(wèn)題分析

通過(guò)分析ASOP的源碼,崩潰的觸發(fā)點(diǎn)是mView字段為空。

https://cs.android.com/android/platform/superproject/main/+/main:frameworks/base/core/java/android/view/ViewRootImpl.java;drc=98e96368cc73432efbacd6fbcf61fe789dcec0ee;l=7243?q=ViewRootImpl

圖片圖片

源碼中mView為空的情況有兩種:

  • 未調(diào)用setView方法前觸發(fā)窗口焦點(diǎn)變化事件(只有setView方法才會(huì)給mView賦不為空的值)。
  • 先正常調(diào)用setView使mView不為空,其它地方置為空。

結(jié)合前置判斷了mAdded為true才會(huì)走到崩潰點(diǎn),在源碼中尋找到只有先正常調(diào)用setView以后在調(diào)用dispatchDetachedFromWindow時(shí)才滿足mAdded=true、mView=null的條件,從采集的logcat日志中可以證明這一點(diǎn),此時(shí)基本可以定位根因是窗口銷毀與焦點(diǎn)事件處理的時(shí)序問(wèn)題。

圖片圖片

圖片圖片

解決過(guò)程

在問(wèn)題初期,嘗試通過(guò) Hook 攔截 handleWindowFocusChanged 方法增加防御:當(dāng)檢測(cè)到 mView 為空時(shí)直接中斷后續(xù)邏輯執(zhí)行。本地驗(yàn)證階段,通過(guò)在 Android 15 設(shè)備上高頻觸發(fā)商詳頁(yè) Dialog 彈窗的焦點(diǎn)獲取與關(guān)閉操作,未復(fù)現(xiàn)線上崩潰問(wèn)題??紤]到 Hook 方案的侵入性風(fēng)險(xiǎn) ,且無(wú)法本地測(cè)試,最終放棄此方案上線。

通過(guò)崩潰日志分析發(fā)現(xiàn),問(wèn)題設(shè)備100% 集中在小米/紅米機(jī)型,而該品牌在 Android 15 DAU中僅占 36% ,因此懷疑是MIUI對(duì)Android15某些定制功能有bug。經(jīng)與小米技術(shù)團(tuán)隊(duì)數(shù)周的溝通與聯(lián)合排查,最終小米在v2.0.28版本修復(fù)了此問(wèn)題,需要用戶升級(jí)ROM解決,目前>=2.0.28的MIUI設(shè)備無(wú)此問(wèn)題的上報(bào)。

六、總結(jié)

通過(guò)上述問(wèn)題的治理,系統(tǒng)bug類的崩潰顯著減少,希望這些經(jīng)驗(yàn)對(duì)大家有所幫助。

責(zé)任編輯:武曉燕 來(lái)源: 得物技術(shù)
相關(guān)推薦

2024-06-06 10:39:32

2023-03-30 18:39:36

2023-07-19 22:17:21

Android資源優(yōu)化

2023-10-09 18:35:37

得物Redis架構(gòu)

2023-02-08 18:33:49

SRE探索業(yè)務(wù)

2023-11-27 18:38:57

得物商家測(cè)試

2022-12-14 18:40:04

得物染色環(huán)境

2023-08-09 20:43:32

2022-10-26 18:44:33

藍(lán)紙箱設(shè)計(jì)數(shù)據(jù)

2023-03-13 18:35:33

灰度環(huán)境golang編排等

2025-03-20 10:47:15

2022-10-20 14:35:48

用戶畫像離線

2023-11-29 18:41:35

模型數(shù)據(jù)

2022-12-09 18:58:10

2023-02-01 18:33:44

得物商家客服

2023-02-06 18:35:05

架構(gòu)探測(cè)技術(shù)

2023-12-27 18:46:05

云原生容器技術(shù)

2023-03-31 18:36:00

2023-05-12 18:42:13

得物AI平臺(tái)

2022-12-12 18:56:04

點(diǎn)贊
收藏

51CTO技術(shù)棧公眾號(hào)