Leveraging LLMs for Generation of Unusual Text Inputs in Mobile App Tests: Experiment Design

:::info
This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Zhe Liu, State Key Laboratory of Intelligent Game, Beijing, China Institute of Software Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China;

(2) Chunyang Chen, Monash University, Melbourne, Australia;

(3) Junjie Wang, State Key Laboratory of Intelligent Game, Beijing, China Institute of Software Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China & Corresponding author;

(4) Mengzhuo Chen, State Key Laboratory of Intelligent Game, Beijing, China Institute of Software Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China;

(5) Boyu Wu, State Key Laboratory of Intelligent Game, Beijing, China Institute of Software Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China;

(6) Zhilin Tian, State Key Laboratory of Intelligent Game, Beijing, China Institute of Software Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China;

(7) Yuekai Huang, State Key Laboratory of Intelligent Game, Beijing, China Institute of Software Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China;

(8) Jun Hu, State Key Laboratory of Intelligent Game, Beijing, China Institute of Software Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China;

(9) Qing Wang, State Key Laboratory of Intelligent Game, Beijing, China Institute of Software Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China & Corresponding author.

:::

Table of Links

Abstract and Introduction

Motivational Study and Background

Approach

Experiment Design

Results and Analysis

Discussion and Threats to Validity

Related Work

Conclusion and References

4 EXPERIMENT DESIGN

4.1 Research Questions

• RQ1: (Bugs Detection Performance) How effective of InputBlaster in detecting bugs related to text input widgets?

For RQ1, we first present some general views of InputBlaster for bug detection, and then compare it with commonly-used and state-of-the-art baseline approaches.

• RQ2: (Ablation Study) What is the contribution of the (sub-) modules of InputBlaster for bug detection performance?

For RQ2, We conduct ablation experiments to evaluate the impact of each (sub-) module on the performance.

• RQ3: (Usefulness Evaluation) How does our proposed InputBlaster work in real-world situations?

For RQ3, we integrate InputBlaster with the GUI testing tool to make it automatically explore the app and detect unseen inputrelated bugs, and issue the detected bugs to the development team.

4.2 Experimental Setup

For RQ1 and RQ2, we crawl 200 most popular open-source apps from F-Droid [3], and only keep the latest ones with at least one update after September 2022 (this ensures the utilized apps are not overlapped with the ones in Sec 3.3). Then we collect all their issue reports on GitHub, and use keywords (e.g., EditText) to filter those related to text input. Finally, we obtain 126 issue reports related to 54 apps. Then we manually review each issue report and the mobile app, and filter it according to the following criteria: (1) the app wouldn’t constantly crash on the emulator; (2) it can run all baselines; (3) UIAutomator [65] can obtain the view hierarchy file for context extraction; (4) the bug is related to text input widgets; (5) the bug can be manually reproduced for validation; (6) the app is not used in the motivational study or example dataset construction. Please note that we follow the name of the app to ensure that there is no overlap between the datasets. Finally, 31 apps with 36 buggy text inputs remain for further experiments.

We measure the bug detection rate, i.e., the ratio of successfully triggered crashes in terms of all the experimental crashes (i.e., buggy inputs), which is a widely used metric for evaluating GUI testing [8, 27, 43]. Specifically, with the generated unusual input, we design an automated test script to input it into the text input widgets, and automatically run the “submit” operation to check whether a crash occurs. If no, use the script to go back to GUI page with the input widget if necessary, and try the next generated unusual input. As long as a crash is triggered for a text input widget, we treat it as successful bug detection and will stop the generation for this widget. Note that our generated unusual input is not necessarily the same as the one provided in the issue report, e.g., -18 vs. -20, as long as a crash is triggered after entering the unusual inputs, we treat it as a successful crash detection.

For a fair comparison with other approaches, we employ two experimental settings, i.e., 30 attempts (30 unusual inputs) and 30 minutes. We record the bug detection rate under each setting (denoted as “Bug (%)” in Table 2 to Table 5), and also record the actual number of attempts (denoted as “Attempt (#)”) and the actual running time (denoted as “Min (#)”) when the crash occurs to fully understanding the performance.

For RQ3, we further evaluate the usefulness of InputBlaster in detecting unseen crash bugs related to text input. A total of 131 apps have been retained. We run Ape [26] (a commonly-use automated GUI testing tool) integrated with InputBlaster, for exploring the mobile apps and getting the view hierarchy file of each GUI page.

We use the same configurations as the previous experiments. Once a crash related to text input is spotted, we create an issue report by describing the bug, and report them to the app development team through the issue reporting system or email.

4.3 Baselines

Since there are hardly any existing approaches for the unusual input generation of mobile apps, we employ 18 baselines from various aspects to provide a thorough comparison.

First, we directly utilize ChatGPT [58] as the baseline. We provide the context information of the text input widgets (as described in Table 1 P1), and ask it to generate inputs that can make app crash.

Fuzzing testing and mutation testing can be promising techniques for generating invalid inputs, and we apply several related baselines. Feldt et al. [24] proposed a testing framework called GoldTest, which generates diverse test inputs for mobile apps by designing regular expressions and generation strategies. In 2017, they further proposed an invalid input generation method [55] based on probability distribution (PD) parameters and regular expressions, and we name this baseline as PDinvalid. Furthermore, we reuse the idea of traditional random-based fuzzing [13, 41], and develop a RandomFuzz for generating inputs for text widgets. In addition, based on the 50 buggy text inputs from the GitHub dataset in Section 3.3.1, we manually design 50 corresponding mutation rules to generate the invalid input, and name this baseline as ruleMutator.

Furthermore, we include the string analysis methods as the baselines, i.e., OSTRICH [15] and Sloth [14]. They aim at generating the strings that violate the constraints (e.g., string length, concatenation, etc), which is similar to our task. OSTRICH’s key idea [15] is to generate the test strings based on heuristic rules. Sloth [14] proposes to exploit succinct alternating finite-state automata as concise symbolic representations of string constraints.

There are constraint-based methods, i.e., Mobolic [8] and TextExerciser [27], which can generate diversified inputs for testing the app. For example, TextExerciser utilizes the dynamic hints to guide it in producing the inputs.

We also employ two methods (RNNInput [43] and QTypist [44]) which aim at generating valid inputs for passing the GUI page. In addition, we use the automated GUI testing tools, i.e., Stoat [61], Droidbot [39], Ape [26], Fastbot [12], ComboDroid [67], TimeMachine [23], Humanoid [40], Q-testing [53], which can produce inputs randomly or following rules to make app running automatically.

We design the script for each baseline to ensure that it can reach the GUI page with the text input widget, and run them in the same experimental environment (Android x64) to mitigate potential bias.

Table of Links

4 EXPERIMENT DESIGN

4.1 Research Questions

4.2 Experimental Setup

4.3 Baselines

Leave a Comment Cancel reply