Leveraging LLMs for Generation of Unusual Text Inputs in Mobile App Tests: Abstract and Introduction

:::info
This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Zhe Liu, State Key Laboratory of Intelligent Game, Beijing, China Institute of Software Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China;

(2) Chunyang Chen, Monash University, Melbourne, Australia;

(3) Junjie Wang, State Key Laboratory of Intelligent Game, Beijing, China Institute of Software Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China & Corresponding author;

(4) Mengzhuo Chen, State Key Laboratory of Intelligent Game, Beijing, China Institute of Software Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China;

(5) Boyu Wu, State Key Laboratory of Intelligent Game, Beijing, China Institute of Software Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China;

(6) Zhilin Tian, State Key Laboratory of Intelligent Game, Beijing, China Institute of Software Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China;

(7) Yuekai Huang, State Key Laboratory of Intelligent Game, Beijing, China Institute of Software Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China;

(8) Jun Hu, State Key Laboratory of Intelligent Game, Beijing, China Institute of Software Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China;

(9) Qing Wang, State Key Laboratory of Intelligent Game, Beijing, China Institute of Software Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China & Corresponding author.

:::

Table of Links

Abstract and Introduction

Motivational Study and Background

Approach

Experiment Design

Results and Analysis

Discussion and Threats to Validity

Related Work

Conclusion and References

ABSTRACT

Mobile applications have become a ubiquitous part of our daily life, providing users with access to various services and utilities. Text input, as an important interaction channel between users and applications, plays an important role in core functionality such as search queries, authentication, messaging, etc. However, certain special text (e.g., -18 for Font Size) can cause the app to crash, and generating diversified unusual inputs for fully testing the app is highly demanded. Nevertheless, this is also challenging due to the combination of explosion dilemma, high context sensitivity, and complex constraint relations. This paper proposes InputBlaster which leverages the LLM to automatically generate unusual text inputs for mobile app crash detection. It formulates the unusual inputs generation problem as a task of producing a set of test generators, each of which can yield a batch of unusual text inputs under the same mutation rule. In detail, InputBlaster leverages LLM to produce the test generators together with the mutation rules

serving as the reasoning chain, and utilizes the in-context learning schema to demonstrate the LLM with examples for boosting the performance. InputBlaster is evaluated on 36 text input widgets with cash bugs involving 31 popular Android apps, and results show that it achieves 78% bug detection rate, with 136% higher than the best baseline. Besides, we integrate it with the automated GUI testing tool and detect 37 unseen crashes in real-world apps from Google Play.

KEYWORDS

Android GUI testing, Large language model, In-context learning

1 INTRODUCTION

Mobile applications (apps) have become an indispensable component of our daily lives, enabling instant access to a myriad of services, information, and communication platforms. The increasing reliance on these applications necessitates a high standard of quality and performance to ensure user satisfaction and maintain a competitive edge in the fast-paced digital landscape. The ubiquity of mobile applications has led to a constant need for rigorous testing and validation to ensure their reliability and resilience against unexpected user inputs.


Text input plays a crucial role in the usability and functionality of mobile applications, serving as a primary means for users to interact with and navigate these digital environments [43, 44]. From search queries and form submissions to instant messaging and content creation, text input is integral to the core functionality of numerous mobile applications across various domains. The seamless handling of text input is essential for delivering a positive user experience, as it directly impacts the ease of use, efficiency, and overall satisfaction of the users.


Given the unexpected input, the program might suffer from memory leakage, data corruption, falling into the dead loop, resulting in the application stuck, crash, or other serious issues [14, 27, 28, 63]. Even worse, these buggy texts can only demonstrate a tiny difference from the normal text, or they themselves are normal text in other contexts, which makes the issue easily occur and difficult to spot. There has been a fair amount in the news about the crash of iOS and Android systems caused by a special text input [1], which has greatly affected people’s daily lives. For example, in July 2020, a specific character of the Indian language caused iOS devices constantly crash. It has affected a wide range of iOS applications, including iMessage, WhatsApp, and Facebook Messenger [2], and as long as certain text inputs contain the character, these apps would crash.


Taken in this sense, automatically generating unusual inputs for fully testing the input widgets and uncovering bugs is highly demanded. Existing automated GUI testing techniques focus on generating the valid text input for passing the GUI page and conducting the follow-up page exploration [6, 8, 27, 43, 44, 62, 63], e.g., QTypist [44] used GPT-3 to generate semantic input text to improve the coverage of the test. They could not be easily adapted to this task, since the unusual inputs can be more diversified and follow different rationales from the valid inputs. There are also studies targeting at generating strings that violate the constraints (e.g., string length) with heuristic analysis or finite state automaton techniques [37, 42, 64]. Yet they are designed for specific string functions like concatenation and replacement, and could not be generalized in this task.


Nevertheless, it is very challenging for the automatic generation of diversified unusual inputs. The first challenge is the combination explosion. There can be numerous input formats including text, number, date, time, currency, and innumerable settings, e.g., different character sets, languages and text lengths, which makes it quite difficult if not impossible to enumerate all these variants. The second challenge is context sensitivity. The unusual inputs should closely relate to the context of the input widgets to effectively trigger the bug, e.g., a negative value for font size (as shown in Figure 1), an extremely large number to potentially violate the widget for people’s height. The third challenge is the constraint relation within and among the input widgets. The constraints can be that a widget only accepts pure numbers (without characters), or the sum of item values smaller/bigger than the total (as shown in Figure 1), which requires an exact understanding of the related widgets and these constraints so as to generate targeted variation. What’s more difficult is that certain constraints only appear when interacting with the apps (i.e., dynamic hints in terms of the incorrect texts), and static analysis cannot capture these circumstances.


Large Language Models (LLMs) [10, 17, 58, 66, 70] trained on ultra-large-scale corpus have exhibited promising performance in a wide range of tasks. ChatGPT[58], developed by OpenAI, is one such LLM with an impressive 175 billion parameters, trained on a vast dataset. Its ability to comprehend and generate text across various domains is a testament to the potential of LLMs in interacting with humans as knowledgeable experts. The success of ChatGPT is a clear indication that LLMs can understand human knowledge and can do well in providing answers to various questions.


Inspired by the fact that the LLM has made outstanding progress in email reply, abstract extraction, etc. [10, 16, 35, 68], we propose an approach, InputBlaster[1] , to automatically generate the unusual text inputs with LLM which uncover the bugs[2] related to the text input widgets. Instead of directly generating the unusual inputs by LLM which is of low efficiency, we formulate the unusual inputs generation problem as a task of producing a set of test generators (a code snippet), each of which can yield a batch of unusual text inputs under the same mutation rule (i.e., insert special characters into a string), as demonstrated in Figure 4 ⑤.


To achieve this, InputBlaster leverages LLM to produce the test generators together with the mutation rules which serve as the reasoning chains for boosting the performance. In detail, InputBlaster first leverages LLM to generate the valid input which can pass the GUI page and serves as the target for the follow-up mutation (Module 1). Based on it, it then leverages LLM to produce mutation rules, and asks the LLM to follow those mutation rules and produce the test generator, each of which can yield a batch of unusual text inputs (Module 2). To further boost the performance, we utilize the in-context learning schema to demonstrate the LLM with useful examples from online issue reports and historical running records (Module 3).


To evaluate the effectiveness of InputBlaster, we carry out experiments on 36 text input widgets with cash bugs involving 31 popular Android apps in Google Play. Compared with 18 common-used and state-of-the-art baselines, InputBlaster can achieve more than 136% boost in bug detection rate compared with the best baseline, resulting in 78% bugs being detected. In order to further understand the role of each module and sub-module of the approach, we conduct ablation experiments to further demonstrate its effectiveness. We also evaluate the usefulness of InputBlaster by integrating it with the automated GUI testing tool and detecting unseen crash bugs in real-world apps from Google Play. Among 131 apps, InputBlaster detects 37 new crash bugs with 28 of them being confirmed and fixed by developers, while the remaining are still pending.


The contributions of this paper are as follows:


• We are the first to propose a novel LLM-based approach InputBlaster for the automatic generation of unusual text inputs for mobile app testing.


• We conduct the first empirical categorization of the constraint relationships within and among text input widgets, which provides clues for the LLM in effective mutation, and facilitates the follow-up studies on this task.


• We carry out the effectiveness and usefulness evaluation of InputBlaster, with a promising performance largely outperforming baselines and 37 new detected bugs.


[1] Our approach is named as InputBlaster considering it likes a blaster which ignites

the following production of the unusual inputs.


[2] Note that, like existing studies [38, 40, 53], this paper focuses on the crash bug,

which usually causes more serious effects and can be automatically observed, and we

interchangeably use the term bug and crash.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.