Navigating Complex Search Tasks with AI Copilots: Opportunities

:::info
This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Ryen W. White, Microsoft Research, Redmond, WA, USA.

:::

Table of Links

Abstract and Taking Search to task

AI Copilots

Challenges

Opportunities

The Undiscovered Country and References

4 OPPORTUNITIES

For some time, scholars have argued that the future of information access will involve personal search assistants with advanced capabilities, including natural language input, rich sensing, user/task/world models, and reactive and proactive experiences [57]. Technology is catching up with this vision. Opportunities going forward can be grouped into four areas: (1) Model innovation; (2) Next-generation experiences; (3) Measurement, and; (4) Broader implications. The opportunities are summarized in Figure 6. There are likely more such opportunities that are not listed here, but the long list shown in the figure is a reasonable starting point for the research community.

4.1 Model Innovation

There are many opportunities to better model search situations and augment and adapt foundational models to better align with searchers’ tasks and goals, and provide more accurate answers. Copilots can leverage these model enhancements to improve the support that they provide for complex search tasks.

4.1.1 Task modeling. Opportunity: Build richer task models that more fully represent tasks and task contexts. This includes how we infer tasks (e.g., from textual content of search process, from usersystem interactions, from other situational and contextual information such as location, time, and application usage) and how we represent those tasks internally (e.g., as a hierarchy (Figure 1) or a more abstract representation (semantic vectors, graph embeddings, Markov models, and so on)). We also need to be able to estimate key task characteristics, such as task complexity, which, in one use, can help search systems route requests to the most appropriate modality. In addition, we need to find ways for copilots to collect more user/world knowledge, both in general and specifically related to the task at hand. A better understanding of the task context will help copilots more accurately model the tasks themselves.

4.1.2 Alignment. Opportunity: Develop methods to continuously align copilots to tasks/goals/values via feedback, e.g., conversation content as feedback (e.g., searchers expressing gratitude to the copilot in natural language) or explicit feedback on copilot answers via likes and dislikes. The performance of copilots that are missing alignment will remain fixed over time. Copilots need applicationaligned feedback loops to better understand searcher goals and tasks and use that feedback to continuously improve answer accuracy and relevance. Beyond research on fine-tuning foundation models from human feedback (e.g., likes/dislikes) [75], we can also build on learnings from research on implicit feedback in IR, including work on improving ranking algorithms via SERP clicks [25] and developing specialized interfaces to capture user feedback [65].

4.1.3 Augmentation. Opportunity: Augment copilots with relevant external knowledge and enhanced tools and capabilities. As mentioned earlier, RAG is a common form of knowledge injection for foundation models. Relevance models are tuned to maximize user benefit, not for copilot consumption. We need to evaluate whether this difference is meaningful practically and if so, develop new ranking criteria that consider the intended consumer of the search results (human or machine). Despite their incredible capabilities, foundation models still have shortcomings that manifest in the copilots that use them. We need to understand these shortcomings through evaluation and find ways to leverage external skills/plugins to address them. Copilots must find and recommend skills per task demands [59], e.g., invoking Wolfram for computational assistance.

We can also integrate tool use directly into tool-augmented models, e.g., Toolformer [40], that can teach themselves to use tools. Models of task context may also be incomplete and we should invest in ways to better ground copilot responses via context, e.g., richer sensing, context filtering, and dynamic prompting.

4.1.4 Grounding. Opportunity: Use grounding to reduce hallucinations, build searcher trust, and support content creators. It is in the interests of copilots, searchers, and content creators (and providers and advertisers) to consider the source of the data used in generating answers. Provenance is critical and copilots should provide links back to relevant sources (preferably with specific details/URLs not generalities/domains) to help build user trust, provide attribution for content creators, and drive engagement for content providers and advertisers. It also important for building trust and for supporting learning for copilots to practice faithful reasoning [14], and provide intepretable reasoning traces (e.g., explanations with chain-of-thought) associated with their answers. We should also think about how we integrate search within existing experiences (e.g., in other copilots) to ground answers in their context of use and in more places that people seek those answers.

4.1.5 Personalization. Opportunity: Develop personal copilots that understand searchers and their tasks, using personal data, privately and securely. Searchers bring their personal tasks to search systems and copilots will be no different. Here are some example personal prompts that describe the types of personal tasks that searchers might expect a copilot to handle: (1) Write an e-mail to my client in my personal style with a description of the quote in the attached doc. (2) Tell me what’s important for me to know about the company town hall that I missed? (3) Where should I go for lunch today? These tasks span creation, summarization, and recommendation and quickly illustrate the wide range of expectations that people may have from their personal copilots. As part of developing such personalized AI support, we need to: (1) Study foundation model capabilities, including their ability to identify task-relevant information in personal data and activity histories, and model user knowledge in the current task and topic, and (2) Develop core technologies, including infinite memory, using relevant long-term activity (in IR, there has been considerable research on relevant areas such as re-finding [52] and personalization [47]); context compression, to fit more context into finite token limits (e.g., using turn-by-turn summarization rather than raw conversational content); privacy, including mitigations such as differential privacy and federated learning, and research on machine unlearning [7] to intentionally forget irrelevant information over time, including sensitive information that the searcher may have explicitly asked to be removed from the foundation model.

4.1.6 Adaptation. Two main forms of adaptation that we consider here are model specialization and so-called adaptive computation.

• Model specialization. Opportunity: Develop specialized foundation models for search tasks that are controllable and efficient. Large foundation models are generalists and have a wide capability surface. Specializing these models for specific tasks and applications discards useless knowledge, making the models more accurate and efficient for the task at hand. Recent advances in this area have yielded strong performance, e.g., the Orca-13B model [34] uses explanation-based tuning (where the model explains the steps used to achieve its output and those explanations are used to train a small language model) to outperform state-of-the-art models of a similar size such as Vicuna-13B [13]. Future work could explore guiding specialization via search data, including anonymized large-scale search logs, and as well as algorithmic advances in preference modeling and continual learning.

• Adaptive computation. Opportunity: Develop methods to adaptively apply different models per task and application demands. Adaptive compute involves using multiple foundation models (e.g., GPT-4 and a specialized model) each with different inferencetime constraints, primarily around speed, capabilities, and cost, and learning which model to apply for a given task. The specialized model can backoff to one or more larger models as needed per task demands. The input can be the task plus the constraints of the application scenario under which the model must operate. Human feedback on the output can also be used to improve model performance over time [72].

These adaptation methods will yield more effective and more efficient AI capabilities that copilots can use to help searchers across a range of settings, including in offline settings (e.g., on-device only).

4.2 Next-Generation Experiences

Advancing models is necessary but not sufficient given the central role that interaction plays in the search process [57]. There are many opportunities to develop new search experiences that capitalize on copilot capabilities while keeping searchers in control.

4.2.1 Search + Copilots. Opportunity: Develop experiences bridging the search and copilot (chat) modalities, offering explanations and suggestions. Given how entrenched and popular traditional search is, it is likely that some form of query-result interaction will remain a core part of how we find information online. Future, copilot-enhanced experiences may reflect a more seamless combination of the two modalities in a unified experience. Both Google and Bing are taking a step in that direction by unifying search results and copilot answers in a single interface. Explanations on what each modality and style (e.g., creative, balanced, and precise) are best for will help searchers make decisions about which modalities and settings to use and when. Modality recommendation given task is also worth exploring: simple tasks may only need traditional search, whereas complex tasks may need copilots. Related to this are opportunities around conversation style suggestion given the current task, e.g., fact-finding task or short reply (needs precision)

and generating new content (needs creativity). Search providers could also consider offering a single point of entry and an automatic routing mechanism to direct requests to the correct modality given inferences about the underlying task (e.g., from Section 4.1.1) and the appropriateness of each of the modalities for that task.

4.2.2 Human Learning. Opportunity: Develop copilots that can detect learning tasks and support relevant learning activities. As mentioned earlier, copilots can remove or change human learning opportunities by their automated generation and provision of answers. Learning is a core outcome of information seeking [15, 32, 54]. We need to develop copilots that can detect learning and sensemaking tasks, and support relevant learning activities via copilot experiences that, for example, provide detailed explanations and reasoning, offer links to learning resources (e.g., instructional videos), enable deep engagement with task content (e.g., via relevant sources), and support specifying and attaining learning objectives.

4.2.3 Human Control. Opportunity: Better understand control and develop copilots with control while growing automation. Control is an essential aspect of searcher interaction with copilots. Copilots should consult humans to resolve or codify value tensions. Copilots should be in collaboration mode by default and must only take control with the permission of stakeholders. Experiences that provide searchers with more agency are critical, e.g., adjust specificity/diversity in copilot answers, leading to less generality and less repetition. As mentioned in Section 4.1.4, citations in answers are important. Humans need to be able to verify citation correctness in a lightweight way, ideally without leaving the user experience. We also need a set of user studies to understand the implications of less control of some aspects (e.g., answer generation), more control over other aspects (e.g., macrotask specification), and control over new aspects, such as conversation style and tone.

4.2.4 Completion. Opportunity: Copilots should help searchers complete tasks while keeping searchers in control. We need to both expand the task frontier by adding/discovering more capabilities of foundation models that can be surfaced through copilots and deepen task capabilities so that copilots can help searchers better complete more tasks. We can view skills and plugins as actuators of the digital world and we should help foundation models fully utilize them. We need to start simple (e.g., reservations), learn and iterate, and increase task complexity as model capabilities improve with time. The standard mode of engagement with copilots is reactive;

searchers send requests and the copilots respond. Copilots can also take initiative, with permission, and provide updates (for standing tasks) and proactive suggestions to assist the searcher. Copilots can also help support task planning for complex tasks such as travel or events. AI can already help complete repetitive tasks, e.g., action transformers, trained on digital tools[8] or create and apply “tasklets” (user interface scripts) learned from websites [30].

Given the centrality of search interaction in the information seeking process, it is important to focus sufficient attention on interaction models and experiences in copilots. In doing so, we must also carefully consider the implications of critical decisions on issues that affect AI in general such as control and automation.

4.3 Measurement

Another important direction is in measuring copilot performance, understanding copilot impact and capabilities, and tracking copilot evolution over time. Many of the challenges and opportunities in this area also affect the evaluation of foundation models in general (e.g., non-determinism, saturated benchmarks, inadequate metrics).

4.3.1 Evaluation. Opportunity: Identify and develop metrics for copilot evaluation, while considering important factors, and find applications of copilot components for IR evaluation. There are many options for copilot metrics, including feedback, engagement, precisionrecall, generation quality, answer accuracy, and so on. Given the task focus, metrics should likely target the task holistically (e.g., success, effort, satisfaction). In evaluating search copilots, it is also important to consider: (1) Repeatability: Non-determinism can make copilots difficult to evaluate/debug; (2) Interplay between search and copilots (switching, joint task success, etc.); (3) Longer term effects on user capabilities and productivity; (4) Task characteristics: Complexity, etc., and; (5) New benchmarks: Copilots affected by external data, grounding, queries, etc. There are also opportunities to consider applications of copilot components for IR evaluation. Foundation models can predict searcher preferences [50] and assist with relevance judgments [19], including generating explanations for judges. Also, foundation models can create powerful searcher simulations that can better mimic human behavior and values, and expand on early work on searcher simulations in IR [66].

4.3.2 Understanding. Opportunity: Deeply understand copilot capabilities and copilot impact on searchers and on their tasks. We have only scratched the surface in understanding the copilots and their effects. A deeper understanding takes a few forms, including: (1) User understanding: Covering mental models of copilots and effects of bias (e.g., functional fixedness [17]) on how copilots are adopted and used in search settings. It also covers changes in search behavior and information seeking strategies, including measuring changes in effects across modalities, e.g., search versus copilots and search plus copilots. There are also opportunities in using foundation models to understand search interactions via user studies [12] and use foundation models to generate intent taxonomies and classify intents from log data [43]; (2) Task understanding: Covering the intents and tasks that copilots are used for and most effective for, and; (3) Copilot understanding: Covering the capabilities and limitations of copilots, e.g., similar to the recent “Sparks of AGI” paper on GPT-4 [10], which examined foundation model capabilities.

Measuring copilot performance is essential in understanding their utility and improving their performance over time. Copilots do not exist in a vacuum and we must consider the broader implications of their deployment for complex tasks in search settings.

4.4 Broader Implications

Copilots must function in a complex and dynamic world. There are several opportunities beyond advances in technology and in deepening our understanding of copilot performance and capabilities.

4.4.1 Responsibility. Opportunity: Understand factors affecting reliability, safety, fairness, and inclusion in copilot usage. The broad reach of search engines means that copilots have an obligation to act responsibly. Research is needed on ways to understand and improve answer accuracy via better grounding in more reliable data sources, develop guardrails, understand biases in foundation models, prompts, and the data used for grounding, and understand how well copilots work in different contexts, with different tasks, and with different people/cohorts. Red teaming, user testing, and feedback loops are all needed to determine emerging risks in copilots and the foundation models that underlie them. This also builds on existing work on responsible AI and responsible IR and FACTS-IR, which has studied biases and harms, and ways to mitigate them [36].

4.4.2 Economics. Opportunity: Understand and expand the economic impact of copilots. This includes exploring new business models which copilots will create beyond information finding. Expanding the task frontier from information finding deeper into task completion (e.g., into creation and analysis) creates new business opportunity. It also unlocks new opportunities for advertising, including advertisements that are shown inline with dialog/answers and contextually relevant to the current conversation. There is also a need to more deeply understand the impact of copilots on content creation and search engine optimization. Content attribution is vital in such scenarios to ensure that content creators (and advertisers and publishers) can still generate returns. We should avoid the so-called “paradox of reuse” [55] where lower visits to online content leads to less content being created which in turn leads to worse models over time. Another important aspect of economics is the cost-benefit trade-off and is related to work on adaptation (Section 4.1.6). Large model inference is expensive and unnecessary for many applications. This cost will reduce with optimization, for which model specialization and adaptive computation can help.

4.4.3 Ubiquity. Opportunity: Copilot integrations to model and support complex search tasks. Copilots must co-exist with the other parts of the application ecosystem. Search copilots can be integrated into applications such as Web browsers (offering in-browser chat, editing assistance, summarization) and productivity applications (offering support in creating documents, emails, presentations, etc.). These copilots can capitalize on application context to do a better job of answering searcher requests. Copilots can also span surfaces/applications through integration with the operating system. This enables richer task modeling and complex task support, since such tasks often involve multiple applications. Critically, we must do this privately and securely to mitigate risks for copilot users.

4.5 Summary

The directions highlighted in this section are just examples of the opportunities afforded by the emergence of generative AI and copilots in search settings. There are other areas for search providers to consider too, such as multilingual copilot experiences (i.e., foundation models are powerful and could help with language translation [33, 74]), copilot efficiency (i.e., large model inference is expensive and not sustainable at massive scale, so creative solutions are needed [72]), the carbon impact from running foundation models at scale to serve billions of answers for copilots [18], making copilots private by design [70], and government directives (e.g., the recent executive order from U.S. President Biden on AI safety and security[9]) and legislation, among many other opportunities.

[8] https://www.adept.ai/blog/act-1

[9] https://www.whitehouse.gov/briefing-room/statements-releases/2023/10/30/factsheet-president-biden-issues-executive-order-on-safe-secure-and-trustworthyartificial-intelligence/