Autocompletion for Search Enginees

Autocompletion dates back half a century ago (Longuet-Higgins & Ortony, 1968), initially designed to save keystrokes as people type and help those with physical disabilities type faster. The incomplete user input is the “query prefix” and suggested ways of extending the prefix into a full query are “query completions”. This feature is essential to modern text editors and search engines.

This blog post summarizes key ideas from the survey paper A Survey of Query Auto Completion in Information Retrieval, recommended by my Search teammate at DoorDash.

Why Autocomplete?

The “why” for autocompletion is straightforward — if done right, it makes search far more efficient (when was the last time you typed a 10-word query in Google?):

  • Users: Saves keystrokes + speeds up typing + discovers relevant search terms not previously thought of 👉 less search friction
  • Search engines: Reduces the probability of typos and ill-formed queries compared to if the user types the full query 👉 better query performance

A Yahoo! paper (Zhang et al., 2015) estimates that autocompletion saves searchers ~50% of the keystrokes. Autocompletion is such an integral part of modern search engines that you may only notice when it’s missing (like in the screenshot where I searched in Nudie Jeans and wondered, “Ummm, what was the fit called again?").

Query autocompletion is one type of query reformulation; other query reformulation tasks include query suggestion, query expansion, and query correction. However, the goal and the usage of autocompletion are quite different from the latter three. The latter three usually take place after a query is submitted and aim to improve suboptimal query performance (e.g., the user didn’t have the right “vocabulary” or typed something too specific so the search result page showed few or irrelevant results). By contrast, autocompletion occurs before query submission as users are typing and its main purpose is to save keystrokes and ease the submission process.

Build A Simple Autocompletion Engine

The “million-dollar question” is the “how” — how do we know what the user meant to type without them typing the whole thing? 🔮

Retrieval

Tries (“prefix trees”) are a common data structure to store associations between a prefix and its query completions. A trie can be pre-built from the past query log, which will be used to retrieve a list of query completions given a prefix.

Ranking

Which completions to show in what order can be determined by heuristics-based approaches as well as learning to rank (LTR) models. Most search engines only show a handful of completions — the user’s intended query should appear at the top.

Heuristic-Based

Successful completions are ones that eventually get submitted by the user. Heuristic-based approaches rank each query completion $q_c$ by the probability that it might get submitted, given the prefix $p$, the time $t$, and the user $u$ — $P(q_c | p, t, u)$.

  • Popularity: Some completions appear more frequently than others. Frequency could be from past searches or future searches predicted by time-series models.
    • Short vs. long windows: Search may trend in the short term (i.e., a star’s name trends after a movie release) or have long-term cyclic patterns (e.g., people search “NeurIPS” more in May and December, prior to the submission deadline and the conference date, respectively). Combining both trends lead to better completions than only considering either one.
  • User: You and I may mean very different things when typing “cat” (e.g., I’m buying “cat food” for my kids 🐱 whereas you’re looking up your aunt “Cathy") — a user’s past searches, previous queries in the same session, and engagement with query suggestions can be used to personalize their query completions.

Learning to Rank

In document retrieval (DR), learning to rank (LTR) refers to learning a function $f(q, D)$ that scores a list documents $D = \{d_1, d_2, \ldots, d_n\}$ given a query $q$. In query autocompletion (QAC), documents are not yet in consideration; instead, we learn a function $f(p, Q)$ that scores a list of queries $D = \{q_1, q_2, \ldots, q_n\}$ given a prefix $p$.

Labels for QAC LTR are simpler than the usual 5-point relevance scale (“Bad”, “Fair”, “Good”, “Excellent”, “Perfect”) used for DR LTR — users show binary preferences towards query completions by “submitting (1)” vs. “not submitting (0)”. The table below summarizes different features, labels, and evaluation metrics used for DR LTR vs. QAC LTR, two similar tasks each with their own peculiarities.

The table below shows example data for training DR LTR vs. QAC. The format is highly similar between the two. As we can see, each prefix (grouping ID) is observed with multiple query completions, each with a feature vector and a binary label.

Below are some key ranking signals (many shared by heuristic-based approaches):

  • Popularity features: Observed popularity from historical searches; future popularity predicted from recent trend and cyclic behavior.
  • Semantic features: Users might want query completions that are similar (however defined) to submitted queries in the same session.
  • User features: User features such as location, demographics, search history, past reformulation behavior, etc. can affect query completions.

Evaluation

How do we know if the list of query completions returned to users is any good? On a high level, a good list ranks the user’s intended query at the top.

Differently from document retrieval, query autocompletion is mostly concerned with finding a single best solution. Also, all else being equal, better suggestions should save more keystrokes. Metrics below capture these requirements and purposes:

  • Mean Reciprocal Rank (MRR): On average, how early does the first relevant completion appear? The formula is $\frac{1}{P}\sum_{p = 1}^{P}\frac{1}{\mathrm{rank}_p}$, where $p$ is a specific prefix, $P$ is the eval prefix set, and $\mathrm{rank}_p$ is the rank of the 1st relevant query for $p$.
  • Success Rate at Top K (SR@K): % of time the intended query is found in top $K$ query completions for a given prefix, estimated from the prefix eval set.
  • pSaved: The probability of using a query completion while submitting a query.
  • eSaved: The normalized amount of keystrokes saved due to query completions.
  • Diversity: Redundant query completions (e.g., “apple pie” and “apple pies”) should be penalized. One example diversity metric is $\alpha$-nDCG, which assigns a higher gain to each additional query with more “new aspects”.

Performance Improvements

Above is the bare bone of an autocompletion engine, which can be further improved:

  • Computational efficiency: Running BFS/DFS on vanilla tries may result in slow performance, especially given the vast query space for each prefix. We can optimize the trie data structure itself (e.g., Completion Trie, RMQ Trie, Score-Decomposed Trie, etc.) or the search algorithm (e.g., A* search).
  • Error-tolerance: Users may have misspellings in the prefix — spell-checking or fuzzy match is needed to ensure misspelled queries can have completions.
  • Multi-objective: In e-commerce search, after a user submits a query completion, we ultimately want them to convert on a product. A multi-objective model also optimizes for subsequent engagement (e.g., clicks, add to cart, conversions) following query submission. This was described in an Instacart post.

Presentation of Completions

Last but not least, autocompletion needs to be presented to users. The canonical way is to display completions as a vertical list below the search bar. If you’re highly confident about the top completion, you may put it directly in the search bar, a method called “ghosting” (Ramachandran & Murthy, 2019). The completed text is usually highlighted so that we can distinguish it from the user input.

Compared to document retrieval, search engines have even more limited real estate to show autocomplete suggestions and a stricter requirement on the position of the intended query (i.e., even if the intended query is included in the list, it only gets clicked ~13% of the time if ranked below top 2, Li et al., 2015). In other words, precision matters even more in autocompletion than in document retrieval.

Rather than simply organizing results by submission probability, sometimes it makes sense to organize them by intent (see the graph below) or the nature of the prefix (e.g., if a prefix is a geolocation, completions can be ordered by region).