What:

  • When doing Information Retrieval (IR), you can add new terms to the user’s original query for better results.
  • If a user is searches for “car”, but the documents only contain “automobile”, then traditional searches will fail.

1. Thesaurus Based:

  • Inject synonyms from a pre-built dictionary.
  • Dictionary can be manual (unscalable), built automatically with word co-occurrence.
  • Loses context (😉)

2. Relevance Feedback:

Manual:

  • The user submits a query, and we return 10 documents we assume are ideal.
  • Users mark results with positive (👍) or negative (👎) feedback.
  • We use this information to come up with a new optimal query.
  • We then run this improved query.

Automatic:

  • Exactly the same. Except, we assume the top documents are relevant.
  • We automatically refine all of the documents using similar methods to before.
  • We instantly surface this new and improved query to the user.
  • Problem? Query drift, where if the top was incorrect, this new refined version will be wayyy off.

Combining With LLMs:

These methods are fast. Much faster than BERT for example. Thus, we use these methods to come up with the top 1000 most promising documents, then BERT to come through and rerank better.

Thus a hybrid, two-stage pipeline:

  1. Stage 1: Retrieval Use the classic, efficient method (like BM25 with an inverted index) to scan billions of documents and retrieve the top 1000 most promising candidates in milliseconds.

  2. Stage 2: Re-ranking Use the powerful, context-aware, slow model (like BERT) to carefully re-rank only those 1000 candidates. This model can understand the deep semantic meaning and produce the final, high-quality top 10 list that the user sees.