What:

A data structure that allows for the fast retrieval of information. Ideally, you can search for a term in non-linear time.

How?

  1. Imagine you have a document and you get the Bag of Words Embeddings of it.

  2. Now repeat for all documents. You’ve got something like the below.

  3. Now, transpose the matrix. You’ve now got a bunch of vectors, each corresponding to a word. The index of each corresponds to the amount of times it appeared in document .

So What?

You can now do Boolean Search!

But for bigger documents? Sparse Representation:

  • The problem is that if you’re searching through for a query of length , you’ll have at most words out of a vocabulary of (which is huge).
  • Thus, what if we just store the document where the word appears.
  • This is called a Posting List (it’s essentially a dictionary - exactly like the actual index in a book).
  • Also, so we don’t lose frequency information, we actually store tuples

But now how to we do Search? We can do Linear Merge

Seen in Text Technologies for Data Science (TTDS)