What:
A data structure that allows for the fast retrieval of information. Ideally, you can search for a term in non-linear time.
How?
-
Imagine you have a document and you get the Bag of Words Embeddings of it.
-
Now repeat for all documents. You’ve got something like the below.
-
Now, transpose the matrix. You’ve now got a bunch of vectors, each corresponding to a word. The index of each corresponds to the amount of times it appeared in document .
So What?
You can now do Boolean Search!
But for bigger documents? Sparse Representation:
- The problem is that if you’re searching through for a query of length , you’ll have at most words out of a vocabulary of (which is huge).
- Thus, what if we just store the document where the word appears.
- This is called a Posting List (it’s essentially a dictionary - exactly like the actual index in a book).
- Also, so we don’t lose frequency information, we actually store tuples
But now how to we do Search? We can do Linear Merge