What:
A scoring mechanism for ranked information retrieval. It says:
A term is important if it appears frequently within that document but rarely in the overall collection.
- Term Frequency (TF): Measures how often a term appears in a document. A higher TF suggests the document is about the term.
- Measures how rare (and thus important) a term is across the entire collection.
- Terms that appear in many documents (βtheβ) get low IDF scores and vice versa.
, where is the total number of documents and is the number of documents containing the term.
Vector Space Search on TF-IDF
- Similar to Bag of Words Embeddings, you initialise 0-filled vector, where each entry corresponds to a word in the vocabulary.
- Instead of filling it with the term frequency, you fill it with the TF-IDF score.
- To search, you measure the cosine angle between the search query and document vector.