What:

A scoring mechanism for ranked information retrieval. It says:

A term is important if it appears frequently within that document but rarely in the overall collection.

  • Term Frequency (TF): Measures how often a term appears in a document. A higher TF suggests the document is about the term.
  • Measures how rare (and thus important) a term is across the entire collection.
    • Terms that appear in many documents (β€œthe”) get low IDF scores and vice versa.

, where is the total number of documents and is the number of documents containing the term.

Vector Space Search on TF-IDF

  • Similar to Bag of Words Embeddings, you initialise 0-filled vector, where each entry corresponds to a word in the vocabulary.
  • Instead of filling it with the term frequency, you fill it with the TF-IDF score.
  • To search, you measure the cosine angle between the search query and document vector.