Measures:

  • Effectiveness: how good are the results now?
  • Efficiency: How fast is the system?
  • Usability: How easy is the system for a user to interact with?

Unranked Results

  • TS easy. We can do the classic Accuracy vs Precision vs Recall vs F1 vs Perplexity.
  • Accuracy is useless.
  • Precision and recall are at odds:
    • If you retrieve more documents (high recall), you’ll likely retrieve more junk (low precision). If you’re very selective (high precision), you’ll miss a lot of good documents (low recall).

Ranked Results:

  • Average Precision: We calculate precision every time a relevant document is found, and then average all of those scores.
    • E.g: You have 3 relevant docs. Your system ranks them at positions , , and .
    • Precision at rank :
    • Precision at rank :
    • Precision at rank :
    • Relevant documents ranked early are heavily rewarded.
  • Normalised Discounted Cumulative Gain (NDCG): This is the most popular measure for web search.
    • Use Case: Used when relevance isn’t just “yes/no” but graded (e.g., 3=Perfect, 2=Good, 1=Fair).
    • How it works:
      1. Gain (G): The graded relevance score (e.g., 3, 2, 1).
      2. Cumulative Gain (CG): Sum the gains as you go down the list.
      3. Discounted CG (DCG): The gain of docs lower in the rank is “discounted” (reduced) using a logfunction, because users are less likely to see them.
      4. Normalised DCG (NDCG): The DCG score is divided by the ideal DCG (the score of a perfect ranking) to get a value between 0.0 and 1.0.