What:

  • Finding specific, unstructured material that satisfies an information need from within large collections.

Pipeline:

When preparing to do large scale Information Retrieval (IR), you need to first index all of the data. Here’s the pipeline:

  1. Decide the data you’re looking for
  2. Aquire those documents
    1. Crawlers, RSS feeds, emails etc.
  3. Store all of those documents in a document store
  4. Transform all of the text (Pre-Processing in IR)
    1. Stemming (or lemmatising)
    2. Stopping
  5. Create an index (index) of all of that data
  6. Then do the searching. Again, you’ve got multiple kinds:
    1. Boolean Search
    2. Phrase Search
    3. Proximity Search

Problems Presented:

IR needs to be:

  • Effective
    • Find relevant things
  • Efficient
    • Needs to find them quickly

Components:

  • Documents:
    • The unstructured element you’re retrieving (has a UUID)
    • May not even be words at all (e.g. DNA)
  • Queries:
    • Free text that represents user’s information need.
      • Multiple queries can describe the one information need (e.g. ā€œCurrent POTUSā€ and ā€œDonald Trumpā€)
      • Can also be one query represents many different things (e.g. ā€œAppleā€)

Relevance:

  • How do we decide what to show? Based on if the user will click it? If it satisfies the user’s info need? Novel?
  • Relevance: Relevant items will be similar