Lecture 9:

SPEAKER 0
So welcome everyone to today’s lecture on text classification. I’ll be talking about how we can determine the class of a document using various methods, rule-based methods, machine learning-based methods. So we are presuming here there’s a bunch of classes, and we have a set of documents as usual. We’re not interested in searching them this time, like in the information retrieval task. We’re not interested. In determining, uh, in general, what are they about, like in topic modelling, but we have a specific set of classes. Um, so I’ll talk about the, uh, text classification, um, theory, and there will be a coursework too that I’ll announce today as well that uh is focused on you practically implementing your own text classifiers. There will be no lecture next week, this is cancelled because of the strikes. I’ll explain a bit more about uh this at the end of the lecture, uh, after announcing the coursework. Um, so yeah. What are the objectives of this lecture about text classification? You’ll learn about the basics of text classification. So the main idea here really is, what is it, how can you do it, and how is it evaluated. These are the main three things. So we’ll define the task. We’ll we’ll, I’ll explain a couple of different ways of doing it, and then we’ll define the evaluation metrics for figuring out how well we did. Um, and we’ll, we’re not going to go into any depth on any of the specific text classification models that we’ll be using. There are specific NLP courses that are dedicated to this, but it’s really about, um, and also machine learning courses for learning algorithms, but it’s really about giving you that general overview of you’ve got this big data set, you know that it has classes, how do you assign a class for each given text. So, let’s start with the definition of the task of text classification and let’s talk about how it’s different from other tasks that you’ve seen in the past. So text classification is the process of classifying documents into a set of predefined categories based on their content. So predefined categories as highlighted here because this is different from something like LDA. In LDA we did not have a set of predefined categories, we had a general vague sense that there’s a number of different topics that are being discussed in the document, but really all we were given at the beginning was, Um, the document collection, the set of documents. And now the main distinction here is really we’ve got a set number of classes, we know what they are, we know how many they are, and we can probably roughly describe what they are. Um. And we want to look at the content of the document to classify them. We’re not looking at metadata. We’re not really looking at tabular data here like what time was it published or anything like that. We’re really specifically looking at the content, so the input is text. This could be something like just a sentence, an individual sentence like. Sentiment analysis. The book was great. It was a positive sentence. It could be something like a whole document or it could be something like a paragraph, something in between. It could be a whole document like a newspaper article. It could be an entire book. So the length of the text, the input text, varies dramatically that we deal with in text classification. Um, and the task is always the same. Classify into, um, pre-defined categories. In general, uh, at least 2. So if we just have one category, then the task becomes, is the text in the category or not? Is that a question you’ve got? No. OK, if you’ve got a question at any point in time during the lecture, please feel free to put your hand up, interrupt me, that’s completely fine, that’s what we’re here for, um. So yeah, you might have one class that you’re interested in, in which case we’re dealing with a binary classification problem. Is the class present in the text or is it not present? Or you might have multiple classes that you’re interested in. For example, for sentiment analysis, you might be looking at positive or negative or neutral. For something like newspaper articles you might be interested in things like sports, politics, comedy, or technology. And you might also have hierarchical relationships between classes, so in something like scientific articles, you could have a class that is bio biology and one that is sociology, and then within the biology um class you have multiple subclasses that correspond to different subfields of biology and that kind of thing. Um, So what this means fundamentally is classification, which is also called categorization, is. The activity of predicting to which out of a predefined finite set of groups or classes or categories, we use these terms more or less synonymously, a data item, a text belongs to, it’s really widely used in all areas of data science, uh, pattern recognition, statistics, machine learning, and widely studied, and it can be formulated more generally and more formally as this task of generating. What we call a hypothesis or also call a classifier or also call a model that really maps from the domain of data items to the finite set of classes. So it’s it’s like a function. We want to know what is the function that takes us input, the document and as output the class, and we want to learn that function really. We want to have the data. We want to learn from the data what this function should be. So this is the formalisation of the um of this problem. What this means, because we’re learning a function is we have to find some way of representing the documents that we have, the text documents as vectors. So we have to find some way to to represent the text numerically, and we have to find some way of representing the classes numerically, that’s the easy part. Um, this means that a lot of what we’ll be talking about today is really about how we can represent text data numerically as vectors. This is something that is useful for classification. It’s also immensely useful for other things like retrieval, and it can also be used for unsupervised learning. So it really overlaps with other things that you’ve heard about before. Classification is different from clustering, where you. Don’t know in advance, um what the groups are, and often don’t even know how many there are. Um, so, yeah, classification is also not very useful when you can determine quite easily, um, what the membership of a class of a document in class is. So when it’s really easy to tell. So for example, something like predicting whether an actual number belongs to is a is a prime number or not a prime, that’s not a useful thing to be predicting with the sort of methods we’ll be talking about, um, or I don’t know, predicting whether a text contains. The word queen, that’s not a classification problem. It doesn’t make sense to throw a classification model at it because you can just use a regular expression string matching for this problem. So what we’re really interested in is statistical relationships. We’re interested in probabilities. We want to assign classes probabilistically. We want to be able to say things like I don’t know 80% chance that this document belongs to this class and 20% belongs to that class based on the features that we’ve observed. So the question is really what kind of features are we interested in, what kind of things do we look for in the document that helps us tell something about the class that the document is in. Um, we tend to deal with textual data in this course because it’s called Text Technologies for data science, but fundamentally the algorithms that we use for classification can be used with a lot of other input data as well. So whether it’s numerical or it’s a tabular data, fundamentally it’s often very similar algorithms, it’s just we talk about text in this course. Uh, so examples of purely textual data would be something like news articles or emails, uh, individual sentences, queries that you have in an IR task, um, and also sometimes the data we have is partly textual, like a web page which comes with a lot of text content but also a lot of metadata about the structure of the document and where it was created, and things like links between the webpage and other web pages, where you might actually want to take this into account, so now we suddenly have this task of merging the text features that we have. Uh, just from the text content with all the other metadata. So what are some different types of classification tasks we do? I’ve already talked about binary classification, so this is when you’ve already got one class that you’re interested in and you want to predict whether or not your um item is in the class or not. Uh, or when your data set can be perfectly partitioned into exactly two classes, yeah, so often it’s presence of class A or not presence of class A, or it could just be something like class A versus class B. But the point is you’ve got exactly two possibilities, um, so you’re classifying the item into exactly these two. so examples of this would be things like spam classification, you go through emails that come into the inbox and you want to identify which ones of them contain spam and which ones don’t. Offensive language classification. This is a common task in content moderation. If you work for social media and you need to identify who’s posting something hateful online, something that needs to be that violates some policy you’ve got, um, and it needs to be taken down or it needs to be hidden from users because it violates violates your policy. So that’s the sort of binary classification task. It either violates the policy or it doesn’t. It’s either offensive or it doesn’t. Or in the context of information retrieval, you could look at things like, are the query results that you got relevant or are they irrelevant. Then we’ve got single label multi-class classification, which is really about situations where you have things um that can be classified into one of any possible classes, but only ever one class at a time. So you’re assigning a single label out of multiple classes. Um, an example of this would be if a newspaper article is definitely either about sports or politics or entertainment, you know it’s going to be about one of the three, and exactly one of the 3. Or sentiment analysis when you know that a tweet is either positive or neutral or negative, but you know it’s exactly one of the three. And then multi-label, multi-class classification is when you actually want to assign possibly multiple labels or possibly no label at all, um, out of, out of multiple potential classes. So you could be interested in things like computer science articles that are to be classified into the ACM classification systems, which is a classification system for different subfields of computer science, and it could tell you an article could be about. Multiple subfields like information retrieval and about databases, or it could just be about one subfield, or actually if you have an article that’s not from computer science at all, it could be about none of the subfields. So in this case you’ve got a multi-label, multi-class classification problem where you are assigning these classes, usually actually independently. So often you can transform this, Multi-label, multi-class problem into a series of binary classification problems, right, for each, for the computer science article, for each of the classes, um, like is the article about databases, Just do that as a binary classification task, yes or no. Is the article about um information retrieval, yes or no. So you can transform something between these different uh different types of problems. And there are different dimensions along which you might want to classify a text, so you could, um, yeah, classify, we also call them axes uh that are orthogonal to each other, so they’re not necessarily in any way correlated. Things like topic is by far the most frequent case you might want to identify is the uh newspaper article about sports or is it about entertainment. Um, things like sentiment, which I’ve mentioned, useful in market research, trying to understand, um, you know, what are, what are customers saying about product? Are they generally happy, are they unhappy? Is this a review where someone’s happy, is this one where they’re unhappy? Um, things like language, so language identification is a very important task in NLP, um, often something you, you do is, uh, pre-processing, uh, for example, to show relevant results to users who are, um, who prefer a specific language. By genre, so things like um whether you’re talking about a news article or whether you’re talking about a um a blog, or an author, so you might try to find out, for example, which historical figure wrote a manuscript where the authorship of the manuscript is actually disputed, um, or you might want to find out which parts of a Beatles song uh were written by uh Paul McCartney and which parts were written by John Lennon or something like that. Um, or usefulness, so if you’ve got a data set of product reviews, which ones of them are good or one of them not, often you ask users, can you, can you rate, did you find this review, uh, useful or not, you could take this as the data that you’re trying to predict. So now you’re trying to figure out, given the text of the review, is it a good review? Um, or it could be something like hatefulness in the example of the social media content moderation system. Where you’re trying to um identify, because you’re responsible for removing the hateful content, you’re trying to identify which content needs to be removed. So that’s the task. Now, as with any task, there’s usually a number of different ways of solving the task, a number of different ways of approaching the task. So I think this is a really useful distinction to make. Think about the task you’re trying to solve. Think about the different models you can use to solve the task. You’ve learned about the information retrieval task, given a query, identify relevant documents. You’ve learned about the task that LD LDA solves, identifying topics and documents. Now we have the task of, well, Uh, classification, identify does the item belong to the class or not. So what are the different ways that we can address this task? What are the different models and the different methods that we can use in order to solve it? This is roughly a historical overview, so I’ve ordered the different models and methods. All of them are still used, but I’ve ordered them in the order that they appeared, and this is the order that we will cover them in. So rule-based classification models have been around since basically as long as there has been digital data, since the 50s, 60s. And they were really the dominant paradigm up until roughly the 1990s for doing any sort of text classification. And then supervised learning classification emerged with research on neural networks, with more and more data being available. And gradually replaced rule-based classification, uh, initially with models like, yeah, initial neural networks, with things like support vector machines, with um things like, well, there’s a, there’s a bunch of other uh classification algorithms. Like random forests, usually with handcrafted features, so the researcher deciding what are the different things that my algorithm should in principle pay attention to, and then the algorithm solving the task of or the learning, the learning algorithm solving the task of identifying just how much attention to pay to each of the features, and then this gradually got replaced for many use cases with word embeddings starting around 2013. Uh, which will, will, will both cover these, and then eventually pre-trained language models have by now become the most common way of doing this. But again, there’s still a place for traditional features and handcrafted feature engineering. There’s still a place for static word embeddings, and within the area of pre-trained language models we can distinguish between people fine tuning models specifically for classification tasks and zero-shot classification users. Um, so in the order that these approaches appeared, uh, and, and were invented really, uh, let’s go through them one by one. And look at how exactly they they work. So rule-based classification is the oldest form of text classification. It’s by now quite old fashioned seeming way of building text classifiers. That was why knowledge engineering, so actually experts coming together, sitting down, collecting dictionaries, lexicons of terms that are relevant for specific things, so you could have a dictionary of terms that help you decide what class a document belongs to. Um, so things, uh, that you’ve already seen, uh, dictionaries for things like sentiment. Uh, where we computed scores for documents in the last lecture I gave, um, based on predefined sets of words, and you can think of dictionary classification as a way of using this dictionary for classification. So you could say, um, for example, you just set a threshold where you say if this and this many words that I’m interested in a peer, um, if the score is higher than this threshold, then I’m gonna say the class that’s present in the document, and if not enough words from this of this type appear, um, then that’s not the case. And Can you think of any reasons why someone might want to do this, even today? And can you think of disadvantages? So what are advantages or disadvantages of this approach? You don’t need too much compute power. That’s an excellent point. You don’t need much compute power for this because you’re just matching strengths, right? It, um, is a very cheap, computationally efficient operation to go through a dictionary of a limited fixed size and a data set of a limited fixed size and check whether each of the text items, uh, you know. Whether each of the words in a dictionary appears in each of the text items and sum them up. So you can do this in a single path over the um over the document collection. So it’s very computationally efficient. So if you’re working. These days we’re used to having phones that can do things that computers 10 years ago weren’t able to do, but still, if you’re working in a situation with maybe embedded systems and you have a classification task where maybe also the accuracy of your classification is not of the utmost importance, maybe we’re just trying to get a very initial sense, it can be something that’s incredibly useful. Another example of this is with search engines that have to index trillions of documents across the web. So very often we use computationally efficient methods as a first pass to just identify what are potential candidates, and then we re-rank them with more computationally expensive methods, just the top results, so we have less data to work with because often all of the methods I’ll talk about next require far more compute than the rule-based classification models do. Um, so yeah, if you have a huge data set or a very limited set of hardware available, then it’s a good idea still to use this, uh, or at least consider using this, evaluate the accuracy of it and see how much accuracy do you lose compared with more modern, potentially better methods and how much accuracy do you actually need. Um, it does mean initially there’s a bit of setup cost involved because you need to write the dictionary, OK. If the dictionary exists and you can just download it, find it on GitHub somewhere, someone’s created it, uh, that’s great, but sometimes it doesn’t exist when you have a classification task that’s a bit more niche, a bit more specific, um, you might have to get a couple of experts together and sit them down in a room and identify specific words, and this can actually be a relatively costly exercise. Uh, so whether or not it’s worth it depends on your existing, uh, I mean, yeah, it depends on what situation you’re in. And So you, yeah, also, um, you might have pretty good precision with this approach, but you might not have very good recall, because it’s actually really hard for humans to come up with all the potential words that might be relevant for a class. OK, if I think of all the words that could be relevant to sports, I can probably very quickly write down 100, maybe between us we can come up with 1000, but in reality there are tens of thousands of words that are related to sports, including the names of individual athletes that can tell. A text classification system that a word belongs to the class, um, and that as a human I can read and and understand that it belongs to the class, so I can, I can, I can, I can recognise it, but I, I’m not going to be able to write it down if you ask me now to write down all the words related to sports. I, I would still be here sitting here tomorrow and, and still be writing and I wouldn’t have written, you know, a 10th of the words that are relevant. So expensive to set up is another disadvantage. So let’s move on now to supervised learning classification. Here the idea is really we’re going to use um machine learning models to identify what are the features that are the most relevant. So we’re going to automate this process way a bit of, of just identifying the words, manually identifying the words we should be paying attention to. Um, so supervised learning. In general, it’s not really it’s not limited to text data, it’s just a numeric framework where the idea is we know the class of some examples, um, so we have a data set available to us where we know the class. Um, and then we have a whole bunch of examples where we don’t know the class, and the point is to predict what the class is for the. Documents where we don’t know the class and to predict it based on the examples we’re given where we do know the class, and what we do, what we need to do for that is we need to learn the characteristics of the class, so characteristics of in this case, what is it about a text that tells us it’s in the class. So we’re no longer sitting down with a group of experts and trying to write down the the the words, we’re actually getting the um machine learning algorithm to do that, to identify the words and other features that it should look for. And the advantage of this, it’s usually far cheaper to get training examples than it is to get the experts to write down all the words. So if you get, if you have 5 experts in a in a topic and they’ve got a limited amount of time, they can usually pretty easily label lots of examples. So it’s a good use of their time to actually get them, give them examples that are unlabeled. And ask them to label A journalist will tell you fairly easily, you know, fairly reliably um whether a newspaper article is in the class sports or the class entertainment. When it would take them far more time to try and write down all the words that relate to sport or entertainment. Um Yeah, and also when conditions change, like you have a new class, then you can use your existing training data, which is great, and you just update it by adding examples um of the new class. So this now takes us to supervised learning, uh, based classification, and here’s the general framework. You start with a set of general of documents where you can imagine that you have collected a large data set of something like blog posts about products and you want to know how many of them are positive and how many of them are negative and how many of them are neutral. Um, so you want to know which products are described more positively and how that’s changed over time. We’ve got some other research questions, um, that you are supposed to be maybe writing a report about. Uh, so you start by reading a few. And that’s maybe a relatively small sample of all the documents that exist, or you start by paying someone else to do all this reading. These days, a huge industry has emerged just about getting documents annotated, right? So just about getting crowd workers, paying crowd workers to annotate documents for you. You can get this done on Amazon Mechanical Turk or with. of other platforms that facilitate this, and you pay them money and they distribute the money and the tasks to crowd workers who then get asked, is this new news articles, sports or entertainment? Is this social media tweet or post, is this harmful or hateful.

SPEAKER 1
What if those workers make mistakes?

SPEAKER 0
Excellent question. Yeah, what are the, what are the, I’m just repeating the question for the recording. Uh, what if the workers make lots of mistakes? This happens all the time, OK. It actually turns out if you get randomers from on the internet who get paid for this, they tend to go about this with a with a very specific mindset. For example, if they get paid per sample that they annotate per text, they have an incentive to do it quickly. So mistakes do tend to happen and you do tend to get more mistakes than if you get actual experts like say a bunch of PhD students in an area to sit down in a physical room and debate. And the typical way to deal with this is, you can just get multiple people to annotate the same text independently and take a majority vote. So that’s a common way of doing it. Um, or you tolerate some level of noise in the dataset because, as it turns out, if you’ve got thousands of examples annotated, and some of them are. Annotated wrong, that tends to be better than having just a few 100 examples annotated where you are 100% sure about each example because you’ve manually checked it. So these systems do tend to benefit from having huge amounts of data, um, so to some extent these mistakes can be tolerated, uh, yeah. And what we typically would do is we would quantify them, right? We’d measure what we call the reliability of the annotators. They are dedicated metrics for measuring this. I’m not going to go into this in detail here now, but there are examples like Cohen’s Kappa and Cronbach’s alpha and Krippenov’s alpha and lots of metrics that have been developed specifically to measure the reliability of annotations and to quantify how often are annotators wrong. And what you need for this is you need to have the same set of documents annotated by two people, because then you can check. How often do they agree with each other, and the simplest way of doing this is to measure percent accuracy. That’s a bit of a dumb way of doing it, because you have class imbalance, where so if you have spam, 99% of emails maybe are not spam or 1% is spam. percent accuracy could give you 99% even if your annotators always disagree. So what these metrics essentially do is they control for this random variation from class imbalance, and they give you a better sense of how often do my annotators actually agree with each other on the cases that matter. Um, does that answer your question? Cool. So uh yeah, now you’ve got your annotated samples from reading them yourself or from paying annotators to do it, and they, um, there’s annotators, uh, yeah, now now they’re now they’re gone, and you’ve got your big data set, your huge data set of unlabeled samples, and your smaller subset of labelled samples. If your big data set. Is actually small enough for your experts to annotate all of it, there’s no need to do supervised classification, right? The only reason we ever do supervised classification really is when we actually have a much bigger data set than we can feasibly annotate. Otherwise we could just have the experts annotate the whole data set. So in general, whenever you have a data set that’s too big for the experts to to annotate, Or too expensive for the experts to annotate, you haven’t got the budget, and then we would use a small sample, have it annotated by the experts, And uh then train on that sample and get the big model that is that’s been trained on this sample to annotate the big data set. So the idea really is get, how do we train this model to, how do we teach this model and to learn what, what to look for. So, um Where we, we have our documents to be classified, um, and the next steps now are really to extract some features. So for using um an actual classifier, we need to represent our documents numerically. So we’ll talk about how can we represent text documents numerically. You’ve already seen this a bit when building uh an index, um, so we’ll go back to this and look at how this works for classification. And then given um the features. And our labels from the experts we can train a classifier, which is mathematically really an optimisation problem. So there’s a huge space of potential classifiers or potential functions really that map from the domain of the data items to the classes, and the goal here is to identify which one’s the best. So we have to have some sort of metric. That’s called the loss, that tells us how good our classifier is on the dataset, on the training data, and the optimisation problem here is to identify the best function that gives us the lowest loss, so it gives us the best classification of the training document. And there’s lots of different ways of measuring this error, these loss functions, and many different methods for, uh, essentially reducing it. Um, if you take a class like machine learning and pattern recognition, you’ll go into some of more of the, the details of this. Um, so we’d more, more or less skip over this here, I’d say it’s, it’s like more like a black box. Uh, what we’re interested in here is how do we represent specifically text data, uh, for use in these classifiers. Yeah, so now you’ve trained your classifier and then you can use the classification model as the output of the training process, and now you can use this classification model to extract features from the unlabeled documents. Unclassify. That’s the overall process. Now let’s zoom into some of these steps to look at what’s going on there. So one step is extracting the features. And in general, a feature here is just a single data point that tells us something useful about a document, so something like the number of times that the word mortgage appears in a document, and this might be useful for deciding if it’s a document about finance or not. And when we look at an individual document and a step, and the input is just the textual content of the document, and the output is a vector where each entry corresponds to one of the features. So when we look at the entire corpus, then the input in the step is um the entire set of documents and the output is a matrix where we have a row per document and we have a column for each feature. Um, when we extract features, we have to keep in mind that we are looking for features that help the classifier separate the class as well. So ideally we’re looking for features here that tell us something about whether the document is in the class or not, or which class the document is in. Um, so really what we’re doing is we’re transforming this unstructured text data into a structured representation, into a matrix where hopefully the columns will tell us something useful about which class the documents in. Um, ideally we want documents from different classes to be represented by different vectors, as different as possible so that the model can identify um the differences between them. And in order to help ensure that the classes are separable, our task is to identify the right the right methods for extracting the features from the documents, we convert them into vectors, and then often if we have too many features for the classifier to actually be able to use, we’ll reduce the number of features with the feature selection step. And we’ll decide on some sort of waiting for the features. So the first step here Is feature extraction, which is where we ask what are the features that should be different from one class to another, and the simplest thing we can do typically is something that is known as a bug of words model where each document is represented separately, um, And each word. In in each document is a separate feature. So the actual entries in the document vectors could be binary to indicate that a word is present or not, or they could be numbers like how many times is the word present in the document, or there could be something more interesting like TFIDF, um, which is higher when the word appears, uh, often in our document but lower when it also appears often elsewhere. Um, but what this all has, what all these approaches have in common is they ignore the relationship between the different words. So whether a word appears in one, context, maybe often before another word is completely irrelevant. Um And I hope this sounds a bit familiar, cos you’re already familiar with uh the idea of representing words and documents as vectors uh from the indexing lecture in the course. So there you had the document, he likes to wink, he likes to drink, or the thing he likes to drink as ink, um, and we represented this in the following way. So the documents were represented as vectors, um, where if you look, for example, at the first column. That is for the word he, that word appears twice in the first document, so the number there is a 2. So what we’ve got here in the cells is the temp frequency and um it’s not normalised in any way, but document frequency or anything like that. And um we can think of the rows here as the documents, the rows are document vectors. And that also means that the columns here are term vectors. So we’re interested here in these term vectors, which are essentially, yeah, vector representations of words. These are the features, OK, this could be, we could use this directly as a feature matrix for a classifier, because it’s exactly what we were looking for. It’s a document per row, um, and it’s a feature per column, so we’re just calling these words features now. But we could add in other features, we could add in more columns for things that are maybe not actually textual features like things like metadata. How many, Replies did a tweet get or how many uploads did a comment on stack overflow get? And the numbers in the cells here are feature weights, and we’ve chosen in this case to use term frequency, but we could use something else like just presence. Or number of occurrences of a term in a document. And it turns out this is already quite a useful representation because it allows us to do a bunch of other things. For example, we can also calculate similarities. We can calculate similarities between two documents here. And what this would tell us if you take something like cosine similarity between the two vectors between the two documents, would tell us something about the doc how the documents are related, because it would tell us which documents tend to contain the same words. And we can also do similarity between terms. And this must tell us something about which which terms are similar in the sense that in our limited collection of data, which terms tend to appear in similar ways in documents. So the the the term here, he for example here appears twice in this document and that’s something it has in common with the word likes. So actually if you calculated the similarity between these terms, pairs of terms here, based on this relatively small sample, um you’d find that the term he and the word likes. Are identical. Of course they’re not in real life, they’re not actually the same word, they mean very different things, so it only tells us so much really about semantic relatedness of terms. It tells us a bit about it, because presumably if one word was car and the other the other one was automobile, we’d expect both terms to appear commonly in documents about cars and both terms to appear infrequently in documents about. Projectors, um, but, so it, so it tells us a little bit about what how the words are related, but not a whole, not a whole lot. Uh, another thing we can do instead of these, uh, columns that are individual words, we could do, uh, we could use the word n-grams. So in addition to, to individual words, we could use things like bigrams or trigrams. So these are things like not good. When the word when the words not good appear in that sequence in our documents, we could set the uh the the column here, we could set it to one, and when they don’t appear, so even if the word not appears and the word good appears. But they appear separately, then we set it to zero. And what this does is we’re creating a feature now that tells the algorithm whether the words not and good appear one after the other, and it could then potentially learn things like, um, well when the word good is preceded by the word not, it means the opposite of just the word good. So, um, you can see how this immediately becomes a much, much, much larger number of features, when before we had maybe a few tens of thousands of features, suddenly we’ve got millions. So our vectors also become much more sparse. OK, we already had a couple of zeros here, words that don’t appear in some of the documents. Now you can imagine if we had diagrams, all possible diagrams. So if we had a lot more columns on he likes, likes to, to rank. And all these were additional columns, the vast majority of numbers in this matrix would be 0. This is a common feature of this kind of representation that um they’re really sparse matrices. Um, yeah, sometimes we use character n-grams instead of word n-grams. So in this case character n-grams would be something like maybe um. H E E L, it has certain advantages and disadvantages compared with, um, with Word engrams. For example, if you’re using optical character recognition, so if your documents are actually scans of, analogue documents, and there’s an error rate in the, in the optical character recognition, um, then you might really want to use these subword tokens or characters. Because it is a bit more resilient to these kind of errors. So yeah, the main issue here is um usually if you, if you add more than just the individual unigram features, individual words, you often end up with huge feature matrices where you don’t have to do some, some level of feature selection. What other kind of features could be used? So what are other things that could tell you interesting things about uh a text and which class a text belongs to? What kind of things you can glean from the text that tell you something about the class? Any ideas? TF 5. Yeah, so TFIDF you can use even with, even with this representation here, even with the bag of words representation, what we looked at here was a number of occurrences, the term frequency, but if you divided it by the most document frequency, you could use TFIDF in exactly this representation. So you can use TFIDF with unigrams like here, you can use it with um Word n-grams, with trigrams. I would actually encourage you to try this out in the lab yourself and see does using TFIDF improve my scores or does it make it worse. Any other ideas what else we could use? the length. Yeah, absolutely. So we can derive metadata features, numerical features that we then add. So these become additional columns in this matrix where instead of wink is how many times does the word wink appear in the document, but we could define another, another column, and this would be the number of words in the document, uh, and then he likes to wink, he likes, OK, that’s 8. He likes, OK, that’s also 8, and that kind of thing, um, absolutely. Or we could do something like the average length of a word that could be interesting, maybe also from the perspective of reading ease. What else? Any more ideas? So we can look at things like the sentence structure. We can use a part of speech tagger, which goes through our entire document collection. And looks at things like, is the word bank being used here as a verb or is it being used as a noun? Is the word screen being used as a verb or as a noun? Uh, we can even represent the documents as syntax trees and then find some way of representing that as a feature. Or we can use the output of things like LDA as features. So our LDA gave us a topic document matrix which tells us to what extent does it which document contain which topics. So document one could be represented 20% of topic one and 30% of topic two, and that could be our feature. In web applications we very often look at links, so we look at things like which web pages does a page link to, which web page is linked to a page. So if you’re classifying a webpage, very often hyperlinks end up being used to to derive features. Also which terms are being linked. Um, so when someone links to this web page, what terms do people tend to use in the hyperlink text? Um. Yeah, and a number of non-textual features like you suggested, average document, sentence, or word length, or things like what percentage of words start with an uppercase letter, or what percentage of words start with a lowercase letter, or what percentage of links are they in a text or what percentage of hashtags or emojis. So this whole area is never an easy straightforward thing. We have to always have to make all these decisions about your design, the design of your features, which features do you extract, how do you extract them, and what preprocessing do you apply. Because we have features based on words, so what pre-processing you do matters a lot. If you do case folding, um, and you turn really with a capital R or uh all caps. Into all lowercase, then suddenly what was previously 3 different features is now 1 feature. So you’ve reduced the space of the feature space, you’ve reduced the number of the number of features which might be good for maybe computational reasons, but you’ve also lost some information. Do you remove punctuation or do you keep it in? How do you turn it into features? How do you tokenize exactly. How do you tokenize things like hashtags, for example? A lot of hashtag, a lot of tokenizes that were written with not having tweets in mind tend to strip the hashtags into separate, uh, tokens. Uh, or either get rid of them or put them into separate tokens. But if you’re looking at Twitter data for example, you might want to keep that hashtag as part of the, the hashtag symbol as part of the hashtag and treat it as a separate token from having just that word. Stopping, stemming, all of these are design choices you have to make. Um, all the things you know about from previous lectures in IR, um, and, and they equally apply to text classification as as useful things to do or not to do and try varying a bit and thinking about the design, what’s your text? What’s the what’s the thing you’re trying to measure, what makes sense in this context. Yeah, other features are things like does it start with capital letters or is it all caps, so instead of having just the decision whether you case for it or not, you can have things like uh what percentage um. Of the words are in all caps or something like that. Um, you can come up with ways of dealing with repeated characters, which is a common thing on social media, uh, like repeated exclamation marks, do you want to take that into account or do you just want to say, OK, there’s an exclamation mark present here, and you can use dictionary scores as well, and scores from lexicons. So there’s a whole bunch of things you can try and there’s a lot of room for experimentation here really. So, um, yeah, we can have a lot of distinctive features, and this could be something on the order of 10 to the power of 6 or 9 because we’ve defined, say bigram features, trigram features, there’s so many potential combinations of words that can follow one another. So now we have so many features often we need to reduce the feature space somehow and identify what are the most relevant features. Instead of just letting the classification algorithm do that. The learning algorithm might take ages, might never actually um finish its task, might not converge. Um, we often do a step before, which is actually reducing the number of features, uh, using some heuristic. For example, um, the, uh, you know, something that finds the top representative K features for each class, um, and then gets the union over all the classes and eventually, um, just, uh, yeah, cuts off. Everything that’s not in the top K. So how do we decide what the most important features are. We can use a number of different functions to identify the most relevant features. For example, document frequency. So we look at things like what percentage of documents in a class contain a term, or mutual information, which I was talking about in the last lecture, where we look at how much do we learn from the presence or absence of a term. About the class it’s in, so we could do the mutual information calculations which are relatively computationally efficient for all classes and all terms, and then say we’re only interested in the top 1, top 1000, top 10,000 terms that are most relevant for all of the classes. Um, so this is a common thing, common way of doing feature selection and text classification. Uh, all the same with chi squared. There’s a whole bunch of other, um, algorithms for doing this. I’m not, by the way, expecting you to remember any of these, uh, or definitely don’t learn the slide by heart for the exam or anything like that, please. Um, I’m just putting it here as an example of the sort of things that people tend to do. So maybe for the exam, uh as a more general point, whenever there’s an equation that looks like this, we would give it. We wouldn’t expect you to know about it. So things like precision and recall, you’d be expected to know, but anything like here chi square or mutual information or NGL coefficient or relevancy scores are the sorts of things where if we did require you to calculate it, we would give you the equation. Um, yeah, and finally, feature weighting is a way of identifying, uh, of like deciding how much importance, um, or how do I, how do I even measure this feature. So I had the table earlier with the term frequency. Alternatively you can use things like binary presence or absence of a term, or you could use things like TFIDF which you mentioned, um, and there’s a number of supervised measures of, uh, yeah. Feature importance as well or the presence of a feature, different ways of quantifying exactly what role does this term play in this document. Um, and yeah, like I mentioned, you can then use measures like cosine similarity to calculate the similarity between two vectors. Most of what I’ve said so far about extracting features from documents also applies to unsupervised learning from texts. So if you have things like document clustering, uh, and then instead of training, instead of training the text classifier, if you do not have labels, you could also use the same features and you can input them into a clustering algorithm. So you could say, I don’t know what the classes are, um, but I’m going to use the same. Way of representing the documents with features and I’m going to use that to, to, to uh using a Keynes clustering or hierarchical clustering algorithm to cluster the documents. So to, to categorise them into documents, uh, categories of documents that are similar, because all that, um, all, all most clustering algorithms really need is a way of quantifying similarity between. Two data points, a way of quantifying how similar are to um, to texts. So as soon as you’ve done the feature selection and the feature weighting and the feature uh, Uh, yeah, so as soon as you’ve completed these texts, you’ve now got each document represented as a numerical vector, fixed length, doesn’t matter how long your document was, your vectors are all the same length, and you can calculate similarity between them, and that’s what you need for classification and clustering. So finally there’s a number of different ways of actually training the classifier for binary classification. These are things like support vector machines, random forests, naive Bayesian methods, uh, logistic regression, and the way we’ll be doing this in the lab is with the Circu Learn package, which gives us a very um, Similar API for each of them, so it’s often just a matter of swapping out one function call. We’re not going to go into the details of how any of these algorithms specifically work. There just isn’t the time for going into all of them. But if you’re interested in a specific one, I would encourage you to read up about it because it often tells you a lot about what the parameters mean. Similar to LDA where we did go into some detail and from looking at the equations we realised, OK, this is what the alpha parameter means. So it’s really worth it if you’re spending a lot of time tinkering with support vector machines, reading up a bit about how they work. For example, to understand better how the hyperparameter C affects the results. Um, So there’s this thing which is called the no free lunch principle, which means that there isn’t really one learning algorithm that outperforms all the others in all contexts. There’s some that work better with high dimensional data and some that work less well with high dimensional data, some that work better with certain types of text data, some that generally work better with text data and worse with numerical data, um, so it’s often a matter of trying out a few different ones. And then for multi-class classification, some learning algorithms can do this directly, and for others you need to transform the task into a set of binary classification problems. And when we have multi-label classification, we generally always transform it into a set of binary classification problems. So it really depends on which algorithm you’re looking at, whether it is able to directly deal with these multiple classes or whether you need to transform the data in some way. So, let’s move on, skip forward a few years, um. So far we’ve represented documents in terms like this, so we have always had. But we’ll continue to have for a while, uh documents represented as, uh, well, rows in a matrix and the individual terms being columns in the matrix, and what the numbers we put there were the term frequencies. Or sometimes they were, they’re binary, so it could just be a 0 or 1. So you can imagine all the numbers that are greater than 1 here, just replace them with 1. now you’ve got binary presence or absence of the word in the document. Or you could have TFIDF like you suggested, um, in general these are sparse, like I mentioned, most values are zero, and they capture semantics only sort of by accident, because two terms with similar vectors are ones that, um, have, well, that appear. In the same documents. Two documents of similar vectors are the ones that have the same terms appearing in them. Um, this is useful to some extent, but it’s not perfect because it doesn’t capture quite a lot of the nuances of language. So it doesn’t capture dependencies between words. Things like this word tends to appear before that word because we’re all throwing them, we’re all mixing, mixing them, which is why it’s called a bag of words model. Um, and it doesn’t capture things like two words having really similar meanings. Where it’s interchangeable which word you use, or two words from the same domain being somewhat more related to each other. We can sometimes get to this if we have a large enough data set that we’re actually calculating these matrices from. We can get to something like that, but um. This is a bit of a limitation. So this is the problem that word embeddings can solve. Um, the idea here is they look like this, so we have um. Many, many dimensions. So a typical number would be something like 300. So I’ve just put 3 here for each word, but you can imagine there’s actually 300 numbers between 0 and 1. Um, they’re all real values between 0 and 1, and, uh, they are dense, so all entries are non-zero, and they capture the semantics a bit better, so similar words um should have similar vectors. Words that are almost completely interchangeable, where you could, any sentence where you use one word, you could use the other word, should have almost the same vector representation. Um, words that are roughly from a similar domain, for example, words about sports, should be roughly represented. In a similar area in the vector space, uh, so that you can do, um, algebra on them, so operations like, for example, Um, a typical example is something like king should relate to queen, the way that man relates to woman. So if you took man, uh, or if you took, if you took king and you said the vector for king, you subtract the vector for man, you add the vector for woman, what you should end up with is something like the vector for queen, um. And this is neat because it allows you to uh it allows you to do a whole lot of operations that you weren’t able to to previously do. So how are these um learned? They’re obtained through self-supervised learning that has the goal of um learning the relationships between words. So the task that we are um optimising for here that determines the loss function is predicting a word. So you take a large corpus of documents uh where you predict one word based on the context words that surround it. So for example, actually here on the slide, you could imagine uh the word through isn’t there. We just see obtained and then self-supervised learning. What are words that could fit there? It’s probably gonna be through, it could be via, it could be from, but it’s not going to be queen. Because obtained queen self-supvised learning is not a meaningful sentence. Um, so you predict the centre word given the surrounding context, or the other way around, you predict the context words, given the target word. These are the two main ways of, of, of doing this. And it’s called self-supervised learning because it is, well, I suppose mathematically it’s a bit like supervised learning in that you have the target, you know what the word is, um. But you don’t have to have labelled data, and this is great because you don’t have to have this data that’s been labelled by the experts. This is really costly, the experts are journalists who’ve gone through your newspaper articles and have said this is sports, this is entertainment, um, that’s the expensive part. Getting unlabeled data is a relatively easy, cheap part. So you can have the entirety of all books that have ever been digitised, copyright issues aside, uh, you can have the entirety of all web pages that have ever been created, copyright issues aside, um, and you could take these and you could, you could learn your word embeddings based on this, and this is what people have done. And lots of organisations like Google have created word embeddings like Word2vec uh that we can download and use, and they’ve been trained on huge data sets. which is great. So they exist based on this huge data which we call pre-training data. So they were trained based on the pre-training data without a specific application in mind, without knowing whether I’m going to be using it for sentimental analysis, and you’re going to be using it for identifying the topic of a newspaper article. So we now have two data sets, we have this big unlabeled pre-training data set, and then I’ve still got my, my labelled um data set that I’m gonna be using for the classification problem. And on the big pre-training data set, someone, maybe me, maybe someone else has trained the pre-trained word embeddings. I download them, um, and then I can use them to represent my, my, my, my documents. Um, so not much in the same way as before, but with different vectors that have been trained um on a much, much larger data set. This helps. In a number of ways, for example, with out of vocabulary problems, OK, so before I was relying on my own labour data and I had this problem. And that words that didn’t appear in my data set, uh, I just didn’t know what they meant. And if words appeared very rarely, then my metric for similarity didn’t mean much, it didn’t, it didn’t capture that very well. So if a word only appears once, you can imagine that doesn’t really give you a useful cosine similarity metric. But in this case now we have these much much denser embeddings, so they capture much more richly the nuances of language. And finally, um, the latest generation is pre-trained language models. So these are also obtained through self-supervised learning. The task is uh kind of similar, it’s predicting the next word, for example, in the sentences the queen of, what’s the next token. It’s more likely to be England than it is to be keyboard. Or predicting a masked word like the something of England, more likely to be Queen than keyboard, um, can also be done on huge unlabeled pre-training corpora. Um, and then this can be used to derive word embeddings. In this case, contextualised word embeddings that tell us something not just about the word, so we don’t just get a vector for the word, but we get a vector for the word given the surrounding context. So things like uh the bank of the River Tay will have a different representation. Than I’m taking my money to the bank. Which is useful because now we can distinguish between different meanings of the same word. And this is really the basis of many state of the art text classifiers. So you, it, it basically turns out that if you Train a language model for the task of next token prediction or master language modelling or a number of related tasks. On this huge data set. Sorry, So you train the model on the huge data set for a task that’s completely unrelated to the classification task you’re actually interested in. It’s really just predicting the next word, but you just throw a tonne of data and a tonne of computer at it. It actually learns enough about the relationships between words just from understanding which ones appear together in documents, which ones tend to appear one after the other, that we can use these um these contextualised word embeddings in our text prediction task in order to um to to predict which class of document belongs to. So the way that this pipeline looks like now is we’ve still got our unlabeled documents like we had in Word embeddings, and we train a language model, or someone else trains a language model more often than not, it could be a group of researchers, could be a company, they publish it or send it to us and we download it. We use it and we only fine tune it, so instead of having all the weights initially just assigned randomly and then we we we train from scratch like we did with the SVMs, we now take this existing model which has been trained on millions of documents and we take a few 1000 documents and we just fine tune the weights. So in the grand scheme of things we’re not changing the weights of the model all that much, but we’re just teaching the model a little bit. About how to how to adjust the weights just a tiny bit to get the classification just right. So often the model can already do a bit of this classification task without any fine tuning, but it’s really by just adding that little bit of extra data that we’re getting it to be a bit a bit better. And finally, um, the most recent thing that’s emerged is zero shot classification where we completely skip the fine tuning step, we only have the the pre-training step. Um, or we have maybe, or we have a fine-tuning step on a, on, on a number of, uh, General tasks, but we skip the the the step of fine tuning on the specific task that we’re trying to solve. We directly use language models for classification without giving training examples. This has really only become feasible um since roughly the time when ChatGBT was released, that kind of generation of models, or maybe the previous generation like GBT 3. It’s only really become feasible um since then, that models have got good enough. To do um classification without any fine tuning, um, and they’re not always still as good, so sometimes these models without any fine tuning are already pretty good. Sometimes they’re not, it really depends on your text classification task. And that’s why I was saying there’s still a place for all of the things we’ve talked about before, they all have their advantages and and disadvantages, and it’s really useful to know about all of them um and how they, how they relate to each other. So what we typically do because it’s so easy now, um, is we use this as a baseline for performance comparisons, right, we want to know how, OK, how good is this quick way of, Um, asking an existing language model to classify. Uh, it requires virtually no human effort. We can automate it using APIs. I can put a screenshot here of using it in the, in the, in the UI, but of course typically we’d automate it with the API, um, and it has obvious downsides that we’re depending in this case on a commercial black box model that we don’t really aren’t really able to inspect. And we can’t really do much customization or error analysis. But this is of course an active research area, so zero shot classification in the same. Visual language from before is really just the the the model is trained, and the pre-trained language model is directly used to classify the documents. There’s another variation of this which is few shot classification, um where we actually give a few examples in the prompt. So we’re also skipping the task specific fine tuning. We’re giving specific examples alongside the task and the prompt, and often this can be quite helpful, a helpful trick. We we’re still using this relatively cheap um language model, or in terms of the actual effort that, you know, I, I need to go through to classify a single example is relatively low, so in that sense it’s cheap. It’s certainly not cheap if you want to classify a million examples because it takes a tonne to compute, and if you’re using a commercial model, it costs a lot of money. Um, but, uh, yeah, it’s a quick way of, of just sense checking how good does that classify compared with something like this. Turns out often if you give a handful of examples in the prompt, it actually helps explain the task to the model. So here’s an example where it stands detection. So if you just do a simple sentiment analysis that you can do quite easily zero shot, it might struggle with examples like. Renewable energy is amazing, but I don’t have strong feelings about a carbon tax. So if we make it clear here, we’re really only interested in the opinion on carbon tax and not renewable energies in general. We’re specifically interested in the stance on carbon tax. Here’s an example that supports it, and the example I’ve just read out is neutral. If you give these examples in the prompt, then often it works a bit better than just the pure zeroho approach. So we’re leveraging the model’s ability here um at in context learning, so learning the task really from the prompt, um, adapting to the task from the examples in the prompt. Great, can you think of some reasons when this is an appropriate approach and some examples when it’s not? When would I, when would I not want to do this? If you’ve got a lot of classes, it might struggle if it hasn’t got an example for each class. Yeah, so if I have a, if I have a tonne of classes, I’d have to probably write it in extremely long prompt. Absolutely. Might not be, might not be feasible, yeah. Yes, this applies to the few shot classification and zero-shot classification. Uh, certainly with commercial models, um, that I can access through an API, um, I’d be very wary of submitting anything there, uh, because I’m, I’m literally giving that data to another company. So anything that’s commercially sensitive to my own company. I’d at least want there to be a contract in place with the provider that governs exactly what they’re allowed to do with the data and what they’re not allowed to do with it, and also data like health data, patient records, or if you’re working in defence, any sort of sensitive data really, um, you don’t want to hand over to to to a commercial enterprise without being absolutely sure exactly what they’re going to do with it. Even if you use models that you can use on premises like Lama models, demo models that that you can actually run locally on your own GPUs, now suddenly you’ve got the issue, you need a tonne of hardware, you need to have the hardware yourself, you need to have the system administrators who, who, who, who take care of it and things like that, um. But even if you do that and you make that massive investment in in in in in creating and building that hardware, um, then you still have the issue that it is expensive to maintain. So the amount of compute that’s required by this approach is huge. So it’s it’s kind of quick and dirty, easy in a way to just put in one example. So if I just want to know how well does it work, that’s a useful thing to do. I put in 100 examples and I, and I measure how, how well does it work. But if I actually have a data set of a billion texts that I want to classify, um. I mean it would take a long time to do this through an API and be very expensive and also it would take a lot of compute to do it locally with a model I can run locally. So there’s a trade-off here where sometimes these models perform reasonably well despite not being functioned for the specific task, but they are. Uh, well, they’re very expensive to run at scale. Let’s take a break here and continue in 3 minutes. One more thing I want to say about these um different techniques that I’ve discovered here, so things like word embeddings, language models to derive contextualised word embeddings, uh they can also be used with topic modelling, they can also be used with information retrieval, um, and there’s different ways of doing that. So in principle, because it’s just about representing text, right, it’s just about different ways of representing text numerically. There are different applications for doing this. So for example, if you’re finding you’re not getting very good results from LDA, and there are topic models that are based on, based on language models and that might help you give more accurate. More accurate assessment of which topics appear in which document, and hopefully this lecture has given you a sense why this is the case because it can take into account these relationships between words much better than LDA could. So, um, so finally now, OK, we’ve got our text classifier, we’ve tried a whole bunch of different models, uh, implemented, spending potentially a considerable amount of time implementing different models and trying out different ones, and, um, we maybe already have an initial sense just from reading examples how well our classifiers work, but how do we tell systematically which ones work well, so. We’ll talk about evaluation. We’ll talk about the effectiveness, which I would say globally what it means, means things like accuracy, precision recall F1, which uh you’ve, uh, which you know in the context of information retrieval tasks and which we’ll talk about in the context of classification tasks, um, but there’s also things like efficiency. There’s things like how quickly is this to learn. So for example, an SVM with a linear kernel is an extremely fast way of learning weights, whereas, Neural networks tend to be much slower. How quickly is it to classify, so there’s some examples like K nearest nearest neighbours that’s actually um uh typically quite slow to do classification. Things like how quick is the feature extraction, um, and this tells us something about how, so how, how efficient is this. So sometimes we might actually go for a model that’s a bit faster, even though it’s maybe slightly less accurate. Um, and finally, um. One way to gauge how well we’re doing is baselines. Uh, so in the example where, say a hypothetical naive classifier achieves 90% of accuracy, just because, um, 90% of documents are in one class, so if you have a query and 90% of the documents are irrelevant. But 10% irrelevant, then a classifier that gives every document in the class irrelevant is completely useless, but gets 90% accuracy. Um, so we, we really want to calculate some metrics that help us explain. And basically how you know what is reasonable to expect, what, what do these different numbers mean. So let’s start with baselines to help put things into context. These are the standard methods for creating baselines and text classification to compare your classifier with, to understand. So the most popular and simplest baselines are things like random classification. If you assign classes randomly, um, you can then ask. How much better is the classifier doing than random? You’re not actually doing this because you think it’s going to be a useful classifier, you’re just doing this to identify a number, so you see, OK, your random classification already gives you 80% accuracy. So you, you know, OK, the range you’re aiming for is definitely higher than 80, it’s somewhere closer to 100, whereas if you’re a random classifier, because of the number of classes you’ve got or the, the, the, the, the, the way they’re distributed, and gives you 2%, then you know, OK, the range you know you’re reasonably looking for with your own work is somewhere between 2 and 100. You can use the majority class baseline like in the relevant irrelevant example where 90% of classes of documents were irrelevant, so you just assign all elements of the class that appears the most, and then you can ask yourself how much better are you doing and if you always pick the same thing for the output regardless of the input. Um, or you could use a very simple to implement algorithm, um, like you use bag of words with a linear kernel SVM. Um, so, uh, when you have, you, you want to compare your own features that you’ve created yourself and you compare that with an existing, um, thing, you, you pick that existing thing to be something really simple, and this way you know, OK, this is what the simple approach gets me and. This is how much better I can do with the stuff that I’m putting, the effort I’m putting on top of that, into this. And yeah, these days in 2025, often we use LLMs in a zero short way, uh, for example, GBT 40, um. As a baseline, and the weird or or slightly paradoxical thing about using this as a baseline is that often it actually tends to do pretty well, right? So it tends to be pretty good at certain types of tasks. Some tasks it can’t do well at all, others that can do fairly well. So yeah, some of the best performing models these days are some of the ones that when you look at just the, How much effort it takes me right now to do the classification for one sample are relatively easy, relatively quick to use. Um, often, uh, it’s the cloud hosted commercial models that perform a bit better than the best open open weights or open source models. Um, and then there are these considerations that I talked a little bit about before the break. So things like the price, how much does it actually cost. If I wanted to use this, so even if you, suppose you you put this in, you find out it actually it works pretty well, OK. It works in 97% of cases and that’s a pretty good number because my majority class baseline was just 50 and OK, I mean this is pretty good. Do I use it at scale? You have these these considerations like can I use it at scale? Am I legally allowed to because the data that I have is really sensitive or something like that? What’s the price of using it at scale, um, so there may be situations where even if, even if this is pretty good. Um, you then might want to treat it as almost like a ceiling. Maybe you can’t beat it, but maybe you can approximate it with a model that you create yourself and that you know works reasonably well, but for a fraction of the cost. And with better privacy. Um, so we’re now in this, uh, slightly paradoxical situation where sometimes this simple baseline is one that actually works pretty well, and then we’re trying to see how well we can do compared to it, um, rather than just trying to see, OK, can we do, can we do much better than it, um. To give you an example of how evaluation works with a binary classifier, how do we actually do these evaluations? The balls here represent the examples in the test set um and the colour is the class. So it represents the true class that we know from the human annotations. Um, and the bins represent the predicted classes, so we want our classifiers to essentially put the balls into the right bins. A classifier is a function for assigning the uh the data items to classes, so here it’s a function that puts our balls into different bins. Um, and accuracy is just asking basically how many of the examples we’ve correctly classified. So here we’ve 10 examples, and from the 5 red ones, 4 have been classified, um, correctly and all 5 of the green ones have been classified correctly. So we could say the accuracy here is 4 + 5 and divided by 10 so it’s 90%. Um, and, uh, now here’s another case. Uh, where the accuracy is 70%. And now let’s compare this with another system that gives this classification result. So the System System 2 now here, uh, it has, uh, so we’re dealing with the same number of overall balls. It has exactly the same accuracy, which is 70%. But it’s not exactly the same thing, is it, because what if we care more about the red class? It’s fairly common that we care more about the rare class. Information retrieval, queries, documents are either relevant or irrelevant. Most documents are going to be irrelevant. We care about the ones that are relevant. Content moderation. Tweets are either against the policy, hateful, for example, need to be removed, or they’re not. The vast majority of tweets are going to be OK, a handful are going to be bad. Those are the ones we want to remove, so we care more about the rare class. So in a lot of classification um situations, we care more about the rare class than the more common class. So what if the, the rare class is a rare class? But we’ve actually completely misclassified both examples in the rare class. So arguably this classifier System 2 is much worse than C classifier System 1, and this is an illustration of why uh why accuracy is a pretty poor metric. For evaluating your classifiers, especially when you’ve got imbalanced classes, so in one class it’s much more common than the other. Um, so yeah. Let’s look at precision and recall, that’s maybe a bit better. So system 2 actually has precision of 0 and a recall of 0. Whereas System One at least gets precision of 1/3 and recall of 0.5. So we can now calculate the F1 score. Uh, so it’s very similar to when we were looking at precision and recall in IR and calculate the F1 score and of course an F1 score of 0 is rubbish. So we know System 2 isn’t any good. So now let’s look at the multi-class case. We can still compute accuracy in the same way as before, and this is good when the classes are are balanced, and once we have these imbalanced classes, we now know it makes a lot more sense to look at precision and recall LF1 for the classes separately. So in this example, the model is really struggling with with one of the classes, with the blue class. And you might go and think about what kind of features would be helpful for deciding whether or not a sample is in the blue class or not, or you might go and ask which which maybe language model or large language model has had lots of examples from the blue class in the training data, so it should generally be able to tell what are the typical features of the class. So you’ve tried a few different models, and if you just saw the accuracy. Uh, then you might not, uh, necessarily know that. Now we can identify the F1 score, um, for each of the class, so we can measure precision for each of the classes and recall for each of the classes, uh, and then we can average. them as well, so we can look at things like what’s the macro F1 score, where we take the F1 scores from the three classes and we just divide them by 3. So this is typically a good measure when we have 3 classes and they’re all roughly equally important, we just calculate the F1 score per class and we um yeah, we divide them by 3. But we might have other situations, OK, we might want to weigh um, Certain, uh, some of the F1 scores more than the others, so a typical other way of doing it, that can lead to very different results, very different decisions on which classifier is the best, uh, is if you weigh by a number of samples, because here we’re giving each class the same amount of room. In our evaluation, but if one class is really rare, and this class and the performance on this class now has a tonne of influence on our evaluation metric, so on which of our models we think is the best. So this is fine if we really care about the rare class, which we often do, sometimes we don’t. We just want maybe all classes are important, so weighing the F1 score by the number of samples in each class can be an alternative measure that can be useful, and it’s really a matter of weighing up, because a different evaluation metric will often give you a different um will often give you a different model as the best model. Let’s look at what the majority class baseline would look like in this case. So here we can look at our data and we can say what’s the most common class. We assign the class that’s red to all examples and we calculate these values, so accuracy is 80%, but macro F1 is only 0.293. And this is just another illustration of how F1 is in some ways a better measure than accuracy. So whenever you see accuracy being used, uh, be, be sceptical about these questions about imbalance. Um, Yeah, so generally I’d say Macro F1, use it in binary classification when two classes are both important. If not, often we just look at the F1 score of the right class. When you really want to know how good your model’s doing and better understand it, another thing we often do is calculate a confusion matrix to look into more detail. So these are all so-called contingency tables, and they tell us, so they, they, they contrast the predicted class versus the actual class. You make a table. And you look at how often has a model predicted, for example, green, when in reality, um, the, the class is blue if there’s one ball here, and you say, OK, so the actual classes. The actual class, hang on, that’s my pointer. The actual class is blue. When the printed glass is green, um. So that, that kind of thing, um. So typically this can help us identify where the model is really struggling. And if for example the model might be really struggling with identifying the correct class of the blue examples, then it can actually make sense to go back to the training data and annotate more blue examples. Or if what we’re doing is few shot prompting with an LM, modify the prompt to give it a better idea of what um the blue class is about, what it represents. Um, or if it’s zero shot, move to few shot where we, you know, that kind of thing. So, so it often gives us an idea of what we should be doing next. When we look at this contingency table. And Finally, a final note on this that you should know about the training of models and the evaluation. So the training I mentioned is mathematically an optimisation problem. You’ve got a loss function that mentions that measures how close are the predictions on your training data to the actual labels on the training data that you typically get from human annotations. So this is the mathematical optimisation problem where you’re minimising this loss function and You basically want to find the best parameters, the best weights for the model that minimises this. So this is what you do on the training data, and now you could get. Just the learning algorithm to optimise the parameters based on the training data. And, and call it a day, but the problem is we tend to want the model to work well on new unseen data. We’re not just happy with a model that works well on our training data. The whole point was we have this huge unlabeled data set and then a small amount of it that’s been hand annotated at great expense. The whole point of it was it was too expensive to annotate the whole data set. So we want a model that works well on the unseen data. So if we want to get a model that works well on the unseen data, we need to be able to know what its performance is on unseen data. So we don’t tend to actually calculate our evaluation metrics, precision recall F1 on the training data. They’d be really good because we’ve just perfectly trained the model to predict. So we calculate loss on training data, we minimise it, and that’s our model. But if we then want to know how good is our model actually, we have to use it. On unseen data, but now we’ve got a problem because we’ve just used all of our labelled data for training. We don’t have any more unseen data to evaluate on, so let’s not do that, OK? We don’t use 100% of our labelled data for training, we use something like 80%. We keep 20% of our labelled data. We don’t use it in training. We don’t even ever let the learning algorithm see it. And we only keep it for the purpose of evaluation, so we we retain it as we call this the test set. So we use it only to test the performance of the train classifier on unseen data, 20% of the data. So train on 80, test on 20. Um Sometimes we do this, sometimes this is, this is all there is to it, but sometimes there’s more. There’s hyperparameter optimisation, so most classifiers have some form of hyperparameter to be, uh, optimised. So the C parameter in SVMs or the R&D parameters for nonlinear kernels or decision threshold or whatever. Uh, and also we might try different models, so we try SVMs versus random forests, versus a fine-tuned language model, uh, like Roberta versus a GPT 400 shot. So now we’ve tried tonnes of different things. And every time we’ve trained, we’ve then evaluated on the test data. And then we see how well does it do on the test data. So we have this long table of 100 different things we’ve tried, and every time we know how well does it do on the test data, there’s a risk here again that we overfit to the test data. So just from seeing, from comparing how well 100, 100 different things work on the test data, there’s a risk we then we then say, OK, this one worked best on the test data, let’s roll with that. But it might just be random. We’ve tried out so many different things, we don’t actually know how well it works on unseen data. We’ve kind of used that data now. So what we actually do is we split the data into three parts, something like 80% training data, 10% development data. This serves as the test set for evaluation for calculating precision F1 recall while we are actively developing the model. And then an actual final test set that we don’t even use for model selection, we only use it once at the very end to test the performance of the final classifier. Cos if we were to optimise the hyperparameterss or even the selection which model to use on the final test data, that would be cheating. We don’t actually end up with the best model. We want to have the model that performs really well on the unseen data, uh, so in order to do that, we have to keep that final test data, uh, and just stash it away somewhere where the model can’t access it, don’t look at it, only use it at the very end. You’ll actually do exactly this in the coursework. So in a minute I’ll talk about coursework too, and this is exactly the process that we’ll mirror. So the test data is something we won’t release until a few days before the deadline. You can do all the work on the coursework, uh, on the training and development data. We’re giving you one data set, you can split it yourself into training and development data, you should because otherwise you’re um you’re not going to get good results. And then the test data gets released a few days before the end, and then all you need to do is run your classifier on the test data, submit the report. So in summary we’ve seen the text classification task defined or different variations of the task defined, different types of text classification like by sentiment or topic, models and methods for doing the text classification from rule-based to pre-trained language models, baselines and evaluation. And here’s a couple of resources that can help you if you want to learn more. Any questions about this lecture before I move on to the description of coursework too? No questions? If there’s any questions, put them on Piazza. We’re here for you, we’ve got a bunch of TAs. And so yeah, we will be answering your questions there. So, here’s coursework number 2, which was released a few days ago, uh, and I just thought I’d give this, take this opportunity in the next couple of minutes to give you a bit more detail about it. Coursework 2 has 3 parts. There’s an IR evaluation part build on Walid’s lectures on um on IR evaluation, and there’s a text analysis and a text classification part that builds on my lectures from 2 weeks ago and today respectively. And it’s due on the twenty-eighth of November, that’s a Friday at noon. The coursework, yeah, depends on these lectures and evaluation, comparing corporate and text classification and also the lab from my lecture two weeks ago, and the lab and text classification that will be released tomorrow. There’s a portion of the coursework that’s not supported by labs, so I’d encourage you to get started as soon as possible on it. And make sure you complete the lap 7 on time. So in II evaluation you’ll be implementing a simple IR evaluation tool, you’ll get a results file that’s the output of an IR system, and you will also get relevance assessments, and you need to compute these different metrics. I’ll upload the slides as well, by the way. You don’t have to actually build the IR tool. We assume it already exists. We just want to know how well it works. So your job is figuring out how well it works, uh, helping us evaluate it. So you submit the output that you’ve computed. Um, and there’s of course a lot more detail on what you need to, uh, to do on the course page, and then you need to do also significance testing between the systems to see if any of them are significantly better than the others. In the second part of the coursework, text analysis, you’ll be given 3 copra, the Koran, the Old Testament of the Bible, and the New Testament, and you’ll find which words are the most distinctive. For each one and you’ll do topic modelling on them to discover topics that are relevant to to each of them. And you’ll present your analysis in the report. Please aim for high quality formatting and presentation so that it’ll be easy for the markers to see what you’ve done and what you learned, and you can think about presenting tables, data visualisations and plots, and the goal is really to answer the question, what you can tell us about the different corpora uh based on the analysis you did comparing them. And finally, in the third part of the coursework, this is based on the next set of lectures that, uh, well, today’s set of lectures. Um, so you’ll have a classifier that you’ll develop in the lab. I will share with you a Jupiter notebook that gives you the basics of how to build an SVM classifier, and you’ll apply it to the data that we’re giving you, um, and the goal is to improve it. Uh, the goal is to essentially determine which of the three corpora. Does a sentence belong to and you’ll you’ll be expected to try and improve on the classifier, try a number of other classifiers, and, and then see how much of an improvement can you get, but the main idea here is not. Getting the best scores. The main idea here is what you try and what you learn from it. So it’s more about the amount of work you put into it and trying different things and seeing, OK, this tends to work well, this tends to not work well, so I want to try this next. It’s not just about the overall scores. So it’s not like the marking is just a direct function of the improvement in the classifier that you managed to get. And really, we look at the report and what you did and what you say about why you did it. And how, how well you discuss the results. There’s more details of course on the marking on the on the webpage. Uh, so the actual deliverables to submit are the file containing the IR evaluation scores, there’s a script for checking the format’s correct. There’s a file with a classification result, and there’s a script for checking the format’s correct. Uh, there’s your code, and there’s a report, and 70% of all coursework marks are based on this report. So please make sure it’s well written and clear and understandable so we see what you did and why you did it. You can use any code that you wrote yourself for labs, for example, you can use any existing implementations. You don’t have to implement things from scratch. You can discuss functions with with friends, but please don’t share code. The only thing is for the IR component, so the first part of the coursework, please don’t use any existing libraries or implementations of the evaluation measures. I mean the evaluation measures in IR are really not. And Not the most uh complicated part to implement. What we don’t obviously expect you to do is, is, uh, implement the code for for pre-training your own language model or something like that. So you can totally use existing language models and we, we encourage you to experiment. Feel free to experiment with different things for the classification. It’s really, you know, you’ve, you’re getting the data, your job is to build something that works well and be able to explain why you did what you did. Um, so we’re really, uh, encouraging you to, to, to experiment. You’re very welcome also to discuss on the Piazza, um, your ideas and ask questions about it. Just don’t share the actual code, um, and don’t, uh, where there’s a, there’s, there’s an answer, don’t, don’t ask questions that can reveal the answer or indicate your results. Uh, so there’ll be penalties and if you’re sharing any indication of results publicly or if you submit results in an incorrect format, and that’s what we have the format checking scripts for. Um Quick recap on the extensions policy, so for this, just like for coursework one, our school has these two different policies for coursework extensions, Rule 1 and Rule 2, and we follow Rule 1 for this coursework, which means um basically if you submit it late. Out a valid excuse. We have to take off 5% of the mark. It’s not actually us, it’s the ITO who we give them the actual marks like what you got for how well you did, and they apply this automatically based on when you submit it. So if you submit late, you’ll get penalties unless you have an extension. Or unless you have what’s called extra time adjustments, which is something you only get if you’re registered with a disability and learning support service, and which you can combine with a regular extension. Um, so it’s a bit complicated when exact when exactly you’re late, uh, you know, exactly how much of a penalty you get if you submit late, and what the maximum late submission is for this. It’s all clearly laid out on these web pages. If it applies to you, please have a look because we don’t want anyone to end up in a situation where you end up with a mark of 0. Having said that, life happens. Sometimes unforeseen circumstances arise, like there’s something like an interview or an emergency situation, that’s what extensions are for. But if there’s something of really great consequence for you personally like the loss of a family member. Or a very significant illness that is sudden and unpredictable or catastrophic technical failure, that’s the sort of scenario where you need to get in touch with the with the people behind exceptional circumstances because they can do even more like disregarding the late penalty for late submission. Unfortunately, as the course team, we don’t have any say in this. We sometimes get requests where someone says to us, this has happened in my life. Can I have a longer extension or something like this? It’s not within our control. It’s not something we can grant. So please, please check these web pages out and make sure you follow the right policy for the school. Timeline, uh, 28th of November is the coursework to submission deadline, uh, noon UK time. Um, the lecture next week is cancelled due to strikes. I’ll say a few words more about this in a minute. Uh, 26th of November, we will announce coursework 3, and then semester two, there are no lectures in this course. It’s a year-long course, but all the lectures are in semester one. And you will be working in groups supported by us on Coursework 3. It’s a very significant portion of the course, um. And uh yeah, finally the exam happens at the end of semester two. Any questions about this? Yes. 4 to 64 to 6 people per group. And we are asking you to decide who you want to be in a group with. So what we’re gonna do, and I’ll announce more details, uh, on the 26th, but basically, um, we’ll have a form. Where people who found a group of 4 to 6 people can submit and say these are all of our student IDs, we’ll be working in a group. The deadline for this form will probably be roughly just before Christmas. Uh, so you can start even now chatting to your friends on the course about who wants to be in a group with you, and we will have on Piazza two threads that are specifically for finding people to be in a group with, because obviously not everyone knows everyone else on the course. Um, so there, there’s going to be two threads. One’s gonna be for individuals who want to, uh, find a group, and one’s gonna be for existing groups who want to find more members. So if you’ve got 3 or 4 people and you want to have up to 6, you can post on that thread and there’ll be more people. And we’ll give you some tips in the coursework 3 announcement as well on how to find the right um people for the, for the, for the group. So you don’t have to worry about it yet, but if you do already have a group of 4 to 6 people, you know, you can already discuss it if you, if you want. Um, yeah, so I mentioned that next week’s lecture is cancelled, um, because of strikes. This also means the content that we would have taught next week in the lecture is not going to appear in the exam. Um. We will make available, because we did, we did teach this course last uh year, we’ll make available to you a video where if you want to learn more, uh it’s completely optional, uh you can watch last year’s recording of the lecture, but we didn’t expect you to and again it’s not going to be in the exam. And I just want to explain uh why we’re striking. So essentially, uh, a whole bunch of uh university workers, including lecturers, librarians, technicians, and administrators are, are taking part in industrial action, which means we don’t get any salary for the time period where we’re striking. So classes may be cancelled from Monday to Wednesday next week. Um, oh, and they will be pickets outside, uh, different buildings at the university. And the reason this is happening is essentially because there are cuts. In the university budget that we think are not justified, and they impact on you as well, OK, so courses have less budget this year in informatics for hiring teaching support than they did previously, so there’s less people available as TAs and there’s less people available to us as markers, so workloads on university staff are higher on those staff who still have their jobs, and some of my colleagues are about to lose their jobs. This is something I’m deeply concerned about, and if you want to show your support, um, then you can make sure you don’t come in on Monday, Tuesday or Wednesday next week. Uh, you can follow the studentstaff Solidarity Network on Instagram if you like. They have, um, they organise a number of events. Um, you can complain to the university about classes being cancelled, of course that’s your right, and tell friends and family what’s happening. And uh if you happen to be free in 10 minutes, there’s um a whole session with more details on intro to strikes for students that takes place. At 5 p.m. today in Fort George Square. So if you haven’t got any plans right now, I would encourage you, if you want to know more about it, I’d encourage you to head over there um to understand more about why staff are striking. So thank you. Thanks for coming.

Lecture 8:

SPEAKER 0
OK, so you have a question. Uh, it hasn’t been announced last week. OK, so it should be Bjorn announcing it, or, because we switched it, I think. So, uh, at least, uh, he should be announcing it. So what we can do, we can actually announce it tomorrow. We release the details tomorrow, OK? But it will be Bjorn presenting more about it maybe next week. OK, fair enough. So thanks. I’m sorry for this which it wasn’t planned earlier. OK, so, uh, we are back to some more about the search engines. Uh, after you got a break. I think you got, you started the Covering Corpus, uh, last, last week, which uh hopefully would be another interesting topic. So now we are going to talk finally about web search. We have been studying a lot about search engines, OK, how web search happens. But before actually uh the lecture, let’s have some points to discuss. The good news, there is no lab this week, so a little bit of relax. You don’t have to implement anything, maybe for your group project, but not for this one. And the other thing is that coursework feedback will be sent soon, maybe tonight or tomorrow, OK. Uh, how, how do you feel you did well. OK, OK, but actually I, I give you some teaser here. Um, this is actually the distribution of the marks, so it’s pretty well. It’s really, really good. Most of you got actually over 70 and only 3 got less than 50, so, uh, but everyone did well. So I think just be excited, uh, hopefully you’ll earn most on this side. Does it look good? OK, yeah, I think it should be, uh, some of these would be 100, yeah.

SPEAKER 1
What does the state?

SPEAKER 0
I don’t know. The marketers knows. You will find out. But I think some of you got penalised because of the formatting, uh, so some of you actually did some misformatting a little bit, so we got just the penalty of which would be -1 or -2 marks, but otherwise, generally, you did well. So you can see, and I think the majority got over 70%, so that should be fine. OK, so hopefully it will be sent. Actually, they were about to send it this morning. I said wait till I give the lecture and run away, then you can get your marks. OK. So the objective today is to learn about working with massive data, which is the web, and Some of the things which called PageRank, that’s not anyone heard about PageRank before? OK, a few of you, that’s nice. So we’ll learn more about it and about Anchor text, which are all the stuff related to the web search. So the web document collection. If I ask you what are the main features of web search, what is the web collection itself? So I think the first thing anyone would say, it will be, it’s massive, it’s huge, of course, that’s a big collection of large amounts of documents, and we’re talking about trillions and trillions of documents. But what else is special about the documents of the web? Any ideas? Can you share your ideas? Yes, that’s an excellent thing. We have links to which actually they are kind of a graph. They are connected, so it’s not just a document in isolation that like the one you have been working with, but actually they have links to each other. What else? Yes, informal, yes, we actually. You can say actually there is no design or coordination, so everyone publish whatever they want in the format they like. So there is no specific format. So actually, if you check the articles you have from Financial Times or if you’re doing any kind of other collection like news website, they have a specific structure, the way they’re doing their stuff, but here it’s totally, anyone can do anything. There is no specific structure. What else? What else? Think about other stuff what is special, yes.

SPEAKER 2
Esoteric and other things are very common.

SPEAKER 0
It’s not what some things are very esoteric and some

SPEAKER 2
are very common things which everyone wants to do.

SPEAKER 0
Yeah, so some of these actually documents will not like not everyone will be looking for. Some of these will be actually very common. Yes, different kinds of media. It has different kinds of media. It’s not only text. It can be anything. It can be actually videos, images, whatever they want. What else? What else actually, what actually is a more think about this, if you indexit the web, are you done? It gets, keeps getting new stuff, and even the existing stuff, it keeps being updated. So actually, there are many things about the web collection specific. So it’s distributed content publishing. Everyone is publishing from everywhere. Content includes truth, lies, obsolete information, contradictions. Everything is there. It has news and fake news as well. Unstructured sometimes, or many times, semi-structured, sometimes you can find actually annotated photos or XML or sometimes it’s a database on the web, so you still have to index it. It’s fully structured. Growing continuously, it’s always growing. The number of documents, what I’m saying this sentence has increased already. So it’s increasing all the time nonstop. And content is dynamic. So if you index in one page, after many, many of these pages will be very dynamic like news websites, you have to keep indexing every few seconds because it’s actually keeping updated all the time. So it’s a very special type of actually documents here. It’s not the normal one. So dealing with this, you have to think about all these kind of things and how you can actually capture and make it work in a nice way and make it actually effective retrieval. Which creates some challenges here, which are like the no design and coordination, how would you parse all of this and extract the relevant content and not the noisy stuff? How would you do that? Growing. Actually, based on statistics, they stopped publishing it years ago, but in 2008, Google was processing around 20 petabytes. Per day we’re talking about petabyte is actually 1 million gigabytes. That’s a huge amount of data. In 2013 it was 160, and we don’t know what is actually the information now. We don’t share it anymore. So imagine the amount of data processing happening every day based on this. And controlling the quality is a big challenge here as well, because how would you deal with the spams, fake news, generated content now within the era for AI? Many, many, you can create many, many pages using a chat GBT or Gemini or whatever. How would you make a distinction between those and what is actually high quality, what is low quality? All of this stuff. And you have to think about it actually from the search engine. One is the technicalities. How would you collect these documents in the first place? Then how would you be able to store this huge amount in an efficient way and processing all of this amount. And also how you can philtre out the stuff which you might think looks irrelevant, but they are not. They are actually fake news or spams and so on. So there are actually that big massive I’m out of the data has a lot of challenges. However, at the same time, it creates a big opportunity as well. And when you think about it, We will talk about actually how massive data can be useful in this case. Also, That concept, actually the concept of relevance relevance which we have been talking about many times, we have always the problem if you can remember this discussion we had before, like if you’re searching for William Shakespeare, are you looking for the author or the plays by him, or actually the history of it? If you are searching for Apple and Jaguar, what exactly are you looking for? Relevance is highly subjective, subjective, and we know this. And there is actually a thing about spam and SEOs which we’ll talk about more after a while. How would you handle all of this? But the good thing about web search engines, many people and many pages are there, which can create something which we call the wisdom of the crowd that we can use in a way that helps us to automatically find the high quality stuff and get rid of the low quality stuff. We will discuss it more once we move forward, but let’s discuss the first point still, which is the effect of massive data. We’re talking about massive data, really massive data. Nothing related to what we’re doing here in this course. So I think the best of you in the 2nd semester, you can create a project of maybe 100 million documents. This is still tiny compared to the web. This is nothing. So The challenges here we talk about storage, processing, networking, and many things, but somehow it can make stuff easier when it comes to finding relevant documents. So let’s take an example here and let’s think about it. Imagine that we have two good search engines. Let’s say the same, both using the same exact techniques. They are using the same retrieval models and everything. However, one search engine, web search engine, managed to collect a huge amount of data, let’s say 1 documents. Let’s, this may be 10 trillion documents in this case. Another search engine, which is very similar to its functionality, the models and everything, managed to collect more documents. It did a nicer job in collecting and crawling the web, and it managed to instead of actually collecting 10 trillion, it actually collected 40 trillion documents. Assuming the first engine, Engine A. Had a precision on average around 40%. Precision at 10, for example, is 0.4, 40%. What will be your guess for the second one? What would be the precision at 10 for it? Any guess? Mhm, 40% as well. Who thinks it would be lower than 40%? Who thinks it will be higher than 40%? Why do you think it will be higher? Just think about the thing. So you’re trying to collect everything on the web. One system managed to collect X, the other one managed to collect 10 x or 4x. Which one do you think will be better, retrieving better documents for you? There is one answer that probably it will be the same. You mentioned it might be better. Why do you think it will be better precision?

SPEAKER 2
There are more documents to choose from, so. Yes, good. The chances of getting the correct documents is.

SPEAKER 0
OK, that’s a good point. Do you have a different answer or something related to this, um, I guess like, uh, it’s not like just

SPEAKER 3
more documented. It’s also more, uh, there’s a larger sample of high quality documents.

SPEAKER 0
Let’s say it’s the same quality of the same, it’s the same exact quality instead of 10 trillion, you have 40 trillion now.

SPEAKER 3
So yeah it’d be like um if the system is searching for like a specific term there’s 4 times more documents OK so you think there will be more relevant

SPEAKER 0
documents so it’s probably to retrieve it more do you have done the different yes.

SPEAKER 4
I think with more documents they can, uh, build a better model of the semantics or the what is the distribution of like what documents correspond to the user information.

SPEAKER 0
You can build better statistics based on this. Uh, it, it should be, yes, right, but remember after giving larger number of documents, everything, it will be, you know, the same, but that’s actually still a good answer. So let’s, let’s, let’s analyse it step by step to think about it, OK. So we said this, that we have this engine, search Engine A, that managed to collect the documents and documents, and it retrieved on average 4 relevant documents in the top 10, something like this. The green ones are relevant, OK. So What we can expect about the distribution of the scores, the retrieve the score here. Let’s use, we’re using TFIDF or BM 25. What we should think about if this is a good search engine, that the distribution of the score for the relevant documents will be higher than the distribution of documents of the non-relevant documents. Is that fair enough? If this is a good search engine, I should for relevant documents, I have the distribution of the score that you calculated in coursework for TFIDF. I’m expecting it to be higher than the non-relevant ones, so it can be something like this. The relevant are the greens, probably have a higher score than the lower ones. So I’m expecting them usually to have more relevant documents on the top than later. So imagine that I multiplyde my collection by 4. So if I have the same distribution. I’m expecting to get something like this. Every document will be by another 4 documents similar to this. So if I calculated the precision at 10 for this second search engine, which has 4 times more documents, what would the precision at 10 here? What it will be 8 out of 10. 8 who sing. How did you find it’s 8? But this is 12, not 10. 442. Then it, what would be the precision? So actually, the precision now would be something like this. If you manage to draw it, I’m expecting to find more relevant documents because just I have more relevant do more documents in general in my collection. I managed to collect more documents. I’m expecting to have better performance as long as I’m doing a good job in retrieving the relevant documents at the top ranks. So an important thing with big data is that it’s very useful to have the answer for your query. I get more chance to find the answer for my query and more chance to find more relevant documents. This is why the competition, a big thing in web search engine. Imagine one of you decided to build a search engine to compete with Google. Then in this case your best, important, most important job is to find as many documents as you can on the web. If you manage to get more documents, probably you are going to do very well. If you have an amazing algorithm. Amazing algorithm, but you have less number of documents, your performance will probably be less because there might be irrelevant documents that you already missed. You cannot retrieve them anyway. Is it clear? OK. So which brings us to the thing which is about, is it big data or clever algorithms? This has been something studied for decades in the, in different fields, in uh information retrieval, in data science in general, in NLP and so on. So for web search, larger index usually would beat a better retrieval algorithm. Google Index versus Bing Index, for example. I worked with Microsoft for a few years, and we have been always competing to actually get similar performance like Google. Always Google actually was when you reach Google, Google becomes better. But the biggest challenge always is that we were aware that Google’s index is bigger than Bing Index, Microsoft Index. So they have more documents, and this challenge always will be a problem. So even if you are doing amazing algorithms, but the problem is that they have more documents, they have more chance to find more relevant documents. So this is one thing. Another thing actually, In other applications like Google Machine Translation versus IBM Machine Tranture, this was actually the time when I was working at IBM at that time, and I can remember our team were, I, I was not in the part of the machine translation team, but it were my colleagues, and they have this competition, it’s called WMET. It happens every year. Years for many, many years now competing actually who will do better translations, and they did an amazing job in actually very clever algorithms about parsing the text, stemming it, and they find actually the connection between words and synonyms and so on to have a very good translation. And then they went to Google and Google achieved the best results, competed in the competition, and when he shared, everyone shared their algorithms, Google actually did nothing, just a statistical machine translation with no much NLP work. And it turned out that their training data was 10 times IBM training data. So probably you don’t have to stem and do the, this kind of uh lexical or semantic analysis as long as you have, you have seen all the examples anyway, you will be able to translate it. So big data here, work it again. LLMs at the moment we’re talking about LLMs. Chat GBT for example, it’s trained on the whole internet, plus more data which we’re not aware of, but we’re talking, for example, Common Core, it has 5 betabytes of data, a huge amount of data. And we know that when you have larger data to train an LLM you probably require more parameters to be able to handle all of this, so you don’t have a large amount of number of compression and in this case, generally you will get better performance. So we were aware that chat GBT 3.5 was actually was 175 million billion parameters. We don’t know actually what is going on now. And all of this, actually we know that with bigger data, you will get better performance because you have seen a lot of data. And it started actually way, way long ago. It was this very nice experiment happened in 2002. It was the first chatbot that ever created that look smart at that time was IBM Watson. They actually came out and there was a big splash. They actually came to Jeopardy, the programme, and they actually beat the human beings in answering questions. It was an amazing system. It was not, of course, trained in a totally different way than what we’re doing now, but it was the most smart system humans created at that time. And there was this task in Trek about competing in question answering. You have a question and you find actually what is the right answer. And Microsoft came up with another algorithm that didn’t use any kind of NLP. And it beat IBM Watson. That was actually the state of the art at the time, beating human beings in answering questions. So what happened exactly? Microsoft came and said, for example, you have a question like, who created the character of Scrooge. And you have to find an answer from a set of documents, where is the answer and retrieve it back. And it required actually heavy linguistic analysis, a lot of research, and has been actually studied a lot in Trek. You remember Trek? Uh, these evaluation campaigns, they created several uh uh tasks for this. And IBM at that time was taking the question, trying to do a deep understanding of what is exactly it was looking for, a lot of parsing, extracting the information, getting the answer. Microsoft said, I would come and actually do something different. I would create some patterns of questions. For example, if I said something like, What are all the NLP I’m going to do is to identify within a sentence, where is the subject, the verb, and the object. That’s it, a simple part of a speech stagger. And I would create actually two systems, 2 actually phrases. The first one is started by a verb, then the object. And the second one starts by the object, then the verb, but actually in the passive tense. So for example, who created the character of Scrooge, I would search for sentences, just full sentences like created the character of Scrooge as it is, exact match, and also I Create something in the opposite direction, which is the character Scooge was created by, and that’s it. And all what I need to do now is to search for an exact match for these sentences and look for the first sentence, whatever appears before this sentence, and for the second one, whatever appears after this sentence without any kind of processing, without understanding what this term is. It can be a number. I don’t care, just take it. So just created the pattern I’m looking for exact phrase match and find the term either before the first phrase or after the second phrase. And but they have access to the web. They were doing search engines and building search engines, so they have a large amount of documents, and they just checked how many times each term would appear after or before this one. And from this you can find actually. They started to find whatever appears before or after, and they just do a very simple count, and they found the most common answer here is Dickens, they say, you just take Dickens as the answer. And this system actually achieved. Better than IBM Watson at the time. Very simple, exact match, looking for the term appearing after a certain phrase, and you’re done. And this is simply because using massive data. Have access to the web and they will search and find an exact match and find the answer from just normal count. Very naive approach ignores most of the answer patterns. They can be mentioned in many different ways, the answer of this, but they ignored it because we have a huge amount of data, that is enough. I will find the answer anyway. Is that clear so far? Any questions? OK. So Even if it’s missing a lot of the answer buttons, who cares? Web is huge. You will find the matches anyway. OK, OK, let’s take another thing about web search. So the first thing that we learned about web search, massive data actually, yes, it has its own challenges about processing them, collecting them, indexing them, making them very efficient, but on the other side, it helps to get good results when we search. The more data, the better results we expect because we have many relevant documents in there. The second thing is, imagine I’m now searching for the word Microsoft. OK, on the web, and I have 2 documents here. One is Microsoft.com. And the word Microsoft is mentioned in this web domain 5 times only because Microsoft will be talking about different things, probably their products. And another document, web page, which is more a tutorial about Microsoft Word, how to use Microsoft Word. And the word Microsoft has been mentioned there 35 times. OK. So based on our method that we have been using so far, which one should be ranked higher? Is it document one or document two? Two, because it has more TF than, and they said you’re searching for the same exact word, Microsoft, so Microsoft has the same IDF for both. So based on what we have been developing so far, the first document would actually come lower than the second document. But who thinks that, which, which one should be more relevant from your understanding? First one, I’m searching for Microsoft. Let me find the website of Microsoft. Which brings us to some ignored feature in the web so far if we just use the documents as isolated documents, which is They are connected. The web pages are connected to each other. Each page has some links going out of it and coming to it, and they’re actually what we call the anchor text, which is the, if you find the word linked to another page. This word, we call it anchor text, which is actually the word linking to the other page. So the assumption one. Is that a hyperlink between pages denotes denotes author perceived relevance. So if I’m linking my page to page B, so I can assume that I think page B is relevant to what I’m talking about. Is that fair enough? If I’m linking something, then probably it’s irrelevant to what I’m saying now. This is why I linked it. The second assumption is The text in the anchor where the words appearing is linking, describes the target page. It’s kind of describing what is in there. So if I, when I’m searching, for example, I, in my web page, I would say I worked before for Microsoft. So and they would link Microsoft to the Microsoft page. So the word Microsoft describes Microsoft. Which is kind of the context, what exactly I’m thinking, what I think is relevant here. So the link between pages. Google description of it is called PageRank, which they came up, actually this was everything about Google when they came out. It relies on the uniquely democratic nature of the web. The web so far people linked each other somehow to what they think is relevant, and if we assume that every link from one page to another page is kind of a vote, I’m voting for this page because it’s relevant to my content, then I can say this is kind of democracy here on the web that I can use somehow a leverage somehow to be useful when I do the search engine. So if A is linking to B, if there is a link between A to B, then it means that A thinks B is worth something. Somehow it’s relevant somehow, and based on the wisdom of the crowd, many links to B say that B is a good page. If many pages are linking to page B, then I can assume that this page is somehow important because many things are linking to it. And it’s independent of the content here because I just check how many pages are coming to myself, so it’s a measure of the quality of B somehow. So when you’re talking about crawling a lot of pages from the web, if I’d like to save some space, I can’t find out which are the pages that no one is linking to them, and then start thinking these are lower quality than the ones which have many people linking to it. Is that clear enough? Is it clear? OK, I’m good. Just give me an indication, some feedback. You will get the feedback of coursework one, give me some feedback now. So Use as a ranking feature in this case combined with the content. You can think of it, I would use it somehow to help me in ranking the documents now. However, but not all the pages that link to B are equally in importance. So I have two links linking to my page. One is coming from example from CNN, and another one is coming from a blog post here. So I shouldn’t consider that the ones coming from CNN the same as coming from a random blog blog page in this case. So I should actually give some weight as well to who is linking to me. An important page should have much higher weight than the page that is actually less important. And it was introduced in 1998. PageRank, which changed everything about web search since then till now. OK? And this is actually why Google came out to overtake most of the existing web search engines. So what is going on is here, given these two documents, I’m expecting that Microsoft.com would receive, first for the tutorials probably would be linking to Microsoft. So this shows that this page thinks Microsoft is important, but also Microsoft is receiving a lot of links from different pages. So I’m expecting the page rank of Microsoft will be way higher than the page rank of another tutorial page because it will be less popular and less linked. OK. So how do we calculate the page rank in this case? The main analogy here that I have all the web, all the pages I have. Let’s assume I have a big web page like this one which has like, how many are these? 99 pages, huge web. Of course we’re talking about 9, 90 trillion in real life, but let’s start by this, and I have all of these pages linking to each other somehow. So how I will start, I would think about I’m starting browsing at a given random page. I will go to any of these pages at random. Then what I would do, once I pick up a random, I will check this page I’m on now. I’ll check some outgoing link I have from this page, for example, I have 5 outgoing links. I will pick one there at random and then go to the next page, and I keep repeating this. So I move from page to another page as long as there is a link, and I keep repeating this forever. So for example, I can start at page G, and from page G I can see it goes to E and and B. So from here, is it G, yes, E and G. So I, I go from, I’ll just at random and go to E, then E is going to F and D. I will go to F at random, and F goes to B and E, I will go back to E, assume, then I will go back to D and this time from there, and then I go to B, and then I go to C. So I keep jumping between pages. And If I kept doing that, for example, what do you think might be a problem here? Yes, hm, I can reach a dead end, like, for example, B and C. B is only linking to C and C is only linking to B. So in this I will keep stuck there forever. So how, so page rank is important to actually think about lambda, which is with the probability of 1 minus lambda, I can stop jumping to links and go to a totally random page again. Which happens usually when you’re browsing the internet, you go and click on something, go something, then you actually decide later to write a new address and and go start a new thing. So this is simply, with lambda, I will keep jumping, probability, I will keep jumping between links. Then at 1 minus lambda, I will jump to a totally random one. In this case I can jump. So in this case, otherwise I can get stuck in A, it doesn’t go anywhere, or actually between B and C. OK? So what is page rank in this case? It is the probability of being a page expert at a random moment of time. What is the probability? Keep if I keep doing this infinite amount of times, what is the probability I’ll be at this page? After a while, If I have many links coming to my page, probably in this case, there is more chance I would be at this page. If there is no links, probably I have lesser chance and just can go by random. OK. So let’s start calculating this. So first of all, we calculate page rank on multiple iterations. The first one, I will go to the given page at random. So what is the probability in this case? It’s simply, I will divide the whole value 100% divided by the number of pages I have on the web. In this case it’s only 9, so 100 divided by 9 by 11. It’s actually 11, not 9, 11 pages. So it will be 9.1. So I can be at any page of this with a 9% probability, because I would jump to anyone at random. And for every page, I start to calculate the next step. Pro uh page rank at time, at a certain time, next step plus 1, it will be 1 minus lambda, which is jumping at random, divided by the number of pages, which it will be, for example, I would say at a random 20 seconds, it was again 9%, but in this case, I can say multiplied by 1 minus lambda. And then the other one is to go to one of the other links that is coming to my page. The probability to be on this page is. Get at random or get from a link coming to my page, or the links actually coming to my page. And why go to X, it contributes a little bit to the page rank of X and spread page rank equally among all outlinks. So if I have 5 links coming out of my page, in this case, I will just say the split. I can go to any one of them at random. I will split my page rank divided by 5. And they keep iterating till at some point there is no changing going to happen because I keep iterating this. At some point I find the scores of each page rank of each page doesn’t change anymore. Let’s continue calculating this to make it clearer. So at the first stage, we said it will be 9.1 for every, uh, for each of these pages. So let’s assume lambda here is 82%. So with 82%, I will go to one of the outlinks, and with 18%, I will jump to a random page, OK? So let’s calculate. At, at the first stage, the page rank for all the pages will be the same, so B is equal to C is equal to D, all of them the same, which is 9%. Let’s calculate page rank of C at step one, so after initialization. So it can come as coming 199 divided by 11, which is jumping to this page at random, plus, The lambda, which is 0.82, multiply the page rank of the pages linking to it. How many pages linking to C? One, which is only B. So in this case, it’s 0.82 multiplied by page rank of V. Page rank of V is what, what, what is a page rank of V B at this stage. 9%. So simply it will be 0.18 multiplied by 9 plus 0.82 multiplied by 9, which will be still 9%. Let’s do page rank 1 of B in this case. So, how to calculate it, page rank of B would be 0.18 divided by 11, which is jumping to this page at random, or how many pages are linking to B. A lot, yes. So 0.82 multiplied by all the pages linking to it. So for C, how many, how many links are coming out from C? It’s only one. So it’s just the page rank of C. Then I will go for D. How many pages are coming out of D? 2. So when I do that, how, how many of those going into bits? Only 1. So in this case it’s half of the major rank of D. The next one is E. How many outlinks are coming from E? It’s going to D and F and B. So one third of the page rank of E, and they will keep going the same. For F, how many pages? It’s only 2, then it will be half of this, and for G, it’s half, for H it’s half, and for I, it’s actually only 1 link to to coming to B, so it will be the same, actually it’s 2 pages will be half as well. So when I calculate this, I can calculate that page rank of B will be 31%. And I can keep doing so for all the pages. Iterating for I did that for bed rank at step 1, then I can do it at step 2, step 3. At a certain point, maybe step 5 or 6, I will find no changes anymore. It will keep the same And in this case, I can start, if I calculated this, I can find after one step, for if I did for all the pages, I can see B started to be 31% and C is 9.1%, the others is 354, and so on. OK. And if I calculated Brank2 at step 2 for C, it’s actually doing it again, but now B is actually 31%, so it will jump to make it 26%. So this is why they reach the 1st 1.99%, then 9%, then 26%, and they keep iterating this. So if I asked you This is the web pages we have. Remember them? Actually, these are the numbers of them. Which page would have the highest page ranked based on this algorithm? B, who thinks it’s B? Only one. Who thinks it’s E? More. OK, so B and E, who would you think would be the, who would be the second page, for example, have the second largest page rank. We are now between B and E. Sorry, can you repeat? We said B&E probably have the highest. What else? What can have also high page rank? C Uh, OK, let’s, who would have the lowest page rank? A. OK. Let’s check this actually would be the final outcome. Actually, can you see E? It’s way less C actually is very high. Why this happened Yes, because it’s not just the number of links coming to my page, who is linking my page as well? So if I have 100 pages, blog poststers are linking to my page, and I have another person who has Wikipedia, for example, linking their page, it’s a big difference. So it’s important also to check who is linking my page. So some of the things we noticed here that pages with no links, for example, will be only like the ones on the bottom there. The only chance to be on these pages, because there are no links going to these pages, the only chance to be them is to actually go to them by random. So it actually will be the 0.18 divided by the number of pages, which will be like 1.6%. So this is the only chance to be on these pages, to go there by random. And we also noticed that same or Smith inlinks, if I have similar inlinks coming to my page like DNF. I will get similar pagerank DNF, both of them actually is linked from E only. So I would expecting they will have the same page rank. And the last important thing is one in link from a high page rank is way more important than many links from low page ranks. The same comparison between C and E. E has many links coming to it, but most of them actually are very unimportant pages. But C has only one link from B, so it became very, very important in this case. Is that clear so far? OK. So this is the main thing about PageRank. It managed to make Give some weight to the pages. Now I can see that not all the pages will be the same. Now I can see that Microsoft, I’m expecting that Microsoft, CNN, Wikipedia will have higher page rank compared to my impersonal web page or a blog post or something else. OK. And it was very, very useful when it came out. Another thing about The links here. We have been talking about the coming links to the page. What about the text that actually made this link? So actually you say you can find this page in the following link. Which we call it the anchor text. So the anchor text, it’s kind of a description to the page I’m linking to it. It’s a good thing it’s usually short and very descriptive, like a query. So for example here, what could be linking to IBM? It can be international business machines. It can be IBM. It can be Blue, Big Blue, which is a supercomputer they created a few years ago. So I can get some description about what it’s going to see on this page. And the good thing about this, if you’re going to reformulate a queen, it’s kind of human. Query expansion. I’m extracting this. Actually, this is very useful for query expansion we talked about last time. That another method for finding relevant pages, uh, relevant query expansions is to finding what the anchor text go all going to the same link where they are actually what they are saying differently. I can dictate all of them as different query expansions because they are linking to the same thing. And when you use the word indexing page content, actually the good thing now, if I have this page I’m, I’m linking, I can actually include with it, for example, I’m linking IBM.com, I can start including with it the all the anchor text that has been linked to this page as an expansion to the document itself. So if I’m searching IBM.com, maybe they never mention the word uh Big Blue on their, on their website. Actually, I learned it and I added to this. So I’m searching for Big Blue, I can’t find it matching to my webpage. And of course, I can give different weights to these documents. Uh, a lot of people don’t like IBM. No, no, go, it’s fine, it’s fine. So The good thing now, actually, even I can put them with different weights. If this link is coming from a big important page, then in this case I can give higher weight when adding this expansion compared to other documents, and it has been tested to significantly improved results. It helps a lot to improve the results. However, The bay drank itself and the anchor texts, they are very useful, have been demonstrated to be amazing, but they still have some issues here. They can be misused. So let’s see actually just in general, there is actually the vulnerabilities that can come up through. In general, in general, there is actually kind of a concept or a uh what we call it uh Hawthorne effect, which is Observation changes behaviour. When this came out, it was a big thing. People started, oh, that’s amazing, it’s useful, it’s using it. But people took a note, oh, this is important. Then we should actually play the system to be able to have better bed rank, which comes to the other thing, which is a good heart slow, which is when a measure becomes a target, it ceases to be a good measure. The problem actually can lead to misuse of this. And we call it later the Cobra effect, which is a solution that worsens the problem due to other incentives. Just to understand the Cobra effect, how this, anyone heard about the Cobra effect before? One only, 2, OK, so actually what happened it was actually when Britain was colonising India, they started to notice that there are a lot of cobras around them, and it was actually Indians are fine with the cobras and the snakes, it’s fine, but actually the British people, they didn’t like it, so they started to create an award, said actually whoever brings us a dead cobra, then we’ll give them a reward, give them some money. And people found, oh, this is amazing, so that we can make money off this. So what happened later? Yes, yeah, I mean people started raising more cobras. Yes, they created cobra farms. They actually started raising more cobras so they can sell it and get more rewards. And after British people realised that, oh, in this case, oh, we’re not getting any rewards anymore. What happened later? The Indians said, what is the point of these farms? So release all the cobras. So it ended up actually more cobras in the country because this has been misused. And think about page ranks and anchor texts in this way. It can be misused in this way. So one of the first things is track back link stamps. Track back link spams. Simply, you hear about it actually from social media, which is follow me back. I will follow you and you follow me back to create more followers for each other. It has been done on the web as well. It’s, for example, blog that linked to me feature Bing Bing is the linked page which we call it Sing, and it tells you if you manage to link my page, I will link your page as well. So you can find a spam page linking something, a genuine page, and then you can actually link it back to create more links and create something like this, which is a challenge. This is actually fake page ranking. Another thing is the comment spams. Linked to from comments on sites with high page ranks. So I go to the web to actually CNN and comment on the news and I put a link to my webpage and saying actually find more on this webpage, which actually is very common. You’ll find people actually, this is from a friend’s website who created this uh something uh discussing about computational social science, and there were people linking in different ways. So actually you can go and give this actual link or actually put a link in your actually name itself. So when you click on it, it’s more clicks. So you created actually and you go and comment there and say, OK, now I’m taking some of the page rank from the page I commented on, which created something in web pages when they were crawling the pages. Something is called uh which uh relation here is not follow, so all the comment sections will not be crawled or given any uh when they calculate the page rank, they don’t care about these parts. So for example, you can search for some posts on Facebook. You will find the link to the post itself, but not on the comments. For example, and, and actually on the web pages, you have this kind of comments section or usually have this kind of meta-information that relation is not follow, so don’t follow any links in this section, so don’t waste your time, OK? So this is another solution for this. Another thing which is called, remember the Cobra Farms, that is actually called link farms. I go and I have a page, and I don’t want to start with this very low page rank at random. I create a virtual machine with many virtual machines, many links between each other, and all of them are linking to my page. So instead of starting at 1.6%, I can start at 2 or 3%. It boosts my page rank a little bit in the beginning. So this is actually another way, and there are many algorithms. The problem about this actually, if you started a search engine and started to do data crawling. And keep following links to collect the data, it can get trapped in some of these farmers wasting a lot of resources and time on actually virtual machines, not real pages. And there is a lot of work to actually how to detect these ones and not get trapped in these ones, to actually culate the correct one. And actually this has been done also on social media. You can find this nice paper about TG is a colleague here and the team followback, which is done on Twitter, how the people use it on social media. Yes, I have a question. If a new site is added to this network, wouldn’t you have to then recalculate all the page rank for everything again? You have to. Page rank calculation doesn’t stop. Crawling doesn’t stop. This is running machines all the time. So even if just one sites like You keep, yes, you have, it doesn’t stop, it, it continuously happens because it’s not one snapshot. Web even CNN itself updates the website. So this is the crawling and the page rank calculation and the ending sync, everything on the web continuously happening, nonstop. So actually, some pages, you will find the page rank of it. Actually, there is a website that can tell me the page rank of this page. It gives you a score. You can actually check some web pages if you have a personal website, you can check actually how it changes. Probably for non-famous websites, it will update every week or two. For actually other websites, it will be updating every hour or something like this. So it keeps happening, the collation doesn’tstop. OK? And you will notice it actually if you, if you any, any time you run a web page or have a web page or any website you are maintaining, when it will be indexed by the website, by Google for example, you can search for something specific and you will never find it. And after a while you start to find it. Oh, now it went to the index. OK? So it, it, web, web pages, web search engines continuously process information, nonstop. We, we’ll speak more about this in the next lecture. OK. How, what else can this be misused? One important, actually one interesting thing, it was actually noticed in Google, people started misuse this from 2003. Which at that time if you searched for the word miserable failure. What was the number one result coming from Google at that time? It was actually the official web page biography of George Bush on the White House. And when it has been tracked while you’re in search for a miserable failure, you get this webpage. Actually it’s a white a webpage. How would you get that? It turned out that this was someone who didn’t like his policies, probably because of the Iraq war and killed many people. So he posted this and said, actually there is a new project, guys. Let’s make a lot of links on our pages linking miserable failure and put a link to the biography of George Bush on the White House webpage. And people took this idea, it became a trend, and everyone was linking it on the web pages. So Google started to collect and they found actually this is a nice expansion to the way, so the word miserable failure, a good answer for it is this webpage. And it kept happening actually. This is using what we call anchor text spam, and it kept happening until 2007. For 4 years, if you search for miserable failure, you will get the answer of George Bush at the top results. And it happened only in 2000 because actually Google decided to re-index everything from scratch, not just update their index. And this data was. And the interesting thing, it didn’t happen only on Google, actually, 4 years, 5 years ago, where if you search it on Twitter at that time for the word loser, this was the top result. It happened exactly after the Donald Trump lost in front of Biden, and actually I, I took the snapshot myself. So, so actually at that time people started to cite him uh in Google, uh on Twitter searching Donald Trump lost, loser loser, so it linked it to it. So this was the top results. The other thing is, High page rank websites gaming the algorithm sometimes, especially when you are a very famous website. For example here. For web page C. If B removed its link to C, what would be the page rank of C? 1.6, full destruction to C. So it relies a lot on B, so websites, especially the big ones, they know they have this power. And they treat it in actually a very careful way, especially when you are linking to a rival or a competitor. For example, Avoiding sometimes giving out links. Especially if you are giving a source. For example, CNN is citing Reuters. So Reuters usually brings the news first in many cases, and CNN will take the news from Reuters, say, according to Reuters, this happened. So they will say according to Reuters, but they will not link Reuters. Because if they know if they linked Reuters, they are increasing its page rank a lot, and they don’t want to do that. Still, they are a competitor, so you can still cite them without linking them. So this is another strategy people use. Sometimes you can create irrelevant content and link to other pages just to destroy, like the same one like miserable failure in this case. So you can try to do that to actually reduce the quality of your arrival as well. For example here, How to transfer money using Revolt, Revolut. OK. If you search it for this, many times the top answer would be. How to spend money on volute step by step, but actually it’s unwise. So why are the competitors of Revolut, but actually how to actually gain attraction to them? They put exactly what people are searching about Revolut, but they advertise to themselves. Which is related to actually search engine optimisation SEO, and we will talk more about it in the second lecture. So this is a way to actually attract even the people looking for competitors to bring it to your page. So this is another way to do it. But in reality, in reality, PageRank now actually is not the full story anymore, and when it came out, it was a big change, everything. But at the moment, it had a big, initially when it was proposed, it had a big hit at that time. However, now it’s just one feature among many, many other features. There are many sophisticated features are used at the moment. Machine learning is ranking, actually, we talked a little bit last time about word vector representation where the word is not even a word, and using machine learning, using birth models, LLMs, and so on. So it became so complicated. However, Brank still continues to be a very useful feature, but it’s only one feature among hundreds. So it’s it’s still an important feature, but it’s just one among many. So the summary of this, that web data is massive and which is challenging for efficiency but it’s still very useful for effectiveness, and we discussed this, how it happens. We talked about page rank, probability that random, a random surfer will be at a certain page at some point, and the more powerful page powerful pages linking to my page will have a higher page rank. So it’s not just the, the quantity, but also the quality of the pages linking to my page. Anchortecus, it’s a nice short description of the link, and it’s very useful, but it can be still misused, and we saw how this has been done and the link span that people try to play the system with it, and there are different ways to actually detect them and solve them. So the resources for this is in introduction to IR Section 2 Chapter 21 and Section 1, and I in practise you will find 4.5 and 10.3, Chapter 4, Section 5 and Chapter 10, Section 3, and I would recommend that you read the initial paper that came out about PaRak when it was first introduced to the world. It came in 1999. If you know these names, these are the founders of Google, so they, it was their PhD project and they took it later to be the largest, one of the largest companies in the world. And you can find additional readings here about The thing about the story about Microsoft versus Watson, this is the paper, and also how linked farms is happening on social media, which is a very recent paper last year. This is all for now, and we will have a break for 5 to 10 minutes, and we will continue afterwards. Do you have any questions? OK, so I will get the mind teaser in a second. Come on, licence.

SPEAKER 1
So this research is about. And Yeah, it’s not OK. And if you want to, you know.

SPEAKER 0
OK, this is a simple one. Whoever gets the answer, please let me know. OK, so let’s continue talking about web search engines. So what we’re going to talk about are some of the basics of web search engines in general, a brief history of web search. It would be nice to see what is the history of what we are living now. Remember, this was all invented in the 1990s. It wasn’t there before when the internet came. Search engine optimisations, and we mentioned it quickly last lecture, and web crawling, how this is happening in general. So a brief history, when the internet came out and people started to create web pages and people wanted to browse them, so it was all keyword based in the same way we have been studying in the last lectures about using BM 25 and even and so on. And by BM 25 I even later than this. And one of the earliest search engines at that time was called AltaVista. I’m not, I don’t think anyone is old enough to knows what it is, but it was the main web search engine at that time, and there was something called Excite, Infosee, Clios, and AOL. Actually, AOL was one of the known web search engines, and at some point they decided to be kind to the scientific community and said we will release all our query logs we have. And it is actually all the people were searching for, they had a huge equity log at that time, and the second day, the CEO actually of this company was fired. Do you know anyone here about this or know why? So actually what happened turned out that people searched for bizarre stuff, so people were searching for its full names, their addresses, their credit card information, and this was all publicly released. So people actually, when it was released, people took it out, find a lot of credit card, the numbers, and they started to use it. So it was a big problem at that time. So when you talk about users and search engines, everyone, people, we do crazy stuff, so we don’t know what actually to expect. So it was released without any revision. It was a problem at the time. But at that time it was working a nice web search engine at that time. It was using traditional IR techniques and scalability was an issue, but still, they managed to do that. It wasn’t that huge at the time anyway. Then at that time paid search was something that came out in the late 1990s like GoTo, which actually was called Later Overture.com, which was acquired later by Yahoo at the time. And your search ranking depends on how much you pay. So actually it was not just relevance, but also how much you are paying to this company to make your stuff, your product being coming out. Um, sometimes it was an auction or keywords, like, for example, uh, I’m, I’m, uh, I’m creating cameras, for example. Who would pay more to the word camera to bring their actually, uh, uh, uh, webpage when the people search for it. And it was called the sponsored search, and there were two types of how you pay, which is actually happening till now, which is cost per click or cost per 1000 impressions. So you actually there are two ways to do that. One is You once the user clicks on your page, so you pay as a sponsored one, and your result will appear on the top because of what you actually. What you’re working on. If some user clicked on it, you say, I will pay, for example, 5 pence or 5 cents for any one click on my page and open it. Or the other one, imagine the user didn’t open it, you can say, OK, whatever, whenever you bring my result on the top 10, then I will pay 1000 impressions. If it just, it appeared to the users every time, 1000 times regardless if they clicked or not or not, I will just pay every 1000 impressions. Currently, actually people are doing it with what we call revenue per 1000 views, RPM, which is used mainly for people doing videos on TikTok or YouTube and so on. You can read more about this in this link. OK, this is a quick history. What else happened? Then in 1998, Google came out to introduce what’s called link-based ranking, which is the Page rank, and it blew away a lot of early search engines. It was very good at getting very good results compared to anything else and gave a great user experience at that time. Meanwhile, the sponsored search like uh ones which are based on advertising is still at that time overdue as we talk about $1 billion per year happening at the time for the paid search uh search results. Google at that time decided to create what they call ads on the side to actually include this in the web engine because it started to be very popular, and they wanted people also to find a way of revenue. They started to have this, and at that time Yahoo realised it’s really important. At that time they decided to acquire Overture at that time to be able to compete with Google. At that time, Google, Yahoo was actually doing well compared to Google in the early 2000s. In 2005, Google started to gain a lot of market share. Probably this is when you were born and you started to realise only Google is the main thing, and dominated in Europe and was a very strong competitor in North America, so they were dominating by far. This made Yahoo and Microsoft at that time to join efforts and combine their search things to compete with Google, but still didn’t lead to much results. And since last year we started to hear about the AI and RC systems coming with actually search engines. However, so far still the basic search. You can find a news summary about the results, but still we didn’t hear about OpenEI taking over anything or perplexity or anyone. Still Google the dominating one. Just to give you how this happened over time, actually I like this video. I will show it to you. So I will play it and try to describe it a little bit. So let’s make the speed. Uh, Come on.

SPEAKER 1
Pause. Yeah.

SPEAKER 0
It’s really annoying how to make it faster. Anyway, I’ll keep pushing it. Yes, working, yeah. No. This is annoying. Sorry, Give me a second to fix it. Oh my god.

SPEAKER 1
It’s your man well. the So you can figure it if she likes it she wants to.

SPEAKER 0
Um. If I played at normal speed, it will take a long time, but I’m wasting the time anyway.

SPEAKER 1
Because you Come on.

SPEAKER 0
OK, what I can do is do it like this. Let’s see, is it working now? OK, you can see it, it’s happening. This is double the speed. So you can see actually in the early 1995, it was these engines probably you haven’t heard about, but there was Yahoo at that time. Who heard about Yahoo? You hear about, at least you know it. And at that time it was these search engines. AOL came out a little bit and it started to actually share, and then Atla Vista came out to be actually one of the most popular ones in the late 1990s. Then AOL came in. You can see it was getting a lot of traction at that time. This is the one that the CEO was fired and released the notes, but Yahoo and AOL and Google then came. You can see this is in the late 1990s. It started to compete with them. Microsoft. It was Microsoft MS at the time, was actually competing very well with Yahoo and was the top three Yahoo, Microsoft, and Google. Till 2001 and so on, but Google actually got a good traction with their latest start and an early company Microsoft has been there for a while, and they were competing, but Google started to gain traction, traction. Then in 2002, at the end of 2002, 2003, they started to be dominating and gaining most of the market at that time from Yahoo and MSN. AOL was doing, trying to compete with them, but The incident when they released the stuff they actually they were done at the time. Baidu is the search engine in China and Yandex is the search engine in Russia. And you can see Google now keeps dominating over time, trying to win. I was at Microsoft at that time when it was called MSN, I think it was I went 2008, and we were trying to compete, but still you can see they didn’t do well and they didn’t know what to do. Baidu, because it’s only closed in China, no one competing with it, it was still had some of the market share. Then MSN was rebranded to Live. They called it Live at that time, so they said they rebranding it to make it better, still nothing happening. Yahoo at that time, Microsoft and Yahoo decided to work together at that time to actually get a better performance, but nothing. Then life didn’t work. MSN didn’t work. Life didn’t work. We call it Bing. They rebranded again to Bing, and they tried to do something, started to gain a little bit of market share, especially when they’re working with Yahoo, but at the end, still Google dominating by far, and they keep dominating over the years, nothing changing. Baidu, this is the one in Russia. Happening, uh, we’re moving, we’re coming closer, that’s fine. And this keeps happening. AOLs, I’m not sure it had, went to the market in 2010, but no one probably heard about, anyone heard about AOL before? Oh, for a few of you, that’s not bad, that’s good. But as as a search engine, Google by far. And the same thing. Baidu and Yandex, because they are special markets, one is for Russia, one is for China, they just have their own shares. But it’s still actually Google has started even to gain shares on this, and Google has been happening. Bing and Yahoo realised they will never gain the market. They are just losing the market, and Google continues to actually gaining it. And you can see actually trying to find solutions for this, but still Google dominating by far compared to anything else. Uh Yeah, here you are still Google and Baidu. I think this one goes to to 2023. So this is the story in the last probably 10 years. It didn’t change. Google is taking care of like having most of the share of the market and everything else is much less compared to it. I think it has even more market share now compared to before. There are, and this is actually the situation until 2021, yes, this is the thing. So this is a quick history about what has been happening. So if you know Google now, this is the story of it. It started in 1998 and gained dominate in 2004, then it started to continue like this. What is the situation now? Oh, don’t run again. This is the current situation. Last month, October 2025. This is the market share. Google by far 90% of the market. Then comes Bing, 4%, and Yandex, which is in Russia, is 1.8%. Yahoo is 1.4, and the other one is actually. Where is where is OpenAI? Where is perplexity? Not there at all yet. So how this will change in the next few years, I don’t know. But so far, Google is dominating the web search engine by far compared to anything else. OK. How many of you were aware of this full history? Cole, OK, so at least most of you now learned about what has been happening over the years. So when you see Google now, it wasn’t Google all the time, just 20 years, but before that there were other search engines, but they actually what made the breakthrough for Google started by PageRank, then after that they were leading by far, learning rank and the main thing for Google for many years, they were very good at collecting. A large amount of documents of the web, so we have better access to the web and actually now we are gaining most of the market share in actually in Russia at the moment, so you don’t see actually the market share of the decreased a lot. So this is actually because they are doing very well in all languages at the moment. How this would change with the new AI era, but you still have the AI summary. Gemini is doing well with Google, but the search engine continues to be the main thing. So in summary, you can see actually this is a normal Google page years ago when you have searching for something and you get the top results and also sponsor results. So you get actually the algorithmic search results which can be web images and news and so on, and the sponsored search results. However, this has even changed in the last 3 or 4 years. When you search for something, you will get the sponsored results on the top. And then after that you start to see the relevant results. Sometimes actually when you search, for example, when I search for airlines like British Airlines, you will get a sponsored result of British Airlines, and after 3 results you get British Airlines again but not sponsored. If you click on the first one, you make British Airlines pay something to Google. If you click on the second one, they will not pay. So it depends how you like them. So if you don’t like them, you keep clicking the ones which are sponsored, so make them pay more. The other thing, actually, even when you search for stuff, double bed or something, you get sponsored results on the top with the shopping coming out as well. And if you’re searching for the same query we showed here, which is curtitude, whatever it is, you will actually now get this kind of AI summary of the results, which is called Drag Systems. We will have a lecture about this at the end. When you get the results from the normal search engines using the same techniques we discussed, but you will get the top results and then summarise it using AI to give a quick summary about it, so you don’t have to open the links. OK. Web search engine basics. So we said actually it’s a connected web. We have. First step to build a web search engine is, what is it? Crawling, you have to crawl the data first. So which we call it web spider. This is the main thing that allows you to collect data from the internet. And you try to use it with, uh, to keep collecting data as much as you can, then put it into your index, your indexer, create your multiple indices, the web, it will not be only one index. You will load it into the memory and you’re done. Of course, it will be multiple indices, and you have to think about the data centres you’re using. Some of these will be like the documents which you think people in the US will be looking for it more. You have to be closer to the US. Documents which are in Russian should be closer to Russia, and the Chinese will be in China, so actually allow this kind of networking and things. And then you have the ad index with people who pay to get the results on the top, sponsored results, and then you have this kind when you have the search engine search, the user search for something, you get results from both and you show a combined set of results. So this is simply how it’s done. User need on the web search. Why we search the web? Actually, let me ask you, why you search the web? What are the purposes of searching the web? Getting some information seeking, it’s easier. Yes, but what, what, what kind of processing, what, what kind of target you are doing? So are you would like to learn about something or looking for a specific thing? So what, why you search in general? What are the different purposes of searching? Yes, question and answer. You need to know specific information about something. Yes, what else?

SPEAKER 5
On your specific web page?

SPEAKER 0
You are looking for a specific web page like British Airlines. I’d like to find this one. It’s not an answer. I just need to go to British Airlines. What else? Shopping, excellent. So actually people tried to study this and we found actually that 40%, 65% are informational. You’re looking to learn about something, question answering, learning about certain topic, and so on. Another one, actually a popular one, is navigational, like searching for United Airlines, a specific website or looking for it, so you search for it and find it. Another one is transactional, which is like shopping. You’re looking for something like uh to download something, to shop on something, to access a specific service, so it’s transact, I need something to transact in this way. And there are some great areas just to go to explore what is going on, maybe top news and so on. So in general, the majority will be to know about specific information, but some will be just navigational. I’d like to reach a certain website. I will not memorise all the websites, just search for it, or transaction. You’re looking for shopping, finding something to do a certain transaction. And here becomes the search engine optimisation, which is something that has been there for years. People try to optimise their web pages to be able to get it more top in the top results of the search engine. It’s changing a little bit recently with AI stuff, but still people are doing it because it’s actually helping you. Maybe AI is doing the search engine optimisation at the moment. So if you have trouble to get actually to pay something for Google to make your results showed in the top, so are there any alternatives? This is one alternative you can use. So search optimisation is trying to tune your page to be ranked high when you search for a specific thing, and it’s an alternative for paying to the search engine itself. And it’s performed by companies or webmasters. Some people would say I’m an expert in extension optimisation to help you to make your page seen if you’re working, for example, in tourism or something and a travel agent, you have to make yourself shown when people are searching for it. And recently it has been even done by AI. You can actually ask the AIBT, I’m doing this kind of business. I’d like to have something in my page that allows my search engine search engine to get it on the top. Then you can use it as well. Some of these are perfectly fine to use. Some people are using it in a shady way, so we learn more about this. So the simplest form in this is to, especially with the first generation engines, is to actually focus on the thing to improve your TF. So if the top ranking is about Maui resort, then you can actually look for engines which have these words, the specific word Maui. So what you can do is to keep repeating the term in your webpage to increase the TF in this case. Of course it would be annoying to the users. So what happens, you put them as keywords in the tag in the HTML or actually put them in the same background colour so it will not be shown. So it’s still there, but the users don’t see it, but somehow you’re increasing this. However, the problem with this, it’s not with the current machine learning methods used, it will not help a lot because TFIDF is just one feature. It’s not everything. It will help you to bring it in the initial phase, but when we go to the re-ranking, remember we discussed last time, you use the same engines we do to retrieve the top X, maybe 10,000, and then you use the re-ranking methods with machine learning to re-rank them, so this will go down in this case. Um So pure word density cannot be trusted as an IR signal, especially recently. Other things to manipulate the examples. Remember the example of Wise and Revolut. This is a very good example. You can create content in your page that will bring, you make your content relevant, but to bring your actually competitors, users to your website. So for example, if I have an XYZ hotel in an ABC city, for example, and you’d like to make a website about yourself, so what you can do is you can start mentioning stuff like accommodation, hotel rooms, flat, travel sites, attractions, vacation, holidays in ABC ABC. You can do that. This is kind of put all the Terms you can think of if you are providing family advices, you actually have something you’re providing family advices and if you’d like someone who has trouble in their family with their wife or their spouse, how they can actually looking for advice, you can put stuff like family, couples, parents, spouses, wife, husband, and keep adding terms that might be relevant if you are a company selling umbrellas. For example, you don’t, you can include, of course, the word umbrellas, but still you can add terms like raining, rainy, wet weather day. This is why when I search for the weather in Edinburgh, and it’s raining, I can get an advertisement or my website will pop up showing I’m selling umbrellas because it’s raining. I didn’t search for umbrellas at all, but I’m suggesting it to the user. So you optimise your search engine, think about the situation where your result will be important to the user and try to embed this kind of terms. There is another way of search optimisation which is called cloaking, and one technique of it is called black hat. We simply I create two versions of my web page. One is the one I show to the user, and the other one, it’s like I put a black hat and I have one that the search engine only will see. And I put there all whatever I’m looking at might be interesting, and for the user, I will just show something very simple, not that much. So I’m showing two versions, one for the bot that’s used by the web crawler to collect the my information will go to the index, but one, another one for the user to show it to them. It can be seen as kind of a spam, but it’s still accepted because its search engines know it’s actually OK they would like this kind of index terms to be there to be relevant, so it’s, it’s still used in this case. Which brings us to another thing about web search engines, which is duplicates. One thing that if I have a large number of collections and documents, when I search for something, I can get many, many pages, all relevant, but they are very close to each other, sharing almost the same information. So showing all of these results, the top 100 results for the user are identical, it’s bad. I would like to give the user some kind of diversity. So this brings us to the duplicate detection, how search engines would find if a page is a duplicate or a near duplicate from from another page. So There are some kind of strict duplicate detection which is an exact match, and it’s not as common because always if it says it is changing in one word, it doesn’t make it identical. So detecting it will be harder, which makes us focus a lot. Search engines focus on near duplicates, how we can detect near duplicates. These two documents are 80% the same, for example. Uh, Maybe 2 ones actually they have the same content, but one of them just last modified date is different and everything else is the same. And approximate match in this case, like use similarity threshold to detect a near duplicate. You can say if the similarity is over 80%, I will assume them to be near duplicate, and it’s not transitive. So for example, if A is nearly duplicate to B, Very duplicate to see, it doesn’t mean that A and C are nearly duplicate as well, because if this is 80%, this is 80%, maybe B and C are actually 70%, because different parts of documents are uh matching, but not the same. So this is actually you have to be careful with this. The very common approach that is used is minihash. Simply try to find uh shingles, which is a a compound of n-gram words, and you try to see the Jacquard similarity. This is very fast, very efficient to calculate. Between two documents. Create some shingles, maybe with like 4 grammes like this, and try to see what is overlapped using Jacques similarity. We don’t know about what’s Jackard’s similarity. And from there, if it’s 80%, 60%, whatever threshold you decide, I will consider these as near duplicates, and then I will just mark it, so if the two results came to the user, I can show one of them to the user, maybe the one with the higher page rank, OK? This is another thing So the Schen Gel plus set intersection, computing exact set intersection of Shingles between all pairers of documents is expensive and intractable. So there is approximate using a clever chosen subset of the Schenels, and in this case, you can try to see the intersection of certain parts as a sketch and get the results. So if you have like documents A and B, you take shingle set of A and single set of B and sketch both of them and calculate the card, so you not only the whole thing, just parts of it. And this is just a minihash is one of the very algorithms in this case, and it’s super fast. Anyone tried minihash before? OK, there are already implementations for it, so you can actually go to Python and find a library for doing mini hash matching, and you can use it. It should be straightforward. Which brings us to the most important initial step, which is web crawling to create a web search engine. So it starts what we call the web spider. The main idea here is we start by certain web pages that we are aware of. So we can start, for example, by CNN, Wikipedia, and very, very famous pages, and then give it to what we call a web spider and ask it, can you please use the links in these pages and keep following these links, expanding, expanding as long as you can until you get everything, which brings us to the unseen web. So at the beginning, I might be aware of 10 or 20 pages only. Which has a lot of links, and from there I keep the web spider, which we call it web spider. It’s like a spider moving in the net and trying to find the link, follow it, find the links there, follow it, and keep doing that. So the URL crawler crawled and parsed. You get all the URLs in these pages, and you keep expanding. We call it URL frontiers. You keep collecting in these pages all the URLs in these pages and collect them and try to crawl them. However, it will end up with a big part of the web that is not still unseen. You will not be able to collect everything anyway. Do you know examples of the things that you will never be able to collect, as search engines you will never be able to collect? Mhm. Private data, yes. Your bank account, you can access your bank account on the web, uh, on the page, on the, when you go to sign in. Anything behind a username and password, the recorder will not be able to collect. What else? Yes, paid content. Usually you’ll find actually the web crawlers will be able to collect it because they will be paid actually even they will get it for free because for example, if I’m I’m a publisher and I have my stuff actually under a paywall, I want my stuff to be searchable by Google, so I will give access to Google to be able to collect it so people will search for it, then the user can pay for it. So this one actually will be accessible to the web spider to be able to collect it. So paid content, at least the stuff which are publishers, usually will be able to accessible, not everything, of course, but many of these things will be accessible to the web search engine to be able because you want the user to find it. What else Some small independent businesses that are not very

SPEAKER 5
famous, so.

SPEAKER 0
This is true, especially if you don’t have any links to these pages. If you have a page that no links to it whatsoever, then it would be really hard for this. So many examples of this are like your private profile. If you actually your profile on social media is private, so web searchers actually, you’ll find this as an option and use most of the web search on most of the uh social media profiles. Make your content accessible to web search engines or not. It’s a checkbox, if you can check it actually on most of them. So this one will make your content, your posts on Facebook or TikTok. Is it searchable by search engines or not? Online banking and the cloud stuff, you have your images, you have your documents on the cloud. Google is not indexing, even if it’s on Google Cloud itself. They will not make it public for everyone. So this is actually private stuff. And of course there is actually this dark web. Which is another version of the web that has its own browsers like Tor or something, you will be able this is actually not indexed by search engine because it is actually probably most of these are illegal, like drug and drug dealers and stuff are selling stuff like that. So there is a big part of the web still will never be able to collect it. However, most of the public tech stuff, the main purpose of the search engines is to collect all of them. So how this happens. It begins with a known seed of URLs. I would say one example is Wikipedia is a very good starting point because it’s big, it has many pages, and it has many, many, many links to different domains. So you fetch all the links on. Imagine I will crawl all Wikipedia and find all the links in Wikipedia, in the references in the links, everything. Then I will collect all the links there and start to extract the URLs they point to and put them in the URL frontiers. I have the list of queue. Whenever I find the link, I put it there, keep putting there a bidding. I go to these links and collect whatever links inside there and put it there in the queue and keep collecting. I keep collecting. And they keep fetching the URLs and the Q and repeat. I keep repeating, as long as it goes, nonstop. However, What any clon must do, they have to be polite. What do I mean by this? I should respect the implicit and explicit politeness consideration, which one is only call whatever is allowed. So respect what is called robot robots.text. This is actually a meta file that is available in any website which will tell the search engine which part of this webpage is allowed to be crawled. Some of these will be private information. It will not try, not even try to access these URLs. Just collect these kind of documents. So robots with the text with some meta file comes with any web engine, web, web uh page will tell actually what things to be crawled. The other thing is to avoid hitting these websites too often. So for example, I would like to collect CNN and be sure it’s actually up to date. I can collect. How often shall I collect this website? And actually if it has many subweb pages within this domain, how often shall I do it? Shall I do it 1 million times per second? I can actually make the website actually be down. So I try to be respectful and try to do it every few seconds, every minute, to allow it to actually don’t put a lot of pressure and access to this website. And the other thing is to be robust, which is be immune to spider traps. Remember the link farms and things like this, and other malicious behaviour from web surfacers. So be sure that when your web spider goes there, if there is a link farm or something, it actually be robust and don’t actually get trapped in something like this. What they should do as well is to be capable of distributed operation designed to run on multiple distributed machines, so you don’t do all of that, of course, from one on one machine. You have many machines, clusters around the world, data centres. Everyone is doing its own part, collecting data. And be scalable, designed to increase the crawl rate by adding more machines. Google, the number of machines serving Google now probably is double the one that has been there last year. It keeps updating because actually the web is expanding and growing as well. And the performance and efficiency permit full use of available processing and network resources. This is actually a lot of efficiency being done on there, so I don’t want to stay have one of my one of the nodes in my cluster being idle, not doing anything. All of them should be working all the time. And this is why these data centres are huge and actually the cooling of these systems is really actually they require a lot of energy. For example, Microsoft moved their data centres a few years ago to Siberia, so they said actually instead of the cooling, it’s there to be sure it’s doing the cooling by itself naturally, and actually Google is creating some stuff in the middle of the sea to generate electricity from the waves and run its machines, so the amount of energy used in this stuff is huge. So be sure that everything is actually being used and nothing is not idle for a second. And fetch pages of higher quality first. You have to think about the ones which are higher quality to collected first. So starting with Wikipedia is a good idea. Starting with a blog post to collect data is a very bad idea. So think about which one to be fetched first. And freshness and continuousness. So the operation should continue fetching fresh copies of previously fetched pages. Even I collected this page, I have to keep updating it, but again, I have to be polite, so I don’t actually go and put a lot of pressure and keep updating it every second. I can update it every few minutes or a few hours depending on the importance of this page. So you have to think about these kind of things. A lot of engineering is being done with this kind of data. And the other one extensible, adapt to new data formats and protocols. So these kind of things, TCP IP, these kind of different formats, protocol IP 6 or 7, so all of this stuff you have to be adaptable to be able to collect all these digital formats coming out. So the basic roller architecture is something like this. You start by, this is the web, you have some to fetch some data by the front URL frontiers. This is where you start with, and you get fetch all the URLs and DNS, use it to get and get it from the web, parse it, content scene, you check if I have seen this before or did I crawl it before or not. If, uh, if not, then you put it, collect the documents and put it there, and then use the I. URL philtre to be sure that robot’s philtres are not collecting something that shouldn’t be collecting and then have the URL set, put it in the URL, check anything that is duplicate and put it in the URL frontiers and keep repeating this forever, nonstop. So the process is pick a URL from the frontiers, fetch a document, and the URL, parse the document, extratract all the other links, and check if this content has been seen before, if not, add it to the URL to the index, and then if for each extract the URL, ensure it passes a certain URL philtre test test that it’s not actually violating the robot.text, and check if it’s already in the frontiers to be sure that you don’t have something repeated twice in the URL frontiers you have. Your frontier can include multiple pages from the same host, so a hack that contains domain and all the subdomains within it. Must avoid trying to fetch all of them at the same time. Be kind, be polite. Don’t go and you find, for example, CNN published 10,000 articles today. I don’t have to collect all of them at the same time. Put some time between them. Must I try to keep all the crawling threads busy, be sure that actually I’m always collecting it. OK, I will take a pause for 5 seconds to collect the next CNN article. In these 5 seconds, go and collect 1000 other articles from different domains. Don’t keep anything idle. Explicit and implicit politeness, we said explicit politeness is specification from webmaster on what portion of the site can be crawled like robots to text, and in the implicit one, even with no specification, avoid hitting the website too many times at the same time. And this is something actually you can find in robots detect it will tell you user agent, if this is something for for actually the people, what disallow your site. This should be done, and user agent for search engines, it disallows nothing, for example. So actually it gives instructions about the web browser, how to what to do. And no robot should visit anywhere else starting with your site temp, except the robot called search engine. So this is something you would actually identify. The search engine can access some stuff you don’t access, and you can sometimes access some stuff that the search engine cannot access. Two main considerations here politeness again, do not hit a web server too frequently, and priority and freshness crawl some pages more often than others. Pages whose content it chains often like news will be crawled every few minutes, pages which doesn’t change a lot, maybe a Wikipedia page, it can be updated every week or two. Actually, one of the problems, I’m not sure if you were young at that time, but when Michael Jackson died, At that time it created a problem across many things because first thing, there were many people trying to get onto the Wikipedia page to change that he died and add some information about him, so actually it crashed. And the other thing, web search engines could not update with these kinds of updates happening on the Wikipedia page. So these kinds of things you have to be careful with when to access a certain website and update it very frequently. These goals may conflict with each other, of course. Simple priority queues, fails, many links out of the page, go to its own site because there are many internal links for the same domain. So I’m following this link. It’s on the same website, so I have to be careful that I still keep the politeness. And even if we restrict only one thread to fetch from a host, it can hit it repeatedly. So we have to be careful about these kinds of things. Take all of this into consideration. I’m giving you this information. I’m not expecting you to build a search engine web spider, but it’s good to know. Maybe if you’re doing a search engine that would collect a certain domain like news from a certain news website, and you would like to make this live, you should consider this. So when you create a search engine, I will index a search engine for CNN, BBC, Reuters, and Associated France Press. Do that and keep it crawling, but be sure that you have to keep these kinds of things. Don’t keep hitting them a lot. And common heuristic insert time gap between successive requests from the same host. So time taken in most recent switches from the last host. So switch between them. So if you’re crawling from CNN and BBC, maybe switch between one page from here, one page from here, and keep going and going forth and back. So the summary here is finally coming there. We talked about the history of the web search engines, I think it was more about historic class, this lecture, but it’s good to know that how this has developed over the last 25 years, or maybe 30 years now, changing from a very simple one with keyword-based to actually something that started with PageRank and recently with Learning Rank and LLMs. Google is dominating by far, even with the huge hype after opening I came out with the Chat GBT. Still, search engines have its own stuff using the same techniques we have been discussing. There are advanced techniques to re-rank stuff, but the basic stuff about ending sync, word match, and everything is still there, at least for the initial step. We talked about the basics of web search engines and the usage of web search, how it has been used in different cases, and the search engine optimisation, how it has been done and how it’s being used, and the web crawling using web spiders and how you can get as much as you can of the web and be smart and polite at the same time when collecting data. The resources for this. There are Chapter 19 in Textbook 1 and Chapter 3 in Textbook 2, but I would recommend to watch these videos as well. It’s a nice way of presenting how Google works. It’s it’s actually, I would recommend that you have a look on this as well. It will give you a nice way, an easy way to explaining it. Do you have any questions? Yes, OK, that’s a good point. I don’t know. I think by default it’s an algorithm and a protocol on the internet, so the algorithms should actually respect it anyway, unless you are a hacker, and they would not want to do that. So maybe hackers will do it in a different way, but any search engine will not want to lose their reputation just because they would like to get additional information that they shouldn’t be accessing anyway. OK, yes, uh, for web volume, if you start with

SPEAKER 4
a set of, uh, URLs in the dark web in the 6 pages, can you index the frontier pages?

SPEAKER 0
Yes. What happened? I’m sorry, what is the question? If you have some URLs in the dark web as

SPEAKER 4
your 6 page, can you index the dark web?

SPEAKER 0
No, it shouldn’t be because the dark web has its own protocols. Even your browser will not, the normal browser like Chrome or Firefox or whatever, will not be able to access it. It has its own. It has totally different protocols, and many of these are illegal stuff, so the web search engines, they have their advanced stuff. This shouldn’t be there. So you have all of these philtres. The easiest one of them is not access to our dark web, so it shouldn’t be a problem in most cases at least, yes.

SPEAKER 5
Uh, what happens if there’s uh some URL that it like is added, this is removed from a TXT after you’ve already called the web page, so it becomes a.

SPEAKER 0
I don’t, it shouldn’t be a problem because at least it will try to collect it. If it didn’t be able to crawl it, then it’s done. So maybe they try again, they should have a philtre at some point. Try it once, twice, 33 times, they didn’t access it, that’s it. So it should be fine. Yeah, yeah, it will be just removed, yeah, it will be updated, yeah. OK, so I would say that most of this work, you will not, of course, implement all of this, but one of the important things is when you’re having your group project, you will be required to crawl some data. You can have a static data set, you just crawl it once and then you’re done, or you can have a nicer thing which is uh continuously crawling something and indexing it as you go in your project. So when you’re doing something that is continuous, be sure you are taking care, you’re respecting. Privacy of the data and, and res uh collecting what is actually you have only can be accessed. One item we mentioned when we talk to discuss the group projects uh requirements, but one of the requirements, be sure that you’re accessing, uh, collecting only public data. So don’t go and actually I had a project like 4 years ago that they decided to the students contacted NHS and they got access to some documents which are not publicly available, but they managed to get access and actually to mark. This project, they put us a username and password to be to be able to access these documents. So this can be an option as along as you get the consent from the data owner. Otherwise, focus on public data, collect it and be polite when crawling. Don’t hit it a lot of times, OK. OK, good, thanks a lot. And it will be Bjorn from next week again, I’m not sure if I’ll be back, but uh you will find me. I’ll be on Piazza anyway, OK? Best of luck.

Lecture 7:

SPEAKER 0
Hi and welcome to today’s lecture in TTDS on comparing text corporate. So my name is Bjorn Ross. I’m a lecturer here at the University of Edinburgh in the School of Informatics, and I’ll be teaching today’s lectures and also some of the ones in November. I also teach another course, a postgraduate course on evidence, argument and persuasion in a digital age, and we are teaching for the first time. This year, a first year undergraduate course in computation and social science. So you won’t be seeing that, you won’t be taking that, but the next generation of students will. Um, so today I’ll be talking about comparing text corpora. I’ve got two lectures back to back, and, uh, please do interrupt me, put your hand up if you’ve got a question, if you’re here in the room, make use of that, that’s why we’re here. Um, also let me know if I’m speaking too fast or too slow or if I’m not explaining something enough. Please do let me know. OK, so I’ll be talking today um about comparing text corpora, uh, and also we will be releasing lab number 6, and then next week Walid will be teaching a lecture on web search. So The premise of today’s lecture is you’ve got some, you’ve got access to some text. And this time you’re not searching for something specific in the text. In a lot of the previous lectures you’ve learned, well, some fundamentals about working with text data, preprocessing text data, but also fundamentals of building a search engine. So the idea being you can index this data really efficiently and then if you’ve got a specific idea of something you’re looking for, like you want to know I don’t know how many of these texts talk about a specific topic, you can search for that directly. We’re going to temporarily forget about this part for now. We don’t now have a specific thing we’re looking for. We don’t have a query, OK, so in today’s lecture we don’t have a query. We just have a text collection and we want to explore this data collection and we want to understand what is it about, what does it say, what does it not say? And if we have two of these collections, how can we compare them? OK, so suppose you’re given access to a new data set, you’re graduating from university, you’re going into your first job, and your first assignment is, look, we’ve got these two big text data sets. I know you took a course on text technologies and data science. You should be able to work with this. What can Tell me about these two datasets, and all you’ve got for now is there’s one folder and it’s got lots of text files in there and there’s another folder that’s got lots of text files in there or maybe they’re in some structured format like XML or JSON, but mostly it’s really text data and very limited metadata. So you’re supposed to. Compare these two corpora, these two datasets, and you want to understand, you want to quantify what’s the, what’s the content of these documents, what are they about? Um, and how does the content of the two corporate differ? You might be told something about what they’re about. It might be that one folder is all the newspaper articles published in the Financial Times last year, and the other folder is all the newspaper articles published in The Telegraph. Or you might be told one of them is all the transcripts from one Netflix show and the other one is all the transcripts from another Netflix show, or something like that, or you might not have been told anything about what they contain. What are some of the things that you would do? To try and figure out what do these huge datasets say, what does the text and then what does it say, what, what are, what are the contents? What are some things you could try? Any ideas? Yeah? Good idea, yeah, how would you do that? Yeah. Something like that could work really well. So you could iterate over all the contents, over all the text files, and you could count the number of words, you could create an index, for example, you can index them and just count for each word, how often do they appear, um, yeah, what else? Look at the names of the files. Look at the names of the files, maybe the creation date, any sort of metadata you can get. You can find out about the text files, absolutely. What else about the text content itself? What else can you do apart from counting frequent words? Any other ideas? So you could visualize the most frequent words, right, you could create things like word clouds that show, OK, uh very, very common words and very big font and rare words and a smaller font. You could do things like um actually reading examples, right, you can go through actual examples of texts and build an understanding yourself of what the texts are about, and that will probably give you a much more, much more nuanced understanding of the text than you would get just from um calculating the most common ones. You could identify which language are they in, either yourself or if you don’t know the language, um try and pass it to um some system that can tell you which language the texts are in. Calculate things like also the average number of words per document or the average number of words per paragraph, or the average number of paragraphs per document, and these kinds of things they tell you a lot, right, so each Twitter post or X post is a maximum of 280 characters. But if you’ve got reviews on Amazon, then they tend to be much longer. And if you’ve got entire books or the subtitles to an entire Netflix show in a single file, that’s going to be a lot more text. What, what else could you do? You can ask an LM to summarize them. So I’ve actually did, I did this with some data, um, about the 2024 US elections. I gave her two newspaper articles. Um, I think this was clawed, yeah, this was clawed, uh, by Anthropic, and I asked it, summarize the two newspaper articles and tell me what the differences between them are. And the results were pretty good. I mean this is something you can actually use. Um, but we’re teaching, we’re starting here from the ground up, right? So what are some reasons why you might want to do this and what are some reasons why you might not want to do this? It could hallucinate, exactly, you could make stuff up and you just don’t know how to trust it. We can do that 500s. It’s It’s very limited, it’s very computationally expensive, so if you’re working through an API they’re going to limit how many requests you can make per second, per minute, uh, before you have to start paying. You might have to pay, it might get very expensive. You can run the models locally, but even then you need strong GPUs and things like that. What else? Uh, for small samples it’s very quick and. Yeah, absolutely, it is quick. confidential. It might be, we don’t know what the data’s about. It might be patient health records, and the last thing you might want to, to upload to a service like this where you, you’re literally transferring data to some. Server belonging to some other company. Of course you can, you can get around some of the privacy limitations by setting up your own LLM locally in a complete cluster, but even then you’ve got issues like hallucinations and you’ve got issues like, um, also that the salaries can often be a bit vague, they can be a bit hard to reproduce, so if you ask 5 LMs they give you 5 different answers. Or if you ask the same LM 5 times, they might give you 5 different answers. So what we’re going to look at. For today, for now, we’ll later look into how some of these technologies that underpin LLMs can also be integrated into our workflows, but for now we’ll start looking at actual numbers. OK, we want to calculate actual numbers. We don’t want just plain text summaries of what’s going on. We want hard numbers that are deterministic and they’re reproducible and they’re they’re specific, so we can say things like this topic appears 3.4 times. As often as this other topic in the text and these sort of things, so we want to have a systematic approach, an almost scientific approach. We’ll think like a scientist, OK? We don’t just want a general summary the way that I might summarize some document collection to you in natural language, but we want some actual hard facts that we can then cite and we can say in our report this is how often this appears and this is how often this appears. Here’s a table to compare the two. Here’s a data visualization. So that’s the sort of approach we’ll take. So we want to be systematic about how we’re analyzing Text corporate. Um, and we’ll start by talking a bit about how this stuff would have been done traditionally in the era before there were computers, basically, um, content analysis. We’ll look at word level differences, exactly things like counting the number of words and calculating differences between two different corpora in things like count. We’ll look at dictionaries and lexicons, which is when you’ve got an idea of a specific set of words that we call a dictionary or a lexicon, uh, a specific set of words that we’re looking for. And we look at topic modeling, which is an approach that we can use when we don’t have an idea of a specific set of words that we’re looking for. So we really don’t know what’s going to be in the data, and we let those insights about what’s in the data, these topics, we let them emerge from the data. And then we’ll briefly talk about annotation classification, but really that’s the topic of the lecture in 2 weeks. So content analysis is the first part here that I’ll only talk about briefly as a sort of historical overview because that’s what would have been done traditionally um before we were able to automate some of this analysis. So the idea is given some documents, we want to decide what are the types of the content that’s present. What are the main themes, what are the main topics, and also which documents specifically contain which topics. So in this traditional manual, manual process, what we would have done is actually read a subset of the documents, we still do this sometimes, to be fair. Read a subset of documents and defining specific themes. And defining specific topics. That we see emerge from the data. Sometimes we might have an idea before we start looking at the data of what kind of topics we’ll find and then we can create this kind of taxonomy or hierarchy, a priori before looking at the data. Sometimes we might have to let this emerge from the data and we see the more documents we read, the more topics we note down, OK, and then maybe this appears quite often. Can we subdivide it into two and these two, maybe we have to merge. And then we identify a consistent methodology for deciding that this specific document belongs to this topic. A so-called coding methodology. It’s called coding method, it’s got nothing to do with programming. Um, so you’d have experts reading these documents and agreeing and arguing with each other until they hopefully finally agree on how we can assign a document, To a topic, then read all the documents, label them according to these codes, check how often the experts actually agree with each other and hopefully agreement is high, Potentially settle any disagreements via a third party, maybe a 3rd expert. And finally we can analyze the resulting annotations. So if you’re going through a big corpus of newspaper articles and you don’t know which ones are about sport or politics, you get the experts to read them, they, they discuss with each other, and yeah, OK, if it mentions specific politicians, it’s probably about politics, that kind of thing. And finally you’ve got your results. Now, can you automate this process I’ve just described? If so, how? Any ideas? So the answer is of course yes, otherwise we wouldn’t be here. So you can to an extent. Um, because for example, you can count things like words, right, if you, if you’ve decided if the word, if the, if the text contains, um, Kirtan, OK, it’s gonna be about politics, if the text contains Messi, it’s gonna be about football, that kind of thing. Should you automate it, well that probably depends on what you want to do. Still these days humans tend to be better than machines at this task. For now, we’ll revisit in a few years, um, but of course computers are much, much, much faster, especially if you use the, the, the simpler the algorithms use. But basically in getting actual experts in a room and paying them to read lots of documents is really, really expensive and takes a long, long time. So the average human reading speed is maybe 250 words per minute. So if you assume 1000 words per document, 50,000 documents, this is gonna take the average person over 4 months to read, which is a lot of time. Modern computers can process lots and lots more, so there’s a, there’s a good reason, there’s a trade-off here, OK, between accuracy and speed, and we’re choosing speed for now because we, we’re assuming we have huge document collections that are far too big for any human to read. So if you’re just comparing two paragraphs, a human has got to be able to tell you much more, especially an expert. Um, but if you’re comparing two document collections with 5 million documents each, this is the sort of use case we’re talking about here. So we’re going to automate the content analysis, there’s a couple of different ways of doing that. One is looking at word frequency analysis, which is what you suggested, so we literally count the words, how often does each word appear? One is looking at dictionaries, also called lexicons. We might already have an idea what words we’re looking for. Essentially a list of words we’re looking for. We call this list of words a lexicon. Or we might have topic modeling, which is where we don’t have that, that, that fixed idea. And you can extend this idea. You can calculate all this for a single corpus or a single class. I’m going to treat the words corpus class more or less interchangeably here. They basically refer to document collections. And you can do the same thing for multiple corporate classes and compare two of them. Using word frequency analysis we’re now talking about word level differences, using dictionary dyslexicons, we’re talking about things like dominance scores that you can calculate. And with topic modeling you can look at topic level differences between two. and we’ll see how each of these work in practice. So let’s start with the first, um, on the top left, the single corpus of class, and we look at word frequency analysis. This is essentially a formalized version of counting which words appear, how commonly. So the idea is we use this as a very simple starting point. It’s a good first thing to try with any document collection. We can pre-process as usual, like lower casing and stemming. We can count the words, so which word appears how often, um, we normalize maybe by document length so that we can make sure we take this into account that some documents tend to be much longer than others. So instead of saying, um, that maybe a word appears a total of 5 times in this corpus. And or 500 times in one corpus, we could say it appears well maybe 0.3% of the words in this corpus are this are this word and this way we’ve normalized by the length of the corpus as a whole. We average it across all documents and we can then visualize this in this very quick way so you can see how this is extremely efficient. To do, because we only have to iterate over our document collection once, we don’t have to do any complicated calculations, um, and we can visualize it in this way, so um, this for example is a word cloud of the Wikipedia page of the University of Edinburgh and it very quickly communicates to you. Something about the main themes on this page. It doesn’t tell you whether it says it’s a good university or a bad one, or what the students do, or what they’re learning about, but it, it very quickly gives you a general set of ideas of what this page talks about. So it’s a very quick, quick and dirty way of visualizing, um, the information in the text corpus. Now extending this idea to two corpora that we want to compare, we can do this, we can ask which words best characterize a set of documents, like a corpus or a class. So we’re directly compare to, um, and what we need to do here, whenever we’re comparing two or more is we need to have a sort of reference corpus, so we have an idea of, This is the thing we’re comparing things against. If we have two corpora we might compare them against each other, or we might compare them with a 3rd 1. That 3rd 1, that reference corpus can often be something that we think of as a representative. So when you want to know what characterizes newspaper articles about sport. You can say your reference corpus is a set of all newspaper articles. Now you’re comparing the articles about sport with all articles to see what is unique about the sport articles. You can generally do a lot more if you have something to compare against. Um, yeah, this is, for example, because some words are just common in any text, so if you don’t do the preprocessing with removing stop words and things like that, the word cloud will just have words like the and of and is being really common. But what you can do once you’ve got a reference corpus, instead of having a static list of subwords, you can actually say how much more often is this word appearing in my document collection of interest, for example the newspaper articles about sports versus all, Uh, versus the reference corpus. So now you don’t have to have this static list that says exclude the word is. You can just see, OK, how often does the word is appear in my sport articles? OK, it appears in 1% of words. How often does it appear in all articles, my reference collection, 1% of words. OK, not very interesting. Which words appear much more often? That’s the interesting part. So let’s define a reference corpus, um, and then once you’ve done that, some methods that you can use to do the word level differences is mutual information and also chi squared. Um, these can also be used for feature selection, and I’ll talk a bit more, um, later about what this means. So mutual information basically answers the question, how much can I learn about Y by observing X, where X and Y are random variables. Mutual information is also short for expected mutual information, it’s also the same as information gain. And confusingly it’s different from point-wise mutual information which we’re not going to cover here. So within the course in the exam, we’ll use this, this phrase information consistently. It’s just when you see it appear somewhere else, you should ask yourself. Are they talking about expected mutual information, which is what we’re learning about here, or are they maybe talking about pointwise mutual information, which is a different thing? So yeah, we want to learn about important words in our corpus, um. And we do this by using this method that was actually developed for understanding the relationships between two random variables. Now what do the two use cases have in common? What, what are the random variables? In our case, does anyone have an idea what X and Y could be? What are the random variables when we’re talking about comparing words between two corpora? And it is. So basically X. Are the words and why are the classes. So X is a boolean random variable, it captures whether or not a document contains a term that we’re interested in, term T, like the term messy, and Y is also boolean, and it captures whether or not a document belongs to the target class. Or whether it doesn’t. So it captures whether or not the the newspaper article we’re looking at is a sports article or not. So, um, I’ll convert these X and Y to U and C, uh, because then the equation is the same as in the information retrieval textbook, where intuitively when I see that, um, a newspaper article contains the words, um. And or knew a lot, it doesn’t tell me much about whether it’s about sport or not, um, but if I see that the article contains the word Messi a lot, it does tell me a whole lot about the topic of the article. So more formally, the presence or absence of the word is considered a random variable X with two possible outcomes, 0 or 1, and the class of a document is considered also a binary random variable with the outcomes 0 or 1. so either it is in the class I’m interested in or not. And mutual information is now a way of quantifying how much do I learn about the outcome of random variable X when I can observe the random variable Y or vice versa. So in information theory we’d say how much information do I gain about one random variable from observing the other. And this is also how it gets its name, um, mutual information or information gain, and it’s a measure of how essentially different two probability distributions are. So we can pick apart this equation here from the right we start when I say that u equals et. Just saying that the document here that we’re looking at either contains the word T or it doesn’t, and when I see C equals EC, it means the document has class C. And we divide the joint probability by the product of the probabilities. We take the logarithm and we multiply them with the joint probability and we sum across the different combinations of U and C. But in the case of two boolean random variables, uh, four different, four different possibilities, so 00, 01, 10 and 11. And we know that two events are independent. Well, when, when two events are independent, then the joint probability is. And the same is the product of their probabilities, so the term here becomes zero when the two events are independent. And therefore the entire equation becomes 0. Therefore, mutual information is zero when the two variables are independent. Um, When the terms distribution is the same in the class as it is in the collection of all. Document, sorry. So the word and, if it appears in 2% of sports articles, in 2% of all articles, mutual information between and and sport is zero. Um, yeah, and it’s maximum, maximal when the, the presence or absence of the word tells us exactly whether or not the document’s in the class. So if the word messi appears only in sports articles. Then we can learn immediately once we see the word Messi, it’s a sport article, so mutual information would be maximized. And this is why words with high mutual information scores are really useful because they tell us a lot about which class the document’s in. Does that make sense? Yeah. I’m a little bit confused on what do you mean a document has a class C. OK, so picture a situation like a newspaper article. Every newspaper article is in, imagine every newspaper article is either a sports article or it’s a politics article, or it’s an entertainment article. There’s no other classes within these 3, we’re looking at a newspaper that has these 3 and no others, and no article is a member of more than 1. That’s the sort of scenario we’re looking at. So now we have 3 classes. Politics, entertainment, sports. So a given article we’re looking at is either a sports article or it’s not. And that’s what the random variable um measures. So it’s just saying that either this document is this class or not, exactly just like classification test, exactly, exactly, exactly. And what we’re measuring here is the association between a word and a class. So how much does the word, the presence, if we see in the document the word, how much does it tell us about the class of the document? OK, and We can do this with an example actually. Um, so given the corpus and the time, how would we actually estimate this number, the probability of the time appearing in the random document in the corpus? Um Basically, if you have 200 documents and you can do any sort of analysis you would like, how would you estimate the probability that if you took a random document out of the corpus, it would contain the word butter? You count, right? You look at which of the 200 documents contain the word butter and which ones don’t. So that’s your maximum likelihood estimate of the probability. Um, of U equaling um ET or C equaling EC, it’s based on counting. Um, and this is why we can use counting to compute, mutual information scores. So 11 here is the number of counts of documents that contain the words that are in the class. And in 01 is the number of documents that don’t contain the word that are in the class. The dot um means just either 0 or 1, so it’s the total number of tokens. So N.1 is N01 plus N11. Um, And yeah, so that’s basically the the equation. Where we can say. If we have these numbers, for example, to make it concrete, we’re going to compute the mutual information for one term in one class. And in this case, um, we’re saying there’s a token, a, a term, that’s the word export. And we have a class, and the class is poultry, um, in the, in our set of documents, um. So we’re assuming some of the documents are about poultry and some of them are not, just like sports in the newspaper example. We want to know how much does the word export tell us about whether the document, Is about poultry or not. And we have got 11, no, we’ve got 49, we’ve got 49 documents that are in the class poultry and contain the word export. We’ve got 27,000 documents that are in the class poultry and don’t contain the word export. 141 that contain the word. Poultry and are not in the class export and we’ve got 770,000 documents that are neither in the class export nor do they contain the word poultry. Um, now we’ve counted all the documents, these are our ends, and we can also count, um, the total number of documents. And here we’re just the sum of all 4, because there’s only these 4 cases, right? So how, what, what is the mutual information of poultry and export? You can work this out, please. I’ll give you 2 or 3 minutes. And then we can walk through the solution together. Is export the class of words. Export is the class. Hang on, export is the word. Export is the word, poultry is the class. Yeah. OK, let’s move on. So how do we work this out? And remember N is the total number of documents, so these are document counts, that’s only 4 cases. If you sum these 4, you’ll get N. N.1 is our shorthand for N01 and N11, so it’s the sum of the two, and likewise N1. for example, is our shorthand for N10 and N11, so it’s the sum of the two. and if you plug these numbers in, you should have got something like this. Does that look like what you got? So it’s exactly the same equation actually as here. But with maximum likelihood estimates plugged in uh into this here and uh yeah, with the, with the real numbers. From poultry and export, so we’ve got a number here which doesn’t tell us much, OK, we don’t really, isn’t, there’s no obvious interpretation of this number, 0.0001105 is the misinformation between the class, uh, poultry and the word export, doesn’t tell us much by itself, but what we can do now is we can calculate this for all classes. And all words, so for all pairs of classes and words. And then we can sort. So for each class, we can sort. The list of words by their mutual information scores with that class, and we find out that in the poultry class, the top words are poultry, meat, chicken, agriculture, avian, broiler. That tells us a lot about what the documents in that class are about. In the UK class, London, UK, British were the most common, um, words, that kind of thing. OK, sports class, soccer cup, march matches. So the idea is if you do this for a bunch of different tokens, a bunch of different classes, it actually tells you in this news data set, quite a lot. This, just this list of top words, yeah. How did you spell the words? In this case, I’m not actually sure because I didn’t, I didn’t do this calculation. Um exporters and export. Yes, I think in this case probably there was no stemming. So you can do this with just lowercasing no stemming. It was obviously lowercasing because UK is is lowercase. You can do it with stemming, you get slightly different results. There’s no one correct way of doing it. If you do STEM, obviously you’d want to keep a record of what you stemmed and how, because then if you’re looking at a STEM, it’s not always intuitive. But uh yeah, so in this case it looks like no stemming was done. OK, so here’s another option. Chi squared does something very similar. It’s an approach based on hypothesis testing, statistical hypothesis testing. If you’ve taken a statistics class before, you’ll be familiar with hypothesis testing and also with the chi squared test. As a test of the independence of two events. Are we sure this mic is working? Cause I can’t There we go. OK, I was scared of giving a whole lecture and then in, in the recording you can’t hear me or something like that. I think happened before so I didn’t want to avoid the same thing. OK. If you’ve taken the statistics class before, you know, uh, chi square is the test of the independence of two events. So the independence of two events is, is defined as a situation where the probability of event A happening and the probability of event B happening. Is equal to the product of the probabilities, um, so an intuitive way of thinking about this is knowing what the outcome of one event is, doesn’t tell us anything about the outcome of the other event. So if you know that I’m rolling a dice. Whether or not I roll this dice doesn’t tell me anything about what the weather’s going to be like tomorrow. That’s an example of statistical independence. So as usual in null hypothesis testing, we don’t actually expect this null hypothesis to be really true ever. In the case of the dice and the weather tomorrow, it obviously is the case, but in the case of words and document classes, very often there are some relationships. What we actually want to do is we want to have a measure of the strength of the relationship. It just turns out. That this test designed to test this null hypothesis um is a really good measure of the strength of evidence against it. So we want to know really um for we’re looking for evidence that we can reject the null hypothesis, evidence that the two classes, well that the class and the world are not independent from each other. Um, so again we can think of this as using the same kind of variables as before. The two Boolean variables, 0 or 1. does the document belong to the class? Does the, does the word appear in the document? Um, and a simplified way of calculating this, um, is, is using counts just like before. I’m not going to make you do this now, but it’s worth trying it out at home to make sure that you can uh calculate. It’s a very similar procedure, slightly different numbers, um, and you can work this out for the same example data, and uh the solution is here as well. But I just make sure that you, you’re able to to do it. If you’re given the formula um and you have this table that you’re able to uh to calculate it. It is a useful thing to do Um, and a quick thing for a computer to be able to do, and just like with the with the mutual information example with Chi Square, we can calculate this table of the top words for a class, that’s quite useful. So we’ve seen two different methods of basically looking at how often different words appear in different corpora and uh yeah. Now we’re going to talk about dictionaries and lexicons, and this is the situation where we now actually know what kind of words we’re looking for. So we have an idea of what kind of concept. We’re expecting to be present in the data and we have no idea what kind of words people use to represent the concept. We might have an idea, for example, Um, that we’re looking for sentiment. OK, we want to see if positive or negative emotions are present in the text. So we look for things like positive words like good, great, or happy, negative words like terrible or horrible or nasty. And what what we can do now is essentially count. How often these words appear, but we need to be careful that we choose the right dictionary, so the domain can be important, so if you’ve got. Um, a movie plot in, in reviews of movies, and it says that the movie plot is unpredictable, it’s probably a good thing, but if you’re looking at product reviews and you’re looking at a coffee pot and the coffee pot is being described as unpredictable, that’s probably a bad thing. So you want to make sure you have got the right dictionary for your specific domain, that it makes sense. People use completely different positive words when they’re talking on social media than they use in formal writing, these kinds of things. So you can find hundreds of lexicons on all sorts of topics online. You don’t generally have to go and define your own lexicon. When you’re building a system like in Coursework 3 and you’re looking for a lexicon, very often you can find an existing one that someone has already created. Um, It’s been published in the scientific literature, on blogs or elsewhere. Uh, so one example of this is sentiment analysis lexico, but there’s a ton of other types of lexico, and. Yeah, so if we now want to get an actual score per category from this leg, let’s go in one simple way we can calculate such a score, we just divide. The number of dictionary words in the document by the total number of words in the document, so we can say uh if you’ve got 1000 words in the document and 50 of them are in the positive list, then we can say we have a score of 0.05. Again, this can also be used as features, and I’ll talk a bit later about what that means. Um, yeah, if you’re, um, curious about more advanced approaches to quantifying, uh, categories, there’s some optional reading here. Um, here’s a list of a bunch of dictionaries, there’s a lot more, just to give you an idea of what the sort of things are that, um, people can find in dictionaries. So Luke is one example, linguistic inquiry and word count, and it’s got. Whole bunch of dictionaries on various things, so not just positive and negative sentiment, um. There’s dictionary going all the way back to the 90s, uh there’s things like Veder and center Werner which is more about sentiment, Imolex which is more about general other emotions, um the personal values lexicon, yeah, a bunch of lexicons. And um Here’s an example of something that people have done with them. So this is people, researchers who were looking at rumors. They were looking at rumor tweets. Some of these rumors were actually true, some of them were false. And what is plotted here is the reactions to the rumor tweets. So the reactions were analyzed using emotion lexicon mole, and the reactions to true rumors are represented with green marks. And the reactions to false rumors with red ones. And you can see for each of the different emotions how they differ between false and true rumors. So for example, the surprise. Emotion, the reaction to false rumors have a much higher value than reactions to true rumors. So what they were able to do here with this this lexicon with the list of words, essentially the list of words that people use to express surprise is is is find out that in this case reactions to false rumors create or indicate a much higher level of surprise than reactions to true rumors. And the same thing for a number of different categories. For each of these categories here for disgust, for fear, for anger, there’s there’s a list of words, there’s a lexicon in the emo lex dictionary. We can also calculate the dominant score for a category with respect to a corpora and we want to compare two different corpora with dictionary lexicons. So this is also a very simple thing we can do when we have a target’s corpus and we have a reference corpus and we want to know, so how do, how does this in the target corpus, how does this emotion, for example, compare to the level of emotions in the reference corpus, we divide one by the other. Um So here’s an example that looked at deception in language, so the authors here of this research paper looked at street interviews, uh you can find on YouTube and also TV interviews, um, where they knew that people were lying. And the reason why they knew that people were lying was because they asked them about made up things. So there, there was, there was a, there’s this YouTube channel where the interviewer goes around in the street and asks the random passersby, questions about, for example, films that don’t actually exist or events that aren’t actually happening. So they might ask. People, are you enjoying the Olympics, even though the Olympics aren’t actually on right now? It turns out, I mean some people say, what are you talking about? I’m in, you know, I, I’m a fan, I know the the Olympics were last year, they’re not on until, you know, in 3 years. But some people actually say, oh yeah, no, I’m watching them, they’re really great. And this way you know for a fact they’re lying, right, cause you know they’re not happening. Or what did you think about the new film with, um, I don’t know, uh, Ryan Gosling that came out last week. Some people might say, I don’t know what you’re, what are you’re talking about, but if someone’s saying, yeah, I watched that, it was great, you know they’re lying, OK? So now you can check which words are they using when they’re lying. And they did the same thing for actual um uh uh trials. Where they they looked at what the statements of the witnesses and defenders said. Um, and later they knew he was lying because they knew who was found guilty or not. Uh, so they analyzed the words that people used. And looked at the differences between uh truthful and deceptive statements, and the scores that we see here are from Luke, which is this uh very big resource with lots of dictionaries, like a dictionary for meta metaphors. Which words do people use when they use metaphors? Um, class money, which words do we use to talk about money? And things like that, but also things like Family or you, which is just second person pronouns. Family is words for relatives and family members, and it turns out, so some results are quite interesting like. The people who are more who are making deceptive statements compared with the people who are who are who are making truthful statements are more likely to say things like certainty, OK, they’re more likely to say things like, I’m sure I’m certain this happened. Even though they’re actually lying, but this is essentially a way the researchers think, uh, is a way of covering up the fact that they’re lying. Or the people uh who were not lying, um, didn’t feel like they needed to use those words, I guess. You can see the scores like a score of 2.28 now means that the words from the certain categories are more than twice as common, they’re 2.28 times as common in the deceptive trials than they are in the truthful trials, um, or truthful statements from trials. Uh, so it’s a fraction. And it’s a nice analysis that can tell you quite a lot about, uh, this data and what you can learn in the data. Good. So what if you don’t know which words you’re looking for? And you want to do something that’s a bit more interesting, a bit more sophisticated than counting individual words like what we did at the beginning. We leave the predefined groups of words behind us again and we go. We also leave behind this word level analysis where we just use individual words in isolation. Um, cos in reality when a document contains the word or it is in the, in the class poultry, it’s likely to contain the word poultry, but it’s maybe also likely to contain words like duck. So we can summarize that, we can call that a topic. Even when we don’t actually have classes that we don’t have ground roots classes that the documents are assigned to. So the idea here is um topic modeling. We want to know what the main topics are in a document and which documents contain which topics. Um, so the goals are quite similar as in the traditional content analysis I talked about at the very beginning of the lecture. What are the main themes or topics in the corpus, Especially when we don’t have a predefined set of classes, like the newspaper example. Which documents contain which topics. So, um, for an example of what you can do with one of these topic models, let’s say we have a news article, uh, expected soon, first ever photo of a black hole. It’d be nice to know what the main things that are talked about in the article, um, because you could use this to tag articles to make them searchable or group similar articles together, uh, and the output that we get from the topic model looks like this. So it will tell us something like uh just over 40% of the documents about astrophysics. Between 30 and 40% of the documents about photography and about 20% of the documents about optimism, and these three are topics that I haven’t defined. They emerge from the topic modeling, so they’re not predefined by us. Um, so you might wonder why do we have a topic called optimism, Is that really a topic? But that’s really not up to us if there are a lot of words, a lot of words that convey optimism. And then they all appear together, so there’s a lot of words that tend to use optimistic words. And the words appear together, and this topic will form naturally. In the data when we use the topic modeling, and this is why the topic modeling can tell us so much about the data sets because it can tell us which topics have formed in this data set. Uh, so, uh, one of the outputs of the topic modeling is that given a topic we can get a distribution of the, uh, given a document, we can get a distribution of topics in this document. When I say topic, what I really mean is words that tend to appear um together in documents. So here’s an example of some topics that were learned from scientific articles, uh, each of the columns. Is its own topic, so when you run the topic model over the course you have, you look at the top words for each topic, this is what you get, and hopefully they should start to make sense. So the first one here seems to say, well, words like human and genome and DNA tend to appear together in scientific articles, so it seems to be articles about genetics. The second one about maybe evolution, species, organisms, this seems to be articles about evolution, um. And hopefully you can see in each of the four columns that there are some commonalities, and these now enable us to come up with names for the topics. So we get this from the from the algorithm, we get the list of outputs, and I’ll talk in the second lecture today about how we’ll get those, what the algorithm actually does, but this is what we get. And then coming up with a label for the topic, a name for it is something we have to do ourselves. Here’s another example. We can look at this over time, so we can see how prevalent the topics are of time that we have if we’ve got time stamped metadata associated with our documents. Here’s a collection of scientific articles from the 1880s to the 2000s, and if we identify a topic like this, we have a lot of words. We decided, OK, they look like words related to theoretical physics, we label it theoretical physics, we can now produce a plot like this. Um, or we have specifically maybe words that are related to lasers and words that are related to relativity. And words that are related to force. Um, we can look at how often these topics appear in the articles. And how this changes over time, so it’s a very effective way of visualizing over time which topics were common when and we do this, we do this all the time for things like tweet collections. We want to know when did people on social media talk about these topics and those topics, we produce exactly the kind of visualizations. So this is one way of thinking about the topic modeling as a way to enable us to create these kinds of visualizations. Um, another use of topic models is just to reduce the dimensionality of data we have. So if we’ve got a big count matrix, and we have P is the number of words, the size of our vocabulary, and N is the number of documents, you can imagine that a count matrix like this is quickly going to be really, really big. And we can reduce this actually with our topic modeling, this P by N matrix to a K by N matrix where we just have the number of topics, and this will enable us to work with a much, much smaller matrix and still represent much of the original information. So instead of the original matrix, we work with this new matrix where we have K topics, and for each topic it doesn’t give us a count, it gives us a probability. So it gives us the probability of each topic for each, for each document. Um, so as a data scientist you can now decide what you want to do with this, uh, with this matrix, but generally it’s a much smaller, much more compact representation of the same data than the original N byP matrixes. So for now, um, until the second lecture today, we’re just thinking of topic modeling as this black box. We put a corpus in, what we get out is a document topic matrix that tells us, OK, in document 3 is 20% about this topic and 30% about that topic. And the topic word matrix, and that tells us how the topics and the words are connected to each other, so this topic. It’s this percentage to this word and this percentage of this word. So which words are part of which topic. And this is essentially what we’re getting and most of the values are non-zero. Um, so even in an article about computer science, it’s not impossible to encounter the word poultry, it’s just much more unlikely. Than in um, well, an article about port to export. Um, this slide shows the generative model, which is essentially the story that we we we we we imagine that we would use to explain how the data is created. That illustrates the idea of how topic modeling works. So we have our topics and for each topic, uh, each word has a probability. With a topic here that seems to be related somewhat to to DNA and genes and the topic here that seems to be related to evolution, um, and in the topic about uh evolution, uh, life has a 0.02, so 2% probability and the word evolve has a 1% probability. And we have that for each topic we have this list of words, um, because the vocabulary is very large, the numbers are very low. Uh, even a value of 4% is relatively high because it means every 25th word. In this topic is this word, this particular word, um, but real documents don’t or very rarely consist of a single topic. Um, they consist of multiple topics. So you could have an article about seeking life’s bare genetic necessities, um, and it consists to a certain extent of words from the yellow topic and to a certain extent of words from the green topic and the pink topic and the blue topic. Um, so essentially for each document. You have a distribution over topics here, or maybe in this document the yellow topic is especially prevalent, more so than the pink one, more so than the blue one, and then for each topic you’ve got a distribution over words. So you can think of a document being written literally as, OK wait, first word, which topic do we sample from? OK, let’s take the green, uh, the, the yellow topic. Yellow topic. Which word? And every time you look at the specific probability distribution, first the document topic probability distribution and then the topic word probability distribution to to choose which word to write. In the document. That’s obviously not how we actually write documents, how the human brain works, but um, it is essentially how topic models assume, documents are written, and it tells us a lot about why topic, what the strengths are of topic models and what their weaknesses are. Um, So what are some of the limitations of thinking of documents like, like this, as having been written like this? What are some of the limitations of this approach? What’s wrong with the approach? Why does it, it doesn’t fit the way we actually write, but in what ways does it not fit? Uh, it doesn’t take into context into what words were, it’s just from the generation, yeah, exactly. So it doesn’t take into account, for example, dependencies within text, things like. Um If I say that is not. Then most likely next to word is probably going to be either an adjective or it might be. A determiner, you know, but it’s not probably gonna be a noun, that kind of thing, it doesn’t take that sort of thing into account. It just literally is one word after the other. You could. Jumble all the words, put them in a random order. I still have the same document topic distribution, topic word distribution. Yeah Um, It assumes that the words are also exchangeable, so sentence structure isn’t modeled at all. That’s probably actually the main limitation, um. Yeah.

SPEAKER 1
Is that the limitation of the strength? Sorry, is that a limitation or a?

SPEAKER 0
Well, I guess it’s both, because a simple model is often also one that is computationally very quick. Um, so in this case, because we’re not taking into account content and structure, topic models tend to be relatively quick to fit to data, um, so in a way. I guess for some applications it’s a strength, so that’s a good point. For others it’s a limitation. And I think it becomes clear once you start working with actual topic models, you will in the lab, it becomes clear where it starts to become a bit of a limitation. So for example, because we’re not taking into account even word collocations like. Does the word, we just look at how often does the word not appear, which topics is the word not in. But we don’t look at whether it says not good or whether it says not happy. We just look at the individually the probability of the word not. So the topic model itself doesn’t tell us anything about. In which topic does the diagram not happy occur? And that’s an example where it becomes a limitation. Yeah, topic models are mostly used for text data. They can also be applied in other settings. I’ll just gloss over them because this is a course about text technologies, but just so you know, it can be used in bioinformatics for analyzing computer code, uh, music, network data, and other kinds of data. And there’s a whole bunch of topic modeling networks, mostly latent directly allocation is the most well-known one, and this is the one that I’ll be talking about in the next lecture today. But there are a bunch of other networks like, uh, methods like PSLRI and PCA-based methods, uh, non-negative matrix factorization, deep learning-based topic modeling, um, is becoming more and more popular, but LDA is still a very, very popular choice. So this is what the next lecture is going to focus on. First we’ll take a 5 minute break and then I’ll see you back here in 5 minutes. Well, actually, let’s make that 3 minutes. OK, let’s continue. So now I’ll be talking a bit about how latent Dirichlet allocation actually works. So what does the, how does the algorithm actually work? How does it come up with the two matrices? Um, that I spoke about the document topic distribution, the probabilities, and the topic word probabilities, which are the numbers that we’re actually usually interested in, because then we can do fun stuff with them, right? We can sort, uh, the document topic probabilities and we can know which documents, uh, contain which topics, so which topics are frequent which document, and we can, we have timestamp metadata, we can visualize, produce these, these line charts that I was showing, um, and the topic word probabilities tell us which topics. Contain which words which we can use to label to come up with names for the topics. So how does the algorithm actually work? In order to explain how the algorithm works, I’ll first introduce plate notation. I don’t think this has been covered in TTDS yet, you might know it from other courses. Some of you might do, um, but uh I’ll give a very brief background of how it works for the purpose of the rest of the lecture. So. We can think of a variable here that represents an event that we can observe. What’s something we can observe? We can go and see a basketball match, we can see whether or not, um, when a person shoots the ball, whether they’re making the basket or not. So that’s something we can observe. Uh, we’re representing this with a circle that is filled in with a solid color like this, and this could just be a variable like 0 or 1. did this basketball player make the basket or not? Um, and what we can now imagine, what we actually want to know is some other variable that we can’t directly observe. We are actually interested in their shooting accuracy, the basketball skill of the player. Um, we, so we imagine there’s this other variable that’s in the background, unseen, uh, is influencing the probability of the player making the basket or not, uh, and it reflects their ability, their underlying ability to actually make the basket. We could make this a lot more complex, we could take into account if they’re having a good day or or or if they’re thinking of something else, uh, but let’s ignore that for now and just assume that every time a player shoots, there’s just exactly one variable that influences whether or not they’re going to make the basket or not, and this is their basketball shooting accuracy, and we can’t observe this directly. If you’ve ever played FIFA or a game like that, there’s all these stats and tables you can actually look at where you, you see the actual shooting accuracy and it’s, it’s being made visible to you as a player. But in real life if you go and see a football match, you can’t, you don’t see a number hovering over someone’s head, you just, you see the football player and if you know them well and you go to a lot of matches, then you see over time how good they are, and we all know that some players are better and others are worse. So this is what this represents. But it’s not something we can directly observe in the real world, and variables that we can’t directly observe in the real world are ones that we’re going to represent here with a white circle. So the, the making the basket is the observable one, and the the white circle is the one that we can’t directly observe, so we also call it a hidden or a latent variable. Um, So to actually compute this accuracy is not enough to just see one attempt of a basketball player shooting a shooting a basket. Um, we’d need to observe actually many of these events, we’d need to observe quite a few. So. Did the player make the 1st basket, did they make the 2nd 1, did they make the 3rd 1? Um, and then given all of these events that would give us some idea of what the actual skill is. So. Essentially we’d have to observe quite a lot of realizations of this random variable here that is observable to be able to estimate this unobservable one. And we’re going to represent this a bit differently because it would quickly get tiring to draw all these solid circles, so we’ll draw it like this. Um, just as a way of notating, um, uh, this, so we use this notation with a rectangle with a value N in the bottom right, and this means we’re going to repeat this observation n times. So instead of having a separate one for each um each time that the player tries to make a basket, we’re going to have this one rectangle with with the value of n, and you can just think of it as like a 4 loop that that goes over this n times and observes n times whether the player’s made the basket or not. So we’re essentially sampling from. The basketball shooting accuracy and end times, and every time we sample, we see whether or not the basket was made by the player, and then based on these end observations we can go back and estimate, see what the shooting accuracy of the player was. So just to recap the main things about the location, uh, uh, about the notation, the, uh, white circle here is a hidden or latent variable that’s unobservable, the solid dark circle is an observable variable. Uh, the arrow represents this is a random variable that’s drawn from a probability distribution, and the plate here, the rectangle tells us it’s a repeat. with a number that tells us how many times we’re repeating. So now we can use this notation to explain LDA. We’re no longer talking about basketball, we’re talking about words and documents, um, and I’ll not show you the full model right away, but quickly, uh, but, but build it up from a, from a smaller one. So let’s start with a really simple model and work our way up. So in the simple model, the unigram model, W is now a word and N is the number of words in a document. So what we can observe, the individual words, we see the words in the document, um, so it’s a solid circle here. And we assume that we’re just drawing n times. We have n words and documents, so it’s like repeating the sampling n times, so we have all these different Ws, words, we have n of them in a document, and actually we have multiple documents in a corpus, so we can have m documents in a corpus. And now let’s add an additional piece of notation, which is this. Ball W, which is a vector of words, a document, and what we can now say is that the unigram model. Assumes that this is how documents are written. Assigns this probability to a vector of words to a document. So this very simple model explains what we see and with all these models we’re going to be looking at these probabilities. Probabilities. And of the document that we’re looking at, the probability of the vector of words of this document, in this case, it’s just a product of the probabilities of the individual words. There’s no relationships between words being modeled here. Um, there’s no words having different probabilities depending on where in the document they appear or what kind of topic topic the document is about or anything like that. We’re just assuming every single document in the world has been created. Um, by just drawing from, uh, by, by, by, by just assembling individual words, um. And and and that’s just a single word distribution where some words can be more common and other words less common, um, but there’s really nothing, nothing else governing this. Um, so in the equation we see, uh, we use N for the number of words in the document and italic uh W N for the individual word in the document with the index N, and this bold upright W is the document. And then the probability of the Um, the document is just the product of the probabilities of all the individual words that appear, um, in the document. So you can probably already tell this is a very unrealistic model of probabilities. So if you say, I don’t know, uh, the word the appears in, you know, it’s 5% of all words. There’s the word the, the most common word in the English language, um. Then a document that has the word the the the the the is considered a very high probability document by this model. So it’s not a great model. because we’re actually looking for models that assign high probability to realistic documents and low probability to documents that don’t exist. Nonsense documents that don’t make sense. So ideally a good model is one that gives us a high probability for for realistic documents. And So what is the probability in the unigram model of the example sentence my dog barked at another dog. So we know something here about the probabilities of the words. We might have come up with them actually like in the last lecture by just counting the words in a big corpus of documents. We might have figured out in this big corpus of documents the word my is every 10th word and the word dog is every 20th word. So let’s say my has a 10% probability and dog is a 5% probability. And now, yeah, given any document we have, we can now calculate. A probability for it so we can see essentially how likely does this model think that this document is. So how would you calculate? The probability of the sentence in this model. Multiply? What do you multiply? Probability of words. Yeah, so you look at the probability of my. Time is the probability of dog, time is the probability of barked. You multiply them all And you get some number that tells you under this model, what is the probability of this sentence. Um, So yeah, what would be in this scenario, what would be the most likely document possible? Maybe the most likely 3 word document. My, my, my, yeah, or my at my, at at at, OK, yeah, we can, we can see this is not a very realistic model, um, but hey, it’s a model, OK? Tells us something. Uh, but what, common and what’s not. Um, but yeah, I mean, using, using just a few common words in a random order, uh, is what this model considers to be likely. It’s not how we actually speak. So let’s try and make the model a bit more complex. But wait, why do we, why do we try and make the model more complex? Why not just use the basic um Unigram model for everything? Well, because ideally we want to have a model. That assigns high probability only to sentences that are actually realistic, um. We want to accurately describe the data we’ve got. So higher probability for real documents and lower probability for noise. How do we achieve that? So one thing we can do, we can take into account the different topics or different documents are going to be about different things. And this is where we get the notion of a topic from different topics appearing in different documents, because we want to be able to. Model the fact that the word dog might be quite common in some types of documents, but not common in other types of documents. So we introduce a new variable, Z. Z is an unobserved variable, um, so we’re using this white circle. And that is the topic for a document. So when we compute the probabilities for the words, then instead of just using the same probability distribution every time we sample. We now have a specific probability distribution just for this topic. So for example, just for science documents, uh, we might know that some words are more probable than others, and for documents about pets, other words are more, more common. Um, so the variable Z models these different probabilities of different, different words in different topics. Um, so that could be sports, science, politics, pets, literature. Um, and now we can calculate the probability of a document, uh, and the right side of this equation looks a lot like before, but now, um, So we’re still, we’re still multiplying probabilities of words, but now they’re conditioned on a topic. And then we multiply that with a probability of the topic. Um, so the probabilities of some words here might now go down or up depending on which topic. And then each topic has a specific probability that we multiply with. And then we sum over all the possible topics in addition to summing over the different words. So if we have a sentence like my dog chased after the bus, how would we calculate the probability of that? Let’s ignore to make things simpler now, um, stop words because often when we do topic modeling we actually exclude them. Um, it’s not that interesting to look at the probabilities of words like um my or the and and because they tend to be very similar between different topics. Um, and we assume also we, we assume we’ve already learned the model, so we, um, know the numbers, the probabilities, uh, we just want to show how it works. And so in reality we’ll learn the probabilities. We’ll choose them based on the data. Uh, for example, the topic probabilities, how often does each topic appear? Uh, we’ll fit a model that estimates this number, but here we’ll just take it as given. So we’ll say the probability of the pets topic is 60% and the probability of the vehicle’s topic is 40%. There’s only these two topics. And then within the the pets topic. The probability of the word cat is 20%, the probability of the word dog is 30%, and within the vehicle’s topic, the probabilities of these two words are just 1%. So now how would you calculate the probability of the sentence my dog chased after the bus, ignoring stop words my after and the. Any ideas?

SPEAKER 2
We just put these values like Um, for each like Class like for pets and not class the category topic topic topic yeah for like pets we can do 0.6 multiplied by Um, all the probabilities of dog chase. And bus. We can multiply all this together with respect to the topic types, the first one, given the topic.

SPEAKER 0
That’s right, yeah, yeah, that’s 1510 we can do for

SPEAKER 2
the vehicles.

SPEAKER 0
That’s right, we’ll add. So we’re summing over two topics here and we have how many words left? We have my goes away, so dog chased. And bus, 3 words, so we have 2 topics, 3 words, um. We have 6% for pets and 4% for vehicles, so we add those two. Um, 0.6 times the probabilities of the three words under the pets topic uh plus 40% of the probabilities. Under the uh vehicles topic and that gives us an overall uh overall number by multiplying these conditional probabilities. Um, so it’s a bit better, but it’s still not perfect. There’s still some disadvantages, um, so let’s introduce a new thing which is we’re introducing a new observed variable D, and you can think of this as a kind of lookout table where each row of that lookup table has an index by document IDD, so we have a way of associating documents with probabilities of topics because we didn’t have that before, right? So we had a general view that, you know, there’s two topics, pets and vehicles, and. 60% of all words and documents in general, come from the pets topic and 40% from the vehicle’s topic. Now within each topic we had, Word topic distributions that differed, so within the pets topics some words were more common than others, and within the vehicles topics some words were more common than others, and these differed from between the two topics, but we didn’t have a way of modeling that different documents are actually about different topics, right? Every document was 60% pets and 40% vehicles. Now we introduce this to the model. Um, so we think of this as a lookup table where document one might be 90% pets and 10% vehicles, and document two might be 70% pets and 30% vehicles. So, um, yeah, that’s the different difference to before, um, that different words and documents can still come from different topics, but, um, The distributions are different for each document. So now the way we would calculate the joint probability of a document on a Word is uh so each document would have a certain probability, for example, if you have 1000 documents, you could just say and there’s 1 over 1000, um. And then we’d look at all the different topics and for each topic, uh, just like before, we have some probability of the word given the topic and some probability. Of the topic, given the document. This is new to before. Um, the probability of the topic now being conditioned on the document D. So in each doc document we have different topic probabilities. It’s just a way to mathematically, um, represent that. And this model is called probabilistic latent semantic indexing, um, which is a very fancy name for what is in effect actually not, not that complicated a model. Uh, so let’s look at this example where these are our probabilities, um. For the words Given topics tier 1 and tier 2, and we now have a document, the C, sat down and. Where, yeah, if we know that Z. Is topic one. Then these are our probabilities here and if we know that that is topic two, then these are our probabilities. Um. And we also have some probabilities for the documents now. So for D1. Um, we can now say the probability of, maybe we have 100 documents, so the probability of the document we’re looking at being D1 is 0.01. Um, and then within D1. The probability of tropic one could be 60% and the probability of tropic 2 40%. So now we can work out the joint priority of the document and the word cat. And it’s again just a matter of filling in the numbers into the, into the formula. So now finally we make one more assumption and we get to LDA. We make an assumption about how the documents are actually generated, so we make another change. We introduce theta, where instead of having a fixed, Topic distribution for each document D, we can say that this data, this topic mixture, uh, we actually have for every document a probability distribution over topics. This is itself sample, so we have a probability distribution. That’s sampled from another probability distribution. So you can think of theta as a matrix with a row for each document and a column for each topic, and it tells us which topics are how common in which, in which document. um. We also have beta which is word distributions within topics, um, where the probability of the word depends on the topic. And alpha is where we get the name Diri clay from. So if the document topic distributions themselves are sampled from a probability distribution, um, alpha is a parameter of a Dirichlet distribution, a probability distribution that we get the topic distribution theta from. Um, so it’s a distribution from which we can essentially sum sample vectors that’s sum to one. So every time, If you roll a dice, every time you sampled, you get a value of 123456. Every time you sample from the Dirichlet distribution, you get a vector and all the elements of the vector sum to one. Um So you can think of this as basically being like um having a bag full of dice. But the dice are not all fair, OK, if the dice were all fair, each, each dice gives us a probability distribution, right? A fair dice gives us a probability distribution of 1/6 for each of the sides. A loaded dice, a dice that isn’t fair, gives us a probability distribution that’s it’s 6 numbers, that’s sum to 1. Uh, but they’re not all the same, they’re not all 1/6. Um, so you can think of this, uh, this Dioclay distribution as this bag. With all these dice, and every time you take out a dice, you get a probability distribution of 6 numbers, um, so 6 probabilities for the different topics, if there’s 6 topics, and every time you draw you get a slightly different probability distribution because each of the dice are, none of them are fair, and each of them are slightly differently um kind of in different ways unfair. Maybe some of the dice are really like to give you high numbers. Numbers 5 or 6, so topics 5 or 6 are very likely, and others are very likely to give you very low numbers, and another die almost always gives you a 3 and that kind of thing. And So yeah, we’re drawing. Vectors that sum to 1, and we interpret them as probability distributions of the six topics. And actually when we set, depending on how we set the value of alpha, we get different types of probability distribution. If we set a really high value of alpha, then all the values for the six-sided dice will be kind of close to 1/6, so we’ll be fairly close to fair with a very low alpha. We will get very unfair dice that give us very high, and maybe always roll to the same number that give us very high probabilities for some topics and less so for others. Uh, so the dice are the fetus. This is how I like to think of them, um, as, as this kind of dice, and it explains why basically every time we sample from the Dilet distribution we get a different. Different probability distribution over topics. Um, so we draw m times from the Dilet distribution, uh, every time we get a different, um, different topic. Document distribution and then uh we sample from that end times and, and construct our words. So if you look at the five circles, the only thing that we actually observe is the words. Everything else is unobserved, so everything else we either have to set. Alpha is a parameter we’re just going to set, so when you actually go and do your topic modeling in a in a Python library, you have to decide what you want that value of alpha to be. And hopefully now you’ve got an intuition why if you set a relatively low value of alpha, you’ll find that documents tend to be just in in in almost like just one topic. You set a really high value of alpha, you get um topics that are composed of multiple documents, many documents with fairly similar probabilities. And that kind of thing. So we set it to be higher, we end up with, with um with that, with documents that composed of just a few topics like the dice that always wrote the same numbers. Uh theta and z and beta are unobserved. Uh, we’re going to infer them or estimate them from the data that we actually get, so from the, from the document. Uh, we go back, we observe the words and we compute these other, um, variables because all we’ve got here really is the, is the words. And once we’ve learned all these parameters, they will give us what we’re actually interested in, which is the document topic distributions and the topic word distributions. So the equation here looks complex, but it’s not actually that different from the previous one we had for PLSI and we have a probability of Theta given alpha, so we have the Nuclear distribution parameterized to alpha, uh, and then I look at the different words. Over all the words N. And for each of the words, essentially we’ve sampled a different. Topic ZN, the topic of the word um N from uh uh from well given theta, um, and then from the topic we sample the word. Given the, given the topic and, and, and beta. How do we actually find these parameters? So in practice it’s very hard to do exact inference here. The things we actually care about, the document topic matrix, the topic word matrix, um. And the thing you might have maybe gotten intuition from the previous slide is that it really becomes intractable to try and find a An exact solution to this. So there’s a number of approximate methods to try and come up with the solutions here. Uh, GIB sampling is one of them and variational inference is another, and precisely how this, this works is out of scope here, um, because it requires more of a background in statistics than we require in TTDS, but there’s courses on things like probabilistic modeling and reasoning to learn more. About these topics, so I’m just going to give you, finish this lecture by giving you some intuition of how Gibbs sampling works. So the goal is to learn our parameters. Given a set of documents, we want to learn the topic word probabilities and the document topic probabilities, and we’re remembering the choice of topic for a word and the words themselves, they’re all random variables and they all depend in some way on one another, and Gibb sampling tends to work in situations where you. You don’t know the distributions of all these random variables, and the goal is essentially to approximate them, um, but one thing you do know is conditional distributions because you can observe them, you can sample from them. Um, so in this case we don’t know which topics any of the words that we observe are actually sampled from because we can’t directly observe the topics. We can observe the words, so we don’t know the document topic properties, we don’t know the topic word probabilities. But one thing we do know that we can calculate is the probability the word W is from the topic T conditional on all the other topic assignments of all the other words. So if we know which words, if we knew which words each which topic each word is from. If we know exactly this word and this document is from that topic, this word is from that topic, then we know for example how common is word, which word and which topic. And then we can also estimate for the world we’re looking at how common it is, and this sounds a bit vague now probably, but I’ll walk you through an example where hopefully you’ll get an intuition of how this, how this works in practice. Um, so, I’m not going to go exactly into why this is the case, but we’ll take this equation here as given for now. It shows the probability of the word. I in a corpus. Being assigned to the topic J, and we know the probability of the word depends on its topic, and this is captured by the fraction here on the left, and then the probability of the topic depends on the document and this is captured in the fraction on the right. So CWT here is a matrix that shows the number of times that the word W is assigned to topic J and CDT is a matrix that shows the number of times the topic J is assigned to a word in document D. So we keep track of the document topic assignment and the word topic assignment. Um, and what this is saying is the probability. Of a specific word being assigned a specific topic is proportional, essentially. Um To how often that word appears on that topic and to how often the document appears. Uh, that topic appears in that document. So if many documents, if many words in a document have been assigned a specific topic, that makes it more likely, it makes it increase the probability that the next time we see a word in a document, that will also be from that topic. And if many occurrences of a word have been assigned to a specific topic, then that makes it more likely that next time we see that word, OK, that’s probably also because it’s from that topic. And this leads to the following algorithm where we can assign each word a topic randomly. Calculate, just keep track of the count matrices. And repeat this until convergence. So for every document, for every word, we keep track of the count matrices. So we sample the topic assignment and increment the count matrices based on what we’ve just assigned and decrement based on what we’ve assignment we’ve removed. So I’ll walk you through an example of this. We have 2 topics. Red and blue. And 3 documents. Green eggs and ham, ham and green peppers and ham and cheese. Two topics, red and blue. We’ll start in the first step with the random initialization step, so we just randomly assign each word and each document to a topic, cause every word. And every document comes from exactly one topic. We don’t actually know where it’s from, so we’ll just randomly go through the entire data set and randomly assign them. And now we can keep track after this random initialization step of our current matrices. So we’ll see a count matrix here for words and topics. So how often has a certain word been assigned a certain topic? For example, the word green appears twice in our collection here, and once it’s been assigned the blue topic and once the red topic. So we say. One’s topic 1, one’s topic 2. The word eggs only appears once, and the one time it appears, it was assigned the blue topic. The word peppers only appears once and it was assigned a blue topic. So like this, we keep track. And of the word topic assignment, and we also keep track of the document topic assignment. So here we count which document has how many words from which topic. So document one, which is the first line here, has 2 blue words and 2 red ones. Document 3 has 2 blue words and 1 red one. Does this make sense so far? So this is literally just counting how often after the random initialization step have I assigned which word, um, uh, well with which word from which topic and uh which topic occurs how often in each document. And now what I can do is I can go through, uh, so first of all I can show actually, I can convert these two probabilities. So before I had counts, and this is the same data that is represented slightly differently, um, so I’ve converted them to probabilities by here making sure that every, every row sums to one. So before I had 122, and now I’ve got 0.2, 0.4, 0.4, um, and here I’m making sure that every column sums to 1. Um So these are actually our probability matrices here, not the final values, but the random, the ones from the random initialization, and I’m gonna try and approximate the real values. So I’ll see how over time this becomes a more accurate estimate. Um, of the correct values. So now I can go through my words one by one. I select the first word and I remove the assignment, so now the word that was previously blue here, the word green, is now black. So I’ve removed the blue assignment and I’ve updated here the um. Both matrices. So I remove here, in the word topic assignments, uh, in that count, green is now no longer in topic one, and also here in document one. I’ve changed this assignment here uh of how, how many blue words appear in document one. And now I’ve done this, I can now sample, again, I can assign a topic. Um, how do I do this? Before, I was just going 50, fifty-fifty for each word. Now I still assign randomly, but not fifty-fifty. I look at what’s actually the most likely. What is the most likely topic assignment? For the word green, so we have to look at two things now. One thing we have to look at is how often does which, um, when the word green appears, does it tend to appear in the blue topic or in the red topic. Now after I’ve updated my assignments, the word green only appears in the red topic, never in the blue one. So it seems the probability is 0 of it appearing in the big topic. The other thing we have to look at is, um, in my document, one. Which topic is how likely? And topic red is twice as likely as topic blue. So in this case what I would actually do is I would assign the red topic because the probability of the blue topic is 0. Now, After I’ve done this After I’ve assigned the word green to the red topic. I can update the card matrices. Here, in the, uh for example the the word topic matrix, I can update and say now the word green appears twice in the red topic. Um, we move on to the next word, we remove the assignment from this word, we update the two matrices. And now we want to sample again. We want to know, OK, what topic should we assign the blue topic or the red topic to the word eggs. So again we look at two things because the probability is proportional to how often each of the topics appears in this document and it’s proportional. Uh, to how often each of the words is from a certain topic. So let’s look at the first thing, which is how often each of the topics appears in Do one. So in document one, we now actually only have red words, the probability. That egg is blue here is actually 0. But also, the probability is proportionate to how often each of the words is from a certain topic. So let’s look at the word eggs. Oh, we’ve got a problem because blue and red are both 0, so. Actually, the probability of X being from either topic is zero. This is an issue. Um, OK, we have to change the algorithm because this wasn’t really meant to happen. What can we do? What can we do to fix the algorithm? Yeah, exactly, we smooth. So we invent, we just add a little parameter that smooths a little bit. And it’s here, OK, so instead of just dividing our accounts, for example, the count, you know, of, of, of the 0 here by the sum of 3. We add a little bit that could be 0.01, and now we’re no longer dividing. 0, so we’re no longer getting 0 for both topics, uh, but in this case, um, we’re actually getting a non-zero, we’re, we’re always getting a non-zero number basically because we’re never, we’re never dividing zero by anything, um. So we actually introduce uh like a little value, we call it alpha for the word topic matrices, we call it beta for the document topic matrices. We just add that tiny number um to all of our accounts. And now that we’ve done that, we can look at our formula and it makes sense again and we can actually uh sample for the word eggs, um, and we can keep doing this for uh our entire corpus and we can iterate over the entire corpus even multiple times until hopefully. Um, our algorithm converges, and it’s not big changes anymore in the document topic or the word topic distributions. Um, so this table also again it shows intuitively why it is that if we set alpha to a really high value, so if we, if we were to add 100 here instead of adding 0.01. It would almost drown out the actual counts. It would be 100 versus 102. So the probabilities of assigning. A topic would be close to 50%, most of the time, uh, so the counts themselves wouldn’t actually have much of an influence anymore on our assignments. Um, whereas if we, and, and this is why we would end up with, with documents, um, or with, with, with topics that have many, many words almost equally likely. And if we set beta to a high number, we’d end up with documents that have many, many topics almost equally likely. But if we set alpha and beta to really low numbers, then we end up, With assignments where documents are composed of just a few topics and topics are composed of just a few words. So now we run over all the words many times and eventually under some conditions, the algorithm is guaranteed to convert to the target distribution, uh, the true probabilities. And we have to note it’s a probabilistic algorithm, so there’s a there’s a couple of random steps. There’s a random initialization and there’s the random assignments of of topics, um, two words. There’s an element of randomness in practically every step. So if you run the topic modeling on a big corpus, you never get, if, if you run it and you run it and I run it, we never get exactly the same results. We’ll get roughly similar results, roughly similar topics. And in these similar topics, roughly similar words, but because of that randomness, there’s always slight differences in what kind of results we get. So If you run LDA once on a newspaper corpus and you find the top topic is topic 3, and the top words and topic 3 are words like match, players, football, you can call this the sports topic, call it a day, write up your report, and the next day you talk to your friend who’s also taking TTDS. And who’s run the same thing, but they actually found that topic 7 was the top topic, um, and they actually found different top words, maybe words like Messi and Ronaldo and much. That’s actually not unexpected because it’s a probabilistic algorithm, um, so you’d, you’d expect roughly similar ideas like sports is a very common topic, that kind of thing, but with specific words or which in which specific documents, that’s, that’s, that’s uh random, um. Yeah. Let’s look at some examples of what this looks like, uh, in practice, what people have done with topic modeling. Um, so here’s a data set where student comments were analyzed, student comments on the internet about, on a website called Rate My Professor, which is a website where you can, uh, talk about professors and they, uh, considered all the course feedback that one professor got as one document. Uh, we’re looking at sample words. For each topic, so they found that there was a topic within all these documents um where Uh, words like perfectly helpful were appearing together, and then a human went and labeled this, called this topic the topic of approachability, and they found that words like understand hard homework, uh, clear, helpful were appearing together, and they called this the cloudy topic. Um, generally when you look at the highest probability words for a topic, it should give you a good idea of what, um, the topic is. So good, good, maybe you can come up with a name then on, on, about the topic. Um, And what they then did was they did this to compare to corpora so you can learn the topic model on all the documents. You average the document topic probabilities over. Uh, to subsets to corpora, you can compare them, and here what they did was they split all the reviews between sociology professors and computer science professors, um. And they found that for example for the sociology professors, people were far more likely to comment in their reviews on the quality of the readings or the quality of discussion than they were for the computer science professors, or they found that the students were far more likely to comment in their reviews for computer science professors on topics like clarity than in their reviews of sociology professors. Um, so it seems like this might be much more important for computer science students. And they also did this for different countries, so they looked at professors in Canada versus the United States and they found that for example in reviews of Canadian professors, approachability was discussed far, far more often than in reviews of US professors. So it also can show some cultural differences between what students in different countries look at. Another example is personal values, so what’s important to people, uh, where this is topic modeling done on people’s essays about what’s important for them in life, and what are the main things that they think about when they make important decisions, um, and they found that there were very significant differences between country and um gender and age in what kind of topics people seem to, uh, people seem to care about. So this can give you some inspiration of what you can use LDA models for. Um, yeah, just a very quick preview of the next set of lectures I will give. So next week, Wali will be talking about web search, and I’ll be back in two weeks talking about, um, classification, and this will build directly on this, on this set of lectures, um, which is a way of answering the question essentially, um, you know, um, what, what are specific things that we’re interested in. And we look at traditional supervised learning, which is a way where we annotate representative samples from a corpus. We train and classify and apply it to the test data. We look at transfer learning where we find another much bigger but similar data set and train our model on that and then apply it to the rest of the data. We’ll see what this has, we’ll look at what this has to do with topic modeling and how we can use models like that to help us with topic modeling. And after classification we can look at things like common words and topics and how predictive classes relate to other variables. Um. So yeah, just to summarize, we’ve seen about the traditional process of content analysis as done by experts, calculated word level differences between corpora, looked at dictionaries and lexica where we know what kind of words we’re looking for, done topic modeling where um we don’t really know what we’re looking for and we’re letting the topics emerge from the data. And I’ve given you a very quick preview of what’s happening in the next set of lectures by me on classification. If you’ve got any more questions, I’ll be around for a bit, so feel free to come up. If not, I’ll see you again in 2 weeks, but Walid will be back next week. Thank you.

Lecture 6:

SPEAKER 0
OK, hi again. So, um, Today we’re going to discuss a new topic which has been used for a long time, still used in some situations as well, which is cory expansion. But before starting, actually, Let me tell you, how did the coursework one go? Was it OK? I can see most of you already submitted. If you only have some extensions. One good news actually, it’s only 1 lecture today, not 2. It’s a bit long, so it will be only 1. And however, we still actually, we still have a mind teaser after the lecture because after that I will discuss your feedback results. I gave you a survey last week to fill. I’m not sure, only a few of you filled it, but I will discuss your feedback, so we’ll discuss it openly about if you have any more comments, we can discuss it. So hopefully within an hour we’ll be finished, maybe 15 minutes. By the way, next week I won’t be around, but it will be my colleague Bjorn. He will be actually teaching you next week. I will, I will be coming back the week after, OK, because I’ll be away next week. So, uh, is it going well so far? Any comments?

SPEAKER 1
OK.

SPEAKER 0
OK, good. So the objective today is to learn about query expansion, and we will discuss in this one the very well known query expansion method that has been developed over the years. These are the ones that decades ago, and at the end of the lecture we will discuss a little bit actually about the roadmap of what happened in the last. I would say 6 years in IR because IR has been moving in a way and then actually after like 6 years it started to change a lot dramatically and we will discuss actually quickly summary about what is happening because it will be like an introduction for the next lectures about what we’re going to study as well. And there is a practical part of this lecture, so we have a practical part every time, which is implementing what we call a PR. We will understand what it is in a second. So, create expansion. Simply, we remember that the query is a representation of the user information need, and many times it’s not, it’s suboptimal. Yeah, sometimes the, the user writes a term that doesn’t match the document why actually it’s, there may be a better term there that should represent their information need. And also different words have the same meaning like replacement replace. We can solve this by stemming. We have seen stemming would help solve these kinds of things. Uh, different verbs like go gun went. We didn’t apply limitization, but this can be a solution to solve these kinds of things. However, for other stuff like car, vehicle, automobile. They are different words but probably have very close meaning. They are kind of synonyms here. So so far of what we have implemented, this will not be able to match. And also different ways of writing the same thing like US, USA, the United States, United States of America. All of these words actually USA will not match the United States of America, for example. So how can we solve this problem? Currently, the current techniques we developed, the very basic search engine you developed, will not solve this. So stemming and alimatization could be applied to normalize documents and queries in a way that actually shows that it will be improving. Actually it shows that limitization doesn’t help much in a way. Stemming is a major thing. However, for the other cases, we need something different. One of the solutions for solutions for this is code expansion, where I add more words to my query. To help me to find other documents which might have similar meanings. For example, I’m searching for car in my query, then I can implicitly internally in the system add the word vehicle as well and the word automobile as well. So if a document doesn’t contain car but contains a vehicle, it actually can match my query. This is a very simple, basic idea. What are the methods for this? One is using Theodorus, simply a dictionary. It’s a group of words which actually are assigned to be synonyms to each other. And a problem with this, of course, is they actually group the words on the word level, so they might actually be losing the context. However, in many cases they might be useful. Any, anyone aware of a simple theodorus or something that has synonyms? Yes, what UMLS. I don’t, I’m not aware of this myself. But that’s, that’s uh OK. Does it have a Word and uh corresponding meanings. That’s, that’s something like this. Yeah. And anyone heard about WordNet before? WordNet is a simple thing like this as well. So, it’s manually built, there is actually, if you’re using NLTK tool, it has WordNet where you actually can say, what is the, give me the synonyms for this word. Uh, many ways it can be done. Sometimes it’s manually built, like WordNet is actually manually built, but this is of course not scalable and doesn’t actually, if you need to update, it takes a lot of time. Or it can be actually done using automatic methods. Like for example, word cocurrence or parallel corpus translation, and we’ll give examples about each of those. And Retreat document pays expansion. OK. So we will talk about relevance feedback. There’s another method for que expansion, which is we can do expansion based on the documents we are retrieving, which is relevance feedback and pseudo relevance feedback or blind relevance feedback, which we will discuss today as well as another way to do that. And of course another very useful method is query logs. Which it’s hard to build it unless you have a big log of users, which actually can learn from sessions. Someone is searching for something, they didn’t get the results, then they retype another query and they started to click results. Oh, you can actually guess that, and if this is repeated by 100,000 users, you can learn that actually that probably the first query is similar to the second query in this case. Or maybe from the logs you have a document that has been clicked many times when people search for different five queries. You can assume these five queries are similar. And this is another way to do that. But this is actually for query logs, usually with search engines more than normal systems. OK, let’s discuss the automatic methods. One of the simple methods to create some words which might be similar to each other. The very basic idea, which actually has been used over decades and actually if you think about it, it’s very similar to the current LLMs we are using, actually how transformers are built and work to vec and so on. It’s simply the co-occurrence of terms. Well, the main idea here is that Words co-occurring in the same document, the same paragraph, they somehow we can assume they are related somehow. And if we, from our collection matrix, if you can remember, you have terms and documents, you can find out from there that the terms that appear mostly in the same context. So simply what you can do here, I have a collection matrix which is maybe 1,500,000 terms multiplied by 10 million documents. If I manage to do the transpose of this and multiply them together, I will end up actually with 500 terms multiplied by 500 terms. In these 500,000 terms and in this case these metrics will be how each of these terms appear together in the same document. It’s a very simple operation, just to find actually what terms could occur in the same documents together. And from there you can find actually, oh, these terms usually appear a lot, or these terms appear, they never appear together. And from there you can build something. The good thing about this one, it’s actually unsupervised because I don’t need to annotate anything. All that I have, I have the collection of documents. I just try to find the document and the terms that appear more in the same document. The problem about this, it’s not exactly synonyms in this case. Can you just give me an idea why this might they are not exactly synonyms?

SPEAKER 2
If someone searches for a car, for instance, it’s more likely that a car is it’s seen more with fast or something like that than it is to Excellent.

SPEAKER 0
Yes, if you search for a car, in this case, you might find that the word fast or actually engine appears a lot with it, but it’s not a synonym. In some cases this is true. However, if you think about it for the search engine, It might start to be actually useful still. Do you know why? Let me give you an example here. This is actually built from a collection here. This is a very old work. Like you can find the word and the other words that appear a lot when this word is seen. Like absolutely comes with absurd or whatsoever, totally. Bottom will be deb copper drops. You’ll find some like makeup. Here it comes with rebuent lotion, glossy, sunscreen. So if you think about this, they are not exactly synonyms, they are kind of related. Do you think this would be useful for search or not? Why?

SPEAKER 3
You wanna find documents related to this query and not especially not exactly.

SPEAKER 0
Exactly, so it doesn’t have to be the exact term, but something actually might be interesting because it’s related. However, this is not always. Sometimes it’s useful, some not, sometimes it’s not. For example, if I’m searching for. La Mina matter and it ended up that retrieving stuff about Barcelona. It might be useful, but if you’re searching for Barcelona as a city and it returns in a Minamal, which is about the team, you are not sure if that is exactly what you need or not. But however, this has been used and it actually shows some improvements in many cases. And by the way, if you think about the advancements that happened later, the idea of actually word embeddings that we will discuss a little briefly at the end of this lecture and more in the next lectures, this is how it works. We try to see the words appearing in the same context. Another way which is called parallel corpus in this case, and parallel corpus actually are if you have a sentence in one language and another sentence in the other language, usually used for Turing machine translation, or it can be simply like a Wikipedia page about a certain topic and the corresponding Wikipedia page in the other language, which would probably have similar content. And The idea here is that if I found that to learn actually alignment on the word level and I started to find the word in language X that can be translated into more than a word in the other language. Then I can guess that these other words in the other language that might translate to this original word probably are synonyms or close, very close to each other. And from there I can learn something. But however, the main disadvantage here that or limitation that I need a parallel corpus at that. However, there are a lot of data sets that are parallel there. Wikipedia is an example. You have a lot of data sets to use for machine translation. And let me give you an example here. So imagine that I have two sentences, one in French, one in English, and once I align the sentences. What I can do later, I can have two sentences like this. So these are two sentences in English and French translation of each other. What I can do later is to remove the stop words from both of them, from the English and French, because stop words in IR, we don’t care that much about it. Then what I can do is to try to align the terms based on co-occurrence in many other in many other sentences in the parallel corpus. So I can try to do statistics. Oh, this term appears a lot with this, so I learned about it. And from there I can start to learn that the word eliminate, which is this is a stem in this case, can in English, can have these two different words in French, and in French, the word eliminate in French actually can have the word remove or eliminate in English. I can do a very simple assumption here that probably what I can do back alignment that eliminate can actually be remove and eliminate are synonyms in this case, which is some weight in each. This is a very simple idea. That’s what I’m trying to do. So from there I can simply go and figure it back and I can say that eliminate has synonyms of remove in this case. And if we applied this with parallel curbus, actually these experiments I did it myself like years ago. These are the output you can get. So the word motor can have the word engine. Weight actually it’s interesting. I found the word weight as a full word and the WT as a shortcut, which is interesting because I learned this. Travel, travel move, display, color has the different spellings as well, coming in, dye, link, connect, board, and so on. So that’s another method that I can start from parallel corpus that are used mainly for training machine translation. No, I can use it in a way to actually learn words that actually might have the same meaning which lead to similar translation at the end in the other language. So this is simply the main idea. So this is the method about uh using query expansion using a thesaurus or a dictionary here. So if I go to a query which contained the word clothes, I can now, when I’m my querying go and add the word fabric, clothes, garment, and tissue in this case, and search. The user search only for one term, but I added an additional 3 terms, and I’m doing the search. So The thing about co expansion using dictionaries, it works for very specific applications like medical domain. Actually, it’s very useful for some specific applications. However, the problem is that many times it fails to improve retrieval. Actually sometimes it reduces both precision and recall. Can you think how this might happen? OK, who thinks that it might reduce precision? How this would happen Yeah, exactly. So it’s much substituting actually, like, for example, clothes with dye, for example, like or something that it’s not what I’m looking for exactly. It retrieves documents that would actually be on the top, which are not actually what I’m looking for, so precision would be low. So how this would reduce recall as well.

SPEAKER 4
Maybe if the original word didn’t get translated back into it, but you still have the original word in the

SPEAKER 0
query. Yes, exactly, so it will not use recall in general, but it will reduce recall at 1000, for example. So it will keep pushing the real results very, very low, so recall itself actually on the top 100 reduced itself, OK? So we didn’t achieve that much documents at the end. So when it works, It is hard to get consistent performance over all the queries. So even if within a specific domain, you run this and you end up that the mean average precision for the queries you have was 0.3 and after that it becomes 0.4, this is great. However, the question is, is it consistent over all the queries, which brings us to the award of think about. Test Remember from last week that the overall improvement might be OK, but not for all the queries it’s not significant. Some queries actually have been worse. So this is something that you need to also check, especially when using query expansion, that are the results, the average is better maybe, but is it actually really better or not? And this comes with a significance test which is an important test here. And the main idea why it fails because usually it’s a lack of context. I actually, for example, if I find the word bank, in this case. What are the synonyms of a bank? Maybe finance, maybe money. You can think about this stuff. Exactly, which is a very common example. The Bank of a river doesn’t have the other stuff, so out of the context will be always a problem. So how can we solve this? The interesting stuff like word embeddings, BERT and LLMs, we started to notice this actually. I would say with Word to Vec and Glove when it started to came out, it’s still losing context. It wasn’t that great, but once BERT and LLMs started to come out, come out, they take into consideration the context and the improvements actually came to be very useful in that. We will discuss it briefly again at the end of this lecture. Which brings us to another method to do that. OK, forget about dictionaries. Dictionaries is a resource I need to get it there to be sure, I’m ready to actually check if I have words similar to the others or not. What about actually thinking about a different way? What about using some relevance feedback? So the idea is Let the user give us some feedback about the samples of the retrieved results, and then what is relevant, what is not relevant, and then we can use it somehow. So imagine in your submitted results for lab 3, or even the coursework, you said actually these are the top 10 relevant results. Imagine someone went and checked, OK, the 1st is relevant, yes, the 2nd is not, the 3rd is not, the 4th is relevant, 5th is irrelevant. If I got this information that document 14, and 5 are relevant and 2 and 3 are not relevant, I might think of a way to use it as a sample to improve my results. So how shall I do that? So The user issues a query, in this case, search for something, and the user will get some results. Then the user to start mark some results as relevant or not relevant. Then after that, the, the system itself can use this kind of information to make a better representation of the user’s information needs. Oh, now I got what you’re talking about. And then I can make a better information, then the relevance feedback can go through actually more, one more iteration. You can even search again after improvement and identify this is relevant, this is not relevant to get actually even a better results. From a user perspective, it may be difficult to formulate a good query. Sometimes we write a bad query. But it’s sometimes it’s much easier to say if this document is relevant to what I’m looking for or not. So I, I’m looking for a specific thing. I’m not sure what exactly to write. Sometimes we think about how to write, what exactly we’re looking for, but when I see something, oh, this is exactly what I’m looking for. And then let the system help in doing that. Just to give you an example, uh, just from image, image search for simplicity. If someone searches for the word jaguar, On image search. They can retrieve results like this. It can be the car, it can be the animal. So imagine that the user can go and tell the system, oh, actually, what I’m looking for, which of these are relevant, these three are relevant. Click, this is relevant, relevant, relevant, and then click another button called iterate. So now we can get results like this. Oh, the system learned that they are looking for the car, not actually the animal, and in this case, we can have a better results. This is simply the main idea. So in text search, if you’re someone searching for a new space satellite application, And we use a system with stemming with BM 25 or TFIDF, whatever we used, and we retrieved these results. What the user can say, OK, I’m, I’m checking from even the headlines, I’m not opening the document, but from the headlines, I can see these are the ones that actually I’m looking for stuff related to NASA and the satellites and so on, but not the other stuff. So once I start to find which ones are relevant, then the system can learn new terms from these documents, and I can apply it. So these are the initial four words I’m looking for. Now I can learn additional terms. The word NASA might be relevant here. EOS launch, Aster are actually relevant, and from there I can do another search and get actually. Better results. Hopefully better results because we don’t know still. So this is a simple idea over there. If I identify, yes, feedback for the user? No, no, it’s for the user. You just did the search, but instead of getting the results and taking it as it is, you just selected which ones are relevant and do another search to get actually even better results, hopefully better results from what you searched for, because I learned what is the context exactly. So the main idea, actually the theoretical part of this is actually The theoretical optimal query of a certain information need. So simply, the optimal query, if I’m, I’m thinking about, if I’d like to write a perfect query that will read exactly what I’m looking for and nothing else. Think about it. It is closer as much if you present these documents, all the documents in some space. It’s something like this, and these are the relevant documents. These are the non-relevant documents. What I’m trying to do is to find the query that would be the closest using cosine similarity or TFIDF or BM 25, whatever measure you’re using, would be the closest to the most relevant documents and away as much as possible from the non-relevant ones. So I’m looking for something in this space. And I’m trying to reach this query. However, the challenge we don’t know that through the relevant documents. If I know the relevant documents in the collection, it would be easy, but I don’t know what exactly would be in the relevant documents, which brings us to one of the very famous algorithms that has been developed theoretically, initially that applied later it turned out to be very useful, which is Roy’s algorithm. It’s actually developed by a scientist called Drusus. OK, so, simply trying to say, the main concept here that we are trying to think about the centroid of a vector. So remember that the vector space model that we represent our documents and queries as vectors. So what we are trying to do, what is the centroid? Center of mass point of a given collection of documents. Which would be actually sum summing all the vectors of the documents and dividing by their, their, their size for their size. Interestingly, this has been introduced in 1963, many, many, many years ago, but this concept itself, mathematical concept, turned out to be very useful when applied on big data afterwards. So the main idea here is that we are trying to Find the optimal query from here which maximizes, and remember it should be. As close as possible to the center of the relevant documents and as far as possible to the irrelevant documents. So we’re trying to find the centroid of all the relevant documents and put it, uh, find all the vectors of these relevant documents and summit them, then subtract all the irrelevant documents from it as well. So we are doing actually an operation here. For cosine similarity, if you’d like to represent it with cosine similarity, it would be something like this. Just to measure how the optimal query would be, I imagine I’m getting these 5 relevant documents are there. I will find all the vectors of them, do the summation, and get the vector from there, then subtract it from the irrelevant documents, which will be thousands of documents, subtract it from there. And hopefully this will be the optimal query because it’s the closest to the irrelevant and farthest from the irrelevant ones. Then in practice how we can do that, we don’t know that. So the main idea here is, OK. We have an initial query which is Q 0 here. This is what the user submitted. What we are trying to do here is to find a modified query QM. Which is I’m adding the we have the the first, the original query as it is, I will keep it, but I will add to it values of some of the relevant documents the user identifies. Imagine the user ident finds document 1 and 4 to be relevant. So I will take these documents and Ismit the vectors and add it to my query, and then they said that the other 10 documents from these 10 documents, 8 of them are irrelevant. I will take these 10 8 documents remaining, and then I will just subtract their values. Can you understand this? So for example, I’m searching for Jaguar and I identify that I’m looking for the car, so I will find that the word car started to appear in the relevant documents. Model started to be in the relevant documents, so I’m adding this to the query, but I found the words animal forest are far away, so I’m subtracting this from the query. Is there anything strange about this?

SPEAKER 4
I don’t know.

SPEAKER 0
That’s a good point. So I can’t understand addition of words, but how do you subtract a word? How you subtract, subtract a word in this case?

SPEAKER 5
So there is subjective.

SPEAKER 0
So what would happen simply, the weight of the word jaguar would be 1, the word of the word car would be 1. The weight of the word force is -1. So if I found a document that contained the word forest, it will not just give it zero, it will actually penalize it, so move it away from what I’m looking for. So what I’m trying to do here is move the new query towards the relevant documents and away from the non-relevant documents. This is what I’m trying to do. In practice, the values of beta, gamma, remember this is actually we saw alpha beta and gamma would be multiplied for each of the parts. Alpha is the original query, beta for the additional stuff that we will add, and gamma for the negative irrelevant documents. So the addition of them will be one of them. Um, the values of beta gamma compared to alpha, which is the original one, are set high when large judgment documents are there. So imagine someone judged 100 documents. Oh, I’m very confident about what is relevant and what is not relevant. In this case, I can give higher value to the new terms I’m going to introduce. However, if someone said this is irrelevant, this is not irrelevant, only 2 documents. This is only 2 samples. So I shouldn’t be reliable a lot on this because it’s just 2 samples. It’s a very small sample. So in this case, I will keep 0.8 for the original query and maybe 0.1 and 0.1. In practice, the positive feedback, which is actually the beta, is way more valuable than the negative feedback. So usually beta, which is the value of the weight you’re going to add to the new terms, would be higher, much higher than the negative one. Actually, in practice, to be honest, we don’t use gamma at all. We’ll, we’ll not put a term in negative. You don’t need to do that. Just just give me more additional terms that are relevant, and that would be useful. Remember, this is actually, this is what I was giving an example. Jaguar plus car plus model minus E minus jungle. It’s a strange query, you can run it, the system, why in your system now, the current implementation you have would allow it. It’s not, it’s not a problem. But in practice, you don’t actually have to have the negative ones. Why this would be someone can say. What is the problem of adding this negative term in practice? OK, for effectiveness it might be a bit useful, but what else? Anything about efficiency? Yeah, yeah, getting the negatives, we’ll have to subtract this

SPEAKER 5
from the entire list of documents every time.

SPEAKER 0
You have to subtract it, but actually, what else, because When you have these negative terms, it means that I still have to retrieve the posting of these documents. And then to subtract it, so I’m actually adding a lot of processing here, adding new processes here, Animal and jungle, not to help adding stuff to my query, but actually to punishing the non-relevant documents, which probably will not be matching anyway. So now the documents talking about car, Jaguar car and model will be retrieved top. It’s unlikely to have animal in jungle in this case. So what I do now, I will still go and retrieve two postings of these two terms to process additional 80% of the query now, now to adjust to find some irrelevant documents and punish, punish them more, pushing them lower in the list. Who cares? Ignore it. Is it clear? OK. And in practice, I will not add all the vector of the documents. Usually what I do is to maybe add from 5 terms to 50 terms. You can try this, and from actually the documents itself, how many documents shall I retrieve? Maybe it’s 5, I’m sorry, the number of terms, and how shall I, I, which terms shall I take from the document? Use TFIDF simply or BM 25, whatever you’re using. You can use TFIDF to word the term this term, because probably the, the most frequent term in a document would be that. Who cares? I try to find the word that is actually most specific about this document, so this is actually the TFIDF. I can find, oh, the word now car is important, the word model is important. So TFIDF is very important here. So the effect of relevant feedback on the query simply have an initial query like this. This is the initial query, and this is actually the distribution of relevant and non-relevant documents. What I’m trying to do, imagine I retrieve the most relevant, the most documents having the highest scores, which are the closest to my query. It will be something like this. So the user will go and say, oh, I found 2 of these are relevant, 2 of these are not relevant, this is whatever in the circle. So I will modify my queries, so hopefully the modified query will move to something closer to more relevant documents in this case. And when I do another search, I can get more relevant documents this time. Clear? No response clear It’s OK. So The effective relevance feedback. It can improve recall and precision in this case. Is it, is it clear why? Why, why, why precision? Precision, I think it’s clear. Recall. Why recall?

SPEAKER 1
the more Exactly, and also we’ll find other terms which

SPEAKER 0
are relevant that I didn’t mention in the beginning, I got it and started to receive other documents that didn’t contain my search query terms, but actually the modified one. In practice, relevancet feedback is more, uh, most useful for increasing the recall in situations where recall is important, especially I can find more stuff. Empirically, one round of relevance feedback is often very useful than two rounds because maybe after the user do that, They identify all these 5 or 65 documents as relevant, then they can do another operation, find more relevant terms, get closer to this and search. Probably it will not move you that much. It can, but usually the first iteration is the most important one. OK? However, there are some issues here. The first thing is about long queries are inefficient for a typical IR engine, high cost for retrieval system. Why? Why? More postings, more processing, you will keep redoing it, yes. And which you actually make a longer response time. Also, it is often harder to to understand why a particular document was retrieved. If you are doing it initially and some users searched, you did this internally and retrieved more documents, some documents started to come back, some terms has been added, and this has been actually more documents have been retrieved, then you start, oh this is coming up because you haven’t seen what happened in the process. However, There is one more, much bigger issue here about this method. Anyone can think about it, what it could be. There is one actually makes it, there’s one issue here that makes it very impractical to apply this. Anyone can guess? Yes, the user is unlikely to just keep.

SPEAKER 3
Yes, that’s an excellent thing.

SPEAKER 0
Users are often reluctant to provide feedback. One thing you have to learn about users in general, which is us when we sit on the search engine, we are very lazy. We do spelling mistakes. We don’t write capital letters, and we don’t provide feedback, and sometimes we don’t even scroll down. I will not say even go to the next page. We can even sometimes not scroll down, if not the top result of what I’m looking for, I will type something else. So users are lazy. Imagine that you are asking users, please tell us which is relevant, which is not relevant to. I can give you better results in the second round. No one would care. They will try to find another query themselves. So in practice, it’s not practical. This has been studied and done, but it turned out when you give it to users, it achieves the results on the test. But when you give it to real users, no one will provide anything. Just do another search and that’s it. So it’s a very nice idea, but it’s impractical. So from the practicality, user revises and resubmits the query. User may prefer to revise and resubmit the query themselves instead of actually judging documents, open documents, and what is written and what is not. And Even when it’s actually users revised their query, it’s actually very useful for the search engines of the world to actually learn about query suggestion. When you search for something, Google will tell you, did you think about searching for this? How did they know? Because 1 million users before you did this query, they didn’t find much results. Then they typed another query and they started to find results. So Google learned from this case, some statistical methods. But this idea is still good. Is there a way to apply relevant feedback without the user’s input somehow? Yep, Check which website? Nicolaon. You can do that as an implicit way. that’s some that’s a good idea, yes. Any ideas? Yes?

SPEAKER 1
Track. That what.

SPEAKER 6
With, with the user, when, for example, a search engine, you can track uh what they hover over, what they spend months on.

SPEAKER 0
OK, this is exactly the same, click through, to click through data. This is actually very useful for web search engines, what you’re looked at. But in general, actually, if you’d like to apply for your project in the second semester, how would you do that? How would you get user logs to find out what they are doing? So one idea of that is Soothe the relevance feedback or blind the relevance feedback. It solved the problem with users hate to provide feedback. So the feedback is applied blindly. Automatic, which is kind of automates the part of the truelet feedback. So what is the main idea here? The main idea is very, very, very simple. Retrieve a ranked list of results based on your initial, the initial query based on your algorithm of retrieval, then assume the top key to be relevant. Hopefully your system is doing well. You’re matching some terms. So what you can do, I just, I would assume the top 3 documents to be relevant. That’s it, I would assume blindly. And then applies the relevance feedback Rous algorithm that we have discussed earlier and then actually after that. You do actually, you do another search and get more results. In this case, of course, there is no negative feedback because you don’t assume anything about the relevance. There is no gamma here. So what you’re doing, I would assume the top 1, top 2, top 3 to be relevant, hopefully because your system is good enough to retrieve at least the top document to be relevant, and then you can apply relevance feedback. Uh, what could be an issue with this? The. Yeah, if the top few documents are not relevant, if the first result about the jaguar turned out to be about the animal, then you assume it’s an animal, and that’s it. So mostly the good thing is still that it mostly works. It still can go horribly wrong if the top ones are not relevant. However, And this is actually, sometimes you can say, OK, I will assume the top one to be relevant, then I will do implicit feedback and then actually do another iteration, do a third iteration. After 2 or 3 iterations, you might end up with something totally, totally different from a different topic than the user was searching for. However, in general, in many applications, it turned out to be actually useful. So it was proven to be useful for many IR applications. A new search, especially for a new search, and you’re searching for like learning the names and entities, you’re looking about for the US presidents of the United States. It changes for over 4 years, but this is if you’re talking about the prime minister in the UK, 2 years ago it was changing every few months, so you don’t actually need to search for the name. It will actually implicitly learn this. Social media search like learning hashtags. You can search for something and learns what are the hashtags coming with this search that can help you and web search. Implicit feedbacks like the clicks would be useful if the user started to click something, you can actually do it internally, assuming these are relevant, OK. Some domains are more challenging like pattern search, which actually is a very hard language. It’s not easy to do that. And However, it still pseudo relevant feedback to be the most basic query expansion methods in IR. It is highly unsupervised. You don’t need to do any kind of supervision. It’s language independent. I really don’t care about. So for example, if you’re doing thesaurus or dictionary, I have Word it for English. Do you have it for Vietnamese? Probably not. So this one doesn’t care. Just whatever language is there, it will just do that. And does not require any kind of language resources. Nothing needs to be required. Just apply it. Search, assume the top X to be relevant, and then take the implicit feedback and do that. The evaluation itself, if you’d like to do that, imagine you have a test set from an evaluation last week. You have a query set and the relevant judgments of the documents. You can start applying it. I will apply it from one document to 50 documents. I will check, assuming the first one is relevant, or 2 or 5 or 50, and how many terms shall I add from 5 to 10 to 15 to 50, and try all of these combinations and see which ones are achieving the best results based on your query judgments you have. The results of PRF are directly compared to the baseline with no BRF. So I, I check this. If I search it, and this is achieved the mean apprecision of 0.3, now I will apply pseudo relevance feedback and see what would be the mean average precision. Is it going to be 0.4 or 0.2? Increase or decrease? Which is not cheating because if you think you are doing it blindly, you are not actually taking explicit feedback from the user. And then it’s essential sometimes that the improvement is significant to test that because sometimes if you have 100 queries, it might be improved to 50 and degraded for 50, so you don’t know actually what is what is exactly going to happen on this. So just to make it clear, I will give you some practical on this, actually, which is kind of the lab you’re going to do this week. So let me go here. Oh, I don’t see it. I need to duplicate the, OK. Um, let’s play. Uh never. You see it now Yep. Is it big enough? Make it bigger. OK, so, um. Let me go here. Remember this This is actually the The lab from lab 3. This is a query, and this is actually the top relevant results you retrieved based on your experiments. So What I can do now is for each of these documents. I can check what would happen if I took the top document for this query and see what would be the term in this document that have the highest TFIDF. That’s all what I do. If I’m doing two or three documents, then I will collect I’ll bend them as one big document and check what are the most highest TFIDF I have for these documents. So what I can do now. Not here, uh, here, yes. So I created a simple Yeah, something like this. So this one actually just takes the list of documents you want to find the highest TFIDF for them from the collection. So for example here, if an income tax reduction, if I search for document 65, what are the terms there in Document 65 that has a high CFIDF? That’s simply what it does. I’ll print the top 10 here. So this is actually what comes out. I will learn something like spend labor conservative, 20 billion. You can see not all of them. Some of these are useful, some of them are not great, but this is actually if I took only one document, what about taking more? So if I said 35, 33, this is the second document. So I started to find some terms coming up like adult tax band, property bill, started to some useful terms coming there. Maybe if I added more documents in the feedback. 35. 62 and 3608 for example. This is what I’m getting again banded property bill, household income. So some term is that if I took the top 5 of these and add it to my query, hopefully I will get better results next time. This is for the first query. Go to the next one, Peace in the Middle East. What are the top terms, actually, what are the top documents here? So if I did that for the top document. 354 9 Remember this was the 1990s, so it would be interesting to see what’s going on there. So you find the word baker, which is probably one politician at that time. Israel is coming up. Arab is coming up. Soviet Union is coming up. That starts to be irrelevant stuff. Yes, uh, this, oh, this, I think this is, uh, thermal frequency, document frequency, and the TFIDF, OK? This is what I’m doing. But you started to learn. I search for peace in the Middle East, OK, go directly to Arab and Israel. You learn, you learn of this directly. And if I added more documents here like 305-288, yes? Again, Israel comes to the top, peace, Baker, Arab Israel, Shamir. I don’t know what this Assad, which is a Syrian father of the one who moved here had been controlling Syria for 50 years. So these kinds of things actually came out to be interesting. So I’m learning more relevant terms now about what I’m doing. So instead of searching just peace in the Middle East, I’m starting to learn more documents. Maybe some documents didn’t mention the word peace in the Middle East explicitly, but they started to talk about Israel and Arabs. Probably this would be talk about peace in the Middle East. So this is actually where you start to learn about these kinds of things. Let’s take some more examples. What is this? Let me take the 5th 1. The industries of computers, that’s an interesting one. So let’s take the documents there. What are the numbers? 3933. Let’s take 3. 3936. And 3937, OK? You start to find the word computer system, mainframe, IBM microprocessor, Unix. That’s actually relevant stuff that would be useful to add to my query. We think that we will see, but you can see now that I assume blind stuff, that the top documents will be relevant, and I might learn additional terms and it turned out to be useful. Let’s take one final one. Just stock market in Japan. This is another one. So I can do. 3 693. And 3459, and 21, yeah. I thought to start some like FDBK I’m not sure what is that, but Sahi, this is probably something uh related to the stock market in Japan, oil. W R R NT. So you can hear some terms actually are coming up to be useful. I can keep adding, by the way, so let me do. 3416. And 287. Oops. I didn’t filter out the numbers in at that time, so it turned out that number 3 has a very high TFIDF and came to be the top. So this is where query drift might happen. If I searched my query and added these ones, I’m adding 4321, numbers, just numbers to my query, I destroyed what I’m searching for. So you have to be careful because it doesn’t have to work all the time. Sometimes it works, sometimes actually it will not work. So, however, in general, it’s many times it would work well, but you have to be careful. And remember this is just blind. There is no there is no learning there at all. All what I’m doing, get some documents, extract the most high CFID from these documents, add it to my query and search, and that’s it. OK? Any questions? OK, let me get back to the slides. Yep, you can see. Which brings us to actually the term representation in IR in general. About actually this is actually what I’m telling you has been the information retrieval over decades. Not in just the last few years. So so far a term is a definite term. A car is a car, a vehicle is different, doesn’t equal to a vehicle, and they don’t equal to hamster. In which way you call it local representation. Each journal represents itself only in out of context. So I know that car is not a vehicle, it’s not a hamster, but I should think that car and vehicle are closer somehow than vehicle and hamster, for example. So how shall we do that? It brings us to what is the trend in the recent years, I would say in the last decade, which we call it vector representation of terms. So a term or a phrase or a paragraph can be represented as a vector. But remember, we, before we show it, it’s totally different than the document as a vector in this case. Document as a vector was a vector of terms. But this one is a vector that we’re using embeddings, which are values of something that we don’t have to understand, doesn’t have something meaning the mean in red. It’s actually just the neural networks and we’ve got the hidden layer and we learned actually this representation of the term. However, what we’re trying to do with this kind of representation, the objective is that the term or a sentence with closer meaning, they get higher dot product cosine similarity than terms which are totally different. This is what we’re trying to do. This is objective. This has been done in research for a decade now. Every year we have a new thing, started from word embeddings to birth to other ends and so on. And ideally, if I can reach an abstraction to all the knowledge that even multimodal, I would like to have a sentence and an image carrying the same information to be closer to each other. So if I said the bird, the vector for a bird as a as a word would be closer to a vector of an image of a bird. Can we do that? And this is actually what people have been trying to develop over the last 10 years. So what we hope at the end and instead of actually saying a word is called a car, a word is called a vehicle, or a word is called hamster, I will have a vector for each of these words which will be something like this. A vehicle and a car would be two vectors closer to each other, but the hemisphere is a bit far away from them. This is an objective that has been developed over the last 10 years. So let’s take a quick history over the last decade. What has been happening. Um, I will be mentioning this very quickly now, but it will motivate the next few lectures afterwards, OK? So this is, uh, I’m not going into details now, but just a quick summary about what’s going on. So the traditional IR, as I have been showing you, it has been the case for a long time. Then from 2014 and 201515, people start to think about the word embeddings era. Thinking about word embeddings and the breakthrough came out when Word2Vec came out. Glove came out. Any anyone aware of what is Word2vec and Glo? Did you hear about word embeddings before? OK, at least if you are taking any course, you will hear about this. However, when this came out, the impact on IR was kind of very limited. Actually, sometimes it didn’t improve anything. And the main thing that these embeddings at that time, it focused on the term itself in a representation, but again it was missing the context. So this kind of query expansion methods or TFIDF or BM 25 actually were achieving good results already compared to them. They didn’t lead to that much improvement. Then came out in 2018 and 201919, the Transformers revolution came out, the BI models, the transformer-based models, which was mainly creating embeddings as well, but taking into consideration the context. The World Bank, which refers to the bank itself as an institute of economics as an institute, is different than the World Bank when it comes to the Bank of a river, for example. So this has started to take actually something in the context. And it turned out it has an excellent impact on the retrieval. Actually, the results of the retrieval started to show significant improvement in many applications. And there was stuff like Monobird and Ubird which came out at the time. Then actually from 2020 to 2021, started to think about what’s called dense retrieval and dual encoders, which is simply try to retrieve using big models like Kohlbert and ANSI and ANCE. It’s more advanced models that tries to learn from the queries and actually the documents and try to find better generation and some objective function to get better results for the receiver in specific. Of course it has a lot of machine learning in these parts. And it started to enable end to end neural retrieval using dense embeddings, so something called neural retrieval started to emerge at that time. And approximate nearest neighbors started to come out with some different winds, and the trend at that time, the emergence of vector-based retrieval pipelines started to come out. Then every year is happening something. This is crazy, revolution is happening now. So in 2020 to 2023, LLMs came out. LGBT came out, which started to introduce something called the retrieval augmented regeneration, which is simply using LLMs. What about actually trying to solve problems like question answering in a better way? Now I can use the retrieval as a mean here, then try to get a summary to give the user a direct answer instead of actually giving them documents to read. Which is simply the direct systems created at that time and very recently, just happening this year and last year, what the trend is. Graph augmented IR, which is simply this is something Microsoft released last year called Graph Rag, which is saying Rag is simply to retrieve some documents and try to get a summary of it. Graphrack says no, I would put all the knowledge I have in a graph, and I would try to answer from the graph directly in a service. It’s still in the research. It’s not highly scalable. Like you cannot apply this on the web, for example, so far, but there is research on this. Who knows next year when I’m teaching this course what I’ll be teaching? Probably something new. But this all has been developing in the last 10 years. Which brings us to the question. What about what we studied so far? Does it mean it’s obsolete now? No one is using it? Inverted index, we created have been happy creating your inverted index, BM 25 TFIDF pseudo relevance feedback which we studied today. Are they old technology? Anyone can guess the answer. Yeah I might be some, giving you information about outdated stuff.

SPEAKER 7
I think they are still relevant. What, uh, they are still relevant and used today. So even for rep, it will still be used or be use of DM 25 to get more effective pro initials.

SPEAKER 0
OK, that, that’s a good thing. So these results, I would not, I would say they are not getting, sometimes they are still getting the best results, but still, of course, the new models are getting very amazing results. But at this means that we don’t, we don’t use them actually. They are indispensable. It’s impossible to get rid of them because they are super efficient. They are highly scalable. The only limitation of them may be they are not the best in ranking. They give good ranking, but not the best in ranking. So can I apply one of the LLMs on the whole web? Probably they train it on the whole web, but how, how often do they update it? Of course not every day. It takes months to do that with a huge amount of money and processing power to do that. So what is really happening now is that the solution is they are the main solution to use for essential first step. You cannot do a search without them. What you created now is a basic thing that Google would go, and when you search for something, it would turn out that there are 2 million documents I found matching your query in a second with this normal inverted index and term matching, and so on. And it will tell you, I know that these are the top 1000 documents based on BM 25, for example. But what I will do, I will not show this ranking to the user. I would take these top 1000 documents and re-rank them again based on the new advanced models. I would use birds in this case, I would use LLMs in this case to re-rank them with better understanding of the meaning, knowing that the vehicle and the car actually are the same, so it can do better matching. But can I apply this on the whole web in the beginning? It’s impossible so far, unless something comes in new next year. But so far what we have been studying, this is the basic, this is the basic thing, how to represent the documents in a very efficient way to retrieve the results very fast. And once you do that, hold on, I will not show it to the user. I will just use these models to actually rank them. Maybe the documents that you rank at number 1 will be number 10 now, and what we rank at number 50 is number 1 now. So it would rank the top results for the user and show it to them. So this is actually the story that is happening now. So we cannot live without the basics that you studied so far. These are the fundamentals of search engines. You have to inverted index. You represent the world with the postings, and so far and so on, and then you add additional stuff on the top of them. To improve results In some situations you might get rid of them. I have to be clear about it. Maybe if you have a company with how many, maybe 10,000 documents, yeah, I can build a graph a graph rack with them and just find them like this without any inverted endings. But in real situations with huge amount of humongous amounts of documents. In trillions and billions, no, you still need them because they are the super efficient. They have been created with no GPUs around and they are working very efficient, and you have tested it yourself. You can find now, you search for something, you get the results like innostantly. This is actually still running now. Is that clear? Any comments? OK. So the summary of this lecture that query expansion, add more terms to the user query um to better match the relevant documents. This is a basic one that we have seen. We can have seen this to be done via dictionaries which you can manually build them or automatically build them. Sometimes they fail to capture the context, which happens, which brought us to something called relevant feedback, a very simple idea. Find the top documents and you either ask the user to do that, which turned out to be not practical, or assume the top ones to be actually the relevant, and then you can learn in additional terms. I have seen them. Sometimes, many times it works. Sometimes it actually can lead to a query draft. And the current methods. Use advanced techniques for better matching, but we still apply this to the top retrieved documents, not the whole thing, because we cannot do that on the whole thing. So the resources for this, you can check in Textbook 1, chapter 9, and in Textbook 2, chapter 2 and 3. Actually, there is an old paper I published like how many, 15 years ago, 14 years ago, which actually was just testing different methods for code expansions for patent retrieve. Evil using Theorus, Wordnet, and different ones, and it shows you actually how some of these works, sometimes it doesn’t work. It’s an interesting one if you’d like. This is additional reading, and you have lab 5. Simply you need to test it yourself. What would happen when you actually try to get the top relevant terms from a given document. Uh, I finished this lecture, but I will still have the mind teasers. Then I will show you in 5 minutes afterwards, your feedback that you provided. If you’d like to leave, it’s up to you, but I would love that you stay to show you, give you the feedback. Probably it will be 1015 minutes at most for everything. So I will show you the mind teaser of today. And by the way, it’s for a question that I couldn’t solve myself.

SPEAKER 1
The white, OK. Thank you. I OK, here is a mind teaser of today.

SPEAKER 0
Probably you have seen many stuff like this. So move one match to make this equation valid, without, of course, use an and not equal, so just. If you get the answer, raise your hands, and after that, we will actually show you the feedback about you provided last week.

Lecture 5:

SPEAKER 0
So hi again. Which is week 5, yeah, yeah. And I think you already, many of you already built the search engine from scratch. So as we promised, hopefully most of you, I have seen a lot of results, so I’m hopeful. So we have now search engines working. So, uh, but before I start, if you can remember last week I was a bit unhappy with the interaction on Piazza. It took a lot a long time to actually share your results. However, this is not the same way this week. I was happy because I found even on Wednesday some people started to search your results. So this is how I feel now, so. Um, um, oh, it’s even with sound. I didn’t know it was coming with sound. OK. Never tried it with sun. So I’m happy at least uh that you start to share your results on Piazza discussing it. That’s great. It shows actually that uh everyone uh are getting some results, excited about it, so this is good. Hopefully, most of you have finished because the coursework submission is coming very soon, so uh you should have built it or in the next couple of days if you have. So some of you discussed some differences in the results, which always can happen because of some deconization decisions, stopping and stemming, which is small variations can lead to some variations, of course. And one small thing, just be sure you’re using log 10, not log 2, because this can make a big difference. So be sure about this stuff. And the good news actually, there is no lab this week. I know you are very busy with your coursework and you have been doing a lot of practical work in the last few weeks. So this week, no lab, no practical lab, so just focus on finishing whatever you have for your coursework submission. So I think now it’s you should be dancing now. The other thing, today, uh, we have a long, uh, lecture one, so the first lecture will be a bit long. We might go over an hour, uh, and the second one will be shorter, so, because just to focus on one topic, uh, in one lecture. And I’m giving you this, uh, QR code, please scan it. You can provide feedback on the course. This is a mid-year, uh, feedback, so, your feedback will be very useful. So please, I posted the link on Piazza, so you can go from there. It should take 1 or 2 minutes at most, not more than this. Just say whatever you want. It’s anonymized. Don’t worry, I don’t know who are you, so say whatever you want about the course. Uh, yeah, this QR code is there if you’d like to do it now. So how is the progress with Coursework One? Who have finished implementing everything so far? Oh, that’s almost half of you. That’s good. So the remaining, please try to finish it. It’s, uh, I hope it’s a good experience, uh. The test collection and the queries will be released tomorrow, so I release it tomorrow, I think, uh, late afternoon because I have a pitch examination, so I, uh, I think I will finish at 3 or 4, so I will post it after that. If I manage to do it today, I will do it, but, uh, probably tomorrow. There are no tricks in this one. It’s very simple. The only difference is instead of 1000 documents in the collection, it would be 5000. It’s a little bit bigger, nothing compared to the real life. It was just something to get it working, and a new set of queries, similar to the ones you have seen. But once this dataset is released, it’s a silent period. Please don’t share any discussions anymore about this stuff on Piazza. It’s still Sunday when we submit. So please don’t, no questions about coursework are allowed on Piazza. If you, if there is something urgent you would like to ask, make it make it actually private so we can get it if we think it’s actually worth asking. But I can see, I can see we’ve answered most of the questions about this stuff anyway. And before you post a question, have a look on the existing questions because I think everything is covered already on Piazza. Many, many of you already asked them in the past week and we responded to them. And the main thing, once you get this stuff released tomorrow, you can just index the collection, extract your query, uh, run your queries and get the results and submit it. Your report should be ready from now. There is nothing big difference actually. It should be the same outcomes. The only difference would be the output files of the results, just submit the ones you get with the new collection. However, anyway, if you didn’t finish, you still have your weekend to work on it. So you have by midnight on Sunday, if this is actually the last period of time to submit your coursework. Any more questions about this before we start our new lecture? OK, good. So the objective here of this lecture is to learn about how to evaluate information rival in general, search engines, the measures, this lecture specifically, we’re talking about some measures P, R, and F, and me and every precision, and DC, and there is no lab yet this week, but you’re going to implement them anyway as part of your coursework too, so there is still practical parts, you know, this course with a lot of implementations, but hopefully it’s making you understand better what you’re learning. So if you can remember from the search process that we discussed, uh in the previous lectures, we said actually the user would submit the query and then, then we do this kind of ranking and, and checking the index and retrieve results, and there is part about the evaluation here. So what we are trying to do here is we did actually the indexing, we did the ranking, if you can remember, we did all of that, but the question, uh the evaluation now, how shall we do evaluation? So if you’d like to model information as an experimental science, so simply have a research question which is the hypothesis that if I did this kind of preprocessing, this kind of ending thing, probably I would get good results. So we design an experiment to answer this question and then perform some comparison, compare like you actually see sometimes you ask, shall I do stopping or not? Shall I do stemming this way or not? Shall I remove numbers or keep it? I don’t know. You need actually to evaluate it and see which, which one will achieve better results, so. And this kind of question, does the experiment answers the question, are the results significant or actually they just by luck? Because you can say, oh, it’s improving for this query, but another 10 queries will be actually worse. So how shall you know that this is actually better? And you should report these results and keep iterating until you find the best system. This is actually how we see it. For example, if stemming improving the results or not, is it better to annex the documents without the stemming or with the stemming? Simply can apply both and then somehow you should find a way to measure that the performance is better or worse, not just by checking one or two queries. For example, this is output from your lab 3. Many of you already shared the results there. This is the the documents for document 1, document 2, and document 3. Imagine that we started to uh manually check each document for each query. So you have the query and you check all the documents, and it turned out that these are the relevant documents for each of these queries. So the question is, is that a good performance or bad? Uh. I, I, it’s really, for example, for query 3, probably it’s bad because you’ve got only one relevant document very low, and uh maybe one is OK, but it’s still maybe irrelevant, 2 is how shall we measure this? We need something to tell us, yes, this is 30%, this is 50%. How shall we know? This is actually what we’re going to learn about this uh in this uh couple of lectures today. So to configure a system about the system itself, you can think about how a system could be configured like applying stopping or not, applying tokenization or not, or how, which kind of tokenization stemming, is it porter or should we keep it as with a stemming or without stemming, or maybe in grams, we said actually you can represent the word n gram of characters. Would it be better? Using synonyms to improve the retriever, for example, all of these setups are different. There should be a way to measure this. And the corresponding experiment run your search for each for a set of queries with each of these setups and see which one is performing better. But from the user side, Some of this can be thinking about actually is letting the user adding weight for the terms themselves would be a good idea. So to correspond the experiment in this case, you should think about building two different systems interfaces. One allows them to put weights, the other one doesn’t allow it, and see actually which ones the user is using more and actually for which cases. So all of this actually is by experimentation. They Two types of evaluation strategies. One is system-centered studies, simply giving a set of documents, a collection of documents, and a set of queries, and some of actually, you know, which of these documents are relevant to each of the queries. Try several variations of the system and measure which setup achieved the best performance, which is kind of a laboratory experiment. Or it can be a user-centered study, which is simply given several users. You have users, I can actually split the class into two, and then, and at least, and two retrieval systems, for example, is it Bing is better or Google? Perplexity now is better, or actually maybe than Google AI retrieval, and then actually give you the same task and run it on both systems, and you tell me. Yeah, this time I, I prefer this one. This time I prefer this one, and based on your feedback we can tell which one is better. So it can be done in two different ways. Yes. Is it like chat GTT when it gives you two different answers and you have to pick one of them? Like sometimes the chat GT gives you like two different answers to your question, and it tells you which one you prefer. You can think about it in this way, but the main objective here is to actually find which system is better. So there are two ways. One is to Keep doing changing the system and check based on a specific set of documents giving some numbers, or it can be based on users. So one, the first one doesn’t involve users. The second one involves users. You ask the users themselves, what do you think? Which one is, is better for you? And of course, opinions vary, but you can do it on a like a given number of users, enough number of users, you can tell which this system is really better than the other one, OK? The importance of evaluation actually is, it’s the main thing if you can think about it. Building engines without measuring how they’re good or bad makes no sense. You have to find out. So the ability to measure differences underlies experimental science in general. How well do a system work? And if you have System A and System B, can you do some setup for A and say, is it better than B? You need to find a way to answer this question. And is it really bitter or just by chance, because it can happen, just it’s working a little bit better for some queries, but it’s worse for the others. And under what condition is it? And evaluation drives what to research, identify the techniques that work and don’t work. This is actually when you telling you, but if you did the stemming, it will be actually match stuff that it doesn’t, but after experimentation by many, many experiments, it turns out that the stemming is always. Mostly, most of the time it’s better, so we adjust apply it. Stopping. We actually lose a lot of words in many applications, not web search. Stopping is fine to remove it. So all of this actually is based on experimentation. So we need to experiment somehow and tell you which setup is better. And there are 3 dimensions of the evaluation here. First one is effectiveness, which we will focus on in this lecture. How good are the documents that are returned? Simply. So now I retrieved, you built a system. You’re retrieving some documents, and you need to know actually are they good or bad. And this can be system only just about documents or it can be human plus system, how people prefer it in different situations. The other part which will not be covered today, but it’s an important thing you should always think about is the efficiency. What is the retrieval time? What is the indexing time itself? And actually what is the index size as well? And this actually would be very important when we’re actually evaluating your group project in the 2nd semester. So you might build an amazing system achieving almost 100%. Uh, whatever I search for, I will get exactly what I’m looking for on the top result, but it takes 5 minutes to retrieve the results. No one will be patient to wait for this, so it’s important always to think about efficiency, or actually it’s achieving and it receives it in milliseconds, but we need like for 100 documents, I need 2 terabytes of storage. It makes no sense. You have to think about all of this, or actually I need a memory of 1000 gigabytes of memory. Think about all this. Efficiency is important, and then usability. Is a user, is it easy for the user to learn about it and flexible with the user’s importance? Is it novice or actually expert users? For example, Google, you’re not old enough, but if you remember, Yahoo at some point was a full page. It has a search box in the middle and a lot of information around it. Then Google came and said, No, no, no, I just provide a search box. That’s it. Nothing else around it, so it’s very, very novice. Nothing advanced, but that’s very more usable for users. In some situations, you have expert users who need advanced search, which will do actually like structured search on Google Schore, for example. You go and search for specific things. So it depends actually on the usability as well. And this actually introduces us to the lab setup for experimenting IR which is called the Cranfield paradigm. So simply it’s taking whatever we do for search, which is simply you have queries and documents, and then you apply for a search engine and come with some search results. It adds to this setup an evaluation module which takes two inputs. Actually one is the search results you achieved, and the other thing is the relevant judgments which we will talk about which of the documents are relevant. To the query and then it produces some kind of a measure for us to tell us which system is performing better than the other. So these are the main four components for preparing. You need a set of queries, some document collection, and a relevance judgments for for the 3 documents for each of the queries, and then a measure at the end. How are you measuring the performance. So these are the main things, a collection of documents. I need a collection of documents. It should be representative to the given IR task. So for example, imagine that you’re building your search engine you’re building now with Financial Times articles and someone started to put some setup, achieved amazing results. I cannot say I got I got a system that actually working for Financial Times articles, I will build a web search engine using the same setup. It makes no sense because we, it has a different nature. So if you’re going to build a system for a specific task, you have to find a collection that represents this one. And also imagine, OK, I will build something for Financial Times. Representative, does it represent Final Times? This is actually 40 years old documents or 30 years old documents. Imagine that the documents are published now, the articles about Financial Times now. Would it be the same setup? So if you need a sample, take a sample across all the years and see if it’s working fine with all of them. And you have to think about the size, the source, the gender, the topic, all of this. So you have a sample of a collection that you can build your system with, and then a sample of the information need, which should be randomized and representative. So I have to think about the queries that a user was going to to search for this. Imagine I give you queries for the collection I produced for you, which is a Financial Times article talking about, uh, for example, uh what kind of uh Who qualified for the World Cup 2028? Is it 2027 or 2028? I don’t know. So for the next World Cup, it makes no sense. It’s actually information outside of this collection. So I have to think about the topics. I was actually searching for a cure for cancer. I’m not sure if these documents are talking about this stuff, it’s more financial documents. So think about the possible random set of information needed that would be searched for this one. It has to be task, the task focusing on this specific task. And then a knowledge reference judgments which simply assessed by humans. I will check. I have 10 queries in this case, which of the documents are relevant to each of these queries? I have to think about this now. If I have this information, now I can easily build an evaluation measures that can tell me how the system is achieving in performance. So the long lecture, the first long lecture, lecture should be a bit longer. We only talk about this one. And the 2nd lecture today will be talking about the other 3 ones, which will be shorter, way shorter, but let’s talk about the evaluation measures itself. We would like to build an effective measure. So first of all, it should capture some aspect of what the user wants, like we need to find a way to measure that and tell us is the user happy with the result or not, is the user satisfied or not, and it should be easy to replicate. Don’t say I get a measure and it’s achieving 60%. And then, OK, I built something better, and I would like actually now to see if the 60% will be 80% or 70%. I’m sorry, we cannot replicate it. It’s done once I’m done. No, you need to actually somehow to think about how it could be replicated. If someone followed your exact procedure or steps to do this one, they can achieve the same results, so it should be somehow replicable and it should be easily comparable. How to make two things comparable? The easiest way is to have numbers, 10 versus 20. Uh, 30 versus 60. Still, it can be curves. I can say this is a performance, some curve like this. But if you’ve got another system and say the curve of the other one is something like this, how shall I, is there a way, easy way to compare it? Sometimes we do it for analysis, but the the optimal way is to find just a number to tell you the performance. And it, the good important measure is that it should be also predictive for other situations. So now I actually Have this measure that achieves performance, but now I have a new collection, or actually additional collection. The collection has expanded, and my queries actually are changing over time. Is this measure would continue to be reflective to the performance, or now it will not work anymore. I need to find another measure. So I have to think about something that actually is robust and actually will be useful over time. So we will start by a very simple set of measures that probably you might have heard about before, and this is actually designed for the state-based measures. The ones which assume a search engine would retrieve a set of results without ranking that lack the boolean search, what we did in lab 2. A set of documents, and these are the sets of documents, and we need to find out if they are good or bad. And in this case, there are no specific numbers. If you can remember in lab 2, some documents, some queries will retrieve 3 or 4 documents, other ones will retrieve 50 documents or 100 documents. So we need to find a measure that will allow us to do that. We will move all of this lecture by examples, OK, so hopefully this will make it clearer. So imagine something like this. I have a query, uh, and I know there are in the collection 8 relevant documents to this query, and I built 7 search engines here. OK. And A, B, C, till G, and each of these systems retrieve this set of documents. For example, retrieved 10 documents, one relevant, one not relevant, one relevant, and so on, OK? So the question is, What measures? We will talk about precision and recall. Who heard about precision and recall before? That should be easy because you heard about it before. But we are going to see it from the perspective of search engines, information retrieval, not from classification, which is a bit different. So imagine we have this big document collection we have, and we have actually These are the 3 documents. I searched it and I read this set of documents, let’s say 10 documents, and When I checked which of those are relevant and what are the relevant documents in general I have in my collection, it turned out to be this one. So I have these subsets of the collection, big collection, some I could, there are some uh relevant documents I have, and I could retrieve some of them and some other non-relevant. In general, this is what we call, so the red, the red part is called the false positives, which is simply they are retrieved documents thinking it’s relevant, but they shouldn’t have been retrieved, so they are classified as positive, but actually they are false because they are not relevant. The green is actually the false negatives, which actually these are relevant documents, but I couldn’t retrieve. So I thought they are not irrelevant. They are relevant, they are irrelevant, but actually I couldn’t retrieve them. The brown part is the true posts. These are the relevant documents that I managed to retrieve, classified correctly, and then the remaining stuff, which is a gray true negatives. These are relevant documents. I didn’t retrieve them. That’s it. So precision simply tells us what fraction of these retrieved documents are relevant. OK, so if I asked you what would be this area, can someone describe it in the colors? Excellent. brown divided by red. So it’s simply the intersection between the retrieved and relevant divided by all the retrieved documents. So it is the brown divided by the red circle. So this is simply the precision. The other part is the recall itself. So the recall tells us a different, a little bit different story, simply what fraction of the relevant documents I have I managed to retrieve. So in this case, what would be the colors? Yes.

SPEAKER 1
Brown divided by the relevant or the green part with

SPEAKER 0
the, yes, yeah, exactly. So it’s again the intersection in this case uh between received and relevant uh and relevant, but divided by all the relevant stuff which is simply the brown divided by the green circle including the brown. So this is giving us two sides of the story. So if I ask you for this example. What is the precision for System A? 5 5/10. I retrieved 10 documents and I’m, 5 of them are relevant. What is the recall for System A?

SPEAKER 1
Oh.

SPEAKER 0
I cannot hear you. Yes, 5/8. We know there are 8 relevant documents. I’ve managed only to retrieve 5. So the the precision record is 5/10 and the other one is 5/8. What is that, what about the second system, Precision? 6/12, and recall, 6/8, and you can keep doing that for each of those ones. OK. What about the last one? Precision Oh 4/5 and the coal is, is 4/8, exactly. Here he is. So The trade-off always between precision recall, there is always a trade-off between them. The precision is the ability to retrieve top ranked documents that are most relevant, and recall is the ability to search and to find all the relevant documents in the corpus. And the problem is, If you started to design your system to retrieve more stuff, so we start more documents, there is a higher chance to find all relevant documents, which is increasing the recall. But on the other hand, there is a chance that you miss, actually, I’m sorry, there is a chance that you start also to retrieve the stuff that are not relevant, which is actually reducing the precision. If someone is asking about stemming in this case, what do you think about stemming? Do you think it It increases recall or precision when you apply stemming. Who think it would stemming would increase precision. Who thinks it will increase recall. Actually, yes, it’s if you apply stemming, you will increase recall because if I search for play, it will match playing, blade, and so on. So I’m reading more documents. So the recall will increase, but precision might decrease in this case because I start matching stuff that is not exactly the same form I’m looking for, but it might be some of them might not be relevant. So standing is a very simple example about precision and recall in this case. Is it clear Good. So the trade-off in general is that when you retrieve less number of documents, like on this side, you will find that precision is high. Usually because actually you’re very strict, so I just need stuff for the matching, but recall will be low and once you start to move retrieving more documents, probably the recall will increase, but the precision precision might decrease in this case. So this one, it returns the relevant documents that messes many useful stuff. The other one returns most relevant documents, but you might actually include a lot of junk noise documents. And of course the ideal case is this one. If you manage to achieve 100% precision, 100% recall, retrieving only the relevant documents, all of them, that would be of course the ideal case. So the problem that precision recall, they are actually measuring. One side of the problem and each one is measuring one side of the problem. There is actually another measure that probably captures all the stuff, the true positive, true negatives, and false positive and false negatives as well, which is accuracy. So this can be another measure that measures everything. So who thinks this would be a good measure to use? Who knows what is accuracy? That should be the easiest measure. Who thinks it would be a good measure for information retrieval? OK, who thinks this will not be a good measure for information retrieval? Can someone say why? Yes, because there’s a lot of documents on the internet, so if you just don’t retrieve anyone, any of them, you’ll be good accuracy. That’s an excellent one because actually you’re measuring two negatives. The ones which are irrelevant and you didn’t retrieve. That’s a lot of documents. So actually this is how I’m explaining it, but in reality it should be something like this. It’s really, really small signal you’re looking for. So the problem that the relevant documents is way much bigger than the relevant documents. It’s like a needle in a haystack. So if I use accuracy, for example, I have a collection of 1 million documents, and this is a small collection here, and I know the relevant documents there are only 10, and I retrieved 10 of them. If I found that 10 of the retrieved ones are actually real docs are really relevant, so measuring about 5 of them are irrelevant. So if I measure the accuracy in this case, It will be 99.99%. Because I was very good at not retrieving the irrelevant documents. Who cares? So while accuracy might be an interesting measure for other applications like classification, like different things, for IR it’s the worst measure ever. It makes no sense because it measures, oh, you’re great, you didn’t retrieve any of the irrelevant documents. Who cares about this stuff? Yeah, I don’t care because it’s millions of documents out there. What I really care about is the needle I’m looking for. I need to find them with high precision. So is there another measure that can combine somehow the precision and recall? Do you want to hear about something? Yes. F score is amazing. So the F score is simply the harmonic mean between pre-call and precision to try to combine both of them. And the interesting thing about the harmonic mean uh compared to like arithmetic mean or geometrical mean is that it focuses on the lower value. So if you have to score arithmetic meanwhile, it gives you the average, but Harmonic means if something is 10, the other one is 5, it will go closer to the smaller one. So just focusing on achieving high on both of them. And there is actually, uh instead of F1, there is actually F beta, which actually, you can say F5 or F.5, which will tell you, give emphasis on one score over the other. So I might focus on some, on the recall more or I might focus on the precision more, depending on the applications. It depends on some applications. So I can, so if I actually F1 is that both of them are equal, but if I said beta is 5, in this case, I’m giving recall 5 times importance over precision. OK, So The F score for these ones. So if I manage to see this, this is a precision they call an F1. So you can see now the scores are closer to the lower value. So here, the first one, it’s, it’s not actually between like the 2nd 1, 75 and 50, it actually goes to 60. It doesn’t come to 62.5. So this goes almost to the closer value. This is F1. If you checked about F5, which is giving high importance to the recoil over precision, it will start to come closer to the recoil. If I said F of 0.5, I’m start to come closer to the precision, which is giving precision two times the importance of recall. What about if I said F0? What would be the scores? Precision. Exactly. This means I’m giving infinity importance to the precision and zero importance to recall. It would be exactly the. Precision values, OK? So this is actually another way to measure the quality of your retrieved system, but so far it’s all about actually boolean search. There is no ranking here. What about when you start talking about ranked documents? So imagine in this case System A and System B. And there are both 315 documents, and 5 of them are relevant. So what is the precision for System A? 5/10 and 1 B. 5/10. What is that called? Imagine there are 8 relevant documents. 5/8 for both of them. If a score. Identical for both of them. Do you think A and B are the are the same quality? No. Obviously no. Probably System B looks way much better than A. But the problem so far of these measures we’re talking about, it didn’t take rank into account at all. So the relevant ranked IR should take ranks into the equation here. And the question is how to do that. How shall I find somehow integrate ranks into the system? There are different ways to do that, but let’s imagine again. These documents, if I asked you in general, in general. Imagine these are different 87 systems. Which system would you prefer? This is a personal opinion. There’s no right answer here, by the way. He or she DRG OK, anyone has other potential ones? You don’t like anything. OK. OK, let’s do that. Who likes A the most? Who likes be the most? Who likes see the most. The, the most, as many people, the most. F, OK. G. Interesting. OK. So there are different preferences, but by the way, there is no right answer. It depends on which measure you’re going to use. But this is the thing that we have to find the right measure that would be for a specific task. Some of you would prefer to find like those who are selling at F would prefer to find more relevant documents. Uh, some actually would prefer actually, uh, documents which come more at the, at the beginning. A very simple solution to the rank, a very, very simple and dummy solution to actually use the rank is to calculate precision at a given rank. Let’s say a precision at precision at 5, for example. I’m sorry. So in this case, I will make a cutoff. I will not consider anything lower than this rank, and then I will start to see what is the precision in this case. It can be a very dummy solution, but it’s actually working, and it’s a very common solution, by the way. Uh, it might average badly because in some cases it will be like uh if you’re taking the cutoff at 1 versus 5 versus 10, it will be different, totally different, depending on the system. So in this case, let’s say the precision at 5. So what I will do is very simple. I will just take a line, cut off at 5, and I will execlude everything underneath. I really don’t care what they achieved after that. And it’s not a bad measure because for many applications like web search engine, you only, it retrieves only top 10 at the top page, and that’s it. You can say Precision at 10. Actually, one of the very common ones you’ll find it reported all the time in many system performance is Precision at 1. People care a lot about the first document. Is it relevant or not? So in this case, I asked you what is the precision of the first system. 3/5, what would be the 2nd 0. So in this case, which system is the best, by the way? DNG. So those who study DNG, actually they are precision oriented. Now you think that it’s better, OK? But this is just a simple, a very simple approach. Someone can ask, but why actually we should consider only n documents or k documents? What, what is the optimal number of k? If I’d like to get a right number in this case, what could be the right number? For example, here we know there are 8 relevant documents. So if I take a cutoff at 5, if the system is retrieving all the 8 on the top, I will never actually be able to waiver it over the others. So one solution is called R precision. Which is for a query with a known number of relevant documents, our precision is a precision at rank R. So in the last example, we have 8 relevant documents. I will take precision at 8. And the main concept here is I’m trying to see how this compares to the ideal system. The ideal system is to retrieve all the documents on the top. That’s it. Is it realistic, by the way? Is it a realistic measure to use it most of the time?

SPEAKER 1
There can be like many, many documents in the real search engine that are relevant to a certain terms, so I don’t think it’s really.

SPEAKER 0
OK, but why, what, what, what, what else? Imagine this is a, a, a small, uh, uh, like 11 million collection. It’s not a web search engine. So, oh yes, uh, because if we search for a

SPEAKER 1
query, there’s like there will be so many, uh, hits basically for each, uh, term in the query.

SPEAKER 0
But it’s not about hits here. We’re talking about relevance. It might have a hit, but it’s, you, you will check it manually and then decide if it’s relevant or not. For a quiz with a large number of documents, we

SPEAKER 1
could have.

SPEAKER 0
The main thing here, yes, you want to say something. Yeah. Exactly, how would you know R? It’s, it’s not an easy job. At the end you will try to find, when you do the, we’ll talk and discuss this more in the second lecture, how to find most of the relevant documents, but at the end you, you will not capture all of them, and it’s every time and for every query it’s different as well. So query 1 has 8, query 2 has 50. So in this case, actually doing that stuff, it’s, it can work if for in a like setup of experimental setup, but at the end you’re not, it’s not general. If you’d like to generalize this, it, it becomes harder, OK? But it’s still used and it’s very useful in specific situations. So here, where I should take the cutoff, 8, ignore everything like this, and based on this, which system is better? And Yes, so the answer changes this time. So that what is the right answer? There is no right answer. It depends on the measure you’re going to use, and selecting the right measure is depending on the task itself, OK. But there is another thing actually, people have been thinking about what should be the cutoff. It’s actually a very intriguing actually question. What about the user satisfaction? Remember that our objective in search engines in IR is to actually satisfy the users’ needs. So it is assumed that the user needs to find a set of relevant documents at the highest possible rank. I can assume that. So precision is an important measure, because I need to find my relevant documents on the top. That’s what I’m looking for. So precision is the right measure. That’s a good one. But the question is, the user would cut off, stop inspecting more documents in the ranked list. At some point, the user will not keep reading. At some point they will stop. So we need to find the cutoff here where the user would stop, which is X. But the question is, what is the optimal X? When the user, when a user can stop. This is the question. When do you think a user would stop? To find the answer to these questions. Answer, yeah, find the answer to their questions, which can happen at any point. So when a user can stop, remember that we need to satisfy their information needs. And the assumption that they will stop once this information is satisfied. They read the piece of information that completed the whole picture. They read document one. Yes, it’s relevant, but still I need more information. I’m doing a search, so I, I, I read the 2nd, 3rd, 4th, and after 5 documents, oh, now I think I found all what I need. I’m done. Another user may be in the 1st and 2nd document only, so. The user will keep looking for relevant documents in the ranked list, read them, then stop once they feel satisfied. So we can have an assumption that the user will stop at a relevant document. So I read the first document, relevant, nice. I got 50% of my information need. I went to the second, it wasn’t relevant. I didn’t add anything. So I went to the third, not relevant. I didn’t add anything, and then I kept reading. 4th 1, I got to the other 50%, I’m done. So I will stop probably at the relevant document. And X can be any rank where a relevant document appear and we can assume a uniform distribution. Let’s take an example here to make it clearer. Imagine I have a query and a collection has 8 relevant documents. And this is actually what we retrieved, OK. But we’ve got 3 users here. So the first user, they read the first relevant document. It turned out to be relevant. And they found actually, oh, that’s exactly what I was looking for. I don’t need anything else. They are satisfied. They are happy. 2nd user went, keep reading. The 1st 1 is relevant, 2nd 1 is not relevant, 3rd 1 is irrelevant, keep reading until he reached document number 7. And in this case, they found, oh, now I’ve got everything I need, so now I’m happy, and they will stop. And a 3rd user, he’s interested to dig deeper into this topic and would like to learn more, so he read the 1st, the 3rd, the 7th, and keep going on. He reaches the end of the list, but there is still some information is actually missing there, so he keeps continuing actually reading more. Go to the next page and keep reading. So if I asked you For user one, what is the precision, perceived precision for them? 100%, yes, he read the first document. Perfect. It’s actually right, relevant, 100%. For user two, what is the precision for them? 4/7. Remember, this is the same query, the same set of documents, the same search engine, but for user one, he thinks that precision is 100%, the second one thinks it’s 4/7. For the last user, what is the precision for them? If I’d like to do that. Imagine he kept reading till he reached 1000 documents and he found all the 8 relevant documents. So imagine it’s 8 over 1000 in this case, but it will be close to 0, almost 0, because he reached the information at infinity. Kept reading So the question is, what about instead of saying taking a cutoff for every user separately, what about actually calculating the averages of all the potential stopping for each user, which is simply calculate the precision every time I find the relevant document and take then the mean afterwards. Which brings us to the most Common evaluation measures for IR in history till these days and will continue to be in the future, which is average precision. Average precision is the main retrieval score for search engines. Never change it, and it will continue to be this one, which is simply try to take the average of the precisions that might be perceived by users, which is simply calculating the precision every time I found a relevant document. So let’s do that. For the first, imagine we have 3 systems here. I’m sorry, 1 system but 3 queries. The first query, it has 411 documents and it could retrieve it in this way. So what is the precision in the first rank? 1 in the 2nd rank 1. There are no relevant documents here, so that might, the user will continue to read. Then to reach the 5th document, what would be the precision in this case? 3 or 5. And, and I keep going till I find document 9. In this case, what would be the precision? 4/9. And in this case, I can simply say, what is the average precision here is the average of all of these numbers. So now this is a performance of This query. For the 2nd 1, what is the 1st precision? Sorry? 1/3, yes. What is the second one? 27. What is the average precision here? 231. No. You missed one piece of information here. There’s a 3rd relevant document, but it wasn’t retrieved. Oh So what would be the next precision? It’s actually 3 over infinity. I couldn’t retrieve it. This is my 3rd 11 document that I couldn’t find, so my precision here is 0. So when I take the average, I don’t take the average between these two only, but I will go and take the average even for the stuff I didn’t find. Is that clear? For the third one, another exercise, there are 7 relevant documents. First, precision, 1/2. 2nd 1, 2/5. 3rd one. 3/8, and if I need to calculate the average precision, I will add these 3 numbers and divide by 7 because there are 4 I couldn’t find. OK, Is that clear? This is the most important measure for search engines, for information driven in general. And This is one system. When I say what is the performance of this system in this case. Which one of these is the performance of this system, I can simply calculate, not the average precision in this case, but the mean average precision. The means of the average precision, which actually is simply calculating the average, the mean of all these three numbers. So the mean average precision, average performance of this system is 0.38. And in this case, I take into consideration all the situations that can happen. A user would stop in the 1st rank, or the 3rd, or whatever. It depends on the user. OK. Any questions? Yes. That’s a good question. What is an acceptable one? It depends on the task. For some tasks you’re talking about 0.7, 0.8. For some tasks, if you achieved 0.1, it’s an amazing result. It depends on their task. So, uh, for example, uh, when you do, uh, I was doing my PG on patent search, searching patents, and if you achieved 0.2, this is an amazing result because you people would read a lot of documents in this case. So the number of 11 documents, you need to find all of them. Matching is, is hard. It depends on the task. But in general, the average precision, I mean average precision. This is actually the formula of it. 1 divided by R, the number of relevant documents, and you multiply the precision every time you find the relevant document. And the media represent is simply divided by the number of queries you have. You have 1050, 100. Just take the average across all the queries you find for your system. That’s simply as it is. So It’s kind of a mix between the precision recall in this case. Do you know how precision is clear how this is actually reflecting the recall as well.

SPEAKER 1
We are dividing the.

SPEAKER 0
Exactly, we’re dividing by R in this case the relevant

SPEAKER 1
documents and we were like dividing. The, the actual document, the relevant one which I received. Exactly.

SPEAKER 0
So even if you miss like the example we missed the 4 documents out of the 7, it was still divided by the number of 7, which kind of reflected actually the recall here. So it’s a mix between both of them. However, still, the main idea of this score is to focus on finding the relevant documents as early as possible. Because if you find it in the 1st rank, great, 100% precision, you’ve counted it. If you found it in the 10th, it’s just you get 1/10. Too late. You’ve missed a lot of the score. And whenever you have only one relevant document, one query has only one relevant document. This is what’s called, in this case, mean reciprocal rank. If you found it in the 10th, uh uh uh rank, it will be the meancision will be 1/10. That’s it. Divide just the opposite of the rank. And One limitation is still, it’s using the document is relevant or not, that’s it, just binary, relevant or not. But let’s apply it to our systems here. Based on your understanding, which you think is the most, is the best system here, using every precision. I will do it and raise your hand when you think. B. See OK, a couple. F OK, G. So we get more results between F and G. Actually, if you check this one. Remember these are the scores, but when you check these ones, these are the scores of every precision, which turned out to be the highest. D? It missed a lot of relevant documents, but it could, it’s the best system to find the top 3 relevant documents in the top 3 ranks. So because I found more relevant documents early as possible, I got the highest score. Finding relevant documents later will not add a lot. So if I found another system found all of these documents to be irrelevant, who cares? It will add a very small value to the score. So mean average precision still is precision oriented. It has recalling to it, but it’s still precision oriented. So the main objective is to find the most relevant documents. Actually, G, it finds or actually F. You can see they actually achieved a lot of relevant documents, but just one missed the first, the second missed the 1st and 2nd. Can you check the results of them? I’m sorry? Actually, oh, this is the ranking of them. It came out 4th and 5th, not even 2nd or 3rd. Because once you miss the one on the top, you are hurt a lot, because imagine when you’re adding, if I found one on top, I got one. If you found the one in the second, I got 0.5. I already lost 0.5. This one I got 0.3. I found more relevant documents later, so it helped me a little bit, but still I missed a lot. So this is really important to understand. We have research and focus about finding more relevant documents as soon as possible. Is it clear enough now? Good. So so far we are talking about binary classification actually relevance. Is it relevant or not? However, some documents might be more relevant actually than others. A document is relevant, yes, but this one is super relevant. So binary will tell you it’s a rel no, but it can be sometimes graded relevance, which you say this is a perfect document, amazing document relevant to what I’m searching for. I don’t think I need anything else. Or it can be it’s an excellent document, it has a lot of information of what I need, or it can be a good, yeah, it’s a good document. It has many of what I’m looking for, fair, yeah, on the topic, but not exactly what I’m looking for, or just a bad, not relevant at all. So If we are saying that relevance would be 1 or 0, then we can say that if I find a perfect document, I should get a higher score than finding an excellent document. So in this case, There are 2 assumptions here when you talk about graded relevance. Highly relevant documents are more useful than marginally relevant documents. We try to find the ones which are super relevant. And the second thing, the lower rank position, when I find the relevant document, the less useful it becomes. I wanted to see it actually as early as possible. Remember, precision, the main concept, finding relevant documents as soon as you can in the rank. There is another measure that has been used. It’s actually the most commonly used measure for web search engines specifically. It’s map is still used, but this one is highly used as well with web search engine search as well, in this case, which is called the discounted cumulative gain measure. It uses graded relevance as a measure of the usefulness, but the most popular, especially for web search. And what does it mean this score? DCG Gain is accumulated starting at the top of the ranking and may get reduced, discounted at lower ranks. Users can care more about high ranked documents, so the discount starts to actually give some discount to one. You find documents later in the rank by 1 over log 2. It’s different than in every position. In every you just, you say 1 divided by the rank, but here actually you just take a little bit of discount lock 2 in this case. The discount rank, for example, of rank 4 in this case, 1 divided by 1 divided by rank, look 4, it would be 2, so it will be only half. And rank 8 will be one of our 3, so the lower I find it in the list, I will get a lower gain. And DCG is the total gain accumulated over all the ranks till you reach a certain point, a certain cutoff, and DCG in this case, it’s the relevance as it is of the first rank, then the summation of the relevance of every rank divided by log 2, so I get some discount. And this actually relevance in this case, it can be 01234, whatever. It can be what you decide. Some of these you can say 123, you can say 12, 10, depending on how many grades you have. Let’s take an example and see how it works. So imagine I have this system which reads the following documents. The first document in rank 1, it has a gain of 3. It’s very, very relevant. The second one has a, it’s good, but it’s not perfect like the first one, it has a rank of 2, actually a gain of 2. A third one is super relevant, and the 4th and 5th are not relevant, and so on. So what would be the discounted gain in this case? The 1st and 2nd will be untouched in this case because the 1st 1, as it is, the 2nd 1 will be divided by log 1, which will be. Uh, I’m sorry, log 2, which will be actually uh 1, so nothing happens. Then from the third, I will start divided 3 divided by log 23, which will be this number. I keep going so. So even if I got 3 at the ninth rank, I will not be adding 9, I’ll be actually adding only 0.95, because I got a lot of discount because I found it lower. Fair enough? DCG at K, I will keep adding these scores. So if I stopped that rank 1, I go to a game of 3, great. If I stopped at rank 2, I got 3 plus 2, which will be 5. If I went at 3, I will have keep adding this stuff. If I went to rank 4 and 5, it will be the same like 3, because I didn’t gain anything. These two documents were not relevant in this case. And this is actually how I measure my decision. So if I ask you This measure is used actually for measuring the cumulative gain or the performance of different systems. Is there any problem with this score or any issue you think with this score so far? Think about it. Mhm. If I ask you, what is the ideal case here? For mean average precision, I know the ideal case, I will get a meancision of one. What would be the ideal case in DCG? It’s hard to tell because actually it’s keep adding. As long as you go down, it will add, probably you will find more relevant documents and it will keep adding them. So there is no, the value itself doesn’t make a lot of sense in this case. It’s just keep adding. And this brings us to the NDCG, normalized DCG. Simply, this are numbers across a set of quies and It keeps adding to them. So this year at 5 is 6.89, and at 10 is 9.61. It can be any positive real number. So we need to normalize them somehow to find actually two systems give us a more indicative number. NDCG comes in this case. It tries to divide the existing DCG of a system divided by the ideal case, the ideal case. And in this case, all the NDCG values will be always lesser or equal to 1, so I know that if I go to 1, this is a perfect case. If I got 0.5, this is 50% of the ideal case. And it will actually become now easier to compare with different systems. But what is the ideal case in this case? The ideal case is to have all the most relevant documents on the top, and the less relevant comes afterwards. So here this is what we have. The ideal case for this set is to rank them like this. So I get all the threes on the top, later all the twos, then ones, and then all the zeros. So I try to imagine what would be the ideal case for this system. And I calculate the IDCG, OK, what would be the cumulative lay gain for this one? So in this case, this is the IDCG. Which will be the, again, the discounted gain because I found the 33, and then the 3, if it’s still ideal, but I will get lost and discounted in the ranked 3 here, because I found it in the 3rd rank. But when I do the IDCG at K, the full action, when I start to summit them, I will get, these are the best values I can get. For this system. So in this case, what is NDCG is simply dividing the DCG divided by IDCG, which would be something like this. So my system was at rank one ideal, but at rank 2 is not ideal. It’s actually 83% of the ideal case. At 33, it started to be 87% of the ideal case because I found another super random documents, and at rank 10, it’s actually 88% of the ideal case. So it keeps going up and down comparing myself to the ideal case. Is it clear? OK. This, I know it’s a long lecture, but So the summary, we going to reach to the end of this one. IR test collections, we said there are 4 main components documents, queries, relevant judgments, and IR. So far we only discussed the IR measures. We didn’t discuss about the other 3, how we can create them. We’ll discuss it in a second, the second lecture. And we talked about the different measures, recall precision and F score. They are not commonly used in IR, to be honest, because it’s actually, you need to include the ranks somehow. Most of the search engines are ranked. and are precision are used sometimes. They are actually especially precision at 1. Many actually companies will use it, or precision at 10. Mean average precision is the most used IR measure, and NDCG is the most used measure for web search. OK? For read more about this, introduction to IR Chapter 8, and actually also Chapter 8 in IRM Practice. Both of them covers both of them. Any, any questions? OK, we will take a break for 5 to 10 minutes, and then we continue. I will get the mind teaser in a second.

SPEAKER 1
Yeah 2 Bye.

SPEAKER 0
So here is the mind teaser of today. What is the value of X? But still, in the meanwhile, uh, if you give up and you don’t know I want to answer, you still can give the feedback, you scan it and give the feedback on the course. The more feedback, the more useful we can improve. Speak after the break. So shall we continue with information retrieval evaluation? So what we’re going to learn about in this lecture is how to create a test collection in the first place, what are the topics that would be in the quiz and how they are different, what actually meant by this, relevant judgments and pooling. And actually we’ll learn more about some statistical tests we need to do with search as well. So remember we said these are the main 4 items for Canfield paradigm which is having a set of queries and a set of collection of documents and Some relevant judgments, and then evaluation score. We talked about the evaluation score last time, and this lecture, we will cover the other three components here. It shouldn’t be that much information, just to learn how to do that. So the question is, where do the test collections come from? In general, for web search, companies apply their own studies to assess the performance of their search engine, and there are some kind of measures they do like the traffic they are receiving, the user clicks, where they click, and the decision logs, what they are searching, and they try to write a different query and keep searching until they find something. Sometimes they label results for selected user queries. They can bring people to label it. However, for other search tasks, someone goes out and build the test collection, which might be very expensive. Or as a byproduct of larger scale evaluation. Which brings us to the IR evaluation campaigns that have been created for this specific task. What I’m talking about. So IR evaluation campaigns is something that has been created in the field, on the scientific field generally from the early 90s, and simply try to create IR test collection are provided, create some collection of documents and they provide it to the research community, the scientific community to develop the best search engines. And then a collection and queries are provided as well with some relevant judgments and to see actually which would build the best system, and this is actually how we can see all the advances in search engines right now is happening. But remember Google actually has been built by two PhD students who were doing this kind of research when they are doing their. So it was another system trying to be built, and it brought us Google at some point. The most famous one is called Trek. Text retrieval conference. It happens every year since 1992. And it’s actually sponsored by the US government over the years, and every year they come up and say we have different tracks for different test collections and go to the, here are the documents, and here are some queries and here are the relevant judgments. Go and search and let’s find the best system. Let’s develop the search engines. There have been also other campaigns that came out like Clef, which is the one that has been built since 2000 for the European languages, so it’s not just English, but it includes French, German, Finnish, and so on, which came out to cover the Asian. Or Korean, Chinese, Japanese, and Fire, which came out in late 2008 just for the Indian languages Hindi and Bengali and other Urdu and so on. So it has been actually these different collections coming out. And the trick task, each year they come up with several tasks. It is a task to search a set of documents given a given gender or domain, and They are formed every year with a set of tracks. For example, track medical track. It’s searching for medical documents which can actually help to search routine names, diseases names, these kinds of things, track legal track or other alternatives like IP track in clef or actually patent mining track. Searching patents. We release a set of patents and actually tell people, let’s find the best way to search them. Andre Microblog Track, which came out a few years, uh, has been out for years, how to search social media, like searching Twitter at a certain point, Reddit at a certain point to achieve the best results. And there are different cross-language retrieval tracks actually searching across languages. Has been coming over years. And They come every year and they produce some track, and every track will be given a set of collection, and the collection contains many documents. Hundreds of collections have been released, usually in millions. Sometimes it will be in billions like the Clue web track, which actually has started to give samples of the web for the users to apply their algorithms on. And the typical format for this, how they release their documents, it will be in this format. Have you seen this format before? Did you get whites in this format? It’s not we created it. It’s actually, this is a typical format that you will find. Even tweets will be put in this format. So you have the document, uh, start of a document, then document uh number, then text, and so on. So the format you have now for uh the actually, the Financial Times documents are coming from one of these tracks, early 90s at that time. So they bring this stuff and produce it for the users to use it. So this actually, they give us the collection. So a collection of millions or hundreds of millions or sometimes billions of documents, they release it publicly for people to participate and get it and test their search engines on it and develop actually some search engines on it. Then with the with this collection you need topics or queries with them. So which brings us to the track topics coming out of it. It’s a query set provided for each collection generated by experts. And it’s associated with additional details in this case. But the main thing about experts here for the patent to track, for example, they bring the queries which are queries sampled from the patent office, people who actually work in the patent office. These are the types of queries that we’re searching for in the medical domain. They bring doctors who will actually do the search. In the legal domain they will bring lawyers to do that. So actually it’s very in domain stuff and they know this is a real life situation. These are the users who are going to use this system. It’s not web search is fine, it’s us, but the other ones are really dedicated systems and that actually built for specific domains and they bring the experts in this domains to work to prepare these topics. And it’s not just a query. The topic is more than a query. It’s actually a format of three different items. One is the query, which is what is the query that you should should search for. Second part is the description. What is the description? What do you mean by this query? What do you mean exactly? And then narrative, which you will add what should be considered relevant for this query, what type of documents you are looking for. So this is just an example. So for example, the title is Health and Computer Terminals. This is the query, and it tells you a description. It is hazardous to the health of is it hazardous to the health of individuals to work with computers terminals on a daily basis? And the narrative tells us that the set of documents I’m looking for is something like that which contains this information. So they will provide all this information with the queries to give you, which in this case will be called a topic to understand what you are looking for. So now we have the collection. That we would release and then we have the set of topics which contain the queries we will be searching for, probably between 25 to 100 every time with every collection. And then comes, OK, how shall we know the relevant documents for each of these queries? So for each topic, this is what we call the relevant judgments, relevance judgments for the documents of coming up with each of the queries. So for each topic set of relevant documents is required to be known for an effective evaluation. So now I am measuring every precision. How shall I know what are the relevant documents? I need to know them. There are different ways to actually do them. One is To do extensive assessments. So I will just check the collection for each query and see which ones actually are relevant, but this is actually impractical. If a collection of 1 million documents, I will not read the 1 million documents to find which one is relevant to each query. Especially if you have more than usually it will be at least 550 quid done. What other solutions you might have to find the relevant documents. Or as much as possible of the relevant documents. Do you have some ideas. Yeah, compare my answer, for example, with the best, uh,

SPEAKER 1
with the best answer that accused me for the best engine.

SPEAKER 0
What what you build the engine actually, when you provide the collection, you don’t have engines yet. People will still build it. And even if you built the system, how would you know that this system is getting the relevant documents? Maybe it’s missing it. Any other ideas? Because this is a big challenge with finding this. One can say you can do random sampling, but random sampling, what of what of the collection. If you find 100 documents out of the million, probably you will not find anything actually relevant there, because actually it’s a needle in a haystack. Which brings us to actually what a nice interesting methodology that has been used that turned out to be very effective, which is called pooling. Pooling in this case, which is simply You can use our systems to do that. Simply, what you need to do is you create as many different search engines as you can with different setups, one with the stemming, one without the stemming, and so on. So this is done actually by the participants themselves, people who are actually going to participate in this challenge task. So each system finds some of the relevant documents because they are not perfect. Each one will find some of the relevant documents, and if you have enough different systems, many different systems, so hopefully we will capture each one of them will capture part of the relevant documents. And when you start to do pooling for them, all of them. So hopefully I will find most of the relevant documents in this case. So how it works exactly. So first of all, let’s imagine I released a collection of 10 million documents and a set of 50 queries or topics in this case. So I will add and then then I find 50 participants coming. This is what usually happens. You can find 50 participants, teams from the US, teams from the UK, teams from different places around the world trying to build the best search engine for this task. And I will ask them, everyone, please submit your queries to 1000 documents. I build your system, you don’t know what is relevant yet, and we don’t know as well. The organizing, those who are organizing the task, they don’t know what are the relevant documents yet. So we tell them, build your system with whatever setup you have and give us the most, the top retrieved 1000 documents. So in your coursework, I’m asking you the top 150. In this task, it will tell you, please retrieve the top 1000 documents. Then What would happen, they will take from the 50 different participants the top 100 documents from each participant. And they will put them into one pool, so 100 multiplied by 50 will be. 5000 at most, because some of these would be common. Some of the documents might be actually overlapping. And then they will bring an expert person, especially in the medical domain, legal domain. They will bring the expert and they will check every of these documents to see if it’s relevant or not. So you got you submitted 1000. Let me, oh, I can actually use it. So each participant will submit, can you see this? I’ll try to come here, OK? So I will submit 1000. All of this area, and I will take the top 100. Another participant, they will take. 100 They submit 1000 and so on, and I will do that for all the participants I have. I will take the 100. I will take all of them here and put them in one pool. If they said we have 50 participants, and probably some of these documents are common because actually people will steal the relevant documents will come on the top, so maybe at the end I will end up with maybe 2000 documents. Because after doing all the overlap, it will be around 2000 documents. An expert will go and check all of these documents, and they will tell us, OK, out of these 2000 there are maybe Maybe to 100 relevant. And 1900 which are not irrelevant. OK. So this is actually what would happen. After that, There are many documents that I haven’t seen. So I would assume they are not relevant. If 50 different systems didn’t retrieve them, then I can say, you know what, probably they are not relevant. So I would just assume they are irrelevant. Then I will start to compute the mean average precision for each of these participants for the top 1000. The question is here why I’m calculating mean average precision. For 1000, not the 100. 0, remember, I only judged 100 only. Why I’m calculating the efficient for the top 1000. Any idea? I’ve now got these 1000 results. I took the top 100 from these participants. I manually checked them if they are relevant or not, and I did that for all of the participants. But now, actually, when I’m going, actually I’m trying to do the score for this participant. Now I know the relevant documents. There are 100 relevant documents now. I need to check out of this 100 or 1000 which are relevant. Why am I’m checking 100, 1000, yes, it’s. But, but if you think about it, um, I took only 100 here. What is the point of checking the document number 600 here, for example? At this stage, yes, absolutely, because the 600 documents here might be in the top 100 there. Maybe so this is why out of these 1000 I only checked 100 from them, but after pooling, I checked at the end of 200, 2000, maybe out of the 1000 I have seen 500 of them. So actually I found relevant documents in these stages. That was not in the top 100, but it was at one of the top 100 of the other participants. Is that clear now why I’m doing that? So now I start to get actually a comprehensive list of the relevant documents. It took a long, a large amount of time, of course, by checking for 50 queries every time and checking at least 2000, maybe 1000 every time, but now it’s much better than doing it for only one system or actually doing it for the whole collection. And once I do that, to make the pooling work, it’s the better, of course, to have a larger number of reasonable systems. Of course, if the system is retrieving random stuff, it will be just a waste of time. And the other thing, as much as possible, they should be different, not doing the same thing. Imagine everyone, it’s not, for example, can I do pooling for your coursework submissions? You’re doing almost the same thing. There are little bits of variations, but you’re doing the same stemming, the same list. You’re doing. You’re doing kind of the same tokenization. It makes no sense. But if we’d like to do this as pooling, if I would like to graft a new collection and do that, I will ask you to do your own setup on this collection. Now some of you will do stemming, some of you will do stemming with a different methods. Some of you will be actually doing some kind of word net to find actually potential relevant synonyms. Another one will be applying stopping in a different way. Everyone has different stopping, and so on. So all of this will give different results. Some will be cosine similarity, some without cosine similarity. Putting all of that, now I can start to find actually a variation and hopefully in this case find most of the relevant documents. Is that clear? OK. So there are some questions here. One I can say judgments cannot be possibly exhaustive. Yes, I will find a lot of relevant documents, but I still didn’t check everything in the collection. Yes? Is that true? No guarantee we’ll find all of the relevant documents in the collection because maybe all the systems will miss some. But the good answer actually, it really doesn’t matter. The relevant ranking at the end will remain the same. So if you found that system, and let’s if you say this is actually system 123, and I found actually that System 3 is number 1, then System 2, then System 1, if I added, if I managed to find more, and now I will judge more documents to check, the ranking will not change that much. It ends up at almost the same ranking. Yes, the mean average position now was 0.5, and here 0.4, and here is 0.3. Later it starts to be 0.52, 0.45, and 0.39, but still the ranking will be the same. Yeah, the score, because I found more relevant relevant documents, I will find them, but still, the ranking of the systems, the best will continue to be the best because of these methods, it, it captures most of the relevant documents. OK. The other one, this is only one person’s opinion about relevance. At the end, you get an expert to check that. What if actually some tasks they actually find more people to do the judgments. But the question is, what if we find another person doing that? Would the variation you change? It turns out, no, it doesn’t matter, even if we find the align a lot, there are different opinions, but at the end, the best will continue to be the best. This is interesting because you really care about ranking which system is better. What about the documents between these places? You didn’t judge all of them. Yes, some of them might be already judged based on the other participants, but more of them is that. So there was another experiment that they asked for one task to get all the 1000 done. So it was very extensive just to do this experiment and see what, if anything was going to change if we started to label all of them, the 1000. It turned out that it doesn’t matter. The ranking still continued to be the same. Because actually, especially if you’re using m average precision, remember, min precision, if you find a document at one ranked 101, the score added will be very minimal. So I care more about the documents coming on the top. The other one, we can’t possibly use Jasmine to evaluate a system that didn’t participate at the time. So imagine after 5 years, maybe in your group project, you come up and say, I created a new system. I will call it System X, and this is actually my top 1000. And I’m using very advanced method now. Now my 10 top 100 here. Might contain documents that didn’t exist in the other 100. Correct? It might happen. So the question is, now it’s unfair for this system. It didn’t participate at that time. Its top 100 hasn’t been exclusively extensively checked. So what would happen for this system? The interesting part, actually we can do that. It’s fine, because probably something would be captured even if it didn’t capture all the 100. Still, the average ranking would continue to be useful. So if I can achieve many systems will participate maybe 5 years later and say I’m getting a higher mean representation than everyone else, which is fine. It happens. So it’s actually everything is fine. And these actually are all supported by research. It’s not just it has been tested extensively to be sure that. So the pooling turned out to be working very efficiently and effectively. With this one, you are sampling part of the relevant documents as comprehensive as possible, not everything, but as comprehensive as possible, and actually it turned out to be very useful in ranking systems later on. Any questions about this part? Yes. Yes, at the end, the collection and the queries has to be the same. OK, so you cannot compare two systems with two sets, even in the same collection, but two sets of queries. Maybe these queries actually are different, OK. Any other questions? OK. The question is who decides a document is relevant or not. The same document can be seen as relevant by one person but not by the other. And sometimes actually, as we said, sometimes it would be useful to have one document checked by 2 people or 3 people to be sure actually that there’s some kind of agreement here. So one of the important things is how we can measure agreement between two people’s assessments of a given document. One of the very common measures, probably you might hear about, it’s used in many applications, is Kohens-Cappa. anyone heard about Kohenskappa before? Wow, you haven’t. Did you take any statistical course before? OK, so this is very common in statistics. So using Cohen’s Kappa to measure the agreement between two people about something. So let’s take an example here. It actually simply compares the probability of agreement between the real agreement on something compared to the agreement on chance, by chance, the random agreement. So let’s get an example here to make it clearer. Imagine we have two judges here annotating 50 documents to be irrelevant or not, and it turned out that the first judge found these 30 documents to be irrelevant and the 2nd and 20 documents to be irrelevant. The third G2 found actually 25 to be irrelevant and 25 to be irrelevant. And this is actually the overlap between them. They agree on 20 to be relevant and 15 to be irrelevant, but they disagree on the other 15 documents. So how to measure Cohen the kappa in this case? Simply, it, the what is actually actual probability here of agreement? The probability of agreement here is they agree on the green part, which is 35 divided by 50, which would be 70%. This is the actual agreement between them. What is the random agreement? What is uh for them to agree by chance? So the agreement for them to be, to agree by chance is the probability that Judge one. would be actually saying something relevant, multiplied by judge who would be saying something irrelevant, and plus both of them would be saying something irrelevant. So how to calculate this? It’s probability of being relevant. What is the probability of having something relevant here by each of the judges? Judge one, what is the probability of relevance by them, by Judge one, it’s actually these are the relevant documents for them, which is 30/50, which is, uh, what is it? Yeah, 20 + 10/50. The second one is 5. The second one is here, 20 + 5/50, which is 00.6 multiplied by 0.5, which is 0.3. What is the probability of being irrelevant for each of them? It will be the other values 20/50 and 25/50, which is 0.2. So now what is Agreement for random things between them is actually 0.3 plus 0.2, which is 0.5, so this is for them to agree by chance. So how coins kappa would be calculated in this case, it will be comparing the real agreement compared to the agreement by chance by this equation which is 0.4 uh in Cohen’s kappa. What this value of 0.4 shows, it shows how likely they agree by chance in real life compared to by chance. The higher it’s more correlated, the lower it’s actually when it comes to zero, it means actually there is no, the agreement between these two is totally by chance. It’s like giving you stuff randomly. So 0 means the 4 chance agreement. One is for total agreement, and it can be less than 0, which means worse than random. So it’s even worse than random. There’s actually something wrong happening here. Usually if coins kappa is 0.8 or more, it means that this is an excellent agreement. There’s a big agreement between both. If it’s between 0.66 and 0.8, it’s a fair agreement. If it’s less than this, this means that even the task is very challenging or very subjective. So usually you’re you’re targeting to have a coin cup over 0.6 to actually say something is acceptable. Otherwise the task is very, very challenging. Nothing is clear about relevance here. Maybe the description of the topic is not clear. Any questions? OK, yes, uh, and there are other ways to calculate it across other things, but usually it’s between two judges I think the average of two words, OK. How we do that for web search? It’s really hard in this case to get queries and they actually get colle uh uh like a a sample of the collection and all of that stuff. So, Sometimes they have surgeons have the, have test collections of queries and hand, uh hand-ranked results. They still do that on the site. However, they have other stuff. Recall is kind of really hard to measure for the web. Why? It’s huge. Many, many documents might be relevant, yes. So search engines usually use precision at certain k, precision at 10 is very common, and precision at 1 is very common to tell, OK, just check the top results are relevant or not. I really don’t care about the recall. And measure, measure the rewards using NDCG in this case. Check the top documents, but to check actually if you have this kind of graded irrelevance, you use NDCG in this case. Non-relevant based measures, some stuff like uh if you don’t want to measure any relevance in this case, or check the relevance of the documents, you can check the click through. So how do you, how many times do you click on the first result? But it’s not even actually like this. It’s sometimes you check how they click on it and how long you stay on the page, because you might click on it, after 5 seconds you close it again. It means probably it’s not relevant. Sometimes you take more time. So. And there has been a study actually about in some countries, actually like authoritarian countries, people will click on the first and accept it anyway because they just take whatever is given to them. In other countries, people are more critical, for example. So this was an interesting study at some point, but at the end, it’s still a good measure. And sometimes you do studies of user behavior in the lab. You actually bring people and do the search in front of you and see how they do. And sometimes you do the A B testing, which is what we mean about A B testing here. Imagine that the purpose is to assist a single innovation. So imagine that in my search engine I started to introduce this kind. I’m applying STEMming now, for example, for a new language, and I’d like to see if it’s going to improve the system or not. So you need to have a large search engine up and running already, and then you still keep the old system you have, but maybe you’ll take only 1% or 5% of the users coming to your search engine and give them the new feature you added. Like, and now I’m doing swimming in a different way. And then what you can do is use evaluate the automatic measures like click through on the first result and see actually if there’s going to be any significant difference between them. So now I started that once, I added a new feature, many people from this small sample compared to the big one are clicking on the first result more than before. So now I can guess, yeah, probably it’s better. But now I start to see actually no people actually start to ignore the first result and go to the 2nd and 3rd. So maybe this one is not the best. And Most of the search engines, actually probably this is the one they are using. They actually divert part of, part of the traffic they have to the new innovation they have to see actually how they behave to this one. One other thing here is, is System B really better than System E? Remember we talked about different measures like mean average precision, for example. And we say the mere representation is actually the performance of a given system on multiple queries. So given the results of a number of queries, B achieved a better score than A. How can we conclude that ranking of B is really better than A? So let’s actually do this example here. Imagine that the mean average precision. are System and System B for 7 queries, different queries, and this actually achieved the mean average position of 0.2, System A, and this one, System B achieved a 0.4. So this is clearly way better than A in the score, and the same other two systems, A and B again, 2 and 24. But let me ask you, For do you think that System B is better than A in both situations? Who thinks that B is definitely better than A for this situation, let’s do one by one. Who thinks that System B is better than System A here? Raise your hand. OK. Who thinks that System B is better than System A in this situation. 234. So actually if you checked for the first one on the query level. While the scores of the average precision for every query, you will find System B here is doing better than System A on each single query. For the second one, if you checked it, Do you think in the second case System B is still better than System A? Yes. The problem, the average score is the same. But here for the second situation, System B sometimes achieves significantly better than A, but sometimes it achieves still worse than A. So imagine this is Google, this is Bing. So you can say, oh, let’s make it more realistic. This is Bing, this is Google, OK? So I can see that sometimes Google is achieving better than Bing, but in some situations Google is achieving very poor, and Bing is actually better. So I cannot say it’s always B is better than A, and this actually brings us to what’s called the significance test. It’s really important here. That I need to be sure that something is having a better average score, that it’s really better, not just by random. And this is called the null hypothesis here. No relationship between two observed phenomena means there is no relationship between this. It’s a random thing. And rejecting the another hypothesis here means that there is a meaning of this phenomena I’m noticing. What does it mean in, in our context? A significant test enables the rejection of null hypothesis, no difference, would say no difference, in favor of the alternative hypothesis that B is really better than A. It’s not just a random thing. The power of the test is the probability that the test will reject the null hypothesis correctly. So increasing the number of queries, for example, if you are judging this on 5 queries, it is not the same when you’re judging it on 50 queries, because you will have more evidence. This is the power of the test. And increasing the number of queries actually increases the power of the test. So simply we compute the effectiveness measure for every query for both retrieval systems. In this case, in this case, for the query, it will be on the query level. It will be average precision. And then compute the test statistical statistics for each and compare with the uh significance test. Tested statistics is used to compute the P value, what we call a P value, which reflects the probability that the null hypothesis is true. And small p value suggests that the null hypothesis may be false. Which means there is no relation. So what we care about here in general. That the null hypothesis, which means there is no difference between A and B, they are the same, is rejected only in favor of the alternative hypothesis, which is B is actually effectively better than A if the P value is very low, less than a given number. Usually it’s 0.05, 5%. And if the value is less than 0.05, it means this is actually, there is something happening here. The System B is really better than A. If it’s higher than this, no, it’s actually might be random, just by chance for some queries are better, some chance others are worse. So simply, uh, one-sided test the statistic here, distribution of the possible value of the null hypothesis, if you, this is same thing like this. If I found it to be less than 0.05, it means that actually there is uh A meaning behind this. One of the very famous ones is they’re called the T-tests. There are different ones, but let’s talk about the T-test here. Assumption that the difference between the effectiveness values have a normal distribution, which is not always the same, but for many cases, the T-test with the normal distribution is actually accepted. This is how it’s calculated. I don’t want to go into much difference, but it’s You can actually calculate it from these open places, but even if you’re using Excel or Python, just say Ttest between two vectors, and it will calculate it for you. Let’s do that here. It is not enough to show that System B achieves a better average score than System A. You need to do a statistical test here. So two-tailed T test is highly accepted with a P value alpha is 0.05. So if you achieved less than 0.05, it means that the observation you’re seeing has a meaning. If it’s more than that, it means no, it’s actually it’s not a hypothesis. There is nothing actually interesting here. For non, no normal distribution, there is something called Wilcoxen. There are different ones, but TTS is the easiest and most accepted. And meaning of significance tests in IR. What is the meaning of that? When a user uses System B, in uh uh when a user uses System B that is significantly better than System A, they will feel that there is a difference, that System B is really better. However, if System B is achieving higher score, but it’s not significantly better, it means the user will see oh this system is sometimes better, sometimes worse, the other one sometimes better, sometimes worse. There is no big difference between them. There’s no significant difference between them. So if we calculated this in our two examples here, the first one, the P value came out to be zero because B is always better. So it means actually this is significantly better. For the second one, if you calculate that, the P value is actually 0.3. It’s way higher than 0.5, 0.05, which means actually, what does it mean? B here is statistically significantly better than A. I can come to the conclusion that my system is better than A. This one, no, they are statistically indistinguishable, even if the score is way double the other one. So when you actually build the system and you come on and say this is actually better than my baseline, this one is not supported. It’s bigger, better than your baselines in some queries, but it’s worse in many others. So actually, It’s both of them. You don’t feel the difference between them. Is that clear? You don’t need to go through all the steps of calculating the test. All what you need to do is just simply use any function for it. There are many functions in Python, on Excel, whatever you want. So the summary of this, which took a bit longer, IRS collection for automatic evaluation. We have a collection of documents. We talked about the set of topics which contain queries plus descriptions about what does it mean, and the recommended minimum number of queries for a collection is 25. This is based on some research. Relevant judgments, we talked about how to create it using pooling, and a large number of diverse systems is required to be able to do the right pooling. Evaluation measure, in this case, selected the proper measure about what you’re looking for. Mean average is very common, but we talked about other stuff, and it’s not enough always to say the average is better than the average. What you really care about is the significance test here. We search engines have its own methods for evaluations, which includes the logs of the people using it. And you read more about this in chapter 8 compared to what we’re doing. And if you read more to read more about pooling, this is, I suggest you read this paper. Any questions? OK, so I’ll be releasing the test set tomorrow. Best luck with your coursework. Deadline is on Sunday. OK.

Lecture 4

SPEAKER 0
OK. OK, hi again. To a new lecture for text technologies. So how’s it going so far? Good Are you sure? OK. OK, it’s getting more intense every week, so, yeah, get ready. So actually, before starting the lecture, so uh I have some comments about the labs and coursework and so on. First of all, actually, I was expecting a bit more engagement on Piazza, because currently I don’t find a lot of you sharing their results on Piazza. So this is actually this week, I have been waiting a long time for anyone to share the results, and I had to make a post. Please share your results about lab 2. So that’s what I’m saying. Thankfully, thankfully 3 or 4 shared the results and it’s aligning, which is good, but I’m actually encouraging all of you, once you get a lab, try to implement it, find out the results, share it with everyone, so actually you can compare your results, at least because I’m not sharing the exact outcome for the lab. I need you to share your results and compare to each other. So this is really important. It’s an important thing within the course. There was an interesting question about actually phrase search and proximity search and actually the kind of doing stop words removal. So someone asked the question if the document has course on IR, but the phrase is courses IR, shall it match it or not is it a phrase search? Honestly, it’s, it’s again up to your design. For the coursework and the lab, we’re asking you to remove the stop words, to apply stemming, and then do the phrase search. So in this case it should match. In real life, in, in your group project, for example, you can decide, even if you’re going to do the stop post removal, you can decide, yes, I will remove them, but I will keep the original position of the term. However, in our existing lab and coursework, we are telling you no, remove the stoppers and then start counting. So in this case, it will come, course and IR will come actually next to each other. In your application of the words, it’s up to you. You decide, OK? So this is just for clarification. Remember, you are designing all of this. It’s again to your uh your understanding of the task, you design for the system. For the coursework, we’re trying to make stuff as simple as possible just to get your hands dirty and run something. But in the group project, this is where you can see, we need to see your choices in this case. Uh, Again, PS, the discussions on the lab, please try to share the results more and discuss your results. I can see some of you mentioned there are some little bit of variation on some of the results of some of the queries, and I can see from your discussion saying probably because of the deconization, because of the axillary removal, some, some of you removed the support, some of you kept it, about the special terms. It’s again, all of these variations. I know you might be worried. What about the coursework? Which one would be counted? The good thing is that for the coursework we implemented our auto markers to take account of all of these choices. All the valued choices would be there. So don’t worry if you did a small tiny choice that is different than someone else, the auto marker will be able to match it, so it shouldn’t be a problem, OK. For Coursework One, it has been announced last week, and I shared the full details on the website. This week we are with Lab 3 that we actually can be able to implement it after this lecture. Once you do it, uh, it will be actually you can finish your coursework, honestly, you can apply it with the existing queries that has been released with the labs, and all what you need to do, you can write your report. Get everything ready because it won’t, the, the final ones which will be released next week, uh, will not be, it will be just a different set of queries and documents, but at the end, the findings should be the same, so you can write your report actually. And the main important thing, next week on Thursday, I think. We will release the test collection and the test queries that will be submitted with the coursework that all you need to do is just index in the new collection and run the queries. That’s it, and then submit the files. The report should be ready before that. It shouldn’t be a problem. But once we release the coursework, the test set. From Thursday next week, you are not allowed to ask any questions about the coursework on Piazza anymore. So it’s better to start asking now, compare your results on the labs now. But once we release it next week, it will be only Friday, Saturday, and Sunday to submit your coursework. During these 3 days, please don’t ask any questions anymore. This is the time to ask now. Nothing will be different. It will be just a new set of queries and documents. So try to make all the questions you have, all the discussions you need now. If there is an urgent thing after the stuff is released on Thursday, OK, you can ask on Piazzaat privately, OK. But don’t come and say, uh, I actually I’m getting only 5 documents for query 5. Is that OK? This is totally not allowed. Don’t ask questions related to the coursework once the collections are released. But for lab 2 or 3, say these are full my full results. Is it matching with everyone else or not? Take your time and enjoy your time during these days, before the coursework is actually the test it is uh. Is released And another important point for your coursework, and actually in your lab and in general, the search module should be a standalone. Don’t make your search that I need to make a query. I will go and read all the documents, index them, create the index, then actually start doing the search. No, you should have the index separate, done offline, and you saved your index in a text file. I know it’s not, some of you asked about the binary file for it. It would be more efficient, but for the sake of the coursework and automatic marking, we need to it in a text format. And, but in the search module, all what you need to do is to just load this index file, not reading the documents anymore, and apply the queries and tell us what are the results. Is everything clear about this? OK, good.

SPEAKER 1
Yes, so for the saving, does it have to be text I save it like as a pickle file. Is that OK or do you have to, do you have to file?

SPEAKER 0
It’s in the coursework description. It has a full exact format about this. Follow the exact format. Honestly, if you think that it’s not efficient, I’d like to save it in a binary file. It’s OK, but again, save another text version. For us, because this is one of the things that you need to submit. So we need an an automatic marker would go and read this one and automatically check if it’s correct or not. So we need the text file still, OK? OK, good. So this lecture. It’s more about learning about ranked information retrieval. Uh, learning stuff about TFIDF, F, uh, uh, vector space model, and smart notations, and you’re going to implement the TFIDF, OK? Which is part of the coursework. Who heard about TFIDF before? I was expecting more, but that’s fine. So we learned more about this today. So so far from what we have learned in the last week that we learned about how to implement an index and apply search, but so far it’s just boolean retrieval. So all what we need to do for a query is to say if a document is matching the query or not. That’s it. However, This is a way for some expert systems actually, if you learn about how they are doing patent search in the patent office, they are using steel bullion queries, by the way, because it’s a very advanced, actually the people doing the search, if they get a new patent which is having a new innovation, some actually 30% of those are PEDs, have PEDs already in the field. So they would like to go and search for an expert system and searching all the word combinations. They can search for something like car or vehicle and motor or engine and not cooler, for example, so they can have this kind of complicated set of queries to search for it. However, in general purpose, it’s not good for the majority of users. None of us is using this on a normal basis. Maybe you can go on Google Scores at some point, you can say this and this or this. Sometimes it happens, but generally when you’re doing web search, social search, any kind of search, probably it’s just a, a, a free form query. So, Most of the users in general, they don’t know how to write even this kind of boolean query, and the other thing, boolean queries will tell us if something is matching or not, and it just received a full set without any ranking, and sometimes it will be a very long list. So we will not go and read all of these documents to find actually which is the relevant one. Actually, do you know what is the most unused web search feature? Yes, yes. Next, at the end of the page. Mostly no one uses it, very, very small, tiny users. Actually, if you search and you didn’t find your results on the top 3, sometimes you can scroll down, not everyone, to see actually the top 10 results. Most users will just go and try to retype their query. Before actually going to the next page, so we are expecting stuff to be ranked in this case. So ranked retrieval is simply the query now can be a free text. And the results would be ranked with respect to this query. And even if you have a large set of matching documents, it can be in millions, it’s OK because at the end I will show the user the top 10 results. And because we don’t want to overwhelm the user with many results. Even now with the rack systems and LLMs, chat GBT and NGMI, and so on, it actually goes and retrieves the top 5 and tries to subtract a summary from this so you don’t go again and check everything. So ranking is really important and essential for everything, including the LLMs that we are using now for, of course, RAG applications. So the criteria here is that top-ranked documents are the most likely to be relevant to the user, satisfy their the query that they posted, and the score. That try to give a score between the query and the document, saying how this document is matching the query. This is what we will try to learn in this lecture, how we can create some score. Before it was only 0 to 1.0 doesn’t match, 1 matches. Now we should have a score. Remember our old example from last lecture like ink wink. Last time it was ink and wink or ink wink as a phrase, but now actually, and the Boolean search will tell you is it 0 or 1, that’s it. But now actually we can now get a ranked list of results when I search just a freak with a text like ink wink. And it will be a function not just of existing, but it can be a function of the thermal frequency, document frequency, and the length, and usually it can be a score, let’s say from 0 to 1. So the results now will not be 0 to 1, they can be something like this, and from these numbers I can rank them according to which one is more relevant. This is our objective that we need to do. What could be a possible way to match between two strings, a document and a query or two actually strings in general? One of them is Jacquardt coefficient. Anyone here is about Chekharto coefficient coefficient knows what is it. No one heard about Jackard coefficient before. OK, that’s strange. So simply, it tells us what is the common between two lists. So imagine we have List A and List B, which can be document 1 and document 2, or query one and document. What is the intersection overlap between the unique terms they have divided by everything else? The intersection over the union, very simple. If it’s 1, it means that it’s fully matching. If it’s 0, it means there is no overlap at all. So remember these two documents from last time. He likes to win, he likes to drink, and the other one, he likes to drink and drink and drink. So if you did the union here, so all the union, the different terms of the vocabulary appeared in these documents would be these terms, and this is the intersection. So if you just measured what is the Jacquard’s similarity here, it will be how many terms appeared in common divided by all the other terms in the documents, which would be over 16 in this way. OK, so this is one way to calculate the overdub between two documents. Let me ask you, what could be the problem with this kind of measure? Do you think there are any issues with this one? It’s still OK. Many people still use it in many applications, but why do you think here it might not be the best thing? Yes, that’s an excellent point actually. Term frequencies are not considered. If a document has a term 10 times, or a document has a term 1 time, it doesn’t take into consideration this. You just say it’s appears or not. Yes, it treats all the terms with equal equal weight.

SPEAKER 2
So you’re treating this is as important as, for example, the universe.

SPEAKER 0
That’s an excellent, have you, have you seen my slides? Maybe that’s excellent, but this is an excellent point. This is exactly it. It treats all the terms equally. So if a rare term in the collection, how a term is rare or actually important, it’s not actually sensitive to this. So shall, he likes to drink, for example. Shall I give two the same as drink, or they should be actually done in a different way? Any other ideas? What could be the issue as well? Yes. There is no ordering, but remember we’re doing a bag of words. So this one, it’s something common between all the methods. What else? Actually, another thing, if you thought about it, this is actually a match between the document. It’s not just between the document and the query, but actually between the document, it’s not the document and the document, it’s actually a document and a query. The document can be 1000 words, the query is usually 2 or 3 words. So, It’s really the difference in lens is huge. So the overlap here, like adding the intersection between 3 and 1000 is actually maybe not the best way to do it. Which brings us actually to how shall we Sort our documents and our terms in this case. So let’s move with an example. I know this might take longer than anyone would explain this, but I’m trying to make you understand the fundamentals of this. So imagine that we have 5 documents. Each bucket of these is a document. And the terms are the balls we have here, OK? And someone came and searched for this query. This is my query, OK? It’s not a traffic light, it’s a query, OK? So if I asked you. What do you think is the least relevant document here, and what do you think is the most relevant document here? Let’s take this exercise together. So let’s start with the least relevant document. Who, what do you think would be the least relevant documents among all of these, the 5? Who agrees it’s 2? OK, that’s kind of everyone. OK, the other question, which you think is the most relevant document here? Who do you think it’s for? Almost the majority. OK, so we agree now that the least relevant document is 2 and the most relevant document is 4. Do you have any, anyone you have, you have, um, but what if the yellow ball refers to words like the and. Let’s discuss it actually, but remember actually our choices so far. The least document is 2, the most relevant document is 4. Let’s do it. Let’s actually take some examples here. So TFIDF tried to solve this problem. It tells us actually it’s actually it stands for term frequency inverse document frequency, so we have two components here. One is the term frequency, which is simply how many times a term appeared in a given document. So if a term appeared 3 times, like he likes to drink and drink and drink, the term frequency of drink, he will be 3. OK, so in this case it tells us how many times the term appeared in a document, which stands for if the term appeared more times in this document, probably this document is more relevant to this term. Makes sense. If a term appeared in a document one time, another document appeared this term appeared 20 times, probably this one which has 20 times would be more relevant. The other component is document frequency, which is the number of documents that contain this term regardless of how many times it appeared inside the document. So imagine that we have 5 documents and this term appeared only in 2. So document frequency will be in 2 documents. The document frequency indicates something interesting because if a term appeared in many, many documents, It means this term is very common among the language. It’s not that important. While if a term appeared in only a few documents, it becomes, oh, probably this is an interesting term, but actually it’s a rare term that we need actually, it should have a higher value. OK, For example, the word there. Probably it would appear in most of the documents we have in the Financial Times one, probably that appeared in all of them. So does it mean it’s an important term? Probably not. Someone asked the the question of Piazza, shall I do a stop word like FT? I, I don’t know who asked it, but it’s a very good question because he, the question was, it appeared in all the documents. It can be a stop word, yes, but the what we said, you don’t have to filter it out because we say stop, uh stay to one to the stop words list. But if you’d like to to take it out, it’s OK. It doesn’t matter because it appears in all the documents in this collection. Maybe FT. In the web search is important because it doesn’t appear in all the collection. It shows that you are looking for Financial Times, but in a collection article by Financial Times, this one appears in all the documents. Who cares about it? It’s not important. It’s not, it doesn’t say any document is special here. All documents are the same about this term. Is it clear so far? OK. So and there is actually a difference between document frequency and collection frequency. Because document frequency is not the collection frequency. Document frequency tells me how many documents contain this term. So this number can never be higher than the collection size. If I have 1 million documents in my collection, the document frequency can never be more than 1 million because how many documents appear in this term, it cannot be over 1 million because my collection is only 1 million. However, the collection frequency, this is similar to the one we used in the first lab, how many times the term appeared in Wikipedia abstracts. So if a term appeared 5 times in a document and 6 times in other documents, then the total here would be 11. But in, uh, so here it’s just how many times it appeared in all documents. However, the document frequency will be only 2 because it appeared only in two documents. Is that, this is an important difference. So DEF is mostly used in most of the IR techniques. Collection frequency will be actually still used in other applications. We will discuss it in the second lecture, but for most of the uh for uh uh IR applications, actually many applications as well, document frequency is the most important term here. And the inverse document frequency, when we say inverse document frequency, is simply the opposite of this. It’s actually the inverse of this, which shows that We said document frequency. If a term appeared in many documents, it’s not important. If it appeared in a small amount of documents, it’s more important. So the inverse makes it when the number this IDF increases, it means the term is more important. It’s more rare, so it’s more important. So this is just an example between what is the difference between document frequency and collection frequency from our very common example. Like the word drink here, it appeared in 5 documents but appeared 7 times. So there is a difference here. The word likes appeared in 5 documents but and but appeared actually 6 times. So there’s a difference. When we say document frequency, it’s the number of documents. It can never be higher than the length of the size of my collection. So the IDF formula, this is a very basic formula for IDF. There are other variations of this, but this is, let’s use a very basic one. It’s simply the log 10 of the collection size I have divided by how many documents this term appeared in. So the log scale here is tries actually to dampen this differences because if a term appeared in 100,000 documents, another term appeared in 10,000 documents, I will not say it’s 10 different time 1 difference, but actually it’s just this one is more important twice than the others. Let’s give an example. Imagine this I have a collection of 1 million documents and the wordbornia appeared in only one document. So what is the log of 1 million divided by 1? It would be 6. Another term animal appeared 100 times. What is IDF in this case, 1 million divided by 100, 10,000. 10,000 is 4. And the word that appeared in all the documents, all the 1 million documents, so 1 million valued by 1 million is 1, log of 1 is 0. So this tells us that the word under is the word that has no weight, useless. Remember when you talked about stop force removal with ranking, it doesn’t matter because if the term is appearing a lot, it will have a low weight anyway. Then under is actually 10,000 times it’s 1, but the word fly is twice as important as under. The word Cornia is actually 6 times important than under. So if I’m searching for under Calpurnia, for example, and I found the document contained the word Calpurnia a few times and under a document contained under many times, still I would rate Cornia way much more important than under. So far, so clear. OK Let’s see. So TDF term weighting is one of the best known term weight schemes in IR and I would say in many applications in text in general. It increases as the number of occurrences of a term within a document increases, but also increases when this term that appears a lot actually is rare. It doesn’t appear in a lot of documents. And when you combine TF and the IDF together, this is how you can count it. So the TF, probably this is the weight we’re going to use, is 1 plus log TF of the term. So if a term appeared once, a log of 1 is 0 plus 1, it would be 1. If a term appeared in another document 10 times, it will be log of 10, which will be 1, plus 1 would be 2. So it actually I get start to get twice the importance of this document and multiplied by the log of the document frequency, and of divide by document frequency. For a query and that has multiple terms, a query of 2 or 3 words, I will simply calculate the TFIA for each of the terms appeared in the query and then do the summation. OK, Clear so far? We will find out. Remember our question? I would ask you again, what is the least relevant document here. Based on what we have discussed. Which is the least relevant document here? 2, yeah, who thinks it’s 2? Yeah, uh, uh, still, actually most of you think it’s 2, yes, 2 is the most it would be the least relevant, the 5th 1. What is the next one? What is the next 2, the next least relevant document? 1, yes? I will ask you now, what is the most relevant document here now. Remember, you said it’s document 4. Who thinks it’s still document 4? What else? Why? Why document 5?

SPEAKER 1
What appears in 2 documents twice. And so it’s So green is all very important and so is red, OK, but this one has more yellow. But You got it.

SPEAKER 0
So simply the most relevant is document 5 because the yellow ball is really not important. It is something like searching for the destructive storm. If I found a document that contains a lot of that, I don’t care, but if I found something, a document containing destructive storm more, this is more important. Now you understand what is TFIDF and the meaning behind it, OK, so don’t get actually like fooled by some of the useless terms appearing more in a document. It’s what’s important here is the most rare terms. Like if one of the queries I gave to you in uh in the in the collection I, I released last week contained the word FT. If I found a document contains FT many times, who cares? It’s useless. It appears in all of them, so it’s not an important term. Fair enough? Good. So now, if you remember, if we would like to map it to the collection matrix we had before, remember before we said actually how many times we can show it, is it, is it 1 or 0, a term appeared or not, or sometimes we say term of frequency, how many times a term appeared in this document. Now what we can put actually is the term TFIDF score. So I can tell actually each of these, what is the score here would be the TFIDF. I can add this one. So if I’m looking for this kind of document with this term, I can guess, OK, this is the value of the term appearing in this document, the TFIDF value. It’s not actually 1 or 2 or 3 or 1 or 0 as before. And this brings us to what’s called the victory space model. It’s actually based on the showing stuff as vectors that we discussed last time. So now I will put the query as a vector and the document as a vector, and all what I need to do is to measure how they align to each other. And the values of this vector is not the term frequency only or the appearing or not appearing, it will be the TFIDF value. One thing I can use actually is I can use the Euclidean distance, which is the distance between the document and the query. However, the problem with this, if I did that, that It will be always actually high for the It will be always high because the documents are much longer than the queries. The queries are short, the documents are high, so if the distance itself, measuring the distance between Q and D2 and Q and D1 and Q and D3, it’s not important, actually, which is simply the point, the line there. It will not actually make, give us a good indication. But what we really care about is Not actually the distance, but actually the angle between the victors. So what we are saying regardless of the length of this vector, because the documents we expected to be longer, let’s measure the angle between them, which in this case is what we call the cosine similarity, cosine of the angle between the vector of the document versus the query. As long as there’s this, the, the value of the angle between them is low, it means the cosine similarity will be high. Cosine of zero is 1. It means actually in this case they are fully aligned. However, the problem again, it’s actually sometimes how we can measure because this document is long and the other one is short. This is what we call length normalization. You can actually find the normalization of a specific vector by counting by its norm, get the norm of the vector which is the values inside it, and then you define by this norm to get the normalized version of this vector. Just to give you an example, imagine document one, which has 3 terms. The first term appeared once, the second appeared 3 times, the third one appeared 2 times. I can measure its length or norm by actually calculating this formula, which is a square of all the values than the root of them, which, and then I will divide these numbers divided by its norm. It will be something like this. If I got another document, which is Longer, 3 times longer, but actually it’s very similar. It’s kind of this document went appended 3 times. So each, instead of once appeared 3 times, so it’s 3 to 9 times. So the norm of this one will be 11. So if I divided both of them, it will be exactly the same values. So now I can guess regardless of actually how, what is the size of this document, I still can get a normalized version of this. And from there, I can simply measure the cosine similarity, which is the dot product of those. Of the normalized version of this, which will be something like this. OK. So all what you need to do to get the normal values and then do the dot product and now you can find the cosine similarity between two vectors. This is the algorithm. You can read it quickly, but this is the general thing. There are different ways to calculate these things. So for example, for the frequency, I can actually just make it the exact the natural number, is it how many times I appeared, 25 times, whatever it is, or I can use the log value of it, which is 1 plus log of. TF or there is what’s called the augmented value, which is another formula, or boolean, which simply appears or not. If it appears 10 times, it’s still 1. If it doesn’t appear at all, it will be 0, which is something that was similar to what you have done in the boolean search. Or the average with something like this. For the document frequency, I can ignore. I can totally ignore it. It will be always one. All the terms are the same. I can use the uh uh IDF, which is the log value of this, or I can use what’s called the probability of uh IDF, which is another formula, something like this. Normalization, I still can apply normalization to the document and the and the the query, or I can ignore it totally, or I can use a cosine similarity, and there actually are different ways to do that. This is what we call the smartnotation. It’s all different variations of how you calculate the TFIDF. It’s all different variations of how to calculate the TFIDF, and all what you need to do is to decide which one I should use. What is very common, what is very common is for For the documents, you calculate the 1 plus log of the TF. And you don’t do any IDF score here, and you apply cosine normalization here. For the queries, if you have a term in the query, what you do is you apply again the log plus the log for the TF and you calculate the IDF of this term in the query, and then you apply the cosine similarity. So this is how you can calculate the TFDF. This is very common. To make it an even easier way for your coursework and lab, what we ask you is to simply use the TF as it is. Uh, when, when it’s coming to the document, and you don’t need to do just the IDF would be done once on the document on the query side, and you don’t do any kind of polarization. Don’t you calculate the cosine similarity here. So simply, simply what you need to do just in other words, If you’ve got a query of 4 terms. You need to extract all the documents that contain any of these terms. It’s like an old operation for the boolean search. If all the documents contain at least one of these. Then for each of these terms, all you, what you need to do is to calculate the 1 + 2 log TF of this uh term in each document and multiplied by the IDF. Then if it’s had 3 or 4 terms, you do the summation. So this is the formula you need to implement for your lab and coursework. This is a very simple way to do it, OK? So when you do calculate this, the scores will not be 0 to 1 because we are not applying normalization here. So it will get someone will get 3, some other documents will get 6. It’s OK, and then you’re doing ranking. The one that gets the highest score will be ranked at the top. The one that has the lowest score will be ranked later after that. If the document doesn’t contain any of the terms, it will be zero and nothing will be there. Is that clear? OK. So the summary of the steps represents the query as weighted TFIDF vector, represent each document as a weighted TFIDF vector, then compute the cosine summarity. This is a general thing, rank the documents with respect to the query by the score, then return the top 10 documents that contain that achieve the highest score. The retrieval output for a given query one, the output would be a list of ranked documents. The possible format, and this is actually the format would be required, I think it’s kind of the one that would be required for your coursework and the lab. You just add the query ID. The most relevant document which achieved the highest score, what is the idea of this document, and then the score that it achieved. So you rank them. This is the output. OK, so you should put the ID of the query, the ID of the document, then the score that they achieved, and sort them by score. This is what we expect from you in your coursework and that. The resources for this, uh, uh, for this for this lecture is introduction to IR textbook for uh chapter 6.2 till 6.4, and in IR in practice, chapter 7, the whole chapter 7 is relevant here. Do you have any questions about this? So in lab 3 is the most important thing. This is how you practice. It’s building on what you have done last week. You will not dump it, keep it. You will do an all operation between a query having 4 terms. You do all operation after you apply pre-processing, and this is very important. Apply pre-processing to the query, stopping and stemming. And then for each of the terms, extract its postings from the doc from your index. You have the document frequency saved. You can measure the term frequency based on how many times they appeared in a document by their positions, and you can calculate the formula. And do the summation for the whole query and then find what is the score for document X. And then document X achieved 0.5, document Y achieved 3.5, then document Y is more relevant than document one and print them. OK? This is the implementation you need to do for this lab. We will take a break, uh, for, uh, yeah, like 5 to 10 minutes, and then we continue with lecture two, and there will be a 90. I have a question, yes, yeah, so like in this

SPEAKER 1
lab, uh, for like calculating the score, uh, we’ll call a function, like we’ll call a function, and can we have any boolean type of queries also with the rank search.

SPEAKER 0
because actually there is no bullying here because actually it’s kind of embedded because. Assume the operation here is or between all the terms.

SPEAKER 1
I’m asking that can be mixed, no, no, there is

SPEAKER 0
no mix between ranking and bullion. Boolean is separate than this. It’s a totally different type. OK, so no, no, it will be something you will not find this format and or. It will be just a free text, OK. Yes, Christian, and in this formula, uh, will you give

SPEAKER 1
a query with a term that does not appear it

SPEAKER 0
might. The, the, in this case, actually, yes, you have to handle these cases. Yeah. If you, uh, uh, yeah, that’s a good question. And what if I ask for a term that doesn’t, doesn’t exist in the collection? So in this case, before you, if you adjust your DF is zero, it will make an exception. So this is how you handle your code. Be sure that you have, if the term exists already, then in this case, but the good thing, all what you need to do before going for term by term, you will retrieve the posting list of term X Y Z, which doesn’t exist. When you retrieve the posting list of this term, it doesn’t exist. There is nothing to calculate, so move to the next one. You see it? So it will be automatically done. So you keep a loop on all the terms in the query. Each term you will extract the posting list, do the calculations, and see if the summation and do the summation. One term of them didn’t appear at all. When you try to extract the posting list of this term, nothing is retrieved, so you’ll go to the next term directly. Clear enough. Yes, if the term doesn’t exist in the, of course in the query it will not, uh, it’s not, it will not what you do. But if, if the term in the query doesn’t exist in the documents, when you retrieve the reposting list, nothing will be retrieved, then go to the next. OK, you will not need to calculate anything. Yes So, uh, following up on this question, like, will

SPEAKER 1
we need to retrieve some documents in that case?

SPEAKER 0
No, for example, if I give a query, yeah, it will be an empty list. Nothing, nothing will be printed. So imagine I give you a query that doesn’t, none of the terms in this query exist in all the collection. So in this case, there are 10 query. OK, then your scores will be based on the 5 terms that appeared. OK, it will be simply by, yes, you’ll get some results unless all the 10 terms doesn’t, none of them is in the collection. Nothing will be retrieved. OK. OK, we’ll take a break for 10 minutes. I will show the mind teaser in a second. OK So this is a mind teaser of today. It is one of the easiest problems, so hopefully you will get an answer quickly. If you’ve got an answer, raise your hand and shout, what is the answer? OK. Yeah. OK, so, uh, I got actually interesting questions within, uh, in the, in the break. One actually, is actually about how would you, I will just explain how you should, how my thinking, how you can process the query for ranked search, but still you can do your own implementation. But the main idea, let’s assume that I got a query of two words, just a free form query, and I did the old operator and find that there are 10 documents that contain this query. OK? So what I need to do now is I will go one by one for the first term in the query and the first document that contains at least one of these terms. I will check how many times this term appeared. If this term didn’t appear in this query, skip and go to the second term of the query. And I checked it and calculated the TFIDF, and it turned out to be 2. Then the score will come out back as for this query and document as 2. I go to the second document. I checked the first term it appeared in this document, so in this case I calculate the TFIF it turned out to be 1. And the second term, I found it is existing as well in this document, and I found that the, the TFID score for this term and the document will be 1 again. So in this case, the score here will be 2. So I will, all what I need to print at the end for query 1, document 1, the score is 2. For query one, document 2, the query is 2, and keep doing it for I finish all the documents. So you start by the Boolean query as an old operator, getting all the documents that contain at least one of these terms. Then I start to calculate the TFIDF for each single term in the query and the document and do the summation over the document at the end. Is that clear? This is how you should implement it. This is how I recommend you implement it. If you have a different way to implement it, it’s fine, but this is what I’m expecting, OK? And then you will print the score over all the terms of the given, appearing in a document to a given query. OK? OK, in this one we will go to continue talking about ranked retrieval. In the last lecture we spoke about the very, very basic ranked retrieval, TFIDF, and hopefully understood the notion of it. Yes, there are different implementations of TFIDF, but in general this is a very basic form. In this one we will learn more about probabilistic models and we’ll talk about one of the most famous formulas which is BM 25. Uh, anyone heard about BM 25 before? One, OK. It’s very famous if you, but if you not, most of you didn’t know TFIDF, probably you will not know BM 25. And the second one is how you can use another, uh, representation of documents which is the language modeling for IR, how we can actually treat it as a language model. So Remember from the last lecture we have the vector space model and TFIDF term weighting which combines the TF and the IDF in this way and you do the summation over the query terms. What we can notice it is that the term appearing more in a document it gets a higher weight. The first occurrence is more important. This is why the log. So this is why if I didn’t find the term, it’s 0, but if I found the term, I got 1. I get 1. And a score of 1, if it’s good, I mean we’re using lock 10. If I find the 2nd, 3rd, 4th, 5th, till I find 10, now I can move from 1 to 2. To reach score 3, I need to reach 100. So the big difference happens when I find the first term. As long as I add more, I have to find even way more to actually get a better score. The rare terms are more important. This is the IDF, what it does. It tells me if a term appeared in many documents, it’s really not important, even if I calculate it, I keep the stop words and do it, it will not have a high weight anyway. But if a term is rare, I start to get some score. There is a little bit bias toward longer documents in this case. Do you know why? Why you guess there is a bias here to blocking the documents? Exactly. If if a document is only 100 words, another document is 10,000 words. So if both are relevant, probably the term is that 10,000 words will have more TF for the terms I’m looking for. So it will be ranked higher. So there is a little bit bias here because if a document, if you are ending sync in some in some way, uh, books with tweets. It’s a bit unfair for the tweets because the term frequency will be always low, for the book’s expected to be higher. So the question is, can we do something better? This is what we’re trying to see here. So the information retrieval model we started by the vector space modeling which is very heuristic in nature. There is no notion so far, so far to what is relevant, what is the thing is relevant here. However, it still works very well because we assume if a term appears more in a document in this way, it becomes as assumingly, it will actually be more relevant. And Any weighting scheme similarity measure can be used. We saw, we saw about different ways components are interpretable, not very interpretable actually. There is no guide about what to try next, more engineering, so we keep trying. Remember the smart notion that there are different ways of how you do the normalization, how to calculate the TF itself or the IF itself. You can keep trying and see which one will achieve the better results. However, it’s still very popular. It’s a very basic one. It’s very popular. It’s a very strong baseline, and usually if you’d like to build a system very quickly, this is something that you can do in a few seconds. It’s working well. Then there is another model which is called probabilistic model for information retrieval. This one is tried to use mathematical formulation for this, for what is relevant and what is not relevant. And it explicitly defines a probability, some variables about what is relevance, query and document, so we introduce the relevance here in this way. And be specific about what is their values and state the assumption behind each step and what could be contradictory here in these kinds of assumptions. So a probabilistic model is the main concept is that uncertainty is an inherent part of IR because I get the document, I’m not sure if this document is going to be relevant or not. So if you think about how can I model something that has uncertainty, in this case, probabilistic theory can be a good theory to use in this case. And one of the people that pioneered this kind of field, uh, in the, since the 1970s is a mathematician, actually a very known mathematician, uh, Stephen Robertson. He was a professor at the University of Cambridge, but also actually a scientist at Microsoft. He was working in Microsoft research in Cambridge. Actually, when I worked at Microsoft, he was the host for my interview uh at that time. So, uh, And he created this theory in 1977, too long ago, that this theory about probabilistic theory for search. Um, He’s originally a mathematician, but he applied it in this kind of documents, and 1977 the collections would be 100 documents, 200 documents, but interestingly, his theory came to be working till today. This is actually still what we apply. So this is actually the theory he has. It’s a little bit. Like, uh, formulated in this way, but let’s try to understand it word by word. He says if a reference retrieval system’s response to each request is a ranking of documents, we’re trying to retrieve a set of ranked documents in the collection in order of decreasing probability of relevance, we try to rank these documents according to the decreasing probability of relevance, so the document which is higher, has high probability to be relevant, should be on the top to the user who has submitted this request. Where the probabilities, how you calculate these probabilities are estimated as accurately as possible. This is what our objective, of course. On the basis of whatever data have been made available to the system of this purpose, the overall effectiveness of the system to the user will be the best that is obtainable on the basis of those data. And this is the main basis of the most simplistic approach in IR. By the way, actually, I know it’s a bit, maybe it’s not easy to read it in this way, but It has been a big science for years. People try to develop this, and there are mathematical mathematical derivations about this part. And so if you like to read like a paper with 100 pages of derivation of some formulas, you can read the stuff by Steven. It’s really nice. OK, so the formulation of probabilistic relevance theory here. So what we are trying to do is to rank documents by probability of relevance. So what we are trying to do is that the probability of relevance of the document ranked at number 1 should be higher than the one in rank 2, than the one in rank 3, and so on. And we try to estimate this probability as accurately as possible. Imagine there is a true value for the probability here. So what we’re trying to do to have an estimation for this probability almost equal to this true value. And we need to estimate this probability based on whatever existing data available. So currently what we are working with is simply the query and the document, nothing more. However, in real life, especially in web applications and in many other applications, you can have more inputs like the document itself, like the, which is what we’re working on, but the session. What is the session? What the user has been actually searching for in the last 10 minutes. You can actually guess a better guess when you have about this session. The context Are they searching from Edinburgh University or from a cafe, or actually they are living in the UK, Scotland, or they are living in the US or in Africa. This gives you some information. User profile. Do you have some information about the user, their profile, their interest? You can learn more about this stuff. As much information you have, if you can integrate it to the probability estimation, it would be great. And the best possible accuracy can be achieved with the data. If we manage to do that, this is a perfect IR system. But the question is, is it really doable that we can do something like this? How to estimate this probability of relevance in a way that becomes, this is the best way so I can get a very good ranking for the documents. So the main concept here is imagine the information as a classification problem. You have a query And you have a probability that this document for this query is relevant. Or a big number of documents that are not relevant in this case. And the probability of the documents which are relevant plus the probability of documents which are not relevant should be equal to one. This is everything, all the documents we have. And the document is relevant if its probability to be relevant is higher than its probability to be not relevant. It’s a bit theoretical discussion, but we’ll try to actually see how this can take us to drive another formula after a while. So what is a probability to to be a crude probability here for a document that for relevance given a document and a query and session a user and so on? The question is, isn’t relevance in this case depends on user opinion. So the user decides. If a document is irrelevant or not, you decide. So what is this prob thing here? And the problem is that search algorithms cannot look into your head. It’s actually based on the query, you wrote, but you might, you might have something different in your head. How shall we know? So it’s based on actually, relevance depends on the factors the algorithm cannot, cannot be on the stuff that they cannot see. However, actually there are some experiments. Uh, it was a nice nice paper a few years ago, 9 years ago, at Posper Awards actually, they actually checked. They connected the users with fMRI in their head to see actually the reactions when they find the relevant documents versus non relevant documents. However, this cannot be applied on a scale, so unless, who knows what would happen in the future. But at the moment, it’s mainly actually what What actually, what we see is just the query. By the way, if you’re interested, there is a, I’m not sure if there is in the informatics forum, there is a big lapse about connection fMRI and actually eye tracking and different stuff. So if you’d like to do some project on this stuff, you might think about using these labs. So we have some devices to measure these kinds of things if you’d like, if you have a crazy idea to do something. And different users may disagree on the relevance of the same document. So you have the same query, the same document, but if you’ve got a set of users, all of you, and I say, Do you think this document is relevant to this query, I’m not expecting that all of you will say yes or all of you will say no. Some of these will have like half and half, 70, 30, and so on. So the probability of a true, the prob true true probability of a given document and a query. This is a proportion of all unseen users in different contexts, in different tasks for the same document judged for a given query. So what we are trying to see, OK, if I ask you this document and this query, and I try to ask everyone in the world. Do you think it’s relevant or not? In this context, and what about a different context at a different time of the year, a different time, a different location. If you can measure this, this would be the true probability, but it’s actually kind of impossible to do it. But if you manage to do that, it becomes like similar to what is the probability of getting 6 in a die given that it’s even and not square. So you put some, as long as you put more restrictions, giving more information about the context, about the date, task, about different things, you will get actually a better probability in this case. So what is the probability of getting a 6 in general in a die? 1 divided by 6. What is the probability of 6, if it’s, we know it’s even 1 by 3, what is the priority in this case? 1 by 2. So your probability improves every time you get more information. This is what we are trying to do with the search here, OK? Which brings us to something Steven created with his team, which is called the BM 25, Okapi BM25 model. And it’s based on the probabilistic theory. Uh, I will not go into details about this. There is a full paper with derivations about it, but let’s take it from me here that a document D is relevant if probability of relevance is higher than probability of not relevant. An extension, it’s an extension of the binary independence model, so the binary features documents represent by a vector of binary features indicating term occurrences, and it assumes term independence. So we don’t, it’s still a bag of words. We are still in the bag of words. And this came out in 1995. He produced this formula. And people have been doing search and trying to improve stuff for search for years using TFIDF, using different notation, and so on. Then BM 25 came out to beat every single system, achieving the best results. Everything they tests on different tasks, achieving the best. And it became a very popular and effective ranking algorithm until now in most of the open source search toolkits, BM25 will be the default formula to use. What is this amazing formula? It’s not that big difference from TFIDF actually. It’s very close to it. All what the thing they did was adding something about the length of the document, and a little bit of formulation, and the TFIDF became actually uh a little bit also considering something about the collection. And even there is some constant here. The best practice turned out to be 1.5 based on experimentation. So what it does, it tries to normalize the TFIA formula a little bit based on the length of the document. So if a document is very long, it starts to give different weighting where it’s actually a little bit shorter. So how it looks like this is actually what you can see actually if you used no normalization, if the length of the document is similar to the average length of the collection, this would be the value in black here. But if this documents started to be short, You can see now it starts to give higher values for TF. OK, now the TFI start to give higher value because it’s a shorter document, so it means something when you have occurrences, multiple occurrences in a short document. If the document itself starts to be very long, like if the lens is 2, the, of the average, it starts actually with lower value for the TF. And the uh uh actually the IDF itself starts to be a bit lower than the classical uh IDF implementation. This is small variation when you think it’s very small and honestly it looks very small, it makes a big difference. So the probabilistic model in IR, it focuses on the probability of relevance of the documents. It could be mathematically proved. There is a proof, I think I will cite it as one of the readings if you’d like to have a look at it. You don’t have to, but it’s if you’re interested in math. This is nice. Uh, it, it, it’s different ways to apply it. Conbent one is the most common use formula, and it has been actually very, one of the strongest baselines in search so far, you, when you have only documents and queries, nothing, no additional information, OK? The question is what other models could be still used in IR. Remember, the first one was vectors and trying to get the similarity between them with CF and cosine similarity. The second one is this beam 25 notation, which is based on derived from the probabilistic theory. Let’s try to actually map it in a different way or model it in a different way. Which is the noisy channel model. Think of the operation. This is how it happens. You have some information needs. You have something in your mind. I’d like to get some information about it, and you have a document collection. What you usually do is you think about it, then. You try to write down a query that actually would help you to find the relevant documents you’re looking for. This is a process. Think about something, write something, then. Go and find the relevant documents. But actually if you think about it, how this actually happens. What you actually do is You have a query, information need, and you try to think, oh, the document I’m looking for would probably contain these words. So what you try to do is to formulate a query that probably would be able to be derived from this document. You think I’m looking for something talking about climate change, for example, so I’m looking, so I’m accepting the words about climate change, maybe scientists, sciences, these kinds of words that might contain in this document. So what’s actually happening, you write down a query worse likely to appear in these documents. And imagine there is a model that for each of these documents, imagine each document has a kind of a language model that is derived from, so you are trying to drive, you’re generating a query from this model. You are think each of these documents has a model which is the model of words that it’s generated from. I’m trying to generate my query from the same model. So the language model approach directly exploits this idea. A document is a good match to a query if the document model of this relevant document is likely to generate the same my query. So I’m thinking about the document in this case. So the concept is actually coming up with good queries is think about the words that should appear in a relevant document and use these words to build your query. And the language model approach in IR model is this idea a document is a good match to a query if the document model is likely to generate this query, and it happens if the document contains the query words you’re looking for. It still builds a probabilistic language model, so you build a probabilistic language model for each document and then rank. Your documents based on the probability of generation of equity from these different models you have. So Language models a language model is a priority distribution over strings. I think everyone knows about language models now. LLMs is a big thing of it, but this is based on just thinking about generating a language model from a document, not from a big collection. And a topic in a document or a query can, can be represented as a language model. Words that tend to occur more will have higher popular in this document compared to other ones. So the very, very basic language model, very basic language model is the Unigram language model. Which is simply the terms are randomly drawn from a document. You imagine this bucket thing and you would like to actually get a word. What is the probability of finding this word in this document? So for example, for this actually document, and imagine I have this query that’s red, yellow, red, blue. So what is the probability of getting a red word from this? Documents 4/9, yes. What is the, the Swiss case, the red will be the probability of red. What is the rate of getting yellow? 2/9. What’s the probability of getting red again? It’s 4/9, and then, uh, which is actually uh in this case, 3/9. So this would be simply the probability. And I can assume they’re still independent, so what I’m doing now is just to multiply all the probabilities of this stuff. And in this case, I still, for probability in this case would be the thermal frequency divided by the document length. Very, very basic. How many, the words that appeared in these documents 10 times and the document length is 100, its probability is 10% or 0.1. So this is one state wherein state automa automation, uh, a unigram language model. So if I’m looking for a query like frog said that uh loads like frogs stop, which is a very strange phrase here, I start to calculate the probability of each of these words and then the multiplication, I will get a score at the end. It might be very tiny, but I will do the same for all documents, so I’ll get something at the end. So imagine that now I’m looking for these two documents, and I have the model for document 1, the model for document 2, and this is actually the probability of each of these terms in each document, and from here I can calculate the multiplication of all of these, and I will get, I can guess easily now that the probability of document 2 is actually higher than document 1, so document 2 is more relevant than document 1. This is the main very simple idea. OK. Uh, the stochastic language model, which is a bit more advanced than the Unigram language model, is not just actually calculate the words as in a way that they are independent, but actually start to kind of check the phrasing in this case. So I can say the variety of the same red, yellow, red, uh, green, it can be the variety of Observing uh red, then observing yellow after the red, then observing red after the yellow and red, which can be unigram, bi-gram, trigram, for gram models for each of the models. So this is, if you can build something like this, which is, uh, it will be more accurate, but actually, of course, less efficient in this case. So And sometimes you can only, if you have actually have a unigram or actually a higher gram, in this case, you bi gram, you can actually, you don’t have to have what, 10 g. You can actually calculate it as probability unigrams, only the H word the dependent, or if you have only bi-gram language model, just two words, so you can say red and then yellow given red, red given yellow, blue given red, which is the sequence you have. So you actually can actually even uh uh make it uh uh approximated to whatever language model you have. So language modeling in IR, each document is treated as a basis of a language model. Then giving a query, I try to rank my documents based on the probability of a document, giving a query, which is using Bayesian formula, it’s the probability of the query to be generated from a document multiplied by the probability of the document itself divided by the probability of the query if you know Bayesian formula. The interesting thing is the priority of the query, it should be the same for all documents. There’s, what is the product of a query? I don’t care. It’s all documents will be the same. So I can ignore this value, and the probability of the document itself, what is the probability of this document in my query, it’s in this case, I should be treating all documents the same. Of course, it can be different for web search. I can do something like PageRank, which we’ll talk about when we talk about web search. But assuming in an hour there is no different treatments within documents, all documents are equal. So in this case, I can ignore this formula as well. So in this case, the probability of the query given document is the probability of a query given, given the language model of this document. So what is the probability of generating this query from this document? So to rank documents according to relevance to a query is ranking according to the probability of a query to be generated from the document or the document to be relevant to a given query, which would be the same. So the basic idea here, we attempt to model the query generation process from each document. Then we rank the documents by probability that a query would be observed in this document as a random sample from the retrieved document model. That is, we rank according to the probability of a query to be generated from a given document, and this is a formula. The probability of a query given model is actually the multiplication of each of these terms in the query by the operator from these documents. So remember the last time it was summation, this time it becomes multiplication. And what we call it, this is a query likelihood model. What is a query likelihood model to actually be able to generate from a given document. And the parameter estimation, the simple one, which is a unigram model, is the term of frequency divided by its document length. So if I have a term appeared 5 times in a document that its length is 10 words, it’s 50%. Another 15 times in 100 words will be 5%, which is kind of Embedding the format OBM 25, which includes the lens for a normalization, it’s kind of embedded automatically here because the documents, the document lens is different, the property will change anyway. And in this case, if I, I, it’s simply just the multiplication of this, if a term of the qui appears twice, then it will be to the power of this because I will multiply it in this case. So remember we said the probability of red, yellow, red, blue, it will be the formula we calculated earlier. The red be twice, then yellow, then blue. But the question is, do you find any issues so far from this kind of model of modeling of the Of the information retrieval of the search. Do you think this would work better, or the other one, or actually let me ask you, would there cases where this formula would be or this way would be problematic? Yes, uh, doesn’t like rare words and like very common words. The IDF part, yeah, this is, it doesn’t count this in case, yeah, but imagine that a term appear, this term appears in different, in different documents, it will be at the end normalized somehow. But there is actually a bigger issue here. Yes, that’s an excellent point. Yes, if I search for something like this. What would be the probability here? What would be the probability? 0, yes, because I’m looking for a green term. The green term doesn’t appear. So in this case it will be 0 over 9 in one of these terms and it will be 0, which is an issue. If we’d like to map this to boolean search compared to the TFIDF, what is the difference? The boolean, remember, remember that when I talk about TFIDF, it’s a summation, so we said we make an or. If a a document contains at least one term, I would be able to process it and get results. This one, it’s kind of and operator. I have to find all the terms in the document. If one term is missing, I will not even retrieve it. Is it, could you see it? So this is kind of an implicit end operator happening here. Usually, we should resolve this issue. However, if you noticed in web search. They kind of apply an end operator, by the way. Have you ever seen when you search for something and Google will give you results and telling you this document doesn’t contain this term? Usually it will try to find everything that contains the terms, all the terms you search it for, but if one term is missing, it will tell you, I, you know, I didn’t find a lot of documents, so I’m ignoring this term. Uh are you, or do you want me to actually just get everything? So the default for web search is kind of and operator, by the way. They just go for, uh, miss one term based if they didn’t find enough results. So it’s in web search they try because there is a huge amount of documents and if they end up that OK, I will just focus on the documents which contain all the terms, but sometimes you will notice it. It will tell you this term doesn’t exist. Be aware of this. It will mention to you before you open the link. However, this is sometimes not fair. So in Victory space model we do the score, which is a summation. So if a term is missing, it’s fine. I add a 0, it’s nothing. But in language models it’s a multiplication. So if 1 is 0, everything goes. Is there a better way to handle unseen terms in a given document? Which will bring us to what’s called smoothing. So the problem here is that if I have a zero frequency, Of a term, I shouldn’t be that harsh. Maybe in some applications I might be harsh, but in other applications, like if you’re doing Financial Times 100,000 documents or 1000 documents, that’s too much. Uh, no, I still can miss one term and still I retrieve something. So the solution, solution here is to smooth the term as frequency. And instead of having like the first term, this is like zip distribution, if you noticed, uh, but it’s on a big scale. So you try the term appear this number, the second one goes, and at the end. Many terms will not be there. It didn’t appear in my collection. What it’s saying is actually, can you do some kind of smoothing? So I try to take these tips of parts of the different terms and then try to add, if a term I didn’t see it, I will give it a sum score. So the lowest appearing term would be would be the lowest of course term of frequency would be 1, but I say, you know what, if I didn’t see it, I could assume it’s like 0.1. No, instead of 0 OK, so we try to do some smoothing here. So documents are a sample from a large language model. And missing a word should not have a zero probability of occurring because it didn’t appear in this document, but it might appear somewhere. And the missing term is possible even though it didn’t occur, but no more likely than would you expect by chance in the collection itself, in the whole collection. It may be a micro document, but it might appear in the whole collection. So a technique for estimating probability for a missing term unseen word, it tries to overcome the data sparsity when the term is missing, and it lowers or discounts the probability estimation for words that I see it and takes the leftover parts to put it for the terms that I haven’t seen before in my documents. One of the very famous uh techniques for this is called mixture model, which simply says the probability of a term given a document is of course the probability of this term generated from the document model itself, which is what we have been seeing in the last part, but it’s multiplying by a given constant and maybe 0.9, and I take 1.1% of this actually, which is a point of seeing this term from the whole collection. So if I didn’t see it in my document, I still can see it in the collection. So mixture mixes the probability from a document with the general collection frequency. Remember the collection frequency we talked about in the last lecture, document frequency and collection frequency. Now collection frequency can work. It estimates for unseen words 1 minus lambda multiplied by the probability of this term in the whole collection. Model which you call the background language model, then it’s actually I can give the term some probability in this case. And the estimation of observed words would be calculated in this way. Which in this case is the collection frequency which we spoke about last time. It’s not document frequency now, it’s now collection frequency. One of the famous implementation of this is uh Jelinek Mercer smoothing, which is calculated in this way, and I start actually to decide the value of lambda based on what I’m looking for. So if lambda is high, Which I’m getting more of the weight of the term is observed in my documents, so it actually tends to retrieve documents that contain all the keywords. If I started to make it low, lower value of this, then I can actually start to be more flexible. If all doesn’t exist, that’s fine. Give me still rank the other documents in a good position. And correctingly setting this kind of value of lambda is important for a good performance. You need to try and error and see which one is achieving the best results. And the final ranking function will be a multiplication. This is the one which we had before, but in this case, I’ll keep adding some value from observing the term in the collection. So let’s give us a very simple example here. Imagine I have two documents, a collection of only 2 documents. Jackson was one of the most talented entertainers of all time, and document 2 Michael Jackson. What is this word, anointed himself king of pop. And if my query is is Michael Jackson. Let’s assume lambda here to be 0.5. 0.5, it means that I’m not that harsh in finding all the terms in my uh in my query, that all of them should be in the document. So how to calculate it? The probability of the query of document 1, what you can see here for the word Michael in the first document, it didn’t appear at all. There is no Michael there. And the document length is 11, so it’s 0 over 11, but remember I’m adding 0.5 for this value, and the other one is the collection frequency. The word Michael appeared only once in the whole collection, which is these two documents, and the length of the whole collection is 18 words. So it’s 1 divided by 18, and of course I’m multiplying divided by 2, which is the lambda here. Then I will add to it the second word. Actually, for the second one, it’s actually 1/7 because it appeared in this one, and 1/10 is still I calculate it. For the word Jackson, I will see here it’s 1/11 it appeared in the first document, and twice in the whole collection. And for the second one, it’s 1/7 and twice in the whole collection, and now I start to get some values. Yes, uh, document two is better than document one, but I still didn’t get 0 for document one because it missed one term. I get some value for it. So not on the query likelihood model, it has similar effectiveness to BM 25. It has been experimented and it turns out that BM 25 is achieving still comparable results. Actually, sometimes BM 25, in most cases actually PM 25 would be even better. However, in some sophisticated techniques, BM 25, it can be better than BM 25. There are several alternative smoothing techniques, and this is just an example about the smoothing here, but there are different ways to do that. And when you are doing actually the n-gram language models, remember that we have been testing it as a unigram, just the property of a term standalone, not around the context. So probability distribution over the words in the language associated with the probability of the occurrence of every word separately, and it generates the generation of the text consistent for pulling out of a bucket independent from terms around it. The N-gram language model, if you are doing bigram, trigram, and so on, some applications use bigram trigram language models where probabilities depends on the terms observed and predicts the world based on the previous word it has seen. However, based on many experimentations and many papers, It turns out that generally the Unigram language model is still very effective. You don’t need even to have the Bram or trigram language models for these retrieval tasks. And there are 3 ways to, uh, for possibilities here. One is to calculate the probability of generating the query text from the document model. Another one is probability of generating document text from the query language model, which is strange because the query would be like 2 or 3 words. And then, or actually comparing language model represented by both. The general findings from this experimentation as well is that, yeah, the obvious one, just the generate, try to generate the query from the document in this case. So the summary here, based on today’s lectures, both lectures, that There are 3 ways to model information retrieval. One is a vector space model and try to measure the angle between them, how query vector aligns with the document vector. This is what we have been discussing, and we discussed with different notations, how to implement it. This is actually what you’re going to do. The very basic formula of this for your coursework and that. The probabilistic model, which becomes to actually put some theory behind the notion of relevance here. What is the relevance probability of a document giving a query, and we found that we discussed that BM 25 is one of these formulations that turned out to be very effective. And the last one, the language model way for IR is how likely is it possible that you observe or generate a given query. From the model of the document, this is kind of what you are doing implicitly in your head. You’re thinking about probably I’m looking for something. Probably these are the words that I’m looking for I have to find in the document. So the resources for this in uh introduction to IR chapter 12, is talking about this nicely, and in IR in practice, it’s section 2 and 3 in chapter 7. And I would actually recommend if you are interested to have The original paper of Okapi 25 BM 25, it actually came out in 1995, and the other paper, which is the language modernism, which was initially introduced to the world about using this way and the experimentations done. This is actually nice reading to see how this stuff has been developed at the time it was presented in the first time. Do you have any questions? Yes.

SPEAKER 2
Yes, so even after applying the smoothing technique, if your career has one or more words that don’t appear anywhere in the entire collection, the quality is still there, is that the internal behavior. Is that what?

SPEAKER 0
Yes, it will not say that this document shouldn’t bed. It will calculate actually a score for it. Imagine these two words turned out that they actually the probability in the collection is actually high, so you still actually might end up that these documents still come to the top ranks. It would not always put it down. It will actually try to bring these documents to the top ranks depending on the terms itself. OK. Any other questions? Yes, doesn’t show up in the collection. What if the word doesn’t show up in the collection? Probably you just ignore it in this case. Because it will just to make nothing be returned in this case, OK, it doesn’t exist. Yes, uh, you said that, um, they went to the

SPEAKER 3
commentatorial plan that I think was better if it was like larger if we have, uh, smaller que queries and smaller to have larger queries. Can you adjust that dynamically based on what query the user is putting, or would that be like a better idea?

SPEAKER 0
So imagine that the extreme case when lambda is one. It means that I’m looking at all the terms exist in my query, and this is kind of an end operator here. If you started to give it lower value and increase the 1 minus lambda on the other side, it means I’m starting to be more flexible and I mean even if some terms doesn’t exist, 2 or 3 terms doesn’t exist in my my document, I still can give it a higher score and receive it in this case. What is the best value? It’s again, depending on the application. So actually, um, even in, in, in, yeah, in this paper, it has an interesting experimentation on this, and you’ll find some papers just about the value of lambda, testing it in different ways, reporting for you which is better. And it shows that for some applications, a higher value of lambda is better. Usually a higher lambda value of lambda is is better when you have a huge collection. So if you have a huge collection of billions and trillions of documents, then if one document is really missing one of the terms, I have already many, but if the collection is small enough like the two documents collection we talked about, you don’t want to actually just be that harsh with them. OK? So, so far, I have to revisit this quickly. I’m trying to make you implement how I search engines work. Uh, so far you are doing the very, very basic things. Uh, I, I know some of you are excited to try more stuff like, uh, like extended index or uh binary saving for the index in a binary format or structured index and so on. And maybe on this someone would like, I’d like to implement PM 25 or language models for IR. I’m the first. Experimentation so far is to actually make you do the basic search engine, which is your coursework one. However, when it comes to the group project, you have plenty of ideas you are going to learn that you’ve already learned some of them. You’re you’re not implementing, for example, BM 25 for language models here. You might think about using them in your group project. And now actually from the next lectures or as you go with the course, we will have even more advanced methods like learning to rank, for example, rank systems. You will not be required to implement them at that time because it takes time and effort, but we would be appreciated a lot actually if you would be appreciated a lot if we find some of this stuff implemented in your group project. You need advanced stuff. So this is the main point. We’re giving you the pointer, make you take the start of the rope about how to build search engines, and we’ll give you all the routes you can take afterwards about which advanced semester you can use later. Now you are doing the basic thing. Is that fair enough? OK, so the lab is based on 1000 documents. The coursework will be 5000 documents. When you release it, it will be a little bit higher, but this is nothing. This is not a real system. A real system will never be, it will be millions and actually hundreds of millions, sometimes billions. So it’s all just about getting the system up and running, and you know how it goes. But hopefully you will learn these basic skills and develop it over the course through the course of the second semester, you create an interesting group project. OK, so, OK, best luck with the coursework. Uh, you will not be able to submit till we release the test set, but you can start finishing everything, even writing the report. All what is remaining will be just the final collection, final test set to be able to run it and submit the files of the results. Share your results, please, on Piazza, OK? I hope to see something within this week. Best of luck.

Lecture 3

SPEAKER 0
OK. Hi, everyone. Welcome to the 3rd week of Tet Technologies. And today we’re going to talk about uh ending sync. We’ll speak about it in a, in a minute. But before the lecture, So, um, So far, the 4 lectures that we took in the last couple of weeks, it was a kind of a warm up, and now we’re starting to get a bit more serious about building search engines. Lab one, if you notice, many of your tanks have started to share their results on the Piazza. You you notice there are some slight differences. It’s not, it’s almost they look the same, but there are maybe slight differences which might be coming from the way you did the deconization, the stopping, the stemming, and which is fine. This is normal. At the end you are doing something, you can see actually how it’s working, how you deal with big data. And there is actually one small tip about loading big files. I can see there are a couple of you started to talk about this. Yeah, I think one of the mistakes that you do, and it will not work when you go to do your group project, is when loading a file, you don’t load the whole file to the memory and then start processing it. Load it part by part. I usually recommend just line by line, easy, but you can do chunks as well. But don’t make this file was 500 megabytes. In the future, you might be working on something with 20 gigabytes. Don’t fill your whole memory with a file to start processing it. So just load chunks line by line, process it, and print or whatever you do, or save it in a hash or whatever. Then actually go for the next one. Don’t keep the whole file in the memory. This is one small tip to make your stuff actually working better. Today we are going to talk about 2 lectures about indexing. I’m expecting that you know some binaries. Hopefully everyone knows binary here. And I will also announce coursework one, which will be depending on the last lecture on this one between the break in the break. And one important notice, please do labs on time. Don’t get at the end and say you’re going to submit coursework 1, which is totally dependent on lab 1 and 2 and 3, and then your eyes you have to do all of them at the same time. So start doing it on time and you’ll be absolutely fine with the coursework, OK. Uh, one thing, someone mentioned that the recording last week of the 2nd lecture was, actually the sound was off, which was a problem because it turned out that I was muted, it seems. So it seems there is no solution for this. It’s gone, but what I’m trying to do is to find the lecture from last year and we have it. It’s just about uploading it to the system, so I need to go find how it’s done and we’ll try to do it. So if you need a recording for the 2nd lecture, it will be there hopefully within the next few days. Do you have any questions before I start? Is it going well so far? Is it fun or Good, OK. More fun to come. OK, so that’s fine. So the lecture objectives is again to learn but also to implement. What we call boolean search. Inverted index and positional index. And we will learn more what we mean by this. So the indexing process that we started in discussing last week, which is about converting our set of documents, you have a question? OK. Uh, converting a set of documents. Into some bag of words form which we discussed and then putting them in the index. So last week we started to talk about what kind of transformation process we are doing, which is a pre-processing. We talked about very, very simple processing stuff that would lead to big improvement in search like tokenization, like case folding, normalization, stopping, all of this stuff. But once we preserve this, we need to put them in the index, which is exactly what we’re going to talk about in this lecture and the next one, OK? So remember this example from last lecture. After processing the DEC text, we got the text and we processed it, pre-pro uh removed the stoppers and applied the stemming. Uh, we end up with something like this. And we said this is what we’re going to put in our index to be able to match. And the question is, OK, what is the index then? So, simply, The question is how to match terms between documents and documents in a nonlinear time. It’s not like find or grab as a Linux command. Actually it’s not sequential. We need to find it very quick. And the main thing is using annex to find these terms immediately. We don’t have to go and scan everything. So anyone have an idea of any sam have you passed by any index in your life? Have you seen any index before in any application in

SPEAKER 1
life?

SPEAKER 0
Hash map like a dictionary, OK, but have you seen it in a physical world? Have you seen index? Yeah, in a book, yeah, yeah. The very simple index is the book index, which is something like this. So what is the use of this one? It is simply a small search engine. If I’m looking for a specific term like coverage here, it will tell me it’s on page 8. So I don’t have to keep reading the whole book to find these words. I can just from this one I can find, oh, it’s on page 8. I just jump directly in a nonlinear time to get this page. That’s it. That’s Endex. Easy. So we are done. Let’s explain how it’s done on the computer side. So simply search engines versus BDF find or grab, it’s infeasible to scan a large collection of documents to keep finding a specific term. If you grabbed for a specific term on Wikipedia, for example, the Wikipedia abstract, it will take a long time until you find it, especially if it appears at the end. So the question is how we can find it so quick. So, and also finding more stuff, like if I’m looking for a document which contained the words UK and also Scotland and also money. So doing this, it becomes actually it’s not just a word, it’s multiple words. It don’t have to be actually a building as a as a phrase, but I’m looking for a document that has the three words. So This is not feasible or easy, it’s not straightforward using sequential search. So the book index helps you to find a list of relevant pages for specific words. So if I’m looking for a book index, I’m talking about, for example, economy in the UK, I can search for where the word UK appears, the word Scotland appears, and the word money appears, and then I can find what is the common page between them. I can get, I can guess, oh, this one is irrelevant. And this is the main point, find it in a sublinear time. I can find it very quickly. IR index is very, very similar. It’s a data structure for fasting for, for fast finding of terms, and it has some additional optimization that actually could be applied to make it actually even more effective, and we will discuss it more here. So it all starts by the very basic representation of a document in the index here or in a collection in general. We start by what is called documents, document vectors. So simply we represent a document as vector. So the vector here is simply a document, and the cell inside this vector represents a specific term in this document. And if the values of this can be how many times a term appeared in a document as a bag of words, or actually maybe just it appears or not. Is it there or not? That’s it. And if you have a set of documents, if a collection has 100,000 documents, and if you put all of them as a vector, so these 100,0001 vectors will be as a matrix. We call it a collection matrix. So we’ll have a big matrix which has each each column represents a document and actually try to find out how many times each term appears. So let’s give a very, very small simple example here. Imagine we have a very small collection of only 5 documents as we see here. So document when he likes to wink, he likes to drink, and document 2, he likes to drink and drink and drink and so on, whatever, just strange words inside. And What we can do now is to find the unique terms we have in our collection. So let’s take a sample of these words heat, drink, ink, likes, bing, and so on. And then we start to see for each of these documents. How many times each of these terms appears. So what I can do here is something like this. For example, in the first document, I can see wink appeared only once and likes actually appears twice. So something like this here reads, it says to me that the number of occurrences of a term in a document. The word drink here appears 3 times in the document too. And the document vector is simply these ones. So I see now the document as a vector of all the unique terms that in my collection. Some of them appeared multiple times, some of them didn’t appear at all, like 0 and 0 here, for example. The thing and link didn’t appear in document 4, and that’s it, very simple. But however, This is a simple index we can think about. What we are interested more about is inverted antics, which is actually instead of representing a document as a vector, we start to represent. The terms themselves as a victor. So the term, the vector itself is the term, and the cell is which documents this term appeared in. So simply it’s the transpose of these metrics here. If I just did that, it would be something like this. So if I loaded the word he, I can see actually it appeared in document 1 twice, document 2 once, and so on. The word in appeared in actually document 34, and 5. The word wink appeared in 1 and 4 and 5. So it tells me in this vector that where this document, where each term appeared in the collection, in which documents. And this is a very, very basic representation for very basic search. This is a very basic thing to do to build anything afterwards for a search engine, and it enables what we call boolean search. Boolean search, it’s something, a functionality that will tell us all what it tells us that this document matches a specific queries, a specific query or not. It doesn’t tell me if it’s relevant or not. It just tells me. Does it have the terms I’m looking for or not? That’s it. It’s not ranking anything. And actually it goes with logical operators like and or or not, so I can go and say actually I’m looking for a collection of Shakespeare novels, for example, I can have a bullion query like I need all the novels which has the word Brutus and Caesar but not Cornia. So how we can do that. If you thought about Grab, for example, it’s not straightforward, but here we need to build a structure to be able to enable us to do that. So what is required here is to build a term document in this incidence called metrics, and which term appears in which document will tell us, and then actually we can apply this boolean operator and or not to find out what is the relevant documents here. So let’s take an example here. So imagine this is a collection matrix, and these actually are the different documents we have, the different novels by by actually Shakespeare. And here are the different words I’m looking for. Remember my, my actual query is Brutus, Caesar, and not California. So what I need to do now is to load the map, the vector of each of these terms. So these are Brutus, and these actually are the documents it appeared in. I know 1101, where it actually appears. And this is the second one, Caesar. I loaded. And California, actually I’m looking for not. I’m looking for the doc the documents that does not actually contain this term. So actually I will take the inverse of this. OK, so now I can easily check what is the vector value here. So the Brutus, I can see it’s 11 from the top, 110,100, which I copied here, and for Caesar, I can find actually it appears in which documents and not actually Calpurnia, and I can apply this boolean operation and then find out actually that this is the output. Document #1 and document #4. So all what it does, it tells me that now these are the two relevant documents. In a very nonlinear operation like this without scanning the whole document, I just loaded the vectors of these terms and then applied the Boolean operator and then find out, OK, these are the two relevant documents I’m looking for. OK? Is it clear so far, so far? OK. You’re going to implement this stuff, so if you don’t, if you have a question, ask. The problem with this representation is that if you have a bigger collection, OK, we have a very, very small collection here, but imagine that consider we have a collection of 1 million documents. This is not a big collection, it’s just 1 million documents, not that big. But let’s assume it’s just 1 million documents, and Each document contains on average size of 1000 words. Imagine these news articles, for example, and each article has 1000 words on average. So what would be the size of the metric we have here? To find actually the size of the matrix, I need to know how many unique words we have, how many unique words we have. Imagine we’re going to apply. That the total number of the size of the collection, the number of words in my collection, is 1 million documents multiplied by 1000 words. It’s around 1 billion words in my collection. And using heap’s law, if I assumed it, it assume it will end up that we have around 500,000 unique words in my collection. So what would be the size of my matrics in this case? 500,000 by 1 million, which is Huge. It’s a lot. It’s 0.5 trillion. Cells, but all of them, the entries are 1s or zeros. OK, so that would be huge metrics. But actually, what is more important, what do you think would be the values inside these metrics? Would it be actually how many ones and zeros we are expecting to find there? More ones or more zeros? Why?

SPEAKER 2
Because most words don’t show up often.

SPEAKER 0
Yes, actually, also, actually, also think about it. The document size is 1000. And you have a collection of actually 500,000 words, 500,000 words. So even this document, imagine it has 1000 unique words. No words appears twice in this document. So out of the 500,000 vectors lengths, it will be only 1000 once everything else will be zeros. But actually, even so this is actually how we all the words appeared in many documents. This is we are talking about at most you’ll get 1 billion. Ones and everything else will be zeros. But this is actually the case when every document contained actually a term, a term appeared only once. This is not the case. You will find actually some terms will appear a lot and some terms will appear a very few times. Imagine actually, this is something and how we should see something like this. It will appear in the document 1 and then document 4 and 5, and many, many, many 0s, and so on. But actually we also know this is 1 billion in length. But actually if you think about the zip flow, remember that half of these terms will appear only once. So we actually, we have half of the vectors of the terms, 250,000, have everything is zero except only one cell is one. So actually what we’ll end up with, this is really, really sparse. Everything is almost 0, only a few 1s, so this representation might be not very efficient. Imagine you have to build this huge matrix, but mostly are zeros. And we need actually to save space and something like this will be a very, very sparse, mostly zeros. So what could be a better representation to the document terms in this case? This is what we call the inverted index is still, but sparse representation. In this case, for each term, we store a list in this case of all the documents that contain this term. And we identify actually each document by its ID in this case. So this is instead of using a vector, we are using a list. So it will be something like this. The word Brutus appears in document 12, and then 4, and then 11, and then 31, and then 173, and all of the zeros in the middle where the word Brutus didn’t appear, I didn’t actually store it. So this will not be sparse. If a term likeor appeared in only 4 documents, it just appears there. If a term appeared in only 1, it will be on the list of 11 cell. So this is actually will be actually sparse representation just to save whatever the document ID in this case. So instead of saying how many times it appeared, just saying which document it appeared in. So what we call this stuff, this is the document number. I will just store the document number inside the cell. And when I find a new document containing my term, I add it to my list and I will call this posting. I find the posting of this uh term in my, in my collection, and the whole thing, I call it, not a vector anymore, it’s actually a posting list because it’s a list. It has different sizes between one term to another. So we call this a posting list. OK. And all the terms, unique terms I have in my collection, I call it my dictionary. So which is very similar to the book index you have seen. Remember, at the beginning, it’s a term and how many times where it has appeared. This is simply what we’re storing. And how to process it, it can be done in different ways, but simply just a quick example here. So I have this um input text. All what I need to do now is to apply the tokenization and then apply the normalization, maybe actually like stemming and case folding, stop words removal if needed. Then all what I need to do now is to store for each of these terms where, which documents it appeared in very straightforward and simple. If you’d like to go for more details, simply, if I have two documents here, I can then sequence them by terms and which document I appeared in, then I can sort them alphabetically, and then from there I can create my postings and say for each term which documents it appeared in. So the words that here appeared in document 1 and document 2, that was appear in 12, and others which ones they appeared in. Very straightforward and easy. It should be very similar to the processing you did last time. So remember this example. So this was a Uh, collection metrics we have and the term vector here. If we’d like to represent this example as sparse representation using lists, posting lists, it would be something like this. He appeared in these five documents, the word thing appeared only on document 3, and that’s it. OK. Is there a, yes, I have a question. Excellent question. Yes, you are talking about the next slide. OK, so the problem here, as you mentioned, I know that this term appeared in this document, but I don’t know how many times it appeared. So it’s just here would be 100, but it will never tell me how many times it appeared. For boolean search, this should be enough, but when we talk next week about ranked search, the frequency is important. So how can we solve this? Actually it’s a little bit of structuring here. So what we can do here with frequency and instead of actually having the list of cells which is just one number or one integer, now we can actually save it as a topple. Couples here, so it will tell me he appeared in Document 1 twice, in Document 2 once, in Document 3 once, Do 4, and so on. So now it’s a topple. I saved the document’s ID and how many times it appeared in this document. So for example, here it will tell me the word drink appeared in document 23 times. Is that clear? Does it solve the problem? OK, so it’s a little bit of structuring this one. So let’s, how, how shall we do query processing in this case? Imagine that I got a query which says I need a collection which has all the documents which contain the word ink and the word wink. How can I process this? There is one algorithm called linear merge which has, it’s actually used a lot, but you don’t have to implement it, you can implement different ways. I will give you some tips what, what you can implement for your coursework. So, but how it works in Liar merge, for example, what I can do now, I’m looking for two terms documents containing the word ink and the word wink. What I would do now, I have all my dictionary saved in the index, and for each of these words I have the posting list and how many times it appeared in each. So, I’m looking for ink and wink. I will just load them from my index. So this is actually the outcome, but it has ink appeared in document 31 time, 41 time, 51 time, and winks the same thing. Linear merge actually works in this way. It starts to check actually what is the closest document ID I have in these two postings, which one is the closest. Which one appears in a in a closer document? Yes, wink appeared in document one. So I will start with this. Then I’ll check, does in appear in document 1? Nothing. So actually it will be a function of 0, didn’t appear, and 1 because it appeared in the second document. Then I will go to the next posting in the posting list here. I will check. Is other postings in the other actually posting list, does it have a document with an ID less than this one? Yes, it does. So I will go for document 3 in this case, and the function will be, it appears in document 3, the first uh uh term, but it doesn’t in the second term. So it will be 1 and 0. Then I will go to the next 4. Did I reach 5 yet? Not yet, so it will continue. It’s 1 and 0. Then I will go for 5. Does it match here and there? Yes, so it’s a function of one on one. But remember that we are actually looking for an end operator here. So an end operator means and between 0 01 1010 will be always actually zeros, except 1 and 1 will be 1. If I can change this to or. What would be the outcome? Oh, all ones. If I came and said change this to ink and not wink. I can still do it. I just Apply the Apulian operator and it will just find the matches. So this is how it’s done. OK, in a simple way. So now I can find that whatever gets won, this is actually my match. Just bullion. I didn’t rank anything. All what I need, I need to find the documents which contain a specific set of terms. OK. Remember that we asked about, OK, this is a bag of words. What about when we would like to search for a phrase? Imagine I’m looking for a document that contains pink ink. And we have this representation. How shall I find it in this case? Do you have some ideas what could be the solution? Yes, look for documents that have pink and ink and

SPEAKER 1
then specifically find it in the string of that document.

SPEAKER 0
How would you find it in the string of the document to start scanning the document itself? But imagine that you’ve got matching 1 million documents. It will go to the grab thing. It will take a long time. Is there another way to do it? Think about it. That’s a good solution, but it’s not, it will not be scalable. Any other solutions? Using the same postings, yes, combine and ink and make

SPEAKER 2
it one word with a hyphen.

SPEAKER 0
Nice, which is actually using something like by grams. So instead of saying he likes to wink and drink bi ink, which is a strange phrase, but this is an example, I can just put it in this representation. He likes, likes to, to wink, wink, and, and drink, drink, drink, wink wink, ink. And now I can search for the postings. Now instead of the postings it will be for a single term, it will be for a bigram. And I can find actually that pink ink appeared in this document. OK, this is your solution. Do you have any problem with this solution? Size Excellent. The size would be huge. The combination of words will be instead of actually having a vocabulary of 500,000, it might end up with 5 million or 50 million. What else? If it’s a gram Here you are. If I’m looking for a phrase that actually is more than two words, 3 g 4 g 5 g 10 g, shall I keep creating something like this? So the problem here is that it might be fast when you’re actually finding the load it quickly, but the problem, the size would it would be huge. Yes.

SPEAKER 2
Another.

SPEAKER 0
Another, let me finish the example and I’ll hear it from you. Actually, I’m interested to see. And actually, what about trigram phrases and more than this? And actually, what about even proximity? If I’m looking for something like ink is pink, I will, I’m not doing stopws removal, and I’d like to find these terms appearing close to each other instead of actually in a phrase, instead of actually being just exact phrase. What could be the solution?

SPEAKER 2
Looking for proximity where to store the proxim.

SPEAKER 0
That’s excellent, which is exactly what we call the proximity index. It’s another structure for our index to make it actually save an additional pieces of information that might help me doing these kind of operations very efficiently. So the thermal position in this case would be embedded in the inverted index. I will store it and we call it proximity or positional index. Both are referring to the same thing. And it enables freight search and proximity search, but instead of having toples, which is the document ID, and how many times it appears, it will change a little bit. It will be a to as well, but it contains a document ID and where this term appeared in it. So remember our example. This is how it would look like. It will tell me the words he appeared in document 1, not twice, but in position 2. And doc, and it appeared in document 2 in position 1, and in document 3 in position 1. And so on And drink the same thing. Actually, this is the old one. I’m sorry, this is actually the old one. The new one will be like this. It will tell me that appeared in document 1 in position 1, document 1 in position 5 equals appeared twice, and the same thing actually for drink, it will be something like this. It shows me in document 2 it appears in position 46, and 8. So I stored where this term appeared in my Index in my in my document and it’s stored in my index. This would solve a big problem now because I know exactly where the term appeared. OK, Is that clear? You’re going to implement this. If it’s not clear, ask now. So the query processing for proximity search, we still use linear merge, but we take an additional step. So let’s go to the example. The word ink appeared in document 3 in position 8, in document 4 in position 2, in document 5 in position 8, and pink appeared in document 4 in position 8, and document 5 in position 7. So remember, I’d like to pink ink. This will be an end operator by default because I need both of them to appear. So what I will do, I will start with linear merge appeared in position document 3. Is there anything there? No. So it will be a function of 1 and 0, nothing. Then I will go to the next document appeared in document 4. Yes, it appeared in document 4. Here is pink. There is anything in document 4. Yes, so that’s great. So the function here will be one on one, but I will not take an action now and say it’s yes or no, because I need to double check. What about the positions of these two terms? Is the position of ink minus the position of pink equal to 1? Yes or no? No. So this case, while they both exist in the same document, it’s not a phrase because their positions are far away from each other, so the answer would be zero. Then I will go to the next document, document here and there, and the question is, is the position between them equal to 1? Yes, then the answer here is one, and now I can find out that Document 5 matches my phrase query. OK. And the good thing, this actually now can enable me to do what’s called not just phrase search but also proximity search. I can search for words appearing in the same sentence. They don’t have to come close to each other just next to each other, but at least close. I can say I need these words appeared within a window of 5, within a window of 10. And instead of a word that appeared in the beginning of the document, the other one is 1000 terms later at the end of the document. So this, this is a very useful feature when doing ranking for documents later on when we speak about ranking that if I’m searching for documents that without actually first search or proximity search, I’m searching for just the White House, for example, OK, and I find documents like 1000 documents contain White House, but some documents contain White House. One at the beginning of the white egg and then at the end the happy house and actually another document have them close to each other. So we start to use this kind of proximity search as a score to make a give a bigger score for the ones appearing close to each other for ranking, and we’ll talk more about this next week. OK, so this is really important, but it solves a big problem. Now I don’t need to think about by grams or try grams or 4 g. I can easily apply this. I have the positions. I can simply calculate and find out which would be a phrase or not. Fair enough. Good So how to do this with a sample implementation. So you don’t have to apply in your coursework or labs. You don’t have to implement linear merge because this is using lists and the low level. Next, the lecture will even talk about even lower level stuff, but how to make it simple for your coursework. So OK. So first, when you get a query for each term in the query regarding regardless of what its operator is, retrieve the posting list for these terms. So I got a query which has 3 terms. retrieve from your dictionary all the posting lists for these 3 terms. If you’ve got an end. Operator, then what you will need to do is to make the intersection between these lists. So if the term 1 appeared in document 1 and 4 and 5, and term 2 appeared in document 4 and 5, then the section is 4 and 5. If it’s or, you will do the union of these lists. It will be 14 and 5 and everything. Just union operator If I got knocked Not. It will be the inverse of posting list. So if it appears in document 1 and 4 and 5, then you have to check document 236, and 7, and so on. But this will become actually more complicated. Usually not even in my in the in the lab and coursework will be only combined with and. So what you can do is to subtract actually. So if you have term 1 and not term 2, you will check where term 2 appeared and subtract it from the first list, and that’s it. Fair enough. Easy. So I said term one appeared in document 14, and 5, and 6, and term two appeared in document 5 and 6 and 7. So what I need to do is subtract 4 and 55 and 6 from the first list and it will be actually only what appear uh the remaining stuff. If I got phrase search, it’s simply applying and operator. I will apply and operator, which is the intersection. Then I will check if the position 2 minus position 1 equals to 1 or not. OK. And to have to be careful because White House is not the same as House White. It has to be pos if they actually position 2 minus position 1 equals -1, this is not the phrase I’m looking for. OK? So you have to be sure that you’re taking which, which one is coming first. If I’m looking for proximity search terms appeared within a specific window. In this case, it’s an and operator, which is the intersection, then check that the absolute value of position 2 min 1 is less than or equal to this value. So in this case, the order doesn’t matter that much. And all what I need to do, if I’m saying within 5, then I want to be sure that the terms appeared within 12345 distance between them, OK? Easy. Easy, OK. Another thing that I recommend you to use, it’s not the most efficient, actually, it’s far away from being efficient, but it’s just for. Easy start for your coursework and lab is to put this kind of representation for actually your index. Later on, what I hope for is actually in your group project you can actually use more efficient ways for storing the index itself. But let’s take this example. What you can do is to, for each term, you put an entry and mention how many documents this term appeared in. It appeared in 5 documents, for example. And for document number, for the first document appeared which positions, document, the second document which positions. For example, if the word Scotland here appeared in different 25 documents, it appeared in document 94 in position 112, in document 351 in these positions, multiple positions, and so on. So this is a very, very simple saving it as a text, nothing hard, just some simple structure that can help you to store actually your index. And from this you have all the information. I have actually for this document for this term, how many documents appeared in, what are these documents, and what are the positions you have for each of them. And if you need to, for example, need to know how many appeared in this document, you can just check the lengths of your list you have there. It’s easy. It appeared here twice, 1 or 7 times or so. Is it clear? OK, practical. Let’s do some practical see this in action, OK? I need actually to do it. I think I have it here. Where’s the dirt You cannot see it, yeah, I need to actually make the screen mirror.

SPEAKER 2
This place.

SPEAKER 0
Uh, mirror it. You see it now, yes? Excellent. Uh. OK, let’s have an example here. What do they call it? Mm, collection, yes. Actually, I saved, I created a collection, which is this one. So actually it has 5 documents. This is a format that is standard. We’ll know why exactly this format later, but this is a standard format. It tells me I have a starting of a document here. I’m sorry, a start of a document here, and this is a document number and this is the text of the document. OK, and then this is the end of the document. Then it’s kind of an XML format, OK? Hm. It’s kind of XML. OK, but this is a very standard format for documents when you’re doing them for search and indexing. But let’s say this is actually the format I have, OK? So This is, can you remember this example we have seen in the lecture, yeah? Only 5 documents, nothing, nothing, nothing big. So I created an indexer. You don’t have to see actually how it looks like, but just something to take this and put it in the structure we talked about. I applied very simple processing, pre-processing, not all the pre-processing, all what I did here, no stoppers removal, no stemming. I just applied case folding and removed the bunk intuitions. So you do it in the lab, do the whole thing, but just I’m illustrating the thing for you. So if I actually get this one. Collection of text. And then I created the indexer. And let me print it. So you see it actually now, it tells me the word end appeared in two documents. Document 2 in 5 and 7, document 5 in position 5. Actually, let me, let me, let’s double check this. Let’s sort of, if I went to the indexer. Where is it? Yeah. And collection of texts with. I You can see it Here. So and it tells me here, let me get back, appears in document 2 and 5. Document there is no end here. Document. 2 What is this? It’s not.

SPEAKER 2
Yeah, oh yeah, yeah, document 2, yes, in 5 and

SPEAKER 0
7, which is true, yes, this is true. 12345, so yes, it is correct, and nothing here and the other one documents. Where is it? Document 5 at position 5. Document 5 at position 5. OK, so simply it’s doing it for me. Now what I can do, I create a simple algorithm for And just an and thing, and I’m taking the collection. index to load it. So what you do once you Process this, print it in a file as an index. This is a file output, and when you do search, you don’t go and process the documents again. You just load the index in the memory. You have this structure now, just load it and parse it and load it actually in your memory in the structure like this. And now I can go and I just add, if I’m looking for the word and, it will tell me it appeared in document 215. The word drink, it appeared in all the documents. The word he and drink, they appeared in all the documents. The word and and drink, it appeared in 2 and 5, ink. Ink pink. Now it just tells me what are the documents disappeared in. OK. Very simple. And this is a small collection. I had another collection actually. This is the one you’ll be using in the lab, just to check it, which is a sample. If it’s the one open with. This is actually uh News articles from the Financial Times in the 1990s, very, very old news articles, but actually it’s interesting. It has the document number as well. It has been put in this format. It has profile date, headline, and text, and so on. Actually, when you do the indexing, you just apply it for the headline and text and ignore the other stuff, and it has many documents like this. OK? So you have here The headline, whatever international company news, whatever something like this, it’s about many financial stuff and some political stuff as well. So if I created the index for this, I will. It’s called, what did they, what did they call it? Streck track. sample. Dot XML Uh, indexer. And I will put it to track. Dot Hendex, OK? I save now the index. If I open this one index, I have created.

SPEAKER 2
Open with.

SPEAKER 1
Uh, why you cannot open it.

SPEAKER 0
That’s annoying. Let me do it. OK. Bring it here. So you can see now, I didn’t do any preprocessing, decent preprocessing, I should do it better, but this is actually where the document, each each term starts with some numbers, some random numbers. How to scroll it down more, for example, here. Fallacy appeared in only this document with this uh in this position, fallen fallen appeared in many documents, mostly once. This document appeared twice, and so on. Falling, I didn’t do stemming, you should be doing stemming, OK? So now if I went and loaded this, it’s actually 1000 documents, but if I did the end search again, but loading in this case the track index. Now I can search for any word I want, so I can search for the word Scotland, for example, it tells me these are the documents it appeared in. If I search for the word England, England. These are, it appeared in more documents. What about the documents containing Scotland and England together? It’s less number of documents. These are actually the common ones. You got to check if you want, if you have time. Anything Liverpool, Manchester.

SPEAKER 2
Chester OK, both together, Liverpool.

SPEAKER 1
Manchester, only these three documents.

SPEAKER 0
69351, which is in both and so on. Very simple. You can see now it’s became very fast. Even if this collection is 100,000 or even 10 million. Now I can just load it and find out, match, and find out. If I’m doing or, it will be actually the union of these numbers. If I’m doing and not, I will be actually subtracting someone from the other one, and that’s it, very simple. If I’m doing phrase search, I will just check, I’ll do an additional check for the position of the term where it exactly appear. OK, Is it easy? Any questions? Yes, how are you storing the indexes?

SPEAKER 4
What? How are you storing the indexes?

SPEAKER 0
How I can use what?

SPEAKER 4
So how, how are you storing the indexes in the memory?

SPEAKER 0
I’m sorry, I cannot hear you. I’m in storing the index. I’m currently, I said, just in a simple text file as the one I showed you, like this one. Yes, when I’m doing the search, I load this one. I load this file. One big mistake, don’t actually say I will not save it and I will have to index everything, keep it in the memory and use it for the search. No, save it on the desk, and when you are going to do actually the search, load just the index, not the collection. OK, it doesn’t make sense to every time I need to search, I have to index everything. We’re doing web search, uh, Google will go and index everything from scratch, nothing. It just have this one, OK? Yes, we have to apply.

SPEAKER 5
What is the use of actually query wants to include a stop or something?

SPEAKER 0
Yeah, whatever I, I, I mentioned this many times last, last week. Whatever processing you’re going to do to the documents, you have to exactly to apply to the query. So if you are doing stemming to the documents, you should apply stemming to the query. If you are doing stop words removal, you should apply stop rewards removal for the query. So actually this will, this is one of the things you’ll be tested at, and when you do the search, it can be actually two different queries, but after a semi it will be the same. So you have to be sure about this. This is a really important thing because if you are, for example, the word falling, if you did stemming, it would be fall, but if you, there will be no falling in the index. And if you go to the query and search it with it as it is without applying the same stemming, you will not match anything because it’s not there. You have to actually to apply this stepping, OK?

SPEAKER 3
Yes. Uh, so, yeah, I had a question. So this is regarding the collection site that keeps increasing. So right now we know the collection if it’s, let’s say for Google, the collection sites to keep increasing, then what do, what would be the uploads?

SPEAKER 0
This is an excellent question. Actually, uh, so this is why indexing is done offline. OK, and there are many methods. For example, there are big vocabularies, but what you need to do at the end is actually to update the index every while. So the index, this is actually when you talk about web search, some documents, imagine web search, some documents you will need to update it every few days. Some documents you need to update it every few minutes, like news websites. It makes no sense to update it every few days. It’s too late now. Actually, you need to find the news as soon as possible, and it takes some time to do that, but the, the index gets updated every while. Oh, and this is actually one of the things actually we always appreciate when you have it in your group project. Are you applying live indexing or just one-time indexing? I gave you an example in the 1st, 1st lecture about this actually the book reviews search. It was live indexing, and I had uh and they actually 2 years ago they did actually another one which was really nice, which is about searching song lyrics, and they were actually updating any song appears, it’s just, they get it directly and add it to the index. I can remember they were testing it when uh on the same year when this BI2 came out. So we searched it and we found it came out and it was viral and we checked it’s there in the collection. So it’s, it’s, this is live indexing. This is something you need to update, of course, OK? Oh, any more questions? Yes.

SPEAKER 6
Uh, large operator and uh all operator. Yeah, pilots to combine these, these two operators.

SPEAKER 0
You can combine, you can combine both, of course. Remember, if you have, you can’t combine infinite because imagine that I’m saying term one and term two are not our term 3. So, of course, you need some brackets in this case or something. You just apply this and get the vector, the list that is matching, then apply, take the other one and do the intersection or and whatever you want. It’s very simple, OK? It can be a phrase search and another thing like the White House and Trump. So I need to match the White House together, get the list that is matching, then find Trump alone, get the messages that are matching, and then find the intersection. OK. Yes.

SPEAKER 4
Let’s consider two phrases, falling down and falling down. Now both falling and falling have the same stem word.

SPEAKER 0
Yes, yes, someone’s going to search for falling down.

SPEAKER 4
There is a chance that we might give them a document which has fallen down. Yes, yes, and if we’re applying stemming, which will be

SPEAKER 0
the case in our coursework, it will match, OK. However, in general applications, in general application, if you need some, because codes might be, actually means I need this exact term, OK? Like you put actually even a code around the specific words. For some search engines, they will actually have two indices. One is it processed, a stamped one is not, to give you the exact match. Another option for this, if you think about it, actually you can actually do the search and then apply a post processing to check if this term is the full term, but it’s not efficient. You have to have another index to make it fast. OK. In the next lecture we talk about even more advanced stuff like wildcard search. If you actually say words that start with these three letters and then an asterisk, any term, how would you do that? So we’ll talk about this in the next lecture directly. Any more questions? Nothing. OK, so the summary, the summary of this lecture, we have a document vector. We have a term vector, inverted index collection matrix. We talked about all of these postings and posting a proximity index, which is proximity or positional index, and query processing. We talked about linear mirrors, but I told you use intersection and union for now it’s easy. You don’t have to do that. OK? And the resources for this, uh, you can check both textbooks. Entry to IR chapter 1 and Section 2.4 and textbook 2 IR in Practice chapter 5. You should read this stuff, OK? Because I’m expecting it has more details than what I might miss some stuff. It has more examples, more details about everything, OK? Don’t ignore them. OK, um, uh, we should have a break, but I, I, I, I, I, would you like to take a break first and I talk about the coursework one or talk about course coursework one, then take a break? Who would like to talk about the coursework first? Oh, that’s a clear majority. OK, wow. OK, give me a second.

SPEAKER 2
Mhm. Just so. for 30 minutes.

SPEAKER 1
OK, coursework one.

SPEAKER 0
by the way, this is just general guidance about it, but you will find a full, full description published tomorrow, OK? So this is just general guidance, and if you have any questions now, ask about it, but the full details will be published tomorrow, OK? So what is required is to implement a very simple information trival 2 search engine that includes preprocessing of text, which is tokenization, stopping and sending. You should have done that already. Positional inverted index, you should be doing this this week anyway and search execution materials that allows booty and search, which is part of this lab, lab 2, and phrase search, which is part of this lab, proximate search part of this lab, and ranked search when you actually rank the documents according to which one probably is more relevant, which would be about next week we’ll take next week. OK, so by this week you can finish already. 60 or 70% of the coursework anyway, if you’d like to do that. And there will be a challenging question. Which is what will happen to results when stopping is not applied. All what you need to do is to test it because in the coursework, if you have clear instruction, apply stopping, apply stemming. The question now, what if you didn’t apply stopping? What would happen? You need to test it and report your observations. I don’t need the results. Just tell me what is your observation. And actually, on all of these different searches, the Boolean search, ranked IR, the speed, the index size, tell me, give me half a page or a page discussing this part in your report, OK, just your observations. So It’s not it will have 20% of the coursework mark, but I’m not expecting everyone will do it. So if you didn’t do it, you’ll get 80% if you finish everything fine. If you want to do it and get some results and you discussed it, you can get the full mark, OK? Uh, uh, here in Edinburgh, 70% is, is great, OK? Getting the degree is outstanding. But this is just a question if you’d like to get the full mark, which is doable in this coursework. It depends on the lectures, lecture 4 and 5 and 7, pre-processing, indexing, and ranked IR, which we will take it next week, labs 12, and 3. And by implementing lab 3 next week, coursework 1 is done. That’s a gift that I’m giving you 10 marks for free. You get the support and that’s fine, but it’s mainly for us to be sure that you be. Get, uh, doing the work in this uh in this uh course. The deliverables are Your code ready to run in Python. Uh, that’s what we recommend. If you’d like to use something else, let us know. Someone told me he’s using PHP or something, uh, you can use it, but I need to double check with the uh markers that we have. They can read it, or you can always actually go to chat GBT and tell them this is my code and PHP converted to Python. That’s OK. Yeah, guys, it’s fine, it’s easy. Your report, we need a report between 2 to 4 pages, not that much, just 2 to 4 pages, which includes the module you produced, you implemented, tells us I implemented this module about pre-processing which does that and that, another module for the indexing, another one for the search, one for the Boolean search, one for the ranked search. Just let us know. And it’s important to tell us why you implemented something in a specific way, and if you have any discussion about the challenging questions about if we didn’t apply stopping, how, what, what did you notice about the speed, the results, and so on. And what is really, really important then is the search results files. You need, we will give you the details tomorrow about how would be the format of the output in a specific format, because you will tell us for, I will give you queries and tell you what is the out results for this one. No one will check it manually. It will be done automatically. So what is required is to be sure that it’s in the right format, because if it’s in the wrong format, it will not process. It doesn’t mean it didn’t process, we will give you a 0. We are a bit kind. We will try to find out what is the solution and try to fix it and fix the format for you, but you will be penalized for this because it took more work off the markers we have. OK, so just be sure about the format. Assessment to be considered, your search results, which is automatic marking, and the quality of the report and explanation of the code. So it will be between both. Not highly considered at this stage is the speed of the system, unless it’s reasonably slow, like, uh, for example, if you process the query and it takes like 10 minutes, that’s too too much. So, uh, but if it takes, instead of taking like milliseconds, it’s taking 2 or 3 seconds, it’s OK, that’s fine. And the quality of the code, uh, we will not say that it has to be super commented and documented and stuff that’s OK, just this is just a warm-up coursework, but don’t make it rubbish, just make something that’s readable, at least least somehow readable. What is allowed, what is not allowed. Allowed is to use libraries for portrait stemmer. OK, it’s OK. Don’t implement Portrait stemmer. Just use a library, and I think it has been discussed already on Piazza about the different things you can use. Uh, you can use a ready code from stack Overflow or even generate it with a chat GBT or cloudy or whatever thing you are using for optimization, like doing a function for the intersection, for example. It’s OK to, to get something like this ready to use it, but not the whole thing. Don’t say, can you read this link and the course were quite unimplemented for me. No, no, no, no, don’t do that. Discuss some functions with your friends. You can say, how did you do the intersection part? It’s for me it’s a bit challenging. How did you do the union? Do you have something you recommend? It’s OK to discuss something like this, even on Piazza. That’s fine. And use Piazza to ask general questions on the implementation. It’s OK. What is not allowed, and it has been, I will actually raise a flag that happened last week, but it’s fine at this stage but not later. Uh, Using libraries for tokenization or stopping, you have to implement this yourself because there are actually in NLTK it has its stop stop words list. No, you have will be specified these stop words, use it here. So you need to use the same exact stop words and tokenization as well in the way we are talking about. What is not allowed is, of course, copying code from each other. This is something that is not allowed, and this is why when you from now when you talk about your results about lab 2 and 3, don’t share your full code. Don’t share your code anymore. Just say someone asks which function you use this. Just give a general answer, OK, or just something general. But don’t share the whole full code about how you do the indexing because it’s in this case it’s part of the coursework. So keep your code for yourself and just share the results. And sharing the results is really, really important, OK, for now. And sharing the results, not. I mean for the labs, but for the coursework, because for the coursework you will be given a new set of queries and a new set of collection. Once you get these ones, don’t share anything about them. Feel free to share everything in these couple of weeks before the deadline, which is about the labs. You can compare your results. Oh, I got 5 documents relevant. You’ve got 6. What is the difference? You can discuss it and see what is the difference. But once you’ve got the one, the collection and the queries for the lab, don’t share anything, OK? The timeline 1st of October, which is today, the initial announcement. Full details will be released tomorrow. 16th of October is the test set release, so you have to implement everything about the coursework and keep testing it with the collection and quiz published in the lab. OK, Test it, discuss it as you like, but when the new collection comes out, and the new collection will come out for you, a different set of documents, a different set of queries, in these ones, this will be released just a few days before the deadline. It should take you 2 minutes exactly to take the new collection and queries and process it with your code and submit the results. Nothing, even if actually with your part about writing the report, write it earlier. It doesn’t depend on these specific queries. Just write it earlier. It doesn’t matter, OK, about from the lab results. The deadline usually in most of the course it will be on Friday. I will just give you more time, so you have if you didn’t, you should submit it because this is Sarah’s day to get the new collection. You can submit it on Thursday or Friday, but even if you didn’t finish it, you have the weekend. If you don’t want to finish earlier, you have the weekend to rewind it for yourself and work during the weekend, and it’s the last minute of Sunday midnight will be the last time for submission this core coursework, OK. Some notes It’s only 10% of the mark. Effort is a bit high because you have to implement a lot of things this last week, this week, and next week, but you have to remember why it’s only 10% because you have full support through this. You have the support, you have discussed it on Piazza. You can actually have the responses from myself and the administrators for any questions, and When you find something in the coursework that it doesn’t have a specific detail, don’t say, don’t, then it means it’s flexible. It can be either, OK? Don’t say what you didn’t specify, is it this or this? If I didn’t mention, use any, OK? It’s a good practice to build the system from scratch. I know it’s not the big search engine, but it’s just a few steps, something decent small, and once you have built a search engine, it’s nice. It’s a nice feeling to have something working like this. And the next coursework, which will be after a month, I think, it will not be covered in the labs. Most of it will not be covered in the lab, and this one it will have a higher mark because you’re going to implement. Some advices do lab 1 and 2 and 3 because if you did them, it’s really course work 1, just a new collection and queries, and apply it and it will be done. Implement carefully. Write efficient and clean code, please, as much as you can, so at least you track your, don’t give our markers a hard time, especially when you’ve got the results and it’s not matching. And they have to go through your code and see what is going on wrong, so don’t give them a nightmare trying to read out what is going on. Make it simple to, 00, this is a problem, OK. A change in preprocessing and observe the the change, what is going to happen like the stopping part, test and test and test, keep testing yourself and keep your system that you’re going to implement now. It can be actually as a start, a good starting point when you go for your group project to do something more efficient and much much better. Do you have any questions about this? Yes, uh, you said 2 to 4 pages.

SPEAKER 1
Yes, is that like a hard limit?

SPEAKER 0
Yeah, that’s a hard limit. OK, that’s our limit. So, uh, 2 to 4 pages. It can, it can be 2, not less than 2, and not more than 4, OK. Yes, uh, if you have a space within the 4 pages and you would like to think it will help, that’s fine. OK. But I, I prefer not to read code in the report more as much as actually what did you do exactly, OK? OK, that’s good. So we will have a break, and in the break, actually, I will share the mind teaser. You can have a break from now, it’s fine.

SPEAKER 2
July. Yeah. We told her before.

SPEAKER 0
OK, this is a mind teaser of today. Mossack operators. Around these 3 numbers for each to actually achieve this equation. If these are 11 equations, OK, once you get the 11, raise your hand.

SPEAKER 2
Same, same.

SPEAKER 0
No, it can be any operator. It can, yeah, it can be, of course, it will be different from each one to another. It can be any operator as much as it doesn’t have numbers like root 3 or square. It can be anything, yes. OK, so once you solve the 11, raise your hands. We’ll have 5 minutes break, and I will ask at the end, maybe if you don’t want get every, all of them, we’ll check who gets the highest number. OK, enjoy your break. OK, we are back for the 2nd lecture. I’m sure that the mic is working, so they don’t have the problem like, like last week. So the objective in this lecture is to learn no implementation at this time, so some relief for you. What is the structure? How you would work indexing with some structured documents? What is extent index? What is index compression? Data structures that could be used for actually efficient processing of the index and wildcard search and its applications, how it can be useful. You’re not required to implement this as part of your coursework, but uh it will be highly, highly appreciated if you implemented some of this stuff in your Guru project in the 2nd semester, OK? So just still give attention to this. So Documents are not always flat. Like, like, for example, uh, there are some documents which have some interesting metadata, like the one actually I showed you an example like Financial Times articles. It has a headline, it has a title, sometimes it has an author, uh timestamp. So these are some kind of structures there. Yes, the, the content itself is mostly unstructure, but it has some structure. Um And sometimes you have tags like link these words, this actually 3 or 4 words in the document. It’s a flag, but it has a special tab. It has a link in it. So how can you deal with this stuff? It can be a hashtag, for example, or a mention how you make a differentiation between a word that is a hashtag or not. So how can we get this information included in our index? So there are different options here. One is to neglect it, totally neglect it, depending on the application. It’s up to you. But sometimes you can think, OK, I can create a different index for each field. So for example, news articles, I will create an index for the headlines only, and another index for the body. And when someone searches, I can’t search this part. Are they looking for the headline, the title only, or actually everything? Or a simpler solution which is using extent index. So what is the extent index? This is something that we hope to see in your group projects. It’s a special term. We create a special term for uh for each element or field or tag in our documents that would help us to allow a search like this. So we index all terms in a structured documents as plain text as normal. However, terms that are of a specific field or tag, they have a link or they come in the title, we, we do a special additional entry for them in the inverted index. And the posting in this case, instead of saying that it appears in term 3 and 4 in position 3 and 4 or 5, it can actually say it’s from position 3 to position 10. It will be a window instead of actually just a specific position. And it allows multiple overlapping, so you can in this case start to have terms appearing in the same position. Because one of them is not a term, actually, it’s just an extent term. So let’s give an example here. Remember the example we have been talking about indexed and we have seen it. So imagine that. Some of these actually terms are linking to something. They’re linked. So I’d like to include this information that the word there, here is a link. Ink actually has a link. How shall I do that? So simply what they can do. is to create a new posting for links. How this would happen, for example, I can create a new special term, not a real term, but a special term called link. If I have a different, the word link inside these documents, it will be a different thing. This is just a special term, and it will tell me there are some links in document 3 between 1 and 2. And in document 4 between 1 and 4, and in document 5, between 7 and 8. So it’s not a real term, it’s just a special term. If I have a headline, a title, I can do the same thing. So with this very simple hack, I just added one new posting, and it saved for me a lot of information that I can use later. You might think I’m going to search for something that should be linked, but actually in some cases when you do a web search, for example, When there is a, a, a specific term I’m looking for that is linked to another place, it might be actually weighted differently than a normal uh term that appeared in this document. So this information is important to be saved. So using extent in this case, like imagine these documents which would be similar to the ones you have in your lab. You’re not going to do extent there, but this is just something that if you’d like to use it. So you say actually you have a headline information with a lecture and text. This is a lecture 6 of TTS course in IR. Imagine I will apply some kind of stopports removal. I remove the stop words and I will start to save the positions of each of these terms. So this is a term 1234, and so on. I just ignored the structure. This is in the headline and this is in the body. No, I kept counting normally. Then someone comes and says, I’m, I’m looking for a query, the word lecture, but in the headline. It has to be in the headline, which is, you can do that in Google Scholar, for example, if you’re searching for papers, you can go and say I need this service in the title only. So what it does, it searches for the with the board, the posting for lecture which it appears in document 1 in position 3 and 4, like lecture here and lecture there, but in the headline, I have now a special term extant index for headlines saying in document 1 it appeared from position 1 to 3, in document 2 from 1 to 5, and so on. So when I start to match. I will find actually that the match is just this first mention of the lecture, not the second one, because the second one is not inside. The headline So it’s an end. I applied an end operator that it should appear in both of them. It should appear in the lecture and also it should appear in headline. Yes Why not have a different index for? This is a good question. Imagine you have something that Like, um, uh, like Google Scholar, OK? It has, you have, you have title, you have authors, you have institutions, you have abstract, you have introduction, you have conclusion, you have appendix. Some papers will have a different structure. How long would it take you to keep creating something for this? And what actually if a term is appearing in multiple of these things? So the easiest one is to keep a normal index that has everything together, and it tells you that the abstract for Document 5 is between word 15 and 45. It becomes much more efficient and in the future if something appeared like the appendix or actually captions of figures, you can create a new one and just add the information without creating a new one. It becomes much, much more efficient in handling them. Yes, this headline, you know, the vector or whatever becomes

SPEAKER 3
very, there are 1 million collections, so it would be

SPEAKER 0
huge. Yes, it will be huge definitely because this term is like indexing the word is. It will be everywhere. That is fine, but it’s still much more efficient than the other things. Which actually, that’s a good question. What would happen when we have millions of documents? The posting would be huge, correct? You’re talking you’re playing with 1000 documents and the course of the work could be 5000 documents, and this is nothing. Once we have millions, what would, what would we do? Actually, millions is not even a problem now. Trillions. OK, now we start to have problems. So which brings us to index compression. So the inverted indices are big. And the problem that If, if there is actually a large disk space, they’re actually taking a large amount of space. And I need to read the index into my memory. It will take a longer time, you know, actually that if you have some information about systems, computer systems, it’s, it’s reading from the desk is way, way, way more slower than reading from a memory. So this will take a long amount of time. So the main purpose of index compression is to reduce the space the index will save. The one I showed you last lecture is a terrible format. It’s saved in text, nothing efficient at all. It’s not even a binary file or something to be more efficient. It’s just something like this. I’m saving one document 3 as an AI, as a letter, so it makes actually it’s very inefficient. So if you’d like to do this more efficient, you start thinking about how can I compress it, make it actually take a small space. So instead of actually my index taking 1 gigabyte and I cannot fit 10 gigabytes and I cannot fit it in my memory, I just will take only 1 gigabyte and can fit it wholly in my memory without needing to access my disk. So the large size of the index goes to many. Things. One of these is the terms themselves as a dictionary. And another part is the document numbers. The question is which do you think would take more space. The term entry or the document numbers. Sorry? Yeah. Who think it will be the terms seeking more spa and making the space. Who think it will be the document numbers. I think more saying that’s document number because, yes, because the term is just 500,000 terms unique, but for each of them, there are a list of very long list of document number postings there. So this will keep repeating for every term, it’s taking a lot of space. So Is there a way to compress these document numbers to not take that much space? Uh, for storage. And let’s get introduced to something called delta encoding. Which is a very simple idea. So if you have a large collection, so when you have a document, in this case, you start to actually see this is uh for this term document 123, and keep going till document 5 billion or something like this. If we’d like to save this in a binary file, that’s a file that actually just saves the numbers as integers or something that will just be the same thing the smallest thing to store this number, if I have only 255 documents, then I can save each document number in one byte. One byte would be enough to have 255 different documents. But if I got 100 and 300, One bite will not be enough. I need to actually have 2 bites. So in this case, and they can store documents up to 65,000. Then if my collection is bigger, reaching up to 4 billion, then I will need 4 bytes. If it turns out to be actually it’s 5 billion, now I need 5 bytes. Because every time to store, if you know, if you’re aware of binary, then it will take more time to to save this. For example, this number, it, it can’t be stored in a byte. It needs more bytes, uh, in multiple bytes. OK? This is 100,000. Yes, this needs actually 3 bytes to be stored in. OK. So The idea is simple. And instead of saving the document number, what about saving the difference between my number and the document before me? So instead of saying I’m starting with document 123, I will just say document 1, and the next one is document 2. I will not save two, I will save 1. It’s one step from the previous one. It’s a list. I keep going through the list anyway to find all the documents matching, so I can start with the beginning one, the first document, and keep how many delta, how many documents I need to skip to reach the next one. So I can say Document 1, document 1, then 1, which means document 2, then 5, which will means document 7, and keep adding while I’m moving. So in this case, I can start from the beginning and keep adding how, what is the delta between them, something like this. So here, 1, 100,0002, then 100,0007, I say the delta is 5. And this is one actually is After 7, it’s 8, add 1 document, add 3 documents, add 7 documents, and so on. So what is needed now and instead of saving these numbers in 3 bytes, I can save it in only 1 byte. Yes, but then you lose the ability to assess anywhere

SPEAKER 6
in the list in constant time because you have to go through.

SPEAKER 0
You have to go through them anyway because if you’re looking for a term Scotland and it appeared in, you have to load the whole list anyway and find them to be and do the intersection or union or whatever you want, so you have to go through them anyway in all cases. So it’s better just instead of reading the number directly, just add an operation of addition. Is it clear? Good, you have a question.

SPEAKER 2
Yeah, like for instance, a document appeared in only, uh,

SPEAKER 5
uh, a term only appeared in the first and last document, wouldn’t we still be, wouldn’t we still need I

SPEAKER 0
like these kind of questions, which actually brings me to the next thing I’m going to talk. Yes, so sometimes it will be one document, but some documents will be, it will exceed many, many documents until you find the next term. Like this one. I keep actually saving one byte, then at some 0.321 documents don’t, don’t contain this term, then I start to find it again. So this one would require 2 bytes instead of one. Sometimes it can be the first document and the last document, which will require actually like 5 bytes. How shall I do that? In this case, instead of saving everything in one byte, now I have to save it in 2 bytes, which will be double the space. Sometimes if I’ve got a difference of 3, it will be 3 times the space. How shall I do that? Which brings us to another compression technique to solve this part, which is V-byte encoding. So V byte encoding to go a little bit lower level to the binary. Of what is behind this stuff. So the thing is, I would like to save the do the difference, the delta in a variable size every time. If it’s the difference is 5 documents, I will save it in 1 byte. If the difference is 300 documents, I’ll save it in 2 bytes. If the difference is 6,465,000, I’ll save it in 3 bytes. How shall I do that? It starts to use actually 1 byte goes to 8 bits. Uh, no, you know, hopefully you know binary, so I’m not speaking French for you or German or, I don’t know what’s the language you’re not speaking. So, Every bite has 8 bits. The main idea is, OK, we will use the first bit in this 1 byte. To indicate I’m terminating the number or actually I need to read the next byte to continue to finish my number, and the next 7 bits will be just the number, just to give you an example. So if I actually like to save 6, I will show it as 6 in binary will be 110, OK, but I have the full byte here. So what I can do, I will say 1 here means keep reading, and when you’re 1, this is actually all the information about the delta. It’s only 6, read 6. If it’s actually 127, it will be 71s, but I will not add 0 at the most with his left bite. I will, I will just make it 1, which means terminate it. Terminate now. So I’m using actually 7 bits of the byte, not the whole byte. One of them is just an indicator to continue or not. When I have 128. In this case, I’m using only 7 bits, so 1 bite will not be enough. So I will add 0 at the beginning. And say And put actually the 1st 1 to 8 will be 1, then 7 zeros, and then add the 1, then another one telling the one which is just saying it’s terminated now. Take the numbers and terminate it. So what would happen, I will keep breathing like this. I found 0, so I keep breathing. I find 1, then I will terminate. So what will happen, I will just leave these two bits indicating shall I terminate or continue, and this will be the final number. which shows 128. So I’m using one of the bits just to tell me, shall I continue reading or terminate it now. To give you an example, to see actually if you understand it or not. This is a real example sequence of the deltas, so let’s split them by bytes. These are actually the most left bit here. So the first one. Does it say, shall I terminate or shall I continue reading? Terminate. What is the number here? 5, amazing. The second one, it tells me to continue or terminate. Keep reading. So actually this is not the whole floor, it’s not 1. So I will continue to the next bite. Shall I terminate or continue? Terminate. So I will take both of them and combine them together, and what would be this number? So let’s just, it will be like this. The last one will be terminate. So what is the first number you said? 5, 2nd number? So like this 130 is the last number. Last 17. Yes. So simply, I encoded the delta encoding I have, now I know that I need to jump 5 documents to find my next one, and then jump 130 documents, and now jump 7 documents and to get my num document number. So you can see actually it went a bit low level, but now I can save my deltas in a very efficient way. I don’t have to have a fixed byte number for the deltas. I can actually, if it requires only 1, I will use 1. If it requires 2, I will use 2. If it requires 5, I will use 5. It becomes much, much more efficient. OK, Who thinks this is a lot of processing that will waste time? You have to check the data and check if it’s terminated, continue, and converted. Who thinks this will take a lot of, do you think this is a lot of processing, yeah? Yes, and that’s processing. So why you are using, why, why you are doing that exactly. It’s some processing, but it’s way faster than reading from the desk. All what you need to do, don’t read from the desk. I need to compress the index as much as I can so I load it as much portion of it to the, to my memory. OK, this is why when we last week when we were doing the practical, I showed you I’m reading the file of the wiki abstracts from the compressed file itself. I’m not reading it from, I don’t uncompress it and read it. I’m reading it from the compressed using GCCAT, because this is way faster, actually, you can try it. Uncompress and read the file, and actually read it from compressed and and do the processing. Usually it’s much faster to read a smaller file from the desk, process it more than it’s actually reading a bigger file without processing. Fair enough. OK. So index compression, this is just one sample, but there are even more complicated compression techniques that actually are, would lead to more compression uh like Elias Gamma, for example, and this is something about your reading, do with reading from the textbook. The more compression, less storage, but more processing. But in general, this is a rule of thumb the less IO plus more processing is usually much more efficient than more IO plus no processing. It’s usually faster. And with new data structures, now there are even better data structures, the problems become less severe, but with a big, big amount of collections, you might need to use this. OK? I’m not expecting you would use compression in your group project, for example, but you might use an efficient data structure to allow you to save the index in an efficient way. It will be disappointing if I found you, you saved your index in the same way you’re doing it for your coursework. This is the most inefficient way you save it, OK, it’s just for practice. OK? Fair enough. Dictionary data structures, OK. Sometimes I cannot load everything any way to the memory, like. It’s trillions of documents. I have to look at some part of it, but not all of it. I cannot do. It’s impossible, for example, sometimes. So If I cannot do that, The question is what to load in the memory when it’s needed and how to reach my, the term I’d like to get its posting quickly. And what data structure they should use for the inverted index index in this case. The easiest way which I recommend you use it for your coursework is using hashes or dictionaries. So you have a term and all the postings of it, of this structure we showed, just save it a hash and all the values is saved in the in the value. The process of this actually it’s, it’s very fast than just order of one when you need a term, you just get it. That’s easy. The problem with it, it’s If you have a small variant and it’s like judgments or judgments in this case, you will not be able to match it. You have it to be a different term. No prefixed search, so you cannot search as you type, you know, when I’m searching for something. Google actually starts to show me and even suggesting some stuff for me. This one will not allow me to do that because I have to finish the whole term. And if the vocabulary keeps growing, it will require some rehashing, which sometimes is not actually efficient to do. So hashes, I’m expecting or dictionaries, you will probably use it for your coursework. It’s up to you. Use whatever you want, but it’s fine with smaller collections. But when it comes to big, there are some other options like for example, binary search trees. So if I have a huge amount of vocabulary and I need to load something, so instead of waiting for the user to press enter, while the user is typing, I start actually to load some stuff in my memory. Like I can have A to M in one branch and NZ to the other branch, and they write B, I will go to this branch, and I go keep actually going while the user is typing to know exactly where do you know what they are going to actually load, and I can load it earlier. Once they finish typing, I have it actually loaded in my memory. Even the general part of it is the be trees, which can be and it doesn’t have to be binary, it can be actually multiple branches. This is also another option. So while you are typing, you can actually go and load the branches while you are moving. So the good thing about trees, it solves the problem of prefixes. So if I start to start writing A, B, I can start getting what I’m looking for. But the because, of course, it’s, it’s less efficient than hashes, it’s order of log m, and balancing the binary trees is expensive sometimes, especially when you get new terms. I need to balance it now because I’ve got a lot of terms with the word letter Z, so how to balance it over time. So sometimes rebalancing takes a large amount of time. However, this is related to some interesting feature in search, which it will be highly appreciated if it, it will be implemented in your group project, which is a wildcard queries. I’m looking for. Query that is m and anything afterwards. I’m looking for all documents contain any term that started with MON. In this case, it can be Monday, it can be monkey, it can be Montreal, it can be anything, money, many words like this. So With binary trees, it can be actually useful because it helps you. So now, if I search it for MON asterisk, I can go to the binary tree or whatever tree I have, and once I reach MON I get everything underneath. This will be matching, correct? Because now all the patches will be relevant. But actually it will have some limitations here. What if the wild card? Come the beginning. I need something that ends with. Like common, for example How would you match it in this case with Bin Ortiz? You have a solution.

SPEAKER 2
Yeah, just make a binary tree with all those rules.

SPEAKER 0
You make a binary tree in the reverse order of the letters. That’s an excellent solution. Yes, in this case, if I need a word ending with something, I can make an reverse binary tree. Yes.

SPEAKER 5
So like in succession and then uh.

SPEAKER 0
But actually in this case, if you are looking for something. If you need NOM, this is, you need to do the reverse. You have to have the binary tree, you cannot go from down in the tree. But it solved this solution, but what about, uh, so maintaining the additional binary E4 term is backwards. This is a good. But what about if you did something like this? Something that a wildcard is in the middle of the world. How would you solve something like this? I need something that starts with bro and ends with scent. Yes. It can be, but imagine that you will get the subset here and the subset here and through the intersection, which can be a huge amount of terms. This can be a solution, but it might not be on the running time when the user is searching, it might take time. It’s a large amount of processing. Remember, all what we are trying to do now is to, when the user is searching, we need to find the result as soon as possible. This is actually our main objective here. So I know it becomes more complicated. I need something like this and another term, so imagine the amount of processing needs. So it will be very expensive. Which brings us to something called the permiter indices. Which is transforming wildcard queries. So that the wildcard always appears at the end. I will create a binary tree, but always I’m sure that the wildcard will come at the end, so one tree would be enough to go one direction. How this happens. Imagine I have the word hello, just one word. What I can do for the end of each word, I will just put a dollar sign to indicate this is the end of the word, and then each word will be indexed. With all the rotations around this dollar sign. So I hello dollar sign, then I move the H to the end, having the, the index, and so on. And all of these terms will be referring to hello. So if I can match all dollar sign hell, it will actually know that it’s actually looking for hello. So how shall I use this later on with the wildcard? What I will do is to rotate the query till the wildcard becomes at the end. I will just index all of this in the binary tree, all of these versions in the binary tree, and when I get a wildcard, I, I will keep rotating until the wildcard becomes at the end. So let’s give you an example. If I’m searching for a specific term x specific string, then I know that this term, I will add a dollar sign and find actually in the tree that something will goes till I find the dollar sign like hero in this case. So as an example, hero will just hello dollar sign and I can be able to match it in the binary tree. If I’m searching for a string with a wild card at the end, like in this case, I will keep rotating, so what I will do. I know that X then ends with something, so I know that X dollar sign. I just add x asterisk, then I add the dollar sign at the end, then I will keep rotating till I get them the wildcard at the end. So in this case it will be like this. If I search it for hell with anything afterwards, I know that I need something that has hell wildcard and and then the end of the world, then I will rotate them. I will keep rotating till the wildcard becomes the end. So now I will search for something that starts with the door sign, then HL which will match this one. So again, it will be referred to the same one. I know it’s a bit complicated, but I hope you get it. So just I keep rotating till I get the wild card at the end. So let me ask you this question. What if I’m searching for a wild card, then a string? What would be the outcome? What should I be looking for? Add a dollar sign at the end and keep rotating till the wild card is becoming at the end, how it would look like. Can someone say yes? No, no, I’m asking about this one. Yeah, the 3rd 1 dollar dollar sign X. Excellent. No, there are, it’s actually x do sign then because I know that what would happen here like something ends with LLO, I will, I know that it should end with this, so I will have the other sign at the end. Then I will keep rotating them. You can see, I’ll take them, rotate them, it will be like this. OK, can someone get this one? If I have a string, then a wild card, then another string. How it should look like? Yes, dollar dollar X. Why dollar excellent. So this is exactly what would happen. I’m looking for H anything LO, then I know that the dollar sign will be at the end, then I’ll keep rotating, it will be like this, so it will match my query. So whatever it goes, it’s just one binary tree but containing new different terms which doesn’t make sense for us, but it will actually always lead to the same term. What I’m looking for, so this will enable searching with wildcard, finding all the terms matching it. Is it clear now Excellent. OK. Uh, Of course the index size will be huge in this case, but remember this is just for the terms. I’m not talking about the documents here, just for creating something for terms itself, which brings us to a different way of representation. Maybe this is not the best implementation, but there is another thing that could be used, which is character gram indices. It enumerates all the n-grams sequenced in character level, not the word level, occurring in inimitter. And in this case, hopefully it will be able to allow us to search for a term. So actually the initial thing. When I have a wild card, I will search for an index for terms, not for documents, just an end for terms, and find out what are the relevant terms to my wildcard, then use it for documents afterwards. So for example, if I got something like April is the cruelest month, for example, like this, then I can actually put them in bigram representation. I will put a dollar sign at the beginning of the term and a dollar sign at the end and put actually all dollar signs A, A, P, P, R, and so on. And this is a special symbol just to show me the boundaries of the word, and I can maintain a second inverted next for bigrams to to dictionary terms that actually match each by gram. So I start by character n-gram to get the terms from terms to the documents. I’ll give you an example to make it clearer. I know it might be not very clear. So if I’m searching for, remember the MON MN wildcard asterisk, what shall I find? It will have another index showing me that a dollar sign M with all the words starting with M in it as postings, and all the words which have M within them. And all those has ON within them. So this is just an index for terms. Not for the documents. So actually it just starts with a small string and Give me all the terms that apply this one to it. And when I’m searching for MON and wildcard, I will start with this index of characters by grams. Finding all the possible terms and filter them that actually matches them, which will be an end. When I say MON, it has to be actually an end between all of the three of them. It has to be starting with M and MO and ON. The and, then I will take all the terms I found, which can be monkey, Monday, Montreal, and put them and search for, for all of them in the index, normal index of documents with an old operator, because any of these terms will be valid, and then retrieve the documents. So if I’m searching for MON, I was searching for the initial index for terms, which is dollar sign M and M and. Sometimes it will have some actually false positives like moon. It matches this, but it’s still not, so actually you can do additional filtering, but at least you did most of the job. Then step two. must post filter them, of course, to remove their stuff like moon. Then the surviving terms you can create like monkey, Montreal, money, monaster, all of them put an operator and search for all of these terms and retrieve them. So the problem here is that this wildcard can end up with many, many possible terms here and it will be very expensive. Why it will be very expensive? While processing this will take a lot of time. Which one, which part of this process do you think will take the most amount of time? Step 3, OK, who thinks it’s step 1? Step 2. OK, step 3. OK, why do you think it’s a step 2? It has some filtering. You have to check actually that the word has three letters, but who thinks it’s why step 3? Can you say why you think it’s step 3?

SPEAKER 7
Because all, all the occurrences of that word or like how much of a words we find we have to find the queries matching for each of the words.

SPEAKER 0
Excellent. You have to load the actually it’s step 3, the right answer, because you have to load all the postings for all of these terms. If it ends up that with 100 terms, you have to load all the postings for them. And keep doing all the operations among them. The, the one in the middle and the second one is still like a simple rigorous expression can solve it. But the third one is actually loading from the real index of documents all the possible terms and matching them. Actually, if you try it on Google itself to search for a very long query, it will tell you I searching for part of it, because when that query is very, very long, it will take a huge amount of time. Actually, a query of two terms takes double the time of 1. Of the query of one term. So actually it’s a linear. So if a long query will take a long amount of time. This is why it’s important to think about this. But it has some interesting application like a spelling correction. Like if I have a problem, if, for example, when the user types something and Google tells you, did you mean that? How did you do that? There are different ways, of course, but one of them is actually using this character in gram indexing of terms. So if I search it for some term and it turns out that the documents containing this term is very few. So instead of 1 million web searching, instead of 1 million documents contain it, only 10. Google will tell you, are you sure you didn’t mean something else? Because I will actually keep searching for this specific one. It’s you definitely have seen this. So you can say, no, I’m looking for this specific one, or no, I meant the other term and you corrected me. How do they know it matches how many documents. The other question, how did they know the other term, what potentially you are looking for? One of the solutions is actually they check which are the closest terms to this one that has many results. Like for example, if I wrote something searching for Eri Gunt. It can be actually, I don’t know, there is no documents because it’s a misspelling clearly. So now it can be elegant or elephant. So if I use this bigram character indexing, now I can see that these two terms have the most matches with my original term, so I can suggest this one or this one. To the user OK. So the current, and actually it has even other applications for other languages. It’s very useful for languages which are highly intellectual, like Arabic. When we said the stemming helps a lot. If you don’t have a stemmer for this language, you can use character gram for indexing. For example, searching for these two different words, if you split it by by grams, you can find it still, I can match a lot of part of the word itself. And also for spelling uh documents which actually are, have misspellings themselves, the document, not the query, it can fix it. And you don’t have always use only by grams. You can use a competition between by grams and trigrams. It’s, it will be done, OK? So the summary here, because I took a long time this time, index can be multilayer, which uh like there was for structured documents like extend index, like having the links, having structured headlines, you can use this one. And index does not have to be for be formed of words. Sometimes we have the character in grams, you can actually use it for indexing. I know it looks strange inside the index like something like this, but for the users they will not see it. Remember, we said the user will not see that. We’ll just match better documents. And usually if you have some stuff like wildcard or actually misspellings, you can always have two indices, one for finding the term potential terms that the user might be searching for using a wildcard or actually misspelling, and then take the possible terms from there and use it for the document index to search for this stuff. The resources for this is You can check Chapter 3 in the introduction to IR, Section 3.1 to 3.4, and Chapter 5, which is part of the same thing you mentioned earlier for textbook 2. That’s all for, for today. I’m sorry if it was a lot of information, but we said the easy stuff has gone. Now we’re starting for real stuff. And please apply, do try to do that as soon as possible and share your results. Don’t share your code. Uh, general ideas about the code is fine, but not share your full code and share the results about the uh lab one. Say what are the documents you retrieved and compare it to each other, OK? And best luck.

Lecture 2:

SPEAKER 0
OK. Hi, everyone.

SPEAKER 2
Welcome to the 2nd week for another lectures in techno technology for data science, and we’re going to talk a little bit about some laws of the text in this lecture and some pre-processing steps in the next one.

SPEAKER 1
But let me start before, uh, the lecture itself. How was Lab Zero?

SPEAKER 0
Anyone tried it?

SPEAKER 1
Uh, was it, yeah, trivial, yeah? If you found it trivial, that’s great. Uh, if you found it challenging, again, think twice. There was no support for the lab last week because

SPEAKER 2
it’s just something for yourself to see if you can

SPEAKER 3
do that or not.

SPEAKER 2
Uh, this week, actually, we start to do the relab,

SPEAKER 1
uh, of the course, so actually it will be, uh,

SPEAKER 0
practicing all what we will take in this lecture.

SPEAKER 1
So whatever you’re going to take in this lecture, oh,

SPEAKER 2
thank you.

SPEAKER 3
I didn’t, I didn’t know that it’s not on. Oh, sorry, again, Thanks for letting me know.

SPEAKER 2
You should stop me, don’t, uh, don’t be kind, just shout, don’t worry.

SPEAKER 1
OK, so this is your lab one.

SPEAKER 0
This week is important to everyone, so it’s important that

SPEAKER 1
you try it as soon as you can. Actually, if you would like to go home and try it today, you can finish it today, so it should be fine. Hopefully it would be fun to try as well. I’ll be testing quick stuff with you in the lectures

SPEAKER 2
today.

SPEAKER 0
Try to implement directly after the lectures.

SPEAKER 1
I would say during this week would be perfect. Before the 2nd lecture is really, next week’s lecture is really, really important.

SPEAKER 0
Ask questions, share results over Piazza, so I think everyone is on Piazza now.

SPEAKER 4
Once you get, if you manage to get it, get some results, share it on Piazza directly. Uh, if you have some challenges, ask questions on Piazza

SPEAKER 0
as well. Our demonstrators will actually go and respond to you when

SPEAKER 2
you need any support.

SPEAKER 0
And the lab next Tuesday is just only if you couldn’t have tried everything and it’s not working and you need some support in person. So in this case, go to the lab.

SPEAKER 4
So I hope if we are successful, we will not find anyone in the lab next week, next Tuesday.

SPEAKER 0
OK? That’s what we’re looking for.

SPEAKER 1
So join Piazza please and get engaging on this lecture. So this is another reminder about we said it’s not

SPEAKER 4
about the course itself or we know how the search

SPEAKER 0
engine works. Yeah, you can read about it. You can see some videos and probably it will be more entertaining than my lectures, but actually what we really

SPEAKER 4
care about in this course is gaining some skills, and

SPEAKER 0
this is actually some of the skills we discussed last time working with large textual data. Uh, some few shell commands, you get used to it when you have some textual data and you’d like to support it quickly, how you can do that quickly, uh, using, getting used to Python and regular suppressions, which would help you to parse text quickly and teamwork.

SPEAKER 4
Uh, what we are planning to do today, actually, we

SPEAKER 0
start gaining some skills related to the 1st 3 ones.

SPEAKER 1
So hopefully you’ll be building some of these skills.

SPEAKER 0
You might already have it, which is great. You’ll be faster in doing stuff, but if you don’t, it’s, it’s a good opportunity to learn about them. So the objective of this one is learning more about actually some text flows of text, uh, some of the laws of the text, uh, like Zipflow.

SPEAKER 4
Who heard about Zipflow before?

SPEAKER 0
Wow, I can skip this lecture then. Penford is low.

SPEAKER 1
OK, something new there. Heat slow.

SPEAKER 2
OK, one clamping and contagion in text and probably not many. OK, there is something new and still in this lecture,

SPEAKER 0
and this, this lecture is a bit practical, so we

SPEAKER 1
will be testing whatever we claim, uh, in the lecture.

SPEAKER 4
And you can try it yourself if you’d like actually to practice with me. If you have your laptop with you, you can go and download this one actually, you can find the link

SPEAKER 0
in the lab one on the course itself.

SPEAKER 1
This is just a copy of the Bible in text, just we’ll be using it to demonstrate what we are teaching. And you can use Python if you want. You can use Excel to draw some graphs, or actually

SPEAKER 2
you can use Python to draw the graphs itself.

SPEAKER 0
Just if you’d like to try, you don’t have to do it. You will do it in the lab anyway. I’m just saying if you’d like to try while we are speaking now.

SPEAKER 4
OK, Let’s start. Word’s nature. So this is about text. This course is about mainly text.

SPEAKER 2
And uh we can say the basic form of any

SPEAKER 4
text is actually the word.

SPEAKER 0
The text is form of different words.

SPEAKER 4
And there are actually some characteristics of the words in general observed. As we use how we use these words, that’s interestingly actually, these characteristics are very, very consistent across languages, across

SPEAKER 0
domains. So if you’re speaking in English, you you’re speaking Chinese, Arabic, German, at the end, you will have these characteristics

SPEAKER 4
still very common between them. And The interesting thing actually, even if different domains, you’re talking about text technology, you’re talking about medicine, you’re talking

SPEAKER 0
about entertainment, it will apply there. So across different domains and languages, these actually laws or characteristics are very common to see there.

SPEAKER 4
For example, the frequency of some words, like some words will be very, very frequent in some languages.

SPEAKER 0
Let’s speak about English now, like the word that of

SPEAKER 4
to, it’s very common to see these words in, in,

SPEAKER 1
in a sentence.

SPEAKER 0
Other words might be less frequent like schizophrenia or baya.

SPEAKER 4
These words, yeah, you might see them, but not that commonly you could see them, you might actually spend a

SPEAKER 1
year without passing by any of these.

SPEAKER 4
Interestingly, any collection, if you have a collection of news articles, a collection of books, a collection of whatever you

SPEAKER 0
have.

SPEAKER 4
Roughly, most of the time, half of the words, unique words, will only appear once. Just appear once and they will never see it again in the collection. And this is very common across languages as well.

SPEAKER 0
And if we try to draw, because now you know

SPEAKER 4
the, what is the flow? The frequency of the words with how, with their rank, so we say this is the most frequent word and

SPEAKER 0
how many times it appears, the second most frequent word and how many times it appeared.

SPEAKER 4
It will lead us to something like this. Few words will appear in larger, much larger number of times, bigger number of times, and most of the words will appear very, very few number of times. And if you’d like to draw this on log scale, log scale for ranking and rank and log scale for frequency, you will get something close to a straight line.

SPEAKER 0
So on the log scale, it will be close to

SPEAKER 1
a straight line going down a slope like this.

SPEAKER 4
So this is actually simply Zip flow, which most of you know about it. For a given collection of texts ranking unique terms according to their frequency, then multiplying the rank by the probability of seeing the, the, the, the term itself or its frequency will give us something close to a constant, where it’s not exactly a constant, but something, a range, a number that is actually within a range.

SPEAKER 0
It’s not something that’s fluctuating a lot.

SPEAKER 4
So, and this is actually since actually the probability in this case would be a function of 1 over x

SPEAKER 0
of the rank, this is why we see this kind of exponential decay we see.

SPEAKER 4
So this is just an example. If we go out on a collection, this is actually a Wikipedia abstracts you will be using in your lab one. If we said actually what are the most frequent words appeared in this big collections, 3.5 million English abstracts, around,

SPEAKER 1
I think 90 million words in this collection.

SPEAKER 0
This is not that big. You’ll be working with bigger stuff, but this is just

SPEAKER 4
for more warm up. 90 million words. What is the frequency of the most frequent words in this collection? It turned out to be these words. And this is actually the common words we usually see in a language. So if you, this is the rank, and this is how many times it appears, if you multiplied them, if

SPEAKER 0
you multiply the rank by the frequency, you will get

SPEAKER 4
something like this. If you notice the numbers here, yes, it’s not a constant, but it ranges between 56, 10 million. You don’t find something like sometimes it’s 5 million, sometimes it’s 100,000, sometimes it’s 100 million. No, it’s within a range, it’s close to a constant here. So this is actually the interesting thing. So this means that the first word in any collection is probably would be appearing twice the time of numbers

SPEAKER 0
for the second word, the most second frequent word, and it will be 3 times the third frequent word, and

SPEAKER 4
so on. And the 10th 1 probably will be like 10 and

SPEAKER 0
in this case will be probably 1/10, 10, uh, it

SPEAKER 4
appears 10 times more than the 10th word. Interestingly, when we observe this across languages, it is very

SPEAKER 0
common. This is the same thing.

SPEAKER 4
And these actually are the words that are very functional

SPEAKER 0
to a sentence, that actually you cannot construct a sentence

SPEAKER 4
without it. So actually a fun exercise that you can try now will take 30 seconds. So each one try to speak to the person beside you. And to make a conversation without using any of these

SPEAKER 0
words. Can you try now?

SPEAKER 4
Yeah, you find someone beside you, try to speak to each other, don’t use any of the words here. Oh, by the way, speak in English, OK? Don’t, don’t play smart.

SPEAKER 5
I. I. It will be. Um, yeah.

SPEAKER 4
Anyone managed to do it? Kind of. It’s really hard. OK, so this is a point that some of the laws, one of the laws of text in general of the language we speak, whatever language it is, there are some terms which are very, very, very frequent that you cannot even construct a meaningful conversation for 30 seconds without

SPEAKER 0
using any of them. And actually, even if you don’t speak a different language,

SPEAKER 4
try to do the same thing. So this is the main thing about Zipflow here. So let’s do a quick practical because probably you started

SPEAKER 0
about Zipflow, but let’s do a test and see this.

SPEAKER 4
So we’ll actually do a quick experiment using Bible and also Wikipedia abstracts. One contains around 800,000 words.

SPEAKER 0
The other one contains 80 million words. Let’s see actually if this stands or not, and this is the right of the text.

SPEAKER 1
One in the file in textual format is 4 megabytes. The other one is around 0.5 gig of text.

SPEAKER 0
So if I went here, And this actually, some shell

SPEAKER 4
commands I’ll be using, I’ll be using some shell commands here. Uh, you might follow up about this, but it would be useful to learn about these shell commands in general.

SPEAKER 0
So let’s try this. Um, one of the very uh common shell commands you

SPEAKER 1
need to learn about is CAT, which is actually reading a textual file from the desk.

SPEAKER 2
So here I have it actually collection and I have a Bible.

SPEAKER 3
Uh, text, and there is actually more which actually will

SPEAKER 1
keep showing me the results from the terminal, uh, page by page.

SPEAKER 0
So if I did that, so this is simply the Bible, the text of the Bible. You can see it has Verses in each chapter, and

SPEAKER 2
this is how it goes.

SPEAKER 0
So Let’s try to count actually what are the most frequent words here in this one. I don’t have to write code for this. I can still use shell commands.

SPEAKER 4
For example, one of the things I need to do, let me count actually all the uppercase and lowercase the same. So actually what I can do, I can lowercase everything.

SPEAKER 0
So one interesting one is actually translate, which is tr.

SPEAKER 1
I will translate anything that is uppercase with a lowercase letter. This is what it does simply. if I did more, this, as you can see now, everything is lowercase. OK.

SPEAKER 0
What I can do more actually is, I can see there are some punctutuations here like uh dots and commas

SPEAKER 1
and columns.

SPEAKER 0
What I can do, I also translate this stuff.

SPEAKER 1
So, for example, I translate the comma, semicolon, uh, apostrophe,

SPEAKER 2
and dots with space. If I checked how it would look like, this is how it looks like now.

SPEAKER 1
I’m processing text from the shared comments. I’m not doing anything, but this is something as you

SPEAKER 4
learn in the course, if you’ve got a collection, you need to explore it quickly. You have, you learn about these shared comments to do

SPEAKER 0
it quickly. OK.

SPEAKER 4
What I can do later, actually, let’s actually put every word in a new line so I can translate spaces

SPEAKER 2
with a new line which is n.

SPEAKER 3
If I checked more, this is how it would look

SPEAKER 2
like.

SPEAKER 1
Now, if I need to check actually, what are the most frequent words, I can sort them, which is a

SPEAKER 2
sortch command.

SPEAKER 1
Which will be sorted alphabetically.

SPEAKER 0
Then what I can do is to check the unique

SPEAKER 1
words minus count, which will tell me what is the count of each of these unique words I have, how

SPEAKER 2
many times it would see. Then I can sort them again according to frequency, or in this case minus N minus R, actually to show the most frequent ones on the top.

SPEAKER 0
Before I run this.

SPEAKER 4
This is the Bible. What do you think would be the most frequent words here?

SPEAKER 5
God, mhm.

SPEAKER 1
The, OK.

SPEAKER 6
What the same words like Wikipedia, OK.

SPEAKER 1
Any other words, any other Yeah, yeah. You might find Jesus, for example.

SPEAKER 0
God, let’s check actually if I did that.

SPEAKER 1
This is actually the most frequent words. There and off to that.

SPEAKER 4
You’ll start to find some stuff works like shell, which

SPEAKER 0
is because it’s uh it’s used more in the old

SPEAKER 1
way of writing text, onto here, and then actually the

SPEAKER 4
most relevant word to this collection is lowered, comes a

SPEAKER 1
little bit down there.

SPEAKER 4
So actually it turned out that Yes, the functional words continues to be the same, but of course in a specific domain like this, you start to find actually some

SPEAKER 0
related words. If I went more, anything relevant here to the topic, it’s all again functional words.

SPEAKER 1
Maybe the word ye is interesting here because we don’t

SPEAKER 2
use it now, but it’s actually common in that old text.

SPEAKER 1
Nothing so far. Israel comes here, appears something relevant king.

SPEAKER 4
So far, no Jesus.

SPEAKER 0
Children before here, you start this later down the list.

SPEAKER 4
You start to find some things that are related to the topic. So most of the frequent words are the function words

SPEAKER 1
used to construct a text.

SPEAKER 4
So just actually I created a quick thing.

SPEAKER 3
Bible text.

SPEAKER 0
I created a, a simple function to just to actually

SPEAKER 2
print this in that file.

SPEAKER 3
I just put them in as file. A bad naming, but that’s OK.

SPEAKER 2
Uh, I put actually everything by the frequency, and it

SPEAKER 1
tells me how many, how many words, 85,000 and uh

SPEAKER 2
13,000 unique ones. 85, uh, 857,000.

SPEAKER 3
So I went there to this 10, not here. Sorry, I can see the. Oh, here you are. So I think it’s shared here as if, where is it? Oh man. Slides. Is that OK, this is a file I have open with, let’s open with this one. So these are the words sorted.

SPEAKER 2
OK, if I took these ones, let me draw by

SPEAKER 3
Excel sheet. I will not use something fancy.

SPEAKER 2
You can draw it by Python, please, make something decent,

SPEAKER 3
but this is just for demonstration. It’s a bit small. Let me zoom in a little bit.

SPEAKER 2
So if I can do here, if I pasted that

SPEAKER 1
here. You can see these are the words and how many times it appeared.

SPEAKER 2
I can let me move this one here. So I can see this is the rank. This is rank 1 and this is.

SPEAKER 3
The next rank, I just keep incrementing. OK.

SPEAKER 1
So you can see, you can see how many terms

SPEAKER 0
appeared once.

SPEAKER 1
Many, many terms appeared once. I can keep scrolling, scrolling, scrolling, scrolling.

SPEAKER 2
Many terms appeared once.

SPEAKER 0
And this is very common to any collection, even a

SPEAKER 4
very closed domain like the Bible, a religious text like

SPEAKER 0
this. So actually if I went there and I started to,

SPEAKER 2
I’d like to actually implement the frequency versus the rank,

SPEAKER 3
I can insert. An extra plot like this, this is how it would look like. Let me expand this so you can see it.

SPEAKER 2
OK.

SPEAKER 4
Few words appeared. A lot of times, most of the world’s been very,

SPEAKER 0
very few times. If I did log scale, remember, we claim that if

SPEAKER 1
I did log scale, I will start to see a

SPEAKER 2
straight line. Let’s do log scale here and log scale to the

SPEAKER 3
frequencies as well, which is from here. Log log, log.

SPEAKER 0
It’s not exactly a straight line, but close to a

SPEAKER 1
straight line.

SPEAKER 0
OK, but this is the Bible.

SPEAKER 4
This is a very close text. Let’s do something actually bigger.

SPEAKER 0
So in this case, I have the Wikipedia abstracts in

SPEAKER 4
one file, but it’s compressed.

SPEAKER 0
I put it actually in a GZ file. I compress it with GZ.

SPEAKER 4
So there is another nice interesting shell command that you might learn about, which is GZcat. Which would read a file, a compressed file directly.

SPEAKER 0
You don’t have to uncompress it because something that you

SPEAKER 4
will learn over time. You need to save the space, and if you saved all the files like fully uncompressed, it might be taking a lot of space. So you can always compress them, which will take around

SPEAKER 0
10% of the space, and you can read them directly

SPEAKER 4
with stuff like GZCAT and so on.

SPEAKER 0
Actually, it will be even faster.

SPEAKER 1
So if I went with, I think it’s called abstracts

SPEAKER 3
Wikipedia, the text that’s compressed.

SPEAKER 1
This is how it looks like. This is much more text. If I went and applied to check the most frequent terms I have, this is now I’m pressing, I’m, I’m,

SPEAKER 0
I’m processing 80 million words, remember.

SPEAKER 2
So I created something to create, to count this stuff and print it in, like call it, let’s, I call

SPEAKER 3
it wiki, wiki.text.

SPEAKER 2
OK.

SPEAKER 0
It will take a little bit of time. I don’t have a GPU on my machine. It’s just a CPU, but it should be fast actually, because you need to process the text very quickly and flush then. So what it’s doing now is just reading the 80 million words and trying actually to count every word, how many times it appeared, and then at the end we print these words sorted by frequency. This is just as it’s done. Here you go.

SPEAKER 1
So now it has processed in 30 seconds it processed

SPEAKER 4
80 million words. This is what we expect from you to have this

SPEAKER 0
kind of skills to write a code to actually process text very quickly.

SPEAKER 2
Now, actually, let me check how this file looks like.

SPEAKER 3
It’s called wiki.text.

SPEAKER 2
That’s what we called it.

SPEAKER 3
If I opened it with This one These are the terms, remember we got.

SPEAKER 2
If I copy this one, I went to Excel.

SPEAKER 1
By the way, remember actually this is from the command

SPEAKER 2
shared command. You’re talking about how many unique words?

SPEAKER 0
1.3 million unique words. That’s way much more. The other one was 13,000 only.

SPEAKER 4
This is 1.3 unique words.

SPEAKER 1
If I went to the Excel sheet, I’d like to

SPEAKER 2
plot it again.

SPEAKER 3
I can, again, zoom in so you can see it better. Sorry. I paste it like this, uh, it’s, the Excel cannot

SPEAKER 1
take over a million, so it just, that’s fine, just

SPEAKER 3
paste whatever you can, just, it will take some time. Done. I will do the same thing again. I will remove the, remove this on the side. I will take this is rank 1, this is rank 2. I’m sorry, rank 0. And then I can calculate it. It takes time because it’s processing a large amount of uh this is 1 million, it managed to paste 1

SPEAKER 2
million. Well let’s plot it again and see how it would look like. Maybe I’m a liar, maybe actually we’re telling you stuff that doesn’t happen in the real uh in in in real life. So if I insert again, where’s the insert?

SPEAKER 3
Same graph It takes time to process because it’s a large amount of text. Here you are.

SPEAKER 2
Does it look familiar? Even expanding it takes time. Here we are. Let’s do the log log and see how it would

SPEAKER 3
look like. The look. It takes time to process. Come on. Here And this is, this is an old machine I

SPEAKER 2
have.

SPEAKER 0
And this is Wikipedia abstracts, 80 million words.

SPEAKER 1
Which would show in a second.

SPEAKER 2
Here you are. Straight line again.

SPEAKER 1
Whatever collection you have, whatever language you have, try it.

SPEAKER 2
It will be the same, straight line.

SPEAKER 0
So few words appear in a large amount of time. Most of the words appear actually less number of times,

SPEAKER 4
and many, many words appear only once.

SPEAKER 0
Usually half of them appeared only once. OK?

SPEAKER 4
So this is simply zip flow. This is what it tells us.

SPEAKER 0
Let’s continue with the slides.

SPEAKER 1
And this is actually that you will be processing in your lab. Another interesting thing is When you think about this, this

SPEAKER 4
actually the frequency, each each word and how many times it appeared, if I asked you how do you think about the distribution of the frequency of the first digit here. The first digit of these frequencies. What do you think would be if I took all these digits, only the first digit of each number, and

SPEAKER 0
I would block the distribution of this number from, of

SPEAKER 1
course, from 1 to 9 because it’s, uh, you don’t uh write 0 here.

SPEAKER 0
So what would be the distribution in your, in, in your opinion? Would it be, which one of those?

SPEAKER 4
Would it be uniform or exponential decay or normal? Do you find, are you expecting to find these numbers appearing the same amount of times, so it will be

SPEAKER 0
in uniform distribution, or you’re expecting to find one more

SPEAKER 4
than the others and keep decaying?

SPEAKER 0
Actually, you’ll find more than normal, like 5 would be

SPEAKER 4
more.

SPEAKER 0
Who you think it would be, yes, which one do you think?

SPEAKER 7
decay because the frequencies are dominated by the tail of the distribution. There are still many words that just appeared once, just a bunch of words that appeared twice, and they overshadowed the first white dot.

SPEAKER 2
That’s a good point. Anyone have a different opinion? Yes, I agree with them, but I’m not sure that

SPEAKER 5
it will start with the 1. It may start with the.

SPEAKER 0
So it will be decaying but not one.

SPEAKER 1
So it’s but in this case it might be like

SPEAKER 0
the normal distribution in this case.

SPEAKER 2
What do you think? OK, let me think.

SPEAKER 6
Uh, when you did the tokenization or splitting of the Bible, it was 12345, like the numbers were there in frequency in that rank.

SPEAKER 4
Yeah, but I really don’t, I don’t care about actually

SPEAKER 0
the numbers inside the text itself. I care about the frequencies.

SPEAKER 6
I’m just saying that when you did do that, the number, like the frequency of the numbers were in the correct order and magnitude.

SPEAKER 1
OK, so any other ideas?

SPEAKER 4
Which one do you think? OK, let’s actually make voting here. Which one, which, how of you think it will be uniform? Raise your hand.

SPEAKER 0
1, OK? Who think it will be exponential decay from 1 to, 0, the most of you? Who think it will be normal distribution or random distribution?

SPEAKER 2
OK, a couple, that’s fine.

SPEAKER 0
Actually, this is called Benford’s law.

SPEAKER 4
It’s actually the first digit of a number follows a zip uh kind of a zip flow as well.

SPEAKER 0
Like and this is actually interesting, it applies to text,

SPEAKER 4
but it applies to many other things as well. Like physical constants.

SPEAKER 0
If you get all the physical constants you have, uh, like the pi and different numbers, the first digit, it

SPEAKER 4
will have this actually. Energy bills of all the people living in the UK. Take the first digit, it will be the same. Population number of different countries, it will be the same. And this is simply what it tells you. You will find actually that 1 appears way more than 2 and 2 more than 3, and so on.

SPEAKER 0
And one of the actually, the, the answers for this,

SPEAKER 4
because for example, in terms, you can find many terms, half of them appear once, so you actually find the number 1 or both. But also, when you’ve moved from 1, if, if you need to go to, from, uh, if you actually go to from 12345, then you go to 10. If you, if you move to 20, you will see one a lot. Because you see 1112, 1314, this is still the first

SPEAKER 0
digit will be 1 till you see 2, and then

SPEAKER 4
you go and when you see 100, it will keep 1 till it goes to 200. So this is actually why 1 would usually be appearing

SPEAKER 0
more in this case.

SPEAKER 4
And if you need a proof, Why not?

SPEAKER 0
Let’s try it.

SPEAKER 1
So remember we printed this one, so actually I think

SPEAKER 2
I printed it.

SPEAKER 3
It was actually I called it Edo text.

SPEAKER 2
Yeah, at least there’s a frequency. So, uh, to come out, so I can actually cut the text, and in this case I’m interested in the 2nd column only, so I use a shared command cut.

SPEAKER 0
Actually, whatever she commands I’m using now, have a look

SPEAKER 2
on it and try to use it because it will be a good skill to have. I’m interested in the 2nd column, so if I check

SPEAKER 3
this one, it will be only the numbers.

SPEAKER 2
That’s great. I don’t have to use it the same way I’m doing it, but actually I’m using parallel in online, so

SPEAKER 3
you don’t have to do that, minus P minus E, and then in this case, I use a regular suppression to substitute. The first digit. Anything plus Rashd plus. With just the first digit only.

SPEAKER 2
More.

SPEAKER 1
So now I’m just actually getting only the first digit.

SPEAKER 2
You can see it’s here, it’s a twos and ones

SPEAKER 3
and so on. So I’m thinking now I started to notice it, but if I just try to, like to try it, then I can do sort. Unique minus count. And then sort again minus reverse.

SPEAKER 2
And here you can see 1 is appearing almost twice

SPEAKER 1
of 2, and then 3, and it’s decaying down.

SPEAKER 0
It’s actually true, and you can prove it yourself with the collections we have as well. This is the 2nd, just an interesting natural phenomenon about

SPEAKER 1
Benford’s law here, and actually it applies also to text and the frequencies of the text. OK, this is we did the practical.

SPEAKER 0
Heaps law, the 3rd law of text.

SPEAKER 4
The what it says is actually, it says while going through the document or a big collection, the number of new terms you would notice would actually start to reduce over time because probably I, I have seen this before. So for a big collection, what it says, if you are just counting how many terms you have seen so far and how many of them are actually unique, which we call the first one and the second one, the unique ones are called vocabulary size, it says actually it will have this kind of equation you can use. The vocabulary size you can be nicely estimated based on a constant multiplied by the number of terms you have seen to the power of another constant, and this other constant is usually between 0.4 and 0.7.

SPEAKER 0
OK, and it will show something like this.

SPEAKER 4
It going to a saturation at a certain point.

SPEAKER 0
So it’s, and this is very expected because once you

SPEAKER 4
see, when you read something, the first term, of course

SPEAKER 0
it’s the first time to see it. Second term, probably new.

SPEAKER 4
3rd. 4th, I might have seen this before because the word

SPEAKER 0
that appeared again, so it’s not new.

SPEAKER 4
And as you go to the text, probably I, I’ll start to see more terms I have seen before and less terms that I haven’t seen before.

SPEAKER 0
The interesting thing about it is this is actually how

SPEAKER 4
it would look like with a Wikipedia abstract. You can see actually it’s Reducing but still keep increasing. So the question is, This is actually now 80 million words. This is actually the distribution of Wikipedia abstracts. 80 million words still growing. I still see a lot of new terms. But if you might think about it, OK, when I start to read a lot, a lot of things, I should reach the point that nothing I haven’t seen before. I definitely have seen this stuff before, but still keep growing. So the question is, why do you think it still keeps growing? Why do I still find new terms? Is in the language limited at the end it is English only. Do we have 80, this big number, how many? 1.3, we found 1.3 unique terms in the Wikipedia abstract, and it keeps growing. Why this is happening? Why do you think this is happening? Why do I keep finding new terms happening, coming in?

SPEAKER 0
Think about it and share me your thoughts. I will not move until you give me your thought, by the way, yes, uh, because of the nature of

SPEAKER 8
the, I think, the, the, the corpus that we chose, such as Wikipedia, maybe it has different, uh, genres in different categories, and each of them has new words. But if you were to choose a kind of, uh, a limited corpus, maybe.

SPEAKER 4
What do you think about the Bible?

SPEAKER 0
Would it reach the situation at a certain point? this is.

SPEAKER 8
It would be more saturated than this because the, the nature of Wikipedia has more topics and more journals than limited scope.

SPEAKER 1
OK, but do you think it’s for, for some situation

SPEAKER 0
it would reach the full saturation? You will not see neuter. So why you will keep finding new stuff?

SPEAKER 3
Yes, the natural sparsity of language.

SPEAKER 2
What does it mean? So there’s a lot of words that show up once

SPEAKER 6
and they’re all over the place, but do you think

SPEAKER 1
at some point I will have seen everything, almost everything,

SPEAKER 2
and not keep seeing a lot of new things?

SPEAKER 5
Yes. New words are coined regularly, so you’ll always see new

SPEAKER 7
words in text eventually because they’ll just be words that hadn’t existed before that time.

SPEAKER 1
So new, new words will be invented, for example.

SPEAKER 0
But is it actually to this extent to keep growing

SPEAKER 1
that fast?

SPEAKER 2
Yes, words are very uncommon.

SPEAKER 7
Having seen every possible.

SPEAKER 5
low. In the OK, this is close, but I need the

SPEAKER 0
right answer.

SPEAKER 4
Think about it more. Yes, I think there’s words. Forget about Wikipedia. Actually, if you apply this to any other corpus, it

SPEAKER 1
will keep growing.

SPEAKER 0
Yes. Yeah, uh, I think the more you read, the more

SPEAKER 9
you will definitely find new words because, uh, OK, the, the, the words that you’ve seen before, they will become familiar. You will keep seeing them, but then you will see something else new because other, uh, otherwise, why are we still reading because there’s something like new addition to what

SPEAKER 5
we read, OK.

SPEAKER 4
All the answers I have heard so far are OK, but it has one limitation. You’re thinking about words that this is the language you are speaking. But there are different words that it doesn’t have to be the functional words or verbs we are talking about like your names. Like your email, like spelling mistakes. Like codes. What is the code of this course?

SPEAKER 0
INF 11045.

SPEAKER 4
These are for, for you, it’s a code, but for the system, it’s a new word. So it will keep happening. So think about spelling errors. This will be a new word names, emails, codes, all of this stuff. This is why it will keep always growing. And the accurate, you can do the most actually for a collection, it will be with, it’s, uh, this actually law would apply. It’s just about estimating the correct K and B here, the constants to be able to fit it. Sometimes actually when you have a small very, very small

SPEAKER 0
collection like I will take like 10 articles of the,

SPEAKER 1
from CNN.

SPEAKER 0
I try to estimate uh how it would look like over the CNN archive. No, that’s really small, but when you start to do more, a little bit more, it becomes more accurate.

SPEAKER 4
Let’s do a practical here to prove the point and actually understand why we are doing so, OK. So, uh, here I created a, uh, uh, like a,

SPEAKER 1
a, a, a uh, uh, like a city script to

SPEAKER 2
just Do something I called, I, what did they call it? I think hips.

SPEAKER 3
Uh, oh, OK, actually, this is a Bible. I kept the Bible. I create a script like it’s called, I think heaps. P.

SPEAKER 0
Actually, what it does is very simple.

SPEAKER 4
It keeps counting how many terms I have read so far and how many unique terms I have found so

SPEAKER 1
far.

SPEAKER 0
And it prints it every time I found 100 new terms. So actually if I like if I did that, you can see, it tells me I read 33, 333 terms in the Bible, and I found 100 unique words. And until I find 200 words, new words, I had

SPEAKER 4
to read 933, and, and so on.

SPEAKER 0
So it tells me what is an N, what is a V.

SPEAKER 4
So after 333, the vocabulary is 100, after 933, the

SPEAKER 0
vocabulary is 200, and so on, and I keep going. So remember the total number of unique words was 13,000.

SPEAKER 1
So I had to read the whole liability. I found 300,000 terms. So if I did that, let me print it again

SPEAKER 3
to the Edo text thing.

SPEAKER 2
OK.

SPEAKER 1
And let me calculate it actually also for the other

SPEAKER 3
one. So if I actually did. collections to abstracts with Wikipedia. Heaps Dot PL And I’ll print it. Let me call it Bido text for now. And just keep it calculating as well.

SPEAKER 2
So if I went and opened this file.

SPEAKER 3
I printed it to A, yeah. I just told you, here you are.

SPEAKER 2
These are the numbers. I will take it.

SPEAKER 3
Let me go to Excel sheet and try to administrate it. This is too heavy. Let me actually close it so I don’t. Get stuck Don’t see OK, let me open a new one. Again, it’s too small. Let me expand it a little bit, make it bigger.

SPEAKER 2
OK, so remember we have N and we have V, and these are the numbers we got. This is NMV every time.

SPEAKER 1
And we said we have K and we have B

SPEAKER 2
constants.

SPEAKER 1
And I tried to actually estimate something heaps here.

SPEAKER 2
This is actually the estimated one. Let’s start by in values like 2 and let’s make

SPEAKER 3
V.7, for example.

SPEAKER 1
What I can do now. I can try to calculate it.

SPEAKER 2
We set the function. Let me go back to this and see the function, not here.

SPEAKER 0
This is a function. V equals K multiplied by N to the boot of

SPEAKER 2
B. So if I went back, I heaps, so in this

SPEAKER 3
case, I can say, I can calculate this based on

SPEAKER 2
N, I’m sorry, based on actually K. Multiplied by n to the power of b.

SPEAKER 3
But actually in this case I need to just make it constant so it doesn’t change when I move the line.

SPEAKER 2
So now you can see if I just calculated K

SPEAKER 0
as just estimated K as 2 and B as 0.7,

SPEAKER 2
I will get something like this. Let me go here and try to actually apply it

SPEAKER 3
for all the terms. Uh, here you are, oops. Go back, rent it.

SPEAKER 2
And let me plot this.

SPEAKER 3
Insert. Again Uh, it shouldn’t be like this, yes. Here So the actual number is blue, OK.

SPEAKER 1
The estimated one is uh orange.

SPEAKER 0
So it’s not fitting well. So actually what I can do is try to change

SPEAKER 4
the value to see if it’s fixed.

SPEAKER 1
So actually make it 0.6.

SPEAKER 2
It went much lower. What about increasing K to 4?

SPEAKER 1
Mm, a bit estimating, it’s actually working.

SPEAKER 2
What about changing it?

SPEAKER 0
Actually, remember, this is a very closed domain, this is

SPEAKER 1
the bible.

SPEAKER 0
So usually when it’s a very closed domain, you probably will go with a lower value of B.

SPEAKER 1
So if I try to make this, for example, 0.55.

SPEAKER 3
Man No, let’s make it 7.

SPEAKER 2
Oops, OK.

SPEAKER 1
Kind of fitting.

SPEAKER 0
It’s kind of fitting.

SPEAKER 4
So now I can with this function it’s started to

SPEAKER 0
be actually very, very close to the actual thing.

SPEAKER 1
It’s actually fitting well here.

SPEAKER 4
What I can do, let me try to do it

SPEAKER 0
actually.

SPEAKER 2
How can I duplicate this one? I can actually try to make it for the Wikipedia abstracts as well, so I can copy.

SPEAKER 3
And go back here. Expand it. I didn’t do it well. There is no way to duplicate it. And Ser she duplicator.

SPEAKER 2
Anyway, I can overwrite this one, that’s fine. So you can see now actually what is happening.

SPEAKER 1
What about Wikipedia abstracts?

SPEAKER 0
I printed it to Bdo text.

SPEAKER 3
So if I went back here and went to Bdo.text, check. Should have been done now. Here are the numbers. I can copy this and they can go. To the sheet. And print it. These are the values now.

SPEAKER 2
As you can see, it’s totally off because this is

SPEAKER 0
the different between the Bible and the Wikipedia abstracts.

SPEAKER 2
Actually, you can see the orange was actually the Bible. This is an estimated one based on the Bible.

SPEAKER 1
Actually it’s growing much less than the Wikipedia. It grows much farther because it’s actually more open domain

SPEAKER 2
thing.

SPEAKER 1
So what I can do, but actually let me actually

SPEAKER 3
make this one include the whole thing. Now it’s stopping here, I can include the whole thing. Actually, it’s better to start from scratch. Come here and take come on. Yes. Of course, it’s much, much, much larger moment and it will take forever. I will do it from down up. It’s better. So if I went down there, I do that. Go up. Go up more and insert a new graph. Like this one, here you are. Delete this one. I don’t need it anymore. And this number I need to go get it to the rest of the thing. I will take control C and I will go here. So there. Now I can see the whole thing now, hopefully on the graph.

SPEAKER 2
So here is a Wikipedia.

SPEAKER 1
Thing it’s not fitting at all. So probably what I can do, I can increase the

SPEAKER 2
value of B. So what about making it 0.65 instead?

SPEAKER 3
Coming closer to 0.7. Oh, when it’s too high, I can reduce this 1 to 3. Went lower to 4.

SPEAKER 2
Oh, probably it’s fitting.

SPEAKER 1
OK, so now I can see now the value of

SPEAKER 4
the heap slow here. It’s changing between different domains.

SPEAKER 0
One is a very closed domain like the Bible or

SPEAKER 4
a very open domain like Wikipedia abstract. Different values, but I can do a good estimation why this is useful. The question is why this is useful. So the time I downloaded this Wikipedia abstracts when maybe

SPEAKER 0
it was 78 years ago. Now we have much more Wikipedia pages, so I probably

SPEAKER 4
it’s not 3.5 million anymore, probably it’s 10 million. So if I actually now I’m in a company, imagine

SPEAKER 0
that they hired you after taking TDS. I told you you are now an expert in search engines, and we are a new journal and we are

SPEAKER 4
publishing our articles every week, and we have only one

SPEAKER 0
month actually of articles, but we need to estimate the

SPEAKER 4
machines we need for the next 10 years. But we don’t know actually how many vocabulary we’ll have and how many articles we have. So we can take this one month of articles, 10,000 articles, 20,000 articles, and estimate. Find the K and B and actually estimate how this will be after when you reach 1 million articles.

SPEAKER 0
So if I went in the end of this one, And what I can do now, we have 80 million. What about if it will be 100 million?

SPEAKER 2
312, 3.

SPEAKER 1
Now you can get an estimate.

SPEAKER 0
What about if I reach 150 million?

SPEAKER 2
12, 3123.

SPEAKER 1
What about if I reach it, what if I reach

SPEAKER 2
200 million?

SPEAKER 3
I get the, oh, this is something missing here. No, I think it’s great.

SPEAKER 2
So if I went to the graph again, I selected

SPEAKER 3
here and I went to the end and included. This Here you are. He went up again.

SPEAKER 1
Now I can estimate how would be my vocabulalary if

SPEAKER 4
I doubled the number of documents I have in my

SPEAKER 1
collection.

SPEAKER 0
I can actually get a good, very good estimate from

SPEAKER 1
this.

SPEAKER 4
So this is how it would be useful. So actually, the interesting thing is that for each collection nature, you can have a very good estimate about how this collection can grow over time by actually having this

SPEAKER 0
estimate about the collection itself.

SPEAKER 4
Is it clear so far? OK?

SPEAKER 0
You will be asked to do this in your lab, uh, so you need to implement all of this. You don’t have to do use Excel, of course.

SPEAKER 4
Uh, to, I think it might be easier to use

SPEAKER 0
Python and even using some of the fitting functions.

SPEAKER 2
You can have some fitting functions, so that’s find a

SPEAKER 0
fitting function with this formula, and it will give you the value of K and B, OK?

SPEAKER 4
So once you do that, share it on the Piazza

SPEAKER 0
and see actually what are, are, are you getting the

SPEAKER 4
same values or it’s a bit different from one to

SPEAKER 0
another, OK? This is just an estimation. So this is actually heat slope.

SPEAKER 4
The last law about text is uh that we are

SPEAKER 1
going to discuss. There are many laws, but these are, I think the most important ones for this course.

SPEAKER 0
Clumping and contagious, contagion in text.

SPEAKER 4
What we learned from deep flow is that most of the words do not appear that much. Actually, many words appear once. But the interesting thing is that once we see a word, at least once, it’s expected to start to see it more again. It start to actually see it. If there is a document that someone is called Ricardo,

SPEAKER 0
for example, and you have never seen Ricardo before, then

SPEAKER 4
you found a document talking about Ricardo, there is a big chance that the document will mention Ricardo again, but unless you haven’t seen him for a million documents before,

SPEAKER 0
but it’s time now I’ve seen Ricardo and potentially I can see them again.

SPEAKER 4
So We say that wars are like a rare contagious disease. You don’t see it that much, but when this happens, It starts actually to spread.

SPEAKER 0
You sense it more. It’s like rare independent lightnings. You don’t see lightnings every day, but once you see a lightning, you probably can see another one later, very

SPEAKER 1
soon.

SPEAKER 0
So from the Wikipedia abstracts, there was a very a

SPEAKER 4
small experiment you can do and try it. OK, what about the terms that appeared only twice? So I’m looking for, I know that there are many terms that appeared many times, many, some terms appeared only

SPEAKER 0
one time, but there are some terms that appeared only

SPEAKER 4
twice. So if this actually theory is true, it means that I would see these two occurrences of the terms close to each other.

SPEAKER 0
Now I have a 3.80 million words happening in a

SPEAKER 4
text. If I if there are terms that appeared only twice, I can check where this term appeared. If they appeared in the beginning and the second occurrence appeared at the end, it’s not that contagious. But if I found it close to each other, probably

SPEAKER 0
yes. When I see something, probably I’m going to see it again.

SPEAKER 4
So if we did this implementation, you can see actually

SPEAKER 0
this is the distance between the two terms, very, very

SPEAKER 4
close, almost around like 1 between them.

SPEAKER 0
They are always happening at the very top.

SPEAKER 4
Sometimes, of course, it happens like a big distance between

SPEAKER 0
them, but the majority were actually happening between a very

SPEAKER 1
close distance to each other.

SPEAKER 0
Once it’s appeared once, you expect to see it more.

SPEAKER 4
So the majority of those terms appearing only twice appears close to each other. And this actually gives you an idea about this theory. So let’s actually do it in practice. Given a collection of 20 billion Turks. What do you think would be the number of unique

SPEAKER 1
terms in such, in such collection?

SPEAKER 4
It’s really hard to say, but if you manage to take to process only, this is 20 billion term, this is huge to process.

SPEAKER 0
I’m not taking about 80 million as Wikipedia abstracts. What you can do is to take only 1 million

SPEAKER 4
term out of this, just 1 million term of the

SPEAKER 0
collection, and they will go and use Benford’s law to estimate the K and B.

SPEAKER 4
Imagine I did that and they managed to find that

SPEAKER 0
K is 0.25 and B is 0.7 from this 1

SPEAKER 4
million term. Now I can easily calculate how many unique terms would be there, because I can just substitute in the equation, and instead of 1 million, I can substitute with 20 billion and see. So if I did that, it will end up, it

SPEAKER 0
will be around 4 million terms.

SPEAKER 1
I can expect it to be 4 million unique terms.

SPEAKER 4
So the other question, what do you think would be the number of terms appeared once, in this case?

SPEAKER 1
Yes, yes, around 2 million because it’s usually around 50%.

SPEAKER 4
So now you don’t have to process some of the projects you might actually have in the Wikipedia and actually

SPEAKER 0
the web archive, which is trillions of terms.

SPEAKER 4
You don’t have to process everything to find the unique terms. Actually, you can have a quick estimation about the machine you want by processing a small portion of this and estimating what what are you going to expect. Is it clear so far? So the summary here, that text follows well known phenomena.

SPEAKER 0
We talked about the zip flow, the heap flow, and

SPEAKER 1
contagion in text, and we learned a little bit about

SPEAKER 0
shell commands. I hope you start practicing with it if you are

SPEAKER 4
not aware of it, but actually it’s usually useful to be aware of this, and I would recommend in the lab, actually I’m giving you 3 collections. These 2 collections, the 3rd 1. But I would recall actually be very excited if some,

SPEAKER 0
some of you would think, oh, you know what, I’ve

SPEAKER 4
got a collection in French, I got a collection in German, I got a collection in Arabic, I got a

SPEAKER 0
collection in Chinese, and I applied this, and these are

SPEAKER 4
the results. Share it with us. Let’s see actually how it works.

SPEAKER 0
That would be very useful and entertaining to see.

SPEAKER 4
So the resources for this one, you, this is actually

SPEAKER 0
chapter 4 in IRN Practice uh textbook. I would actually recommend seeing these videos, entertaining videos from

SPEAKER 1
Vsauce about Zipflow. It’s very interesting. It’s not just about text in general, but it’s very

SPEAKER 0
interesting. And Ben from Zlo as well on another channel.

SPEAKER 1
It’s like 10 minutes. It’s very exciting to see and understand how it’s going.

SPEAKER 0
And you can, if you are using Windows and you’re

SPEAKER 4
not actually, if you’re a Windows user, you can see

SPEAKER 2
many Apple here, but there are some Windows as well,

SPEAKER 0
or you’re not a Linux or a Mac user, then in this case, you can actually get the shell commands for Windows from this link.

SPEAKER 1
So you can actually use the cat and everything just putting this in your machine. The next lecture we’re going to start getting ready for

SPEAKER 0
ending thing and we’ll talk about the pre-processing steps.

SPEAKER 4
Now we will have a break for 5 to 10

SPEAKER 0
minutes, and you are free to go and come and

SPEAKER 1
do whatever you want, but actually remember that there will

SPEAKER 0
be a mind teaser.

SPEAKER 1
I will show it to you now.

SPEAKER 2
So uh just in the break if you’d like to

SPEAKER 3
make your mind working during this period, OK? So let me Bring it up. Let me close this. Don’t save any. And let me see.

SPEAKER 4
So this is the first mind teaser. Uh, this is just enjoy the break.

SPEAKER 0
Uh, whoever gets this, gets the answer of this, raise your hand. If you’re going to try it, you’ll get a small

SPEAKER 2
prize, OK? Just for fun, nothing. It’s not related to the course, OK?

SPEAKER 1
So, uh, yeah, have the break for 5 to 10

SPEAKER 0
minutes.

SPEAKER 4
If you got this answer, the answer of this, raise

SPEAKER 1
your hand and let me know. The one who gets it first will get a prize.

Lecture 1

SPEAKER 0
OK. Hi, everyone. Uh, welcome to the first introduction lecture for text technologies for data science. I’m Waleed Magdi. I’ll be the instructor for the next few weeks, and there will be another one, joining me afterwards. So, um, This course is about text technologies. I will explain more what it is about, and the lectures today is just an introduction. So the objective of this of the lectures today is To know more about the general, what is actually to be discussed in the course, what is the topic of it, what are the objectives we have to get from the course, the requirements that you need to have before taking the course, so you might change your mind if you don’t have these requirements, what would be the format going on during it’s a full year course, and what are the logistics, the location, the labs, how it would work, and so on. And just a note, this is, there is not much technical details today, just about, just a very uh friendly introduction, so, but don’t expect it to be the case uh going on. It’s just about to know more about the topic of, of the course. OK, so the course is called Text Technologies for Data Science. So what is it about? You can actually guess from the name. So we’re talking about here text, and text is mainly we’re talking about documents, words, terms, but we’re not talking about images or videos or music. So it’s mainly focusing on the textual part. You can, it can be videos and images, but it will be the caption or the transcription of the video, but it’s actually worth focusing on text in this course. Technologies we are talking about, we’ll be discussing several technologies within this course. The main one is information retrieval, which I will discuss more. What is it exactly, then text classification. Text analytics and we are adding actually also stuff about LLMs. Everyone is talking about LLMs, that’s actually part of it, and the rack systems, which we will discuss more what is it exactly. And we’re talking about actually if you’d like to summarize it, it’s more about Search engine technologies, all the technologies around search engines right now, which is not just the search, it’s including other stuff. So we mentioned the first part is information retrieval, which is a big part of this course. So the question is, what is information retrieval? Can anyone give me an example about an information retrieval system? Anything. Yes, database, not exactly, actually, I will explain it’s not databases, but what else? Yeah, search engine, exactly. So like Google, for example, this is a very basic example of information retrieval systems. And we call it web actually web search engines, but it’s not just about Google here or Bing or whatever we call it. So this is just one type of information retrieval systems which is web search, search engines, and we will learn how this works. But also it actually includes stuff like speechQA. Now you can actually have, it was actually Siri before, now it’s a lot of uh technologies out there. You can ask a question, you get some feedback. So this is part of the info retrieval systems. It can be actually social search, searching on Twitter or X or TikTok or whatever. You can search for something for a hashtag. You get a feed of results, and also you can see actually it’s telling you there are 4 or 5 or 10 new results showing up, which is called information filtering that keeps feeding you stuff. Also, you will find actually some recommendations which are recommender systems, which is part of information retrieval. Here. Actually, the field itself is old enough that it actually started in the 1950s, mainly for searching libraries. Actually, we’ll be studying stuff that has been developed in the 1950s and 1960s in our course, showing how this started, that the the the technology has started at that time, and it has been actually good enough in a way that it has been used till our days now. Actually, the same science that has been developed at that time is still being used now, but it started mainly. Decades ago, just for libraries actually to search for books and find the relevant book you’re looking for. And this is from the 1950s. It can be actually more advanced search, like actually legal search. if actually there was a lawsuit between Samsung and Apple in 2010, around that time, and actually it was actually about someone used the IP of the other company. This case took years because actually lawyers were searching tons of documents, legal documents, emails, patents, like reports, and so on to find if actually someone breached the IP of the other company. So it’s actually sometimes, and this actually costs us we’re talking about $10 billion. So this is a search actually like task as well, but it’s not like the web search. It’s something that is very, very advanced that people use. It goes across languages, because actually if you checked, 50% of the content on the internet is around English, but there are many, many other languages, and the users, actually, the English speakers are only 25%, and there are users from all over the the languages around the world. So how you can actually get information, it doesn’t exist in your language, but can exist in another language, how can you retrieve it across languages? So this is an important part as well. It even goes to other applications other than texts, like, if you know Shazam. It’s actually doing humming to get actually your song. It’s, we are not going to do this in our course, but it’s part of information retrieval, how you can actually index these kinds of things and you can be able to model it in a way that you can retrieve it later. So if I ask you what is IR, even the normal web search page, it has a lot of technologies within it. So for example, you will find when you start typing a query, you will find some suggestions which is called query suggestion or correction or prediction. So this is part of the technology. Also, you can start to find, when you find some results, it will have a small snippet under each result, which is summarizing what is the main point you’re interested in. Actually, if you, if you get the same result from two different queries, probably the snippet will be different to each one. So how you can select the part from the page to show to the user uh so they can decide to click it or not. This is another thing. You will have advertisements appearing with it. So how can you insert it in a way that actually gets relevant results with this, and it’s actually running in a parallel way. And also you have categorization, so you can actually get the results. It can be just on the web generally, or it can be news, it can be images, it can be videos, it can be whatever of this. And recently, actually, it has been popular in the last couple of years. Even you can get a summary of the AI generated answer from the top results. So this is actually happening and all of this is actually part of information retrieval. So if it’s all about searching and getting the relevant results, actually, let me ask you the question. Who needs IR anymore in the era of Chai Betina? So now Chat GBT, you can ask it and searching for something and you need to know some information, you can ask Chat GBT about something and you’ll get probably a good answer. So why do we still need search engines? Can anyone guess if we need it actually? Yeah, maybe because you want credible sources and not everything around you the internet. That’s a good point. You need credible sources, yes.

SPEAKER 1
If we give all our data to these, we’re exposing our private data to them. Uh, so we need our own search engines and retrieval systems so that some data is hidden from these LLMs.

SPEAKER 0
That’s an excellent point. Yes, because actually LMs are actually trained on like public data, but if you have a company and have actually some private data. You will not use it in chat GBT. However, someone can tell you, OK, but you can train your own LLM in this case with your own data. So why you still need the search engines, yes, I

SPEAKER 1
guess sometimes LM hallucinate while giving the answers. So maybe search engine. Typically more factually.

SPEAKER 0
So hallucination actually is because the search engine will give you some results and you read it yourself, but the LLM can hallucinate while generating the answer. But it’s getting better. So why do you still need, so maybe after a couple of years when LLMs are amazing or perfect, then we don’t need search engines. Do we think we still need them? Yes. Excellent point. Yes, new data, new data. So the problem about LLMs, they are very expensive to train, and the version you’re using with Chad GBT at the moment, for example, as an example, it has been probably trained last year. It took a few months probably to actually to get the outcome of the model. So if I’m searching for something that is recently happening, how it would get the answer itself. So recently, even if you’re enterprise data, data keep actually being generated. So how to do that? So for example, just one hot topic that is happening in the last couple of years. If I ask a chat GBT, and this actually answer I just tried it yesterday, what do you know about the Palestine-Israeli conflict? So Chad GBT managed actually to give me the answer directly. It gave me an answer about it because it has been trained on a large amount of data that includes this information. It’s about history, so probably it has been covered. But if I ask another question, which is, how many children have been killed in the last couple of years in Gaza, for example? This actually requires recent information. The information actually is changing every day, so it cannot get it from this. This is actually when you try this, even try it on your Chat GBT. If you search it here, and it will tell you searching the web. It understood that it doesn’t have this information and it requires actually fresh data for this, and it will go and it will give you an answer, but the interesting thing at the end of the answer you will find something like this the sources, and I will tell you I got these answers from the UNICEF, from the UN, and so on, different sources. So even ChaiBT and the other LLMs are amazing in giving us answers about many questions we have. It would continue to require searching the web. Maybe in the future we will not do the search for the web ourselves. LLMs will be using it, but the technology itself, you will still continue to need it because everything is changing every day. Many different things happening around us. How to get fresh data? LLMs are very expensive to train, to have this information. So this is still relevant and will be always required. It is clear enough. OK. So it’s critical for chatbots. Yes, chatbots are taking over. Yeah, they are used a lot and probably you will be using in doing your coursework for this. How are you going to use it? I will allow you to use it a little bit. It’s fine. But at the end you will need search engines because search engines are designed to index data very quickly and get retrieved data very quickly in a fresh way. So the other question I would discuss with you. What is information driven? How is it different from find, for example, you go to open a PDF and find something? Do you know if there is any difference between both? If I search it, for example, the word retrieval here in a BDF document, uh, I will get spotted. Where is it? So how is this different?

SPEAKER 2
This looks for literal strings while information retrieval is wiser.

SPEAKER 0
That’s one point. Yes, it will actually look for exact string match, but for example, if you are looking for retrieval, it would match retrieval, but not retrieving, for example. This is one point. Is there another thing? Is there anything other, other than this that you think is different? Yes, a PDF is, yeah, yeah, yeah.

SPEAKER 1
A PDF is small enough. just look through all the numbers, all the words, but

SPEAKER 0
the internet is bigger than that. That’s an excellent point, exactly, because actually searching a PDF, it’s like doing grab, actually, it’s just trying to do a sequential search. It goes through the PDF reading word by word till it finds the match, which is called wordspotting, grabbing something like if you’re using Unix commands, it’s like grab. So if in a PDF it’s fine. If the PDF is actually is 1000 pages, probably it might take some time. If you’re searching for something like the internet, it will take forever. So this is different. The actually search engines is about how you can find the information like this very quickly, instantaneously, almost, and it will also match the topic itself, not the exact match of the word. So it’s mainly two differences here, and this is actually the main technology we are going to study in this course. So the textbook definition of information retrieval is Finding material of unstructured nature that satisfies an information need from within large collections. So let’s take it part by part. The 3rd thing, the task here is to find something. I need to find something, even within, if you’re using the search engine within a chatbot, I need to find an answer for my question. And, and the nature, it’s kind of unstructured. It’s not a database. Database, I know it’s actually in a very specific field, to have specific characteristics, but this is actually internet, documents, enterprise documents which can be slides versus doc reports, many things. How you can actually this unstructured form and retrieve some data from it. And the target here is to satisfy the information needs of a specific user. And so this is why when we will learn how to measure, how we can measure if something is good or bad as a search engine, we need to find a way to measure dissatisfaction that the user is happy with the answer. It’s not just about matching a term. It’s more about actually finding what is relevant. The user thinks it’s relevant. So this is about the big portion of this course which would be about information to people. Another thing is text classification. Classifying text into different categories. Many examples of this, like, for example, when you go to a news website, you will find actually articles like classified into different topics like politics, business, technology, all of these are actually kind of classification. Some of these are done manually, but actually now people are just typing the article and a system would actually classify which topic is related to. More stuff actually even in social media you can find. Staff are looking for feed about news, about entertainment, about sports, and so on, so it’s happening everywhere. And of course in social media it has to be done automatically because no one will, you never type a tweet or a TikTok post and then say this is about politics. No, it’s actually there are some engines inside which classify it automatically. And it’s actually in many cases, do you have any other examples of actually classification? Uh, do you have any other thing? Actually, there is something that is actually used in all of our emails. Do you know what is it? Spam detection, exactly. So it’s simply, how can you learn that something is a spam? This is just a very simple confiction task. I need to find if this email is it spam or not. This is a classification task. Google tried to do its best, and sometimes it’s classifies as spam as not spam and not spam as spam, but it’s mostly hopefully working fine. And of course it can be also hierarchical, so it doesn’t have to be yes or no or actually multiple classes, but sometimes it’s one class and goes on to subcategories like actually patterns, for example, you will get actually different types of subcategories within the same class. So the definition of text classification is It’s a process of classifying documents into predefined categories based on their content. From the content of the text, I’d like to find which category is it related to. It can be binary, is it spam or not, positive or negative sentiment, for example, or it can be few, like is it sports or politics or technology and so on, or it can be hierarchical. Like this is the main major class of it, but it goes to subclasses as well. So this is another thing that we are going to study within this course. A third thing actually we’re going to study as well is text analytics. Usually we’ll find a lot of things, different categories of things around us, and we’d like to understand how we can compare to things. For example, you’ll find actually the left and right in politics. You’ll find actually wars between Ukraine and Russia, for example, and there will be always many claims around them. For example, when I was searching for this, you’ll find actually searching Democrats versus actually left and versus right, you’ll find the right are those who are thinking about people and they are kind. The left are actually bad and angry. On the other side you’ll find no, actually the right is actually very aggressive, but the left are kind. From a scientific point of view, how we can do something like this in a scientific way. So what I can do, I can say, OK, let’s go and get all the conservative politicians, all the Labour politicians or Democrat politicians, and then what I will do, I will take all their statements and then do some objective analysis. I will take and compare them. What are they are focusing on, what is special about each of them from what they are saying. You can go in for the people, take millions of accounts of people who identify themselves on the left, millions of accounts of people who identify themselves on the right, and do the same thing. And then apply how we can apply some methods so I can from big different cobra, I can find what is actually unique about each of these groups. And this is something we’re going to learn in this course, how we have different cobra, and it doesn’t have to be always like this. It can be, for example, what’s the difference between the Shakespeare novels versus actually a recent writer right now. What are there any differences? How we can extract these kinds of things? And this is actually what we’re going to study in this course. How to compare to Corbus, how you, you find unique about each, what is common between them, how to deeply analyze this and data-driven approaches, all data-driven, without, without actually going with some specific hypothesis about them. So in this course we will learn to. How you can build a search engine from scratch, how we learn it from a very low level, which search results to rank on the top, how do you do it fast on a massive amount of data. Then how to evaluate your search engine. And now you are doing something, you achieve some results. How shall I know that these results are good or bad? This is something. How to work with texts in general, like if you have two tweets talking about the same topic, how shall I know about this? And if there are misspellings, morphology, or problems, how we can fix it. How to classify text into different categories, and then how you can apply text analytics between different corpus, actually, how you can find what is unique about a certain document compared to other documents. And then the final part will be talking about drag systems. Now, imagine you have all of this knowledge about search engines, how you can integrate this into an LLM so it can extract summary answer of a question you’re asking. So how is this course different from others? There are NLP courses we have. So the courses like ANLP or FNLP. It has some text processing in common, yes, and maybe text flows. If you heard about zip flow and other stuff, we might discuss it as well, but we are not talking about NLP here. We are talking about working with massive amount of textual data and how we can handle it. If you’re talking about machine learning courses, yes, it might, the text classification might overlap a little bit with it, but the focus here is not on the machine learning part itself, but focused more on actually how to process the text itself. And of course, getting deeper in this. But it does not overlap with the other courses in how to build a search engine. How can you add the different information methods, evaluation, IR evaluation methods, text analysis, processing actually a large amount of textual data. This is something unique about this course where we’re going to work with large amounts of textual data and the rack systems. You might actually study LLMs in the other courses, but now how we can use it in a rack system retrieve augmented generation system. Some terms you will learn from this course, you might be familiar with some, not the others, but you will learn about all of these during this course inverted index, vector space model, retrieval models like TFIDF, BM 25 language models, PageRank, learning to rank. Mean average precision, mean reciprocal rank, NDCG, multi mutual information, information gain, chi square, binary and multi-class classification, and receiver augmented generation. So all of these terms, if you are not aware of many of them, this is, you’re going to learn about them during this course. Another important thing about the course, it’s highly practical. So 70% of the mark is on coursework. And 30 percentages are only on the final exam. And We’re not taking any much technical depth today, but from next week you will be implementing almost everything we study. Like I would say 50% of what we be in the slides, you will be implementing yourself. It’s not just about, oh, it’s it’s fun to know about this, no, it’s not about just knowing. You will be implementing, OK. And by week 5, when you submit your first coursework, you will be required to submit a full functional search engine from scratch. No libraries, running it from scratch. OK? How would actually take the many documents, put them in an index, be able to search them, rank them according to a given query from scratch. OK? It will be fun. You have a practical lab every week, uh, so almost every week, we’ll give you maybe 1 or 2 weeks so that we will give you buffers so you can finish your coursework, but you, there will be always labs. There are 2 coursework, uh, works that actually will be, uh, mostly coding, a lot of coding, and there will be a group project in the 2nd semester that you need to take all what you have learned in this course and implement a nice thing. I will discuss it more. Prerequisites. I’m expecting you to have some mathematical requirements like uh know what is linear algebra, how we can actually multiply matrices, these kinds of things, so I hope you have this. For example, this is something you’re going to implement, so just be sure that you understand how to read something like this. I’m expecting you to have some programming requirements, Python, which is a very common. Everyone should be using it. Um. I’m expecting you might be know, have a little bit of knowledge. Regular expression might be useful because how to parse the text itself, but it’s fine if you don’t, you can learn it, but you should be, know how to, how to code. This is something. Uh, I, I don’t mind to use, for example, uh, some of the LLMs or chatbots to help you run, writing a function or something. It’s OK, but not the whole thing, OK? So just, uh, be kind, don’t uh do my, read the coursework and implement it. No, don’t do that. And uh shell commands, it will be very useful. We, you will, even if you’re not very familiar with it, you will learn it over the course like using CA, sort, grab, Unique, these kinds of things. It’s being doing text technologies, you need to be very fast with this stuff. So when you get a new collection, you don’t have to write a big piece of code to be able to understand what is going on. I can write a little bit of shell command, one line, and you’ll get some statistics about the the uh the data set I have. We will start this from next week, you will see it. I will be running some shell commands in front of you. And of course, knowing data structures and software engineering of the course, because you will need to implement some components, make them speak, speak to each other in an efficient way. I’m expecting you have this. At least you have something of this, and you can build over the course, OK. And an important note, we do not teach you how to code in this course. I’m expecting you to know how to code, and then use your coding skills to implement, OK? Because I’m, I’m not teaching you English, for example, I’m, I know that you expect, you know English, so this is the point, and then I’m not, uh, and, and I will not expect you to, don’t know Python, for example, no, I’m expecting you to have it, then we actually will be using it. Another thing that is a requirement in this course is teamwork. Um, The 2nd semester will be a group project, and it’s a requirement that you would work in a group. So if someone says, oh, but actually I don’t feel comfortable using, working in a group, find another course. This is, this is, uh, you have to work in a group in this course, because the main point of this is to train you for the industry once you graduate. So, um, coming, having some background in industry and doing some stuff, so. I would say 95% of the time you will actually work in a team, and you know how to each want to split the task among each other and then run it and create something, so teamwork is really important, and this would be a requirement that you work in a group in the second semester doing your group project. We will speak more about this in a couple of slides. So Another high objective for you to think about it when taking the course is not that I will get a high mark in TDDS. Forget about the mark. What really is important for you is to think what you will add in your CV by taking this course. And this is actually the main skills we are looking that you would be able to add in your CV, like working with large textual data. You can add in your CV, and I did a project that actually was indexing 100 million documents, and we were retrieving the results in less than a second. This is what I’m expecting you to add in your CV. You know what she commands and stuff like this, you will be able to do that. Some Python programming, which should be a requirement, software engineering skills would be required when doing a big project in the 2nd semester. And how you can build a classifier or a text analyzer in a few minutes, just something you get some data and you’d like to get some analytics about it, you should learn some skills to build it in a few minutes. So how we can do that and of course teamwork. Based working in a group, you will learn some skills like project management, time management, task assignment, and system integration, how you would learn this. So this is stuff you hopefully you would gain in your actually in this course, and you can add to your CV. So the structure of the course, getting more about the structure itself. So we have 19 lectures. During the first semester. OK, it’s all in the first semester. Uh, 2 lectures, introduction like this, nothing a lot, just some nice introduction, nothing heavy, and then it will be 13 lectures about related to IR and RAG and these kinds of things, about search engines in general, how technologies around it work, and there will be 4 lectures which talk about text analytics and text and classification. And there will be between 8 to 10 labs. This is what we offer here, practical labs, all in the first semester as well. There are no tutorials in this course, so it’s mainly the labs. There are some self-reading, every lecturer will give you the basic stuff, then tell you if you need more in this, uh, uh, like textbook or actually some papers. Lots and lots of system implementation, OK, lots of coding, and uh there will be some links to some nice online videos to learn more about general stuff uh for you. So The instructors for this course, it will be myself and Bjorn. Uh, actually, we have been promoted. I didn’t update this. I’m a professor now and he’s a reader, so, but, uh, I can update it later. But yeah, so, uh, I’ll be giving you the 1st 7 weeks, teaching the 1st 7 weeks, and Bjorn will be the last 3 weeks. And we will try to get a guest lecture every year. We try to get a guest lecture. Last year I managed to bring someone who had a startup that raised $56 million and using search engines, and it will tell you how you can use the stuff you are learning from this course to make business and make a startup, so that would be useful. We’ll try to bring him again or someone else. So the lecture format. You have 2 lectures at a time, so we have the 2 lectures on the same day, like behind each other. Questions are allowed anytime. Stop me and ask. It’s not actually ask questions at the end, no, you can stop me and ask anytime. And feel free to interrupt me, OK? Stop me and let’s continue and ask whatever you want. After lecture one, we will take 5 to 10 minute break, and during this, feel free to go out and come back. It’s up to you. And actually, you can discuss the first lecture with people around you, and you can actually, before the start of the second lecture, you can actually ask more questions about this. And in the meanwhile, for fun, but not from today, from next week, we’ll be giving a mind teaser between the break some mathematical problems to see actually who’s going to answer it first, and there will be some small prizes about it. But it’s mainly, you’ll have two lectures behind each other with 5 to 10 minute break in the middle. Some lectures are interactive. I’ll be speaking to you, discussing with you, waiting for your answers, so please try to participate. Some lectures will include demos like running code together and seeing how it would be the results, and this is from next week. You’ll see it. Just to give you an indication of what it’s going to do. You don’t have to copy code behind me. It’s mainly just demonstrating what we are learning. Uh, any questions so far? Yep, uh, so we said we will implement everything from scratch, but are we also going to learn, uh, like the modern technologies regarding, uh, information retrieval? Like what? Like the modern technologies of what information. Yes, yes, we’ll reach this part. We’ll see how the basic stuff works. Then at the end you will be putting everything together and using modern technology integrated, OK, so everything will be covered, but I’m not asking you to build the LLM from scratch. Don’t worry, OK? So I, I will allow you in the 2nd semester to put it as an additional part if you’d like to build the Cerra system, OK. More questions? OK. Labs, so we have a lab almost every week. The format of this, and please like give attention to this, it works in a different way. So I give the lectures on Wednesdays. Once I do that, you will go home and you will find the lab related to the lecture just has been released. So you will be required to implement this lab directly as soon as you can. And If you managed to finish it and got some results, please go and share its results on Piazza. You know what is piazza, I think everyone is using it. Uh, which is, uh, like a forum we’re going to use. If you have any questions about it, go and post it on Piazza. And we have the lab demonstrators and myself will be able to respond to you about any questions you have. By the way, so far for the last year, TTS has been the fastest response time for questions on Piazza. We’re talking about 10 minutes on average. So we’ll try to keep this. So once you have a question, we’ll answer you. Because the main thing about sharing your questions on Piazza, if you have a question and maybe another 5 or 10 have the same question, you’ll be able to see it and see the answer. So we don’t have to keep doing the same thing many times. So if you have any questions during that, after the, after the lectures about the lab, ask it. So you will try to do this. You can do it Wednesday if you want, but you can try it Thursday, Friday, weekend, Saturday, Sunday, Monday. Share the results you got or actually ask questions. So far you’ve tried everything. You ask the questions and you still have problems. It’s not being solved. In this case, we’ll have a drop-in labs on Tuesday. On Tuesday, the week after. To actually ask the remaining questions you have. If you still have any additional questions to ask. So I’m expecting less people to show. On these labs, because I’m expecting you, most of you will be actually done with it over the week, and ask the question and even got results and shared it with everyone. But you still imagine that you have a problem. There is one strange problem. It’s not running well. It’s very slow. You don’t know what is the problem. You ask the question, got answered, but still there is a problem there. Then in this case you can show up on Tuesdays. There are 2 times you can actually show up in any of them and then ask your question to the lab demonstrators we have to actually give us the support. OK, and there are two times, 10 and 11, I think on your timetable, it says there is one at 12 as well, but this one is not, we are not going to do it, so it’s only 1010 to 11 and 11 to 12. OK? Show it to any one of these. But hopefully most of you will actually be done already before Tuesday. And the lab demonstrators for this year, we have Natalia and Zara, who will be actually giving you support. Zara will be online on Piazza, and Natalia will be in person on Tuesdays, if you have any questions, and I’ll be myself actually on Piazza as well, actually answering your questions if you have. But the main thing about labs. Imagine Ina run it and it turned out to be really nice and it ran fine and you get some results. Share your results on piazzas as well, because actually I like when you share your results and some others will share the results and you start comparing actually the outcome from each of you. You are going to implement some stuff, you’ll give you some collection, you’ll see how it would go. OK. Lab Zero. We don’t have a lab for this, uh, for, for, for today because I’m not teaching anything old, I’m just giving you an introduction. However, we created this Lab Zero a couple of years ago because we we realized some students will struggle a lot on the course because they don’t have the required skills. So this lab is very simple. You read a text file word by word, then print it back. Read a file from the desk, word by word, print it back. If you go and try, if you know you are confident that’s really silly, that’s fine, then don’t do it, it’s fine, but if you think this, I don’t know if it would be challenging or not, and try it. If it came out to be very challenging, my advice, don’t take this course. Because later you’ll be indexing millions of documents. So if this is challenging, I’m not sure if you’ll be able to catch up. If it’s fine, yeah, yeah, I’m not that confident with running with textual stuff with Python. It will be my first time, but you try, oh, that’s that’s not that hard, it’s fine. Then OK, that’s OK. In this case, you’ll be able, probably the course will be fine with you, even if it’s a little bit challenging in the future, you can learn some stuff. But if this is a challenging task. The course is all about this reading many, many documents, processing them, indexing them, putting in structured stuff, running components to each other, receiving queries, running this, and retrieving retrieving results in milliseconds. So if you, if this is, if you’re not confident with reading and printing a file at the moment, then this course might not be for you, OK. And just be honest with yourself about this. The assessments, how it goes. The first coursework is 10% of the mark, only 10 points. And it’s really straightforward. Yeah, you would be required to build a search engine from scratch, but actually it’s simply all what you will do in the lab is simply you will submit it as a coursework. It’s like giving you the answer. If you’re doing your lab 12, and 3, then this is coursework 1, you’re done with it. So this is why it has only 10%, but it’s just to confirm that you are following. You’re actually moving with us. You’re implementing week by week what we’re implementing because we build each week over the previous week. So if you delay the stuff, you will have a lot of big queue about stuff you need to implement. So it’s better to keep doing stuff weekly on this stuff. And once you finish lab 3, go and submit your code. It will be coursework 1. That’s it. Coursework too, it will talk about how to do evaluation and some text classification and analysis stuff. Uh, it, it will not be all covered by the labs, so this is why it has 20%. But the main point here is group project. So group projects, now you will form a team, a group of 4 to 6 or 5 to 6, you will see, depending on the numbers. To actually build whatever you have learned over the whole semester one, build something nice in the second semester. Build a search engine that does a task, an interesting search task on a large collection, and show us actually your performance, and it will be assessed by Not reading your code. We will ask you to submit your code to check it, of course, but the main point of assessments will be, OK, where is the website you created? Where is the link? And our markers will go and test it and search for stuff and see the results. We would like to be sure that the stuff will come fast, the stuff will be relevant, and actually it doesn’t break, and they will try to break it. OK, so don’t say we didn’t expect, we, we will see actually if you handled it with different test cases to be able that it will not break. OK, and then he will give you a mark on this, and it will explain, yes, will be provided test cases? No, it’s actually your whole project. It’s your idea. You decide about it. You can decide to search something in a different language, in whatever language you want, different stuff. I will give you some examples actually later on about the Google project. And all of this, you will actually at the end, you will find the remaining 30% will be on the exam, and the exam will be more about theoretical understanding of the course. It’s not practical. You have been taking enough practical stuff during the course. It will be just theoretical understanding of the concepts of text technologies. The largest portion of the mark is on the group project. It’s a group of 5 to 6, and you select your own group. So we don’t assign you to groups. You select your own groups, so make friends from now, OK? And Design. The task is to design an end to end search engine that searches a large collection of documents with many functionalities. Of course this will be explained later in the future, but, but it’s mainly building a search engine that’s from scratch, of course. You can use some libraries. We will explain what stuff you can use, but it’s the main point is that I will have a search engine that actually can do a big search. How the mark will be done for a group. Because it’s 5 people. Maybe some people will be working well. Maybe some people are very active and doing a lot of stuff, so. At the end, we will check the project itself as a project, we forget about the people and we’ll give it a mark out of 100. So maybe this project is super nice, 80%. This project is moderate, 50%, 30%, we’ll see. So you’ve got the group project which will be the same for everyone. Then you will have a weight for each person. In the group How would we know? We will ask each of you to write a paragraph about your contribution to this project. You don’t have to contribute to every single component, but you will have to mention what party did you participate in. Ideally, or the default mark will be everyone will be 1. The weight is 1, so you’ll get if the project is 60, you, you will get 60. However, in some special cases, when one participant, one member, they didn’t work well, he was not actually participating, he was not collaborating. So in this case we might get less than one. If they didn’t participate well, maybe 0.8, so instead of getting 60, they will get 48. If there really was a problem and they actually tried some stuff and they were not helping at all, maybe they get 0.5. If it was not responding at all, he would get a 0. So the weight is 0. So if the project, the group got 80%, 80 mark, then you multiply it by 0, they will get a 0. So ideally, everyone will get one weight, actually this is what we have seen, but in some special cases when one member is not participating well, they are not actually collaborating, they were a problem for their teammates. In this case, actually they will be penalized on this. OK. Fine. Just 11 example from a couple of years ago, this, this group, they created Betterreads, which is an interesting thing. They actually search for books uh based on the reviews, and you search for something like this and you get results about the recommended books based on what you’re looking for. And they actually indexing around 12 million books. Uh, the retrieval time was less than a second. And they actually keep adding, actually it’s not just a fixed collection, they keep adding if new books just shown up, they will keep collecting new books and reviews every single day and putting them through their index, and they will show you the books and the sentiment about the reviews, and, and actually it was hosted on Google Cloud. So, um, so this is actually what you will get. So this is what we expect. It was an OK project, so. Some actually in another group, they did actually 80 million a collection of 80 million stuff. So this is what we expect you at the end by the skills you would gain during the course, you should be able to do that. It might be, oh wow, that’s a lot, you will be able to do that, don’t worry, in a few weeks, hopefully. And not actually, we’ll provide you with Google credit so you can actually host your stuff. So don’t worry, we’ll not ask you to buy anything. So it’s just we’ll give you credit. So your group actually, each member each student will get a credit and then you can join it together and run it for the project. So the timeline for this course, we have 2 semesters. Actually, this is also the debate, is this a full year or 1 semester. So this is how it goes. In the first semester, you will actually get all the lectures and all the labs. There are no lectures or labs in the second semester. It’s just in the first semester. All that learning is in the first semester. There will be actually the individual courseworks will be in the 1st semester, Coursework 1 and 2. OK, that you will implement some stuff about the and actually this is how you should look like when you submit your coursework, OK. So we’ll have one at week 5 and one to be submitted in week 11, OK? Now by the end of semester 1 you will be done by most of the course. However, in the second semester, we will ask you to work in teams to be able to create a project based on what you learned. If you can spare like 5 hours each of you as a group and sit together and work together on something, and by week 9, you will be able to submit. The project you have. What you will submit will be a report about what you did and a link. And the marking will be go through the report and the link itself, uh, see actually what the engine you created. And after week 9, there is nothing related to the course anymore. It’s just at the end of the semester you will have the exam. OK. So Everything is in the, you will have the burden of actually the, the labs and the lectures in this semester, but the second semester, it’s actually you manage your own time. And work on this, and time management is really important in the second semester, because you don’t have, you don’t see me. Actually, we’ll have, of course, lab support every week for those who have any questions about the project, they will give you support about this. You can actually ask me anytime, but there are no lectures or actually specific labs to run. It’s just about yourself managing your time. Don’t come 1 or 2 weeks before the deadline and say let’s build it. It won’t come nice. And we are very selective in the actually marking of the projects. If it ends up that you did a nice search engine and it looks like your first coursework was a little bit of improvement, probably you’ll get 30%. OK, so, and people have seen groups who build very nice stuff. Honestly, and like many nicer stuff. So if you spend some good time about manage your time, you’ll be able to create something nice. So the logistics here, 2 lectures on Wednesday, recording will be available after the lecture directly, and handouts will be posted on the day or the day before the lectures, so you can actually see what these slides should be available already. The coursework web page is this one, probably you got, you know about it, but this actually, you’ll find all the materials. Handouts, labs, coursework, details will be all there. And of course from Learn, you can actually have the lecture recording and the deadlines and the submission of the courseworks will be over there and of course the link to Piazza as well, which is Our main communication platform. This is your new social media. So if you have Facebook, Twitter, TikTok, uh, Snapchat, add Piazza beside them, OK? Enable notification, get a question, try to answer it. You’ve got something, share on it. This is the main thing, OK? You’re the new generations, guys, that use social media. So it’s assume Piazza is a social media, and I, I always love actually responding as soon as I can. And you’ll sometimes you’ll find someone posting something at 3 a.m. and they will respond by 30 5. OK? So just let’s do that and let’s actually share your results there, share your questions there. And one important thing before you ask a question, please have a look that it hasn’t been asked before, because it’s like use your search engine skills before posting search and see actually that it hasn’t been asked before, because you might find the answer already. And yeah, you can find the link to Piazza already on Learn, so please be sure that you are joined. There are some frequently asked questions that I will share with you, like uh how the project will be managed, what if one of the members doesn’t work. I hope I have answered this. Uh, it’s, usually we would ask that there will be one, I, once we talk about this more in the uh in the semester in the future, but I’m expecting that each group would elect a project manager for the group. He’s the one being sure, or she’s the one being sure that stuff is done on time, tasks are assigned. Like being sure everything is done well and a communication person as well. If there is any problem, someone can contact us. What if someone is not working? I know it would affect your project. But you have to handle this. This is why actually it’s your responsibility to select your members. And for example, you started by 6 and turned up 3 are not working. So we will mark whatever the other three did. Fairly, if it’s not great, it will be not great. The other 3 who didn’t participate probably might get 0s, but the project as a whole, probably the outcome of a 3 will be less than the outcome of 5. It’s your own problem. We’ll check the project, regardless who worked on it. If it’s 1 person or 10, we’ll adjust, this is a project, we’ll give you a mark. And then the individual mark will actually weight this. So it’s important always to actually be sure that you have decent group members and assign the tasks carefully and see who’s going to do work on that, on what. Maybe someone would like to, I would like to work on the interface, for example. It’s fine. It’s OK. Just, just think about this, about the data collection. This is what you usually find. Some people will work on the interface because I, some students will say, we have actually created this for desktops and also for mobile phones interface. It’s very fast and nice, and we will do that quickly. Some people would actually create an app as well, that’s doing in Flatter, for example. You don’t have to do that. This is additional stuff, but just if you’d like to improve your mark. Uh, some people will go, oh, I’m, I’m responsible for the data collection. Someone about the indexing, someone about the search to be fast. All of this stuff, many components will be there. The main point is how to sit together from all what you would study and see who is going to work on which part. This is one thing If someone says I’m not solid in programming, should I take this course? Check lab zero. This might give you an indication, OK, if it’s challenging, think 2 or 3 times before taking the course. Can I order the course? Yes, anyone can order the course. So if you have friends who would like not to take the course or have exams or do the coursework but like to come and listen, anyone can join. OK? It’s open for anyone. Now, any questions, any additional questions? Yes, so like the final mark consists of several components, like do you have, for example, to score at least. each component to be considered to pass the course, right, for example, I don’t know, like 60% exam or something like that. Uh, of 666% exam you said you like, so you get mark for each different component, right? Do you need to have specific no, so, uh, the project percentage in each mark. No, no, so actually, what, how the pro you ask about the project specifically, yes, uh, every month, like, do you, do you need to have at least some percentage from each component of the, of the course, so they’re like, no, no, no, I say we just check the total. So imagine you are doing amazing and you got 70% already, full mark before the exam, and you said they said, oh no, you know what, I’m not going to the exam. It’s up to you, OK, so. OK, you go to the exam, send some marks for others, it’s up to you. So yeah, but it’s mainly we, we do the full mark, we check the whole thing. Yes, can I audit the labs? What do you mean audit? Uh, yeah, yeah, OK, yeah, that’s fine. I audit piazza. OK, because I’m telling you that, that, that most of the support will be on Piazza. OK, so yeah, but of course if you would like to join the lab as well, it’s OK.

SPEAKER 1
Yes.

SPEAKER 0
Uh, that’s a good question actually. Uh, what if you actually would like to, uh, program in something different than Python, for example? I have another programming language, but nothing in Python. Yeah, so, uh, uh, so let me know what is the language. I. OK, I, I, you, you try to implement this stuff there and see if it was working, and we’ll find you a marker to assess your work because honestly I really don’t care about which language you’re using as much as understanding and implementing it. It’s highly recommended because actually the support will be in Python if, if you ask some question in. Uh, in a different language, and I don’t, uh, one of my demonstrators is not, doesn’t know it, then I’m not sure if I can give you the required support, for example. But it’s OK. I, I, uh, 4 years ago, some, I got someone who implemented everything in C. Which is, it’s yeah, instead of writing 10 lines of code, he’s writing 100, but it’s up to, it’s up to you, OK? But at the end, if it’s running, hopefully everything you implement will be running, so we’ll not have a problem, but the problem when stuff started to stack and you need support, you can ask Chet GBT in this case, OK? Yeah. Any further, yes, uh, how much, uh, markers you get for the example you gave, the better, better read, Better reads, I think they got um. Uh, over 70. OK, so only a few projects got over 7 get over 70, by the way. Those who actually do, do do something very interesting and check all the boxes that might be interesting. Like, uh, for, for example, they did live indexing, so it’s not just the collection, and it’s, that’s it, no, they actually keep collecting documents all the time. They’re actually doing very nice simple interface, but it’s a nice one. It was super fast. We tried to break it down, it didn’t break. For example, I give a query as to 500 words. What would you do? So you usually submit like 2 or 3 words, and that’s what you would design it for, but we’ll go and try it. OK, let’s try it. We’ll actually give it queries in English, Arabic, and Chinese. Would it work? It might give more results, but it doesn’t break. This is the main point. Actually, did you build something interesting or not? So maybe actually one of the group members would be, you know, this person would be the tester. Just uh testing, testing, testing to be sure everything is fine. But the best 2 is to, if you’re building a component, uh try to be the tester of a different component and try this stuff and give each other feedback. It’s an interesting exercise. I can show you, but uh, you would appreciate it once you get your first job after graduation. You will see actually, oh, I lived this before, OK? It was tough at that time, but now I experienced this. Yes, I don’t think everyone is access to technologies for access to what the land page.

SPEAKER 1
I don’t have landing page.

SPEAKER 0
You mean learn if you are enrolled to the course, you should actually get access to it automatically. If you enrolled just yesterday, it might take a day, usually within 24 hours. If, if you didn’t get access yet, please contact ITO. OK? It happens automatically, but it takes, I know it takes 1 or 2 days sometimes. Yes, this course I cannot hear you. I’m sorry. Does this course include training or fine tuning or encoder or the models? I mean. We will tell you about it and you can do it yourself. It’s actually some stuff that we, uh, so that’s a good question. So we will tell you how actually this, this field. Starting how it has been done over years and has implement improved over years. And we will show you the trending stuff, but when it comes to implementing this stuff, it might be actually a little bit more advanced to everyone to implement it. So we’ll tell you about it, but if you, and we hope that you, some will include it in their project. You can learn a little bit more, find a recent paper and try to actually use a library there and use it. We will actually encourage you about doing that. But it’s will not be required for everyone. We’ll learn, tell you about it. You will be implementing only the very basic stuff that you should, you cannot pass text technology without actually knowing them. But the advanced stuff, we will tell you about it, and it’s up to you later to include it in your project or not. Fair enough. OK. Yes, on PS I saw a post about someone who

SPEAKER 1
was looking for teammates, so the person introduced themselves and

SPEAKER 0
said, Oh, people started looking for teammates now.

SPEAKER 1
I was wondering, so that’s too early.

SPEAKER 0
That’s too early. So, but, but what happens actually, we will tell you, probably we’ll ask you to start forming a team and submit your team by November, and we will actually have a functionality on Piazza that allows you to search for teammates. And at the end by January, if some are left out not having teammates, they will contact us and we will try to put them in a group. OK, but it’s really important, it’s too early to discuss it now, but one thing that is important that will advise you about diversify the skills. You don’t have to be all super software engineers. Maybe someone is better with the interface, the front end. People are better in the back end. Some people with the set up in the servers, some people with data collection, some people with the interfaces, you know. So diversify. It’s fine. Usually it will lead to a better thing. Any more questions? Yes, I want to ask, what’s the policy regarding using

SPEAKER 1
DNA.

SPEAKER 0
So, uh, I, we have to live with it, OK? So, uh, uh, for example, uh, while we’re implementing some stuff and you’d like, you can actually ask it to implement some functions for you and read it and understand it and submit it. I’m OK with that. But as the main thing is to understand because Don’t run it for the whole thing, probably for components, and be sure that you understand it so actually it’s it’s not good in implementing a big thing anyway. So you’re running it to run a function for you, something to sort, something to search, and you get it and read it and optimize it for your code, it’s fine. But remember, you will be marked on this code, so maybe if it did something wrong, you will be penalized for it. OK. So, but, but yeah, it’s, you should use it. It’s a tool, it’s a nice tool. It might actually help you uh like the first lab, write a function for me to read the file and print it back. It should be very straightforward, but there are different ways to do it. Which, which way did they show it to you? Is it the most efficient one? Is it the one working with better text? Is it actually would be uh sufficient for other languages as well, or just English? The encoding, this kind of, uh, tiny stuff. All of this stuff actually, it’s better to, even if you’re using it, to understand before you submit it, OK? OK. So we will have 5 minutes or 10 minute break and we’ll continue the second lecture. It will be shorter than this one. OK? So still continue more introduction, just, uh, uh, but it’s more about the topic itself.

SPEAKER 1
OK. In Australia. Yeah You.

SPEAKER 0
So let’s start with the 2nd lecture here, which is more about very basic step of introduction, uh, nothing hard, very easy, more discussion. But just to get Uh, like, uh, a flavor about what is the course about. So it’s more about defining the main components that will be within a search engine. So because I got this question more, this course is mainly about working with a huge amount of textual data and how you process it very efficiently, OK? So this is the main point of this course, and this is what we’re trying to study, some definitions about this. So the lecture objective here is to learn about the main concepts in information retrieval, which is the document. What is the document? The information needs, the query, index, and BOW, which is bag of words, which is one of the main important things about IR. So if I’d like to summarize the whole thing in a very, very abstractive way. So simply There will be some user thinking about something, query, actually some information that they have in mind, so they will type down a query, searching for what they’re looking for, and submit to the search engine. The search engine will take this information. The engine have actually access to trillions of documents, try to find what exactly this person is looking for, and bring it back to them as a relevant document, and then hopefully, the user will be satisfied. So this is the main objective make the user smiley face on the user’s face. That’s what we’re looking for. OK. Now people are using LLMs a little bit, but it’s the same concept again. The only difference is the user has in this case a query or a question. And it will probably ask it, not a search engine, but actually to an LLM like chat GBT or whatever. And chat GBT will realize, oh, this is actually recent stuff. I need to actually use a search engine, so it would go and give the query, pass it forward to a search engine. Search engine will have access to materials of documents, find what is relevant, retrieve it back to chat GBT, tell them, you know, this is the top 10 things I think they are relevant. Chat GBT will take this stuff and try to summarize the answer from the most retrieved ones and bring it back to the user, so the user still hopefully will be happy. So this is simply what is going on. So even now people are using maybe chatbots more. Still search engines are in the back, unless of course it’s search is something that is embedded in the knowledge of the of the LLM itself. So what are the, in the basic form, we have a given query, let’s say queue. And find relevant sets of documents, the sets of documents. So for example, here, if I search it for Donald Trump, I, I, I will get some relevant results here. And so my query here is Donald Trump, and my relevant documents will be all of that thing showing on the main page of Google, for example, here. But Uh, there is actually one important thing here. Which is this part. Can you read it? You cannot read it, so let’s actually zoom in. This is what I actually mentioned. It retrieved around 300 million results in 0.79 seconds. It’s not because it’s using a lot of GPUs. Actually, IR is not actually designed initially to be running on GPUs. It’s mainly on CPUs, normal CPUs, but it actually has been done in a way, in a representation that makes it super efficient. How we can retrieve this huge number of results in less than 1 2nd. And this is your target for your group project at the end. Hopefully you’ll learn the skills for this. So there are two things objectives here are looking for. One is Effectiveness. It managed to retrieve a lot of results, potentially relevant. Of course it’s showing me some of them, but it’s have them in the back end of if I need them, it will continue to bring it up. So, so it needs to find relevant documents, which is like a needle in a haystack. They say 300 million, but compared to the internet, we’re talking about trillions. So this is actually a really a small piece of information among many, many, many documents, how you can find it. And this is actually different from databases because it’s not just I’m looking for the name field that contains Donald or Donald Trump. No, it’s different. It can be in anything. It can be in videos, it can be in images, it can be in websites, on the news, many, many things. And the second important thing is efficiency, super efficiency. We’re talking about, I need to find them very quickly. And we’re talking about vast quantities of data, hundreds of billions, I think it’s now over 1 trillions of web pages around there. And it’s not like that, it’s not only me sitting on Google searching and waiting for just 0.79. We are talking about around a quarter of a million Google searches per second. There are 250,000 people at the same time clicking search, and they are achieving it super fast for all of them. How we can handle all of this. And this is actually one of the things actually that you will be marked on, because it will be our markers, we’ll have to, your project will be two markers. They will actually coordinate. We’re going to try to search the engine at the same time. Let’s see if it would fail or not. Is it actually ready to handle multiple stuff at the same time? And it’s not just about this. We are talking about it is continuously changing. It’s not saying, OK, I’m indexing this 1 trillion page, but yes, probably 100 billion has been updated, have been updated already of them. How you can keep this fresh. And then actually, if we’re talking about compared to NLP, this is, we’re talking about very fast. You have to be super, very fast. So NLP, yes, I can take a month or two to train my model, a few days. Here it cannot. You have to retrieve the stuff very quickly. Yes, at the entrance time in NLP in Cha GBT that retrieve my answer, the generation tries to be faster, but in, but, but the training itself is hard. Here, actually, this is how you can maintain this to be super efficient and fast. So the main components here we’re talking about mainly 3 components. The documents that they have, which can be in the web web pages if you’re talking about health, it can be actually scientific papers, actually patients’ records. It can be in library, books, it can be anything, queries, which is what I would submit to the system, and relevant documents. Which of these big collections is relevant to this query? So let’s start by a document. Document here, it’s important that we will discuss on some actually terms to find the definition of this. We, we will use it during the whole course. The element to be retrieved here, what I’m going to retrieve. It’s usually in unstructured nature. Usually it has a unique ID, and once I have any of these documents, it’s called our collection. So it can be anything. It can be web pages for web search, emails. When you search your email, you’re actually searching emails. The document is the email here. It can be books. It can be actually even not the book itself, but a page inside the book. If you’re searching on the page level, then the document here is a page, not the whole book. Or it can be on the sentence level or actually sometimes in a tweet level if you’re on social media. It can be photos, videos, musical pieces, or even code. Searching for a stack overflow, for example. It will be a piece of code. It can be answers to questions. It can be product description or advertisement if you’re searching Amazon or eBay. It’s a different type of document here. The document here is a product. It’s not a webpage. Uh, it can be actually in different languages, of course, as we spoke. Actually it, it doesn’t have to be even words. Like if you’re, like if you work in bioinformatics, searching DNA, it’s some kind of text, but it’s not words, it’s not English, it’s not even any language. So how you can search and find this stuff. This is the document itself, so whatever I will be retrieving to the user, this defines what is a document. It can be anything. It’s mainly the element I’m going to retrieve to the user. Then we have a query here which We need to make a difference between A query and information need. So the query is defined as a free text to express a user’s information need. But it doesn’t mean that it’s the main point here. The information needed is still the focus. Maybe the user would express it somehow, and later on decides, oh, that’s not the best expression, I will try to express it in a better way later. But because the information needed is what is inside the user’s head. So the same information can be described in multiple queries. For example, there was a hurricane in the US like 3 years ago. Like I can say the hurricane in the US. I can say North Carolina storm. It was happening in North Carolina, or I can say actually it was called Florence at the time. So if I search it for any of these 3 terms, I’m looking exactly for the same thing, but actually it can be suppressed in different ways. And sometimes it can be expressed in the same way but actually meaning different things. For example, if I search it for Apple or Jaguar, what exactly I’m looking for. I’m looking for the fruit or the product. I’m looking for the car or the animal. So all of this information, it can be different in, in both ways. And it comes in different forms, like web search, yeah, probably up to some keywords, some narratives, I’m writing something. Actually, in the new generations now, people are more, more actually towards writing questions on the web search directly. Um, image search, I’m looking for keywords, can be a sample of an image. For question answering, it can be a question. For musical search, it can be humming in, in the like Shazam. for actually sometimes filtering a recommendation like Google News will bring you some news that you probably are interested in. Facebook will actually, or Twitter would be recommending some posters are relevant to what you’re looking for. You’re not looking for anything at the moment, but actually recommending posts probably relevant to your interest. If you know a lot about, for example, football, so it will be giving you updated news about what’s happening in football, but you didn’t search for anything. So your query here is your interest. It modeled you in a way to actually keep retrieving some relevant stuff to you. Or it can be scholar search. Go to Google Scholar, searching for papers. You can search with the author name, uh, you can search for a paper title, you can search for words that appeared inside the, the paper itself. It can be different ways. And sometimes it can be advanced. You can search, I need something from the title with this weight, and actually it doesn’t contain this word, and so on. We will learn about this when we talk about boolean search. Then it comes to relevance, the 3rd component, which is It’s mainly, is something relevant or not? Can it mean that does document D match item Q, or does it mean is item D relevant to item Q? Sometimes it might match, like for example, the example of actually words that can have different meanings, but it’s actually not relevant. So what we are looking for is about relevance here. Is it relevant to what I’m looking for or not? So it’s a bit tricky because It mostly depends on the user. Will the user like it, will click on it or not? This can be an indication about actually it’s relevance. Will the user achieve, uh, will it help the user to achieve their task they’re looking for, satisfy their information needs? And actually, sometimes, is it novel? Because maybe I find 20 things saying the same exact thing to the user. Should I give it all to them or diversify the results a little bit? Also, Relevance, for example, what is the topic about? Sometimes D and Q, the document and the query, might share some similar meanings about the same topic, subject, or issue. This is something actually that you always have to think about, about your task when designing your system. How I can find the best match between documents and queries that actually satisfies the relevance part. Is it on the same topic or not? Sometimes it’s a bit, some challenges will be there. For example, it’s sometimes it’s not very clear if someone searching for William Shakespeare. Are they looking for like the biography about him, maybe from Wikipedia, or actually it’s mainly about the list of novels uh he created? It’s not very clear here. Sometimes And there is some ambiguity, like synonymy, like if you’re living in Edinburgh for a while, you know what is Edinburgh Festival happening in August when it’s very crowded and noisy, or the Fringe, you can call it the Fringe as well. So both of them are totally different terms, but actually they’re talking about the same thing. Polygamy, like apple and jaguar as the example we mentioned. And sometimes it’s very subjective because if If I’ve got a document about a certain query, the answer is that it is relevant, yes, or no, it is not relevant, or sometimes, yes, it’s highly relevant, or sometimes it’s, yeah, you know what, it’s a bit relevant, but not that much relevant. So how can you actually do that? And sometimes actually how you counter the search engine optimisations, we’ll talk about it later, and spams. Maybe some page will actually have a lot of like rubbish terms just adding a lot of things to, to bring it back in the search engine. How you can avoid this when doing your system. So Relevant items are similar. This is the main idea here that Usually, if something is relevant, a query is relevant to a document, probably they are sharing, sharing the same vocabulary or meaning in there. And that’s what we say. The more there is some match between the my query and the document when term is matching, probably it’s more relevant. This is actually the main basic idea, which is actually trivial and it’s, it’s, it’s not bad to assume so. Sometimes similarity is be a string match, finding the same string in the document. Sometimes it’s word overlap, it doesn’t have to be the same thing, searching for uh playing, but actually it’s, it’s talking about played. It’s still not an exact match, but actually the same thing, or it can be a probability, which is something you will talk about in the 4th week about receivable models, how we can find probabilities that something is relevant to giving query or not. And there is a big difference, we have emphasized on this several times, the difference between information level and databases. Databases, what we are retrieving is highly structured datasets. It’s clear, and we know that this is a name, this is an age, this is address, and so on. And however, IR is usually unstructured, mostly unstructured. For the databases, query are actually like, it’s formally defined. So you can say I’m looking for an age between 18 and 25. It’s very defined, uh, like SQLL or something like this, not ambiguous. You know what you’re looking for. And text, free text is natural language. Anyone can write something that is not very clear exactly what they’re looking for. And the results by then for the databases, you know, especially for all the entries that has the name Jacques, for example, I will get them. But for information even we don’t know actually, is it actually relevant or not? It can be, it can contain the term, but it’s actually not relevant. It cannot and sometimes don’t have the term, but actually it’s still relevant. So it’s actually, the result itself is different. And the interaction with the system. In databases, I search or search something, I get the results. I’m done. Actually, in search engines in general, it’s more interactive. You search for something, you felt that the results didn’t go well. I’m not satisfied. Probably you will go and try to write something different. Hopefully this time you’ll get a better result. So it’s more interactive because it’s a trial and error till you find or reach whatever you’re looking for. So how would from Richard Ri see the documents in this case? So this is something important to know about actually that In IR we don’t see the documents as a sequence of words as we usually expect that’s how we read the documents. This is actually what we call the bag of words trick. So let me just give you an example. Can you read the first sentence here? Uh, can you know what is he talking about?

SPEAKER 1
Yeah, it’s actually this.

SPEAKER 0
It hasn’t been actually sorted. It’s actually from NLP, this is a terrible sentence, the one in blue. What does it mean? It’s not English. But for you, I know what he’s talking about. You got it, even if it’s not sorted. For the second one, actually, you can find the word French. Is it talking about France? Who thinks it’s talking about anything related to France? OK, what is it talking about? Yeah, yes, and french fries, so you can guess it is, it is, or actually it can be sorted in this way, who cares? So these sentences for me, I have seen them as a bag of words, and this is actually how search engines see the documents. They don’t care about the sorting of this stuff. It’s actually about what they are talking about. Even for some terms around spread everywhere, I still can understand what they are talking about, and this is the main point here in IR. So the reordering doesn’t destroy the topic I’m looking for. So individual words, we call it as the building blocks here, and the bag of words or the composition gives us the composition of the meaning of actually the topic that these words are about. However, there are some people who would actually argue about this like, First thing, most of search engines will be using bug of words as the main thing. They treat documents and words and the queries as bag of words. And the bag is a set of repetitions, how many times this term appeared in a document, and so on. I really don’t care about the structure that much. And the retrieval models, it’s some kind of a statistical model that tries to measure how a given query would match a given document based on this bag of words. And what should be the top results, the most relevant document is, it turns out that bug of Words can help me a lot on this. It’s giving us a very strong baseline about getting relevant documents. There are some criticisms about this, for example, The word meaning Lost without context. But actually, if you think about it, it’s true. Yes, if I don’t know the context like the word jaguar, I don’t know if it’s about the animal or the company or the car, but actually, it really doesn’t discard context because you still have the words around it. So if I found the word jaguar somewhere and the word model in a different piece of the document, I would know it’s about the car. If I found the word jaguar here and the word forest somewhere, OK, it’s about the animal. So the context is still there, it’s not sorted just for me. Another thing, it discards the sur it just discards the surface form and well-formattedness of the sentence, which I really don’t care about in uh in in in search engines and information retrieval. And this is a big difference between NLP and uh information retrieval. NLP cares a lot about the structure of the sentence, and how the sentence is formed. For me, I really don’t care. What I really care about is what is the topic about this. There might be another thing What about negations? This sentence like no, climate change is real. Another one, Climate change is not real. They have opposite meaning, correct? So if I use bag of words, they will be flipped around and they will be identical. Would this, wouldn’t this be a problem? Who thinks it will be a problem for the search engine? OK. Who thinks it won’t be a problem? Can you say if you think why it’s a problem, can you tell me why? Yeah, because it’s different, uh, just as this sentence says, uh, sometimes different sentences could be the same in terms of just bag of words. If we would just compare the words, they would be the same, but they would signify different meanings. Exactly. OK, that’s a good point, but why do you think it’s not a problem?

SPEAKER 1
Because. Uh, irrelevant of the meaning, uh, I mean, I, the thing is, the document is irrelevant even though the meaning may be exactly.

SPEAKER 0
That’s a good point because actually we are talking about the same topic. If I’m looking for climate change being real or not, both of them giving this opposite meaning, but actually both of them are relevant. So yes, it has a totally opposite meaning, but from a search engine point of view, it’s both of them are relevant. If I’m looking, is climate change real, it would be amazing to get these two documents. I have a question. If I want to search for specifically, for instance, uh, documents regarding that are saying that the document that climate change is not real. I was just looking for that that’s actually you can use in these codes and actually we learn how to implement it. It’s a part of your coursework one. If I’m searching for specifically a sequence of terms or actually a certain phrase, you can do that. But in general, if you’re talking about generally, actually in this case, both of them are relevant to climate change, about being real or not. And this is the main point. We really don’t care about the exact, yeah, outcome of this. It’s just about the topic. What is the topic of this about? And when you search something on the web, you’re searching for the topic. Just put, make your mindset about this course about this. This is the main point of this search engine stuff. Sometimes it doesn’t work for all languages. For example, if you’re talking about Chinese or German or other stuff, the word itself is different. You don’t have this. However, actually this kind solve a problem solved by segmentation or decompounding. So there are some workarounds that actually make it work, which is simple tokenizations. We call it generate tokenization. We’ll talk about this next week, how we can do tokenizations for different languages. So as an IR black box, you have a query, you have a set of documents, and you get results, which is the hits, the relevant results. What happens actually is that you take the document and you try to transform it, transformation function, which makes it a bag of words. It can be some representation. Now we have some word embeddings and stuff, advanced stuff, we’ll learn about it more later, but the main idea is just a bag of words, and then you put them in an index that you put this transformation and put it in a way that makes it super efficient in storage and retrieval. In the query also transform it to some bag of words and some representation, and you create some function, comparison function that will be able to do that very fast and get your results, and we will learn how we can do that in an efficient way. And what we call it, this is the offline part where you do the indexing of the documents, and this is the online part where actually the user just go and get the results directly. And we call this a retrieval model. We’ll have uh in week 4, I think, or 5, about retrieval more different retrieval models, how we can retrieve the stuff. So system perspective on IR. We have the indexing process, the part where you take all the documents, the internet for web search, your emails for Gmail search, and put them in the index, which gets all the data acquired from using crawling or feeds or whatever, store them in the original if needed. Sometimes you don’t have to, then transform them into a bag of words and index them. In the search engine, in the search retrieval part, the online part where the user searches and gets results directly, you assist the user to format the query by query, suggestion, correction, and so on. You can retrieve a set of results for the user and help the user to browse or reformulate their query if needed. And you can actually always log the user’s interactions to actually learn over time how to make a better process. So as a block diagram, we have a set of documents here and we need to think what data we want to use, so think about the source and actually how you, in this case, emails, web searches, images, whatever documents in for an enterprise, and so on. Then you do some crawling, collecting these documents somehow, or and then Give each document a unique ID because this is your element. In the internet, the web search, it’s simple, it’s a URL. This is unique. On some places like your email, it can be some ID Google actually stored inside. And then actually, you think about the disk space and, and the compression. We’ll talk about this when we do, go to this part. Then the text transformation part, you think about how, what kind of format I will need to do it, some actually transformation a little bit to make the term is actually making uh in a better way, a better representation. The very basic stuff like stopping and stemming. Uh, we’ll talk about this more next time. And then Lookup table at the end to tell me where each of these terms that can appear in my queries can I find it in my collection, so we can get it very fast. From the search process, you have a user who will actually, you will help the user somehow to formulate their query. It can be just nothing or can, you can give them some query suggestions. Then you try to search the documents and fetch the results you have and present it to the user, and then try to see how the user would behave with them. Are you going to click on it or just ignore the 1st 3 results and go to the 4th, you learn that the 3 are top, not the top, uh, best results, then you can improve it the next time. And you keep iterating until you get a better performing system. So this is a very nutshell actually how The very abstractive way of how things actually are happening in search engines. From next week, we will be studying each of these components separately. For we will learn, actually, I think next week we’ll be in the indexing part here. We’ll talk about actually the transformation, text transformation process and And actually, yeah, it’s only been text transformation. So we’ll talk about this part in a couple of lectures, how would this would happen. And then we’ll keep each component in during weeks. So the summary of this very easy, simple introduction lecture, and then we’ll start the real science from next week, is information retrieval is a core as a core technology, it is a selling point. It’s very fast. This is really important and provides context. It doesn’t have to be the same words as a sentence order, but actually the main context is there. The main issues that we’re always looking for our objective is to be super effective and super efficient. How to make it effective and super efficient. And we set three components, documents, queries, and relevance, bag of words trick, we talked about it, search and architecture, we talked about the indexing and searching. We will talk about this from next week from each of these stuff. So the resources for this, uh, of course, the, the, for the slides today, but mainly I would, I recommend that you uh you read chapter 1 and 2 in Uh, information even in practice. It goes to the main basic stuff, the one by Manning. It’s really amazing stuff. It’s very easy to read, by the way. It’s not like a textbook, it’s very complicated. It’s very easy to read. It’s a fast one. And there is no practical lab this week, so you are free. However, if you are not confident, try lab Zero and see it, uh, just to be sure. If you know that you can easily read a file and print it back, you don’t need to bother about it. Um, and if you have any questions, let me know. Next time, we’ll talk about some laws of text and the vector space model and skills to learn next time you learn how to read text from desk and read word by word videos. I would recommend next time, you might actually watch this if you want for next week, the zip flow in general from VSOs, if you are following this channel on YouTube, and there will be some tools, uh, regular expression that will be useful to familiarize yourself about it. Any questions?

SPEAKER 1
We have to learn.

SPEAKER 0
No, no, you don’t have to. I’m just saying if you’d like to, it’s up to you. OK. OK, so uh thank you and see you next week.