Lecture 4
SPEAKER 0
OK. OK, hi again. To a new lecture for text technologies. So how’s it going so far? Good Are you sure? OK. OK, it’s getting more intense every week, so, yeah, get ready. So actually, before starting the lecture, so uh I have some comments about the labs and coursework and so on. First of all, actually, I was expecting a bit more engagement on Piazza, because currently I don’t find a lot of you sharing their results on Piazza. So this is actually this week, I have been waiting a long time for anyone to share the results, and I had to make a post. Please share your results about lab 2. So that’s what I’m saying. Thankfully, thankfully 3 or 4 shared the results and it’s aligning, which is good, but I’m actually encouraging all of you, once you get a lab, try to implement it, find out the results, share it with everyone, so actually you can compare your results, at least because I’m not sharing the exact outcome for the lab. I need you to share your results and compare to each other. So this is really important. It’s an important thing within the course. There was an interesting question about actually phrase search and proximity search and actually the kind of doing stop words removal. So someone asked the question if the document has course on IR, but the phrase is courses IR, shall it match it or not is it a phrase search? Honestly, it’s, it’s again up to your design. For the coursework and the lab, we’re asking you to remove the stop words, to apply stemming, and then do the phrase search. So in this case it should match. In real life, in, in your group project, for example, you can decide, even if you’re going to do the stop post removal, you can decide, yes, I will remove them, but I will keep the original position of the term. However, in our existing lab and coursework, we are telling you no, remove the stoppers and then start counting. So in this case, it will come, course and IR will come actually next to each other. In your application of the words, it’s up to you. You decide, OK? So this is just for clarification. Remember, you are designing all of this. It’s again to your uh your understanding of the task, you design for the system. For the coursework, we’re trying to make stuff as simple as possible just to get your hands dirty and run something. But in the group project, this is where you can see, we need to see your choices in this case. Uh, Again, PS, the discussions on the lab, please try to share the results more and discuss your results. I can see some of you mentioned there are some little bit of variation on some of the results of some of the queries, and I can see from your discussion saying probably because of the deconization, because of the axillary removal, some, some of you removed the support, some of you kept it, about the special terms. It’s again, all of these variations. I know you might be worried. What about the coursework? Which one would be counted? The good thing is that for the coursework we implemented our auto markers to take account of all of these choices. All the valued choices would be there. So don’t worry if you did a small tiny choice that is different than someone else, the auto marker will be able to match it, so it shouldn’t be a problem, OK. For Coursework One, it has been announced last week, and I shared the full details on the website. This week we are with Lab 3 that we actually can be able to implement it after this lecture. Once you do it, uh, it will be actually you can finish your coursework, honestly, you can apply it with the existing queries that has been released with the labs, and all what you need to do, you can write your report. Get everything ready because it won’t, the, the final ones which will be released next week, uh, will not be, it will be just a different set of queries and documents, but at the end, the findings should be the same, so you can write your report actually. And the main important thing, next week on Thursday, I think. We will release the test collection and the test queries that will be submitted with the coursework that all you need to do is just index in the new collection and run the queries. That’s it, and then submit the files. The report should be ready before that. It shouldn’t be a problem. But once we release the coursework, the test set. From Thursday next week, you are not allowed to ask any questions about the coursework on Piazza anymore. So it’s better to start asking now, compare your results on the labs now. But once we release it next week, it will be only Friday, Saturday, and Sunday to submit your coursework. During these 3 days, please don’t ask any questions anymore. This is the time to ask now. Nothing will be different. It will be just a new set of queries and documents. So try to make all the questions you have, all the discussions you need now. If there is an urgent thing after the stuff is released on Thursday, OK, you can ask on Piazzaat privately, OK. But don’t come and say, uh, I actually I’m getting only 5 documents for query 5. Is that OK? This is totally not allowed. Don’t ask questions related to the coursework once the collections are released. But for lab 2 or 3, say these are full my full results. Is it matching with everyone else or not? Take your time and enjoy your time during these days, before the coursework is actually the test it is uh. Is released And another important point for your coursework, and actually in your lab and in general, the search module should be a standalone. Don’t make your search that I need to make a query. I will go and read all the documents, index them, create the index, then actually start doing the search. No, you should have the index separate, done offline, and you saved your index in a text file. I know it’s not, some of you asked about the binary file for it. It would be more efficient, but for the sake of the coursework and automatic marking, we need to it in a text format. And, but in the search module, all what you need to do is to just load this index file, not reading the documents anymore, and apply the queries and tell us what are the results. Is everything clear about this? OK, good.
SPEAKER 1
Yes, so for the saving, does it have to be text I save it like as a pickle file. Is that OK or do you have to, do you have to file?
SPEAKER 0
It’s in the coursework description. It has a full exact format about this. Follow the exact format. Honestly, if you think that it’s not efficient, I’d like to save it in a binary file. It’s OK, but again, save another text version. For us, because this is one of the things that you need to submit. So we need an an automatic marker would go and read this one and automatically check if it’s correct or not. So we need the text file still, OK? OK, good. So this lecture. It’s more about learning about ranked information retrieval. Uh, learning stuff about TFIDF, F, uh, uh, vector space model, and smart notations, and you’re going to implement the TFIDF, OK? Which is part of the coursework. Who heard about TFIDF before? I was expecting more, but that’s fine. So we learned more about this today. So so far from what we have learned in the last week that we learned about how to implement an index and apply search, but so far it’s just boolean retrieval. So all what we need to do for a query is to say if a document is matching the query or not. That’s it. However, This is a way for some expert systems actually, if you learn about how they are doing patent search in the patent office, they are using steel bullion queries, by the way, because it’s a very advanced, actually the people doing the search, if they get a new patent which is having a new innovation, some actually 30% of those are PEDs, have PEDs already in the field. So they would like to go and search for an expert system and searching all the word combinations. They can search for something like car or vehicle and motor or engine and not cooler, for example, so they can have this kind of complicated set of queries to search for it. However, in general purpose, it’s not good for the majority of users. None of us is using this on a normal basis. Maybe you can go on Google Scores at some point, you can say this and this or this. Sometimes it happens, but generally when you’re doing web search, social search, any kind of search, probably it’s just a, a, a free form query. So, Most of the users in general, they don’t know how to write even this kind of boolean query, and the other thing, boolean queries will tell us if something is matching or not, and it just received a full set without any ranking, and sometimes it will be a very long list. So we will not go and read all of these documents to find actually which is the relevant one. Actually, do you know what is the most unused web search feature? Yes, yes. Next, at the end of the page. Mostly no one uses it, very, very small, tiny users. Actually, if you search and you didn’t find your results on the top 3, sometimes you can scroll down, not everyone, to see actually the top 10 results. Most users will just go and try to retype their query. Before actually going to the next page, so we are expecting stuff to be ranked in this case. So ranked retrieval is simply the query now can be a free text. And the results would be ranked with respect to this query. And even if you have a large set of matching documents, it can be in millions, it’s OK because at the end I will show the user the top 10 results. And because we don’t want to overwhelm the user with many results. Even now with the rack systems and LLMs, chat GBT and NGMI, and so on, it actually goes and retrieves the top 5 and tries to subtract a summary from this so you don’t go again and check everything. So ranking is really important and essential for everything, including the LLMs that we are using now for, of course, RAG applications. So the criteria here is that top-ranked documents are the most likely to be relevant to the user, satisfy their the query that they posted, and the score. That try to give a score between the query and the document, saying how this document is matching the query. This is what we will try to learn in this lecture, how we can create some score. Before it was only 0 to 1.0 doesn’t match, 1 matches. Now we should have a score. Remember our old example from last lecture like ink wink. Last time it was ink and wink or ink wink as a phrase, but now actually, and the Boolean search will tell you is it 0 or 1, that’s it. But now actually we can now get a ranked list of results when I search just a freak with a text like ink wink. And it will be a function not just of existing, but it can be a function of the thermal frequency, document frequency, and the length, and usually it can be a score, let’s say from 0 to 1. So the results now will not be 0 to 1, they can be something like this, and from these numbers I can rank them according to which one is more relevant. This is our objective that we need to do. What could be a possible way to match between two strings, a document and a query or two actually strings in general? One of them is Jacquardt coefficient. Anyone here is about Chekharto coefficient coefficient knows what is it. No one heard about Jackard coefficient before. OK, that’s strange. So simply, it tells us what is the common between two lists. So imagine we have List A and List B, which can be document 1 and document 2, or query one and document. What is the intersection overlap between the unique terms they have divided by everything else? The intersection over the union, very simple. If it’s 1, it means that it’s fully matching. If it’s 0, it means there is no overlap at all. So remember these two documents from last time. He likes to win, he likes to drink, and the other one, he likes to drink and drink and drink. So if you did the union here, so all the union, the different terms of the vocabulary appeared in these documents would be these terms, and this is the intersection. So if you just measured what is the Jacquard’s similarity here, it will be how many terms appeared in common divided by all the other terms in the documents, which would be over 16 in this way. OK, so this is one way to calculate the overdub between two documents. Let me ask you, what could be the problem with this kind of measure? Do you think there are any issues with this one? It’s still OK. Many people still use it in many applications, but why do you think here it might not be the best thing? Yes, that’s an excellent point actually. Term frequencies are not considered. If a document has a term 10 times, or a document has a term 1 time, it doesn’t take into consideration this. You just say it’s appears or not. Yes, it treats all the terms with equal equal weight.
SPEAKER 2
So you’re treating this is as important as, for example, the universe.
SPEAKER 0
That’s an excellent, have you, have you seen my slides? Maybe that’s excellent, but this is an excellent point. This is exactly it. It treats all the terms equally. So if a rare term in the collection, how a term is rare or actually important, it’s not actually sensitive to this. So shall, he likes to drink, for example. Shall I give two the same as drink, or they should be actually done in a different way? Any other ideas? What could be the issue as well? Yes. There is no ordering, but remember we’re doing a bag of words. So this one, it’s something common between all the methods. What else? Actually, another thing, if you thought about it, this is actually a match between the document. It’s not just between the document and the query, but actually between the document, it’s not the document and the document, it’s actually a document and a query. The document can be 1000 words, the query is usually 2 or 3 words. So, It’s really the difference in lens is huge. So the overlap here, like adding the intersection between 3 and 1000 is actually maybe not the best way to do it. Which brings us actually to how shall we Sort our documents and our terms in this case. So let’s move with an example. I know this might take longer than anyone would explain this, but I’m trying to make you understand the fundamentals of this. So imagine that we have 5 documents. Each bucket of these is a document. And the terms are the balls we have here, OK? And someone came and searched for this query. This is my query, OK? It’s not a traffic light, it’s a query, OK? So if I asked you. What do you think is the least relevant document here, and what do you think is the most relevant document here? Let’s take this exercise together. So let’s start with the least relevant document. Who, what do you think would be the least relevant documents among all of these, the 5? Who agrees it’s 2? OK, that’s kind of everyone. OK, the other question, which you think is the most relevant document here? Who do you think it’s for? Almost the majority. OK, so we agree now that the least relevant document is 2 and the most relevant document is 4. Do you have any, anyone you have, you have, um, but what if the yellow ball refers to words like the and. Let’s discuss it actually, but remember actually our choices so far. The least document is 2, the most relevant document is 4. Let’s do it. Let’s actually take some examples here. So TFIDF tried to solve this problem. It tells us actually it’s actually it stands for term frequency inverse document frequency, so we have two components here. One is the term frequency, which is simply how many times a term appeared in a given document. So if a term appeared 3 times, like he likes to drink and drink and drink, the term frequency of drink, he will be 3. OK, so in this case it tells us how many times the term appeared in a document, which stands for if the term appeared more times in this document, probably this document is more relevant to this term. Makes sense. If a term appeared in a document one time, another document appeared this term appeared 20 times, probably this one which has 20 times would be more relevant. The other component is document frequency, which is the number of documents that contain this term regardless of how many times it appeared inside the document. So imagine that we have 5 documents and this term appeared only in 2. So document frequency will be in 2 documents. The document frequency indicates something interesting because if a term appeared in many, many documents, It means this term is very common among the language. It’s not that important. While if a term appeared in only a few documents, it becomes, oh, probably this is an interesting term, but actually it’s a rare term that we need actually, it should have a higher value. OK, For example, the word there. Probably it would appear in most of the documents we have in the Financial Times one, probably that appeared in all of them. So does it mean it’s an important term? Probably not. Someone asked the the question of Piazza, shall I do a stop word like FT? I, I don’t know who asked it, but it’s a very good question because he, the question was, it appeared in all the documents. It can be a stop word, yes, but the what we said, you don’t have to filter it out because we say stop, uh stay to one to the stop words list. But if you’d like to to take it out, it’s OK. It doesn’t matter because it appears in all the documents in this collection. Maybe FT. In the web search is important because it doesn’t appear in all the collection. It shows that you are looking for Financial Times, but in a collection article by Financial Times, this one appears in all the documents. Who cares about it? It’s not important. It’s not, it doesn’t say any document is special here. All documents are the same about this term. Is it clear so far? OK. So and there is actually a difference between document frequency and collection frequency. Because document frequency is not the collection frequency. Document frequency tells me how many documents contain this term. So this number can never be higher than the collection size. If I have 1 million documents in my collection, the document frequency can never be more than 1 million because how many documents appear in this term, it cannot be over 1 million because my collection is only 1 million. However, the collection frequency, this is similar to the one we used in the first lab, how many times the term appeared in Wikipedia abstracts. So if a term appeared 5 times in a document and 6 times in other documents, then the total here would be 11. But in, uh, so here it’s just how many times it appeared in all documents. However, the document frequency will be only 2 because it appeared only in two documents. Is that, this is an important difference. So DEF is mostly used in most of the IR techniques. Collection frequency will be actually still used in other applications. We will discuss it in the second lecture, but for most of the uh for uh uh IR applications, actually many applications as well, document frequency is the most important term here. And the inverse document frequency, when we say inverse document frequency, is simply the opposite of this. It’s actually the inverse of this, which shows that We said document frequency. If a term appeared in many documents, it’s not important. If it appeared in a small amount of documents, it’s more important. So the inverse makes it when the number this IDF increases, it means the term is more important. It’s more rare, so it’s more important. So this is just an example between what is the difference between document frequency and collection frequency from our very common example. Like the word drink here, it appeared in 5 documents but appeared 7 times. So there is a difference here. The word likes appeared in 5 documents but and but appeared actually 6 times. So there’s a difference. When we say document frequency, it’s the number of documents. It can never be higher than the length of the size of my collection. So the IDF formula, this is a very basic formula for IDF. There are other variations of this, but this is, let’s use a very basic one. It’s simply the log 10 of the collection size I have divided by how many documents this term appeared in. So the log scale here is tries actually to dampen this differences because if a term appeared in 100,000 documents, another term appeared in 10,000 documents, I will not say it’s 10 different time 1 difference, but actually it’s just this one is more important twice than the others. Let’s give an example. Imagine this I have a collection of 1 million documents and the wordbornia appeared in only one document. So what is the log of 1 million divided by 1? It would be 6. Another term animal appeared 100 times. What is IDF in this case, 1 million divided by 100, 10,000. 10,000 is 4. And the word that appeared in all the documents, all the 1 million documents, so 1 million valued by 1 million is 1, log of 1 is 0. So this tells us that the word under is the word that has no weight, useless. Remember when you talked about stop force removal with ranking, it doesn’t matter because if the term is appearing a lot, it will have a low weight anyway. Then under is actually 10,000 times it’s 1, but the word fly is twice as important as under. The word Cornia is actually 6 times important than under. So if I’m searching for under Calpurnia, for example, and I found the document contained the word Calpurnia a few times and under a document contained under many times, still I would rate Cornia way much more important than under. So far, so clear. OK Let’s see. So TDF term weighting is one of the best known term weight schemes in IR and I would say in many applications in text in general. It increases as the number of occurrences of a term within a document increases, but also increases when this term that appears a lot actually is rare. It doesn’t appear in a lot of documents. And when you combine TF and the IDF together, this is how you can count it. So the TF, probably this is the weight we’re going to use, is 1 plus log TF of the term. So if a term appeared once, a log of 1 is 0 plus 1, it would be 1. If a term appeared in another document 10 times, it will be log of 10, which will be 1, plus 1 would be 2. So it actually I get start to get twice the importance of this document and multiplied by the log of the document frequency, and of divide by document frequency. For a query and that has multiple terms, a query of 2 or 3 words, I will simply calculate the TFIA for each of the terms appeared in the query and then do the summation. OK, Clear so far? We will find out. Remember our question? I would ask you again, what is the least relevant document here. Based on what we have discussed. Which is the least relevant document here? 2, yeah, who thinks it’s 2? Yeah, uh, uh, still, actually most of you think it’s 2, yes, 2 is the most it would be the least relevant, the 5th 1. What is the next one? What is the next 2, the next least relevant document? 1, yes? I will ask you now, what is the most relevant document here now. Remember, you said it’s document 4. Who thinks it’s still document 4? What else? Why? Why document 5?
SPEAKER 1
What appears in 2 documents twice. And so it’s So green is all very important and so is red, OK, but this one has more yellow. But You got it.
SPEAKER 0
So simply the most relevant is document 5 because the yellow ball is really not important. It is something like searching for the destructive storm. If I found a document that contains a lot of that, I don’t care, but if I found something, a document containing destructive storm more, this is more important. Now you understand what is TFIDF and the meaning behind it, OK, so don’t get actually like fooled by some of the useless terms appearing more in a document. It’s what’s important here is the most rare terms. Like if one of the queries I gave to you in uh in the in the collection I, I released last week contained the word FT. If I found a document contains FT many times, who cares? It’s useless. It appears in all of them, so it’s not an important term. Fair enough? Good. So now, if you remember, if we would like to map it to the collection matrix we had before, remember before we said actually how many times we can show it, is it, is it 1 or 0, a term appeared or not, or sometimes we say term of frequency, how many times a term appeared in this document. Now what we can put actually is the term TFIDF score. So I can tell actually each of these, what is the score here would be the TFIDF. I can add this one. So if I’m looking for this kind of document with this term, I can guess, OK, this is the value of the term appearing in this document, the TFIDF value. It’s not actually 1 or 2 or 3 or 1 or 0 as before. And this brings us to what’s called the victory space model. It’s actually based on the showing stuff as vectors that we discussed last time. So now I will put the query as a vector and the document as a vector, and all what I need to do is to measure how they align to each other. And the values of this vector is not the term frequency only or the appearing or not appearing, it will be the TFIDF value. One thing I can use actually is I can use the Euclidean distance, which is the distance between the document and the query. However, the problem with this, if I did that, that It will be always actually high for the It will be always high because the documents are much longer than the queries. The queries are short, the documents are high, so if the distance itself, measuring the distance between Q and D2 and Q and D1 and Q and D3, it’s not important, actually, which is simply the point, the line there. It will not actually make, give us a good indication. But what we really care about is Not actually the distance, but actually the angle between the victors. So what we are saying regardless of the length of this vector, because the documents we expected to be longer, let’s measure the angle between them, which in this case is what we call the cosine similarity, cosine of the angle between the vector of the document versus the query. As long as there’s this, the, the value of the angle between them is low, it means the cosine similarity will be high. Cosine of zero is 1. It means actually in this case they are fully aligned. However, the problem again, it’s actually sometimes how we can measure because this document is long and the other one is short. This is what we call length normalization. You can actually find the normalization of a specific vector by counting by its norm, get the norm of the vector which is the values inside it, and then you define by this norm to get the normalized version of this vector. Just to give you an example, imagine document one, which has 3 terms. The first term appeared once, the second appeared 3 times, the third one appeared 2 times. I can measure its length or norm by actually calculating this formula, which is a square of all the values than the root of them, which, and then I will divide these numbers divided by its norm. It will be something like this. If I got another document, which is Longer, 3 times longer, but actually it’s very similar. It’s kind of this document went appended 3 times. So each, instead of once appeared 3 times, so it’s 3 to 9 times. So the norm of this one will be 11. So if I divided both of them, it will be exactly the same values. So now I can guess regardless of actually how, what is the size of this document, I still can get a normalized version of this. And from there, I can simply measure the cosine similarity, which is the dot product of those. Of the normalized version of this, which will be something like this. OK. So all what you need to do to get the normal values and then do the dot product and now you can find the cosine similarity between two vectors. This is the algorithm. You can read it quickly, but this is the general thing. There are different ways to calculate these things. So for example, for the frequency, I can actually just make it the exact the natural number, is it how many times I appeared, 25 times, whatever it is, or I can use the log value of it, which is 1 plus log of. TF or there is what’s called the augmented value, which is another formula, or boolean, which simply appears or not. If it appears 10 times, it’s still 1. If it doesn’t appear at all, it will be 0, which is something that was similar to what you have done in the boolean search. Or the average with something like this. For the document frequency, I can ignore. I can totally ignore it. It will be always one. All the terms are the same. I can use the uh uh IDF, which is the log value of this, or I can use what’s called the probability of uh IDF, which is another formula, something like this. Normalization, I still can apply normalization to the document and the and the the query, or I can ignore it totally, or I can use a cosine similarity, and there actually are different ways to do that. This is what we call the smartnotation. It’s all different variations of how you calculate the TFIDF. It’s all different variations of how to calculate the TFIDF, and all what you need to do is to decide which one I should use. What is very common, what is very common is for For the documents, you calculate the 1 plus log of the TF. And you don’t do any IDF score here, and you apply cosine normalization here. For the queries, if you have a term in the query, what you do is you apply again the log plus the log for the TF and you calculate the IDF of this term in the query, and then you apply the cosine similarity. So this is how you can calculate the TFDF. This is very common. To make it an even easier way for your coursework and lab, what we ask you is to simply use the TF as it is. Uh, when, when it’s coming to the document, and you don’t need to do just the IDF would be done once on the document on the query side, and you don’t do any kind of polarization. Don’t you calculate the cosine similarity here. So simply, simply what you need to do just in other words, If you’ve got a query of 4 terms. You need to extract all the documents that contain any of these terms. It’s like an old operation for the boolean search. If all the documents contain at least one of these. Then for each of these terms, all you, what you need to do is to calculate the 1 + 2 log TF of this uh term in each document and multiplied by the IDF. Then if it’s had 3 or 4 terms, you do the summation. So this is the formula you need to implement for your lab and coursework. This is a very simple way to do it, OK? So when you do calculate this, the scores will not be 0 to 1 because we are not applying normalization here. So it will get someone will get 3, some other documents will get 6. It’s OK, and then you’re doing ranking. The one that gets the highest score will be ranked at the top. The one that has the lowest score will be ranked later after that. If the document doesn’t contain any of the terms, it will be zero and nothing will be there. Is that clear? OK. So the summary of the steps represents the query as weighted TFIDF vector, represent each document as a weighted TFIDF vector, then compute the cosine summarity. This is a general thing, rank the documents with respect to the query by the score, then return the top 10 documents that contain that achieve the highest score. The retrieval output for a given query one, the output would be a list of ranked documents. The possible format, and this is actually the format would be required, I think it’s kind of the one that would be required for your coursework and the lab. You just add the query ID. The most relevant document which achieved the highest score, what is the idea of this document, and then the score that it achieved. So you rank them. This is the output. OK, so you should put the ID of the query, the ID of the document, then the score that they achieved, and sort them by score. This is what we expect from you in your coursework and that. The resources for this, uh, uh, for this for this lecture is introduction to IR textbook for uh chapter 6.2 till 6.4, and in IR in practice, chapter 7, the whole chapter 7 is relevant here. Do you have any questions about this? So in lab 3 is the most important thing. This is how you practice. It’s building on what you have done last week. You will not dump it, keep it. You will do an all operation between a query having 4 terms. You do all operation after you apply pre-processing, and this is very important. Apply pre-processing to the query, stopping and stemming. And then for each of the terms, extract its postings from the doc from your index. You have the document frequency saved. You can measure the term frequency based on how many times they appeared in a document by their positions, and you can calculate the formula. And do the summation for the whole query and then find what is the score for document X. And then document X achieved 0.5, document Y achieved 3.5, then document Y is more relevant than document one and print them. OK? This is the implementation you need to do for this lab. We will take a break, uh, for, uh, yeah, like 5 to 10 minutes, and then we continue with lecture two, and there will be a 90. I have a question, yes, yeah, so like in this
SPEAKER 1
lab, uh, for like calculating the score, uh, we’ll call a function, like we’ll call a function, and can we have any boolean type of queries also with the rank search.
SPEAKER 0
because actually there is no bullying here because actually it’s kind of embedded because. Assume the operation here is or between all the terms.
SPEAKER 1
I’m asking that can be mixed, no, no, there is
SPEAKER 0
no mix between ranking and bullion. Boolean is separate than this. It’s a totally different type. OK, so no, no, it will be something you will not find this format and or. It will be just a free text, OK. Yes, Christian, and in this formula, uh, will you give
SPEAKER 1
a query with a term that does not appear it
SPEAKER 0
might. The, the, in this case, actually, yes, you have to handle these cases. Yeah. If you, uh, uh, yeah, that’s a good question. And what if I ask for a term that doesn’t, doesn’t exist in the collection? So in this case, before you, if you adjust your DF is zero, it will make an exception. So this is how you handle your code. Be sure that you have, if the term exists already, then in this case, but the good thing, all what you need to do before going for term by term, you will retrieve the posting list of term X Y Z, which doesn’t exist. When you retrieve the posting list of this term, it doesn’t exist. There is nothing to calculate, so move to the next one. You see it? So it will be automatically done. So you keep a loop on all the terms in the query. Each term you will extract the posting list, do the calculations, and see if the summation and do the summation. One term of them didn’t appear at all. When you try to extract the posting list of this term, nothing is retrieved, so you’ll go to the next term directly. Clear enough. Yes, if the term doesn’t exist in the, of course in the query it will not, uh, it’s not, it will not what you do. But if, if the term in the query doesn’t exist in the documents, when you retrieve the reposting list, nothing will be retrieved, then go to the next. OK, you will not need to calculate anything. Yes So, uh, following up on this question, like, will
SPEAKER 1
we need to retrieve some documents in that case?
SPEAKER 0
No, for example, if I give a query, yeah, it will be an empty list. Nothing, nothing will be printed. So imagine I give you a query that doesn’t, none of the terms in this query exist in all the collection. So in this case, there are 10 query. OK, then your scores will be based on the 5 terms that appeared. OK, it will be simply by, yes, you’ll get some results unless all the 10 terms doesn’t, none of them is in the collection. Nothing will be retrieved. OK. OK, we’ll take a break for 10 minutes. I will show the mind teaser in a second. OK So this is a mind teaser of today. It is one of the easiest problems, so hopefully you will get an answer quickly. If you’ve got an answer, raise your hand and shout, what is the answer? OK. Yeah. OK, so, uh, I got actually interesting questions within, uh, in the, in the break. One actually, is actually about how would you, I will just explain how you should, how my thinking, how you can process the query for ranked search, but still you can do your own implementation. But the main idea, let’s assume that I got a query of two words, just a free form query, and I did the old operator and find that there are 10 documents that contain this query. OK? So what I need to do now is I will go one by one for the first term in the query and the first document that contains at least one of these terms. I will check how many times this term appeared. If this term didn’t appear in this query, skip and go to the second term of the query. And I checked it and calculated the TFIDF, and it turned out to be 2. Then the score will come out back as for this query and document as 2. I go to the second document. I checked the first term it appeared in this document, so in this case I calculate the TFIF it turned out to be 1. And the second term, I found it is existing as well in this document, and I found that the, the TFID score for this term and the document will be 1 again. So in this case, the score here will be 2. So I will, all what I need to print at the end for query 1, document 1, the score is 2. For query one, document 2, the query is 2, and keep doing it for I finish all the documents. So you start by the Boolean query as an old operator, getting all the documents that contain at least one of these terms. Then I start to calculate the TFIDF for each single term in the query and the document and do the summation over the document at the end. Is that clear? This is how you should implement it. This is how I recommend you implement it. If you have a different way to implement it, it’s fine, but this is what I’m expecting, OK? And then you will print the score over all the terms of the given, appearing in a document to a given query. OK? OK, in this one we will go to continue talking about ranked retrieval. In the last lecture we spoke about the very, very basic ranked retrieval, TFIDF, and hopefully understood the notion of it. Yes, there are different implementations of TFIDF, but in general this is a very basic form. In this one we will learn more about probabilistic models and we’ll talk about one of the most famous formulas which is BM 25. Uh, anyone heard about BM 25 before? One, OK. It’s very famous if you, but if you not, most of you didn’t know TFIDF, probably you will not know BM 25. And the second one is how you can use another, uh, representation of documents which is the language modeling for IR, how we can actually treat it as a language model. So Remember from the last lecture we have the vector space model and TFIDF term weighting which combines the TF and the IDF in this way and you do the summation over the query terms. What we can notice it is that the term appearing more in a document it gets a higher weight. The first occurrence is more important. This is why the log. So this is why if I didn’t find the term, it’s 0, but if I found the term, I got 1. I get 1. And a score of 1, if it’s good, I mean we’re using lock 10. If I find the 2nd, 3rd, 4th, 5th, till I find 10, now I can move from 1 to 2. To reach score 3, I need to reach 100. So the big difference happens when I find the first term. As long as I add more, I have to find even way more to actually get a better score. The rare terms are more important. This is the IDF, what it does. It tells me if a term appeared in many documents, it’s really not important, even if I calculate it, I keep the stop words and do it, it will not have a high weight anyway. But if a term is rare, I start to get some score. There is a little bit bias toward longer documents in this case. Do you know why? Why you guess there is a bias here to blocking the documents? Exactly. If if a document is only 100 words, another document is 10,000 words. So if both are relevant, probably the term is that 10,000 words will have more TF for the terms I’m looking for. So it will be ranked higher. So there is a little bit bias here because if a document, if you are ending sync in some in some way, uh, books with tweets. It’s a bit unfair for the tweets because the term frequency will be always low, for the book’s expected to be higher. So the question is, can we do something better? This is what we’re trying to see here. So the information retrieval model we started by the vector space modeling which is very heuristic in nature. There is no notion so far, so far to what is relevant, what is the thing is relevant here. However, it still works very well because we assume if a term appears more in a document in this way, it becomes as assumingly, it will actually be more relevant. And Any weighting scheme similarity measure can be used. We saw, we saw about different ways components are interpretable, not very interpretable actually. There is no guide about what to try next, more engineering, so we keep trying. Remember the smart notion that there are different ways of how you do the normalization, how to calculate the TF itself or the IF itself. You can keep trying and see which one will achieve the better results. However, it’s still very popular. It’s a very basic one. It’s very popular. It’s a very strong baseline, and usually if you’d like to build a system very quickly, this is something that you can do in a few seconds. It’s working well. Then there is another model which is called probabilistic model for information retrieval. This one is tried to use mathematical formulation for this, for what is relevant and what is not relevant. And it explicitly defines a probability, some variables about what is relevance, query and document, so we introduce the relevance here in this way. And be specific about what is their values and state the assumption behind each step and what could be contradictory here in these kinds of assumptions. So a probabilistic model is the main concept is that uncertainty is an inherent part of IR because I get the document, I’m not sure if this document is going to be relevant or not. So if you think about how can I model something that has uncertainty, in this case, probabilistic theory can be a good theory to use in this case. And one of the people that pioneered this kind of field, uh, in the, since the 1970s is a mathematician, actually a very known mathematician, uh, Stephen Robertson. He was a professor at the University of Cambridge, but also actually a scientist at Microsoft. He was working in Microsoft research in Cambridge. Actually, when I worked at Microsoft, he was the host for my interview uh at that time. So, uh, And he created this theory in 1977, too long ago, that this theory about probabilistic theory for search. Um, He’s originally a mathematician, but he applied it in this kind of documents, and 1977 the collections would be 100 documents, 200 documents, but interestingly, his theory came to be working till today. This is actually still what we apply. So this is actually the theory he has. It’s a little bit. Like, uh, formulated in this way, but let’s try to understand it word by word. He says if a reference retrieval system’s response to each request is a ranking of documents, we’re trying to retrieve a set of ranked documents in the collection in order of decreasing probability of relevance, we try to rank these documents according to the decreasing probability of relevance, so the document which is higher, has high probability to be relevant, should be on the top to the user who has submitted this request. Where the probabilities, how you calculate these probabilities are estimated as accurately as possible. This is what our objective, of course. On the basis of whatever data have been made available to the system of this purpose, the overall effectiveness of the system to the user will be the best that is obtainable on the basis of those data. And this is the main basis of the most simplistic approach in IR. By the way, actually, I know it’s a bit, maybe it’s not easy to read it in this way, but It has been a big science for years. People try to develop this, and there are mathematical mathematical derivations about this part. And so if you like to read like a paper with 100 pages of derivation of some formulas, you can read the stuff by Steven. It’s really nice. OK, so the formulation of probabilistic relevance theory here. So what we are trying to do is to rank documents by probability of relevance. So what we are trying to do is that the probability of relevance of the document ranked at number 1 should be higher than the one in rank 2, than the one in rank 3, and so on. And we try to estimate this probability as accurately as possible. Imagine there is a true value for the probability here. So what we’re trying to do to have an estimation for this probability almost equal to this true value. And we need to estimate this probability based on whatever existing data available. So currently what we are working with is simply the query and the document, nothing more. However, in real life, especially in web applications and in many other applications, you can have more inputs like the document itself, like the, which is what we’re working on, but the session. What is the session? What the user has been actually searching for in the last 10 minutes. You can actually guess a better guess when you have about this session. The context Are they searching from Edinburgh University or from a cafe, or actually they are living in the UK, Scotland, or they are living in the US or in Africa. This gives you some information. User profile. Do you have some information about the user, their profile, their interest? You can learn more about this stuff. As much information you have, if you can integrate it to the probability estimation, it would be great. And the best possible accuracy can be achieved with the data. If we manage to do that, this is a perfect IR system. But the question is, is it really doable that we can do something like this? How to estimate this probability of relevance in a way that becomes, this is the best way so I can get a very good ranking for the documents. So the main concept here is imagine the information as a classification problem. You have a query And you have a probability that this document for this query is relevant. Or a big number of documents that are not relevant in this case. And the probability of the documents which are relevant plus the probability of documents which are not relevant should be equal to one. This is everything, all the documents we have. And the document is relevant if its probability to be relevant is higher than its probability to be not relevant. It’s a bit theoretical discussion, but we’ll try to actually see how this can take us to drive another formula after a while. So what is a probability to to be a crude probability here for a document that for relevance given a document and a query and session a user and so on? The question is, isn’t relevance in this case depends on user opinion. So the user decides. If a document is irrelevant or not, you decide. So what is this prob thing here? And the problem is that search algorithms cannot look into your head. It’s actually based on the query, you wrote, but you might, you might have something different in your head. How shall we know? So it’s based on actually, relevance depends on the factors the algorithm cannot, cannot be on the stuff that they cannot see. However, actually there are some experiments. Uh, it was a nice nice paper a few years ago, 9 years ago, at Posper Awards actually, they actually checked. They connected the users with fMRI in their head to see actually the reactions when they find the relevant documents versus non relevant documents. However, this cannot be applied on a scale, so unless, who knows what would happen in the future. But at the moment, it’s mainly actually what What actually, what we see is just the query. By the way, if you’re interested, there is a, I’m not sure if there is in the informatics forum, there is a big lapse about connection fMRI and actually eye tracking and different stuff. So if you’d like to do some project on this stuff, you might think about using these labs. So we have some devices to measure these kinds of things if you’d like, if you have a crazy idea to do something. And different users may disagree on the relevance of the same document. So you have the same query, the same document, but if you’ve got a set of users, all of you, and I say, Do you think this document is relevant to this query, I’m not expecting that all of you will say yes or all of you will say no. Some of these will have like half and half, 70, 30, and so on. So the probability of a true, the prob true true probability of a given document and a query. This is a proportion of all unseen users in different contexts, in different tasks for the same document judged for a given query. So what we are trying to see, OK, if I ask you this document and this query, and I try to ask everyone in the world. Do you think it’s relevant or not? In this context, and what about a different context at a different time of the year, a different time, a different location. If you can measure this, this would be the true probability, but it’s actually kind of impossible to do it. But if you manage to do that, it becomes like similar to what is the probability of getting 6 in a die given that it’s even and not square. So you put some, as long as you put more restrictions, giving more information about the context, about the date, task, about different things, you will get actually a better probability in this case. So what is the probability of getting a 6 in general in a die? 1 divided by 6. What is the probability of 6, if it’s, we know it’s even 1 by 3, what is the priority in this case? 1 by 2. So your probability improves every time you get more information. This is what we are trying to do with the search here, OK? Which brings us to something Steven created with his team, which is called the BM 25, Okapi BM25 model. And it’s based on the probabilistic theory. Uh, I will not go into details about this. There is a full paper with derivations about it, but let’s take it from me here that a document D is relevant if probability of relevance is higher than probability of not relevant. An extension, it’s an extension of the binary independence model, so the binary features documents represent by a vector of binary features indicating term occurrences, and it assumes term independence. So we don’t, it’s still a bag of words. We are still in the bag of words. And this came out in 1995. He produced this formula. And people have been doing search and trying to improve stuff for search for years using TFIDF, using different notation, and so on. Then BM 25 came out to beat every single system, achieving the best results. Everything they tests on different tasks, achieving the best. And it became a very popular and effective ranking algorithm until now in most of the open source search toolkits, BM25 will be the default formula to use. What is this amazing formula? It’s not that big difference from TFIDF actually. It’s very close to it. All what the thing they did was adding something about the length of the document, and a little bit of formulation, and the TFIDF became actually uh a little bit also considering something about the collection. And even there is some constant here. The best practice turned out to be 1.5 based on experimentation. So what it does, it tries to normalize the TFIA formula a little bit based on the length of the document. So if a document is very long, it starts to give different weighting where it’s actually a little bit shorter. So how it looks like this is actually what you can see actually if you used no normalization, if the length of the document is similar to the average length of the collection, this would be the value in black here. But if this documents started to be short, You can see now it starts to give higher values for TF. OK, now the TFI start to give higher value because it’s a shorter document, so it means something when you have occurrences, multiple occurrences in a short document. If the document itself starts to be very long, like if the lens is 2, the, of the average, it starts actually with lower value for the TF. And the uh uh actually the IDF itself starts to be a bit lower than the classical uh IDF implementation. This is small variation when you think it’s very small and honestly it looks very small, it makes a big difference. So the probabilistic model in IR, it focuses on the probability of relevance of the documents. It could be mathematically proved. There is a proof, I think I will cite it as one of the readings if you’d like to have a look at it. You don’t have to, but it’s if you’re interested in math. This is nice. Uh, it, it, it’s different ways to apply it. Conbent one is the most common use formula, and it has been actually very, one of the strongest baselines in search so far, you, when you have only documents and queries, nothing, no additional information, OK? The question is what other models could be still used in IR. Remember, the first one was vectors and trying to get the similarity between them with CF and cosine similarity. The second one is this beam 25 notation, which is based on derived from the probabilistic theory. Let’s try to actually map it in a different way or model it in a different way. Which is the noisy channel model. Think of the operation. This is how it happens. You have some information needs. You have something in your mind. I’d like to get some information about it, and you have a document collection. What you usually do is you think about it, then. You try to write down a query that actually would help you to find the relevant documents you’re looking for. This is a process. Think about something, write something, then. Go and find the relevant documents. But actually if you think about it, how this actually happens. What you actually do is You have a query, information need, and you try to think, oh, the document I’m looking for would probably contain these words. So what you try to do is to formulate a query that probably would be able to be derived from this document. You think I’m looking for something talking about climate change, for example, so I’m looking, so I’m accepting the words about climate change, maybe scientists, sciences, these kinds of words that might contain in this document. So what’s actually happening, you write down a query worse likely to appear in these documents. And imagine there is a model that for each of these documents, imagine each document has a kind of a language model that is derived from, so you are trying to drive, you’re generating a query from this model. You are think each of these documents has a model which is the model of words that it’s generated from. I’m trying to generate my query from the same model. So the language model approach directly exploits this idea. A document is a good match to a query if the document model of this relevant document is likely to generate the same my query. So I’m thinking about the document in this case. So the concept is actually coming up with good queries is think about the words that should appear in a relevant document and use these words to build your query. And the language model approach in IR model is this idea a document is a good match to a query if the document model is likely to generate this query, and it happens if the document contains the query words you’re looking for. It still builds a probabilistic language model, so you build a probabilistic language model for each document and then rank. Your documents based on the probability of generation of equity from these different models you have. So Language models a language model is a priority distribution over strings. I think everyone knows about language models now. LLMs is a big thing of it, but this is based on just thinking about generating a language model from a document, not from a big collection. And a topic in a document or a query can, can be represented as a language model. Words that tend to occur more will have higher popular in this document compared to other ones. So the very, very basic language model, very basic language model is the Unigram language model. Which is simply the terms are randomly drawn from a document. You imagine this bucket thing and you would like to actually get a word. What is the probability of finding this word in this document? So for example, for this actually document, and imagine I have this query that’s red, yellow, red, blue. So what is the probability of getting a red word from this? Documents 4/9, yes. What is the, the Swiss case, the red will be the probability of red. What is the rate of getting yellow? 2/9. What’s the probability of getting red again? It’s 4/9, and then, uh, which is actually uh in this case, 3/9. So this would be simply the probability. And I can assume they’re still independent, so what I’m doing now is just to multiply all the probabilities of this stuff. And in this case, I still, for probability in this case would be the thermal frequency divided by the document length. Very, very basic. How many, the words that appeared in these documents 10 times and the document length is 100, its probability is 10% or 0.1. So this is one state wherein state automa automation, uh, a unigram language model. So if I’m looking for a query like frog said that uh loads like frogs stop, which is a very strange phrase here, I start to calculate the probability of each of these words and then the multiplication, I will get a score at the end. It might be very tiny, but I will do the same for all documents, so I’ll get something at the end. So imagine that now I’m looking for these two documents, and I have the model for document 1, the model for document 2, and this is actually the probability of each of these terms in each document, and from here I can calculate the multiplication of all of these, and I will get, I can guess easily now that the probability of document 2 is actually higher than document 1, so document 2 is more relevant than document 1. This is the main very simple idea. OK. Uh, the stochastic language model, which is a bit more advanced than the Unigram language model, is not just actually calculate the words as in a way that they are independent, but actually start to kind of check the phrasing in this case. So I can say the variety of the same red, yellow, red, uh, green, it can be the variety of Observing uh red, then observing yellow after the red, then observing red after the yellow and red, which can be unigram, bi-gram, trigram, for gram models for each of the models. So this is, if you can build something like this, which is, uh, it will be more accurate, but actually, of course, less efficient in this case. So And sometimes you can only, if you have actually have a unigram or actually a higher gram, in this case, you bi gram, you can actually, you don’t have to have what, 10 g. You can actually calculate it as probability unigrams, only the H word the dependent, or if you have only bi-gram language model, just two words, so you can say red and then yellow given red, red given yellow, blue given red, which is the sequence you have. So you actually can actually even uh uh make it uh uh approximated to whatever language model you have. So language modeling in IR, each document is treated as a basis of a language model. Then giving a query, I try to rank my documents based on the probability of a document, giving a query, which is using Bayesian formula, it’s the probability of the query to be generated from a document multiplied by the probability of the document itself divided by the probability of the query if you know Bayesian formula. The interesting thing is the priority of the query, it should be the same for all documents. There’s, what is the product of a query? I don’t care. It’s all documents will be the same. So I can ignore this value, and the probability of the document itself, what is the probability of this document in my query, it’s in this case, I should be treating all documents the same. Of course, it can be different for web search. I can do something like PageRank, which we’ll talk about when we talk about web search. But assuming in an hour there is no different treatments within documents, all documents are equal. So in this case, I can ignore this formula as well. So in this case, the probability of the query given document is the probability of a query given, given the language model of this document. So what is the probability of generating this query from this document? So to rank documents according to relevance to a query is ranking according to the probability of a query to be generated from the document or the document to be relevant to a given query, which would be the same. So the basic idea here, we attempt to model the query generation process from each document. Then we rank the documents by probability that a query would be observed in this document as a random sample from the retrieved document model. That is, we rank according to the probability of a query to be generated from a given document, and this is a formula. The probability of a query given model is actually the multiplication of each of these terms in the query by the operator from these documents. So remember the last time it was summation, this time it becomes multiplication. And what we call it, this is a query likelihood model. What is a query likelihood model to actually be able to generate from a given document. And the parameter estimation, the simple one, which is a unigram model, is the term of frequency divided by its document length. So if I have a term appeared 5 times in a document that its length is 10 words, it’s 50%. Another 15 times in 100 words will be 5%, which is kind of Embedding the format OBM 25, which includes the lens for a normalization, it’s kind of embedded automatically here because the documents, the document lens is different, the property will change anyway. And in this case, if I, I, it’s simply just the multiplication of this, if a term of the qui appears twice, then it will be to the power of this because I will multiply it in this case. So remember we said the probability of red, yellow, red, blue, it will be the formula we calculated earlier. The red be twice, then yellow, then blue. But the question is, do you find any issues so far from this kind of model of modeling of the Of the information retrieval of the search. Do you think this would work better, or the other one, or actually let me ask you, would there cases where this formula would be or this way would be problematic? Yes, uh, doesn’t like rare words and like very common words. The IDF part, yeah, this is, it doesn’t count this in case, yeah, but imagine that a term appear, this term appears in different, in different documents, it will be at the end normalized somehow. But there is actually a bigger issue here. Yes, that’s an excellent point. Yes, if I search for something like this. What would be the probability here? What would be the probability? 0, yes, because I’m looking for a green term. The green term doesn’t appear. So in this case it will be 0 over 9 in one of these terms and it will be 0, which is an issue. If we’d like to map this to boolean search compared to the TFIDF, what is the difference? The boolean, remember, remember that when I talk about TFIDF, it’s a summation, so we said we make an or. If a a document contains at least one term, I would be able to process it and get results. This one, it’s kind of and operator. I have to find all the terms in the document. If one term is missing, I will not even retrieve it. Is it, could you see it? So this is kind of an implicit end operator happening here. Usually, we should resolve this issue. However, if you noticed in web search. They kind of apply an end operator, by the way. Have you ever seen when you search for something and Google will give you results and telling you this document doesn’t contain this term? Usually it will try to find everything that contains the terms, all the terms you search it for, but if one term is missing, it will tell you, I, you know, I didn’t find a lot of documents, so I’m ignoring this term. Uh are you, or do you want me to actually just get everything? So the default for web search is kind of and operator, by the way. They just go for, uh, miss one term based if they didn’t find enough results. So it’s in web search they try because there is a huge amount of documents and if they end up that OK, I will just focus on the documents which contain all the terms, but sometimes you will notice it. It will tell you this term doesn’t exist. Be aware of this. It will mention to you before you open the link. However, this is sometimes not fair. So in Victory space model we do the score, which is a summation. So if a term is missing, it’s fine. I add a 0, it’s nothing. But in language models it’s a multiplication. So if 1 is 0, everything goes. Is there a better way to handle unseen terms in a given document? Which will bring us to what’s called smoothing. So the problem here is that if I have a zero frequency, Of a term, I shouldn’t be that harsh. Maybe in some applications I might be harsh, but in other applications, like if you’re doing Financial Times 100,000 documents or 1000 documents, that’s too much. Uh, no, I still can miss one term and still I retrieve something. So the solution, solution here is to smooth the term as frequency. And instead of having like the first term, this is like zip distribution, if you noticed, uh, but it’s on a big scale. So you try the term appear this number, the second one goes, and at the end. Many terms will not be there. It didn’t appear in my collection. What it’s saying is actually, can you do some kind of smoothing? So I try to take these tips of parts of the different terms and then try to add, if a term I didn’t see it, I will give it a sum score. So the lowest appearing term would be would be the lowest of course term of frequency would be 1, but I say, you know what, if I didn’t see it, I could assume it’s like 0.1. No, instead of 0 OK, so we try to do some smoothing here. So documents are a sample from a large language model. And missing a word should not have a zero probability of occurring because it didn’t appear in this document, but it might appear somewhere. And the missing term is possible even though it didn’t occur, but no more likely than would you expect by chance in the collection itself, in the whole collection. It may be a micro document, but it might appear in the whole collection. So a technique for estimating probability for a missing term unseen word, it tries to overcome the data sparsity when the term is missing, and it lowers or discounts the probability estimation for words that I see it and takes the leftover parts to put it for the terms that I haven’t seen before in my documents. One of the very famous uh techniques for this is called mixture model, which simply says the probability of a term given a document is of course the probability of this term generated from the document model itself, which is what we have been seeing in the last part, but it’s multiplying by a given constant and maybe 0.9, and I take 1.1% of this actually, which is a point of seeing this term from the whole collection. So if I didn’t see it in my document, I still can see it in the collection. So mixture mixes the probability from a document with the general collection frequency. Remember the collection frequency we talked about in the last lecture, document frequency and collection frequency. Now collection frequency can work. It estimates for unseen words 1 minus lambda multiplied by the probability of this term in the whole collection. Model which you call the background language model, then it’s actually I can give the term some probability in this case. And the estimation of observed words would be calculated in this way. Which in this case is the collection frequency which we spoke about last time. It’s not document frequency now, it’s now collection frequency. One of the famous implementation of this is uh Jelinek Mercer smoothing, which is calculated in this way, and I start actually to decide the value of lambda based on what I’m looking for. So if lambda is high, Which I’m getting more of the weight of the term is observed in my documents, so it actually tends to retrieve documents that contain all the keywords. If I started to make it low, lower value of this, then I can actually start to be more flexible. If all doesn’t exist, that’s fine. Give me still rank the other documents in a good position. And correctingly setting this kind of value of lambda is important for a good performance. You need to try and error and see which one is achieving the best results. And the final ranking function will be a multiplication. This is the one which we had before, but in this case, I’ll keep adding some value from observing the term in the collection. So let’s give us a very simple example here. Imagine I have two documents, a collection of only 2 documents. Jackson was one of the most talented entertainers of all time, and document 2 Michael Jackson. What is this word, anointed himself king of pop. And if my query is is Michael Jackson. Let’s assume lambda here to be 0.5. 0.5, it means that I’m not that harsh in finding all the terms in my uh in my query, that all of them should be in the document. So how to calculate it? The probability of the query of document 1, what you can see here for the word Michael in the first document, it didn’t appear at all. There is no Michael there. And the document length is 11, so it’s 0 over 11, but remember I’m adding 0.5 for this value, and the other one is the collection frequency. The word Michael appeared only once in the whole collection, which is these two documents, and the length of the whole collection is 18 words. So it’s 1 divided by 18, and of course I’m multiplying divided by 2, which is the lambda here. Then I will add to it the second word. Actually, for the second one, it’s actually 1/7 because it appeared in this one, and 1/10 is still I calculate it. For the word Jackson, I will see here it’s 1/11 it appeared in the first document, and twice in the whole collection. And for the second one, it’s 1/7 and twice in the whole collection, and now I start to get some values. Yes, uh, document two is better than document one, but I still didn’t get 0 for document one because it missed one term. I get some value for it. So not on the query likelihood model, it has similar effectiveness to BM 25. It has been experimented and it turns out that BM 25 is achieving still comparable results. Actually, sometimes BM 25, in most cases actually PM 25 would be even better. However, in some sophisticated techniques, BM 25, it can be better than BM 25. There are several alternative smoothing techniques, and this is just an example about the smoothing here, but there are different ways to do that. And when you are doing actually the n-gram language models, remember that we have been testing it as a unigram, just the property of a term standalone, not around the context. So probability distribution over the words in the language associated with the probability of the occurrence of every word separately, and it generates the generation of the text consistent for pulling out of a bucket independent from terms around it. The N-gram language model, if you are doing bigram, trigram, and so on, some applications use bigram trigram language models where probabilities depends on the terms observed and predicts the world based on the previous word it has seen. However, based on many experimentations and many papers, It turns out that generally the Unigram language model is still very effective. You don’t need even to have the Bram or trigram language models for these retrieval tasks. And there are 3 ways to, uh, for possibilities here. One is to calculate the probability of generating the query text from the document model. Another one is probability of generating document text from the query language model, which is strange because the query would be like 2 or 3 words. And then, or actually comparing language model represented by both. The general findings from this experimentation as well is that, yeah, the obvious one, just the generate, try to generate the query from the document in this case. So the summary here, based on today’s lectures, both lectures, that There are 3 ways to model information retrieval. One is a vector space model and try to measure the angle between them, how query vector aligns with the document vector. This is what we have been discussing, and we discussed with different notations, how to implement it. This is actually what you’re going to do. The very basic formula of this for your coursework and that. The probabilistic model, which becomes to actually put some theory behind the notion of relevance here. What is the relevance probability of a document giving a query, and we found that we discussed that BM 25 is one of these formulations that turned out to be very effective. And the last one, the language model way for IR is how likely is it possible that you observe or generate a given query. From the model of the document, this is kind of what you are doing implicitly in your head. You’re thinking about probably I’m looking for something. Probably these are the words that I’m looking for I have to find in the document. So the resources for this in uh introduction to IR chapter 12, is talking about this nicely, and in IR in practice, it’s section 2 and 3 in chapter 7. And I would actually recommend if you are interested to have The original paper of Okapi 25 BM 25, it actually came out in 1995, and the other paper, which is the language modernism, which was initially introduced to the world about using this way and the experimentations done. This is actually nice reading to see how this stuff has been developed at the time it was presented in the first time. Do you have any questions? Yes.
SPEAKER 2
Yes, so even after applying the smoothing technique, if your career has one or more words that don’t appear anywhere in the entire collection, the quality is still there, is that the internal behavior. Is that what?
SPEAKER 0
Yes, it will not say that this document shouldn’t bed. It will calculate actually a score for it. Imagine these two words turned out that they actually the probability in the collection is actually high, so you still actually might end up that these documents still come to the top ranks. It would not always put it down. It will actually try to bring these documents to the top ranks depending on the terms itself. OK. Any other questions? Yes, doesn’t show up in the collection. What if the word doesn’t show up in the collection? Probably you just ignore it in this case. Because it will just to make nothing be returned in this case, OK, it doesn’t exist. Yes, uh, you said that, um, they went to the
SPEAKER 3
commentatorial plan that I think was better if it was like larger if we have, uh, smaller que queries and smaller to have larger queries. Can you adjust that dynamically based on what query the user is putting, or would that be like a better idea?
SPEAKER 0
So imagine that the extreme case when lambda is one. It means that I’m looking at all the terms exist in my query, and this is kind of an end operator here. If you started to give it lower value and increase the 1 minus lambda on the other side, it means I’m starting to be more flexible and I mean even if some terms doesn’t exist, 2 or 3 terms doesn’t exist in my my document, I still can give it a higher score and receive it in this case. What is the best value? It’s again, depending on the application. So actually, um, even in, in, in, yeah, in this paper, it has an interesting experimentation on this, and you’ll find some papers just about the value of lambda, testing it in different ways, reporting for you which is better. And it shows that for some applications, a higher value of lambda is better. Usually a higher lambda value of lambda is is better when you have a huge collection. So if you have a huge collection of billions and trillions of documents, then if one document is really missing one of the terms, I have already many, but if the collection is small enough like the two documents collection we talked about, you don’t want to actually just be that harsh with them. OK? So, so far, I have to revisit this quickly. I’m trying to make you implement how I search engines work. Uh, so far you are doing the very, very basic things. Uh, I, I know some of you are excited to try more stuff like, uh, like extended index or uh binary saving for the index in a binary format or structured index and so on. And maybe on this someone would like, I’d like to implement PM 25 or language models for IR. I’m the first. Experimentation so far is to actually make you do the basic search engine, which is your coursework one. However, when it comes to the group project, you have plenty of ideas you are going to learn that you’ve already learned some of them. You’re you’re not implementing, for example, BM 25 for language models here. You might think about using them in your group project. And now actually from the next lectures or as you go with the course, we will have even more advanced methods like learning to rank, for example, rank systems. You will not be required to implement them at that time because it takes time and effort, but we would be appreciated a lot actually if you would be appreciated a lot if we find some of this stuff implemented in your group project. You need advanced stuff. So this is the main point. We’re giving you the pointer, make you take the start of the rope about how to build search engines, and we’ll give you all the routes you can take afterwards about which advanced semester you can use later. Now you are doing the basic thing. Is that fair enough? OK, so the lab is based on 1000 documents. The coursework will be 5000 documents. When you release it, it will be a little bit higher, but this is nothing. This is not a real system. A real system will never be, it will be millions and actually hundreds of millions, sometimes billions. So it’s all just about getting the system up and running, and you know how it goes. But hopefully you will learn these basic skills and develop it over the course through the course of the second semester, you create an interesting group project. OK, so, OK, best luck with the coursework. Uh, you will not be able to submit till we release the test set, but you can start finishing everything, even writing the report. All what is remaining will be just the final collection, final test set to be able to run it and submit the files of the results. Share your results, please, on Piazza, OK? I hope to see something within this week. Best of luck.
Lecture 3
SPEAKER 0
OK. Hi, everyone. Welcome to the 3rd week of Tet Technologies. And today we’re going to talk about uh ending sync. We’ll speak about it in a, in a minute. But before the lecture, So, um, So far, the 4 lectures that we took in the last couple of weeks, it was a kind of a warm up, and now we’re starting to get a bit more serious about building search engines. Lab one, if you notice, many of your tanks have started to share their results on the Piazza. You you notice there are some slight differences. It’s not, it’s almost they look the same, but there are maybe slight differences which might be coming from the way you did the deconization, the stopping, the stemming, and which is fine. This is normal. At the end you are doing something, you can see actually how it’s working, how you deal with big data. And there is actually one small tip about loading big files. I can see there are a couple of you started to talk about this. Yeah, I think one of the mistakes that you do, and it will not work when you go to do your group project, is when loading a file, you don’t load the whole file to the memory and then start processing it. Load it part by part. I usually recommend just line by line, easy, but you can do chunks as well. But don’t make this file was 500 megabytes. In the future, you might be working on something with 20 gigabytes. Don’t fill your whole memory with a file to start processing it. So just load chunks line by line, process it, and print or whatever you do, or save it in a hash or whatever. Then actually go for the next one. Don’t keep the whole file in the memory. This is one small tip to make your stuff actually working better. Today we are going to talk about 2 lectures about indexing. I’m expecting that you know some binaries. Hopefully everyone knows binary here. And I will also announce coursework one, which will be depending on the last lecture on this one between the break in the break. And one important notice, please do labs on time. Don’t get at the end and say you’re going to submit coursework 1, which is totally dependent on lab 1 and 2 and 3, and then your eyes you have to do all of them at the same time. So start doing it on time and you’ll be absolutely fine with the coursework, OK. Uh, one thing, someone mentioned that the recording last week of the 2nd lecture was, actually the sound was off, which was a problem because it turned out that I was muted, it seems. So it seems there is no solution for this. It’s gone, but what I’m trying to do is to find the lecture from last year and we have it. It’s just about uploading it to the system, so I need to go find how it’s done and we’ll try to do it. So if you need a recording for the 2nd lecture, it will be there hopefully within the next few days. Do you have any questions before I start? Is it going well so far? Is it fun or Good, OK. More fun to come. OK, so that’s fine. So the lecture objectives is again to learn but also to implement. What we call boolean search. Inverted index and positional index. And we will learn more what we mean by this. So the indexing process that we started in discussing last week, which is about converting our set of documents, you have a question? OK. Uh, converting a set of documents. Into some bag of words form which we discussed and then putting them in the index. So last week we started to talk about what kind of transformation process we are doing, which is a pre-processing. We talked about very, very simple processing stuff that would lead to big improvement in search like tokenization, like case folding, normalization, stopping, all of this stuff. But once we preserve this, we need to put them in the index, which is exactly what we’re going to talk about in this lecture and the next one, OK? So remember this example from last lecture. After processing the DEC text, we got the text and we processed it, pre-pro uh removed the stoppers and applied the stemming. Uh, we end up with something like this. And we said this is what we’re going to put in our index to be able to match. And the question is, OK, what is the index then? So, simply, The question is how to match terms between documents and documents in a nonlinear time. It’s not like find or grab as a Linux command. Actually it’s not sequential. We need to find it very quick. And the main thing is using annex to find these terms immediately. We don’t have to go and scan everything. So anyone have an idea of any sam have you passed by any index in your life? Have you seen any index before in any application in
SPEAKER 1
life?
SPEAKER 0
Hash map like a dictionary, OK, but have you seen it in a physical world? Have you seen index? Yeah, in a book, yeah, yeah. The very simple index is the book index, which is something like this. So what is the use of this one? It is simply a small search engine. If I’m looking for a specific term like coverage here, it will tell me it’s on page 8. So I don’t have to keep reading the whole book to find these words. I can just from this one I can find, oh, it’s on page 8. I just jump directly in a nonlinear time to get this page. That’s it. That’s Endex. Easy. So we are done. Let’s explain how it’s done on the computer side. So simply search engines versus BDF find or grab, it’s infeasible to scan a large collection of documents to keep finding a specific term. If you grabbed for a specific term on Wikipedia, for example, the Wikipedia abstract, it will take a long time until you find it, especially if it appears at the end. So the question is how we can find it so quick. So, and also finding more stuff, like if I’m looking for a document which contained the words UK and also Scotland and also money. So doing this, it becomes actually it’s not just a word, it’s multiple words. It don’t have to be actually a building as a as a phrase, but I’m looking for a document that has the three words. So This is not feasible or easy, it’s not straightforward using sequential search. So the book index helps you to find a list of relevant pages for specific words. So if I’m looking for a book index, I’m talking about, for example, economy in the UK, I can search for where the word UK appears, the word Scotland appears, and the word money appears, and then I can find what is the common page between them. I can get, I can guess, oh, this one is irrelevant. And this is the main point, find it in a sublinear time. I can find it very quickly. IR index is very, very similar. It’s a data structure for fasting for, for fast finding of terms, and it has some additional optimization that actually could be applied to make it actually even more effective, and we will discuss it more here. So it all starts by the very basic representation of a document in the index here or in a collection in general. We start by what is called documents, document vectors. So simply we represent a document as vector. So the vector here is simply a document, and the cell inside this vector represents a specific term in this document. And if the values of this can be how many times a term appeared in a document as a bag of words, or actually maybe just it appears or not. Is it there or not? That’s it. And if you have a set of documents, if a collection has 100,000 documents, and if you put all of them as a vector, so these 100,0001 vectors will be as a matrix. We call it a collection matrix. So we’ll have a big matrix which has each each column represents a document and actually try to find out how many times each term appears. So let’s give a very, very small simple example here. Imagine we have a very small collection of only 5 documents as we see here. So document when he likes to wink, he likes to drink, and document 2, he likes to drink and drink and drink and so on, whatever, just strange words inside. And What we can do now is to find the unique terms we have in our collection. So let’s take a sample of these words heat, drink, ink, likes, bing, and so on. And then we start to see for each of these documents. How many times each of these terms appears. So what I can do here is something like this. For example, in the first document, I can see wink appeared only once and likes actually appears twice. So something like this here reads, it says to me that the number of occurrences of a term in a document. The word drink here appears 3 times in the document too. And the document vector is simply these ones. So I see now the document as a vector of all the unique terms that in my collection. Some of them appeared multiple times, some of them didn’t appear at all, like 0 and 0 here, for example. The thing and link didn’t appear in document 4, and that’s it, very simple. But however, This is a simple index we can think about. What we are interested more about is inverted antics, which is actually instead of representing a document as a vector, we start to represent. The terms themselves as a victor. So the term, the vector itself is the term, and the cell is which documents this term appeared in. So simply it’s the transpose of these metrics here. If I just did that, it would be something like this. So if I loaded the word he, I can see actually it appeared in document 1 twice, document 2 once, and so on. The word in appeared in actually document 34, and 5. The word wink appeared in 1 and 4 and 5. So it tells me in this vector that where this document, where each term appeared in the collection, in which documents. And this is a very, very basic representation for very basic search. This is a very basic thing to do to build anything afterwards for a search engine, and it enables what we call boolean search. Boolean search, it’s something, a functionality that will tell us all what it tells us that this document matches a specific queries, a specific query or not. It doesn’t tell me if it’s relevant or not. It just tells me. Does it have the terms I’m looking for or not? That’s it. It’s not ranking anything. And actually it goes with logical operators like and or or not, so I can go and say actually I’m looking for a collection of Shakespeare novels, for example, I can have a bullion query like I need all the novels which has the word Brutus and Caesar but not Cornia. So how we can do that. If you thought about Grab, for example, it’s not straightforward, but here we need to build a structure to be able to enable us to do that. So what is required here is to build a term document in this incidence called metrics, and which term appears in which document will tell us, and then actually we can apply this boolean operator and or not to find out what is the relevant documents here. So let’s take an example here. So imagine this is a collection matrix, and these actually are the different documents we have, the different novels by by actually Shakespeare. And here are the different words I’m looking for. Remember my, my actual query is Brutus, Caesar, and not California. So what I need to do now is to load the map, the vector of each of these terms. So these are Brutus, and these actually are the documents it appeared in. I know 1101, where it actually appears. And this is the second one, Caesar. I loaded. And California, actually I’m looking for not. I’m looking for the doc the documents that does not actually contain this term. So actually I will take the inverse of this. OK, so now I can easily check what is the vector value here. So the Brutus, I can see it’s 11 from the top, 110,100, which I copied here, and for Caesar, I can find actually it appears in which documents and not actually Calpurnia, and I can apply this boolean operation and then find out actually that this is the output. Document #1 and document #4. So all what it does, it tells me that now these are the two relevant documents. In a very nonlinear operation like this without scanning the whole document, I just loaded the vectors of these terms and then applied the Boolean operator and then find out, OK, these are the two relevant documents I’m looking for. OK? Is it clear so far, so far? OK. You’re going to implement this stuff, so if you don’t, if you have a question, ask. The problem with this representation is that if you have a bigger collection, OK, we have a very, very small collection here, but imagine that consider we have a collection of 1 million documents. This is not a big collection, it’s just 1 million documents, not that big. But let’s assume it’s just 1 million documents, and Each document contains on average size of 1000 words. Imagine these news articles, for example, and each article has 1000 words on average. So what would be the size of the metric we have here? To find actually the size of the matrix, I need to know how many unique words we have, how many unique words we have. Imagine we’re going to apply. That the total number of the size of the collection, the number of words in my collection, is 1 million documents multiplied by 1000 words. It’s around 1 billion words in my collection. And using heap’s law, if I assumed it, it assume it will end up that we have around 500,000 unique words in my collection. So what would be the size of my matrics in this case? 500,000 by 1 million, which is Huge. It’s a lot. It’s 0.5 trillion. Cells, but all of them, the entries are 1s or zeros. OK, so that would be huge metrics. But actually, what is more important, what do you think would be the values inside these metrics? Would it be actually how many ones and zeros we are expecting to find there? More ones or more zeros? Why?
SPEAKER 2
Because most words don’t show up often.
SPEAKER 0
Yes, actually, also, actually, also think about it. The document size is 1000. And you have a collection of actually 500,000 words, 500,000 words. So even this document, imagine it has 1000 unique words. No words appears twice in this document. So out of the 500,000 vectors lengths, it will be only 1000 once everything else will be zeros. But actually, even so this is actually how we all the words appeared in many documents. This is we are talking about at most you’ll get 1 billion. Ones and everything else will be zeros. But this is actually the case when every document contained actually a term, a term appeared only once. This is not the case. You will find actually some terms will appear a lot and some terms will appear a very few times. Imagine actually, this is something and how we should see something like this. It will appear in the document 1 and then document 4 and 5, and many, many, many 0s, and so on. But actually we also know this is 1 billion in length. But actually if you think about the zip flow, remember that half of these terms will appear only once. So we actually, we have half of the vectors of the terms, 250,000, have everything is zero except only one cell is one. So actually what we’ll end up with, this is really, really sparse. Everything is almost 0, only a few 1s, so this representation might be not very efficient. Imagine you have to build this huge matrix, but mostly are zeros. And we need actually to save space and something like this will be a very, very sparse, mostly zeros. So what could be a better representation to the document terms in this case? This is what we call the inverted index is still, but sparse representation. In this case, for each term, we store a list in this case of all the documents that contain this term. And we identify actually each document by its ID in this case. So this is instead of using a vector, we are using a list. So it will be something like this. The word Brutus appears in document 12, and then 4, and then 11, and then 31, and then 173, and all of the zeros in the middle where the word Brutus didn’t appear, I didn’t actually store it. So this will not be sparse. If a term likeor appeared in only 4 documents, it just appears there. If a term appeared in only 1, it will be on the list of 11 cell. So this is actually will be actually sparse representation just to save whatever the document ID in this case. So instead of saying how many times it appeared, just saying which document it appeared in. So what we call this stuff, this is the document number. I will just store the document number inside the cell. And when I find a new document containing my term, I add it to my list and I will call this posting. I find the posting of this uh term in my, in my collection, and the whole thing, I call it, not a vector anymore, it’s actually a posting list because it’s a list. It has different sizes between one term to another. So we call this a posting list. OK. And all the terms, unique terms I have in my collection, I call it my dictionary. So which is very similar to the book index you have seen. Remember, at the beginning, it’s a term and how many times where it has appeared. This is simply what we’re storing. And how to process it, it can be done in different ways, but simply just a quick example here. So I have this um input text. All what I need to do now is to apply the tokenization and then apply the normalization, maybe actually like stemming and case folding, stop words removal if needed. Then all what I need to do now is to store for each of these terms where, which documents it appeared in very straightforward and simple. If you’d like to go for more details, simply, if I have two documents here, I can then sequence them by terms and which document I appeared in, then I can sort them alphabetically, and then from there I can create my postings and say for each term which documents it appeared in. So the words that here appeared in document 1 and document 2, that was appear in 12, and others which ones they appeared in. Very straightforward and easy. It should be very similar to the processing you did last time. So remember this example. So this was a Uh, collection metrics we have and the term vector here. If we’d like to represent this example as sparse representation using lists, posting lists, it would be something like this. He appeared in these five documents, the word thing appeared only on document 3, and that’s it. OK. Is there a, yes, I have a question. Excellent question. Yes, you are talking about the next slide. OK, so the problem here, as you mentioned, I know that this term appeared in this document, but I don’t know how many times it appeared. So it’s just here would be 100, but it will never tell me how many times it appeared. For boolean search, this should be enough, but when we talk next week about ranked search, the frequency is important. So how can we solve this? Actually it’s a little bit of structuring here. So what we can do here with frequency and instead of actually having the list of cells which is just one number or one integer, now we can actually save it as a topple. Couples here, so it will tell me he appeared in Document 1 twice, in Document 2 once, in Document 3 once, Do 4, and so on. So now it’s a topple. I saved the document’s ID and how many times it appeared in this document. So for example, here it will tell me the word drink appeared in document 23 times. Is that clear? Does it solve the problem? OK, so it’s a little bit of structuring this one. So let’s, how, how shall we do query processing in this case? Imagine that I got a query which says I need a collection which has all the documents which contain the word ink and the word wink. How can I process this? There is one algorithm called linear merge which has, it’s actually used a lot, but you don’t have to implement it, you can implement different ways. I will give you some tips what, what you can implement for your coursework. So, but how it works in Liar merge, for example, what I can do now, I’m looking for two terms documents containing the word ink and the word wink. What I would do now, I have all my dictionary saved in the index, and for each of these words I have the posting list and how many times it appeared in each. So, I’m looking for ink and wink. I will just load them from my index. So this is actually the outcome, but it has ink appeared in document 31 time, 41 time, 51 time, and winks the same thing. Linear merge actually works in this way. It starts to check actually what is the closest document ID I have in these two postings, which one is the closest. Which one appears in a in a closer document? Yes, wink appeared in document one. So I will start with this. Then I’ll check, does in appear in document 1? Nothing. So actually it will be a function of 0, didn’t appear, and 1 because it appeared in the second document. Then I will go to the next posting in the posting list here. I will check. Is other postings in the other actually posting list, does it have a document with an ID less than this one? Yes, it does. So I will go for document 3 in this case, and the function will be, it appears in document 3, the first uh uh term, but it doesn’t in the second term. So it will be 1 and 0. Then I will go to the next 4. Did I reach 5 yet? Not yet, so it will continue. It’s 1 and 0. Then I will go for 5. Does it match here and there? Yes, so it’s a function of one on one. But remember that we are actually looking for an end operator here. So an end operator means and between 0 01 1010 will be always actually zeros, except 1 and 1 will be 1. If I can change this to or. What would be the outcome? Oh, all ones. If I came and said change this to ink and not wink. I can still do it. I just Apply the Apulian operator and it will just find the matches. So this is how it’s done. OK, in a simple way. So now I can find that whatever gets won, this is actually my match. Just bullion. I didn’t rank anything. All what I need, I need to find the documents which contain a specific set of terms. OK. Remember that we asked about, OK, this is a bag of words. What about when we would like to search for a phrase? Imagine I’m looking for a document that contains pink ink. And we have this representation. How shall I find it in this case? Do you have some ideas what could be the solution? Yes, look for documents that have pink and ink and
SPEAKER 1
then specifically find it in the string of that document.
SPEAKER 0
How would you find it in the string of the document to start scanning the document itself? But imagine that you’ve got matching 1 million documents. It will go to the grab thing. It will take a long time. Is there another way to do it? Think about it. That’s a good solution, but it’s not, it will not be scalable. Any other solutions? Using the same postings, yes, combine and ink and make
SPEAKER 2
it one word with a hyphen.
SPEAKER 0
Nice, which is actually using something like by grams. So instead of saying he likes to wink and drink bi ink, which is a strange phrase, but this is an example, I can just put it in this representation. He likes, likes to, to wink, wink, and, and drink, drink, drink, wink wink, ink. And now I can search for the postings. Now instead of the postings it will be for a single term, it will be for a bigram. And I can find actually that pink ink appeared in this document. OK, this is your solution. Do you have any problem with this solution? Size Excellent. The size would be huge. The combination of words will be instead of actually having a vocabulary of 500,000, it might end up with 5 million or 50 million. What else? If it’s a gram Here you are. If I’m looking for a phrase that actually is more than two words, 3 g 4 g 5 g 10 g, shall I keep creating something like this? So the problem here is that it might be fast when you’re actually finding the load it quickly, but the problem, the size would it would be huge. Yes.
SPEAKER 2
Another.
SPEAKER 0
Another, let me finish the example and I’ll hear it from you. Actually, I’m interested to see. And actually, what about trigram phrases and more than this? And actually, what about even proximity? If I’m looking for something like ink is pink, I will, I’m not doing stopws removal, and I’d like to find these terms appearing close to each other instead of actually in a phrase, instead of actually being just exact phrase. What could be the solution?
SPEAKER 2
Looking for proximity where to store the proxim.
SPEAKER 0
That’s excellent, which is exactly what we call the proximity index. It’s another structure for our index to make it actually save an additional pieces of information that might help me doing these kind of operations very efficiently. So the thermal position in this case would be embedded in the inverted index. I will store it and we call it proximity or positional index. Both are referring to the same thing. And it enables freight search and proximity search, but instead of having toples, which is the document ID, and how many times it appears, it will change a little bit. It will be a to as well, but it contains a document ID and where this term appeared in it. So remember our example. This is how it would look like. It will tell me the words he appeared in document 1, not twice, but in position 2. And doc, and it appeared in document 2 in position 1, and in document 3 in position 1. And so on And drink the same thing. Actually, this is the old one. I’m sorry, this is actually the old one. The new one will be like this. It will tell me that appeared in document 1 in position 1, document 1 in position 5 equals appeared twice, and the same thing actually for drink, it will be something like this. It shows me in document 2 it appears in position 46, and 8. So I stored where this term appeared in my Index in my in my document and it’s stored in my index. This would solve a big problem now because I know exactly where the term appeared. OK, Is that clear? You’re going to implement this. If it’s not clear, ask now. So the query processing for proximity search, we still use linear merge, but we take an additional step. So let’s go to the example. The word ink appeared in document 3 in position 8, in document 4 in position 2, in document 5 in position 8, and pink appeared in document 4 in position 8, and document 5 in position 7. So remember, I’d like to pink ink. This will be an end operator by default because I need both of them to appear. So what I will do, I will start with linear merge appeared in position document 3. Is there anything there? No. So it will be a function of 1 and 0, nothing. Then I will go to the next document appeared in document 4. Yes, it appeared in document 4. Here is pink. There is anything in document 4. Yes, so that’s great. So the function here will be one on one, but I will not take an action now and say it’s yes or no, because I need to double check. What about the positions of these two terms? Is the position of ink minus the position of pink equal to 1? Yes or no? No. So this case, while they both exist in the same document, it’s not a phrase because their positions are far away from each other, so the answer would be zero. Then I will go to the next document, document here and there, and the question is, is the position between them equal to 1? Yes, then the answer here is one, and now I can find out that Document 5 matches my phrase query. OK. And the good thing, this actually now can enable me to do what’s called not just phrase search but also proximity search. I can search for words appearing in the same sentence. They don’t have to come close to each other just next to each other, but at least close. I can say I need these words appeared within a window of 5, within a window of 10. And instead of a word that appeared in the beginning of the document, the other one is 1000 terms later at the end of the document. So this, this is a very useful feature when doing ranking for documents later on when we speak about ranking that if I’m searching for documents that without actually first search or proximity search, I’m searching for just the White House, for example, OK, and I find documents like 1000 documents contain White House, but some documents contain White House. One at the beginning of the white egg and then at the end the happy house and actually another document have them close to each other. So we start to use this kind of proximity search as a score to make a give a bigger score for the ones appearing close to each other for ranking, and we’ll talk more about this next week. OK, so this is really important, but it solves a big problem. Now I don’t need to think about by grams or try grams or 4 g. I can easily apply this. I have the positions. I can simply calculate and find out which would be a phrase or not. Fair enough. Good So how to do this with a sample implementation. So you don’t have to apply in your coursework or labs. You don’t have to implement linear merge because this is using lists and the low level. Next, the lecture will even talk about even lower level stuff, but how to make it simple for your coursework. So OK. So first, when you get a query for each term in the query regarding regardless of what its operator is, retrieve the posting list for these terms. So I got a query which has 3 terms. retrieve from your dictionary all the posting lists for these 3 terms. If you’ve got an end. Operator, then what you will need to do is to make the intersection between these lists. So if the term 1 appeared in document 1 and 4 and 5, and term 2 appeared in document 4 and 5, then the section is 4 and 5. If it’s or, you will do the union of these lists. It will be 14 and 5 and everything. Just union operator If I got knocked Not. It will be the inverse of posting list. So if it appears in document 1 and 4 and 5, then you have to check document 236, and 7, and so on. But this will become actually more complicated. Usually not even in my in the in the lab and coursework will be only combined with and. So what you can do is to subtract actually. So if you have term 1 and not term 2, you will check where term 2 appeared and subtract it from the first list, and that’s it. Fair enough. Easy. So I said term one appeared in document 14, and 5, and 6, and term two appeared in document 5 and 6 and 7. So what I need to do is subtract 4 and 55 and 6 from the first list and it will be actually only what appear uh the remaining stuff. If I got phrase search, it’s simply applying and operator. I will apply and operator, which is the intersection. Then I will check if the position 2 minus position 1 equals to 1 or not. OK. And to have to be careful because White House is not the same as House White. It has to be pos if they actually position 2 minus position 1 equals -1, this is not the phrase I’m looking for. OK? So you have to be sure that you’re taking which, which one is coming first. If I’m looking for proximity search terms appeared within a specific window. In this case, it’s an and operator, which is the intersection, then check that the absolute value of position 2 min 1 is less than or equal to this value. So in this case, the order doesn’t matter that much. And all what I need to do, if I’m saying within 5, then I want to be sure that the terms appeared within 12345 distance between them, OK? Easy. Easy, OK. Another thing that I recommend you to use, it’s not the most efficient, actually, it’s far away from being efficient, but it’s just for. Easy start for your coursework and lab is to put this kind of representation for actually your index. Later on, what I hope for is actually in your group project you can actually use more efficient ways for storing the index itself. But let’s take this example. What you can do is to, for each term, you put an entry and mention how many documents this term appeared in. It appeared in 5 documents, for example. And for document number, for the first document appeared which positions, document, the second document which positions. For example, if the word Scotland here appeared in different 25 documents, it appeared in document 94 in position 112, in document 351 in these positions, multiple positions, and so on. So this is a very, very simple saving it as a text, nothing hard, just some simple structure that can help you to store actually your index. And from this you have all the information. I have actually for this document for this term, how many documents appeared in, what are these documents, and what are the positions you have for each of them. And if you need to, for example, need to know how many appeared in this document, you can just check the lengths of your list you have there. It’s easy. It appeared here twice, 1 or 7 times or so. Is it clear? OK, practical. Let’s do some practical see this in action, OK? I need actually to do it. I think I have it here. Where’s the dirt You cannot see it, yeah, I need to actually make the screen mirror.
SPEAKER 2
This place.
SPEAKER 0
Uh, mirror it. You see it now, yes? Excellent. Uh. OK, let’s have an example here. What do they call it? Mm, collection, yes. Actually, I saved, I created a collection, which is this one. So actually it has 5 documents. This is a format that is standard. We’ll know why exactly this format later, but this is a standard format. It tells me I have a starting of a document here. I’m sorry, a start of a document here, and this is a document number and this is the text of the document. OK, and then this is the end of the document. Then it’s kind of an XML format, OK? Hm. It’s kind of XML. OK, but this is a very standard format for documents when you’re doing them for search and indexing. But let’s say this is actually the format I have, OK? So This is, can you remember this example we have seen in the lecture, yeah? Only 5 documents, nothing, nothing, nothing big. So I created an indexer. You don’t have to see actually how it looks like, but just something to take this and put it in the structure we talked about. I applied very simple processing, pre-processing, not all the pre-processing, all what I did here, no stoppers removal, no stemming. I just applied case folding and removed the bunk intuitions. So you do it in the lab, do the whole thing, but just I’m illustrating the thing for you. So if I actually get this one. Collection of text. And then I created the indexer. And let me print it. So you see it actually now, it tells me the word end appeared in two documents. Document 2 in 5 and 7, document 5 in position 5. Actually, let me, let me, let’s double check this. Let’s sort of, if I went to the indexer. Where is it? Yeah. And collection of texts with. I You can see it Here. So and it tells me here, let me get back, appears in document 2 and 5. Document there is no end here. Document. 2 What is this? It’s not.
SPEAKER 2
Yeah, oh yeah, yeah, document 2, yes, in 5 and
SPEAKER 0
7, which is true, yes, this is true. 12345, so yes, it is correct, and nothing here and the other one documents. Where is it? Document 5 at position 5. Document 5 at position 5. OK, so simply it’s doing it for me. Now what I can do, I create a simple algorithm for And just an and thing, and I’m taking the collection. index to load it. So what you do once you Process this, print it in a file as an index. This is a file output, and when you do search, you don’t go and process the documents again. You just load the index in the memory. You have this structure now, just load it and parse it and load it actually in your memory in the structure like this. And now I can go and I just add, if I’m looking for the word and, it will tell me it appeared in document 215. The word drink, it appeared in all the documents. The word he and drink, they appeared in all the documents. The word and and drink, it appeared in 2 and 5, ink. Ink pink. Now it just tells me what are the documents disappeared in. OK. Very simple. And this is a small collection. I had another collection actually. This is the one you’ll be using in the lab, just to check it, which is a sample. If it’s the one open with. This is actually uh News articles from the Financial Times in the 1990s, very, very old news articles, but actually it’s interesting. It has the document number as well. It has been put in this format. It has profile date, headline, and text, and so on. Actually, when you do the indexing, you just apply it for the headline and text and ignore the other stuff, and it has many documents like this. OK? So you have here The headline, whatever international company news, whatever something like this, it’s about many financial stuff and some political stuff as well. So if I created the index for this, I will. It’s called, what did they, what did they call it? Streck track. sample. Dot XML Uh, indexer. And I will put it to track. Dot Hendex, OK? I save now the index. If I open this one index, I have created.
SPEAKER 2
Open with.
SPEAKER 1
Uh, why you cannot open it.
SPEAKER 0
That’s annoying. Let me do it. OK. Bring it here. So you can see now, I didn’t do any preprocessing, decent preprocessing, I should do it better, but this is actually where the document, each each term starts with some numbers, some random numbers. How to scroll it down more, for example, here. Fallacy appeared in only this document with this uh in this position, fallen fallen appeared in many documents, mostly once. This document appeared twice, and so on. Falling, I didn’t do stemming, you should be doing stemming, OK? So now if I went and loaded this, it’s actually 1000 documents, but if I did the end search again, but loading in this case the track index. Now I can search for any word I want, so I can search for the word Scotland, for example, it tells me these are the documents it appeared in. If I search for the word England, England. These are, it appeared in more documents. What about the documents containing Scotland and England together? It’s less number of documents. These are actually the common ones. You got to check if you want, if you have time. Anything Liverpool, Manchester.
SPEAKER 2
Chester OK, both together, Liverpool.
SPEAKER 1
Manchester, only these three documents.
SPEAKER 0
69351, which is in both and so on. Very simple. You can see now it’s became very fast. Even if this collection is 100,000 or even 10 million. Now I can just load it and find out, match, and find out. If I’m doing or, it will be actually the union of these numbers. If I’m doing and not, I will be actually subtracting someone from the other one, and that’s it, very simple. If I’m doing phrase search, I will just check, I’ll do an additional check for the position of the term where it exactly appear. OK, Is it easy? Any questions? Yes, how are you storing the indexes?
SPEAKER 4
What? How are you storing the indexes?
SPEAKER 0
How I can use what?
SPEAKER 4
So how, how are you storing the indexes in the memory?
SPEAKER 0
I’m sorry, I cannot hear you. I’m in storing the index. I’m currently, I said, just in a simple text file as the one I showed you, like this one. Yes, when I’m doing the search, I load this one. I load this file. One big mistake, don’t actually say I will not save it and I will have to index everything, keep it in the memory and use it for the search. No, save it on the desk, and when you are going to do actually the search, load just the index, not the collection. OK, it doesn’t make sense to every time I need to search, I have to index everything. We’re doing web search, uh, Google will go and index everything from scratch, nothing. It just have this one, OK? Yes, we have to apply.
SPEAKER 5
What is the use of actually query wants to include a stop or something?
SPEAKER 0
Yeah, whatever I, I, I mentioned this many times last, last week. Whatever processing you’re going to do to the documents, you have to exactly to apply to the query. So if you are doing stemming to the documents, you should apply stemming to the query. If you are doing stop words removal, you should apply stop rewards removal for the query. So actually this will, this is one of the things you’ll be tested at, and when you do the search, it can be actually two different queries, but after a semi it will be the same. So you have to be sure about this. This is a really important thing because if you are, for example, the word falling, if you did stemming, it would be fall, but if you, there will be no falling in the index. And if you go to the query and search it with it as it is without applying the same stemming, you will not match anything because it’s not there. You have to actually to apply this stepping, OK?
SPEAKER 3
Yes. Uh, so, yeah, I had a question. So this is regarding the collection site that keeps increasing. So right now we know the collection if it’s, let’s say for Google, the collection sites to keep increasing, then what do, what would be the uploads?
SPEAKER 0
This is an excellent question. Actually, uh, so this is why indexing is done offline. OK, and there are many methods. For example, there are big vocabularies, but what you need to do at the end is actually to update the index every while. So the index, this is actually when you talk about web search, some documents, imagine web search, some documents you will need to update it every few days. Some documents you need to update it every few minutes, like news websites. It makes no sense to update it every few days. It’s too late now. Actually, you need to find the news as soon as possible, and it takes some time to do that, but the, the index gets updated every while. Oh, and this is actually one of the things actually we always appreciate when you have it in your group project. Are you applying live indexing or just one-time indexing? I gave you an example in the 1st, 1st lecture about this actually the book reviews search. It was live indexing, and I had uh and they actually 2 years ago they did actually another one which was really nice, which is about searching song lyrics, and they were actually updating any song appears, it’s just, they get it directly and add it to the index. I can remember they were testing it when uh on the same year when this BI2 came out. So we searched it and we found it came out and it was viral and we checked it’s there in the collection. So it’s, it’s, this is live indexing. This is something you need to update, of course, OK? Oh, any more questions? Yes.
SPEAKER 6
Uh, large operator and uh all operator. Yeah, pilots to combine these, these two operators.
SPEAKER 0
You can combine, you can combine both, of course. Remember, if you have, you can’t combine infinite because imagine that I’m saying term one and term two are not our term 3. So, of course, you need some brackets in this case or something. You just apply this and get the vector, the list that is matching, then apply, take the other one and do the intersection or and whatever you want. It’s very simple, OK? It can be a phrase search and another thing like the White House and Trump. So I need to match the White House together, get the list that is matching, then find Trump alone, get the messages that are matching, and then find the intersection. OK. Yes.
SPEAKER 4
Let’s consider two phrases, falling down and falling down. Now both falling and falling have the same stem word.
SPEAKER 0
Yes, yes, someone’s going to search for falling down.
SPEAKER 4
There is a chance that we might give them a document which has fallen down. Yes, yes, and if we’re applying stemming, which will be
SPEAKER 0
the case in our coursework, it will match, OK. However, in general applications, in general application, if you need some, because codes might be, actually means I need this exact term, OK? Like you put actually even a code around the specific words. For some search engines, they will actually have two indices. One is it processed, a stamped one is not, to give you the exact match. Another option for this, if you think about it, actually you can actually do the search and then apply a post processing to check if this term is the full term, but it’s not efficient. You have to have another index to make it fast. OK. In the next lecture we talk about even more advanced stuff like wildcard search. If you actually say words that start with these three letters and then an asterisk, any term, how would you do that? So we’ll talk about this in the next lecture directly. Any more questions? Nothing. OK, so the summary, the summary of this lecture, we have a document vector. We have a term vector, inverted index collection matrix. We talked about all of these postings and posting a proximity index, which is proximity or positional index, and query processing. We talked about linear mirrors, but I told you use intersection and union for now it’s easy. You don’t have to do that. OK? And the resources for this, uh, you can check both textbooks. Entry to IR chapter 1 and Section 2.4 and textbook 2 IR in Practice chapter 5. You should read this stuff, OK? Because I’m expecting it has more details than what I might miss some stuff. It has more examples, more details about everything, OK? Don’t ignore them. OK, um, uh, we should have a break, but I, I, I, I, I, would you like to take a break first and I talk about the coursework one or talk about course coursework one, then take a break? Who would like to talk about the coursework first? Oh, that’s a clear majority. OK, wow. OK, give me a second.
SPEAKER 2
Mhm. Just so. for 30 minutes.
SPEAKER 1
OK, coursework one.
SPEAKER 0
by the way, this is just general guidance about it, but you will find a full, full description published tomorrow, OK? So this is just general guidance, and if you have any questions now, ask about it, but the full details will be published tomorrow, OK? So what is required is to implement a very simple information trival 2 search engine that includes preprocessing of text, which is tokenization, stopping and sending. You should have done that already. Positional inverted index, you should be doing this this week anyway and search execution materials that allows booty and search, which is part of this lab, lab 2, and phrase search, which is part of this lab, proximate search part of this lab, and ranked search when you actually rank the documents according to which one probably is more relevant, which would be about next week we’ll take next week. OK, so by this week you can finish already. 60 or 70% of the coursework anyway, if you’d like to do that. And there will be a challenging question. Which is what will happen to results when stopping is not applied. All what you need to do is to test it because in the coursework, if you have clear instruction, apply stopping, apply stemming. The question now, what if you didn’t apply stopping? What would happen? You need to test it and report your observations. I don’t need the results. Just tell me what is your observation. And actually, on all of these different searches, the Boolean search, ranked IR, the speed, the index size, tell me, give me half a page or a page discussing this part in your report, OK, just your observations. So It’s not it will have 20% of the coursework mark, but I’m not expecting everyone will do it. So if you didn’t do it, you’ll get 80% if you finish everything fine. If you want to do it and get some results and you discussed it, you can get the full mark, OK? Uh, uh, here in Edinburgh, 70% is, is great, OK? Getting the degree is outstanding. But this is just a question if you’d like to get the full mark, which is doable in this coursework. It depends on the lectures, lecture 4 and 5 and 7, pre-processing, indexing, and ranked IR, which we will take it next week, labs 12, and 3. And by implementing lab 3 next week, coursework 1 is done. That’s a gift that I’m giving you 10 marks for free. You get the support and that’s fine, but it’s mainly for us to be sure that you be. Get, uh, doing the work in this uh in this uh course. The deliverables are Your code ready to run in Python. Uh, that’s what we recommend. If you’d like to use something else, let us know. Someone told me he’s using PHP or something, uh, you can use it, but I need to double check with the uh markers that we have. They can read it, or you can always actually go to chat GBT and tell them this is my code and PHP converted to Python. That’s OK. Yeah, guys, it’s fine, it’s easy. Your report, we need a report between 2 to 4 pages, not that much, just 2 to 4 pages, which includes the module you produced, you implemented, tells us I implemented this module about pre-processing which does that and that, another module for the indexing, another one for the search, one for the Boolean search, one for the ranked search. Just let us know. And it’s important to tell us why you implemented something in a specific way, and if you have any discussion about the challenging questions about if we didn’t apply stopping, how, what, what did you notice about the speed, the results, and so on. And what is really, really important then is the search results files. You need, we will give you the details tomorrow about how would be the format of the output in a specific format, because you will tell us for, I will give you queries and tell you what is the out results for this one. No one will check it manually. It will be done automatically. So what is required is to be sure that it’s in the right format, because if it’s in the wrong format, it will not process. It doesn’t mean it didn’t process, we will give you a 0. We are a bit kind. We will try to find out what is the solution and try to fix it and fix the format for you, but you will be penalized for this because it took more work off the markers we have. OK, so just be sure about the format. Assessment to be considered, your search results, which is automatic marking, and the quality of the report and explanation of the code. So it will be between both. Not highly considered at this stage is the speed of the system, unless it’s reasonably slow, like, uh, for example, if you process the query and it takes like 10 minutes, that’s too too much. So, uh, but if it takes, instead of taking like milliseconds, it’s taking 2 or 3 seconds, it’s OK, that’s fine. And the quality of the code, uh, we will not say that it has to be super commented and documented and stuff that’s OK, just this is just a warm-up coursework, but don’t make it rubbish, just make something that’s readable, at least least somehow readable. What is allowed, what is not allowed. Allowed is to use libraries for portrait stemmer. OK, it’s OK. Don’t implement Portrait stemmer. Just use a library, and I think it has been discussed already on Piazza about the different things you can use. Uh, you can use a ready code from stack Overflow or even generate it with a chat GBT or cloudy or whatever thing you are using for optimization, like doing a function for the intersection, for example. It’s OK to, to get something like this ready to use it, but not the whole thing. Don’t say, can you read this link and the course were quite unimplemented for me. No, no, no, no, don’t do that. Discuss some functions with your friends. You can say, how did you do the intersection part? It’s for me it’s a bit challenging. How did you do the union? Do you have something you recommend? It’s OK to discuss something like this, even on Piazza. That’s fine. And use Piazza to ask general questions on the implementation. It’s OK. What is not allowed, and it has been, I will actually raise a flag that happened last week, but it’s fine at this stage but not later. Uh, Using libraries for tokenization or stopping, you have to implement this yourself because there are actually in NLTK it has its stop stop words list. No, you have will be specified these stop words, use it here. So you need to use the same exact stop words and tokenization as well in the way we are talking about. What is not allowed is, of course, copying code from each other. This is something that is not allowed, and this is why when you from now when you talk about your results about lab 2 and 3, don’t share your full code. Don’t share your code anymore. Just say someone asks which function you use this. Just give a general answer, OK, or just something general. But don’t share the whole full code about how you do the indexing because it’s in this case it’s part of the coursework. So keep your code for yourself and just share the results. And sharing the results is really, really important, OK, for now. And sharing the results, not. I mean for the labs, but for the coursework, because for the coursework you will be given a new set of queries and a new set of collection. Once you get these ones, don’t share anything about them. Feel free to share everything in these couple of weeks before the deadline, which is about the labs. You can compare your results. Oh, I got 5 documents relevant. You’ve got 6. What is the difference? You can discuss it and see what is the difference. But once you’ve got the one, the collection and the queries for the lab, don’t share anything, OK? The timeline 1st of October, which is today, the initial announcement. Full details will be released tomorrow. 16th of October is the test set release, so you have to implement everything about the coursework and keep testing it with the collection and quiz published in the lab. OK, Test it, discuss it as you like, but when the new collection comes out, and the new collection will come out for you, a different set of documents, a different set of queries, in these ones, this will be released just a few days before the deadline. It should take you 2 minutes exactly to take the new collection and queries and process it with your code and submit the results. Nothing, even if actually with your part about writing the report, write it earlier. It doesn’t depend on these specific queries. Just write it earlier. It doesn’t matter, OK, about from the lab results. The deadline usually in most of the course it will be on Friday. I will just give you more time, so you have if you didn’t, you should submit it because this is Sarah’s day to get the new collection. You can submit it on Thursday or Friday, but even if you didn’t finish it, you have the weekend. If you don’t want to finish earlier, you have the weekend to rewind it for yourself and work during the weekend, and it’s the last minute of Sunday midnight will be the last time for submission this core coursework, OK. Some notes It’s only 10% of the mark. Effort is a bit high because you have to implement a lot of things this last week, this week, and next week, but you have to remember why it’s only 10% because you have full support through this. You have the support, you have discussed it on Piazza. You can actually have the responses from myself and the administrators for any questions, and When you find something in the coursework that it doesn’t have a specific detail, don’t say, don’t, then it means it’s flexible. It can be either, OK? Don’t say what you didn’t specify, is it this or this? If I didn’t mention, use any, OK? It’s a good practice to build the system from scratch. I know it’s not the big search engine, but it’s just a few steps, something decent small, and once you have built a search engine, it’s nice. It’s a nice feeling to have something working like this. And the next coursework, which will be after a month, I think, it will not be covered in the labs. Most of it will not be covered in the lab, and this one it will have a higher mark because you’re going to implement. Some advices do lab 1 and 2 and 3 because if you did them, it’s really course work 1, just a new collection and queries, and apply it and it will be done. Implement carefully. Write efficient and clean code, please, as much as you can, so at least you track your, don’t give our markers a hard time, especially when you’ve got the results and it’s not matching. And they have to go through your code and see what is going on wrong, so don’t give them a nightmare trying to read out what is going on. Make it simple to, 00, this is a problem, OK. A change in preprocessing and observe the the change, what is going to happen like the stopping part, test and test and test, keep testing yourself and keep your system that you’re going to implement now. It can be actually as a start, a good starting point when you go for your group project to do something more efficient and much much better. Do you have any questions about this? Yes, uh, you said 2 to 4 pages.
SPEAKER 1
Yes, is that like a hard limit?
SPEAKER 0
Yeah, that’s a hard limit. OK, that’s our limit. So, uh, 2 to 4 pages. It can, it can be 2, not less than 2, and not more than 4, OK. Yes, uh, if you have a space within the 4 pages and you would like to think it will help, that’s fine. OK. But I, I prefer not to read code in the report more as much as actually what did you do exactly, OK? OK, that’s good. So we will have a break, and in the break, actually, I will share the mind teaser. You can have a break from now, it’s fine.
SPEAKER 2
July. Yeah. We told her before.
SPEAKER 0
OK, this is a mind teaser of today. Mossack operators. Around these 3 numbers for each to actually achieve this equation. If these are 11 equations, OK, once you get the 11, raise your hand.
SPEAKER 2
Same, same.
SPEAKER 0
No, it can be any operator. It can, yeah, it can be, of course, it will be different from each one to another. It can be any operator as much as it doesn’t have numbers like root 3 or square. It can be anything, yes. OK, so once you solve the 11, raise your hands. We’ll have 5 minutes break, and I will ask at the end, maybe if you don’t want get every, all of them, we’ll check who gets the highest number. OK, enjoy your break. OK, we are back for the 2nd lecture. I’m sure that the mic is working, so they don’t have the problem like, like last week. So the objective in this lecture is to learn no implementation at this time, so some relief for you. What is the structure? How you would work indexing with some structured documents? What is extent index? What is index compression? Data structures that could be used for actually efficient processing of the index and wildcard search and its applications, how it can be useful. You’re not required to implement this as part of your coursework, but uh it will be highly, highly appreciated if you implemented some of this stuff in your Guru project in the 2nd semester, OK? So just still give attention to this. So Documents are not always flat. Like, like, for example, uh, there are some documents which have some interesting metadata, like the one actually I showed you an example like Financial Times articles. It has a headline, it has a title, sometimes it has an author, uh timestamp. So these are some kind of structures there. Yes, the, the content itself is mostly unstructure, but it has some structure. Um And sometimes you have tags like link these words, this actually 3 or 4 words in the document. It’s a flag, but it has a special tab. It has a link in it. So how can you deal with this stuff? It can be a hashtag, for example, or a mention how you make a differentiation between a word that is a hashtag or not. So how can we get this information included in our index? So there are different options here. One is to neglect it, totally neglect it, depending on the application. It’s up to you. But sometimes you can think, OK, I can create a different index for each field. So for example, news articles, I will create an index for the headlines only, and another index for the body. And when someone searches, I can’t search this part. Are they looking for the headline, the title only, or actually everything? Or a simpler solution which is using extent index. So what is the extent index? This is something that we hope to see in your group projects. It’s a special term. We create a special term for uh for each element or field or tag in our documents that would help us to allow a search like this. So we index all terms in a structured documents as plain text as normal. However, terms that are of a specific field or tag, they have a link or they come in the title, we, we do a special additional entry for them in the inverted index. And the posting in this case, instead of saying that it appears in term 3 and 4 in position 3 and 4 or 5, it can actually say it’s from position 3 to position 10. It will be a window instead of actually just a specific position. And it allows multiple overlapping, so you can in this case start to have terms appearing in the same position. Because one of them is not a term, actually, it’s just an extent term. So let’s give an example here. Remember the example we have been talking about indexed and we have seen it. So imagine that. Some of these actually terms are linking to something. They’re linked. So I’d like to include this information that the word there, here is a link. Ink actually has a link. How shall I do that? So simply what they can do. is to create a new posting for links. How this would happen, for example, I can create a new special term, not a real term, but a special term called link. If I have a different, the word link inside these documents, it will be a different thing. This is just a special term, and it will tell me there are some links in document 3 between 1 and 2. And in document 4 between 1 and 4, and in document 5, between 7 and 8. So it’s not a real term, it’s just a special term. If I have a headline, a title, I can do the same thing. So with this very simple hack, I just added one new posting, and it saved for me a lot of information that I can use later. You might think I’m going to search for something that should be linked, but actually in some cases when you do a web search, for example, When there is a, a, a specific term I’m looking for that is linked to another place, it might be actually weighted differently than a normal uh term that appeared in this document. So this information is important to be saved. So using extent in this case, like imagine these documents which would be similar to the ones you have in your lab. You’re not going to do extent there, but this is just something that if you’d like to use it. So you say actually you have a headline information with a lecture and text. This is a lecture 6 of TTS course in IR. Imagine I will apply some kind of stopports removal. I remove the stop words and I will start to save the positions of each of these terms. So this is a term 1234, and so on. I just ignored the structure. This is in the headline and this is in the body. No, I kept counting normally. Then someone comes and says, I’m, I’m looking for a query, the word lecture, but in the headline. It has to be in the headline, which is, you can do that in Google Scholar, for example, if you’re searching for papers, you can go and say I need this service in the title only. So what it does, it searches for the with the board, the posting for lecture which it appears in document 1 in position 3 and 4, like lecture here and lecture there, but in the headline, I have now a special term extant index for headlines saying in document 1 it appeared from position 1 to 3, in document 2 from 1 to 5, and so on. So when I start to match. I will find actually that the match is just this first mention of the lecture, not the second one, because the second one is not inside. The headline So it’s an end. I applied an end operator that it should appear in both of them. It should appear in the lecture and also it should appear in headline. Yes Why not have a different index for? This is a good question. Imagine you have something that Like, um, uh, like Google Scholar, OK? It has, you have, you have title, you have authors, you have institutions, you have abstract, you have introduction, you have conclusion, you have appendix. Some papers will have a different structure. How long would it take you to keep creating something for this? And what actually if a term is appearing in multiple of these things? So the easiest one is to keep a normal index that has everything together, and it tells you that the abstract for Document 5 is between word 15 and 45. It becomes much more efficient and in the future if something appeared like the appendix or actually captions of figures, you can create a new one and just add the information without creating a new one. It becomes much, much more efficient in handling them. Yes, this headline, you know, the vector or whatever becomes
SPEAKER 3
very, there are 1 million collections, so it would be
SPEAKER 0
huge. Yes, it will be huge definitely because this term is like indexing the word is. It will be everywhere. That is fine, but it’s still much more efficient than the other things. Which actually, that’s a good question. What would happen when we have millions of documents? The posting would be huge, correct? You’re talking you’re playing with 1000 documents and the course of the work could be 5000 documents, and this is nothing. Once we have millions, what would, what would we do? Actually, millions is not even a problem now. Trillions. OK, now we start to have problems. So which brings us to index compression. So the inverted indices are big. And the problem that If, if there is actually a large disk space, they’re actually taking a large amount of space. And I need to read the index into my memory. It will take a longer time, you know, actually that if you have some information about systems, computer systems, it’s, it’s reading from the desk is way, way, way more slower than reading from a memory. So this will take a long amount of time. So the main purpose of index compression is to reduce the space the index will save. The one I showed you last lecture is a terrible format. It’s saved in text, nothing efficient at all. It’s not even a binary file or something to be more efficient. It’s just something like this. I’m saving one document 3 as an AI, as a letter, so it makes actually it’s very inefficient. So if you’d like to do this more efficient, you start thinking about how can I compress it, make it actually take a small space. So instead of actually my index taking 1 gigabyte and I cannot fit 10 gigabytes and I cannot fit it in my memory, I just will take only 1 gigabyte and can fit it wholly in my memory without needing to access my disk. So the large size of the index goes to many. Things. One of these is the terms themselves as a dictionary. And another part is the document numbers. The question is which do you think would take more space. The term entry or the document numbers. Sorry? Yeah. Who think it will be the terms seeking more spa and making the space. Who think it will be the document numbers. I think more saying that’s document number because, yes, because the term is just 500,000 terms unique, but for each of them, there are a list of very long list of document number postings there. So this will keep repeating for every term, it’s taking a lot of space. So Is there a way to compress these document numbers to not take that much space? Uh, for storage. And let’s get introduced to something called delta encoding. Which is a very simple idea. So if you have a large collection, so when you have a document, in this case, you start to actually see this is uh for this term document 123, and keep going till document 5 billion or something like this. If we’d like to save this in a binary file, that’s a file that actually just saves the numbers as integers or something that will just be the same thing the smallest thing to store this number, if I have only 255 documents, then I can save each document number in one byte. One byte would be enough to have 255 different documents. But if I got 100 and 300, One bite will not be enough. I need to actually have 2 bites. So in this case, and they can store documents up to 65,000. Then if my collection is bigger, reaching up to 4 billion, then I will need 4 bytes. If it turns out to be actually it’s 5 billion, now I need 5 bytes. Because every time to store, if you know, if you’re aware of binary, then it will take more time to to save this. For example, this number, it, it can’t be stored in a byte. It needs more bytes, uh, in multiple bytes. OK? This is 100,000. Yes, this needs actually 3 bytes to be stored in. OK. So The idea is simple. And instead of saving the document number, what about saving the difference between my number and the document before me? So instead of saying I’m starting with document 123, I will just say document 1, and the next one is document 2. I will not save two, I will save 1. It’s one step from the previous one. It’s a list. I keep going through the list anyway to find all the documents matching, so I can start with the beginning one, the first document, and keep how many delta, how many documents I need to skip to reach the next one. So I can say Document 1, document 1, then 1, which means document 2, then 5, which will means document 7, and keep adding while I’m moving. So in this case, I can start from the beginning and keep adding how, what is the delta between them, something like this. So here, 1, 100,0002, then 100,0007, I say the delta is 5. And this is one actually is After 7, it’s 8, add 1 document, add 3 documents, add 7 documents, and so on. So what is needed now and instead of saving these numbers in 3 bytes, I can save it in only 1 byte. Yes, but then you lose the ability to assess anywhere
SPEAKER 6
in the list in constant time because you have to go through.
SPEAKER 0
You have to go through them anyway because if you’re looking for a term Scotland and it appeared in, you have to load the whole list anyway and find them to be and do the intersection or union or whatever you want, so you have to go through them anyway in all cases. So it’s better just instead of reading the number directly, just add an operation of addition. Is it clear? Good, you have a question.
SPEAKER 2
Yeah, like for instance, a document appeared in only, uh,
SPEAKER 5
uh, a term only appeared in the first and last document, wouldn’t we still be, wouldn’t we still need I
SPEAKER 0
like these kind of questions, which actually brings me to the next thing I’m going to talk. Yes, so sometimes it will be one document, but some documents will be, it will exceed many, many documents until you find the next term. Like this one. I keep actually saving one byte, then at some 0.321 documents don’t, don’t contain this term, then I start to find it again. So this one would require 2 bytes instead of one. Sometimes it can be the first document and the last document, which will require actually like 5 bytes. How shall I do that? In this case, instead of saving everything in one byte, now I have to save it in 2 bytes, which will be double the space. Sometimes if I’ve got a difference of 3, it will be 3 times the space. How shall I do that? Which brings us to another compression technique to solve this part, which is V-byte encoding. So V byte encoding to go a little bit lower level to the binary. Of what is behind this stuff. So the thing is, I would like to save the do the difference, the delta in a variable size every time. If it’s the difference is 5 documents, I will save it in 1 byte. If the difference is 300 documents, I’ll save it in 2 bytes. If the difference is 6,465,000, I’ll save it in 3 bytes. How shall I do that? It starts to use actually 1 byte goes to 8 bits. Uh, no, you know, hopefully you know binary, so I’m not speaking French for you or German or, I don’t know what’s the language you’re not speaking. So, Every bite has 8 bits. The main idea is, OK, we will use the first bit in this 1 byte. To indicate I’m terminating the number or actually I need to read the next byte to continue to finish my number, and the next 7 bits will be just the number, just to give you an example. So if I actually like to save 6, I will show it as 6 in binary will be 110, OK, but I have the full byte here. So what I can do, I will say 1 here means keep reading, and when you’re 1, this is actually all the information about the delta. It’s only 6, read 6. If it’s actually 127, it will be 71s, but I will not add 0 at the most with his left bite. I will, I will just make it 1, which means terminate it. Terminate now. So I’m using actually 7 bits of the byte, not the whole byte. One of them is just an indicator to continue or not. When I have 128. In this case, I’m using only 7 bits, so 1 bite will not be enough. So I will add 0 at the beginning. And say And put actually the 1st 1 to 8 will be 1, then 7 zeros, and then add the 1, then another one telling the one which is just saying it’s terminated now. Take the numbers and terminate it. So what would happen, I will keep breathing like this. I found 0, so I keep breathing. I find 1, then I will terminate. So what will happen, I will just leave these two bits indicating shall I terminate or continue, and this will be the final number. which shows 128. So I’m using one of the bits just to tell me, shall I continue reading or terminate it now. To give you an example, to see actually if you understand it or not. This is a real example sequence of the deltas, so let’s split them by bytes. These are actually the most left bit here. So the first one. Does it say, shall I terminate or shall I continue reading? Terminate. What is the number here? 5, amazing. The second one, it tells me to continue or terminate. Keep reading. So actually this is not the whole floor, it’s not 1. So I will continue to the next bite. Shall I terminate or continue? Terminate. So I will take both of them and combine them together, and what would be this number? So let’s just, it will be like this. The last one will be terminate. So what is the first number you said? 5, 2nd number? So like this 130 is the last number. Last 17. Yes. So simply, I encoded the delta encoding I have, now I know that I need to jump 5 documents to find my next one, and then jump 130 documents, and now jump 7 documents and to get my num document number. So you can see actually it went a bit low level, but now I can save my deltas in a very efficient way. I don’t have to have a fixed byte number for the deltas. I can actually, if it requires only 1, I will use 1. If it requires 2, I will use 2. If it requires 5, I will use 5. It becomes much, much more efficient. OK, Who thinks this is a lot of processing that will waste time? You have to check the data and check if it’s terminated, continue, and converted. Who thinks this will take a lot of, do you think this is a lot of processing, yeah? Yes, and that’s processing. So why you are using, why, why you are doing that exactly. It’s some processing, but it’s way faster than reading from the desk. All what you need to do, don’t read from the desk. I need to compress the index as much as I can so I load it as much portion of it to the, to my memory. OK, this is why when we last week when we were doing the practical, I showed you I’m reading the file of the wiki abstracts from the compressed file itself. I’m not reading it from, I don’t uncompress it and read it. I’m reading it from the compressed using GCCAT, because this is way faster, actually, you can try it. Uncompress and read the file, and actually read it from compressed and and do the processing. Usually it’s much faster to read a smaller file from the desk, process it more than it’s actually reading a bigger file without processing. Fair enough. OK. So index compression, this is just one sample, but there are even more complicated compression techniques that actually are, would lead to more compression uh like Elias Gamma, for example, and this is something about your reading, do with reading from the textbook. The more compression, less storage, but more processing. But in general, this is a rule of thumb the less IO plus more processing is usually much more efficient than more IO plus no processing. It’s usually faster. And with new data structures, now there are even better data structures, the problems become less severe, but with a big, big amount of collections, you might need to use this. OK? I’m not expecting you would use compression in your group project, for example, but you might use an efficient data structure to allow you to save the index in an efficient way. It will be disappointing if I found you, you saved your index in the same way you’re doing it for your coursework. This is the most inefficient way you save it, OK, it’s just for practice. OK? Fair enough. Dictionary data structures, OK. Sometimes I cannot load everything any way to the memory, like. It’s trillions of documents. I have to look at some part of it, but not all of it. I cannot do. It’s impossible, for example, sometimes. So If I cannot do that, The question is what to load in the memory when it’s needed and how to reach my, the term I’d like to get its posting quickly. And what data structure they should use for the inverted index index in this case. The easiest way which I recommend you use it for your coursework is using hashes or dictionaries. So you have a term and all the postings of it, of this structure we showed, just save it a hash and all the values is saved in the in the value. The process of this actually it’s, it’s very fast than just order of one when you need a term, you just get it. That’s easy. The problem with it, it’s If you have a small variant and it’s like judgments or judgments in this case, you will not be able to match it. You have it to be a different term. No prefixed search, so you cannot search as you type, you know, when I’m searching for something. Google actually starts to show me and even suggesting some stuff for me. This one will not allow me to do that because I have to finish the whole term. And if the vocabulary keeps growing, it will require some rehashing, which sometimes is not actually efficient to do. So hashes, I’m expecting or dictionaries, you will probably use it for your coursework. It’s up to you. Use whatever you want, but it’s fine with smaller collections. But when it comes to big, there are some other options like for example, binary search trees. So if I have a huge amount of vocabulary and I need to load something, so instead of waiting for the user to press enter, while the user is typing, I start actually to load some stuff in my memory. Like I can have A to M in one branch and NZ to the other branch, and they write B, I will go to this branch, and I go keep actually going while the user is typing to know exactly where do you know what they are going to actually load, and I can load it earlier. Once they finish typing, I have it actually loaded in my memory. Even the general part of it is the be trees, which can be and it doesn’t have to be binary, it can be actually multiple branches. This is also another option. So while you are typing, you can actually go and load the branches while you are moving. So the good thing about trees, it solves the problem of prefixes. So if I start to start writing A, B, I can start getting what I’m looking for. But the because, of course, it’s, it’s less efficient than hashes, it’s order of log m, and balancing the binary trees is expensive sometimes, especially when you get new terms. I need to balance it now because I’ve got a lot of terms with the word letter Z, so how to balance it over time. So sometimes rebalancing takes a large amount of time. However, this is related to some interesting feature in search, which it will be highly appreciated if it, it will be implemented in your group project, which is a wildcard queries. I’m looking for. Query that is m and anything afterwards. I’m looking for all documents contain any term that started with MON. In this case, it can be Monday, it can be monkey, it can be Montreal, it can be anything, money, many words like this. So With binary trees, it can be actually useful because it helps you. So now, if I search it for MON asterisk, I can go to the binary tree or whatever tree I have, and once I reach MON I get everything underneath. This will be matching, correct? Because now all the patches will be relevant. But actually it will have some limitations here. What if the wild card? Come the beginning. I need something that ends with. Like common, for example How would you match it in this case with Bin Ortiz? You have a solution.
SPEAKER 2
Yeah, just make a binary tree with all those rules.
SPEAKER 0
You make a binary tree in the reverse order of the letters. That’s an excellent solution. Yes, in this case, if I need a word ending with something, I can make an reverse binary tree. Yes.
SPEAKER 5
So like in succession and then uh.
SPEAKER 0
But actually in this case, if you are looking for something. If you need NOM, this is, you need to do the reverse. You have to have the binary tree, you cannot go from down in the tree. But it solved this solution, but what about, uh, so maintaining the additional binary E4 term is backwards. This is a good. But what about if you did something like this? Something that a wildcard is in the middle of the world. How would you solve something like this? I need something that starts with bro and ends with scent. Yes. It can be, but imagine that you will get the subset here and the subset here and through the intersection, which can be a huge amount of terms. This can be a solution, but it might not be on the running time when the user is searching, it might take time. It’s a large amount of processing. Remember, all what we are trying to do now is to, when the user is searching, we need to find the result as soon as possible. This is actually our main objective here. So I know it becomes more complicated. I need something like this and another term, so imagine the amount of processing needs. So it will be very expensive. Which brings us to something called the permiter indices. Which is transforming wildcard queries. So that the wildcard always appears at the end. I will create a binary tree, but always I’m sure that the wildcard will come at the end, so one tree would be enough to go one direction. How this happens. Imagine I have the word hello, just one word. What I can do for the end of each word, I will just put a dollar sign to indicate this is the end of the word, and then each word will be indexed. With all the rotations around this dollar sign. So I hello dollar sign, then I move the H to the end, having the, the index, and so on. And all of these terms will be referring to hello. So if I can match all dollar sign hell, it will actually know that it’s actually looking for hello. So how shall I use this later on with the wildcard? What I will do is to rotate the query till the wildcard becomes at the end. I will just index all of this in the binary tree, all of these versions in the binary tree, and when I get a wildcard, I, I will keep rotating until the wildcard becomes at the end. So let’s give you an example. If I’m searching for a specific term x specific string, then I know that this term, I will add a dollar sign and find actually in the tree that something will goes till I find the dollar sign like hero in this case. So as an example, hero will just hello dollar sign and I can be able to match it in the binary tree. If I’m searching for a string with a wild card at the end, like in this case, I will keep rotating, so what I will do. I know that X then ends with something, so I know that X dollar sign. I just add x asterisk, then I add the dollar sign at the end, then I will keep rotating till I get them the wildcard at the end. So in this case it will be like this. If I search it for hell with anything afterwards, I know that I need something that has hell wildcard and and then the end of the world, then I will rotate them. I will keep rotating till the wildcard becomes the end. So now I will search for something that starts with the door sign, then HL which will match this one. So again, it will be referred to the same one. I know it’s a bit complicated, but I hope you get it. So just I keep rotating till I get the wild card at the end. So let me ask you this question. What if I’m searching for a wild card, then a string? What would be the outcome? What should I be looking for? Add a dollar sign at the end and keep rotating till the wild card is becoming at the end, how it would look like. Can someone say yes? No, no, I’m asking about this one. Yeah, the 3rd 1 dollar dollar sign X. Excellent. No, there are, it’s actually x do sign then because I know that what would happen here like something ends with LLO, I will, I know that it should end with this, so I will have the other sign at the end. Then I will keep rotating them. You can see, I’ll take them, rotate them, it will be like this. OK, can someone get this one? If I have a string, then a wild card, then another string. How it should look like? Yes, dollar dollar X. Why dollar excellent. So this is exactly what would happen. I’m looking for H anything LO, then I know that the dollar sign will be at the end, then I’ll keep rotating, it will be like this, so it will match my query. So whatever it goes, it’s just one binary tree but containing new different terms which doesn’t make sense for us, but it will actually always lead to the same term. What I’m looking for, so this will enable searching with wildcard, finding all the terms matching it. Is it clear now Excellent. OK. Uh, Of course the index size will be huge in this case, but remember this is just for the terms. I’m not talking about the documents here, just for creating something for terms itself, which brings us to a different way of representation. Maybe this is not the best implementation, but there is another thing that could be used, which is character gram indices. It enumerates all the n-grams sequenced in character level, not the word level, occurring in inimitter. And in this case, hopefully it will be able to allow us to search for a term. So actually the initial thing. When I have a wild card, I will search for an index for terms, not for documents, just an end for terms, and find out what are the relevant terms to my wildcard, then use it for documents afterwards. So for example, if I got something like April is the cruelest month, for example, like this, then I can actually put them in bigram representation. I will put a dollar sign at the beginning of the term and a dollar sign at the end and put actually all dollar signs A, A, P, P, R, and so on. And this is a special symbol just to show me the boundaries of the word, and I can maintain a second inverted next for bigrams to to dictionary terms that actually match each by gram. So I start by character n-gram to get the terms from terms to the documents. I’ll give you an example to make it clearer. I know it might be not very clear. So if I’m searching for, remember the MON MN wildcard asterisk, what shall I find? It will have another index showing me that a dollar sign M with all the words starting with M in it as postings, and all the words which have M within them. And all those has ON within them. So this is just an index for terms. Not for the documents. So actually it just starts with a small string and Give me all the terms that apply this one to it. And when I’m searching for MON and wildcard, I will start with this index of characters by grams. Finding all the possible terms and filter them that actually matches them, which will be an end. When I say MON, it has to be actually an end between all of the three of them. It has to be starting with M and MO and ON. The and, then I will take all the terms I found, which can be monkey, Monday, Montreal, and put them and search for, for all of them in the index, normal index of documents with an old operator, because any of these terms will be valid, and then retrieve the documents. So if I’m searching for MON, I was searching for the initial index for terms, which is dollar sign M and M and. Sometimes it will have some actually false positives like moon. It matches this, but it’s still not, so actually you can do additional filtering, but at least you did most of the job. Then step two. must post filter them, of course, to remove their stuff like moon. Then the surviving terms you can create like monkey, Montreal, money, monaster, all of them put an operator and search for all of these terms and retrieve them. So the problem here is that this wildcard can end up with many, many possible terms here and it will be very expensive. Why it will be very expensive? While processing this will take a lot of time. Which one, which part of this process do you think will take the most amount of time? Step 3, OK, who thinks it’s step 1? Step 2. OK, step 3. OK, why do you think it’s a step 2? It has some filtering. You have to check actually that the word has three letters, but who thinks it’s why step 3? Can you say why you think it’s step 3?
SPEAKER 7
Because all, all the occurrences of that word or like how much of a words we find we have to find the queries matching for each of the words.
SPEAKER 0
Excellent. You have to load the actually it’s step 3, the right answer, because you have to load all the postings for all of these terms. If it ends up that with 100 terms, you have to load all the postings for them. And keep doing all the operations among them. The, the one in the middle and the second one is still like a simple rigorous expression can solve it. But the third one is actually loading from the real index of documents all the possible terms and matching them. Actually, if you try it on Google itself to search for a very long query, it will tell you I searching for part of it, because when that query is very, very long, it will take a huge amount of time. Actually, a query of two terms takes double the time of 1. Of the query of one term. So actually it’s a linear. So if a long query will take a long amount of time. This is why it’s important to think about this. But it has some interesting application like a spelling correction. Like if I have a problem, if, for example, when the user types something and Google tells you, did you mean that? How did you do that? There are different ways, of course, but one of them is actually using this character in gram indexing of terms. So if I search it for some term and it turns out that the documents containing this term is very few. So instead of 1 million web searching, instead of 1 million documents contain it, only 10. Google will tell you, are you sure you didn’t mean something else? Because I will actually keep searching for this specific one. It’s you definitely have seen this. So you can say, no, I’m looking for this specific one, or no, I meant the other term and you corrected me. How do they know it matches how many documents. The other question, how did they know the other term, what potentially you are looking for? One of the solutions is actually they check which are the closest terms to this one that has many results. Like for example, if I wrote something searching for Eri Gunt. It can be actually, I don’t know, there is no documents because it’s a misspelling clearly. So now it can be elegant or elephant. So if I use this bigram character indexing, now I can see that these two terms have the most matches with my original term, so I can suggest this one or this one. To the user OK. So the current, and actually it has even other applications for other languages. It’s very useful for languages which are highly intellectual, like Arabic. When we said the stemming helps a lot. If you don’t have a stemmer for this language, you can use character gram for indexing. For example, searching for these two different words, if you split it by by grams, you can find it still, I can match a lot of part of the word itself. And also for spelling uh documents which actually are, have misspellings themselves, the document, not the query, it can fix it. And you don’t have always use only by grams. You can use a competition between by grams and trigrams. It’s, it will be done, OK? So the summary here, because I took a long time this time, index can be multilayer, which uh like there was for structured documents like extend index, like having the links, having structured headlines, you can use this one. And index does not have to be for be formed of words. Sometimes we have the character in grams, you can actually use it for indexing. I know it looks strange inside the index like something like this, but for the users they will not see it. Remember, we said the user will not see that. We’ll just match better documents. And usually if you have some stuff like wildcard or actually misspellings, you can always have two indices, one for finding the term potential terms that the user might be searching for using a wildcard or actually misspelling, and then take the possible terms from there and use it for the document index to search for this stuff. The resources for this is You can check Chapter 3 in the introduction to IR, Section 3.1 to 3.4, and Chapter 5, which is part of the same thing you mentioned earlier for textbook 2. That’s all for, for today. I’m sorry if it was a lot of information, but we said the easy stuff has gone. Now we’re starting for real stuff. And please apply, do try to do that as soon as possible and share your results. Don’t share your code. Uh, general ideas about the code is fine, but not share your full code and share the results about the uh lab one. Say what are the documents you retrieved and compare it to each other, OK? And best luck.
Lecture 2:
SPEAKER 0
OK. Hi, everyone.
SPEAKER 2
Welcome to the 2nd week for another lectures in techno technology for data science, and we’re going to talk a little bit about some laws of the text in this lecture and some pre-processing steps in the next one.
SPEAKER 1
But let me start before, uh, the lecture itself. How was Lab Zero?
SPEAKER 0
Anyone tried it?
SPEAKER 1
Uh, was it, yeah, trivial, yeah? If you found it trivial, that’s great. Uh, if you found it challenging, again, think twice. There was no support for the lab last week because
SPEAKER 2
it’s just something for yourself to see if you can
SPEAKER 3
do that or not.
SPEAKER 2
Uh, this week, actually, we start to do the relab,
SPEAKER 1
uh, of the course, so actually it will be, uh,
SPEAKER 0
practicing all what we will take in this lecture.
SPEAKER 1
So whatever you’re going to take in this lecture, oh,
SPEAKER 2
thank you.
SPEAKER 3
I didn’t, I didn’t know that it’s not on. Oh, sorry, again, Thanks for letting me know.
SPEAKER 2
You should stop me, don’t, uh, don’t be kind, just shout, don’t worry.
SPEAKER 1
OK, so this is your lab one.
SPEAKER 0
This week is important to everyone, so it’s important that
SPEAKER 1
you try it as soon as you can. Actually, if you would like to go home and try it today, you can finish it today, so it should be fine. Hopefully it would be fun to try as well. I’ll be testing quick stuff with you in the lectures
SPEAKER 2
today.
SPEAKER 0
Try to implement directly after the lectures.
SPEAKER 1
I would say during this week would be perfect. Before the 2nd lecture is really, next week’s lecture is really, really important.
SPEAKER 0
Ask questions, share results over Piazza, so I think everyone is on Piazza now.
SPEAKER 4
Once you get, if you manage to get it, get some results, share it on Piazza directly. Uh, if you have some challenges, ask questions on Piazza
SPEAKER 0
as well. Our demonstrators will actually go and respond to you when
SPEAKER 2
you need any support.
SPEAKER 0
And the lab next Tuesday is just only if you couldn’t have tried everything and it’s not working and you need some support in person. So in this case, go to the lab.
SPEAKER 4
So I hope if we are successful, we will not find anyone in the lab next week, next Tuesday.
SPEAKER 0
OK? That’s what we’re looking for.
SPEAKER 1
So join Piazza please and get engaging on this lecture. So this is another reminder about we said it’s not
SPEAKER 4
about the course itself or we know how the search
SPEAKER 0
engine works. Yeah, you can read about it. You can see some videos and probably it will be more entertaining than my lectures, but actually what we really
SPEAKER 4
care about in this course is gaining some skills, and
SPEAKER 0
this is actually some of the skills we discussed last time working with large textual data. Uh, some few shell commands, you get used to it when you have some textual data and you’d like to support it quickly, how you can do that quickly, uh, using, getting used to Python and regular suppressions, which would help you to parse text quickly and teamwork.
SPEAKER 4
Uh, what we are planning to do today, actually, we
SPEAKER 0
start gaining some skills related to the 1st 3 ones.
SPEAKER 1
So hopefully you’ll be building some of these skills.
SPEAKER 0
You might already have it, which is great. You’ll be faster in doing stuff, but if you don’t, it’s, it’s a good opportunity to learn about them. So the objective of this one is learning more about actually some text flows of text, uh, some of the laws of the text, uh, like Zipflow.
SPEAKER 4
Who heard about Zipflow before?
SPEAKER 0
Wow, I can skip this lecture then. Penford is low.
SPEAKER 1
OK, something new there. Heat slow.
SPEAKER 2
OK, one clamping and contagion in text and probably not many. OK, there is something new and still in this lecture,
SPEAKER 0
and this, this lecture is a bit practical, so we
SPEAKER 1
will be testing whatever we claim, uh, in the lecture.
SPEAKER 4
And you can try it yourself if you’d like actually to practice with me. If you have your laptop with you, you can go and download this one actually, you can find the link
SPEAKER 0
in the lab one on the course itself.
SPEAKER 1
This is just a copy of the Bible in text, just we’ll be using it to demonstrate what we are teaching. And you can use Python if you want. You can use Excel to draw some graphs, or actually
SPEAKER 2
you can use Python to draw the graphs itself.
SPEAKER 0
Just if you’d like to try, you don’t have to do it. You will do it in the lab anyway. I’m just saying if you’d like to try while we are speaking now.
SPEAKER 4
OK, Let’s start. Word’s nature. So this is about text. This course is about mainly text.
SPEAKER 2
And uh we can say the basic form of any
SPEAKER 4
text is actually the word.
SPEAKER 0
The text is form of different words.
SPEAKER 4
And there are actually some characteristics of the words in general observed. As we use how we use these words, that’s interestingly actually, these characteristics are very, very consistent across languages, across
SPEAKER 0
domains. So if you’re speaking in English, you you’re speaking Chinese, Arabic, German, at the end, you will have these characteristics
SPEAKER 4
still very common between them. And The interesting thing actually, even if different domains, you’re talking about text technology, you’re talking about medicine, you’re talking
SPEAKER 0
about entertainment, it will apply there. So across different domains and languages, these actually laws or characteristics are very common to see there.
SPEAKER 4
For example, the frequency of some words, like some words will be very, very frequent in some languages.
SPEAKER 0
Let’s speak about English now, like the word that of
SPEAKER 4
to, it’s very common to see these words in, in,
SPEAKER 1
in a sentence.
SPEAKER 0
Other words might be less frequent like schizophrenia or baya.
SPEAKER 4
These words, yeah, you might see them, but not that commonly you could see them, you might actually spend a
SPEAKER 1
year without passing by any of these.
SPEAKER 4
Interestingly, any collection, if you have a collection of news articles, a collection of books, a collection of whatever you
SPEAKER 0
have.
SPEAKER 4
Roughly, most of the time, half of the words, unique words, will only appear once. Just appear once and they will never see it again in the collection. And this is very common across languages as well.
SPEAKER 0
And if we try to draw, because now you know
SPEAKER 4
the, what is the flow? The frequency of the words with how, with their rank, so we say this is the most frequent word and
SPEAKER 0
how many times it appears, the second most frequent word and how many times it appeared.
SPEAKER 4
It will lead us to something like this. Few words will appear in larger, much larger number of times, bigger number of times, and most of the words will appear very, very few number of times. And if you’d like to draw this on log scale, log scale for ranking and rank and log scale for frequency, you will get something close to a straight line.
SPEAKER 0
So on the log scale, it will be close to
SPEAKER 1
a straight line going down a slope like this.
SPEAKER 4
So this is actually simply Zip flow, which most of you know about it. For a given collection of texts ranking unique terms according to their frequency, then multiplying the rank by the probability of seeing the, the, the, the term itself or its frequency will give us something close to a constant, where it’s not exactly a constant, but something, a range, a number that is actually within a range.
SPEAKER 0
It’s not something that’s fluctuating a lot.
SPEAKER 4
So, and this is actually since actually the probability in this case would be a function of 1 over x
SPEAKER 0
of the rank, this is why we see this kind of exponential decay we see.
SPEAKER 4
So this is just an example. If we go out on a collection, this is actually a Wikipedia abstracts you will be using in your lab one. If we said actually what are the most frequent words appeared in this big collections, 3.5 million English abstracts, around,
SPEAKER 1
I think 90 million words in this collection.
SPEAKER 0
This is not that big. You’ll be working with bigger stuff, but this is just
SPEAKER 4
for more warm up. 90 million words. What is the frequency of the most frequent words in this collection? It turned out to be these words. And this is actually the common words we usually see in a language. So if you, this is the rank, and this is how many times it appears, if you multiplied them, if
SPEAKER 0
you multiply the rank by the frequency, you will get
SPEAKER 4
something like this. If you notice the numbers here, yes, it’s not a constant, but it ranges between 56, 10 million. You don’t find something like sometimes it’s 5 million, sometimes it’s 100,000, sometimes it’s 100 million. No, it’s within a range, it’s close to a constant here. So this is actually the interesting thing. So this means that the first word in any collection is probably would be appearing twice the time of numbers
SPEAKER 0
for the second word, the most second frequent word, and it will be 3 times the third frequent word, and
SPEAKER 4
so on. And the 10th 1 probably will be like 10 and
SPEAKER 0
in this case will be probably 1/10, 10, uh, it
SPEAKER 4
appears 10 times more than the 10th word. Interestingly, when we observe this across languages, it is very
SPEAKER 0
common. This is the same thing.
SPEAKER 4
And these actually are the words that are very functional
SPEAKER 0
to a sentence, that actually you cannot construct a sentence
SPEAKER 4
without it. So actually a fun exercise that you can try now will take 30 seconds. So each one try to speak to the person beside you. And to make a conversation without using any of these
SPEAKER 0
words. Can you try now?
SPEAKER 4
Yeah, you find someone beside you, try to speak to each other, don’t use any of the words here. Oh, by the way, speak in English, OK? Don’t, don’t play smart.
SPEAKER 5
I. I. It will be. Um, yeah.
SPEAKER 4
Anyone managed to do it? Kind of. It’s really hard. OK, so this is a point that some of the laws, one of the laws of text in general of the language we speak, whatever language it is, there are some terms which are very, very, very frequent that you cannot even construct a meaningful conversation for 30 seconds without
SPEAKER 0
using any of them. And actually, even if you don’t speak a different language,
SPEAKER 4
try to do the same thing. So this is the main thing about Zipflow here. So let’s do a quick practical because probably you started
SPEAKER 0
about Zipflow, but let’s do a test and see this.
SPEAKER 4
So we’ll actually do a quick experiment using Bible and also Wikipedia abstracts. One contains around 800,000 words.
SPEAKER 0
The other one contains 80 million words. Let’s see actually if this stands or not, and this is the right of the text.
SPEAKER 1
One in the file in textual format is 4 megabytes. The other one is around 0.5 gig of text.
SPEAKER 0
So if I went here, And this actually, some shell
SPEAKER 4
commands I’ll be using, I’ll be using some shell commands here. Uh, you might follow up about this, but it would be useful to learn about these shell commands in general.
SPEAKER 0
So let’s try this. Um, one of the very uh common shell commands you
SPEAKER 1
need to learn about is CAT, which is actually reading a textual file from the desk.
SPEAKER 2
So here I have it actually collection and I have a Bible.
SPEAKER 3
Uh, text, and there is actually more which actually will
SPEAKER 1
keep showing me the results from the terminal, uh, page by page.
SPEAKER 0
So if I did that, so this is simply the Bible, the text of the Bible. You can see it has Verses in each chapter, and
SPEAKER 2
this is how it goes.
SPEAKER 0
So Let’s try to count actually what are the most frequent words here in this one. I don’t have to write code for this. I can still use shell commands.
SPEAKER 4
For example, one of the things I need to do, let me count actually all the uppercase and lowercase the same. So actually what I can do, I can lowercase everything.
SPEAKER 0
So one interesting one is actually translate, which is tr.
SPEAKER 1
I will translate anything that is uppercase with a lowercase letter. This is what it does simply. if I did more, this, as you can see now, everything is lowercase. OK.
SPEAKER 0
What I can do more actually is, I can see there are some punctutuations here like uh dots and commas
SPEAKER 1
and columns.
SPEAKER 0
What I can do, I also translate this stuff.
SPEAKER 1
So, for example, I translate the comma, semicolon, uh, apostrophe,
SPEAKER 2
and dots with space. If I checked how it would look like, this is how it looks like now.
SPEAKER 1
I’m processing text from the shared comments. I’m not doing anything, but this is something as you
SPEAKER 4
learn in the course, if you’ve got a collection, you need to explore it quickly. You have, you learn about these shared comments to do
SPEAKER 0
it quickly. OK.
SPEAKER 4
What I can do later, actually, let’s actually put every word in a new line so I can translate spaces
SPEAKER 2
with a new line which is n.
SPEAKER 3
If I checked more, this is how it would look
SPEAKER 2
like.
SPEAKER 1
Now, if I need to check actually, what are the most frequent words, I can sort them, which is a
SPEAKER 2
sortch command.
SPEAKER 1
Which will be sorted alphabetically.
SPEAKER 0
Then what I can do is to check the unique
SPEAKER 1
words minus count, which will tell me what is the count of each of these unique words I have, how
SPEAKER 2
many times it would see. Then I can sort them again according to frequency, or in this case minus N minus R, actually to show the most frequent ones on the top.
SPEAKER 0
Before I run this.
SPEAKER 4
This is the Bible. What do you think would be the most frequent words here?
SPEAKER 5
God, mhm.
SPEAKER 1
The, OK.
SPEAKER 6
What the same words like Wikipedia, OK.
SPEAKER 1
Any other words, any other Yeah, yeah. You might find Jesus, for example.
SPEAKER 0
God, let’s check actually if I did that.
SPEAKER 1
This is actually the most frequent words. There and off to that.
SPEAKER 4
You’ll start to find some stuff works like shell, which
SPEAKER 0
is because it’s uh it’s used more in the old
SPEAKER 1
way of writing text, onto here, and then actually the
SPEAKER 4
most relevant word to this collection is lowered, comes a
SPEAKER 1
little bit down there.
SPEAKER 4
So actually it turned out that Yes, the functional words continues to be the same, but of course in a specific domain like this, you start to find actually some
SPEAKER 0
related words. If I went more, anything relevant here to the topic, it’s all again functional words.
SPEAKER 1
Maybe the word ye is interesting here because we don’t
SPEAKER 2
use it now, but it’s actually common in that old text.
SPEAKER 1
Nothing so far. Israel comes here, appears something relevant king.
SPEAKER 4
So far, no Jesus.
SPEAKER 0
Children before here, you start this later down the list.
SPEAKER 4
You start to find some things that are related to the topic. So most of the frequent words are the function words
SPEAKER 1
used to construct a text.
SPEAKER 4
So just actually I created a quick thing.
SPEAKER 3
Bible text.
SPEAKER 0
I created a, a simple function to just to actually
SPEAKER 2
print this in that file.
SPEAKER 3
I just put them in as file. A bad naming, but that’s OK.
SPEAKER 2
Uh, I put actually everything by the frequency, and it
SPEAKER 1
tells me how many, how many words, 85,000 and uh
SPEAKER 2
13,000 unique ones. 85, uh, 857,000.
SPEAKER 3
So I went there to this 10, not here. Sorry, I can see the. Oh, here you are. So I think it’s shared here as if, where is it? Oh man. Slides. Is that OK, this is a file I have open with, let’s open with this one. So these are the words sorted.
SPEAKER 2
OK, if I took these ones, let me draw by
SPEAKER 3
Excel sheet. I will not use something fancy.
SPEAKER 2
You can draw it by Python, please, make something decent,
SPEAKER 3
but this is just for demonstration. It’s a bit small. Let me zoom in a little bit.
SPEAKER 2
So if I can do here, if I pasted that
SPEAKER 1
here. You can see these are the words and how many times it appeared.
SPEAKER 2
I can let me move this one here. So I can see this is the rank. This is rank 1 and this is.
SPEAKER 3
The next rank, I just keep incrementing. OK.
SPEAKER 1
So you can see, you can see how many terms
SPEAKER 0
appeared once.
SPEAKER 1
Many, many terms appeared once. I can keep scrolling, scrolling, scrolling, scrolling.
SPEAKER 2
Many terms appeared once.
SPEAKER 0
And this is very common to any collection, even a
SPEAKER 4
very closed domain like the Bible, a religious text like
SPEAKER 0
this. So actually if I went there and I started to,
SPEAKER 2
I’d like to actually implement the frequency versus the rank,
SPEAKER 3
I can insert. An extra plot like this, this is how it would look like. Let me expand this so you can see it.
SPEAKER 2
OK.
SPEAKER 4
Few words appeared. A lot of times, most of the world’s been very,
SPEAKER 0
very few times. If I did log scale, remember, we claim that if
SPEAKER 1
I did log scale, I will start to see a
SPEAKER 2
straight line. Let’s do log scale here and log scale to the
SPEAKER 3
frequencies as well, which is from here. Log log, log.
SPEAKER 0
It’s not exactly a straight line, but close to a
SPEAKER 1
straight line.
SPEAKER 0
OK, but this is the Bible.
SPEAKER 4
This is a very close text. Let’s do something actually bigger.
SPEAKER 0
So in this case, I have the Wikipedia abstracts in
SPEAKER 4
one file, but it’s compressed.
SPEAKER 0
I put it actually in a GZ file. I compress it with GZ.
SPEAKER 4
So there is another nice interesting shell command that you might learn about, which is GZcat. Which would read a file, a compressed file directly.
SPEAKER 0
You don’t have to uncompress it because something that you
SPEAKER 4
will learn over time. You need to save the space, and if you saved all the files like fully uncompressed, it might be taking a lot of space. So you can always compress them, which will take around
SPEAKER 0
10% of the space, and you can read them directly
SPEAKER 4
with stuff like GZCAT and so on.
SPEAKER 0
Actually, it will be even faster.
SPEAKER 1
So if I went with, I think it’s called abstracts
SPEAKER 3
Wikipedia, the text that’s compressed.
SPEAKER 1
This is how it looks like. This is much more text. If I went and applied to check the most frequent terms I have, this is now I’m pressing, I’m, I’m,
SPEAKER 0
I’m processing 80 million words, remember.
SPEAKER 2
So I created something to create, to count this stuff and print it in, like call it, let’s, I call
SPEAKER 3
it wiki, wiki.text.
SPEAKER 2
OK.
SPEAKER 0
It will take a little bit of time. I don’t have a GPU on my machine. It’s just a CPU, but it should be fast actually, because you need to process the text very quickly and flush then. So what it’s doing now is just reading the 80 million words and trying actually to count every word, how many times it appeared, and then at the end we print these words sorted by frequency. This is just as it’s done. Here you go.
SPEAKER 1
So now it has processed in 30 seconds it processed
SPEAKER 4
80 million words. This is what we expect from you to have this
SPEAKER 0
kind of skills to write a code to actually process text very quickly.
SPEAKER 2
Now, actually, let me check how this file looks like.
SPEAKER 3
It’s called wiki.text.
SPEAKER 2
That’s what we called it.
SPEAKER 3
If I opened it with This one These are the terms, remember we got.
SPEAKER 2
If I copy this one, I went to Excel.
SPEAKER 1
By the way, remember actually this is from the command
SPEAKER 2
shared command. You’re talking about how many unique words?
SPEAKER 0
1.3 million unique words. That’s way much more. The other one was 13,000 only.
SPEAKER 4
This is 1.3 unique words.
SPEAKER 1
If I went to the Excel sheet, I’d like to
SPEAKER 2
plot it again.
SPEAKER 3
I can, again, zoom in so you can see it better. Sorry. I paste it like this, uh, it’s, the Excel cannot
SPEAKER 1
take over a million, so it just, that’s fine, just
SPEAKER 3
paste whatever you can, just, it will take some time. Done. I will do the same thing again. I will remove the, remove this on the side. I will take this is rank 1, this is rank 2. I’m sorry, rank 0. And then I can calculate it. It takes time because it’s processing a large amount of uh this is 1 million, it managed to paste 1
SPEAKER 2
million. Well let’s plot it again and see how it would look like. Maybe I’m a liar, maybe actually we’re telling you stuff that doesn’t happen in the real uh in in in real life. So if I insert again, where’s the insert?
SPEAKER 3
Same graph It takes time to process because it’s a large amount of text. Here you are.
SPEAKER 2
Does it look familiar? Even expanding it takes time. Here we are. Let’s do the log log and see how it would
SPEAKER 3
look like. The look. It takes time to process. Come on. Here And this is, this is an old machine I
SPEAKER 2
have.
SPEAKER 0
And this is Wikipedia abstracts, 80 million words.
SPEAKER 1
Which would show in a second.
SPEAKER 2
Here you are. Straight line again.
SPEAKER 1
Whatever collection you have, whatever language you have, try it.
SPEAKER 2
It will be the same, straight line.
SPEAKER 0
So few words appear in a large amount of time. Most of the words appear actually less number of times,
SPEAKER 4
and many, many words appear only once.
SPEAKER 0
Usually half of them appeared only once. OK?
SPEAKER 4
So this is simply zip flow. This is what it tells us.
SPEAKER 0
Let’s continue with the slides.
SPEAKER 1
And this is actually that you will be processing in your lab. Another interesting thing is When you think about this, this
SPEAKER 4
actually the frequency, each each word and how many times it appeared, if I asked you how do you think about the distribution of the frequency of the first digit here. The first digit of these frequencies. What do you think would be if I took all these digits, only the first digit of each number, and
SPEAKER 0
I would block the distribution of this number from, of
SPEAKER 1
course, from 1 to 9 because it’s, uh, you don’t uh write 0 here.
SPEAKER 0
So what would be the distribution in your, in, in your opinion? Would it be, which one of those?
SPEAKER 4
Would it be uniform or exponential decay or normal? Do you find, are you expecting to find these numbers appearing the same amount of times, so it will be
SPEAKER 0
in uniform distribution, or you’re expecting to find one more
SPEAKER 4
than the others and keep decaying?
SPEAKER 0
Actually, you’ll find more than normal, like 5 would be
SPEAKER 4
more.
SPEAKER 0
Who you think it would be, yes, which one do you think?
SPEAKER 7
decay because the frequencies are dominated by the tail of the distribution. There are still many words that just appeared once, just a bunch of words that appeared twice, and they overshadowed the first white dot.
SPEAKER 2
That’s a good point. Anyone have a different opinion? Yes, I agree with them, but I’m not sure that
SPEAKER 5
it will start with the 1. It may start with the.
SPEAKER 0
So it will be decaying but not one.
SPEAKER 1
So it’s but in this case it might be like
SPEAKER 0
the normal distribution in this case.
SPEAKER 2
What do you think? OK, let me think.
SPEAKER 6
Uh, when you did the tokenization or splitting of the Bible, it was 12345, like the numbers were there in frequency in that rank.
SPEAKER 4
Yeah, but I really don’t, I don’t care about actually
SPEAKER 0
the numbers inside the text itself. I care about the frequencies.
SPEAKER 6
I’m just saying that when you did do that, the number, like the frequency of the numbers were in the correct order and magnitude.
SPEAKER 1
OK, so any other ideas?
SPEAKER 4
Which one do you think? OK, let’s actually make voting here. Which one, which, how of you think it will be uniform? Raise your hand.
SPEAKER 0
1, OK? Who think it will be exponential decay from 1 to, 0, the most of you? Who think it will be normal distribution or random distribution?
SPEAKER 2
OK, a couple, that’s fine.
SPEAKER 0
Actually, this is called Benford’s law.
SPEAKER 4
It’s actually the first digit of a number follows a zip uh kind of a zip flow as well.
SPEAKER 0
Like and this is actually interesting, it applies to text,
SPEAKER 4
but it applies to many other things as well. Like physical constants.
SPEAKER 0
If you get all the physical constants you have, uh, like the pi and different numbers, the first digit, it
SPEAKER 4
will have this actually. Energy bills of all the people living in the UK. Take the first digit, it will be the same. Population number of different countries, it will be the same. And this is simply what it tells you. You will find actually that 1 appears way more than 2 and 2 more than 3, and so on.
SPEAKER 0
And one of the actually, the, the answers for this,
SPEAKER 4
because for example, in terms, you can find many terms, half of them appear once, so you actually find the number 1 or both. But also, when you’ve moved from 1, if, if you need to go to, from, uh, if you actually go to from 12345, then you go to 10. If you, if you move to 20, you will see one a lot. Because you see 1112, 1314, this is still the first
SPEAKER 0
digit will be 1 till you see 2, and then
SPEAKER 4
you go and when you see 100, it will keep 1 till it goes to 200. So this is actually why 1 would usually be appearing
SPEAKER 0
more in this case.
SPEAKER 4
And if you need a proof, Why not?
SPEAKER 0
Let’s try it.
SPEAKER 1
So remember we printed this one, so actually I think
SPEAKER 2
I printed it.
SPEAKER 3
It was actually I called it Edo text.
SPEAKER 2
Yeah, at least there’s a frequency. So, uh, to come out, so I can actually cut the text, and in this case I’m interested in the 2nd column only, so I use a shared command cut.
SPEAKER 0
Actually, whatever she commands I’m using now, have a look
SPEAKER 2
on it and try to use it because it will be a good skill to have. I’m interested in the 2nd column, so if I check
SPEAKER 3
this one, it will be only the numbers.
SPEAKER 2
That’s great. I don’t have to use it the same way I’m doing it, but actually I’m using parallel in online, so
SPEAKER 3
you don’t have to do that, minus P minus E, and then in this case, I use a regular suppression to substitute. The first digit. Anything plus Rashd plus. With just the first digit only.
SPEAKER 2
More.
SPEAKER 1
So now I’m just actually getting only the first digit.
SPEAKER 2
You can see it’s here, it’s a twos and ones
SPEAKER 3
and so on. So I’m thinking now I started to notice it, but if I just try to, like to try it, then I can do sort. Unique minus count. And then sort again minus reverse.
SPEAKER 2
And here you can see 1 is appearing almost twice
SPEAKER 1
of 2, and then 3, and it’s decaying down.
SPEAKER 0
It’s actually true, and you can prove it yourself with the collections we have as well. This is the 2nd, just an interesting natural phenomenon about
SPEAKER 1
Benford’s law here, and actually it applies also to text and the frequencies of the text. OK, this is we did the practical.
SPEAKER 0
Heaps law, the 3rd law of text.
SPEAKER 4
The what it says is actually, it says while going through the document or a big collection, the number of new terms you would notice would actually start to reduce over time because probably I, I have seen this before. So for a big collection, what it says, if you are just counting how many terms you have seen so far and how many of them are actually unique, which we call the first one and the second one, the unique ones are called vocabulary size, it says actually it will have this kind of equation you can use. The vocabulary size you can be nicely estimated based on a constant multiplied by the number of terms you have seen to the power of another constant, and this other constant is usually between 0.4 and 0.7.
SPEAKER 0
OK, and it will show something like this.
SPEAKER 4
It going to a saturation at a certain point.
SPEAKER 0
So it’s, and this is very expected because once you
SPEAKER 4
see, when you read something, the first term, of course
SPEAKER 0
it’s the first time to see it. Second term, probably new.
SPEAKER 4
3rd. 4th, I might have seen this before because the word
SPEAKER 0
that appeared again, so it’s not new.
SPEAKER 4
And as you go to the text, probably I, I’ll start to see more terms I have seen before and less terms that I haven’t seen before.
SPEAKER 0
The interesting thing about it is this is actually how
SPEAKER 4
it would look like with a Wikipedia abstract. You can see actually it’s Reducing but still keep increasing. So the question is, This is actually now 80 million words. This is actually the distribution of Wikipedia abstracts. 80 million words still growing. I still see a lot of new terms. But if you might think about it, OK, when I start to read a lot, a lot of things, I should reach the point that nothing I haven’t seen before. I definitely have seen this stuff before, but still keep growing. So the question is, why do you think it still keeps growing? Why do I still find new terms? Is in the language limited at the end it is English only. Do we have 80, this big number, how many? 1.3, we found 1.3 unique terms in the Wikipedia abstract, and it keeps growing. Why this is happening? Why do you think this is happening? Why do I keep finding new terms happening, coming in?
SPEAKER 0
Think about it and share me your thoughts. I will not move until you give me your thought, by the way, yes, uh, because of the nature of
SPEAKER 8
the, I think, the, the, the corpus that we chose, such as Wikipedia, maybe it has different, uh, genres in different categories, and each of them has new words. But if you were to choose a kind of, uh, a limited corpus, maybe.
SPEAKER 4
What do you think about the Bible?
SPEAKER 0
Would it reach the situation at a certain point? this is.
SPEAKER 8
It would be more saturated than this because the, the nature of Wikipedia has more topics and more journals than limited scope.
SPEAKER 1
OK, but do you think it’s for, for some situation
SPEAKER 0
it would reach the full saturation? You will not see neuter. So why you will keep finding new stuff?
SPEAKER 3
Yes, the natural sparsity of language.
SPEAKER 2
What does it mean? So there’s a lot of words that show up once
SPEAKER 6
and they’re all over the place, but do you think
SPEAKER 1
at some point I will have seen everything, almost everything,
SPEAKER 2
and not keep seeing a lot of new things?
SPEAKER 5
Yes. New words are coined regularly, so you’ll always see new
SPEAKER 7
words in text eventually because they’ll just be words that hadn’t existed before that time.
SPEAKER 1
So new, new words will be invented, for example.
SPEAKER 0
But is it actually to this extent to keep growing
SPEAKER 1
that fast?
SPEAKER 2
Yes, words are very uncommon.
SPEAKER 7
Having seen every possible.
SPEAKER 5
low. In the OK, this is close, but I need the
SPEAKER 0
right answer.
SPEAKER 4
Think about it more. Yes, I think there’s words. Forget about Wikipedia. Actually, if you apply this to any other corpus, it
SPEAKER 1
will keep growing.
SPEAKER 0
Yes. Yeah, uh, I think the more you read, the more
SPEAKER 9
you will definitely find new words because, uh, OK, the, the, the words that you’ve seen before, they will become familiar. You will keep seeing them, but then you will see something else new because other, uh, otherwise, why are we still reading because there’s something like new addition to what
SPEAKER 5
we read, OK.
SPEAKER 4
All the answers I have heard so far are OK, but it has one limitation. You’re thinking about words that this is the language you are speaking. But there are different words that it doesn’t have to be the functional words or verbs we are talking about like your names. Like your email, like spelling mistakes. Like codes. What is the code of this course?
SPEAKER 0
INF 11045.
SPEAKER 4
These are for, for you, it’s a code, but for the system, it’s a new word. So it will keep happening. So think about spelling errors. This will be a new word names, emails, codes, all of this stuff. This is why it will keep always growing. And the accurate, you can do the most actually for a collection, it will be with, it’s, uh, this actually law would apply. It’s just about estimating the correct K and B here, the constants to be able to fit it. Sometimes actually when you have a small very, very small
SPEAKER 0
collection like I will take like 10 articles of the,
SPEAKER 1
from CNN.
SPEAKER 0
I try to estimate uh how it would look like over the CNN archive. No, that’s really small, but when you start to do more, a little bit more, it becomes more accurate.
SPEAKER 4
Let’s do a practical here to prove the point and actually understand why we are doing so, OK. So, uh, here I created a, uh, uh, like a,
SPEAKER 1
a, a, a uh, uh, like a city script to
SPEAKER 2
just Do something I called, I, what did they call it? I think hips.
SPEAKER 3
Uh, oh, OK, actually, this is a Bible. I kept the Bible. I create a script like it’s called, I think heaps. P.
SPEAKER 0
Actually, what it does is very simple.
SPEAKER 4
It keeps counting how many terms I have read so far and how many unique terms I have found so
SPEAKER 1
far.
SPEAKER 0
And it prints it every time I found 100 new terms. So actually if I like if I did that, you can see, it tells me I read 33, 333 terms in the Bible, and I found 100 unique words. And until I find 200 words, new words, I had
SPEAKER 4
to read 933, and, and so on.
SPEAKER 0
So it tells me what is an N, what is a V.
SPEAKER 4
So after 333, the vocabulary is 100, after 933, the
SPEAKER 0
vocabulary is 200, and so on, and I keep going. So remember the total number of unique words was 13,000.
SPEAKER 1
So I had to read the whole liability. I found 300,000 terms. So if I did that, let me print it again
SPEAKER 3
to the Edo text thing.
SPEAKER 2
OK.
SPEAKER 1
And let me calculate it actually also for the other
SPEAKER 3
one. So if I actually did. collections to abstracts with Wikipedia. Heaps Dot PL And I’ll print it. Let me call it Bido text for now. And just keep it calculating as well.
SPEAKER 2
So if I went and opened this file.
SPEAKER 3
I printed it to A, yeah. I just told you, here you are.
SPEAKER 2
These are the numbers. I will take it.
SPEAKER 3
Let me go to Excel sheet and try to administrate it. This is too heavy. Let me actually close it so I don’t. Get stuck Don’t see OK, let me open a new one. Again, it’s too small. Let me expand it a little bit, make it bigger.
SPEAKER 2
OK, so remember we have N and we have V, and these are the numbers we got. This is NMV every time.
SPEAKER 1
And we said we have K and we have B
SPEAKER 2
constants.
SPEAKER 1
And I tried to actually estimate something heaps here.
SPEAKER 2
This is actually the estimated one. Let’s start by in values like 2 and let’s make
SPEAKER 3
V.7, for example.
SPEAKER 1
What I can do now. I can try to calculate it.
SPEAKER 2
We set the function. Let me go back to this and see the function, not here.
SPEAKER 0
This is a function. V equals K multiplied by N to the boot of
SPEAKER 2
B. So if I went back, I heaps, so in this
SPEAKER 3
case, I can say, I can calculate this based on
SPEAKER 2
N, I’m sorry, based on actually K. Multiplied by n to the power of b.
SPEAKER 3
But actually in this case I need to just make it constant so it doesn’t change when I move the line.
SPEAKER 2
So now you can see if I just calculated K
SPEAKER 0
as just estimated K as 2 and B as 0.7,
SPEAKER 2
I will get something like this. Let me go here and try to actually apply it
SPEAKER 3
for all the terms. Uh, here you are, oops. Go back, rent it.
SPEAKER 2
And let me plot this.
SPEAKER 3
Insert. Again Uh, it shouldn’t be like this, yes. Here So the actual number is blue, OK.
SPEAKER 1
The estimated one is uh orange.
SPEAKER 0
So it’s not fitting well. So actually what I can do is try to change
SPEAKER 4
the value to see if it’s fixed.
SPEAKER 1
So actually make it 0.6.
SPEAKER 2
It went much lower. What about increasing K to 4?
SPEAKER 1
Mm, a bit estimating, it’s actually working.
SPEAKER 2
What about changing it?
SPEAKER 0
Actually, remember, this is a very closed domain, this is
SPEAKER 1
the bible.
SPEAKER 0
So usually when it’s a very closed domain, you probably will go with a lower value of B.
SPEAKER 1
So if I try to make this, for example, 0.55.
SPEAKER 3
Man No, let’s make it 7.
SPEAKER 2
Oops, OK.
SPEAKER 1
Kind of fitting.
SPEAKER 0
It’s kind of fitting.
SPEAKER 4
So now I can with this function it’s started to
SPEAKER 0
be actually very, very close to the actual thing.
SPEAKER 1
It’s actually fitting well here.
SPEAKER 4
What I can do, let me try to do it
SPEAKER 0
actually.
SPEAKER 2
How can I duplicate this one? I can actually try to make it for the Wikipedia abstracts as well, so I can copy.
SPEAKER 3
And go back here. Expand it. I didn’t do it well. There is no way to duplicate it. And Ser she duplicator.
SPEAKER 2
Anyway, I can overwrite this one, that’s fine. So you can see now actually what is happening.
SPEAKER 1
What about Wikipedia abstracts?
SPEAKER 0
I printed it to Bdo text.
SPEAKER 3
So if I went back here and went to Bdo.text, check. Should have been done now. Here are the numbers. I can copy this and they can go. To the sheet. And print it. These are the values now.
SPEAKER 2
As you can see, it’s totally off because this is
SPEAKER 0
the different between the Bible and the Wikipedia abstracts.
SPEAKER 2
Actually, you can see the orange was actually the Bible. This is an estimated one based on the Bible.
SPEAKER 1
Actually it’s growing much less than the Wikipedia. It grows much farther because it’s actually more open domain
SPEAKER 2
thing.
SPEAKER 1
So what I can do, but actually let me actually
SPEAKER 3
make this one include the whole thing. Now it’s stopping here, I can include the whole thing. Actually, it’s better to start from scratch. Come here and take come on. Yes. Of course, it’s much, much, much larger moment and it will take forever. I will do it from down up. It’s better. So if I went down there, I do that. Go up. Go up more and insert a new graph. Like this one, here you are. Delete this one. I don’t need it anymore. And this number I need to go get it to the rest of the thing. I will take control C and I will go here. So there. Now I can see the whole thing now, hopefully on the graph.
SPEAKER 2
So here is a Wikipedia.
SPEAKER 1
Thing it’s not fitting at all. So probably what I can do, I can increase the
SPEAKER 2
value of B. So what about making it 0.65 instead?
SPEAKER 3
Coming closer to 0.7. Oh, when it’s too high, I can reduce this 1 to 3. Went lower to 4.
SPEAKER 2
Oh, probably it’s fitting.
SPEAKER 1
OK, so now I can see now the value of
SPEAKER 4
the heap slow here. It’s changing between different domains.
SPEAKER 0
One is a very closed domain like the Bible or
SPEAKER 4
a very open domain like Wikipedia abstract. Different values, but I can do a good estimation why this is useful. The question is why this is useful. So the time I downloaded this Wikipedia abstracts when maybe
SPEAKER 0
it was 78 years ago. Now we have much more Wikipedia pages, so I probably
SPEAKER 4
it’s not 3.5 million anymore, probably it’s 10 million. So if I actually now I’m in a company, imagine
SPEAKER 0
that they hired you after taking TDS. I told you you are now an expert in search engines, and we are a new journal and we are
SPEAKER 4
publishing our articles every week, and we have only one
SPEAKER 0
month actually of articles, but we need to estimate the
SPEAKER 4
machines we need for the next 10 years. But we don’t know actually how many vocabulary we’ll have and how many articles we have. So we can take this one month of articles, 10,000 articles, 20,000 articles, and estimate. Find the K and B and actually estimate how this will be after when you reach 1 million articles.
SPEAKER 0
So if I went in the end of this one, And what I can do now, we have 80 million. What about if it will be 100 million?
SPEAKER 2
312, 3.
SPEAKER 1
Now you can get an estimate.
SPEAKER 0
What about if I reach 150 million?
SPEAKER 2
12, 3123.
SPEAKER 1
What about if I reach it, what if I reach
SPEAKER 2
200 million?
SPEAKER 3
I get the, oh, this is something missing here. No, I think it’s great.
SPEAKER 2
So if I went to the graph again, I selected
SPEAKER 3
here and I went to the end and included. This Here you are. He went up again.
SPEAKER 1
Now I can estimate how would be my vocabulalary if
SPEAKER 4
I doubled the number of documents I have in my
SPEAKER 1
collection.
SPEAKER 0
I can actually get a good, very good estimate from
SPEAKER 1
this.
SPEAKER 4
So this is how it would be useful. So actually, the interesting thing is that for each collection nature, you can have a very good estimate about how this collection can grow over time by actually having this
SPEAKER 0
estimate about the collection itself.
SPEAKER 4
Is it clear so far? OK?
SPEAKER 0
You will be asked to do this in your lab, uh, so you need to implement all of this. You don’t have to do use Excel, of course.
SPEAKER 4
Uh, to, I think it might be easier to use
SPEAKER 0
Python and even using some of the fitting functions.
SPEAKER 2
You can have some fitting functions, so that’s find a
SPEAKER 0
fitting function with this formula, and it will give you the value of K and B, OK?
SPEAKER 4
So once you do that, share it on the Piazza
SPEAKER 0
and see actually what are, are, are you getting the
SPEAKER 4
same values or it’s a bit different from one to
SPEAKER 0
another, OK? This is just an estimation. So this is actually heat slope.
SPEAKER 4
The last law about text is uh that we are
SPEAKER 1
going to discuss. There are many laws, but these are, I think the most important ones for this course.
SPEAKER 0
Clumping and contagious, contagion in text.
SPEAKER 4
What we learned from deep flow is that most of the words do not appear that much. Actually, many words appear once. But the interesting thing is that once we see a word, at least once, it’s expected to start to see it more again. It start to actually see it. If there is a document that someone is called Ricardo,
SPEAKER 0
for example, and you have never seen Ricardo before, then
SPEAKER 4
you found a document talking about Ricardo, there is a big chance that the document will mention Ricardo again, but unless you haven’t seen him for a million documents before,
SPEAKER 0
but it’s time now I’ve seen Ricardo and potentially I can see them again.
SPEAKER 4
So We say that wars are like a rare contagious disease. You don’t see it that much, but when this happens, It starts actually to spread.
SPEAKER 0
You sense it more. It’s like rare independent lightnings. You don’t see lightnings every day, but once you see a lightning, you probably can see another one later, very
SPEAKER 1
soon.
SPEAKER 0
So from the Wikipedia abstracts, there was a very a
SPEAKER 4
small experiment you can do and try it. OK, what about the terms that appeared only twice? So I’m looking for, I know that there are many terms that appeared many times, many, some terms appeared only
SPEAKER 0
one time, but there are some terms that appeared only
SPEAKER 4
twice. So if this actually theory is true, it means that I would see these two occurrences of the terms close to each other.
SPEAKER 0
Now I have a 3.80 million words happening in a
SPEAKER 4
text. If I if there are terms that appeared only twice, I can check where this term appeared. If they appeared in the beginning and the second occurrence appeared at the end, it’s not that contagious. But if I found it close to each other, probably
SPEAKER 0
yes. When I see something, probably I’m going to see it again.
SPEAKER 4
So if we did this implementation, you can see actually
SPEAKER 0
this is the distance between the two terms, very, very
SPEAKER 4
close, almost around like 1 between them.
SPEAKER 0
They are always happening at the very top.
SPEAKER 4
Sometimes, of course, it happens like a big distance between
SPEAKER 0
them, but the majority were actually happening between a very
SPEAKER 1
close distance to each other.
SPEAKER 0
Once it’s appeared once, you expect to see it more.
SPEAKER 4
So the majority of those terms appearing only twice appears close to each other. And this actually gives you an idea about this theory. So let’s actually do it in practice. Given a collection of 20 billion Turks. What do you think would be the number of unique
SPEAKER 1
terms in such, in such collection?
SPEAKER 4
It’s really hard to say, but if you manage to take to process only, this is 20 billion term, this is huge to process.
SPEAKER 0
I’m not taking about 80 million as Wikipedia abstracts. What you can do is to take only 1 million
SPEAKER 4
term out of this, just 1 million term of the
SPEAKER 0
collection, and they will go and use Benford’s law to estimate the K and B.
SPEAKER 4
Imagine I did that and they managed to find that
SPEAKER 0
K is 0.25 and B is 0.7 from this 1
SPEAKER 4
million term. Now I can easily calculate how many unique terms would be there, because I can just substitute in the equation, and instead of 1 million, I can substitute with 20 billion and see. So if I did that, it will end up, it
SPEAKER 0
will be around 4 million terms.
SPEAKER 1
I can expect it to be 4 million unique terms.
SPEAKER 4
So the other question, what do you think would be the number of terms appeared once, in this case?
SPEAKER 1
Yes, yes, around 2 million because it’s usually around 50%.
SPEAKER 4
So now you don’t have to process some of the projects you might actually have in the Wikipedia and actually
SPEAKER 0
the web archive, which is trillions of terms.
SPEAKER 4
You don’t have to process everything to find the unique terms. Actually, you can have a quick estimation about the machine you want by processing a small portion of this and estimating what what are you going to expect. Is it clear so far? So the summary here, that text follows well known phenomena.
SPEAKER 0
We talked about the zip flow, the heap flow, and
SPEAKER 1
contagion in text, and we learned a little bit about
SPEAKER 0
shell commands. I hope you start practicing with it if you are
SPEAKER 4
not aware of it, but actually it’s usually useful to be aware of this, and I would recommend in the lab, actually I’m giving you 3 collections. These 2 collections, the 3rd 1. But I would recall actually be very excited if some,
SPEAKER 0
some of you would think, oh, you know what, I’ve
SPEAKER 4
got a collection in French, I got a collection in German, I got a collection in Arabic, I got a
SPEAKER 0
collection in Chinese, and I applied this, and these are
SPEAKER 4
the results. Share it with us. Let’s see actually how it works.
SPEAKER 0
That would be very useful and entertaining to see.
SPEAKER 4
So the resources for this one, you, this is actually
SPEAKER 0
chapter 4 in IRN Practice uh textbook. I would actually recommend seeing these videos, entertaining videos from
SPEAKER 1
Vsauce about Zipflow. It’s very interesting. It’s not just about text in general, but it’s very
SPEAKER 0
interesting. And Ben from Zlo as well on another channel.
SPEAKER 1
It’s like 10 minutes. It’s very exciting to see and understand how it’s going.
SPEAKER 0
And you can, if you are using Windows and you’re
SPEAKER 4
not actually, if you’re a Windows user, you can see
SPEAKER 2
many Apple here, but there are some Windows as well,
SPEAKER 0
or you’re not a Linux or a Mac user, then in this case, you can actually get the shell commands for Windows from this link.
SPEAKER 1
So you can actually use the cat and everything just putting this in your machine. The next lecture we’re going to start getting ready for
SPEAKER 0
ending thing and we’ll talk about the pre-processing steps.
SPEAKER 4
Now we will have a break for 5 to 10
SPEAKER 0
minutes, and you are free to go and come and
SPEAKER 1
do whatever you want, but actually remember that there will
SPEAKER 0
be a mind teaser.
SPEAKER 1
I will show it to you now.
SPEAKER 2
So uh just in the break if you’d like to
SPEAKER 3
make your mind working during this period, OK? So let me Bring it up. Let me close this. Don’t save any. And let me see.
SPEAKER 4
So this is the first mind teaser. Uh, this is just enjoy the break.
SPEAKER 0
Uh, whoever gets this, gets the answer of this, raise your hand. If you’re going to try it, you’ll get a small
SPEAKER 2
prize, OK? Just for fun, nothing. It’s not related to the course, OK?
SPEAKER 1
So, uh, yeah, have the break for 5 to 10
SPEAKER 0
minutes.
SPEAKER 4
If you got this answer, the answer of this, raise
SPEAKER 1
your hand and let me know. The one who gets it first will get a prize.
Lecture 1
SPEAKER 0
OK. Hi, everyone. Uh, welcome to the first introduction lecture for text technologies for data science. I’m Waleed Magdi. I’ll be the instructor for the next few weeks, and there will be another one, joining me afterwards. So, um, This course is about text technologies. I will explain more what it is about, and the lectures today is just an introduction. So the objective of this of the lectures today is To know more about the general, what is actually to be discussed in the course, what is the topic of it, what are the objectives we have to get from the course, the requirements that you need to have before taking the course, so you might change your mind if you don’t have these requirements, what would be the format going on during it’s a full year course, and what are the logistics, the location, the labs, how it would work, and so on. And just a note, this is, there is not much technical details today, just about, just a very uh friendly introduction, so, but don’t expect it to be the case uh going on. It’s just about to know more about the topic of, of the course. OK, so the course is called Text Technologies for Data Science. So what is it about? You can actually guess from the name. So we’re talking about here text, and text is mainly we’re talking about documents, words, terms, but we’re not talking about images or videos or music. So it’s mainly focusing on the textual part. You can, it can be videos and images, but it will be the caption or the transcription of the video, but it’s actually worth focusing on text in this course. Technologies we are talking about, we’ll be discussing several technologies within this course. The main one is information retrieval, which I will discuss more. What is it exactly, then text classification. Text analytics and we are adding actually also stuff about LLMs. Everyone is talking about LLMs, that’s actually part of it, and the rack systems, which we will discuss more what is it exactly. And we’re talking about actually if you’d like to summarize it, it’s more about Search engine technologies, all the technologies around search engines right now, which is not just the search, it’s including other stuff. So we mentioned the first part is information retrieval, which is a big part of this course. So the question is, what is information retrieval? Can anyone give me an example about an information retrieval system? Anything. Yes, database, not exactly, actually, I will explain it’s not databases, but what else? Yeah, search engine, exactly. So like Google, for example, this is a very basic example of information retrieval systems. And we call it web actually web search engines, but it’s not just about Google here or Bing or whatever we call it. So this is just one type of information retrieval systems which is web search, search engines, and we will learn how this works. But also it actually includes stuff like speechQA. Now you can actually have, it was actually Siri before, now it’s a lot of uh technologies out there. You can ask a question, you get some feedback. So this is part of the info retrieval systems. It can be actually social search, searching on Twitter or X or TikTok or whatever. You can search for something for a hashtag. You get a feed of results, and also you can see actually it’s telling you there are 4 or 5 or 10 new results showing up, which is called information filtering that keeps feeding you stuff. Also, you will find actually some recommendations which are recommender systems, which is part of information retrieval. Here. Actually, the field itself is old enough that it actually started in the 1950s, mainly for searching libraries. Actually, we’ll be studying stuff that has been developed in the 1950s and 1960s in our course, showing how this started, that the the the technology has started at that time, and it has been actually good enough in a way that it has been used till our days now. Actually, the same science that has been developed at that time is still being used now, but it started mainly. Decades ago, just for libraries actually to search for books and find the relevant book you’re looking for. And this is from the 1950s. It can be actually more advanced search, like actually legal search. if actually there was a lawsuit between Samsung and Apple in 2010, around that time, and actually it was actually about someone used the IP of the other company. This case took years because actually lawyers were searching tons of documents, legal documents, emails, patents, like reports, and so on to find if actually someone breached the IP of the other company. So it’s actually sometimes, and this actually costs us we’re talking about $10 billion. So this is a search actually like task as well, but it’s not like the web search. It’s something that is very, very advanced that people use. It goes across languages, because actually if you checked, 50% of the content on the internet is around English, but there are many, many other languages, and the users, actually, the English speakers are only 25%, and there are users from all over the the languages around the world. So how you can actually get information, it doesn’t exist in your language, but can exist in another language, how can you retrieve it across languages? So this is an important part as well. It even goes to other applications other than texts, like, if you know Shazam. It’s actually doing humming to get actually your song. It’s, we are not going to do this in our course, but it’s part of information retrieval, how you can actually index these kinds of things and you can be able to model it in a way that you can retrieve it later. So if I ask you what is IR, even the normal web search page, it has a lot of technologies within it. So for example, you will find when you start typing a query, you will find some suggestions which is called query suggestion or correction or prediction. So this is part of the technology. Also, you can start to find, when you find some results, it will have a small snippet under each result, which is summarizing what is the main point you’re interested in. Actually, if you, if you get the same result from two different queries, probably the snippet will be different to each one. So how you can select the part from the page to show to the user uh so they can decide to click it or not. This is another thing. You will have advertisements appearing with it. So how can you insert it in a way that actually gets relevant results with this, and it’s actually running in a parallel way. And also you have categorization, so you can actually get the results. It can be just on the web generally, or it can be news, it can be images, it can be videos, it can be whatever of this. And recently, actually, it has been popular in the last couple of years. Even you can get a summary of the AI generated answer from the top results. So this is actually happening and all of this is actually part of information retrieval. So if it’s all about searching and getting the relevant results, actually, let me ask you the question. Who needs IR anymore in the era of Chai Betina? So now Chat GBT, you can ask it and searching for something and you need to know some information, you can ask Chat GBT about something and you’ll get probably a good answer. So why do we still need search engines? Can anyone guess if we need it actually? Yeah, maybe because you want credible sources and not everything around you the internet. That’s a good point. You need credible sources, yes.
SPEAKER 1
If we give all our data to these, we’re exposing our private data to them. Uh, so we need our own search engines and retrieval systems so that some data is hidden from these LLMs.
SPEAKER 0
That’s an excellent point. Yes, because actually LMs are actually trained on like public data, but if you have a company and have actually some private data. You will not use it in chat GBT. However, someone can tell you, OK, but you can train your own LLM in this case with your own data. So why you still need the search engines, yes, I
SPEAKER 1
guess sometimes LM hallucinate while giving the answers. So maybe search engine. Typically more factually.
SPEAKER 0
So hallucination actually is because the search engine will give you some results and you read it yourself, but the LLM can hallucinate while generating the answer. But it’s getting better. So why do you still need, so maybe after a couple of years when LLMs are amazing or perfect, then we don’t need search engines. Do we think we still need them? Yes. Excellent point. Yes, new data, new data. So the problem about LLMs, they are very expensive to train, and the version you’re using with Chad GBT at the moment, for example, as an example, it has been probably trained last year. It took a few months probably to actually to get the outcome of the model. So if I’m searching for something that is recently happening, how it would get the answer itself. So recently, even if you’re enterprise data, data keep actually being generated. So how to do that? So for example, just one hot topic that is happening in the last couple of years. If I ask a chat GBT, and this actually answer I just tried it yesterday, what do you know about the Palestine-Israeli conflict? So Chad GBT managed actually to give me the answer directly. It gave me an answer about it because it has been trained on a large amount of data that includes this information. It’s about history, so probably it has been covered. But if I ask another question, which is, how many children have been killed in the last couple of years in Gaza, for example? This actually requires recent information. The information actually is changing every day, so it cannot get it from this. This is actually when you try this, even try it on your Chat GBT. If you search it here, and it will tell you searching the web. It understood that it doesn’t have this information and it requires actually fresh data for this, and it will go and it will give you an answer, but the interesting thing at the end of the answer you will find something like this the sources, and I will tell you I got these answers from the UNICEF, from the UN, and so on, different sources. So even ChaiBT and the other LLMs are amazing in giving us answers about many questions we have. It would continue to require searching the web. Maybe in the future we will not do the search for the web ourselves. LLMs will be using it, but the technology itself, you will still continue to need it because everything is changing every day. Many different things happening around us. How to get fresh data? LLMs are very expensive to train, to have this information. So this is still relevant and will be always required. It is clear enough. OK. So it’s critical for chatbots. Yes, chatbots are taking over. Yeah, they are used a lot and probably you will be using in doing your coursework for this. How are you going to use it? I will allow you to use it a little bit. It’s fine. But at the end you will need search engines because search engines are designed to index data very quickly and get retrieved data very quickly in a fresh way. So the other question I would discuss with you. What is information driven? How is it different from find, for example, you go to open a PDF and find something? Do you know if there is any difference between both? If I search it, for example, the word retrieval here in a BDF document, uh, I will get spotted. Where is it? So how is this different?
SPEAKER 2
This looks for literal strings while information retrieval is wiser.
SPEAKER 0
That’s one point. Yes, it will actually look for exact string match, but for example, if you are looking for retrieval, it would match retrieval, but not retrieving, for example. This is one point. Is there another thing? Is there anything other, other than this that you think is different? Yes, a PDF is, yeah, yeah, yeah.
SPEAKER 1
A PDF is small enough. just look through all the numbers, all the words, but
SPEAKER 0
the internet is bigger than that. That’s an excellent point, exactly, because actually searching a PDF, it’s like doing grab, actually, it’s just trying to do a sequential search. It goes through the PDF reading word by word till it finds the match, which is called wordspotting, grabbing something like if you’re using Unix commands, it’s like grab. So if in a PDF it’s fine. If the PDF is actually is 1000 pages, probably it might take some time. If you’re searching for something like the internet, it will take forever. So this is different. The actually search engines is about how you can find the information like this very quickly, instantaneously, almost, and it will also match the topic itself, not the exact match of the word. So it’s mainly two differences here, and this is actually the main technology we are going to study in this course. So the textbook definition of information retrieval is Finding material of unstructured nature that satisfies an information need from within large collections. So let’s take it part by part. The 3rd thing, the task here is to find something. I need to find something, even within, if you’re using the search engine within a chatbot, I need to find an answer for my question. And, and the nature, it’s kind of unstructured. It’s not a database. Database, I know it’s actually in a very specific field, to have specific characteristics, but this is actually internet, documents, enterprise documents which can be slides versus doc reports, many things. How you can actually this unstructured form and retrieve some data from it. And the target here is to satisfy the information needs of a specific user. And so this is why when we will learn how to measure, how we can measure if something is good or bad as a search engine, we need to find a way to measure dissatisfaction that the user is happy with the answer. It’s not just about matching a term. It’s more about actually finding what is relevant. The user thinks it’s relevant. So this is about the big portion of this course which would be about information to people. Another thing is text classification. Classifying text into different categories. Many examples of this, like, for example, when you go to a news website, you will find actually articles like classified into different topics like politics, business, technology, all of these are actually kind of classification. Some of these are done manually, but actually now people are just typing the article and a system would actually classify which topic is related to. More stuff actually even in social media you can find. Staff are looking for feed about news, about entertainment, about sports, and so on, so it’s happening everywhere. And of course in social media it has to be done automatically because no one will, you never type a tweet or a TikTok post and then say this is about politics. No, it’s actually there are some engines inside which classify it automatically. And it’s actually in many cases, do you have any other examples of actually classification? Uh, do you have any other thing? Actually, there is something that is actually used in all of our emails. Do you know what is it? Spam detection, exactly. So it’s simply, how can you learn that something is a spam? This is just a very simple confiction task. I need to find if this email is it spam or not. This is a classification task. Google tried to do its best, and sometimes it’s classifies as spam as not spam and not spam as spam, but it’s mostly hopefully working fine. And of course it can be also hierarchical, so it doesn’t have to be yes or no or actually multiple classes, but sometimes it’s one class and goes on to subcategories like actually patterns, for example, you will get actually different types of subcategories within the same class. So the definition of text classification is It’s a process of classifying documents into predefined categories based on their content. From the content of the text, I’d like to find which category is it related to. It can be binary, is it spam or not, positive or negative sentiment, for example, or it can be few, like is it sports or politics or technology and so on, or it can be hierarchical. Like this is the main major class of it, but it goes to subclasses as well. So this is another thing that we are going to study within this course. A third thing actually we’re going to study as well is text analytics. Usually we’ll find a lot of things, different categories of things around us, and we’d like to understand how we can compare to things. For example, you’ll find actually the left and right in politics. You’ll find actually wars between Ukraine and Russia, for example, and there will be always many claims around them. For example, when I was searching for this, you’ll find actually searching Democrats versus actually left and versus right, you’ll find the right are those who are thinking about people and they are kind. The left are actually bad and angry. On the other side you’ll find no, actually the right is actually very aggressive, but the left are kind. From a scientific point of view, how we can do something like this in a scientific way. So what I can do, I can say, OK, let’s go and get all the conservative politicians, all the Labour politicians or Democrat politicians, and then what I will do, I will take all their statements and then do some objective analysis. I will take and compare them. What are they are focusing on, what is special about each of them from what they are saying. You can go in for the people, take millions of accounts of people who identify themselves on the left, millions of accounts of people who identify themselves on the right, and do the same thing. And then apply how we can apply some methods so I can from big different cobra, I can find what is actually unique about each of these groups. And this is something we’re going to learn in this course, how we have different cobra, and it doesn’t have to be always like this. It can be, for example, what’s the difference between the Shakespeare novels versus actually a recent writer right now. What are there any differences? How we can extract these kinds of things? And this is actually what we’re going to study in this course. How to compare to Corbus, how you, you find unique about each, what is common between them, how to deeply analyze this and data-driven approaches, all data-driven, without, without actually going with some specific hypothesis about them. So in this course we will learn to. How you can build a search engine from scratch, how we learn it from a very low level, which search results to rank on the top, how do you do it fast on a massive amount of data. Then how to evaluate your search engine. And now you are doing something, you achieve some results. How shall I know that these results are good or bad? This is something. How to work with texts in general, like if you have two tweets talking about the same topic, how shall I know about this? And if there are misspellings, morphology, or problems, how we can fix it. How to classify text into different categories, and then how you can apply text analytics between different corpus, actually, how you can find what is unique about a certain document compared to other documents. And then the final part will be talking about drag systems. Now, imagine you have all of this knowledge about search engines, how you can integrate this into an LLM so it can extract summary answer of a question you’re asking. So how is this course different from others? There are NLP courses we have. So the courses like ANLP or FNLP. It has some text processing in common, yes, and maybe text flows. If you heard about zip flow and other stuff, we might discuss it as well, but we are not talking about NLP here. We are talking about working with massive amount of textual data and how we can handle it. If you’re talking about machine learning courses, yes, it might, the text classification might overlap a little bit with it, but the focus here is not on the machine learning part itself, but focused more on actually how to process the text itself. And of course, getting deeper in this. But it does not overlap with the other courses in how to build a search engine. How can you add the different information methods, evaluation, IR evaluation methods, text analysis, processing actually a large amount of textual data. This is something unique about this course where we’re going to work with large amounts of textual data and the rack systems. You might actually study LLMs in the other courses, but now how we can use it in a rack system retrieve augmented generation system. Some terms you will learn from this course, you might be familiar with some, not the others, but you will learn about all of these during this course inverted index, vector space model, retrieval models like TFIDF, BM 25 language models, PageRank, learning to rank. Mean average precision, mean reciprocal rank, NDCG, multi mutual information, information gain, chi square, binary and multi-class classification, and receiver augmented generation. So all of these terms, if you are not aware of many of them, this is, you’re going to learn about them during this course. Another important thing about the course, it’s highly practical. So 70% of the mark is on coursework. And 30 percentages are only on the final exam. And We’re not taking any much technical depth today, but from next week you will be implementing almost everything we study. Like I would say 50% of what we be in the slides, you will be implementing yourself. It’s not just about, oh, it’s it’s fun to know about this, no, it’s not about just knowing. You will be implementing, OK. And by week 5, when you submit your first coursework, you will be required to submit a full functional search engine from scratch. No libraries, running it from scratch. OK? How would actually take the many documents, put them in an index, be able to search them, rank them according to a given query from scratch. OK? It will be fun. You have a practical lab every week, uh, so almost every week, we’ll give you maybe 1 or 2 weeks so that we will give you buffers so you can finish your coursework, but you, there will be always labs. There are 2 coursework, uh, works that actually will be, uh, mostly coding, a lot of coding, and there will be a group project in the 2nd semester that you need to take all what you have learned in this course and implement a nice thing. I will discuss it more. Prerequisites. I’m expecting you to have some mathematical requirements like uh know what is linear algebra, how we can actually multiply matrices, these kinds of things, so I hope you have this. For example, this is something you’re going to implement, so just be sure that you understand how to read something like this. I’m expecting you to have some programming requirements, Python, which is a very common. Everyone should be using it. Um. I’m expecting you might be know, have a little bit of knowledge. Regular expression might be useful because how to parse the text itself, but it’s fine if you don’t, you can learn it, but you should be, know how to, how to code. This is something. Uh, I, I don’t mind to use, for example, uh, some of the LLMs or chatbots to help you run, writing a function or something. It’s OK, but not the whole thing, OK? So just, uh, be kind, don’t uh do my, read the coursework and implement it. No, don’t do that. And uh shell commands, it will be very useful. We, you will, even if you’re not very familiar with it, you will learn it over the course like using CA, sort, grab, Unique, these kinds of things. It’s being doing text technologies, you need to be very fast with this stuff. So when you get a new collection, you don’t have to write a big piece of code to be able to understand what is going on. I can write a little bit of shell command, one line, and you’ll get some statistics about the the uh the data set I have. We will start this from next week, you will see it. I will be running some shell commands in front of you. And of course, knowing data structures and software engineering of the course, because you will need to implement some components, make them speak, speak to each other in an efficient way. I’m expecting you have this. At least you have something of this, and you can build over the course, OK. And an important note, we do not teach you how to code in this course. I’m expecting you to know how to code, and then use your coding skills to implement, OK? Because I’m, I’m not teaching you English, for example, I’m, I know that you expect, you know English, so this is the point, and then I’m not, uh, and, and I will not expect you to, don’t know Python, for example, no, I’m expecting you to have it, then we actually will be using it. Another thing that is a requirement in this course is teamwork. Um, The 2nd semester will be a group project, and it’s a requirement that you would work in a group. So if someone says, oh, but actually I don’t feel comfortable using, working in a group, find another course. This is, this is, uh, you have to work in a group in this course, because the main point of this is to train you for the industry once you graduate. So, um, coming, having some background in industry and doing some stuff, so. I would say 95% of the time you will actually work in a team, and you know how to each want to split the task among each other and then run it and create something, so teamwork is really important, and this would be a requirement that you work in a group in the second semester doing your group project. We will speak more about this in a couple of slides. So Another high objective for you to think about it when taking the course is not that I will get a high mark in TDDS. Forget about the mark. What really is important for you is to think what you will add in your CV by taking this course. And this is actually the main skills we are looking that you would be able to add in your CV, like working with large textual data. You can add in your CV, and I did a project that actually was indexing 100 million documents, and we were retrieving the results in less than a second. This is what I’m expecting you to add in your CV. You know what she commands and stuff like this, you will be able to do that. Some Python programming, which should be a requirement, software engineering skills would be required when doing a big project in the 2nd semester. And how you can build a classifier or a text analyzer in a few minutes, just something you get some data and you’d like to get some analytics about it, you should learn some skills to build it in a few minutes. So how we can do that and of course teamwork. Based working in a group, you will learn some skills like project management, time management, task assignment, and system integration, how you would learn this. So this is stuff you hopefully you would gain in your actually in this course, and you can add to your CV. So the structure of the course, getting more about the structure itself. So we have 19 lectures. During the first semester. OK, it’s all in the first semester. Uh, 2 lectures, introduction like this, nothing a lot, just some nice introduction, nothing heavy, and then it will be 13 lectures about related to IR and RAG and these kinds of things, about search engines in general, how technologies around it work, and there will be 4 lectures which talk about text analytics and text and classification. And there will be between 8 to 10 labs. This is what we offer here, practical labs, all in the first semester as well. There are no tutorials in this course, so it’s mainly the labs. There are some self-reading, every lecturer will give you the basic stuff, then tell you if you need more in this, uh, uh, like textbook or actually some papers. Lots and lots of system implementation, OK, lots of coding, and uh there will be some links to some nice online videos to learn more about general stuff uh for you. So The instructors for this course, it will be myself and Bjorn. Uh, actually, we have been promoted. I didn’t update this. I’m a professor now and he’s a reader, so, but, uh, I can update it later. But yeah, so, uh, I’ll be giving you the 1st 7 weeks, teaching the 1st 7 weeks, and Bjorn will be the last 3 weeks. And we will try to get a guest lecture every year. We try to get a guest lecture. Last year I managed to bring someone who had a startup that raised $56 million and using search engines, and it will tell you how you can use the stuff you are learning from this course to make business and make a startup, so that would be useful. We’ll try to bring him again or someone else. So the lecture format. You have 2 lectures at a time, so we have the 2 lectures on the same day, like behind each other. Questions are allowed anytime. Stop me and ask. It’s not actually ask questions at the end, no, you can stop me and ask anytime. And feel free to interrupt me, OK? Stop me and let’s continue and ask whatever you want. After lecture one, we will take 5 to 10 minute break, and during this, feel free to go out and come back. It’s up to you. And actually, you can discuss the first lecture with people around you, and you can actually, before the start of the second lecture, you can actually ask more questions about this. And in the meanwhile, for fun, but not from today, from next week, we’ll be giving a mind teaser between the break some mathematical problems to see actually who’s going to answer it first, and there will be some small prizes about it. But it’s mainly, you’ll have two lectures behind each other with 5 to 10 minute break in the middle. Some lectures are interactive. I’ll be speaking to you, discussing with you, waiting for your answers, so please try to participate. Some lectures will include demos like running code together and seeing how it would be the results, and this is from next week. You’ll see it. Just to give you an indication of what it’s going to do. You don’t have to copy code behind me. It’s mainly just demonstrating what we are learning. Uh, any questions so far? Yep, uh, so we said we will implement everything from scratch, but are we also going to learn, uh, like the modern technologies regarding, uh, information retrieval? Like what? Like the modern technologies of what information. Yes, yes, we’ll reach this part. We’ll see how the basic stuff works. Then at the end you will be putting everything together and using modern technology integrated, OK, so everything will be covered, but I’m not asking you to build the LLM from scratch. Don’t worry, OK? So I, I will allow you in the 2nd semester to put it as an additional part if you’d like to build the Cerra system, OK. More questions? OK. Labs, so we have a lab almost every week. The format of this, and please like give attention to this, it works in a different way. So I give the lectures on Wednesdays. Once I do that, you will go home and you will find the lab related to the lecture just has been released. So you will be required to implement this lab directly as soon as you can. And If you managed to finish it and got some results, please go and share its results on Piazza. You know what is piazza, I think everyone is using it. Uh, which is, uh, like a forum we’re going to use. If you have any questions about it, go and post it on Piazza. And we have the lab demonstrators and myself will be able to respond to you about any questions you have. By the way, so far for the last year, TTS has been the fastest response time for questions on Piazza. We’re talking about 10 minutes on average. So we’ll try to keep this. So once you have a question, we’ll answer you. Because the main thing about sharing your questions on Piazza, if you have a question and maybe another 5 or 10 have the same question, you’ll be able to see it and see the answer. So we don’t have to keep doing the same thing many times. So if you have any questions during that, after the, after the lectures about the lab, ask it. So you will try to do this. You can do it Wednesday if you want, but you can try it Thursday, Friday, weekend, Saturday, Sunday, Monday. Share the results you got or actually ask questions. So far you’ve tried everything. You ask the questions and you still have problems. It’s not being solved. In this case, we’ll have a drop-in labs on Tuesday. On Tuesday, the week after. To actually ask the remaining questions you have. If you still have any additional questions to ask. So I’m expecting less people to show. On these labs, because I’m expecting you, most of you will be actually done with it over the week, and ask the question and even got results and shared it with everyone. But you still imagine that you have a problem. There is one strange problem. It’s not running well. It’s very slow. You don’t know what is the problem. You ask the question, got answered, but still there is a problem there. Then in this case you can show up on Tuesdays. There are 2 times you can actually show up in any of them and then ask your question to the lab demonstrators we have to actually give us the support. OK, and there are two times, 10 and 11, I think on your timetable, it says there is one at 12 as well, but this one is not, we are not going to do it, so it’s only 1010 to 11 and 11 to 12. OK? Show it to any one of these. But hopefully most of you will actually be done already before Tuesday. And the lab demonstrators for this year, we have Natalia and Zara, who will be actually giving you support. Zara will be online on Piazza, and Natalia will be in person on Tuesdays, if you have any questions, and I’ll be myself actually on Piazza as well, actually answering your questions if you have. But the main thing about labs. Imagine Ina run it and it turned out to be really nice and it ran fine and you get some results. Share your results on piazzas as well, because actually I like when you share your results and some others will share the results and you start comparing actually the outcome from each of you. You are going to implement some stuff, you’ll give you some collection, you’ll see how it would go. OK. Lab Zero. We don’t have a lab for this, uh, for, for, for today because I’m not teaching anything old, I’m just giving you an introduction. However, we created this Lab Zero a couple of years ago because we we realized some students will struggle a lot on the course because they don’t have the required skills. So this lab is very simple. You read a text file word by word, then print it back. Read a file from the desk, word by word, print it back. If you go and try, if you know you are confident that’s really silly, that’s fine, then don’t do it, it’s fine, but if you think this, I don’t know if it would be challenging or not, and try it. If it came out to be very challenging, my advice, don’t take this course. Because later you’ll be indexing millions of documents. So if this is challenging, I’m not sure if you’ll be able to catch up. If it’s fine, yeah, yeah, I’m not that confident with running with textual stuff with Python. It will be my first time, but you try, oh, that’s that’s not that hard, it’s fine. Then OK, that’s OK. In this case, you’ll be able, probably the course will be fine with you, even if it’s a little bit challenging in the future, you can learn some stuff. But if this is a challenging task. The course is all about this reading many, many documents, processing them, indexing them, putting in structured stuff, running components to each other, receiving queries, running this, and retrieving retrieving results in milliseconds. So if you, if this is, if you’re not confident with reading and printing a file at the moment, then this course might not be for you, OK. And just be honest with yourself about this. The assessments, how it goes. The first coursework is 10% of the mark, only 10 points. And it’s really straightforward. Yeah, you would be required to build a search engine from scratch, but actually it’s simply all what you will do in the lab is simply you will submit it as a coursework. It’s like giving you the answer. If you’re doing your lab 12, and 3, then this is coursework 1, you’re done with it. So this is why it has only 10%, but it’s just to confirm that you are following. You’re actually moving with us. You’re implementing week by week what we’re implementing because we build each week over the previous week. So if you delay the stuff, you will have a lot of big queue about stuff you need to implement. So it’s better to keep doing stuff weekly on this stuff. And once you finish lab 3, go and submit your code. It will be coursework 1. That’s it. Coursework too, it will talk about how to do evaluation and some text classification and analysis stuff. Uh, it, it will not be all covered by the labs, so this is why it has 20%. But the main point here is group project. So group projects, now you will form a team, a group of 4 to 6 or 5 to 6, you will see, depending on the numbers. To actually build whatever you have learned over the whole semester one, build something nice in the second semester. Build a search engine that does a task, an interesting search task on a large collection, and show us actually your performance, and it will be assessed by Not reading your code. We will ask you to submit your code to check it, of course, but the main point of assessments will be, OK, where is the website you created? Where is the link? And our markers will go and test it and search for stuff and see the results. We would like to be sure that the stuff will come fast, the stuff will be relevant, and actually it doesn’t break, and they will try to break it. OK, so don’t say we didn’t expect, we, we will see actually if you handled it with different test cases to be able that it will not break. OK, and then he will give you a mark on this, and it will explain, yes, will be provided test cases? No, it’s actually your whole project. It’s your idea. You decide about it. You can decide to search something in a different language, in whatever language you want, different stuff. I will give you some examples actually later on about the Google project. And all of this, you will actually at the end, you will find the remaining 30% will be on the exam, and the exam will be more about theoretical understanding of the course. It’s not practical. You have been taking enough practical stuff during the course. It will be just theoretical understanding of the concepts of text technologies. The largest portion of the mark is on the group project. It’s a group of 5 to 6, and you select your own group. So we don’t assign you to groups. You select your own groups, so make friends from now, OK? And Design. The task is to design an end to end search engine that searches a large collection of documents with many functionalities. Of course this will be explained later in the future, but, but it’s mainly building a search engine that’s from scratch, of course. You can use some libraries. We will explain what stuff you can use, but it’s the main point is that I will have a search engine that actually can do a big search. How the mark will be done for a group. Because it’s 5 people. Maybe some people will be working well. Maybe some people are very active and doing a lot of stuff, so. At the end, we will check the project itself as a project, we forget about the people and we’ll give it a mark out of 100. So maybe this project is super nice, 80%. This project is moderate, 50%, 30%, we’ll see. So you’ve got the group project which will be the same for everyone. Then you will have a weight for each person. In the group How would we know? We will ask each of you to write a paragraph about your contribution to this project. You don’t have to contribute to every single component, but you will have to mention what party did you participate in. Ideally, or the default mark will be everyone will be 1. The weight is 1, so you’ll get if the project is 60, you, you will get 60. However, in some special cases, when one participant, one member, they didn’t work well, he was not actually participating, he was not collaborating. So in this case we might get less than one. If they didn’t participate well, maybe 0.8, so instead of getting 60, they will get 48. If there really was a problem and they actually tried some stuff and they were not helping at all, maybe they get 0.5. If it was not responding at all, he would get a 0. So the weight is 0. So if the project, the group got 80%, 80 mark, then you multiply it by 0, they will get a 0. So ideally, everyone will get one weight, actually this is what we have seen, but in some special cases when one member is not participating well, they are not actually collaborating, they were a problem for their teammates. In this case, actually they will be penalized on this. OK. Fine. Just 11 example from a couple of years ago, this, this group, they created Betterreads, which is an interesting thing. They actually search for books uh based on the reviews, and you search for something like this and you get results about the recommended books based on what you’re looking for. And they actually indexing around 12 million books. Uh, the retrieval time was less than a second. And they actually keep adding, actually it’s not just a fixed collection, they keep adding if new books just shown up, they will keep collecting new books and reviews every single day and putting them through their index, and they will show you the books and the sentiment about the reviews, and, and actually it was hosted on Google Cloud. So, um, so this is actually what you will get. So this is what we expect. It was an OK project, so. Some actually in another group, they did actually 80 million a collection of 80 million stuff. So this is what we expect you at the end by the skills you would gain during the course, you should be able to do that. It might be, oh wow, that’s a lot, you will be able to do that, don’t worry, in a few weeks, hopefully. And not actually, we’ll provide you with Google credit so you can actually host your stuff. So don’t worry, we’ll not ask you to buy anything. So it’s just we’ll give you credit. So your group actually, each member each student will get a credit and then you can join it together and run it for the project. So the timeline for this course, we have 2 semesters. Actually, this is also the debate, is this a full year or 1 semester. So this is how it goes. In the first semester, you will actually get all the lectures and all the labs. There are no lectures or labs in the second semester. It’s just in the first semester. All that learning is in the first semester. There will be actually the individual courseworks will be in the 1st semester, Coursework 1 and 2. OK, that you will implement some stuff about the and actually this is how you should look like when you submit your coursework, OK. So we’ll have one at week 5 and one to be submitted in week 11, OK? Now by the end of semester 1 you will be done by most of the course. However, in the second semester, we will ask you to work in teams to be able to create a project based on what you learned. If you can spare like 5 hours each of you as a group and sit together and work together on something, and by week 9, you will be able to submit. The project you have. What you will submit will be a report about what you did and a link. And the marking will be go through the report and the link itself, uh, see actually what the engine you created. And after week 9, there is nothing related to the course anymore. It’s just at the end of the semester you will have the exam. OK. So Everything is in the, you will have the burden of actually the, the labs and the lectures in this semester, but the second semester, it’s actually you manage your own time. And work on this, and time management is really important in the second semester, because you don’t have, you don’t see me. Actually, we’ll have, of course, lab support every week for those who have any questions about the project, they will give you support about this. You can actually ask me anytime, but there are no lectures or actually specific labs to run. It’s just about yourself managing your time. Don’t come 1 or 2 weeks before the deadline and say let’s build it. It won’t come nice. And we are very selective in the actually marking of the projects. If it ends up that you did a nice search engine and it looks like your first coursework was a little bit of improvement, probably you’ll get 30%. OK, so, and people have seen groups who build very nice stuff. Honestly, and like many nicer stuff. So if you spend some good time about manage your time, you’ll be able to create something nice. So the logistics here, 2 lectures on Wednesday, recording will be available after the lecture directly, and handouts will be posted on the day or the day before the lectures, so you can actually see what these slides should be available already. The coursework web page is this one, probably you got, you know about it, but this actually, you’ll find all the materials. Handouts, labs, coursework, details will be all there. And of course from Learn, you can actually have the lecture recording and the deadlines and the submission of the courseworks will be over there and of course the link to Piazza as well, which is Our main communication platform. This is your new social media. So if you have Facebook, Twitter, TikTok, uh, Snapchat, add Piazza beside them, OK? Enable notification, get a question, try to answer it. You’ve got something, share on it. This is the main thing, OK? You’re the new generations, guys, that use social media. So it’s assume Piazza is a social media, and I, I always love actually responding as soon as I can. And you’ll sometimes you’ll find someone posting something at 3 a.m. and they will respond by 30 5. OK? So just let’s do that and let’s actually share your results there, share your questions there. And one important thing before you ask a question, please have a look that it hasn’t been asked before, because it’s like use your search engine skills before posting search and see actually that it hasn’t been asked before, because you might find the answer already. And yeah, you can find the link to Piazza already on Learn, so please be sure that you are joined. There are some frequently asked questions that I will share with you, like uh how the project will be managed, what if one of the members doesn’t work. I hope I have answered this. Uh, it’s, usually we would ask that there will be one, I, once we talk about this more in the uh in the semester in the future, but I’m expecting that each group would elect a project manager for the group. He’s the one being sure, or she’s the one being sure that stuff is done on time, tasks are assigned. Like being sure everything is done well and a communication person as well. If there is any problem, someone can contact us. What if someone is not working? I know it would affect your project. But you have to handle this. This is why actually it’s your responsibility to select your members. And for example, you started by 6 and turned up 3 are not working. So we will mark whatever the other three did. Fairly, if it’s not great, it will be not great. The other 3 who didn’t participate probably might get 0s, but the project as a whole, probably the outcome of a 3 will be less than the outcome of 5. It’s your own problem. We’ll check the project, regardless who worked on it. If it’s 1 person or 10, we’ll adjust, this is a project, we’ll give you a mark. And then the individual mark will actually weight this. So it’s important always to actually be sure that you have decent group members and assign the tasks carefully and see who’s going to do work on that, on what. Maybe someone would like to, I would like to work on the interface, for example. It’s fine. It’s OK. Just, just think about this, about the data collection. This is what you usually find. Some people will work on the interface because I, some students will say, we have actually created this for desktops and also for mobile phones interface. It’s very fast and nice, and we will do that quickly. Some people would actually create an app as well, that’s doing in Flatter, for example. You don’t have to do that. This is additional stuff, but just if you’d like to improve your mark. Uh, some people will go, oh, I’m, I’m responsible for the data collection. Someone about the indexing, someone about the search to be fast. All of this stuff, many components will be there. The main point is how to sit together from all what you would study and see who is going to work on which part. This is one thing If someone says I’m not solid in programming, should I take this course? Check lab zero. This might give you an indication, OK, if it’s challenging, think 2 or 3 times before taking the course. Can I order the course? Yes, anyone can order the course. So if you have friends who would like not to take the course or have exams or do the coursework but like to come and listen, anyone can join. OK? It’s open for anyone. Now, any questions, any additional questions? Yes, so like the final mark consists of several components, like do you have, for example, to score at least. each component to be considered to pass the course, right, for example, I don’t know, like 60% exam or something like that. Uh, of 666% exam you said you like, so you get mark for each different component, right? Do you need to have specific no, so, uh, the project percentage in each mark. No, no, so actually, what, how the pro you ask about the project specifically, yes, uh, every month, like, do you, do you need to have at least some percentage from each component of the, of the course, so they’re like, no, no, no, I say we just check the total. So imagine you are doing amazing and you got 70% already, full mark before the exam, and you said they said, oh no, you know what, I’m not going to the exam. It’s up to you, OK, so. OK, you go to the exam, send some marks for others, it’s up to you. So yeah, but it’s mainly we, we do the full mark, we check the whole thing. Yes, can I audit the labs? What do you mean audit? Uh, yeah, yeah, OK, yeah, that’s fine. I audit piazza. OK, because I’m telling you that, that, that most of the support will be on Piazza. OK, so yeah, but of course if you would like to join the lab as well, it’s OK.
SPEAKER 1
Yes.
SPEAKER 0
Uh, that’s a good question actually. Uh, what if you actually would like to, uh, program in something different than Python, for example? I have another programming language, but nothing in Python. Yeah, so, uh, uh, so let me know what is the language. I. OK, I, I, you, you try to implement this stuff there and see if it was working, and we’ll find you a marker to assess your work because honestly I really don’t care about which language you’re using as much as understanding and implementing it. It’s highly recommended because actually the support will be in Python if, if you ask some question in. Uh, in a different language, and I don’t, uh, one of my demonstrators is not, doesn’t know it, then I’m not sure if I can give you the required support, for example. But it’s OK. I, I, uh, 4 years ago, some, I got someone who implemented everything in C. Which is, it’s yeah, instead of writing 10 lines of code, he’s writing 100, but it’s up to, it’s up to you, OK? But at the end, if it’s running, hopefully everything you implement will be running, so we’ll not have a problem, but the problem when stuff started to stack and you need support, you can ask Chet GBT in this case, OK? Yeah. Any further, yes, uh, how much, uh, markers you get for the example you gave, the better, better read, Better reads, I think they got um. Uh, over 70. OK, so only a few projects got over 7 get over 70, by the way. Those who actually do, do do something very interesting and check all the boxes that might be interesting. Like, uh, for, for example, they did live indexing, so it’s not just the collection, and it’s, that’s it, no, they actually keep collecting documents all the time. They’re actually doing very nice simple interface, but it’s a nice one. It was super fast. We tried to break it down, it didn’t break. For example, I give a query as to 500 words. What would you do? So you usually submit like 2 or 3 words, and that’s what you would design it for, but we’ll go and try it. OK, let’s try it. We’ll actually give it queries in English, Arabic, and Chinese. Would it work? It might give more results, but it doesn’t break. This is the main point. Actually, did you build something interesting or not? So maybe actually one of the group members would be, you know, this person would be the tester. Just uh testing, testing, testing to be sure everything is fine. But the best 2 is to, if you’re building a component, uh try to be the tester of a different component and try this stuff and give each other feedback. It’s an interesting exercise. I can show you, but uh, you would appreciate it once you get your first job after graduation. You will see actually, oh, I lived this before, OK? It was tough at that time, but now I experienced this. Yes, I don’t think everyone is access to technologies for access to what the land page.
SPEAKER 1
I don’t have landing page.
SPEAKER 0
You mean learn if you are enrolled to the course, you should actually get access to it automatically. If you enrolled just yesterday, it might take a day, usually within 24 hours. If, if you didn’t get access yet, please contact ITO. OK? It happens automatically, but it takes, I know it takes 1 or 2 days sometimes. Yes, this course I cannot hear you. I’m sorry. Does this course include training or fine tuning or encoder or the models? I mean. We will tell you about it and you can do it yourself. It’s actually some stuff that we, uh, so that’s a good question. So we will tell you how actually this, this field. Starting how it has been done over years and has implement improved over years. And we will show you the trending stuff, but when it comes to implementing this stuff, it might be actually a little bit more advanced to everyone to implement it. So we’ll tell you about it, but if you, and we hope that you, some will include it in their project. You can learn a little bit more, find a recent paper and try to actually use a library there and use it. We will actually encourage you about doing that. But it’s will not be required for everyone. We’ll learn, tell you about it. You will be implementing only the very basic stuff that you should, you cannot pass text technology without actually knowing them. But the advanced stuff, we will tell you about it, and it’s up to you later to include it in your project or not. Fair enough. OK. Yes, on PS I saw a post about someone who
SPEAKER 1
was looking for teammates, so the person introduced themselves and
SPEAKER 0
said, Oh, people started looking for teammates now.
SPEAKER 1
I was wondering, so that’s too early.
SPEAKER 0
That’s too early. So, but, but what happens actually, we will tell you, probably we’ll ask you to start forming a team and submit your team by November, and we will actually have a functionality on Piazza that allows you to search for teammates. And at the end by January, if some are left out not having teammates, they will contact us and we will try to put them in a group. OK, but it’s really important, it’s too early to discuss it now, but one thing that is important that will advise you about diversify the skills. You don’t have to be all super software engineers. Maybe someone is better with the interface, the front end. People are better in the back end. Some people with the set up in the servers, some people with data collection, some people with the interfaces, you know. So diversify. It’s fine. Usually it will lead to a better thing. Any more questions? Yes, I want to ask, what’s the policy regarding using
SPEAKER 1
DNA.
SPEAKER 0
So, uh, I, we have to live with it, OK? So, uh, uh, for example, uh, while we’re implementing some stuff and you’d like, you can actually ask it to implement some functions for you and read it and understand it and submit it. I’m OK with that. But as the main thing is to understand because Don’t run it for the whole thing, probably for components, and be sure that you understand it so actually it’s it’s not good in implementing a big thing anyway. So you’re running it to run a function for you, something to sort, something to search, and you get it and read it and optimize it for your code, it’s fine. But remember, you will be marked on this code, so maybe if it did something wrong, you will be penalized for it. OK. So, but, but yeah, it’s, you should use it. It’s a tool, it’s a nice tool. It might actually help you uh like the first lab, write a function for me to read the file and print it back. It should be very straightforward, but there are different ways to do it. Which, which way did they show it to you? Is it the most efficient one? Is it the one working with better text? Is it actually would be uh sufficient for other languages as well, or just English? The encoding, this kind of, uh, tiny stuff. All of this stuff actually, it’s better to, even if you’re using it, to understand before you submit it, OK? OK. So we will have 5 minutes or 10 minute break and we’ll continue the second lecture. It will be shorter than this one. OK? So still continue more introduction, just, uh, uh, but it’s more about the topic itself.
SPEAKER 1
OK. In Australia. Yeah You.
SPEAKER 0
So let’s start with the 2nd lecture here, which is more about very basic step of introduction, uh, nothing hard, very easy, more discussion. But just to get Uh, like, uh, a flavor about what is the course about. So it’s more about defining the main components that will be within a search engine. So because I got this question more, this course is mainly about working with a huge amount of textual data and how you process it very efficiently, OK? So this is the main point of this course, and this is what we’re trying to study, some definitions about this. So the lecture objective here is to learn about the main concepts in information retrieval, which is the document. What is the document? The information needs, the query, index, and BOW, which is bag of words, which is one of the main important things about IR. So if I’d like to summarize the whole thing in a very, very abstractive way. So simply There will be some user thinking about something, query, actually some information that they have in mind, so they will type down a query, searching for what they’re looking for, and submit to the search engine. The search engine will take this information. The engine have actually access to trillions of documents, try to find what exactly this person is looking for, and bring it back to them as a relevant document, and then hopefully, the user will be satisfied. So this is the main objective make the user smiley face on the user’s face. That’s what we’re looking for. OK. Now people are using LLMs a little bit, but it’s the same concept again. The only difference is the user has in this case a query or a question. And it will probably ask it, not a search engine, but actually to an LLM like chat GBT or whatever. And chat GBT will realize, oh, this is actually recent stuff. I need to actually use a search engine, so it would go and give the query, pass it forward to a search engine. Search engine will have access to materials of documents, find what is relevant, retrieve it back to chat GBT, tell them, you know, this is the top 10 things I think they are relevant. Chat GBT will take this stuff and try to summarize the answer from the most retrieved ones and bring it back to the user, so the user still hopefully will be happy. So this is simply what is going on. So even now people are using maybe chatbots more. Still search engines are in the back, unless of course it’s search is something that is embedded in the knowledge of the of the LLM itself. So what are the, in the basic form, we have a given query, let’s say queue. And find relevant sets of documents, the sets of documents. So for example, here, if I search it for Donald Trump, I, I, I will get some relevant results here. And so my query here is Donald Trump, and my relevant documents will be all of that thing showing on the main page of Google, for example, here. But Uh, there is actually one important thing here. Which is this part. Can you read it? You cannot read it, so let’s actually zoom in. This is what I actually mentioned. It retrieved around 300 million results in 0.79 seconds. It’s not because it’s using a lot of GPUs. Actually, IR is not actually designed initially to be running on GPUs. It’s mainly on CPUs, normal CPUs, but it actually has been done in a way, in a representation that makes it super efficient. How we can retrieve this huge number of results in less than 1 2nd. And this is your target for your group project at the end. Hopefully you’ll learn the skills for this. So there are two things objectives here are looking for. One is Effectiveness. It managed to retrieve a lot of results, potentially relevant. Of course it’s showing me some of them, but it’s have them in the back end of if I need them, it will continue to bring it up. So, so it needs to find relevant documents, which is like a needle in a haystack. They say 300 million, but compared to the internet, we’re talking about trillions. So this is actually a really a small piece of information among many, many, many documents, how you can find it. And this is actually different from databases because it’s not just I’m looking for the name field that contains Donald or Donald Trump. No, it’s different. It can be in anything. It can be in videos, it can be in images, it can be in websites, on the news, many, many things. And the second important thing is efficiency, super efficiency. We’re talking about, I need to find them very quickly. And we’re talking about vast quantities of data, hundreds of billions, I think it’s now over 1 trillions of web pages around there. And it’s not like that, it’s not only me sitting on Google searching and waiting for just 0.79. We are talking about around a quarter of a million Google searches per second. There are 250,000 people at the same time clicking search, and they are achieving it super fast for all of them. How we can handle all of this. And this is actually one of the things actually that you will be marked on, because it will be our markers, we’ll have to, your project will be two markers. They will actually coordinate. We’re going to try to search the engine at the same time. Let’s see if it would fail or not. Is it actually ready to handle multiple stuff at the same time? And it’s not just about this. We are talking about it is continuously changing. It’s not saying, OK, I’m indexing this 1 trillion page, but yes, probably 100 billion has been updated, have been updated already of them. How you can keep this fresh. And then actually, if we’re talking about compared to NLP, this is, we’re talking about very fast. You have to be super, very fast. So NLP, yes, I can take a month or two to train my model, a few days. Here it cannot. You have to retrieve the stuff very quickly. Yes, at the entrance time in NLP in Cha GBT that retrieve my answer, the generation tries to be faster, but in, but, but the training itself is hard. Here, actually, this is how you can maintain this to be super efficient and fast. So the main components here we’re talking about mainly 3 components. The documents that they have, which can be in the web web pages if you’re talking about health, it can be actually scientific papers, actually patients’ records. It can be in library, books, it can be anything, queries, which is what I would submit to the system, and relevant documents. Which of these big collections is relevant to this query? So let’s start by a document. Document here, it’s important that we will discuss on some actually terms to find the definition of this. We, we will use it during the whole course. The element to be retrieved here, what I’m going to retrieve. It’s usually in unstructured nature. Usually it has a unique ID, and once I have any of these documents, it’s called our collection. So it can be anything. It can be web pages for web search, emails. When you search your email, you’re actually searching emails. The document is the email here. It can be books. It can be actually even not the book itself, but a page inside the book. If you’re searching on the page level, then the document here is a page, not the whole book. Or it can be on the sentence level or actually sometimes in a tweet level if you’re on social media. It can be photos, videos, musical pieces, or even code. Searching for a stack overflow, for example. It will be a piece of code. It can be answers to questions. It can be product description or advertisement if you’re searching Amazon or eBay. It’s a different type of document here. The document here is a product. It’s not a webpage. Uh, it can be actually in different languages, of course, as we spoke. Actually it, it doesn’t have to be even words. Like if you’re, like if you work in bioinformatics, searching DNA, it’s some kind of text, but it’s not words, it’s not English, it’s not even any language. So how you can search and find this stuff. This is the document itself, so whatever I will be retrieving to the user, this defines what is a document. It can be anything. It’s mainly the element I’m going to retrieve to the user. Then we have a query here which We need to make a difference between A query and information need. So the query is defined as a free text to express a user’s information need. But it doesn’t mean that it’s the main point here. The information needed is still the focus. Maybe the user would express it somehow, and later on decides, oh, that’s not the best expression, I will try to express it in a better way later. But because the information needed is what is inside the user’s head. So the same information can be described in multiple queries. For example, there was a hurricane in the US like 3 years ago. Like I can say the hurricane in the US. I can say North Carolina storm. It was happening in North Carolina, or I can say actually it was called Florence at the time. So if I search it for any of these 3 terms, I’m looking exactly for the same thing, but actually it can be suppressed in different ways. And sometimes it can be expressed in the same way but actually meaning different things. For example, if I search it for Apple or Jaguar, what exactly I’m looking for. I’m looking for the fruit or the product. I’m looking for the car or the animal. So all of this information, it can be different in, in both ways. And it comes in different forms, like web search, yeah, probably up to some keywords, some narratives, I’m writing something. Actually, in the new generations now, people are more, more actually towards writing questions on the web search directly. Um, image search, I’m looking for keywords, can be a sample of an image. For question answering, it can be a question. For musical search, it can be humming in, in the like Shazam. for actually sometimes filtering a recommendation like Google News will bring you some news that you probably are interested in. Facebook will actually, or Twitter would be recommending some posters are relevant to what you’re looking for. You’re not looking for anything at the moment, but actually recommending posts probably relevant to your interest. If you know a lot about, for example, football, so it will be giving you updated news about what’s happening in football, but you didn’t search for anything. So your query here is your interest. It modeled you in a way to actually keep retrieving some relevant stuff to you. Or it can be scholar search. Go to Google Scholar, searching for papers. You can search with the author name, uh, you can search for a paper title, you can search for words that appeared inside the, the paper itself. It can be different ways. And sometimes it can be advanced. You can search, I need something from the title with this weight, and actually it doesn’t contain this word, and so on. We will learn about this when we talk about boolean search. Then it comes to relevance, the 3rd component, which is It’s mainly, is something relevant or not? Can it mean that does document D match item Q, or does it mean is item D relevant to item Q? Sometimes it might match, like for example, the example of actually words that can have different meanings, but it’s actually not relevant. So what we are looking for is about relevance here. Is it relevant to what I’m looking for or not? So it’s a bit tricky because It mostly depends on the user. Will the user like it, will click on it or not? This can be an indication about actually it’s relevance. Will the user achieve, uh, will it help the user to achieve their task they’re looking for, satisfy their information needs? And actually, sometimes, is it novel? Because maybe I find 20 things saying the same exact thing to the user. Should I give it all to them or diversify the results a little bit? Also, Relevance, for example, what is the topic about? Sometimes D and Q, the document and the query, might share some similar meanings about the same topic, subject, or issue. This is something actually that you always have to think about, about your task when designing your system. How I can find the best match between documents and queries that actually satisfies the relevance part. Is it on the same topic or not? Sometimes it’s a bit, some challenges will be there. For example, it’s sometimes it’s not very clear if someone searching for William Shakespeare. Are they looking for like the biography about him, maybe from Wikipedia, or actually it’s mainly about the list of novels uh he created? It’s not very clear here. Sometimes And there is some ambiguity, like synonymy, like if you’re living in Edinburgh for a while, you know what is Edinburgh Festival happening in August when it’s very crowded and noisy, or the Fringe, you can call it the Fringe as well. So both of them are totally different terms, but actually they’re talking about the same thing. Polygamy, like apple and jaguar as the example we mentioned. And sometimes it’s very subjective because if If I’ve got a document about a certain query, the answer is that it is relevant, yes, or no, it is not relevant, or sometimes, yes, it’s highly relevant, or sometimes it’s, yeah, you know what, it’s a bit relevant, but not that much relevant. So how can you actually do that? And sometimes actually how you counter the search engine optimisations, we’ll talk about it later, and spams. Maybe some page will actually have a lot of like rubbish terms just adding a lot of things to, to bring it back in the search engine. How you can avoid this when doing your system. So Relevant items are similar. This is the main idea here that Usually, if something is relevant, a query is relevant to a document, probably they are sharing, sharing the same vocabulary or meaning in there. And that’s what we say. The more there is some match between the my query and the document when term is matching, probably it’s more relevant. This is actually the main basic idea, which is actually trivial and it’s, it’s, it’s not bad to assume so. Sometimes similarity is be a string match, finding the same string in the document. Sometimes it’s word overlap, it doesn’t have to be the same thing, searching for uh playing, but actually it’s, it’s talking about played. It’s still not an exact match, but actually the same thing, or it can be a probability, which is something you will talk about in the 4th week about receivable models, how we can find probabilities that something is relevant to giving query or not. And there is a big difference, we have emphasized on this several times, the difference between information level and databases. Databases, what we are retrieving is highly structured datasets. It’s clear, and we know that this is a name, this is an age, this is address, and so on. And however, IR is usually unstructured, mostly unstructured. For the databases, query are actually like, it’s formally defined. So you can say I’m looking for an age between 18 and 25. It’s very defined, uh, like SQLL or something like this, not ambiguous. You know what you’re looking for. And text, free text is natural language. Anyone can write something that is not very clear exactly what they’re looking for. And the results by then for the databases, you know, especially for all the entries that has the name Jacques, for example, I will get them. But for information even we don’t know actually, is it actually relevant or not? It can be, it can contain the term, but it’s actually not relevant. It cannot and sometimes don’t have the term, but actually it’s still relevant. So it’s actually, the result itself is different. And the interaction with the system. In databases, I search or search something, I get the results. I’m done. Actually, in search engines in general, it’s more interactive. You search for something, you felt that the results didn’t go well. I’m not satisfied. Probably you will go and try to write something different. Hopefully this time you’ll get a better result. So it’s more interactive because it’s a trial and error till you find or reach whatever you’re looking for. So how would from Richard Ri see the documents in this case? So this is something important to know about actually that In IR we don’t see the documents as a sequence of words as we usually expect that’s how we read the documents. This is actually what we call the bag of words trick. So let me just give you an example. Can you read the first sentence here? Uh, can you know what is he talking about?
SPEAKER 1
Yeah, it’s actually this.
SPEAKER 0
It hasn’t been actually sorted. It’s actually from NLP, this is a terrible sentence, the one in blue. What does it mean? It’s not English. But for you, I know what he’s talking about. You got it, even if it’s not sorted. For the second one, actually, you can find the word French. Is it talking about France? Who thinks it’s talking about anything related to France? OK, what is it talking about? Yeah, yes, and french fries, so you can guess it is, it is, or actually it can be sorted in this way, who cares? So these sentences for me, I have seen them as a bag of words, and this is actually how search engines see the documents. They don’t care about the sorting of this stuff. It’s actually about what they are talking about. Even for some terms around spread everywhere, I still can understand what they are talking about, and this is the main point here in IR. So the reordering doesn’t destroy the topic I’m looking for. So individual words, we call it as the building blocks here, and the bag of words or the composition gives us the composition of the meaning of actually the topic that these words are about. However, there are some people who would actually argue about this like, First thing, most of search engines will be using bug of words as the main thing. They treat documents and words and the queries as bag of words. And the bag is a set of repetitions, how many times this term appeared in a document, and so on. I really don’t care about the structure that much. And the retrieval models, it’s some kind of a statistical model that tries to measure how a given query would match a given document based on this bag of words. And what should be the top results, the most relevant document is, it turns out that bug of Words can help me a lot on this. It’s giving us a very strong baseline about getting relevant documents. There are some criticisms about this, for example, The word meaning Lost without context. But actually, if you think about it, it’s true. Yes, if I don’t know the context like the word jaguar, I don’t know if it’s about the animal or the company or the car, but actually, it really doesn’t discard context because you still have the words around it. So if I found the word jaguar somewhere and the word model in a different piece of the document, I would know it’s about the car. If I found the word jaguar here and the word forest somewhere, OK, it’s about the animal. So the context is still there, it’s not sorted just for me. Another thing, it discards the sur it just discards the surface form and well-formattedness of the sentence, which I really don’t care about in uh in in in search engines and information retrieval. And this is a big difference between NLP and uh information retrieval. NLP cares a lot about the structure of the sentence, and how the sentence is formed. For me, I really don’t care. What I really care about is what is the topic about this. There might be another thing What about negations? This sentence like no, climate change is real. Another one, Climate change is not real. They have opposite meaning, correct? So if I use bag of words, they will be flipped around and they will be identical. Would this, wouldn’t this be a problem? Who thinks it will be a problem for the search engine? OK. Who thinks it won’t be a problem? Can you say if you think why it’s a problem, can you tell me why? Yeah, because it’s different, uh, just as this sentence says, uh, sometimes different sentences could be the same in terms of just bag of words. If we would just compare the words, they would be the same, but they would signify different meanings. Exactly. OK, that’s a good point, but why do you think it’s not a problem?
SPEAKER 1
Because. Uh, irrelevant of the meaning, uh, I mean, I, the thing is, the document is irrelevant even though the meaning may be exactly.
SPEAKER 0
That’s a good point because actually we are talking about the same topic. If I’m looking for climate change being real or not, both of them giving this opposite meaning, but actually both of them are relevant. So yes, it has a totally opposite meaning, but from a search engine point of view, it’s both of them are relevant. If I’m looking, is climate change real, it would be amazing to get these two documents. I have a question. If I want to search for specifically, for instance, uh, documents regarding that are saying that the document that climate change is not real. I was just looking for that that’s actually you can use in these codes and actually we learn how to implement it. It’s a part of your coursework one. If I’m searching for specifically a sequence of terms or actually a certain phrase, you can do that. But in general, if you’re talking about generally, actually in this case, both of them are relevant to climate change, about being real or not. And this is the main point. We really don’t care about the exact, yeah, outcome of this. It’s just about the topic. What is the topic of this about? And when you search something on the web, you’re searching for the topic. Just put, make your mindset about this course about this. This is the main point of this search engine stuff. Sometimes it doesn’t work for all languages. For example, if you’re talking about Chinese or German or other stuff, the word itself is different. You don’t have this. However, actually this kind solve a problem solved by segmentation or decompounding. So there are some workarounds that actually make it work, which is simple tokenizations. We call it generate tokenization. We’ll talk about this next week, how we can do tokenizations for different languages. So as an IR black box, you have a query, you have a set of documents, and you get results, which is the hits, the relevant results. What happens actually is that you take the document and you try to transform it, transformation function, which makes it a bag of words. It can be some representation. Now we have some word embeddings and stuff, advanced stuff, we’ll learn about it more later, but the main idea is just a bag of words, and then you put them in an index that you put this transformation and put it in a way that makes it super efficient in storage and retrieval. In the query also transform it to some bag of words and some representation, and you create some function, comparison function that will be able to do that very fast and get your results, and we will learn how we can do that in an efficient way. And what we call it, this is the offline part where you do the indexing of the documents, and this is the online part where actually the user just go and get the results directly. And we call this a retrieval model. We’ll have uh in week 4, I think, or 5, about retrieval more different retrieval models, how we can retrieve the stuff. So system perspective on IR. We have the indexing process, the part where you take all the documents, the internet for web search, your emails for Gmail search, and put them in the index, which gets all the data acquired from using crawling or feeds or whatever, store them in the original if needed. Sometimes you don’t have to, then transform them into a bag of words and index them. In the search engine, in the search retrieval part, the online part where the user searches and gets results directly, you assist the user to format the query by query, suggestion, correction, and so on. You can retrieve a set of results for the user and help the user to browse or reformulate their query if needed. And you can actually always log the user’s interactions to actually learn over time how to make a better process. So as a block diagram, we have a set of documents here and we need to think what data we want to use, so think about the source and actually how you, in this case, emails, web searches, images, whatever documents in for an enterprise, and so on. Then you do some crawling, collecting these documents somehow, or and then Give each document a unique ID because this is your element. In the internet, the web search, it’s simple, it’s a URL. This is unique. On some places like your email, it can be some ID Google actually stored inside. And then actually, you think about the disk space and, and the compression. We’ll talk about this when we do, go to this part. Then the text transformation part, you think about how, what kind of format I will need to do it, some actually transformation a little bit to make the term is actually making uh in a better way, a better representation. The very basic stuff like stopping and stemming. Uh, we’ll talk about this more next time. And then Lookup table at the end to tell me where each of these terms that can appear in my queries can I find it in my collection, so we can get it very fast. From the search process, you have a user who will actually, you will help the user somehow to formulate their query. It can be just nothing or can, you can give them some query suggestions. Then you try to search the documents and fetch the results you have and present it to the user, and then try to see how the user would behave with them. Are you going to click on it or just ignore the 1st 3 results and go to the 4th, you learn that the 3 are top, not the top, uh, best results, then you can improve it the next time. And you keep iterating until you get a better performing system. So this is a very nutshell actually how The very abstractive way of how things actually are happening in search engines. From next week, we will be studying each of these components separately. For we will learn, actually, I think next week we’ll be in the indexing part here. We’ll talk about actually the transformation, text transformation process and And actually, yeah, it’s only been text transformation. So we’ll talk about this part in a couple of lectures, how would this would happen. And then we’ll keep each component in during weeks. So the summary of this very easy, simple introduction lecture, and then we’ll start the real science from next week, is information retrieval is a core as a core technology, it is a selling point. It’s very fast. This is really important and provides context. It doesn’t have to be the same words as a sentence order, but actually the main context is there. The main issues that we’re always looking for our objective is to be super effective and super efficient. How to make it effective and super efficient. And we set three components, documents, queries, and relevance, bag of words trick, we talked about it, search and architecture, we talked about the indexing and searching. We will talk about this from next week from each of these stuff. So the resources for this, uh, of course, the, the, for the slides today, but mainly I would, I recommend that you uh you read chapter 1 and 2 in Uh, information even in practice. It goes to the main basic stuff, the one by Manning. It’s really amazing stuff. It’s very easy to read, by the way. It’s not like a textbook, it’s very complicated. It’s very easy to read. It’s a fast one. And there is no practical lab this week, so you are free. However, if you are not confident, try lab Zero and see it, uh, just to be sure. If you know that you can easily read a file and print it back, you don’t need to bother about it. Um, and if you have any questions, let me know. Next time, we’ll talk about some laws of text and the vector space model and skills to learn next time you learn how to read text from desk and read word by word videos. I would recommend next time, you might actually watch this if you want for next week, the zip flow in general from VSOs, if you are following this channel on YouTube, and there will be some tools, uh, regular expression that will be useful to familiarize yourself about it. Any questions?
SPEAKER 1
We have to learn.
SPEAKER 0
No, no, you don’t have to. I’m just saying if you’d like to, it’s up to you. OK. OK, so uh thank you and see you next week.