Aaron is a natural teacher and this course is the best introduction to Postgres I have come across. Lessons are easy to follow and he recommends some great tools for working with Postgres.Joel Drumgoole
Shorten dev cycles with branching and zero-downtime schema migrations.
We've got our search working pretty well, thanks to the websearch_to_tsquery. But I wanna start kind of tuning the ranking a little bit to make the results feel a little more correct to me. The right results are coming back, but they're not being ordered in a way that I, as the human, might expect. But we can change that. The first thing that I wanna show you is I wanna show you this default_text_search_config, mine is set to English, which stands opposed to a different language, but also stands opposed to simple. So if we did... Let's do select title, and then we're gonna do, to_tsvector, and you can pass in a parameter here if you want to override the language. And I'm gonna put in simple, which is one of the options, simple, title, and then we're gonna do the same thing with just the default, which, in our case, is English, but I am gonna make it explicit here. And so if we run that, we'll see that title doesn't exist 'cause we haven't given it a table title does exist, however you must add the table. Okay, so here we go. Looking at... Let's name this as simple, and we'll name this as English, makes that a little bit more clear. Now, you'll see something a little bit different here. "Loved by the light of the moon." We have "by," and "of," and "the" in the simple column. But we've dropped all those inconsequential words in the English column. So right away, one of the important things in tuning the results, is we want to use the correct language. So when you use simple, it leaves all those stop words, those filler words, it leaves all of those in there. Whereas when we use English, it does a better lexeme representation, and removes all of the words that really don't matter, and that is gonna help us both in matching but also in ranking. Let's move on to tuning the ranking function. Let's go ahead and we're just gonna get back down to simple here. So we have tsvector title, and we're gonna bring this... Let's bring this down here. And we're gonna say where tsvector matches. We'll just do to_tsquery, 'cause I'm just gonna do one word here, and I'm just gonna put the word "fight" in there. So if we select title, we see a bunch of "Fight for Freedom," "A Man's Fight," "Fight for Life." Okay, cool. That's good. Let's add our rank up here, ts_rank. We're gonna open this up and I'm gonna make some space. Yeah, let's take all of that. I don't know why it added a space there. Let's take all of that and replace this part with just that, and we'll call this as rank, and then order by rank descending. Now, so we do see the rank, the rank descending fight twice up here, but "Fight Club" is down here. I would think, in my opinion, "Fight Club" should be a little further up, because it's just two words, although these are inconsequential words. So if we scroll down, "Fighting Back" does seem like it should go up above "Battles Without Honor and Humanity: Deadly Fight...." Okay, that's just like, "Fighting Back" should be way further up there. Here's what I wanna try to do. What we can do here is in our ts_rank, we can pass in some optional arguments here, and I'll describe exactly what these are here. But this one, this number one, is going to favor shorter document lengths, which in most cases means it's closer to an exact match. And so if we run this again, "Fighting" comes to the very top, and then "Fight Club" is still pretty close to the top. But "Fighting" comes to the very top, and all of these short titles are at the top, and those longer titles do get penalized down here. So that parameter one that we passed in, that parameter one divides the rank by the log of the document length. And so it's a more of a subtle operator than if we were to pass in the number two. So you can pass in several different options here. One, two, four, eight, 16, and 32. Sound familiar. But one is a nice, smooth operator to take into account the document length. The default is zero, which doesn't consider document length whatsoever, and then as you move up, those options change. I will leave a link to the docs that show you what all of those options are. But if you wanted to give even more weight to shorter documents, you could change it to two. And that simply divides the rank by the document length. And so it's really going to favor the shorter ones up at the top. Now remember that it's dividing the rank, and so better matches should still float up, but taking into account the document length. So it's kind of trying to find that balance. So this is one of those things where you can test it on your data set and see kind of how that balances out for you. But you do have options for controlling this ts_rank. So far everything we've been doing is searching against title, which is fine, but I do wanna take advantage of this plot over here and allow our users to search against that. And we can absolutely do that. So, what I'm going to do here is I'm gonna make a little bit of space. I'm gonna copy that guy and we're gonna concat these vectors and we're gonna say, let's put a vector over title and concat with a space and then a vector over plot, and then compare that to the query "fight." Just 'cause I assume fight is gonna be in a lot of these plots here, and we can, of course, we could rank against something different than we're matching on. That probably makes sense in many use cases, but I don't want to do that here. I kinda wanna these to the same thing. And so I'm gonna come up here and I'm just gonna paste that in. And this is not actually a match operator. We need this as a second parameter. And I'm going to remove this third parameter, because I don't actually want the length of the plot affecting the ranking. So when we were just doing title, I think it makes sense to say the shorter titles should go to the top. They should maybe bump up just a little bit. But comparing the length of the plot really doesn't have any weight in my mind. And so I'm gonna remove this altogether, and by default, that is going to be zero, which means it does not take into account the length of the document at all, and this is the document that we're talking about up here. So if we have done this correctly, we should get some results back. And oh, we're already seeing our first thing. The search takes a lot longer because we're searching over a lot more data, and we see something that is quite interesting to me, which is, now that we are searching over this tsvector on the plot, "Goon: Last of the Enforcers" probably has the word fight in the plot a whole bunch, and it's bumped all the way to the top even though... Goodness, shouldn't "Fighting..." I feel like "Fighting" should be up there. So, what we can do now is, we can set different ranks, we can set different weights for these tsvectors. So we can say, "Hey, let's weight this a little bit higher, let's weight this a little bit lower, and what we're gonna do is we're gonna rank our title vector a little bit higher, give it more power, more precedent, and we're gonna rank our plot vector a little bit lower, and maybe that will balance us back out." So the way that you set these weights is kind of interesting. So let's take this vector and pop up here to the top, and we'll do that from movies, pop over here, select, and that is the vector. That is something that we're used to. But we can wrap this in a set weight tsvector, so we're passing the tsvector into this set weight function. And the way that this works is really interesting, it works the other way around. The way that this works is very interesting because you can pass it a score or rather a letter, a grade of A through D. And when you do this, this is just very clever to me, I don't know why I'm so moved by this, but it's very clever to me that it changes the actual output of this vector such that the ranking function can look for it later. And so we can still with our concatenation in here, so we still have this concatenation going on, and that's still totally fine because we can wrap this in a set weight and pop down here and say, let's set that to A, and let's just bring this up here just for giggles. Let's bring this up here and say, from movies, we'll just do a limit 10 'cause we don't need to see all of them. And you'll see somewhere... Oh, all of the plot has so many words, doesn't it? Okay, there's one, "beanstalk," 4A, 80. So, what we're seeing here is that beanstalk is in the title, and so it is being given a weight of A, 80, meaning it is in the 80th position, it's the 80th lexeme in the plot, but it has not been given a weight. In fact it has been given a weight, but the default weight is a score of D as in dog, and so we have 4A and then 80 at the default. And so this is very clever to me, the format that they've come up with here, such that you can wrap it in a set weight and then concatenate it to other stuff that has a different weight potentially. I don't know why, I love this so much. But let's keep going and let's see how to use it. Let's make some space here. So to keep all of this the same, I'm gonna take our vector and I'm gonna come down here and I'm just gonna put it right there, just 'cause I want to keep this the same. And we are gonna refactor this at some point. So now if we run this, it's gonna take a little bit longer, which we will get to, I promise. But now, boy, this looks really good. So we have fight and fight at the very top, and then "Fighting," "Fist Fight," "Fight Club," "Fight Club - Members Only." And then do we have any that don't have fight in the title? So maybe if we rank this down a little bit and we give this a B rank, and somebody has fight a lot in the plot but not in the title, maybe it will still float up. But I am seeing that title is taking over, which honestly I'm fine with, because when... Yeah, I'm just still looking. They're all saying fight in the title. That is totally fine, because you probably assume that it says fight in the plot too. But when two titles have the the same kind of rank, it is going to take into account that plot, the weight of that plot. Because if that word shows up in the plot a whole bunch for one movie and not for the other, well that's gonna bump that one up more towards the top. Now something cool that we can do is we can affect the rank outside of the ts_rank, because of course the rank is just a float, it's just a number. So we could come in here and we could type in whatever we want, and affect the ranking on our own. So in some cases, depending on what your application is like, you may want to give a little bit more weight to popular movies. So you might say, "This movie has a million comments, or a million votes, or a million ratings, let's move that one up a little bit, because that's probably a popular movie, and frankly that's probably what people are looking for." You can move movies up based on their average rating. So we can say, "Well this movie's quite good, let's give it a little boost in the search results." What we're gonna do is we're gonna give it a little boost if it's an action movie, 'cause we're searching for the word fight. So we have to do something in here. But first, as I like to do, let's break out and see if we can simplify the problem. So we're gonna say select genre from movies, where genre like, and then we're gonna look for action. Hey, we're back to the like there. We're gonna look for action. Okay, so we do have some action genres in there. Short action/crime western sounds interesting. But what I wanna do here is I wanna calculate our bonus. So, case when genre, let's just take, we can save ourselves some typing, case when genre like action, then what are we wanna do? Let's do a 0.1 bonus else, let's give it a zero bonus end as bonus, and now if we look at that, there are all of the bonus genres, which is kinda cool. So, I'm just gonna yoink that, and then we're gonna make the space again and come down here, and this is going to be my bonus right there. So if we do title genre, and we run that again, so now we're searching across the title and the plot, and we're ranking based on title and plot, giving title a little bit more weight, but also giving the action genre a little bit more weight as well. And so you can see the top several are action, but we still get an unknown, and we still get a thriller, and we still get a comedy because those rankings are still multifaceted. It includes a high rank on the title, a little bit on the plot, and then a little bit on the genre itself. At this point, we've got a pretty good search engine honestly. I'm actually kind of... I'm very happy with this. What we don't have is a very performant search engine. To be fair, it's not very performant. To also be fair, we haven't done any performance work yet. So what we're gonna do is we have learned how to use this ts_rank, how to use the set weight, how to muck about with the rank ourselves, but what we need to do next is we need to move this into a more performant pattern, and that's exactly what we're gonna do in the next video.