Mastering Postgres has dramatically improved my understanding of indexes. It has helped me make informed decisions around my queries and schema rather than blindly adding indexes and hoping for the best.Dwight Watson
Shorten dev cycles with branching and zero-downtime schema migrations.
We had a fun little detour with like, and I like, which are useful but not terribly useful for full text search. Now we're gonna look at tsquery, tsvector, tsrank, and these are all the ones that are specifically built for full text search. We did a little bit of this back in the data types videos, but now we're gonna go into it even further.
Let's start by looking at tsvector again. To get a tsvector, we're gonna say to_ tsvector on the title column. Let's go ahead and pull back the title column as well so we can see what we're working with. From movies limit, we'll just limit down to 50 and we can take a look at this. We get this wonky looking format here that you can kind of read through. "Kansas Saloon Smashers," which I'm going to have to watch, breaks down into 'kansa,' 'saloon,' and 'smasher.' Thinking back to the data types videos, you'll recall maybe that these are lexemes, which are basically like word parts, word roots, pieces of the word. It stores the location of where those things show up.
You can see down here, "Hemlock Hoax, the Detective." We have 'detect,' 'hemlock,' and 'hoax.' It does look like it's alphabetized, but we don't have the word the, because the word the is not terribly useful in search. We kind of just poof. We leave that out. Now there's another type, we can do select to_tsquery. What do you wanna search for it? Everybody seems to love "Star Wars." We'll search for "Star Wars" and we get this also not terribly helpful as it stands alone, we get this result, which is just in single quotes, 'star.' What we're gonna do is we're gonna use these two things together and we will get something useful out of it. Let's open this up and we'll say from movies where, I'm just gonna paste this here, we're gonna pull this guy out of the select and bring it down to the where and how are we gonna compare these two? With this special operator, this special matching operator for the vector and query data types.
Now if we run that, hey, star, star, star, star, star. That's pretty good. We're already off to a very good start. Now the next thing we probably wanna do is say 'star wars.' If we say 'star wars,' shoot, we get an error. We get a syntax error in tsquery. That's because the tsquery format is very specific. It's like a domain specific language. It's a DSL unto itself. We do have to write it in such a way that it is parsable so that it can be turned into the right format to compare against that tsvector. There are some functions that can help us get there more easily, which we'll look at soon.
First I wanna show you some of the syntax of the underlying tsquery and then we'll move on to the helper functions after that. The reason that this failed is we didn't specify how these two words should relate to each other. Should they both be there? Does it need to be either one or the other? That's where these operators come in. We wanna say 'star & wars.' The word star and the word wars must be there. We can also say star or wars. We get stuff like "Gang War" and "Lucky Star." We can say that star must be immediately followed by a word like wars. If we run that, we get these 10 records back and they all have star followed immediately by wars, a lot of "Star Wars" movies.
Now we can do logical grouping like this. Here, it doesn't make a difference. It'll just be read straight through 'cause there's only one thing in there. But what you can do is you can say wars or trek. We're in a spot where we have star immediately followed by one of wars or trek. Running that, we see "Star Wars," Wars, "Star Trek," Trek, Trek, Trek, Trek, Wars, Wars, Trek, Trek, and so on and so forth. This right here is equivalent to saying the very next word i.e. one word must be wars or trek. This format with the dash in the middle is syntactic. Sugar for this one in the middle right here. If you were to say 2, that means skip the next word and then look for wars or trek and no such luck there.
However, if we do find one, let's look for generations. We can say generations. Star must be immediately followed by generations. No such luck. Must be two words. There you go. star blank generations. That is that little operator right there. We can also negate words. We can say, yes, I want "Star Trek," but I actually don't want "Star Trek Generations." You can do that with the negating operator. We're gonna say & trek & not generations. If we run that, we get all of the "Star Treks" back but not generations. If we wanted to remove khan as well, we could either say & not K-H-A-N and not khan and khan is gone. Or we might be able to group these and say, not generations or khan and generations and khan are now gone. Pardon the rhyme. You can see that constructing these is a little bit of a pain, very powerful.
You do end up having to potentially take a user query and then do this kind of stuff to it at the application layer before you hand it off to PostgreS. That is not always true. There are some helpers. I wanna show you ranking results real quick 'cause this is the last kind of base function and then we'll move on to making this a little bit more ergonomic in future videos. I've gone ahead and switched this back to & wars or trek, so we have a little bit more data to work with. I wanna add this ranking function. I can use ts_rank up here.
The first thing that we're gonna pass through is the vector. Whatever we're searching on down below, that's what we need to put into the ranking function up above. We can have this to_tsquery up here. That's the second argument. We'll just call this as rank. We can hop down here and say order by rank descending. If we look at that, we see... Okay, great, this is a little bit interesting. A lot of exactly the same, which we can tweak so we can tweak this ranking function and we will as we go on.
I just wanna show you this here first. In my estimation, the reason that these two get ranked so highly is this one includes the search term twice, Star and Wars and then Star and Wars. This one includes the search term once, but then it does include Wars again. My guess is that's why it's getting ranked almost double the one below it, which is Star and Trek, Star and Trek, Star and Wars, Star and Wars and so on and down we go. The highest rank is the most relevant based on the vector and the query that we have given it. We can tune this because once we expand out into searching the title and the plot, which we're gonna do soon, we kind of wanna rank the title a little bit higher.
Not only do we wanna rank the title a little bit higher, we kind of wanna maybe rank exact matches a little bit higher. We will do both of those things. But now we have a a relatively firm footing that we can start with. We have tsvectors, tsqueries, and tsranks. Now we're gonna start looking at fine tuning some of these things.