Window function 😊. CTE with window function 😮. Hierarchical recursive CTE 🤯. Recommended all the way. Can’t wait to see the rest of the videos!M Wildan Zulfikar
Shorten dev cycles with branching and zero-downtime schema migrations.
We haven't talked about performance of this full-text search engine that we're building and I think it's time. I think it's time we talk about performance and what we're gonna do here is instead of calculating these tsvectors on every single query, we're gonna create a generated column and store the vectors in there and then we can run our tsquery against a stored value and we can put an index over that stored value, so everything should speed up very nicely.
I have our vector here in my clipboard. This is the setweight of to_vector(title), concat, space, to_vector(plot). If we run this to make sure we're still good, from movies limit 10, I think we're still okay. We are still okay, so what we're gonna do here is we're gonna alter table movies add column, let's just call it, we can just call it search, and we'll make it a tsvector. We're gonna declare this as a generated column, which, just as a refresher, a generated column is a column that Postgres is in charge of. We're gonna tell it, we're gonna give it a formula, in this case, how to calculate this vector stuff. We're gonna give it a formula, and then Postgres is in charge of populating and updating that column. In fact, we cannot touch it. We'll get an error if we try to change it 'cause it says, actually, this is a generated column. It's like a very specific type of trigger but with less overhead and easier for the human.
The first thing we're gonna do is we're gonna modify our tsvector thing here just a little bit because if we end up having a null in any of these columns, everything comes back as null, which we definitely don't want. Now that we're going to be writing this into, now that we're gonna be making this a little bit more permanent, let's make it a little bit cleaner. We are going to take that out and then we're gonna coalesce the title or an empty string and we'll take this guy out and then coalesce the title or, I'm sorry, the plot here, plot or an empty string. That should give us, yeah, that gives us a little bit safer vector creation there.
That's not quite right yet, and I'll show you why, but let's do alter table movies add, we can call it, what do you wanna call it? Search_vector? Let's call it search_vectors. It is a tsvector, and then this is where we start with the generated column stuff, generated always as and we open these parentheses and we're gonna put our formula in there and then at the end, we say stored. Some other databases support virtual generated columns. Postgres doesn't, which, in this case, it's fine 'cause we do want to store it because that is going to make everything more performant. This is close, but if we run this, we will, I hope, get an error. There you go. Generation expression is not immutable. So remember that anytime we have a generated column, it must be deterministic.
We're looking at this and thinking why is this not deterministic? But the answer is because your configuration could change, which could change the default configuration here. We're just gonna go ahead and hardcode this to English and now it is gonna run, and it is gonna work. What we had to do there is we had to add in that argument that was configuration-driven, but if your configuration changes and you didn't have it hardcoded, then that means this is a mutable function or a mutable formula, and that's just illegal. You can't do that.
Now, what we can do now that we have that column is select * from movies and we can just use that column name in both the filter part and in the rank part. We can say where search_vectors and then we'll do overlaps with, let's do websearch_to_tsquery and we'll just search for "Star Wars" and that's gonna give us some stuff back that I don't super love. Order by, then we can use this computed column or this generated column again, ts_rank of search_vectors and our query, which is going to make everything, we'll do it descending, which is going to make everything a lot cleaner and a lot nicer. Not only do we get a performance improvement, we also get a quality-of-life improvement in that this is now a named column and we can use it both in the filter up here and in the order down here.
The last thing we're gonna do here is we're gonna add a GIN index on top of that computed or generated column. The one that Postgres maintains, we're also gonna put an index on it and that's really gonna make these searches fly. We will leave this here 'cause we're gonna use it in a second and we'll say create index and we'll call it idx_movies_search_gin on movies using gin(search_vectors). If we run that, that is gonna take a second, we'll clear this guy out that has been created and now look at that. We're down to five, seven, nine, five all the way down here. This is what I'm looking at. We're under 10 milliseconds on all of this stuff. If we run an explain, that is because we are using that index. We're using that index scan instead of just having to go row by row and manually compare the vectors to the search query.
We've got it all. We have declared what we wanna search against. We ranked it a little bit using that setweight function and then we calculated those vectors and had Postgres write it into a generated column so that we don't have to update it. Anytime somebody updates the title, that search column will get updated as well and then we put an index on top of it and now we can have this nice reference in our query to a column named search vectors instead of this gnarly big function that we have to make sure we get right both in the filtering and the ranking. In my opinion, we've got a pretty solid search engine going here that can take you a super long way before you have to reach for something more powerful and something more complicated.