The most comprehensive course on PostgreSQL

Learn how using full-text search directly in Postgres can simplify your tech stack by eliminating the need for additional search tools like Elasticsearch. We'll look at how to use the `tsvector` and `tsquery` data types for creating and searching text, transforming terms into lexemes for more accurate matches. It also covers how to keep search vectors updated efficiently using generated columns in Postgres, ensuring your data remains synchronized.

Having full text search right in your database can be a wonderful, wonderful thing because your stack remains very simple. You don't have to pull in elastic search or any other like Meilisearch type sense, anything like that, if you can run it all inside of Postgres. It is about a maybe 80 or 90% solution, so it covers most people's needs. When you get to the upper end of maybe something like faceted search or something like that, highlighting results, that kind of thing, we might need to reach outside of Vanilla Postgres.

If you're using one of the hosted providers like Xata, they do offer a way to replicate your Postgres out to Elasticsearch, and you can check the docs on that. That's a pretty awesome solution as well. What we're gonna do here is I'm gonna show you the two types that can aid you in full text search, and then later on we'll have a full module on kind of configuring some of this, searching against it, stop words, booleans, all of that sort of stuff. For now, let's look at these two data types.

The first data type that I wanna show you here is a tsvector. We're gonna say to_tsvector, and then we're gonna put our dear old friend quick brown Fox in there. And if you look down here, we get this little bit of a wonky format, and I'll explain that to the best of my abilities here in a second. But let's just check pg_typeof, and you see that the type is a tsvector. That is the proper type that comes out of to_tsvector. Now a tsvector is a sorted list of distinct lexemes. What does any of that mean? A sorted list of distinct lexemes. It is sorted. That part we can kind of see B-D-F-J-L-Q. That part is easy. Distinct, that part is easy. If we were to put fox in here again, you'll see fox shows up once, but the position or the numbers change to four and five. Now a lexeme is something that I have historically not known what it is, but it is the, it is a basic unit of language. What we see here is it's like a, it's like an atomic unit of language, like a linguistic part almost. What we see here is lazy has been turned into L-A-Z-I. If we were to add laziness, you would see that lazy has kind of just retained, it's just, it's both. We have lazy and laziness that are kind of being merged together into one single lexeme. You don't really need to know all that. I think it's a little bit interesting, but the point is this to_tsvector will take a chunk of text called a document, that's the actual argument name.

It takes a document and turns it into this vector format, which we can then use to search against using a different data type. That second data type, it is a tsquery. If we said to_tsquery, and let's just for lazy, and if we ran that, you'll see we do get the lexeme-ified, lexemed version of the word lazy. If we did a pg_typeof, you'll see that this is a proper type of tsquery. The way that we can search, the way that we can search is we can take our search string and turn it into a tsquery and then apply that ts query against a tsvector, and we can use the @@ operator. If we do that, you'll see we get back true. If we do brown, we get back true. If we do red, we get back false. If we do laziness like we looked at earlier, those lexemed versions do match 'cause they get lexemed. I don't know if that's real, but I like saying it. They get lexemed down to L-A-Z-I. This does provide, oh, I already have, I still have laziness in there. Whoops, there you go, still true. That all still, that all still applies because they both get lexemed down to L-A-Z-I.

You can see you do get a little bit of fuzziness just straight outta the box. Laziness will match lazy, and the other way around. It doesn't matter what order you put these in. You could do it that way as well, and it will still come back as true. There are so many more operators. There's ands and ors and nots and weights and rankings and all kinds of stuff. We will save that for later. I do wanna show you one interesting idea. Now that we've learned about vectors and generated columns, I wanna show you an interesting idea for keeping your vectors up to date with the root text or the root document.

Here's a pretty basic table. We've got an ID up there, content as text, and search vector EN as a tsvector for English content. Interestingly we can do select to_tsvector, and you can pass in a different language here. We could say French, and oui. That's the only French word I know, and that will work just fine. By default, it is going to use the default text search config, which I think in many cases will be set to English. But if you want to specify it, you can of course pass in English there, which would likely mean you'd be saying yes. Now this is gonna come back into play. Here we go. Let's say that we've got content as text and a search vector E-N as a tsvector. That's a good start. However, these two things can get out of sync.

They are discrete columns that could be updated independently. You could put a trigger on this table, but what did we just learn about that Aaron loves so much? You guessed it, generated columns. Let's say generated always as open parentheses stored because Postgres does not implement virtual. We have to do stored, which is fine. Generated always as to_tsvector, and then we're gonna pass in the content. Now watch this. Wah, wah, doesn't work. Generation, yeah, that's weird. Generation expression is not immutable. Remember when I said that these have to be, I didn't say they have to be immutable. I said they have to be deterministic. The values cannot change from run to run. Now why would this change?

Potentially, potentially you could change your default text search config. Instead we can just throw English in here, and it becomes immutable. Now that is a deterministic generated column. If we did insert into, insert into ts_example, just the content, we only need to insert the content now, values of, and we'll just put, lemme see, I'll just copy this quick brown fox. The quick brown fox jumps over the lazy dog. We'll input that again. And that looks good. We'll run that, and then select * from ts_example.

We have our vectorized version over here. We could say where search_vector_en @@ to_tsquery, and let's search for lazy again. There we go. If we did this again without lazy, we'll just say over the cat instead, and we were to search for lazy, we only get the first one back. If we search for cat, we only get the second one back. Like I said, we'll dive more into full text search as we get further along. I couldn't resist showing you another generated column, but now you understand the tsvector and the tsquery data types, and that will come back into play when we get to the full text search module.

Full Course

PostgreSQL database platform

Summary

Video Transcript