Mastering Postgres
Introduction
Introduction to the course
Overview of course structure
Postgres vs. everyone
The psql CLI
Data Types
Introduction to schema
Integers
Numeric
Floating point
Storing money
NaNs and infinity
Casting types
Characters types
Check constraints
Domain types
Charsets and collations
Binary data
UUIDs
Boolean
Enums
Timestamps
Timezones
Dates and times
Advanced Data Types
Intervals
Serial type
Sequences
Identity
Network and mac addresses
JSON
Arrays
Generated columns
Text search types
Bit string
Ranges
Composite types
Nulls
Unique constraints
Exclusion constraints
Foreign key constraints
Indexing
Introduction to indexes
Heaps and CTIDs
B-Tree overview
Primary keys vs. secondary indexes
Primary key types
Where to add indexes
Index selectivity
Composite indexes
Composite range
Combining multiple indexes
Covering indexes
Partial indexes
Index ordering
Ordering nulls in indexes
Advanced Indexing
Functional indexes
Duplicate indexes
Hash indexes
Naming indexes
Understanding Query Plans
Introduction to explain
Explain structure
Scan nodes
Costs and rows
Explain analyze
Generating Results
Introduction to queries
Cross joins
Inner joins
Outer joins
Subqueries
Lateral joins
ROWS FROM
Filling gaps in sequences
Subquery elimination
Combining queries
Set generating functions
Indexing joins
Advanced SQL
Introduction to advanced SQL
Grouping
Grouping sets, rollups, cubes
Window functions
CTEs
CTEs with window functions
Recursive CTE
Hierarchical recursive CTE
Handling nulls
Row value syntax
Views
Materialized views
Removing duplicate rows
Upsert
Returning keyword
COALESCE + generated column
Full Text Search
Introduction to full text search
Searching with LIKE
Vectors, queries, and ranks
Websearch
Ranking
Indexing full text search
Highlighting
Advanced JSON
Intro to JSON
JSON vs JSONB
Validating JSON
Creating JSON objects + arrays
JSON extraction
JSON containment
JSON existence
JSON recordset
Updating JSON
Indexing JSON parts
GIN index
Vectors (pgvector)
Intro to pgvector
Vector embedding columns
Find related articles
Upsert vector embedding
Semantic search
Other operators
Vector indexes
Outro
Thank you
Bonus interviews
Heroku's glory days & Postgres vs the world (with Craig Kerstiens)
Creating a Postgres platform with Monica & Tudor from Xata.io
Bootstrapping an email service provider (with Jesse Hanley)
Locked video

Please purchase the course to watch this video.

Video thumbnail
Full Text Search
Indexing full text search

Full Course

$
349
$399
USD, one-time fee
Really enjoying Aaron Francis' "Mastering Postgres". It's information-dense, very approachable, and well-produced.
Daniel Bachhuber
Daniel Bachhuber

PostgreSQL database platform

Shorten dev cycles with branching and zero-downtime schema migrations.

Test Data

I've made my test data available for you to use and follow along.

Download
or
Use on Xata

Summary

Learn how to improve the performance of a full-text search engine by creating a generated column to store `tsvectors` in Postgres. By adding a GIN index to the generated column, searches become much faster without the need for manual updates. This setup not only streamlines the process but also keeps the search engine clean, efficient, and effective.

Video Transcript

We haven't talked about performance of this full-text search engine that we're building and I think it's time. I think it's time we talk about performance and what we're gonna do here is instead of calculating these tsvectors on every single query, we're gonna create a generated column and store the vectors in there and then we can run our tsquery against a stored value and we can put an index over that stored value, so everything should speed up very nicely.

I have our vector here in my clipboard. This is the setweight of to_vector(title), concat, space, to_vector(plot). If we run this to make sure we're still good, from movies limit 10, I think we're still okay. We are still okay, so what we're gonna do here is we're gonna alter table movies add column, let's just call it, we can just call it search, and we'll make it a tsvector. We're gonna declare this as a generated column, which, just as a refresher, a generated column is a column that Postgres is in charge of. We're gonna tell it, we're gonna give it a formula, in this case, how to calculate this vector stuff. We're gonna give it a formula, and then Postgres is in charge of populating and updating that column. In fact, we cannot touch it. We'll get an error if we try to change it 'cause it says, actually, this is a generated column. It's like a very specific type of trigger but with less overhead and easier for the human.

The first thing we're gonna do is we're gonna modify our tsvector thing here just a little bit because if we end up having a null in any of these columns, everything comes back as null, which we definitely don't want. Now that we're going to be writing this into, now that we're gonna be making this a little bit more permanent, let's make it a little bit cleaner. We are going to take that out and then we're gonna coalesce the title or an empty string and we'll take this guy out and then coalesce the title or, I'm sorry, the plot here, plot or an empty string. That should give us, yeah, that gives us a little bit safer vector creation there.

That's not quite right yet, and I'll show you why, but let's do alter table movies add, we can call it, what do you wanna call it? Search_vector? Let's call it search_vectors. It is a tsvector, and then this is where we start with the generated column stuff, generated always as and we open these parentheses and we're gonna put our formula in there and then at the end, we say stored. Some other databases support virtual generated columns. Postgres doesn't, which, in this case, it's fine 'cause we do want to store it because that is going to make everything more performant. This is close, but if we run this, we will, I hope, get an error. There you go. Generation expression is not immutable. So remember that anytime we have a generated column, it must be deterministic.

We're looking at this and thinking why is this not deterministic? But the answer is because your configuration could change, which could change the default configuration here. We're just gonna go ahead and hardcode this to English and now it is gonna run, and it is gonna work. What we had to do there is we had to add in that argument that was configuration-driven, but if your configuration changes and you didn't have it hardcoded, then that means this is a mutable function or a mutable formula, and that's just illegal. You can't do that.

Now, what we can do now that we have that column is select * from movies and we can just use that column name in both the filter part and in the rank part. We can say where search_vectors and then we'll do overlaps with, let's do websearch_to_tsquery and we'll just search for "Star Wars" and that's gonna give us some stuff back that I don't super love. Order by, then we can use this computed column or this generated column again, ts_rank of search_vectors and our query, which is going to make everything, we'll do it descending, which is going to make everything a lot cleaner and a lot nicer. Not only do we get a performance improvement, we also get a quality-of-life improvement in that this is now a named column and we can use it both in the filter up here and in the order down here.

The last thing we're gonna do here is we're gonna add a GIN index on top of that computed or generated column. The one that Postgres maintains, we're also gonna put an index on it and that's really gonna make these searches fly. We will leave this here 'cause we're gonna use it in a second and we'll say create index and we'll call it idx_movies_search_gin on movies using gin(search_vectors). If we run that, that is gonna take a second, we'll clear this guy out that has been created and now look at that. We're down to five, seven, nine, five all the way down here. This is what I'm looking at. We're under 10 milliseconds on all of this stuff. If we run an explain, that is because we are using that index. We're using that index scan instead of just having to go row by row and manually compare the vectors to the search query.

We've got it all. We have declared what we wanna search against. We ranked it a little bit using that setweight function and then we calculated those vectors and had Postgres write it into a generated column so that we don't have to update it. Anytime somebody updates the title, that search column will get updated as well and then we put an index on top of it and now we can have this nice reference in our query to a column named search vectors instead of this gnarly big function that we have to make sure we get right both in the filtering and the ranking. In my opinion, we've got a pretty solid search engine going here that can take you a super long way before you have to reach for something more powerful and something more complicated.