Mastering Postgres
Introduction
Introduction to the course
Overview of course structure
Postgres vs. everyone
The psql CLI
Data Types
Introduction to schema
Integers
Numeric
Floating point
Storing money
NaNs and infinity
Casting types
Characters types
Check constraints
Domain types
Charsets and collations
Binary data
UUIDs
Boolean
Enums
Timestamps
Timezones
Dates and times
Advanced Data Types
Intervals
Serial type
Sequences
Identity
Network and mac addresses
JSON
Arrays
Generated columns
Text search types
Bit string
Ranges
Composite types
Nulls
Unique constraints
Exclusion constraints
Foreign key constraints
Indexing
Introduction to indexes
Heaps and CTIDs
B-Tree overview
Primary keys vs. secondary indexes
Primary key types
Where to add indexes
Index selectivity
Composite indexes
Composite range
Combining multiple indexes
Covering indexes
Partial indexes
Index ordering
Ordering nulls in indexes
Advanced Indexing
Functional indexes
Duplicate indexes
Hash indexes
Naming indexes
Understanding Query Plans
Introduction to explain
Explain structure
Scan nodes
Costs and rows
Explain analyze
Generating Results
Introduction to queries
Cross joins
Inner joins
Outer joins
Subqueries
Lateral joins
ROWS FROM
Filling gaps in sequences
Subquery elimination
Combining queries
Set generating functions
Indexing joins
Advanced SQL
Introduction to advanced SQL
Grouping
Grouping sets, rollups, cubes
Window functions
CTEs
CTEs with window functions
Recursive CTE
Hierarchical recursive CTE
Handling nulls
Row value syntax
Views
Materialized views
Removing duplicate rows
Upsert
Returning keyword
COALESCE + generated column
Full Text Search
Introduction to full text search
Searching with LIKE
Vectors, queries, and ranks
Websearch
Ranking
Indexing full text search
Highlighting
Advanced JSON
Intro to JSON
JSON vs JSONB
Validating JSON
Creating JSON objects + arrays
JSON extraction
JSON containment
JSON existence
JSON recordset
Updating JSON
Indexing JSON parts
GIN index
Vectors (pgvector)
Intro to pgvector
Vector embedding columns
Find related articles
Upsert vector embedding
Semantic search
Other operators
Vector indexes
Outro
Thank you
Bonus interviews
Heroku's glory days & Postgres vs the world (with Craig Kerstiens)
Creating a Postgres platform with Monica & Tudor from Xata.io
Bootstrapping an email service provider (with Jesse Hanley)
Locked video

Please purchase the course to watch this video.

Video thumbnail
Indexing
Index selectivity

Full Course

$
349
$399
USD, one-time fee
Mastering Postgres has dramatically improved my understanding of indexes. It has helped me make informed decisions around my queries and schema rather than blindly adding indexes and hoping for the best.
Dwight Watson
Dwight Watson

PostgreSQL database platform

Shorten dev cycles with branching and zero-downtime schema migrations.

Test Data

I've made my test data available for you to use and follow along.

Download
or
Use on Xata

Summary

Choosing the right columns for indexing involves understanding both cardinality and selectivity—key factors in determining an index's effectiveness. High selectivity (values close to unique, like a primary key) helps narrow results quickly, but even low-selectivity columns can be useful in skewed datasets. Since Postgres relies on internal statistics to decide when to use an index, running the `ANALYZE` command after significant data changes ensures those decisions are based on up-to-date information.

Video Transcript

When it comes time to create an index, we need to think, is this column a good candidate for indexing?

We talked about how you need to look at your queries, and that's where you will derive your candidates for indexing, but now, we can actually look at the data and, in some ways, determine if it is a good or bad candidate for indexing, because consider this. Consider that everyone in our users table had a first name of Aaron. A little bit freaky, but imagine everyone was named Aaron, and then we do a query, select * from users where first name equals Aaron. Who does that help? That index didn't help us narrow down anything at all, because literally everyone in the table is named Aaron, and so when it comes to determining if it's a good candidate for indexing, we have to talk about two separate terms, and that's cardinality and selectivity.

Cardinality is simply the number of discrete or distinct values in the column, whereas selectivity is a ratio. On cardinality, let's say we have a Boolean column, and it can only have two values, true or false, so the cardinality is two. There are two distinct values in that column. Is that good or bad? I don't know. If there are two rows, that's pretty good. You can get directly to the row you're looking for, because there are two values and there are two rows. However, if there are two values, true and false, and there are a million rows, that index isn't going to help you narrow it down very much, because you could say, show me all the users where is pro or is admin equals true, and it may come back with 98% of the table. That's not very helpful, and so instead of just looking at cardinality, we have to look at selectivity also, which is, how many distinct values are there as a percentage of the total number of rows in the table?

We're gonna derive some of this for ourselves, as we like to do, but those are two words that you should stick in your brain somewhere for cardinality and selectivity, because now you'll see those pop up in documentation and maybe ChatGPT or Stack Overflow or whatever, and that can be very, very helpful when you're trying to debug why an index isn't being used. Now, let's derive some of this on our own.

We're gonna start by calculating the cardinality of the birthday column, so we're gonna do count and then from users, and then we're gonna hop back here and say select distinct birthday from users. This gives us almost 11,000 birthdays across almost a million people, which is the birthday paradox. Birthday collisions are very, very common the more people you get, so we have almost 11,000 in here. Now, the question is, is that very selective? Well, we can calculate that, too, so let's cast this to a decimal, and then we're gonna divide by just the number of rows in the table, and so if we do that, let's cast that down even further. We'll cast it down to a seven, four, and there you go. This is 0.0111 selective. Is that good or bad? Do we want to be closer to zero, or do we want to be closer to one? I could tell you, or we could figure it out together.

What is the most absolutely most selective, most perfect, highly selective column? The primary key. If we tested this, if we did count, we could put distinct, but we know it's distinct, so that's fine. If we did count distinct ID divided by count of the total number of rows, there you go. Perfect selectivity, absolutely perfect. There is one ID per one row. That is always going to be ideal. This tells us, the closer we're getting to one, the better in terms of selectivity that this index is, which means the more rows you will filter out, the faster you can get to what you're looking for just using the index. Let's go back to birthday here, because I want to show you, if ID is perfectly selective, birthday's somewhere in the middle, what is the worst thing we could choose? We do have, if we did select * from users limit 10, we do have an is pro column here as a Boolean, and so if we were to change this to is pro, you would see a much, much, much worse selectivity. In fact, if we were to cast this out to, let's change that to 17 and 14, there, you finally start to see some numbers.

You see how poorly selective the is pro column is, and that's because, as we talked about earlier, there are only two discrete or distinct values that can be in that column, and so is indexing is pro a good idea or a bad idea? We still don't know, because we still don't know what your query pattern is.

There is a use, or there is an instance, in which indexing is pro is a very good idea, and there's a use case where indexing is pro is a terrible idea, so let's see if we can figure out which is which. We know already that this column as a whole is not very selective, but there might be specific queries that an index on is pro can be super helpful, so let's do this. Let's do select count star from users, but we want to filter where is pro is true, and so if we did that, we see that we have 44,000 pro members across a million. Hey, we're getting somewhere. Let's take this down here, and we're gonna take this guy, and we're gonna put him in there, and if we run that, you'll see, hey, that's even better than birthday. What we've learned is that while a column in its entirety might not be very selective, if the data is quite skewed and that's the data that you're looking for, an index can be really helpful. In our case, we have 950,000 users that are not pro and 44,000 that are pro, and so if a common query that we're running is, show me all the users where a bunch of stuff and is pro equals true, putting an index on is pro could be really, really helpful.

Unfortunately, you can't even just look at, what's the selectivity of this column? Because it might hide a really good index from you. You might look at it and say, well, the selectivity is zero, and Aaron told me that selectivity is important. It is important! If your data is normally distributed, it's a pretty good indicator. If your data is highly skewed, you're gonna have to dig a little deeper.

The last thing I want to show you is something that we looked at in a previous video, which is select * from users where birthday < 1989-02-14. If we did that and then we threw an explain on the front, we are using the index, but when we flip it around, we're no longer using the index. Now, calculating cardinality, selectivity on a range condition is a little bit more complicated, but if we just look at the straight number of rows that are returned here, you'll see, in this case, for the greater than, we get back 572,000 rows, and if we look at the less than case, we get back 417,000 rows.

What we're learning here is that when Postgres decides an index doesn't help me eliminate enough rows, it's just gonna go straight to the table. In this case, when we say it's greater than 1989-02-14, it says, goodness gracious, you're giving me back more than half of the table from this index. I'm just gonna skip it, and in this case, it says, all right, hey, 417,000, less than half of the table. I'll use the index. Now, that half of the table is not a hard and fast rule, but you can see by comparing here that the index didn't assist us here. It's the same kind of idea as cardinality, selectivity, but with a range, it's a little bit different, but what you've seen here is that the data in the column does matter along with the query that you're issuing, and so the rule of thumb is that you want your index to help you get down to just a few rows as quickly as possible.

That's why an ID is perfectly selective, because it gets you down to one row, and so in this case, it was few enough rows that the index was useful, and in this case, it was too many rows for the index to be useful, and so it totally skipped it.

Now, Postgres isn't running these queries real-time to determine if the index is a good candidate for usage or not. It keeps statistics under the hood, and it refers to those, and those statistics can be updated by running analyze on the table or by the autovacuum, and so these stats do get updated, but if you do a massive update or a massive insert or delete, you might need to update those stats manually, which we'll look at in a future video.