The most comprehensive course on PostgreSQL

Choosing the right columns for indexing involves understanding both cardinality and selectivity—key factors in determining an index's effectiveness. High selectivity (values close to unique, like a primary key) helps narrow results quickly, but even low-selectivity columns can be useful in skewed datasets. Since Postgres relies on internal statistics to decide when to use an index, running the `ANALYZE` command after significant data changes ensures those decisions are based on up-to-date information.

When it comes time to create an index, we need to think, is this column a good candidate for indexing?

We talked about how you need to look at your queries, and that's where you will derive your candidates for indexing, but now, we can actually look at the data and, in some ways, determine if it is a good or bad candidate for indexing, because consider this. Consider that everyone in our users table had a first name of Aaron. A little bit freaky, but imagine everyone was named Aaron, and then we do a query, select * from users where first name equals Aaron. Who does that help? That index didn't help us narrow down anything at all, because literally everyone in the table is named Aaron, and so when it comes to determining if it's a good candidate for indexing, we have to talk about two separate terms, and that's cardinality and selectivity.

Cardinality is simply the number of discrete or distinct values in the column, whereas selectivity is a ratio. On cardinality, let's say we have a Boolean column, and it can only have two values, true or false, so the cardinality is two. There are two distinct values in that column. Is that good or bad? I don't know. If there are two rows, that's pretty good. You can get directly to the row you're looking for, because there are two values and there are two rows. However, if there are two values, true and false, and there are a million rows, that index isn't going to help you narrow it down very much, because you could say, show me all the users where is pro or is admin equals true, and it may come back with 98% of the table. That's not very helpful, and so instead of just looking at cardinality, we have to look at selectivity also, which is, how many distinct values are there as a percentage of the total number of rows in the table?

We're gonna derive some of this for ourselves, as we like to do, but those are two words that you should stick in your brain somewhere for cardinality and selectivity, because now you'll see those pop up in documentation and maybe ChatGPT or Stack Overflow or whatever, and that can be very, very helpful when you're trying to debug why an index isn't being used. Now, let's derive some of this on our own.

We're gonna start by calculating the cardinality of the birthday column, so we're gonna do count and then from users, and then we're gonna hop back here and say select distinct birthday from users. This gives us almost 11,000 birthdays across almost a million people, which is the birthday paradox. Birthday collisions are very, very common the more people you get, so we have almost 11,000 in here. Now, the question is, is that very selective? Well, we can calculate that, too, so let's cast this to a decimal, and then we're gonna divide by just the number of rows in the table, and so if we do that, let's cast that down even further. We'll cast it down to a seven, four, and there you go. This is 0.0111 selective. Is that good or bad? Do we want to be closer to zero, or do we want to be closer to one? I could tell you, or we could figure it out together.

What is the most absolutely most selective, most perfect, highly selective column? The primary key. If we tested this, if we did count, we could put distinct, but we know it's distinct, so that's fine. If we did count distinct ID divided by count of the total number of rows, there you go. Perfect selectivity, absolutely perfect. There is one ID per one row. That is always going to be ideal. This tells us, the closer we're getting to one, the better in terms of selectivity that this index is, which means the more rows you will filter out, the faster you can get to what you're looking for just using the index. Let's go back to birthday here, because I want to show you, if ID is perfectly selective, birthday's somewhere in the middle, what is the worst thing we could choose? We do have, if we did select * from users limit 10, we do have an is pro column here as a Boolean, and so if we were to change this to is pro, you would see a much, much, much worse selectivity. In fact, if we were to cast this out to, let's change that to 17 and 14, there, you finally start to see some numbers.

You see how poorly selective the is pro column is, and that's because, as we talked about earlier, there are only two discrete or distinct values that can be in that column, and so is indexing is pro a good idea or a bad idea? We still don't know, because we still don't know what your query pattern is.

There is a use, or there is an instance, in which indexing is pro is a very good idea, and there's a use case where indexing is pro is a terrible idea, so let's see if we can figure out which is which. We know already that this column as a whole is not very selective, but there might be specific queries that an index on is pro can be super helpful, so let's do this. Let's do select count star from users, but we want to filter where is pro is true, and so if we did that, we see that we have 44,000 pro members across a million. Hey, we're getting somewhere. Let's take this down here, and we're gonna take this guy, and we're gonna put him in there, and if we run that, you'll see, hey, that's even better than birthday. What we've learned is that while a column in its entirety might not be very selective, if the data is quite skewed and that's the data that you're looking for, an index can be really helpful. In our case, we have 950,000 users that are not pro and 44,000 that are pro, and so if a common query that we're running is, show me all the users where a bunch of stuff and is pro equals true, putting an index on is pro could be really, really helpful.

Unfortunately, you can't even just look at, what's the selectivity of this column? Because it might hide a really good index from you. You might look at it and say, well, the selectivity is zero, and Aaron told me that selectivity is important. It is important! If your data is normally distributed, it's a pretty good indicator. If your data is highly skewed, you're gonna have to dig a little deeper.

The last thing I want to show you is something that we looked at in a previous video, which is select * from users where birthday < 1989-02-14. If we did that and then we threw an explain on the front, we are using the index, but when we flip it around, we're no longer using the index. Now, calculating cardinality, selectivity on a range condition is a little bit more complicated, but if we just look at the straight number of rows that are returned here, you'll see, in this case, for the greater than, we get back 572,000 rows, and if we look at the less than case, we get back 417,000 rows.

What we're learning here is that when Postgres decides an index doesn't help me eliminate enough rows, it's just gonna go straight to the table. In this case, when we say it's greater than 1989-02-14, it says, goodness gracious, you're giving me back more than half of the table from this index. I'm just gonna skip it, and in this case, it says, all right, hey, 417,000, less than half of the table. I'll use the index. Now, that half of the table is not a hard and fast rule, but you can see by comparing here that the index didn't assist us here. It's the same kind of idea as cardinality, selectivity, but with a range, it's a little bit different, but what you've seen here is that the data in the column does matter along with the query that you're issuing, and so the rule of thumb is that you want your index to help you get down to just a few rows as quickly as possible.

That's why an ID is perfectly selective, because it gets you down to one row, and so in this case, it was few enough rows that the index was useful, and in this case, it was too many rows for the index to be useful, and so it totally skipped it.

Now, Postgres isn't running these queries real-time to determine if the index is a good candidate for usage or not. It keeps statistics under the hood, and it refers to those, and those statistics can be updated by running analyze on the table or by the autovacuum, and so these stats do get updated, but if you do a massive update or a massive insert or delete, you might need to update those stats manually, which we'll look at in a future video.

Full Course

The PostgreSQL Development Platform

Summary

Video Transcript