Mastering Postgres has dramatically improved my understanding of indexes. It has helped me make informed decisions around my queries and schema rather than blindly adding indexes and hoping for the best.Dwight Watson
Shorten dev cycles with branching and zero-downtime schema migrations.
We've done a lot of great theoretical learning about indexes, which will serve us well going forward, but we gotta do some practical learning. We gotta get our hands dirty here and actually experiment with some stuff.
Remember back in the schema section, I said you can look at your data, and you can investigate it very carefully, and with your powerful brain, you can figure out what a good schema is. You can figure out what is most representative of reality, what covers the entire range of possible values, and what does Postgres have to offer in terms of data types. You combine all of those things and you have a great schema, not so with indexes, indexing is more art than science.
Building a great schema, pretty scientific, building great indexes, a little bit more of an art, because you cannot look at your data or your schema and derive good indexes out of them. You must look at your access patterns, which is to say, how are you going to query the tables? How are you going to query the data? That's what drives your indexes.
Now you may be thinking, "Indexes are good, "I'm gonna add an index on every column." It's not a great idea, it's not a great idea. It's a nice try, but it's not a great idea, because remember, an index is a separate data structure that maintains a copy of part of your data. If you add an index on every column, you functionally duplicated your data. Not only are your inserts, updates, and deletes going to be slower because all of those, all of those separate data structures need to be maintained, your read performance, your selects are not gonna be as fast as you want, because in all likelihood, it is better to have one composite index over multiple columns instead of having multiple single indexes over single columns.
Unfortunately, you have not found a loophole by adding an index to every column. Now, a rule of thumb you might have heard, which is closer to correct, but not fully correct. A rule of thumb you might have heard is anything that shows up after the where, after the word 'where,' you should have an index on that. That's closer, that is closer. But now that we are professionals at this, we need to have a little more nuance and a little more depth. My charge to you, to us, is going to be consider the entire query. Yes, the where clause, very important, also order and group and join and select, all of those things are very important when it comes to designing an effective indexing strategy.
We're gonna look in this video and the next several videos about how to design an effective indexing strategy. We're gonna start small for the next few. We're gonna be in btrees alone. We're gonna start small and build up, and this is going to be your best way to unlock performance in your database. All right, let's take a look at a few examples.
We're gonna be using this user's table, and I've gone ahead and fleshed it out just a little bit, first name, last name, email, birthday is pro created at and updated at. We have several more columns to work with here, and in fact, we have several more rows as well, almost a million, 989,908, close to a million. I'll make this data available for downloads so you can play along at home, or you can spin it up on Xata if you want Postgres in the cloud to play with it there. I highly recommend that.
Let's keep going. The thing I wanna do here is I wanna see if we can come up with or intuit or prove the situations where an index might be useful. Let's do select * from users where birthday equals, and then let's do 1989-02-14. I will let you guess why that's such a good birthday. If we do that, we have 103 rows here. If we throw an explain in front of it, as we go on, we will cover explain more and more because it can be kind of gnarly and there are lots of options that you can use. Let's just start right here.
You can see we're doing a sequence scan on the user's table. What this means is it's just scanning the user's table, fortunately because postgres is awesome, it's doing a parallel scan and it's saying, "I'm gonna dedicate two workers to this. "Y'all each get your own pages, go off and scan them, "and then we'll gather the results up at the end." That's exactly what happens here. This is just scanning the whole table. This is ideally not what you want to see. You don't wanna see a table scan. What I wanna do here is I wanna do create index bday on users using btree on birthday. We're gonna create an index using btree, which we've talked about on the user's table, and we're just gonna name it bday. If we do that, and there are options for creating indexes, which we will cover later, but that's just a basic btree index. If we do that, you'll see bitmap index scan on bday, and this is the index condition. Then there's some other stuff that we're gonna talk about later. You see that we're now using the index, that is the important part.
Let me show you again what it looks like to not use the index like * from users where let's just do first name equals Aaron, and there you see parallel sequence scan on users versus index scan on bday, which is the name of that index column or the name of that index that we created. We've already proven that an index will help us with a strict equality, which frankly I bet we already knew, so let's keep going. Let's get rid of this guy.
If we were to change this from equals 1989-02-14 to less than you'll see, we're still using the index to figure out who was born before February 14th, 1989. However, if we switch this to after, that's interesting. We have proven that it helps on a strict equality. It helped on the first unbounded range but not the second unbounded range. That is because of index selectivity, which we'll come back to soon. We're gonna keep going. We've proven strict equality, unbounded range.
Let's look at a bounded range. We'll say we wanna find anyone born in the year 1989. If we look at that, you'll see, there we go, bounded range works as well. You can see the index condition that's being checked down here is the entire year of 1989. Now we've proven strict equality, bounded range, unbounded range, with a little caveat on the unbounded range that was due to a selectivity issue which we will cover.
I want to prove two more things, and that's grouping and ordering. Let's leave this explain on, we're gonna change that to order by birthday, and we're gonna drop off those guys. You'll see we're still doing an index scan using bday, which is the name of that index. If you wanted to see what a bad, or rather, let's say an index unassisted sort looks like, it's gonna be quite a bit different, there you go. It's doing the sequence scan on the table with a sort key, and then it's sorting and then it's doing that parallel and then it's having to gather it all back up. Whereas when you order using an index, the values are already in order. It can just read the index from front to back, or in some cases, back to front, and then go pick up the rows in that order out of the heap.
When you don't have an index, it's gotta read the whole heap and then sort it. Wherever you have index or rather wherever you have ordering that might be used pretty often, you definitely want to try to get that to be index assisted. We're gonna look at composite strategies here shortly that can help you with that.
The final one I want to show you is grouping. If we did count * from, let's do count * and birthday from users group by birthday, and let's just see what that looks like. You see here are all of our users grouped by when their birthday is. If we explain on that, you see down here at the very bottom, we have an index scan using bday, group on bday, does it in parallel, gathers it up. It does a bunch of stuff. The important part is we don't see any sequence scan, we see this parallel index scan, which is awesome.
Thank you, Postgres, for doing it in parallel. We see that the index has assisted us on the grouping as well. This was just a crash course on btree indexes. I always like to try to prove things for myself because it's nice to read the documentation, but I like to get my hands dirty and prove what I think is going on is actually going on.
We've discovered strict equality, unbounded ranges, bounded ranges, ordering, and grouping. We haven't looked at joins yet. We have that one outstanding selectivity thing that we need to talk about. We're gonna continue to expand on all of this in the next videos.