Aaron is a natural teacher and this course is the best introduction to Postgres I have come across. Lessons are easy to follow and he recommends some great tools for working with Postgres.Joel Drumgoole
Shorten dev cycles with branching and zero-downtime schema migrations.
We've talked about kind of three different types of indexes or instances where an index would be used. We've talked about the single index over a single column. We've talked about a single index over multiple columns, the composite index, and then we took a look at two single indexes over single columns that were combined together using a bitmap. All three of those situations, the story is basically the same.
We're using an index to find some row addresses and then we're going over to the heap and we're grabbing the rows, potentially rechecking some stuff over there, but then grabbing those rows outta the heap and giving them back to whoever requested them. What if I told you there is a specific, very specific instance where we don't have to make that journey back home to the heap to grab the rows because everything can be satisfied from the index itself.
Very rare, very, very fast, but not something you can totally rely on all the time. What I'm talking about is a covering index and a covering index is where the entire needs of the query. Everything that the query needs, that includes the columns that are being selected, that are being filtered upon, grouped on, ordered. All of that is satisfied from the index alone. Instead of making the journey back home to the heap to grab the rows, it just says, "Oh yeah, I think I have it all right here, here you go." That is a covering index I'm gonna show you how and where you can find these and then we'll talk about one more caveat at the very end.
I've dropped all of the indexes and we're just gonna create an index on first using this shorthand syntax we've been using B-Tree but that is the default. We're just gonna use this shorthand on users first name. If we create that, I wanna show you something interesting. Select * from users where first name equals Aaron. This is nothing new, this is not interesting. I forgot the explain. This is not interesting, it does the index scan on the first name column. However, if I change this to first name and we check that again, we do see something interesting. We see index only scan. That is the first time that we have seen that.
If we switch it back to star, we see a bitmap index scan, rechecking bitmap heap scan. But if we change this to first name index only scan, this is the good stuff, index only scan means it is simply scanning the index and it found it has everything it needs to return to you. Now there's not a lot of data over there in that index. If we were to say first name/last name, we gotta go back to the heap to grab the rest of the row, or in this case, simply a single other column, last name. If we were to add id, it doesn't matter, there's not much in that index except for last name. If we drop index first, let's drop that index and then we were to add first name/last name should have renamed that, but that's okay. If we were to then select first name/last name, we're still using an index only scan. That is because we have now indexed not only the first name, but the last name and everything required by that query is satisfied out of the index.
In this case, the only thing that the query needs is first name and last name 'cause those are the only columns that show up here. And we could do order by last name since that has been unlocked. And yeah, we'll just do order by last name. You see we're still using an index only scan. Now this is of course a very rare situation. A covering index is not a special kind of index. It is a regular index in this case, a composite index, a regular index in a special situation. There might be use cases where a first name/last name composite index is great, but it's not a covering index. There might be use cases where a first name, last name composite index is a covering index. That depends on the query. What query are you writing and can all of the needs be satisfied out of the index?
There's one other trick I want to show you, and that's where you can add a piece of data alongside the index, but don't actually use it for indexing. It's kinda like a little sidecar. It's like, "Hey, index first and last name." But why don't you bring along, let's bring along Id just in case somebody actually needs that. It's a little bit interesting and I think it's pretty cool so I'm gonna show you how to do it. All right, I'm gonna go ahead and drop that. We're gonna rename this multi, since it is multi now. What we could do is we could add this on to the end. This is what you have to do in most other databases. You just have to say, Hey, I want you to, I need that id because there's this really specific query that I wanna use that this index as a covering index, but I don't actually need id to be indexed but please bring it along. You don't have to do that, you don't have to do that in Postgres.
What you can do is you can say, include id. And what this does is this does not include it in that B-Tree structure. It does not include it as a part of your indexing strategy. It doesn't include it as a part of the B-Tree traversal. What it does, is it shoves it down in that leaf node and says, "Hey, once you get to the bottom, using first name/last name to get wherever you need to go, there's a little bit of extra data down here." The extra bit of data is the column id. We can go ahead and create this. Once we do, if we were to format this just a little bit, yeah, we'll format this just a little bit. Then we can look at this. This is still an index only scan. Historically if we added even a single other column, we'd be hosed. Now we can add id, and it's still an index only scan because we said, bring this column along, I'm going to need it, but not in the way that a typical index would need it. I need it for my own purposes, don't worry about it.
What's the drawback, why not just do this all of the time? I mean, probably you need most of the row most of the time, right? This is a good reinforcement to the old adage, select only what you need. Don't just blindly hammer select * if you don't need all of the columns. Because any instance where you could have gotten a covering index by just selecting first name, last name, and id, if you hammer it with a select *, you always got a phone home and get that row out of the table. Why not just include a bunch of columns that we don't actually need in the index, but we just kinda want, for a covering index, you're gonna bloat up your B-Tree, you're gonna bloat up your B-Tree quite a bit if you just shove a bunch of extra data in there, you're basically recreating the table. Don't shove a bunch of... Especially large columns, whether it's text or JSON, whatever types of big blobs you have, don't bloat up your B-Tree with those just because you think you might get a covering index, because there's one more drawback.
That is, with Postgres, the way that Postgres does concurrency is if this table is changing relatively often, it's still going to have to go back and potentially fetch it out of the table anyway. There is a step in here in a covering index situation, there is a step in which Postgres will check, "Hey, I got all of these rows out of the index. However, I don't know if those rows are visible to the current transaction. I don't know if those rows are visible to me right now." What it does is it gets all of those rows, or it gets all of those identifiers out of the index, and it checks what's called a visibility map. If that table is changing often that visibility map is gonna say, "Hey, those rows aren't visible to you," and it's gonna have to go check the heap anyway. For a highly active, highly updated or updated, inserted, deleted table, you might not hit covering indexes hardly ever.
In the situation where the table's not updating that often, and this sort of indexing strategy can work, it is your best three stars or five, whatever the max rating is on an index, this is your best option because you never, ever, ever have to check the table, which is gonna save you a bunch of time. Checking the visibility map does incur some costs, but it is much faster than going back to check the table. Covering indexes are awesome. I wouldn't base your whole strategy around them. It is good to know that they exist and for certain super hot paths, certain very specific queries, they can be a massive performance improvement.