The most comprehensive course on PostgreSQL

Learn how to index JSON data in Postgres using two methods: a functional index and a generated column. The video demonstrates how to create a B-tree index on a specific part of a JSON blob, exploring its benefits and limitations. Additionally, we cover using a generated column for indexing, which ensures data consistency but requires adjustments to how queries are written.

Okay, it's finally time to index some JSON. We have two different approaches that we're going to look at. The first one, which we're going to do in this video, is indexing a piece of a blob. , we're going to dig into the JSON blob, pull a single key out, and index that. We're going to do that in two different ways. Then after that, we will look at indexing the entire blob with a gen index.

This first index that we're going to do is just a traditional B-tree index. We talked a lot about B-trees, hopefully you remember some of it. This is a very traditional index. The drawback is it only works on a part of the JSON document. You get all of the benefits of B-tree indexes. Now when you go to a gen index, it will put an index over the entire document. You do have a little bit more flexibility. There are some drawbacks. We'll talk about that in the next video. Let's create some B-tree indexes. We're going to take a look at this orders JSON again. We're going to look at this terrible, terrible schema. Do not take schema recommendations from this video, because this is terrible schema. The problem is the customer's email is all the way down here, nested inside of the customer, nested inside of this details column. Awful schema. We have a few different options.

The first thing, let's we can come up here. Let's pluck this out. We're going to do, let's do details. Then we're going to do the unquoting actually, nope, we got to go through a key. We're going to do customer first, and then we're going to do the unquoting operator. We're going to pull email out of there that needs to be quoted as well. That should do it. , that does give us that does give us the email. You'll remember, you'll remember all of the ways that we can get at the JSON keys. Here, we've gone into customer. Because we only did one arrow that leaves it as JSON, and this double arrow unquotes it.

Now, what we're going to do is we're going to put an index over that email field. We're going to do it in two different ways. The first is just a straight index, we're going to do a functional index, which means we can create a B-tree index on the result of an expression. The other one is we're going to create a generated column and then put an index on top of that. Why would you want a generated column hanging around if you can just do it with a functional index? That's a great question. That is up to you, the human to decide. Because if you, the human, want to see a top-level column that says email, creating a generated column would be the way to do that. Under the hood, there's functionally no difference between creating a generated column, putting an index on it, and creating a functional index on the results of the expression. Functionally, no difference. It's just a matter of preference and what you and your team want. It is nice to surface to the top level some of these things that can be hidden inside of a JSON blob. Beyond that, I don't really have a lot of opinions. A generated column is kind of the way that I like to go. That's totally up to you.

Let's look at a generated column first. Actually, let's look at a functional index first. Change my mind. We'll do create index. Then I've just got this copied here orders, JSON, table name, and then email, which is nested inside of that blob IDX on orders, JSON. Then here's what we have to do here. This is let me get this back up here. This is normally what you would do, you would open it up, and you would say I want to put it on the email column or whatever. To create that functional index, we're going to open up a second set of parentheses. This tells Postgres, hey, this is actually an expression, and it's not a column name. That's what that second set of parentheses does there. What we're going to do is if we come back out here and we do select, what did we it was details. Let's do from orders, JSON. We get the auto complete. It's details customer and then unquote for email. Is that right? That is right. We're going to take that, and we're going to bring that up here. We're going to paste that in and cast that; we should probably wrap that whole thing in parentheses, and then cast that as a text. So sometimes what I like to do is just take the result, or rather the expression that we're writing in here, copy it out, and then just paste it verbatim down here to make sure that we're getting the results that we want. This looks correct to me, we are getting those emails, and they should be text, I think we're fine there. That gives me assurance that this functional expression is correct. Do we have the right set of parentheses, we sure do. I'm going to put that guy up there. I'm going to put that guy down there. That will make it a little bit more clear.

Okay, so I created that index. Here's the thing, we only have five rows in this table. So, it's not going to use the index. I wrote up a quick little recursive CTE to insert a bunch of data. It's not terribly interesting, but it does fill up the table with 10,000 more rows of kind of nonsense, but it will show us that the index is working. If you're curious, here is the recursive CTE. 10,000 rows using the JSON build object, and it just puts in a status of pending, no customer, nothing, nothing like that. Now, if we do select * from orders JSON, we see we've got so many rows in here, just tons and tons and tons of rows, which is good because we want to see if this index actually works. With only five rows in the table, it's just going to scan the table because that's fastest. We can now say where details, customer, always forget to quote it, customer, unquote, email, text equals, I think it was Charlie, [email protected]. If we run that, we do get back the correct result. Now the question is, look at that. It is using that functional index.

Now the good news is, the good news is we're using that functional index, and we didn't have to change a single thing on the application side. , this gives you the power, if you're the DBA, or you're the person in charge, this gives you the power to index some queries without having to go change the application side. This can be very beneficial when you potentially don't control the application side, or you don't control it as tightly as you want. If there's some third-party library, or some even-like bi tool that's connected to one of your replica databases, and it's issuing these queries like this, you can go in and create a functional index and then leave the query exactly how it was. It's going to work just fine and take advantage of that index. This is the functional index approach, very good, very valuable, absolutely nothing wrong with it, we are going to look at the generated column approach, which puts one step in the middle between the JSON and the index itself.

The first thing I'm going to do is come down here, copy this out. We're going to say, actually, I'll probably want that later. We'll say here, let's do drop index, that index, that one's gone. What we can do is we can say alter table orders JSON, we're going to add column, let's just add it as email. We'll do text. Then we say generated always as something. Then we're going to put stored at the end. The thing that is going to be generated always as is this expression right here. If we run that, we can read this back and see select * from orders JSON. There we go, we have a top-level column named email. Remember that this is not editable because it is a generated column. Postgres is in charge of this, which is exactly what we want, we don't want it to be editable. Because we don't want the JSON in the column to get out of sync. With a generated column, you are 100% guaranteed that it cannot get out of sync, the JSON is the source of truth.

Now the last thing we have to do is we have to create an index. We'll call this orders JSON email on orders JSON, open email. This is just a traditional column. At this point, that's it, we don't do the double parentheses, we don't do any of that. We can run that. If we hop down here, it doesn't change the correctness of the query, no index should ever change the correctness of the query, just the speed. It looks like dang, looks like it didn't work. When you have a generated column, it's always going to be best to reference it by its generated column name, even if you are querying by the exact same underlying expression.

That could be a drawback of the generated column style is that it's going to be better to reference it by its name. You would have to change the querying side to make sure that you're taking advantage of that index. Those are the two different ways to index part of a JSON blob. You can, of course, extend this to any of those extractions that we looked at earlier, you can extract nested, you can extract things from arrays, you can extract anything, and you can pull anything out. In fact, of course, these methods don't only apply to extracting JSON parts, they apply to everything. If you ever run into a situation where you think, it's an expression, I wish we could index expressions, you are in luck. Of course, you can index expressions, as we've just looked here at indexing parts of JSON documents.

Full Course

PostgreSQL database platform

Summary

Video Transcript