The most comprehensive course on PostgreSQL

Common Table Expressions (CTEs) make SQL queries more organized and readable by breaking down complex queries into simpler, reusable parts. Using CTEs can improve performance, especially if the same query is repeated, and help avoid performance issues in Postgres. This approach also enhances collaboration on big data queries, making it easier for teams to work together on complex SQL logic.

We're gonna look at common table expressions, which you'll see written everywhere as CTEs. CTEs are a way to, again, refactor your query. In a previous video we refactored a window definition further down and made it a little bit more readable. A CTE is a great way to refactor your query into distinct parts, which can make it a lot more readable.

We do need to consider some performance things, but we'll look at that in a second. First, I want to show you what a CTE actually looks like. We're gonna start out super simple and that's by declaring a CTE. You do that by using the keyword, 'with', and then you can put in any table name that you want. We'll call this our new_cte as, and then you open a set of parentheses and do something in here and then without ending the query, so do not do that, without ending the query, you can do select * from new_cte. If you run that, it's not gonna work because we haven't written anything in here.

We'll just start by doing select one and now you'll see we have written our first CTE. It's not very impressive, I'm not gonna lie, but what we have done is we've functionally extracted part of our query and we've moved it up a level. Do you remember, do you remember when we were doing... We'll leave that there. Do you remember when we were doing this? Select * from users where email equals [email protected], and we said there might be a deleted Aaron, so we might need to check the user's archive where email equals [email protected]. We said, you know what? You could actually, if you wanted to get this back as one result set, you could do it like this, and that makes it a little bit nicer, and so we can look across both tables at one time. This is what we did back in one of the unioning videos.

Let's go ahead and do select * from users_archive. We'll just limit one. I just want to grab one deleted person's or archived person's email address. You can probably see where this is going. Instead of doing it like this, we could come up here and we could say, we could call this all_users and we'll make that a little nicer. Apparently we need to make that nicer, and there we go. Select * from all_users, and now we have all of our users across users and users_archive. I'm unmoved by this. It is interesting, it is helpful for learning, but I'm unmoved by this because we're still duplicating this in here. Let's not do that.

Let's just break this out down here. We have a nice neat way of searching across both tables. If we were to put Mustafa Baumbach in here and run that, you'll see we do in fact get Mustafa Baumbach, and we can do the same thing here with... Let's put it as the beginning. We'll say false as is_deleted and we'll do true as is_deleted, and there we have that extra helpful column to know where this table, or rather where this row came from, which table it came from. We've wrapped up what is potentially a little bit... It's kind of gnarly, right? We've wrapped up some less than ideal query. We've wrapped it up and we've extracted it into this table called all_users, and then we just get to treat that as a table. We can go a little bit further. You'll have to decide amongst your teammates or maybe just for yourself if you like this idea or not. It is a little bit wild, but let me just show you how to do it just in case you're curious.

Up here, instead of saying all_users, we can just say users. And now it's interesting because users as a table does exist, but we've declared this as the users table now. Coming down here, now, we're just saying select from users and it's a little bit different. We can drop Mustafa out and put me back in. Some people right now are recoiling in horror that we've over overwritten this variable of users as this new thing.

Here's the reason you might, you might, might, might consider this in your actual application. You might have an ORM where you can have all sorts of fancy complex conditions and nice clean affordances on the application side, but then somebody comes along and says, we need you to search across regular users and archived users. And you kind of lose some of that ability. You don't lose that ability if you redefine the users table as a combination of regular users and archived users and name it users and then you let your ORM generate this query.

Just a little note on ORMs, I have zero problems with good ORMs. I have a lot of problems with bad ORMs. I have zero problems with good ORMs because they give you just these beautiful affordances on the application side and a good ORM can encourage good behavior. My biggest caveat on ORMs is it does not absolve you of even a little bit of responsibility for what the query underneath is. If an ORM generates a bad query, that is your responsibility and you should know how to fix that and you should know how to spot that and you should know how to avoid that in your ORM. That's just a little bit of an aside. I love ORMs.

Let's get back to CTEs. We're just gonna go ahead and rename this back to all_users for the people out there that think I'm crazy. Look, hey, we're back on the same page. Okay, so one thing that's interesting here is if we do explain analyze, and let's go ahead and turn costs off because it gets awfully wide, doesn't it? If we turn costs off, we see a couple of things. We see a sequential scan on users_archive because we don't have an index over there, but the filter was pushed down to the users_archive to this other query, and then we see this index scan on email_btree using that same condition. This condition was pushed down from this outer query into each individual query.

What's happening here is Postgres can decide whether to materialize or not materialize a CTE. What materialization is, is that is the creating of basically a temporary table and then it will operate against that temporary table. What's happened in this case is it decided not to materialize it because it's only being referenced once down here in this query, and so it gets no benefit, it gets no benefit from like writing the query, writing it to the table and then just selecting straight from it. That doesn't make a lot of sense. If this CTE was being referenced multiple times, it will materialize, it will materialize that CTE so that it doesn't have to run this query multiple times. Maybe that's good and maybe that's bad. It depends. It depends on your specific query, but you can control it.

If you want to say, I don't want it to materialize no matter what, maybe you understand something about your dataset or some terribly inexpensive function in here and you just want this to run every time it's being referenced down here, you can say not materialized. We see that it's the exact same explain plan because this is by default not materialized because it was referenced one time. You can also say materialized and that's where it gets a little bit worse in this case because look, it has to scan both tables and then it's got to put them together as this CTE and then it has to filter, has to do the CTE scan on CTE all_users, filtering it out. In this case, materializing it, terrible idea. Just a terrible idea. Postgres has gotten a lot smarter about this, I think, since like 10 or 11 or 12 or something. Postgres has gotten a lot smarter about deciding when to do this and when to not. The rule of thumb generally is if it's referenced twice down here, it's going to materialize, and sometimes that is exactly what you are looking for, sometimes it is not, but if you find yourself in a situation where you're disagreeing with Postgres and you need to reach for the big hammer, you can say materialized to force it or not materialized to force it to not be materialized.

I want to show you one more thing on CTEs. You can have more than one per query and they can just like keep referencing the one above it. If we come back here and we say with all_users as, and we're just gonna let Postgres do its thing here, and then we can say aarons as, and we can do a select * from all users, which is the CTE above us, where, and we can take yoink and come up here and do that, and then we can select * from aarons and that totally just works.

You can imagine the organizational capabilities that this gives you. When you have to do just huge gnarly query to build up some result set and then reference it and do another result set and just keep going, CTEs are a great way to extract that logic into something that has a pretty name, has its own part of the query, is very reasonable or very readable and very easy to communicate with your teammates about.

Consider CTEs as a nice organizational mechanism and as a performance improvement in the case where you're writing the same sub-query multiple times in a query, you can extract that to a CTE, Postgres will materialize it, it'll run it one time and then it will reference that temporary table over and over again.

Full Course

PostgreSQL database platform

Summary

Video Transcript