Window function 😊. CTE with window function 😮. Hierarchical recursive CTE 🤯. Recommended all the way. Can’t wait to see the rest of the videos!M Wildan Zulfikar
Shorten dev cycles with branching and zero-downtime schema migrations.
Everything we've done with subqueries so far has been producing a result set that we then use to join or just simply treat as a table, whatever. What we're gonna do now is we're gonna use a subquery as a way to filter a table based on data from a related table. In this instance, we're going to look at users and we only wanna bring back our most valuable users, the users that have the most bookmarks in our other table. We don't need to bring back the data from the bookmarks table. We're not gonna use a join, we just want the users based on the bookmarks table so we're gonna use a subquery.
The scenario we're setting up here is we want to bring back users that have more than 16 bookmarks. You can imagine in your domain something like teams that have more than 20 members, those are maybe some high-value customers that you wanna follow up on, or maybe you want projects that have fewer than one, i.e, zero, projects that have zero tasks in it, because those projects might be abandoned and you might wanna proactively reach out to those customers and say, "Hey, don't you wanna "kind of use our product a little bit?" This is what we're gonna try to do.
As always, I like to start with the simplest thing. That's gonna be how do we find the users? How do we find the count of the bookmarks for a user? Let's start with select * from bookmarks, and then we're gonna group by user_id. We have to come back here and we're gonna say user_id and count *. This is going to get us close-ish, but it's not quite right. You see, the first user has 16, but it's not quite right. What we need to do next is we're gonna introduce this having right here. Having is just like a where except that it operates after a group by.
Now we can say having count greater than 16. And if we do that, we see we get back 19 rows. This is a good start to the query. What do we want to do? We could, of course, we could do an inner join on this and then just select the users.*, that would be fine, but I wanna show you how you can do this with a subquery. We are gonna knock this guy out because we don't need that, we just want the users. We're gonna save a little bit of space here and bring all of these back and let's bring this guy up here. That is as small as I'm comfortable making it.
We want to do select * from users, what do we do now? We can do a subquery where id in and we can open the parentheses and close it like that. Then can we indent? We can, so where id in and then we're selecting single column, which is user_id down here. If we run that, we get our 19 people. These are our 19 high-value premium top-tier bookmark users. We were able to accomplish this by just selecting from the user's table, but then using this subquery as a filter. Now this is not, importantly, this is not generating a list of IDs and then basically shoving it in there like this. It's not doing it like this, 1, 2, 3, 4, 5. It's not running that query, getting a list of IDs, and then putting the IDs in there. That's called query decomposition. You might end up doing that in your interchange between your application and your database.
There are instances where in your application you will run this, you'll get back a list of IDs and then you'll come in here and run a second query in here. That is not what's happening. That's very important that that's not what's happening because that can be not very performant, especially when you're having to make that network round trip, because you issue the query, you get the IDs, you issue the IDs, you get the users back. We don't wanna do that, so what's happening here instead is theoretically it's something called a semi-join. It looks at this and says, "I know exactly what you're trying to do. "I'm going to treat this as one query, "I'm not going to execute it "and then plug in a bunch of values."
This is a semi-join, so it's not a full join, but it is kind of, it kind of looks like a join. There is another type like this and that is called an anti-join. That's where you would do something where id not in that. That would be all of your, all of your low-value users, which is gonna be, yeah, the rest of the table. Semi-joins and anti-joins are optimizations for these subqueries. It's not as if you're putting in literal values. If you wanted to bring the count back, you would then have to switch to adjoin, which let's go ahead and do that now, just for fun, we're gonna start with select * from, and we're gonna open a subquery here. Instead of just user_id, we're gonna select count * as ct for count as, these are our whales, these are our big bookmarkers, and let's do an inner join users on users.id equals whales. What a ridiculous concept, whales.user_id.
If we run that, we see, we do have our users and the count, and you see, it's basically grouped around 17, 18, 19. Nobody has, you know, a super outlying amount. You could change this to, let's say, first name, last name, count, and what do we wanna at the front? Probably users.id. This is our whale report that we can deliver while we're here, order by ct descending. We have our number one bookmark is Euna Mraz. Hopefully you're starting to get a sense of the flexibility of subqueries.
In this one, we eliminated a bunch of data from the user's table based on the bookmarks data. And then we turned around and said, "Actually let's start with the bookmarks table "using a subquery "and then join in the user's table on top of that." I wanna show you one more thing about subqueries for elimination before we finish up here. There does exist another style of subquery here. We did select * from users where id in and we opened up our subquery. I'm gonna show you both styles for the query that we just did. The query that we just did was select *, nope, it was select user_id from bookmarks group by user_id having count greater than 16. I think that was right, yep, that looks like it was right. If we bring this down here, there's another style and that is where exists.
There's a pretty big fundamental difference here, with a where exists, you can reference the outer table. Instead, we can just say select one, it doesn't matter, all we're looking for is the mere presence of a row. We need to say where users.id, which is referencing the outer table where users.id equals user_id. Here's the fundamental difference. This where exists, this right here is going to run for every row in the outer table or in the outer query, I suppose, which can be good because the where exists will short circuit the first time it finds a true value.
The first time as it's going through and it finds that this evaluates to true, it's just gonna say, "Great, I know that one is there, "I'm not gonna continue looking through their bookmarks." It's a lot worse here, it's a lot worse in our use case. Let me show you in a use case where it's actually quite good. Then we will do an explain analyze on these two, and I'll show you that this is actually terrible. It's not so much a matter of where exists is always good and where in is always bad or vice versa. Nope, unfortunately it super depends. We're gonna look at one case where a where exists is quite good and we're gonna say instead of grouping, so we're gonna link the inner and outer queries, we're linking the outer table users to this inner subquery on this part right here and starts with url https. This is that same thing.
We're looking for all users that have bookmarked a secure URL ever. If we run that, you'll see it's a lot. It's the majority of the table, 627,000 users. Now in this case, it's gonna start scanning the user's table and it's gonna say, "All right, for the first user, "let's come in here and run this query." And the very first time you find a hit, we can bail out. Don't keep looking for more bookmarks, because all I'm caring about is the existence of this. That's the case when a where exists would be faster is when the potentially this inner query could continue to go on and accumulate more and more and more results. Actually you just care about the first one. It could be slower because this has to run for every single row.
Let's take a look at the explain analyze while we're just out here having some fun, let's do some explain analyze, costs off. We're gonna do explain analyze, costs off on both. We're gonna look at our first one, the one that did super well. Importantly, what I wanna show you here is I wanna show you the loops. Down here this loop that is scanning over, it's actually an index only scan, which is nice, index only scan over bookmarks_user_id_index on bookmarks. It did that one time. Not only did it do it one time, but it never touched the heap, because remember an index only scan found all of the data it needed in the index alone, which is nice. Then it came up here and did the aggregate and it did a loops one time, which means this was run one time, just one time, very good, very awesome.
Let's look at that one, woo, all right, index only scan on secure URL loops, 989,000 times, 989,000 times. In this case, it's running this 989,000 times, but it's not getting any of the benefit from the where exists short circuit because we're saying group on all of the users' bookmarks. Where the benefit exists of short circuiting, we're saying please don't do that. I need you to actually visit every single bookmark, group it up and count it. It's saying, "Ugh, okay, I can do that, "but I'm gonna have to do that 989,000 times."
Wouldn't you rather do it over the big table once rather than a little group a million times? There are big, big differences for where id in and where exists. There's not like a hard and fast rule, but what you need to remember is the where exists is allowed to reference the outer table and it's going to be run for every row, it's going to be evaluated for every row in the outer query as we've seen here.
One final, final, final caveat, with a where not in, you've gotta be super careful about nulls. If we did select * from users where id not in and then a set of values. If we were to say values, I think you can just do one and two, is that legal? Yeah, you can do that. This is saying show me all the users where the id is not in one or two. If we run that again, that's pretty stable, that's bringing back the entire table. Let's knock out this Josianne upfront. We'll say, "Please remove her." It doesn't look very stable. It's not very fast either, but this is working.
The problem is, so this is only eliminating two out of a million people, that's why it's so slow. The problem is the second that you have a null value in there, nobody comes back, nobody comes back, you're completely hosed, it just simply doesn't work. That's because the comparison for null to all of these things is null. If you have some kind of select subquery in here for a not in, you need to be incredibly cautious that you're not returning nulls, in fact, I would probably use a not exists and still be careful about handling nulls.
But again, the knot exists is operating based on the presence or absence of a row, where the not in is actually looking at the values that you return. If you return a null, you might be hosed. I know that that's a whole lot, that is a whole, whole lot. I know that the runtime here, we've gotta be close to 15 or 20 minutes. I know that that is a whole lot, but it's such a powerful technique using a subquery to include or eliminate rows without bringing that extra data along.
Importantly, remember the difference between where in and where exists and how many times those queries you're gonna run. Sometimes it's awesome, sometimes it's terrible, and you'll have to look at the explains and the run times and your actual data to determine which one it is.