Window function 😊. CTE with window function 😮. Hierarchical recursive CTE 🤯. Recommended all the way. Can’t wait to see the rest of the videos!M Wildan Zulfikar
Shorten dev cycles with branching and zero-downtime schema migrations.
I want to show you how we can combine a window function and a CTE to do something that we did with a lateral join several videos ago. Back when we were doing lateral joins, we did something where we find the top three bookmarks, or I think we did the first three or last three bookmarks, for a particular user. This is a very common query that you'll run into a lot in interview questions. I don't know why. You run into a a lot just out in the real world where you say, "I want to find, for a user, I want to find in many of a thing." I want to find the latest, or the the biggest, or the smallest, or whatever, three relations from that user.
We did it with a lateral join, let's do it with a window function and a CTE just to show you how it's done. Let's refresh ourselves on what bookmarks looks like. We'll do limit 30, and it's just id, user_id, url. Not that interesting. Let's do row_number over, and this is where we're going to do our window definition. We will partition by user_id order by id. Let's start with id ascending. If we run that, we'll see user_id has several bookmarks here. I think they have more than 10. They've got 16. With 16 bookmarks, it goes 1 through 16, and then it jumps to the next partition starting at number 1. If we wanted to get the first three bookmarks for a user, that is relatively easy, we could just say with, what do you want to call this?
With user_book, let's call this ranked_bookmarks. With ranked_bookmarks as, open that, and we'll pop down here and say, select * from ranked_bookmarks where num is less than or equal to 3. I think that should get us where we want to be. There's a user, there's a user, there's a user. Yeah, cool. That totally worked. That worked. We do have a lot of users and a lot of bookmarks. If we want to get the last three bookmarks from a user, we can't just do it outside of the CTE because every user has a different number of bookmarks. We can't say, we can't just assume every user has 10 bookmarks and say, "Show us bookmarks 8, 9, and 10." Because some users have 16, and some have 30, and some have two.
What we're going to do here is we're going to switch around the order of this partition, the order of this window definition. If we just comment this out. Then, we'll comment this out. We'll put a semicolon here and we'll throw a limit 30 on it. If we run this, we should see user_id 1 has 1, 2, 3, and that is the latest bookmark followed by the ones before it. The earliest bookmark, i.e., the smallest ID, comes down here at the end. That is a way that we could get the three most recent bookmarks. We're going to drop out this limit 30, and we're going to throw a limit down here because it is kind of a lot of data. We will clean up after ourselves. We did something, we left a semicolon there. We have user_id 1, 1, 1. We have three IDs in descending order, and then we move on to the next ID. Now, that's pretty cool.
The last thing I want to show you is how you can get the first and the last bookmark for every particular, or every single user, even if they have different numbers of bookmarks. To get the first one, we could do something like, we could do something like where num equals 1, and then we could reverse the order and do where num equals 1 on a reversed order. Instead of that, let's use the lead and the lag functions to create a window, a single window definition that we only have to process once, but we can use lead and lag to get there.
I'm just going to start by throwing all of this away so we have some space. We're just back to this guy right here. If we want to do it with just one single window definition, how are we going to do that? Let's take this over partition user_id order by id. Let's do ascending. We'll order by id ascending and we'll call this over ordered, let's call this user_bookmarks. Then, window user_bookmarks as, and we'll plop in our definition down there, and we'll throw a limit 30 on it. We have now extracted our window definition, which is good.
Now we're going to try to use this same window definition twice. Instead of, we could use row number, I want to use something else. I want to use something else. We're going to use lead(id). We'll call this lead_id. Then we're going to use lag(id) as lag_id. We should see something interesting here. We do in fact see that when we are on the first row of a partition and we look back one, there's nothing there. There's absolutely nothing there, so it becomes null. The lead works the same way, when we're on the very last row of a partition and we're peeking forward one row to grab the ID out of that row, we cannot peek beyond a partition boundary, and it says, "Whoa, whoa, whoa. That's another user. That's not your partition." It just returns NULL.
We can use these two indicators to get our first and our last rows. The way that we can do that is we can simply say is null. If we turn that to an is null, then we see that the first row it's set to true and every row after that it's set to false. We're going to do that down here, true and false. We're going to change these to first_bookmark. We'll say is_first_bookmark, since it's a boolean, I do like that. Then we'll change this to save that -st and just put an la-. Now we have is_last_bookmark, kind of awesome. Then, let's take that away. We're going to, yep, that's mine. And we'll say with user_bookmarks. That's fine. As that. We're going to put that back in there, and indent that. Select * from, this is cool, user_bookmarks where is_first_bookmark is true or is_last_bookmark is true.
If we run that for user_id 1, we have first bookmark, last bookmark. For user_id 2, first, last. So on and so forth. Since we're just having fun here, I want to show you a way that we can, kind of a wacky way that we can take advantage of the way that these null comparisons work to kind of bring is_first and is_last into a single column of is_first or last without doing the or in the outer part of the query.
If we remember, null is not equal to anything. If we do select null = 1, we get back null. If we select null = null, we get back null. Because what we're asking the database is, "Is the secret thing that I'm hiding equal to the value of 1?" And it says, "I don't know, what are you hiding? I have no answer for that. I cannot answer that question." We're going to use that to our advantage because what we want to produce is, let's take you and come up here. We'll just do that and we'll do this. Then we'll limit this down to 30. We'll bring that up and drop that back. Here we go. This is what we're trying to do. We want to get is_first and is_last kind of to come together here.
We could, of course, do it by, you know, we could bring this up here and say, you know, or that I don't want to do that. That's probably the reasonable way to do it. This is an interesting way to do it. What I want to do is I want to say lag(id) =. We have to say lag(id) over user_bookmarks = lead(id) over user_bookmarks. I'm not saying this is a great idea, I'm saying it's terribly interesting. If we're on the first row, lag is going to evaluate to null, which will be compared to a legit value, and this will come back null.
When we are on the second row, these will both be legit values, and it'll be false. When we're on the last row, this will be null, this will be a legit value. It'll come back null. We can just say is null as first_or_last. If we do that, we get the first one is true and true over there. No more trues until we get to the last one. It's true and true over there. So, kind of hideous, but, hey, it's kind of cool too. We can bring this all back and we'll drop this out, and we'll put you in there and we'll make a little space there, and we'll come back up here and just make it look nice and neat for the people. We can drop off this hideous or and just say where is, I should have named that is_first_or_last, where is_first_or_last is true limit 30. We get all of our trues in the far right column with our bouncing of true, true, true, true, true, true, all the way down.
The good thing about this is you have so many options, and while there are some queries that have a very obvious right thing to do, there are several others where several paths can be correct.
Now that you know what some of those paths are, you know where to try to maybe test some out. Now that you know how to read, explain plans, you know how to debug the ones that maybe are potentially slow or at least figure out which one is relatively better. If they're relatively the same in terms of performance, please choose the one you like. It's okay to choose things based on preference. If all other things held equal, they're the exact same, just pick the one you like. You're the one that's going to have to deal with it.
Hopefully this is an interesting technique that may prove useful. Don't forget that lateral joins can get you some of the way there on problems like this.