Aaron is a natural teacher and this course is the best introduction to Postgres I have come across. Lessons are easy to follow and he recommends some great tools for working with Postgres.Joel Drumgoole
Shorten dev cycles with branching and zero-downtime schema migrations.
We're gonna look at an example of how you can use a window function and a CTE to quickly identify and subsequently delete duplicate rows out of a table. Looking at our bookmarks, I'm gonna look at this particular person, because I modified their bookmarks that have several duplicates. Now, this is our fault, right? Because if we knew that we didn't want duplicates, we could've put a unique index across User ID and URL, such that a user would only be able to bookmark a particular URL once. This is the real world though, right? Maybe we didn't do that at the beginning and now we're like, shoot, I can't add a unique index 'cause now I gotta clean it up and now I gotta find all the duplicates. Fear not, dear friends. We are well qualified to do that. How can we do this?
Let's start by just finding the duplicates. Let's do that. So we're gonna select, let's select, well we can do select *, and then we're gonna do select row_number. And what this should tip you off to, is that we are about to enter into window function world. We're gonna do over, and then we have to declare our window function. We are in fact going to partition by, let's partition by user_id and url. If we partition by both of those things, then our partitions are going to be, everything in a single partition is a duplicate now. You'll see here that's row number one, row number one. But once we get down to the duplicates, then we have 1, 2, 3, and 4. Because if we're partitioning on User ID and URL, this becomes a single partition and the row numbered increments up through the partition. When it reaches a new partition, that's when it starts over. We kind of know, well, as the human, we know exactly what our duplicates are. Our duplicates are anything where the row number is greater than one.
Let's go ahead and use a CTE, maybe a couple of CTEs to narrow this down to just the duplicates. I'm gonna change this to is, let's say where it's greater than one as is duplicate. If we run that, we see here are the duplicates right here. And we have a couple of options here. I'm just gonna keep operating on this individual user. If you wanna do it across the entire table, obviously you delete that. For the sake of time, I'm gonna keep it there. We can do this. We can say with duplicates identified as, and we can open that up and drop down here and then come up here and then say comma with duplicates, not with again, you only use with one time. Then you can just do duplicates as select ID from duplicates identified where is duplicate is true. We'll do select * from duplicates. What do we get? We get IDs two, one and 1240.
If we come back to duplicates identified, does that seem right? We have two, one and 1240, but this user, importantly, still has one copy of that bookmark remaining. That is pretty important. When I said you have a few options, I was not lying. You do have a few options. You can do this. You can say delete from bookmarks, where ID in select ID from duplicates. You could absolutely do that. If you wanted, you could also do this. Select ID from duplicates identified where is duplicate is true. That's totally fine too. It kind of is just preference, honestly. There might be a performance benefit of doing it this way because we're not building up that second CTE, so I'm not really partial to either one. We're just gonna run it this way.
What we're doing here is we're saying, all right, let's look in the bookmarks table. We're gonna look at the ID and every ID where it is, is duplicate of true, we're just gonna delete it. If we run that, we'll see that three rows were deleted. Thank goodness. Now if we did select * from duplicates identified, we do not have any duplicates left. We can say is duplicate equals, gosh, is true. Where I put the is in the wrong place, where is duplicate equals true. We have fully deleted the duplicates for that user in one single query without having to pull stuff back, find the duplicates on the application side, or without having to run these queries separately. We did it all in one nice neat query thanks to a CTE extraction and a window function, which calculated the row number over the partition, which we partitioned by user ID and URL, which declared the duplicates.