Mastering Postgres
Introduction
Introduction to the course
Overview of course structure
Postgres vs. everyone
The psql CLI
Data Types
Introduction to schema
Integers
Numeric
Floating point
Storing money
NaNs and infinity
Casting types
Characters types
Check constraints
Domain types
Charsets and collations
Binary data
UUIDs
Boolean
Enums
Timestamps
Timezones
Dates and times
Advanced Data Types
Intervals
Serial type
Sequences
Identity
Network and mac addresses
JSON
Arrays
Generated columns
Text search types
Bit string
Ranges
Composite types
Nulls
Unique constraints
Exclusion constraints
Foreign key constraints
Indexing
Introduction to indexes
Heaps and CTIDs
B-Tree overview
Primary keys vs. secondary indexes
Primary key types
Where to add indexes
Index selectivity
Composite indexes
Composite range
Combining multiple indexes
Covering indexes
Partial indexes
Index ordering
Ordering nulls in indexes
Advanced Indexing
Functional indexes
Duplicate indexes
Hash indexes
Naming indexes
Understanding Query Plans
Introduction to explain
Explain structure
Scan nodes
Costs and rows
Explain analyze
Generating Results
Introduction to queries
Cross joins
Inner joins
Outer joins
Subqueries
Lateral joins
ROWS FROM
Filling gaps in sequences
Subquery elimination
Combining queries
Set generating functions
Indexing joins
Advanced SQL
Introduction to advanced SQL
Grouping
Grouping sets, rollups, cubes
Window functions
CTEs
CTEs with window functions
Recursive CTE
Hierarchical recursive CTE
Handling nulls
Row value syntax
Views
Materialized views
Removing duplicate rows
Upsert
Returning keyword
COALESCE + generated column
Full Text Search
Introduction to full text search
Searching with LIKE
Vectors, queries, and ranks
Websearch
Ranking
Indexing full text search
Highlighting
Advanced JSON
Intro to JSON
JSON vs JSONB
Validating JSON
Creating JSON objects + arrays
JSON extraction
JSON containment
JSON existence
JSON recordset
Updating JSON
Indexing JSON parts
GIN index
Vectors (pgvector)
Intro to pgvector
Vector embedding columns
Find related articles
Upsert vector embedding
Semantic search
Other operators
Vector indexes
Outro
Thank you
Bonus interviews
Heroku's glory days & Postgres vs the world (with Craig Kerstiens)
Creating a Postgres platform with Monica & Tudor from Xata.io
Bootstrapping an email service provider (with Jesse Hanley)
Locked video

Please purchase the course to watch this video.

Video thumbnail
Advanced SQL
Removing duplicate rows

Full Course

$
349
$399
USD, one-time fee
Finally treated myself to mastering Postgres by Aaron Francis. I highly recommend it! I love how each video is proving a concept - it makes the content easy to retain and engaging.
Andrew Bass

PostgreSQL database platform

Shorten dev cycles with branching and zero-downtime schema migrations.

Test Data

I've made my test data available for you to use and follow along.

Download
or
Use on Xata

Summary

Learn how to use window functions and Common Table Expressions (CTEs) to find and delete duplicate bookmarks in a table. By partitioning the data based on user ID and URL, duplicates are identified where the row number is greater than one. Using a CTE, we efficiently select and delete these duplicates, ensuring each user only has one copy of each bookmark.

Video Transcript

We're gonna look at an example of how you can use a window function and a CTE to quickly identify and subsequently delete duplicate rows out of a table. Looking at our bookmarks, I'm gonna look at this particular person, because I modified their bookmarks that have several duplicates. Now, this is our fault, right? Because if we knew that we didn't want duplicates, we could've put a unique index across User ID and URL, such that a user would only be able to bookmark a particular URL once. This is the real world though, right? Maybe we didn't do that at the beginning and now we're like, shoot, I can't add a unique index 'cause now I gotta clean it up and now I gotta find all the duplicates. Fear not, dear friends. We are well qualified to do that. How can we do this?

Let's start by just finding the duplicates. Let's do that. So we're gonna select, let's select, well we can do select *, and then we're gonna do select row_number. And what this should tip you off to, is that we are about to enter into window function world. We're gonna do over, and then we have to declare our window function. We are in fact going to partition by, let's partition by user_id and url. If we partition by both of those things, then our partitions are going to be, everything in a single partition is a duplicate now. You'll see here that's row number one, row number one. But once we get down to the duplicates, then we have 1, 2, 3, and 4. Because if we're partitioning on User ID and URL, this becomes a single partition and the row numbered increments up through the partition. When it reaches a new partition, that's when it starts over. We kind of know, well, as the human, we know exactly what our duplicates are. Our duplicates are anything where the row number is greater than one.

Let's go ahead and use a CTE, maybe a couple of CTEs to narrow this down to just the duplicates. I'm gonna change this to is, let's say where it's greater than one as is duplicate. If we run that, we see here are the duplicates right here. And we have a couple of options here. I'm just gonna keep operating on this individual user. If you wanna do it across the entire table, obviously you delete that. For the sake of time, I'm gonna keep it there. We can do this. We can say with duplicates identified as, and we can open that up and drop down here and then come up here and then say comma with duplicates, not with again, you only use with one time. Then you can just do duplicates as select ID from duplicates identified where is duplicate is true. We'll do select * from duplicates. What do we get? We get IDs two, one and 1240.

If we come back to duplicates identified, does that seem right? We have two, one and 1240, but this user, importantly, still has one copy of that bookmark remaining. That is pretty important. When I said you have a few options, I was not lying. You do have a few options. You can do this. You can say delete from bookmarks, where ID in select ID from duplicates. You could absolutely do that. If you wanted, you could also do this. Select ID from duplicates identified where is duplicate is true. That's totally fine too. It kind of is just preference, honestly. There might be a performance benefit of doing it this way because we're not building up that second CTE, so I'm not really partial to either one. We're just gonna run it this way.

What we're doing here is we're saying, all right, let's look in the bookmarks table. We're gonna look at the ID and every ID where it is, is duplicate of true, we're just gonna delete it. If we run that, we'll see that three rows were deleted. Thank goodness. Now if we did select * from duplicates identified, we do not have any duplicates left. We can say is duplicate equals, gosh, is true. Where I put the is in the wrong place, where is duplicate equals true. We have fully deleted the duplicates for that user in one single query without having to pull stuff back, find the duplicates on the application side, or without having to run these queries separately. We did it all in one nice neat query thanks to a CTE extraction and a window function, which calculated the row number over the partition, which we partitioned by user ID and URL, which declared the duplicates.