The course "Mastering Postgres" helped me practice query commands and understand why they're important. Inspiring me to explore the 'why' and 'how' behind each one. Truly a great course!Nakolus
Shorten dev cycles with branching and zero-downtime schema migrations.
We're wrapping up our ways to produce result sets module here, and in this one, instead of putting results side-by-side, which we've done so many times, both in the joins and the rows from and kind of in the subqueries too, I guess, we're gonna put results on top of each other. In this one, we're gonna look at union, intersect, and accept as ways to combine the results of two different queries over-under instead of side-by-side.
The most basic form of a union looks like this, select one, select two union, and that's gonna take the results from the top query and the results from the bottom query and put them over-under, but importantly, you must have the same number of columns, and, in fact, you must have these same data types as well. Normally, it doesn't look like this. Normally, it looks like select * from users where something, something, something, and then a union of select * from users, I didn't copy, select * from users where something something else. Normally, you're gonna run two queries, and you're gonna put 'em together over and under to produce one result set.
In fact, why don't we go ahead and do that, but not both from users, why don't we do it from users and users archive? Imagine this situation. Imagine a situation where your users are able to delete their accounts, but you shuffle them off into this holding table. You put 'em in this holding table for maybe 365 days or whatever. I don't know what GDPR is, don't come at me, but maybe instead of fully deleting them, you kind of soft delete them, but you also move them out of that table to keep your active dataset as slim as possible. Then somebody comes along and says, hey, it'd be really nice to search trashed users along with regular users, and you're like, you told me I could move them to a separate table! Don't worry about it, we have union, so let's show you how this might work.
In this example, we're gonna look for me across both tables, so we're gonna say where email equals aaron.francis@example.com, and in this scenario, we're gonna take whatever filters the user has provided from our fictional UI here, and we're gonna apply it to both queries and then union them together. If we run that, we see that I do exist, and have I been deleted, have I not? I don't know, it's hard to tell, but I exist, and that counts for something. We can tell if a user has been deleted by just selecting a constant up here and a constant down here, and we can just name this, in the first one, as active. We don't need to name it in the second one, and now we know that active is one.
Let's go ahead and make that a little more semantic. We can say active as true, so this user is active. Now, how does this look if we did a select * from user archive limit 1
? So Mustafa Baumbach was deleted, and so we can pull their email address, and if we were to search for Mustafa, we would see that they still show up in the results set, but they are showing up as active false, which is pretty nice. We're able to search across all of the users, even if they are in different tables, which they could be. This is not a pattern that I use very often, but it is a common pattern of shuffling off old data into an archive table and then incurring the penalty of having to union on that whenever you're searching archive data, because you must presume that searching active data is the real hot path, and you wanna keep that table as small as possible, and so you're willing to incur the cost of unioning a users archive table here.
Let's keep going, and let's look at, we're gonna go back to just our super basic example. If we did select one union select one, and we did that, we only get one back, which is interesting. Union all is what you're looking for there. By default, union is going to deduplicate the results, which can be very expensive, so if you don't want that, make sure that you type union all, because in a real-world scenario, what's gonna happen is it has to compare the entirety of the row with every row in the result set below, and it has to do that every time. In our case, it's just comparing a few integers, but imagine a big old row, and it's trying to do some duplicate weed out strategy.
If you know, either by pure logic or just some business process in your brain, if you know that there cannot be duplicates or you want to see the duplicates, make sure that you put a union all, 'cause that's gonna be way, way faster.
There are two more that I wanna show you, and for these, we're gonna use generate series. We're so familiar with generate series at this point. We have generate series one through five, and we're gonna do generate series three through seven as well, and instead of union, we're gonna put intersect in here, and what this is going to do is this is going to give you the intersection of those two result sets.
Now, importantly, you can do intersect all, and that does not perform that duplicate weed out. I know, logically, that there are no duplicates here, and I don't really care, and so in here, I'm gonna put intersect all, because that is faster. You can just keep going. You can just say union all, and then you can throw another generate series from 10 to 15 down here. I don't love this, personally, because the order of operations can get a little hairy, so you can throw parentheses around it, and it makes it a lot more clear, and so it says generate series one through five intersect with this entire thing, which in this case, excludes 10 through 15 because the only intersection is three, four, and five. Or you could say, hey, why don't you generate this intersection first and then do the union all at the end?
In my opinion, if, for whatever reason, you're gonna string multiple of different types together, which I just don't understand, everything is possible, but if you're gonna do that, my recommendation would be throw some parentheses around it.
Let's go on to the last one. The last one is, instead of union, instead of intersect, it's the opposite, it's except. Same rules apply here, except all will prevent duplicates from being weeded out, which is to say it will allow duplicates in the result set, and there will not be any here. What this does is it takes the first one and says, gimme this entire result set except where it overlaps with this result set. In this case, one, two, three, four, five, three, four, five show up there, and so the result is one, two. Union take two result sets, put 'em one on top of the other, intersect, take two result sets, find their overlap, and except is take two result sets and give you the first one except where it overlaps with the second one.
All of them, you can append the all modifier, and that will prevent Postgres from doing those relatively expensive comparisons to weed out the duplicates. If you want the duplicates weeded out, do not shy away from just saying union, except, intersect. All I'm encouraging you to do is be thoughtful about, are there duplicates? Do you want them, or do you want Postgres to weed them out? But regardless, now you know a way to get some results sets, put 'em on top of each other.