Mastering Postgres has dramatically improved my understanding of indexes. It has helped me make informed decisions around my queries and schema rather than blindly adding indexes and hoping for the best.Dwight Watson
Shorten dev cycles with branching and zero-downtime schema migrations.
We're gonna start pretty simple with some grouping. Now, there are some very interesting aggregation functions that we can use. Some of which you probably already know and some of which are extremely esoteric, like bit, and bit or, but of course, I love those. We'll start simple and then we'll build our way up looking at some of these different aggregation functions.
Let me just start by showing you select * from employees. We have two tables we're gonna be working with. We have employees. It's very simple and sales that's similarly simple linked up by employee ID. Let's say that we want to find the amount that each employee has sold. We would just do group by employee ID and get our first error. What this is telling us is that we have decided that we're going to group the rows, but we've not given Postgres any information about how to combine particular values. The only one that we don't need to give any information about is the one that we're actually grouping on because there we're guaranteed that one is going to come back for each value, one row for each value in the table 'cause that is the thing that we are grouping by.
If we did select * from sales, again to remind us of the schema, we have an amount. If we come up here and put amount, we see that same error again, it must appear in the group by which is over here. It's saying it must appear over here or be used in an aggregate function. There are a lot of aggregate functions. The one that you're probably expecting here, and you would be correct is sum. We could say the sum of the amount. What that's gonna do is that's gonna take every row in the table, group it on employee ID and add up the amount. We also have average, and we have min and we have max. And we have any value. Any value, I thought it was any, but it's any value. Sometimes, I do forget things. Any value just picks one at random. Any value for amount doesn't make a lot of sense. Frankly, this doesn't make a lot of sense ever.
You could do something like any value for product and if you have something that you don't want to necessarily aggregate by, but you want to get back a single value for, you could use any value. I don't see this making a lot of sense because you can't rely on it, it literally could be any value. But if you need to use any value, please tell me why. I would love to hear why it does exist. It must be useful to someone. I think what is more useful here is let's make some space.
I think what is probably more useful than any value is let's switch this back to sum and then we can do some cool stuff like array agg. That will give us an array of the items that this person has sold. You can do json agg as well, and that gives you something you're probably more familiar to looking at, which is a json array. You can do jsonb agg, which will give you the jsonb version of that array. We'll go over there. There are in fact more json aggregate functions, but I'm gonna save a lot of the json for the json module itself. We'll go over some more of those aggregate functions in that module.
Let's keep going. I wanna look at another aggregate function. That is the bool and function. The bool and function is quite interesting to me because let's do select * from sales again. If we look at this, we do have the amount, and we do have is returned. The first thing we're gonna do with this bool and function is we're gonna look for employees who have sold only items over $100. Is a 100 good? Yeah. Because there's a 50, there's an 80, there's a 45. We want to find employees that have only sold items over $100. What we can do there is we can say amount, but then pass this in and say over 100. This is gonna return a bool and we'll say as all over 100. What this is going to do, it's going to end all of these Boolean values together. We'll get back the ones that have sold only items that are over $100. We could of course, switch this to bool or and say, well, everyone sold something over 100, so I'm not interested in that. Let's see, over 1,000. Hey, there we go. There we have. Let's say, best salespeople because they have sold something over $1,000 which hey, that's pretty great.
You can also imagine we may be looking for salespeople that have had no returns whatsoever. In this case, we don't need an expression because we have a Boolean column. If we look at the sales table again, we'll see we have an is returned column over here. Instead of putting this expression in here, we can say is returned as any returns. If we run that, we'll see any returns is false for number three, and employee number five, meaning they have had no returns. If you want to filter on this, you can. You could leave that or you could drop it. I'm just gonna drop it out.
What I'm gonna do, is I'm gonna come down here and instead of saying where something, when you have a group by, you have to put it after the group by and you say having. This operates after the group. This is just like a where, but it operates after the group. We can put bool or is returned and we actually want is false. These employees have had no returns, which means they're not maybe very pushy. They're selling people what they want. Hopefully five and three show up. We have five and three down there. Now, kind of cool, kind of useful.
Let's say we want to expand the scope of this report that we're making up as we go. We wanna say, "Great." Now, I know the ones that haven't had any returns, but in fact that's a little too binary. I wanna see how many sales and employee has made, how many were returned and how many were not returned. And before you think, "Wow, shoot." That's three separate queries. In fact, it is not because we can count and then we can filter our counts, which is kind of awesome.
Let's keep going. I'm gonna remove this down here. Yep. I'm gonna get rid of the having. And we'll first start by doing a count. This is going to show us the number of sales that each employee has made. Now, if we want to say, show me the number of sales that were made but not returned, we can do count star and then throw a filter on here and we can say where filter. We'll have a filter and then inside we'll put the where and we can do, where sales.is returned is false. We can say, so if it's false, we'll say as nonreturned sales. If we run that, we'll see a nonreturned sales and account. This guy already, you can tell they sold five and they have one that is returned.
If we wanna prove that out, we can just drop another one in here where sales returned is true as returned sales. We're getting three different counts while we're operating over this result set without having to run three different queries, which is quite nice and quite efficient. You're not limited to just filtering the count function. You can also filter some of these aggregate functions.
Let me introduce a new one, string agg and also we can filter it. Let's find all of the products that were returned for this individual user or employee. We'll say returned products and we'll do a string agg of product and then we'll put in a delimiter of comma and a space. We see, well, there were was only one per employee, so we didn't get to see the comma. But you'll see these are the products that were returned. Let's just switch this to false, so we can see that comma in action. There you go. You see there were four nonreturned and those were laptop, laptop, mouse, and desk lamp.
Before we move on to joining in the employees, I just wanna give you a minor note about count *. Count * counts the number of rows in the result set either the entire result set or the filtered down result set like we've been doing. Whereas count of a column name gives you the number of non null values in that column. Count * is just simply the number of rows and count column name is the number of non null values. And you're not penalized for counting the number of rows. It's not actually count, it's not consuming the entire row. It is optimized to count rows. You don't have to worry about should I use count *, should I use count ID? Use count * because that one is going to be optimized to get back exactly what you want in the fastest way possible. If you need to count the number of non null values, by all means use count column name.
Let's join in the employees table. All right, let's get rid of this. We don't super care about that. And we wanna start bringing in the employee names 'cause employee ID is a little opaque. It's not totally clear. What we're gonna do is we're gonna change things up just a little bit. We're gonna do a join employees. We will give it an alias of e on sales.employee id = e.id. If we run that, we don't see any new information, but we also don't get any errors. That's kind of nice. If we come up here and we say e.first name and we put a space in the middle e.last name as employee name with a comma, we'll see moink-moink. We have to have it in a group by, but we don't actually have to have it in a group by. If we change it down here to e.id, we do then need to remove that employee id. But we can change that to e.id as well. There we have it.
What we've done is we have grouped on that employee table by id and we've concatenated the first and last name together. We switched this employee ID over to e.id because we are grouping on that. We still have our same filters going on over here. The same counts over here. We are joining, and grouping, and counting all together.
In the next video, we're gonna look at even more complex grouping and rolling up with cubes and sets, and all of that kind of stuff so that we can get even better reports, maybe even more usable reports out of our data.