The most comprehensive course on PostgreSQL

Generated columns in Postgres are a great way to keep data tidy and synchronized without the need for complex queries. These formula-based columns automatically calculate their values from other data, saving time and reducing errors. They’re particularly useful for simplifying data models and ensuring consistent data without manual input.

Generated columns gotta be one of my favorite features of any database, Postgres included. They are an escape hatch. They're a get out of jail free card. They're a nice way to tidy up some gross queries. In the process of developing your schema, business requirements will change, databases will probably change, if you keep going long enough, deadlines can be tight, and your knowledge can change. You may look back on a database that's five years old and think, "Boy, what the heck was I doing?" That's okay. That totally happens. And you can adjust the data model as you go.

A nice middle ground or maybe a stop gap can be a generated column. It can paper over some of the gross stuff in your data model. They're also very, very useful for beautiful, pristine data models. Whether or not they're messy, it doesn't really matter. But generated columns can be an escape hatch.

So the question is, what the heck is a generated column? A generated column is in fact a column, but it's basically a reference to another column and then some sort of transformation, some sort of function, some sort of modification. And I'm gonna show you a few here, but you can think of it as maybe a formula-based column in Excel or Google Sheets. You have your one column with all the data, and then next to it you have a formula that references that column. That's all a generated column is.

Traditionally, there are two types, virtual and stored. Postgres only has... I need to put down this finger. Postgres only has stored, so there is no virtual generated column in Postgres, which is a little bit unfortunate, but it's fine, we'll get by. Let me show you a few examples of what these are. Then honestly, throughout the rest of the course, we might pick up generated columns again. So let me show you a few.

Here's a very basic example, a people table that has height in centimeters and height in inches. Now, these two things represent the same fundamental truth, right, a person's height. They're not really discreet pieces of information. They're the same piece of information fundamentally, just represented in different units. Instead of trying to keep these two things in sync, what we're gonna do is we're gonna say this is GENERATED ALWAYS AS, and this might ring a bell 'cause we've done GENERATED ALWAYS AS IDENTITY. But instead of doing that, what we're gonna do is we're gonna do open parentheses here, and then we're gonna say (height_cm / 2.54). And then at the end, you must write STORED. In some other databases, you can write VIRTUAL. And that just becomes kind of like a query macro almost, in that, it's always computed at runtime.

However, in Postgres, you must write STORED, which means it's computed at INSERT and UPDATE and then actually physically written to storage, which is totally fine. If we run that and we were to insert into people, we'll insert (height_cm), and honestly, I don't know how many centimeters tall I am. I don't really know what a centimeter is. I'm gonna insert (height_cm) values (100). Now if we read that back, select * from people, let's see how close I got. So 39 inches. That doesn't sound very tall. select 39/12, that's not very tall. Let's do 200. So if we did 200, so we need to run that, and then run that back, so 78 inches is how many feet? Six feet. That feels a lot better.

Okay, so what we've done is we've inserted a value into the centimeter column, but the inches column was calculated for us based on the centimeter column. Now, interestingly, you cannot muck this up because you do not have control. If we were to say, sure, that's, you know, six feet, which would be... What did we decide? 78-ish inches. You just can't do it. You're not allowed to do it. Which gives me great comfort because I don't want people accidentally inserting data in here if they didn't know that this was a generated column. This is nice. It keeps that data. It keeps that data in perfect sync. Another place, I haven't actually used it to do inches and centimeters, but a place that I have actually done it is create table users. And here we're gonna say, you know, imagine a bunch of stuff and then we say email is text. And then we can say email_domain. And this might be really, really useful because the domain is a part of that email address, right? And so if you wanna search on domain or group on domain, you're gonna have to do some extraction or some left hand wild cards or something like that to search on domain alone. And maybe you block certain domains or maybe you wanna group by corporate domains versus free domains.

Whatever you wanna do, you can extract the domain part by doing something like text GENERATED ALWAYS AS, and then we're gonna put our function in there, which we need to figure out, ALWAYS AS () STORED. And then let's figure out what our actual function is. Uf I did select my email, tryhardstudios. Please, I mean, I guess you can email me, but don't, you know, don't send me any spam or anything. Okay, so that's good there. And then we're gonna use split_part and the delimiter that we're gonna split on is the at sign and then 2 because it is 1 indexed. So if we did 1, it would give us Aaron. I want 2 for tryhardstudios. Then we can just yoink that and come back up here, remove the hard code and say, we'll take that and make that an email. So now, with any luck, we can say insert into users (email) values ('[email protected]'). Gotta do the right table name and select * from users. There we go. So that is actually a pretty useful use case for extracting pieces of text out. And now I have my email domain that I can filter on, I can group on, I can add an index on, all kinds of good stuff.

The restrictions on generated columns are, in fact, many, but you'll probably only run into a few of the restrictions. One is you have to reference the current row. You can't reference other rows, other tables, subqueries, anything like that. The other is your function, or the way in which the generated column is populated must be deterministic. You can't use any volatile functions, which means random, current time, random UUID, anything like that. Given an input, it must always produce the same output.

The other restriction that you might run into is you can't reference another generated column. Unlike in Excel, you can't just have formulas that just reference each other over and over and over, all the way back. It all has to be self-contained. A generated column can't reference another generated column. We're gonna use these... I am not ashamed of generated columns. We're gonna use these over and over and over again because they're super powerful, both for papering over a messy data model and for papering over messy data. It's not always in your control.

Sometimes the data is messy, sometimes the access patterns change and you need to extract a piece. This can be great for extracting parts of JSON, which we'll look at later. But I just wanted to give you an intro to generated columns because we'll be using these throughout the rest of the course.

Full Course

PostgreSQL database platform

Summary

Video Transcript