The course "Mastering Postgres" helped me practice query commands and understand why they're important. Inspiring me to explore the 'why' and 'how' behind each one. Truly a great course!Nakolus
Shorten dev cycles with branching and zero-downtime schema migrations.
A materialized view is almost identical to a view except for the word materialized, which makes a pretty big difference. Thinking back to just a regular view, it's kind of just like a named query. It's a very nice experience. We get to write this complicated query, give it a name, and then reference it as that name from then on.
A materialized view is similar except that, you write that, you know, whatever that complicated query and give it a name, but that data is written to the disc as if it's kind of a cache. You have trade-offs here. One is not better than the other. There are different use cases for each. A regular view is always going to be up to date. Always, always, always. Because it is calculated at runtime. That underlying query is running at runtime.
A materialized view is going to get out of date because that data is written to a temporary table, more or less, written to disc at the time that the view is created and at the time that the view is refreshed, which you can do manually. You create the materialized view and then over time, that data becomes stale, but that might be okay. The reason that might be okay is because that query might be terribly expensive. Maybe it's something like we're rolling up some stats and we want to run it once per day for the previous month, or we want to run it at the beginning of the month for the previous month, something like that. We don't actually need it up to the second, we just need it up to maybe midnight of last night. Having a materialized view fits perfectly there because then our read performance is incredibly good because we're not running that underlying expensive query.
Instead, we're just reading from that materialized view. We're gonna do exactly this on the bookmarks. We're gonna roll 'em up by date and we're gonna create a materialized view out of that. Here's what I want to do. I want to do select * from bookmarks limit 10. I threw a random date in here just as a saved on date. What we're going to do is... This is a very common pattern where you have a bunch of data and you roll up a big part of the historical data and then you keep calculating the live data live. That way we don't have to continue to process super old data, so we can do something like this.
Let's do saved on and we'll do... We'll just say, count(*) from bookmarks limit 10. We'll just say group by saved_on. If we run that, we get a little bit of historical analysis here. You can see, we don't need to be calculating these every single time. This query is not terribly expensive, but many statistical or analytical queries are terribly expensive. You're calculating historical data over and over and over, which honestly it's historical. Why are you doing that? It doesn't make a lot of sense. We can do it like this, let's do from bookmarks where, and then we'll say, saved_on < CURRENT DATE - INTERVAL '1' DAY I'm just, because I like to, I'm gonna wrap that up like that. Now if we check that and we come this way, so it is before 10, 09. Everything from 10, 08 all the way back, we have rolled up.
What I want to do is I want to roll that up into a materialized view. We can do that once a day. We can do that once a day, once, you know, early in the morning when nobody's online. It doesn't super matter. In this case, we're imagining that the read path is very, very hot, and we need to pull all this data for our dashboard over and over and over again and it makes no sense to calculate it that often. I'm gonna show you how we can roll this up and then how we can use the CTE to get all of the data.
Let's make it a view. We're in view land, let's make it a view. All right, let's go back to this. And what we're gonna do is we'll just throw a semicolon there and we're gonna say, create view historical... Let's do bookmarks_rollup. We're rolling up these bookmarks as, and then we're going to say, we're gonna bring this guy, come in here and put it right there. That would be the view.
Let's just go ahead and do that. We're gonna create that. I'll just prove that that worked by doing select * from bookmarks_rollup, bookmarks_rollup, Select * with a semicolon from bookmarks_rollup. There we go, there's our information. But again, if we do the explain here, it's doing all of that, right? Instead of that, we are going to drop view bookmarks_rollup.
We're gonna drop that view and then we're gonna create a materialized view called bookmarks_rollup. Now, if we explain bookmarks_rollup, you'll see it's a sequential scan on that table. We could add an index to the bookmarks rollup, it's not important right now, but we could add an index to a materialized view that's totally legit and oftentimes a good idea. But you see, we still get our data. It's super duper fast. Look at that. We still get our data and it is no longer doing the actual rollup, it's just reading the data out.
Now that we have this, we have this data kind of cached. We can do, we just did this, but we'll do it again. Start from bookmarks_rollup.
Now what if we want today's data as well, right? We have this nice rich data stored away, but on our dashboard, we want some live data. Everybody loves live data. We want to see how many bookmarks are people making today? Because somehow we make money off of bookmarks and we want more bookmarks. We can do that with a regular view on top of the materialized view.
Let's take a look again at our materialized view as bookmarks_rollup. We have that there. I'm just gonna go ahead and write it out to start before we create another view. This is the definition with which we declared bookmarks_rollup. This is the exact same thing. This doesn't make a lot of sense. But what we're going to do is we're going to come down here and we're gonna flip this to a greater than or equals.
We have absolutely everything covered because the roll-up covers less than, and what we're gonna do here in the regular view is gonna cover the greater than or equal. If we run that, we should see, yep, there are the last two days and I'm fine. I'm fine with leaving two days instead of just one day. I like that. That covers over a whole set of timing issues.We're gonna do the last two days live and then the rest of that is going to be historic. Now that we have this, this gives us all of the rollup data that we need. This gives us all of the statistical data that we could possibly want. But 98% of it has been written to that materialized view.
Let's in fact do this. Let's do alter view bookmarks_rollup rename to bookmarks_rollup_historic. Yeah, that's fine. I like putting the table name first bookmarks and then rollup and historic is fine. We're gonna rename that bookmarks_rollup is not a view. Alter materialized view. We're gonna rename it and so now that should fail. We have... There we go, so now we're back to where we need to be. Great.
What I'm gonna do now is say, create bookmarks, we kind of put the word create view bookmarks_rollup as, and we're gonna take that materialized view and union it with this query and so now we can run that and now we can do select * from bookmarks_rollup, and we have all of our data, but we're not calculating all of it. If we look at this, we are doing scan on bookmarks, a bunch of aggregates, gather, bookmarks_rollup, there you go. That's when it finally does the last one, that takes the materialized view and puts it together with all of this stuff that it's doing the grouping and sorting on the actual underlying bookmarks table. The majority of the work has already been done, right? The majority of the work has been done written to disc. Then we take the last few stragglers, calculate them live, put all of that together and wrap it up into a nice view.
The last thing we need to do is we need to periodically refresh our materialized view. We can do that by just saying, refresh materialized view bookmarks_rollup, I think we called it historic or historical. There we go. The view has been refreshed. In this case, it didn't take very long. You can imagine if this query is very expensive, that might take a maybe up to a minute, maybe up to several minutes to refresh that materialized view. You incur that penalty one time, at a time of your choosing, whenever you decide to refresh the materialized view, that's when you incur the penalty. After that the reads are as fast as they can be because it's reading calculated data, it's not running the underlying query.
You can refresh this materialized view concurrently. There are some constraints on that. You do have to have a unique index and the unique index has a few rules around it. But if you refresh it concurrently, it won't lock out concurrent selects on this view. That is an option to you depending on whatever your setup is. Let's do a quick overview of materialized views. They're just like views except that they are written to the disc, instead of being calculated at runtime. They can be very helpful for basically creating a cache, for rolling up some expensive data, for doing a very expensive query, for doing a query that takes a long time but doesn't need live data. That's very important. Because if you need live data you gotta go back to a view or CTE or something like that.
If you are okay with data that is potentially stale and just like the pattern that we did here, you can union it with data that is live, that's a great use case for a materialized view because you get huge read performance improvements while incurring a right performance hit just one time at the time of your choosing, which is very nice.