Mastering Postgres
Introduction
Introduction to the course
Overview of course structure
Postgres vs. everyone
The psql CLI
Data Types
Introduction to schema
Integers
Numeric
Floating point
Storing money
NaNs and infinity
Casting types
Characters types
Check constraints
Domain types
Charsets and collations
Binary data
UUIDs
Boolean
Enums
Timestamps
Timezones
Dates and times
Advanced Data Types
Intervals
Serial type
Sequences
Identity
Network and mac addresses
JSON
Arrays
Generated columns
Text search types
Bit string
Ranges
Composite types
Nulls
Unique constraints
Exclusion constraints
Foreign key constraints
Indexing
Introduction to indexes
Heaps and CTIDs
B-Tree overview
Primary keys vs. secondary indexes
Primary key types
Where to add indexes
Index selectivity
Composite indexes
Composite range
Combining multiple indexes
Covering indexes
Partial indexes
Index ordering
Ordering nulls in indexes
Advanced Indexing
Functional indexes
Duplicate indexes
Hash indexes
Naming indexes
Understanding Query Plans
Introduction to explain
Explain structure
Scan nodes
Costs and rows
Explain analyze
Generating Results
Introduction to queries
Cross joins
Inner joins
Outer joins
Subqueries
Lateral joins
ROWS FROM
Filling gaps in sequences
Subquery elimination
Combining queries
Set generating functions
Indexing joins
Advanced SQL
Introduction to advanced SQL
Grouping
Grouping sets, rollups, cubes
Window functions
CTEs
CTEs with window functions
Recursive CTE
Hierarchical recursive CTE
Handling nulls
Row value syntax
Views
Materialized views
Removing duplicate rows
Upsert
Returning keyword
COALESCE + generated column
Full Text Search
Introduction to full text search
Searching with LIKE
Vectors, queries, and ranks
Websearch
Ranking
Indexing full text search
Highlighting
Advanced JSON
Intro to JSON
JSON vs JSONB
Validating JSON
Creating JSON objects + arrays
JSON extraction
JSON containment
JSON existence
JSON recordset
Updating JSON
Indexing JSON parts
GIN index
Vectors (pgvector)
Intro to pgvector
Vector embedding columns
Find related articles
Upsert vector embedding
Semantic search
Other operators
Vector indexes
Outro
Thank you
Bonus interviews
Heroku's glory days & Postgres vs the world (with Craig Kerstiens)
Creating a Postgres platform with Monica & Tudor from Xata.io
Bootstrapping an email service provider (with Jesse Hanley)
Watch for free

Enter your email below to watch this video

Video thumbnail
Data Types
Charsets and collations

Full Course

$
349
$399
USD, one-time fee
The course "Mastering Postgres" helped me practice query commands and understand why they're important. Inspiring me to explore the 'why' and 'how' behind each one. Truly a great course!
Nakolus
Nakolus

PostgreSQL database platform

Shorten dev cycles with branching and zero-downtime schema migrations.

Test Data

I've made my test data available for you to use and follow along.

Download
or
Use on Xata

Summary

Learn how PostgreSQL handles character sets and collations, with UTF-8 encoding as the default for storing text, supporting a wide range of characters, including emojis. The video explains how collations define rules for character comparisons and how to customize them for specific use cases. While customizing encodings and collations can be helpful, the default settings are often sufficient, especially for English-language applications.

Links

Collation documentation

Video Transcript

When it comes to character sets collations of your text content in Postgres, they set you up pretty great. You should be fine with the defaults, however, if you end up fighting with character sets collations in a back alley, I do want to arm you with a little bit of knowledge that hopefully, you can find your way out. If we are here on the command line we run \l you'll see the list of databases in coding collation. They all are set to UTF-8, in this case, en US UTF-8.

Let's pause there consider what is a character set or an encoding what is a collation? The encoding, which you can also call a character set you might see it referred to as a charset or charset. That is how we go from a character on the screen to bytes written onto the disk. UTF-8 is a multi byte encoding. It's between one four bytes. It supports accented characters, uppercase, lowercase. It supports the entire range of Unicode it includes emojis. UTF-8 is likely what you are looking for. It is extremely versatile. It can store anything you won't end up with those weird question mark boxes when you try to display the emojis back. That's what a character set is that just defines what the legal characters are.

There are other character sets, there are a lot of other ones. There are other encodings as you could call them, they define what their legal characters are. It's not always the entire breadth of Unicode. You might run into a situation where if you are using a different encoding you try to insert something with an accented character or a foreign language character or an emoji, you might get an error because it's simply not representable in that encoding. The second thing is this collation. If a character set or an encoding defines what the legal characters are, a collation is simply a set of rules that defines how those characters relate to each other. In some cases you might be comparing a lowercase A to an uppercase A or a lowercase B to an uppercase B or an E to an E with an accent on it.

Are those the same thing? Well, it depends. Depends on the collation. We'll take a look here in a second to show you what en US UTF-8 actually does, how it compares those things. A collation is simply a set of rules that defines how the characters are related to each other. When you create your database, you create it with an encoding a collation. These are the defaults I would encourage you to just use those move on. They're very good. Let me show you a few things about how these things compare to each other. Before we do that, what we're looking at here, this is the server encoding this the encoding of each particular database as it exists on the server. Your client, which you are connecting to Postgres, your client also has an encoding. If you look at this show client encoding, you'll see our client encoding is in fact UTF-8.

If you change your client encoding, which is possible, Postgres will attempt to convert the server encoded data to your client encoding. That is not always possible because the overlap of encodings is not a perfect circle. There are some characters that can be represented in one encoding that cannot be represented in another encoding. If you are connecting to a database that has some other encoding you set yours to UTF-8, it will attempt to do that conversion for you.

Hopefully your server encoding your client encoding or on the same page you don't have to do any conversions there. Okay, back here in TablePlus, we're going to compare lowercase A, B, C to uppercase A, B, C, explicitly using the ENS UTF-8 collation. If we run this, you'll see that those two things are not equal in fact, because the UTF-8 en US collation is case sensitive, just to make sure we're not crazy., little, A, B, C is equal to little A, B, C, even one big character throws us off. We can create our own collation it's kind of cool.

I'll show you how. If you're trying to, for example, search a column for an email your input string is lowercase, there are maybe potentially better ways to do that than creating your own collation. One would be downcasing it on input. If that is not viable, we can create a generated column that is downcased, or we could create a functional index that uses lowercase email. We will go over all of that. I just want to short circuit that in case you're thinking I'm going to create my own collation use that for everything. That may be totally fine. If all you're after is a lowercase index assisted lookup, we do have a better method for that, which we'll go over. Just for fun, we're going to create our own collation right here. While this is fun, it might prove very useful to you in the future, should you need to create your own collation.

First thing we're going to do is we're going to say that the provider is ICU, which I think is international components for Unicode. Your Postgres hopefully will be built with this. If not, then you cannot use this provider. This is pretty standard, you should be okay. The next thing we're going to do is we're going to pass through this super esoteric string here. We are saying let's start with en US then Unicode. Then level one is the bassist level, it's the least sensitive level. I will leave a link to the documentation where all of this is listed out because this is a pretty magical string that looks like it's just made up, I promise there is a reason behind it. Then deterministic is false because when you have things that are case insensitive accent insensitive, it's possible that those could end up in any order because the comparison could be totally equal. We're going to run this, we're going to create this collation, come back here, hop over here, take that guy out run that.

You'll see those two things do equal out. You can create your own collations to meet your own needs. If you are just doing it for case insensitive searching, you do have the I LIKE operator, which can be good, can be bad, may not be as index assisted as we want. We also have generated columns functional indexes, we will look at all of that.

Finally, I do want to say you can create, your database has a, a default collation a default charset or encoding. You can create a column, a specific text column with its own collation or own encoding. If that meets your needs, meets your use case, that is an option for you. I have found that most of my work, UTF-8 is fine. En US UTF-8 is fine. I will say I speak English. All the apps I have written are in English. It might be different if you're building an app for a different language. You might want to change the collation to French or German or Dutch or something like that. That is totally an option that you can use.