The most comprehensive course on PostgreSQL

Learn how PostgreSQL handles character sets and collations, with UTF-8 encoding as the default for storing text, supporting a wide range of characters, including emojis. The video explains how collations define rules for character comparisons and how to customize them for specific use cases. While customizing encodings and collations can be helpful, the default settings are often sufficient, especially for English-language applications.

When it comes to character sets collations of your text content in Postgres, they set you up pretty great. You should be fine with the defaults, however, if you end up fighting with character sets collations in a back alley, I do want to arm you with a little bit of knowledge that hopefully, you can find your way out. If we are here on the command line we run \l you'll see the list of databases in coding collation. They all are set to UTF-8, in this case, en US UTF-8.

Let's pause there consider what is a character set or an encoding what is a collation? The encoding, which you can also call a character set you might see it referred to as a charset or charset. That is how we go from a character on the screen to bytes written onto the disk. UTF-8 is a multi byte encoding. It's between one four bytes. It supports accented characters, uppercase, lowercase. It supports the entire range of Unicode it includes emojis. UTF-8 is likely what you are looking for. It is extremely versatile. It can store anything you won't end up with those weird question mark boxes when you try to display the emojis back. That's what a character set is that just defines what the legal characters are.

There are other character sets, there are a lot of other ones. There are other encodings as you could call them, they define what their legal characters are. It's not always the entire breadth of Unicode. You might run into a situation where if you are using a different encoding you try to insert something with an accented character or a foreign language character or an emoji, you might get an error because it's simply not representable in that encoding. The second thing is this collation. If a character set or an encoding defines what the legal characters are, a collation is simply a set of rules that defines how those characters relate to each other. In some cases you might be comparing a lowercase A to an uppercase A or a lowercase B to an uppercase B or an E to an E with an accent on it.

Are those the same thing? Well, it depends. Depends on the collation. We'll take a look here in a second to show you what en US UTF-8 actually does, how it compares those things. A collation is simply a set of rules that defines how the characters are related to each other. When you create your database, you create it with an encoding a collation. These are the defaults I would encourage you to just use those move on. They're very good. Let me show you a few things about how these things compare to each other. Before we do that, what we're looking at here, this is the server encoding this the encoding of each particular database as it exists on the server. Your client, which you are connecting to Postgres, your client also has an encoding. If you look at this show client encoding, you'll see our client encoding is in fact UTF-8.

If you change your client encoding, which is possible, Postgres will attempt to convert the server encoded data to your client encoding. That is not always possible because the overlap of encodings is not a perfect circle. There are some characters that can be represented in one encoding that cannot be represented in another encoding. If you are connecting to a database that has some other encoding you set yours to UTF-8, it will attempt to do that conversion for you.

Hopefully your server encoding your client encoding or on the same page you don't have to do any conversions there. Okay, back here in TablePlus, we're going to compare lowercase A, B, C to uppercase A, B, C, explicitly using the ENS UTF-8 collation. If we run this, you'll see that those two things are not equal in fact, because the UTF-8 en US collation is case sensitive, just to make sure we're not crazy., little, A, B, C is equal to little A, B, C, even one big character throws us off. We can create our own collation it's kind of cool.

I'll show you how. If you're trying to, for example, search a column for an email your input string is lowercase, there are maybe potentially better ways to do that than creating your own collation. One would be downcasing it on input. If that is not viable, we can create a generated column that is downcased, or we could create a functional index that uses lowercase email. We will go over all of that. I just want to short circuit that in case you're thinking I'm going to create my own collation use that for everything. That may be totally fine. If all you're after is a lowercase index assisted lookup, we do have a better method for that, which we'll go over. Just for fun, we're going to create our own collation right here. While this is fun, it might prove very useful to you in the future, should you need to create your own collation.

First thing we're going to do is we're going to say that the provider is ICU, which I think is international components for Unicode. Your Postgres hopefully will be built with this. If not, then you cannot use this provider. This is pretty standard, you should be okay. The next thing we're going to do is we're going to pass through this super esoteric string here. We are saying let's start with en US then Unicode. Then level one is the bassist level, it's the least sensitive level. I will leave a link to the documentation where all of this is listed out because this is a pretty magical string that looks like it's just made up, I promise there is a reason behind it. Then deterministic is false because when you have things that are case insensitive accent insensitive, it's possible that those could end up in any order because the comparison could be totally equal. We're going to run this, we're going to create this collation, come back here, hop over here, take that guy out run that.

You'll see those two things do equal out. You can create your own collations to meet your own needs. If you are just doing it for case insensitive searching, you do have the I LIKE operator, which can be good, can be bad, may not be as index assisted as we want. We also have generated columns functional indexes, we will look at all of that.

Finally, I do want to say you can create, your database has a, a default collation a default charset or encoding. You can create a column, a specific text column with its own collation or own encoding. If that meets your needs, meets your use case, that is an option for you. I have found that most of my work, UTF-8 is fine. En US UTF-8 is fine. I will say I speak English. All the apps I have written are in English. It might be different if you're building an app for a different language. You might want to change the collation to French or German or Dutch or something like that. That is totally an option that you can use.

Full Course

PostgreSQL database platform

Summary

Links

Video Transcript