The most comprehensive course on PostgreSQL

Learn how to handle binary data in databases, with a focus on the implications of storing large files. Instead of using the `bytea` column for large files, I recommend using external file storage systems like Amazon S3, leaving reference pointers in the database for consistency. The video also covers using MD5 and SHA-256 for quick equality checks, though I caution against using MD5 for secure data and highlight the potential benefits of UUIDs for certain use cases.

We've talked about storing strings in the database along with their character set their collation, there's a lot of information wrapped up in a string. If we strip all that down, we can just store binary data.

We can just store the bits bytes in the database. Sometimes it's a great idea. A lot of times it's a terrible idea. The column that we're going to look at is called the bytea column, it is a varying size column. You can store up to one gigabyte of data In this column. I would not recommend doing that. When you do store large amounts of data, it gets TOASted. Remember, TOAST, table, oversize, attribute, storage, something something? It's the thing where it leaves like a pointer in the row then it breaks it out into that TOAST table. If you do end up storing a large amount of data in a single byte, a column, it's not all going to be in there clogging up your rows. Your rows will still be very compact, which is good for speed, then when you need that large attribute, it does go out to the oversized attribute storage area. I would not put huge amounts of data in the bytea column. I simply wouldn't. If you need to store files, I would use a file store. If you're using something like Xata or maybe one of the other providers, you can put like up to a five gigabyte file in there then it transparently moves that out into file storage uses A CDN.

If you're not using one of the providers that does that for you, what I would do is put the files on a file store, something like Amazon S3 or an S3 compatible, B2, R2, Tigris, a lot of S3-compatible stores , then leave a pointer, i.e., the file name or URL in your database. That is, that's storing files. One of the great things about storing very small files in the database is it's consistent. You have strong consistency guarantees, whereas you might delete the file off of S3 leave the pointer, i.e., the URL, in your database, then you're kinda hosed because you're out of sync. You do, you would have to invent some of that yourself if you're going to store large files out on S3.

Again, if you're using Xata or one of the providers, they handle all of that for you, if you're going to store pretty small files, maybe something like, maybe something like small profile photos or something, you could potentially put that right in the database. Storing files in the database, you would have to convince me. You'd have to convince me that's a good idea.

There are other use cases for storing raw data, sometimes that comes down to like checksums. Like storing a checksum, it becomes very fast to compare look up because you don't actually care about the character collation or the encoding or anything. You're literally just checking a sequence of bytes. Those are all of my opinions. Storing files in the database has never made a ton of sense to me. It can make sense if the files are very small, we can store other data in there.

Enough of that. Let's take a look at the actual table. Here's the beginnings of our table. We've got file_name as TEXT, data, we're just going to make as bytea, which is your variable-length binary column. We're going to come down here, insert a file_name data as 'hello.txt' this is hello world represented in hex, which is what that \x is representing there. If we run this, you'll see select * from bytea_example. The client says, "Hey, you've got 11 bytes of binary data. I don't know what to do here." let's switch over back over here. If we run that again, here's the actual data that we put in, which is kinda nice. If we did show, let's do show bytea_output, this is how you can control what the client gives back to you here. In our case, the bytea_output is set to hex. This is the default. This is the standard. This is the one you should be using.

There is an older version that you can use, that is escape, which can be kinda nice when you're a human, which I assume you are. Then it will show it to you in something that is more readable. Remember, not all bytes convert cleanly to characters. You may just be storing raw bytes there that weren't originally characters, escape is going to kinda muck those up. There's going to be a bunch of actual escaped sequences in there. That can be helpful. I would recommend sticking to hex making sure your toolchain works all the way through. We talked at the beginning about doing some digests or some MD5s or some hashes.

I want to show you that very quickly. Postgres has an MD5 function. We can say select md5('hello world'), there's a good chance some of you have this MD5 memorized. You've seen it before. Pay attention. Switch back to the tab where you're watching me. Pay very close attention. MD5, not secure. Not secure, not secure, not secure. Do not use MD5 for anything where you need security, cryptographic security, passwords, anything like that. Super fast, super broken, super fast for digests hashes where you just need strict equality across a potentially big piece of data that you can compress down using MD5 then do strict equality lookups. MD5 is still great for that. If you think you'll get dinged on a security audit for that, just even using it as a hashing function, you could potentially get dinged on a security audit. Then you would need to switch to something like SHA-256, which also exists, let's carry on.

With that caveat aside, if we did pg_column, nope, if we did pg_typeof, md5, text, that is what we saw. However, what if we did this, SHA-256, bytea? Little lovable quirk here. Every database has 'em. MD5 comes back as text. SHA-256 comes back as a bytea. If you wanted to make your MD5 a bytea, you could say decode that hex here then we'll just select that to see what it looks like binary. That looks close if we copy that get rid of that come back up here we are in bytea territory again. That's good. We've taken an MD5 hash, we've converted it into raw bytes because at this point we do not care about the characters.

What we want is a strict equality comparison between two hashes. I don't care what they represent in human terms, just tell me if they're equal or not. Pg, I think this is where I was going the first time. Column size 20 bytes. You may be thinking, you're probably not you may be thinking, I thought that would be 16 bytes. There is a little bit of bytea overhead here. If we were to compare the column size of the bytea the column size of just the MD5 string, getting rid of the decode the hex here it is quite a bit smaller when you lose all that character information you just go down to straight bytes. That is pretty good.

One thing that we might note here is this is a great way to do, this is a great way to do strict equality lookups. We can do a hash over a large value store it as potentially a generated column, which we'll get to, which is very interesting, decode it it's just bytes. We've done a lot of great work, right? In this case it's kind of all for not because Postgres does have a type that might work better here. If we did select MD5 hello world then cast it to a uuid, that actually works.

Let's compare these again. You'll see the uuid type has been cast with these dashes in the middle. That is the uuid format where the MD5 does not have them. That's a little wonky. If we look at pg column size of the uuid versus the MD5, (snaps fingers) That is what we are after. In fact, if we pull this guy down here as well take a look at that, you'll see the uuid is the smallest possible representation of an MD5 hash. We will take a look at uuids in the next video or in a future video at least. This is one way that you can store a digest or a hash in 16 bytes rather than 36.

Full Course

PostgreSQL database platform

Summary

Video Transcript