Mastering Postgres

In this video, I explain how I use OpenAI to create embeddings from articles I wrote on my website, aaronfrancis.com. By sending the text to OpenAI, I can calculate a representation of the content, store it in a database, and find similar articles to keep readers engaged. This process, demonstrated with PHP and Laravel, showcases how powerful and versatile embeddings can be for various applications, including text and images.

This is going to be a fun one. In this one, we are going to take the text of articles that I have written on my website aaronfrancis.com, very good website. We're going to take the text and we're going to pass it through to OpenAI and we're going to get some embeddings back so we can see a real-world scenario. We can see what these things actually look like, instead of just making it up from scratch. Now, the code that you're going to see me write is in the language PHP in a framework called Laravel. That's totally irrelevant. I happen to love Laravel, I find it to be very productive, but it doesn't matter. You can create these embeddings in any language, any framework, OpenAI, any model, whatever you want to do. This just happens to be the implementation that I'm using. Please, if you don't like PHP, that's fine, but don't tune out.

We are going to do some very interesting stuff here. Here's the code that I have written already. We'll go over it quickly. Article is a model, so we're using an ORM here, and I'm looking for all the ones where I have the Markdown file. I've written some articles that live on external sites and some that live on my site, and I'm just looking for the ones that are on my site because I have the text for that. Then we're going to each over them and do some stuff. The stuff that we're doing is we're just getting the contents of the article. It has some front matter and so I'm going to explode the front matter off. By the way, explode is how you split, in PHP, we, instead of saying string split, we say explode 'cause that's super metal. Then we're going to pass the contents to OpenAI and we're going to say, let's use the text embedding model.

Let me just show you a few things here. Here's the contents of the article. If we clear and we run embed, we'll see, look, there's a bunch of contents. This looks like an article about SQLite, being weekly typed. That's the contents of the article. Now if we switch back here, let's see what the response looks like. We're going to send that text off to OpenAI and say, please create some embeddings out of it, and let's take a look and see what that comes back as a bunch of meta information. That is the thing that I have been telling you about. Look how wonky that is. I do not understand this. I'll be totally honest with you, I do not understand how they do this. It is incredibly cool to me. We're getting distracted on how cool that is. We need to keep moving on Postgres.

Okay, so scrolling to the very top, let's find the structure. It looks like the response and then the embeddings is an array. We want the first embedding, embedding. Let's see if we can remember that. , the response, so if we do response and then embeddings and then we'll take the first one and then we'll do embedding and see if that gets us the actual. This video's going to cost me many, many pennies. We're hitting the open API, OpenAI API several times. , now let's see how many dimensions they give us back. There's a hint. If we count up the dimensions, we'll see we get 1536 dimensions.

Remember in a previous video I created a sample table with four dimensions. This is more realistic. We're going to do 1536 dimensions 'cause that's what OpenAI gives us back. Let's hop back into table plus, create a proper table to hold this stuff. Then let's create embeddings for all of my articles and start doing a little searching on it. We're just going to start with the same schema, but we're going to modify it. ID is still fine. Let's call this slug. We have unique slugs, and this will allow us to do interesting on conflict stuff later, but we're not going to do that right now. Then we have a title of text and then the embedding is 1536. That's how many dimensions we have. That looks good to me. Now let's hop back into the code.

What we're going to do, let's see, we're going to do a few things here. We've already looked at the contents, we've already gotten the embedding. Now what we're going to do is we're actually reading from one database and then we're writing to this demo database. That's kind of an implementation detail. That doesn't super matter here, but it is kind of interesting. We are going to insert into articles, we will do title, slug and embedding values one, two, and three. Then we are going to bind those in. The first one is article title. The slug in the old database is actually article ID. , that is the slug. Then the embedding is a little bit interesting because it needs to be shaped like this, you know, 1, 2, 3, 4, 5, something like that. What we're going to do is we are going to open up those guys, do a little concatenation and then the super metal implode with commas of the embedding. Then just for giggles, this line inserted will add a space article, title. Tthat can give us a little bit of progress tracking. If we come over here and we run embed hopefully, there you go. Look at that. Cool. This gives you a little bit of sense of what I write about a lot of databases and a lot of Laravel. This is all running right now.

What we're doing is we're inserting all of those vectors into the database. Costing me many few cents to do this. It's incredibly cheap to create these embeddings. Now we can take a single article and find which ones are related to that article. , I don't know about y'all, but this is already terribly interesting to me. Let's do select * from articles and we'll read that back. We'll see, we have this just crazy, that makes no sense to me. We've got all of the articles in here.

Now what if we want to find an article that is similar to another article? Let's imagine that the user is on a blog post called Year in Review 2023. That has an ID of 18. What is the query? What is the query that we are going to write probably on the application side that says, hey, the user is on this article, let's feed them more articles so that they keep reading and stay on our website forever. Well, we can absolutely do that with embeddings. Let's do it this way. Let's say from articles where, so now we're trying to find the related ones. We're going to start by saying where ID is not equal to 18. , we don't want to show the article that we're on. That makes no sense at all. Then we can say order by embedding. Then we'll do our L2 operator here. Then what we're going to do is we're going to say select embedding from articles where ID equals 18. What this is going to do is this is going to pull that embedding from the year in review 2023 and it's going to select all of the articles ordering by the similarity to article 18. If we run that, yes, that is what we want to see. The most similar article is year in review 2022. It works.

Let's do a soft one. I write some like hard technical topics and some soft ones. , let's say we want to find articles that are like this one called Reputation as Portable. If we do that, we'll see, look at that. This doesn't warrant a blog post. You're always doing something wrong. An argument for logging off. These are all soft articles in the sense that they're not hard technical topics. , this seems to be working. Let's pick a hard technical one. We'll do select * from articles and let's look for a, let's finding missing orders with a recursive CTE, I do love recursive CTEs. We'll say where it's not nine, but it is similar to nine. Look at that. We get a bunch of database articles. MySQL, MySQL, SQLite, Laravel, Laravel. Yep, this works great. This works great and I freaking love it.

What we've done here is we have calculated the mathematical representation of the text of my articles and then we've plugged it into the database and we've said, given this mathematical array, this array of floating point numbers, calculate the L2 distance between all of the other articles and then order it by the ones that are closest. We've just from scratch in a few minutes created a related articles, similar products, that sort of thing. It doesn't just have to be text, you can do it on images as well, which is kind of wild. You can see how simple and how powerful this is. I want to show you in the next video that on conflict thing that I talked about, which we're going to leverage the uniqueness of the slug to accomplish that.

Full Course

PostgreSQL database platform

Summary

Video Transcript