Aaron is a natural teacher and this course is the best introduction to Postgres I have come across. Lessons are easy to follow and he recommends some great tools for working with Postgres.Joel Drumgoole
Shorten dev cycles with branching and zero-downtime schema migrations.
In this video, we're going to get the extension installed and then create a little example table before we move on to some more complicated stuff. Now, the application that I use to run Postgres locally is called Postgres.app. It's great. It's free somehow. I don't know who makes it, but it's awesome. It comes with pgvector already bundled with it. We just have to create the extension in our database. That part is very easy.
If you need to get pgvector yourself from source, I'll leave a link to the GitHub repo down below. There are maybe 10 different ways to get this thing installed depending on what kind of system you're on. If I had to do it from scratch here, I would probably use Homebrew because I am on a Mac, but fortunately I don't have to because it comes bundled already. Because there are so many different systems out there, the docs are just going to be better for that. I'll just leave a link down below. I will show you how to create the extension and then we're going to start by creating a little sample table. The first thing we're going to do is create extension if not exists and the name of the extension is vector. If you run that again, you'll see it's fine. It doesn't matter. It's just going to say, "Hey, that already exists. What are you doing?"
What we can do is we can create a table called products. I added a little V there for Products Vector. We have just a traditional auto incrementing integer and a name. Then the interesting thing is down here, this is the vector. The first thing you'll notice here is we do pass through some sort of dimensionality. This is the number of floating point dimensions that are inside of the vector. Now depending on who you use as your embedding provider, there are ways to say, "Hey, I want 1024 or 768 or 2048," whatever. You can decide how many dimensions you get back. You do need to declare in Postgres how many dimensions you're planning on putting into this particular column. Because I'm going to hand write the embeddings here, we're going to stick with four.
Traditionally you'd have something more like 768, 1024 or something like that. We'll look at that as we go to future videos and get the providers actually giving us embeddings back. For now, I'm not going to hand write that many. Enough yapping, let's go ahead and create that table. We've got products_v here, and we can insert, insert into products_v and we'll just insert name and embedding and the values that we're going to insert here. Let's take a look at this. We're going to insert, we'll just say product A as the first name, and then we're going to insert an embedding. Now, this is the thing that you would get back from, we'll just say OpenAI. Whenever I say OpenAI, understand you could get these embeddings from a lot of different places. This is the thing you would get back from OpenAI. This is the thing you would get back from the model and it's going to look, well, it's not going to look something like that. In fact, it's going to look something with no Q in the middle, it's going to look something like this. It's going to look like a bunch of floating point numbers that make no sense at all. We're assured that they certainly mean something. That is traditionally, and you'll have literally hundreds of them.
For this example, I'm just going to write in some regular numbers. We'll insert product A and then we'll insert product B. Let's give product B a pretty different embedding. Let's say it starts at 9 and then does 10, 11, 12, and 13. How many dimensions? Four dimensions. Okay, we need to make sure that we only have four. We'll lop off the first one here. We've got 1, 2, 3, 4, and 10, 11, 12, 13. Running that and reading it back. Select * from products_v. Now we have our products and our embeddings, which in theory were provided by some outside model. Now these embeddings represent some underlying truth, some underlying reality about these products. Maybe it is you take the product description and all of the tags and the title and you hand that off to a model and get the embeddings back. Or maybe you take the images and hand those off to a model and get the images back. Basically what we're doing here is we're saying, take this human-readable representable thing, text, image, whatever, turn it into a machine-readable representable thing. Then what we're going to do is we're going to say, based on this machine-representable presentation of the underlying reality, let's find all the things that are similar to that.
Now with these products in the database, let's see if we can find similar products. Here we go. We've got these two products here. We've got product A and product B. We can get rid of that guy. Let's, yeah, we can get rid of that guy too. We just have those products. Imagine a scenario where you're given some input from the user and you're doing a sort of search. Let's say you're given some input from the user and then you turn that input into an embedding, either using a local model or one of the providers. Then the way that you can compare these things, you have a couple of different operators.
The first one that we're going to look at is this L2 operator. This is maybe the most common. This computes the distance in a Euclidean geometry. It's very complicated, it's math that I don't fully understand. There's an L2 operator, an L1 operator and a couple of others that we'll look at in a minute. This is maybe the best one for approximating like nearest neighbor similarity. We'll get to that in a second. What you do is you say, all right, take the user input, run it through a model, get an embedding back, and let's just start here by just saying, wow, that embedding came back exactly the same, which is not going to be the case. You see that the L2 distance between this embedding and the embedding for product A, it's zero. It's the exact same embedding. Of course it's going to be the same. Now probably it's going to look something more like this. That's where it starts to be more useful, where we say, ah, this input is kind of shaped like this thing that we have stored, or it's closest to these couple of things that we have stored.
You'll see that this distance for product A is much closer than the distance for product B. Meaning product A is most similar to this vector embedding right here. What you would typically do is you would say select from products order by, and we would say order by embedding. This would be the user's search slash a different product's embedding. So user search we talked about, but a different product's embedding, this can drive a related products feature, right? If you're on the page for product A, you can take this embedding and say, hey, database, give me all the products that are like product A, because I want to show 'em to the user and hope that they buy a bunch more stuff. If they're looking at this, they might be interested in that. This is a way that you can create that feature. That's where we're going to leave our little demo for now.
We'll continue to explore some more advanced use cases, but it is just wild to me that embeddings are a thing. That you can take some portion of reality and represent it as floating point numbers, kind of blows my mind, but it's terribly awesome and so we're going to continue to explore it here.