The course "Mastering Postgres" helped me practice query commands and understand why they're important. Inspiring me to explore the 'why' and 'how' behind each one. Truly a great course!Nakolus
Shorten dev cycles with branching and zero-downtime schema migrations.
We're getting to a point that is outside of my mathematical prowess. Everything we've searched on so far has been using the L2 operator, the L2 comparison function, and that is a Euclidean comparison. Now this is the point, where as not a mathematician, we start to get outside the bounds of my understanding, but I do want to talk to you about the difference between three operators that you can use and where you might use one versus the other.
This is the L2 operator that we have been using. That is probably a totally fine default. There is also the cosine operator and the cosine operator looks like that. We've got that in there as well. The cosine operator looks like that. There is another operator called the L1, and these are just different ways to compare the underlying embeddings. The L1 operator is a plus sign and then there's another one called the inner product and this guy is a pound sign. What do we even call that, hashtag? It's been so long, number sign. These are the different operators.
Now, which one should you choose? Great question, it kind of depends. If you read the docs on PG Vector, it says when the embeddings are normalized, such as with Open AI, the inner product is going to be the most performant. If you go read some of the deeper technical papers from Open AI, it says it doesn't really matter, it does actually say it doesn't matter that much, but it recommends the cosine operator when the vectors have been normalized. Some of this depends on the data, some depends on the model that you're using, L1 is very resilient to big outliers or anomalies, where L2 can be skewed. I'm just looking for like a right down the middle, 80, 90, 95% use case.
According to Open AI, the cosine similarity is that, and according to PG Vector, the inner product is that, L2 is a very well understood, very commonly used distance comparison for vectors. Between the three of those, you'll find one that works very well for your use case. Take into consideration the model and their recommendation for comparing the vectors that they give you 'cause that's pretty important. When we look at indexing in the next video, we will continue to talk about these operators 'cause some of these indexes have flags that you can pass based on the operator that you plan to use. Again, this is a little bit above my pay grade in the mathematical sense, and so when you're considering what operator to use, you have to test it against your data.