Which vector similarity metric should I use?

Survey and descriptions of the available similarity metrics supported by: pgvector, Pinecone, Weaviate, Qdrant, Milvus, and Vespa.

One of the first decisions you will prompted with when using a vector database for similarity search is which similarity algorithm is best to use?

Short Answer: Cosine Similarity

For information retrieval involving text encoded by a sentence transformer, Cosine Similarity often outperforms other metrics. Its strength lies in its ability to efficiently compare high-dimensional vectors produced by transformers, with a primary focus on direction, which signifies semantic meaning. However, it's important to note that this metric assumes a convex or normalized set, and discrepancies may arise if the data doesn't meet this condition.

It is also supported by all of the vector databases I evaluated.

Cosine Similarity Representation (Source: Qdrant)

Long Answer

It depends on your use case. Below is a breakdown of the metrics available for the following vector databases. Each link is to the specific documentation or blog article used for this analysis:

Vector Database Support Matrix

Metric Description Support
Cosine Distance Measures the cosine of the angle between two vectors. pgvector, Pinecone, Weaviate, Qdrant, Milvus, Vespa
Euclidean Distance (L2) Calculates the straight-line distance between two vectors in a multidimensional space. pgvector, Pinecone, Qdrant, Milvus, Vespa
Inner Product (Dot Product) Computes the sum of the products of the vectors' corresponding components. pgvector, Pinecone, Weaviate, Qdrant, Milvus
L2-Squared Distance The squared Euclidean distance between two vectors. Weaviate
Hamming Distance Measures the number of differences between vectors at each dimension. Weaviate, Milvus, Vespa
Manhattan Distance Measures the distance between two vector dimensions along axes at right angles. Weaviate

Detailed Descriptions

Below are some detailed descriptions of each metric along with their relative strengths, weaknesses and the use cases for which they might be best suited.

Cosine Distance

  • Description: Measures the cosine of the angle between two vectors, often used when working with normalized or convex sets.
  • Strengths: Primarily considers the orientation of vectors, making it ideal for high-dimensional spaces such as text comparison where the length of documents may not be significant.
  • Weaknesses: Not suitable when the magnitude of vectors is important, such as when comparing image embeddings based on pixel intensities. May not provide accurate similarity measures if the data does not form a convex set.
  • Use Cases: Document classification, semantic search, recommendation systems, and any other task involving high-dimensional and normalized data.
  • Real World Example: In information retrieval, cosine distance is commonly used to measure the similarity between query and document vectors, disregarding their length but focusing on the semantic meaning.

Euclidean Distance (L2)

  • Description: Calculates the straight-line distance between two vectors in a multidimensional space.
  • Strengths: Intuitive, simple to calculate, sensitive to both the magnitude and direction of vectors.
  • Weaknesses: May not perform well in high-dimensional spaces due to the "curse of dimensionality".
  • Use Cases: Image recognition, speech recognition, handwriting analysis.
  • Real World Example: In facial recognition systems, the Euclidean distance between feature vectors can determine the identity of a face.

Inner Product (Dot Product)

  • Description: Computes the sum of the products of the vectors' corresponding components.
  • Strengths: Fast to compute, reflects both the magnitude and orientation of vectors.
  • Weaknesses: Sensitive to the magnitude of vectors, not just their orientation.
  • Use Cases: Recommendation systems, collaborative filtering, matrix factorization.
  • Real World Example: In recommendation systems, the inner product can be used to determine the similarity between user and item vectors, helping to predict a user's interest in an item.

L2-Squared Distance

  • Description: The squared Euclidean distance between two vectors.
  • Strengths: Penalizes large differences between vector components, which can be useful in certain situations.
  • Weaknesses: The square operation may distort distances and is sensitive to outliers.
  • Use Cases: Suitable for problems where large differences in individual dimensions are particularly significant.
  • Real World Example: In image processing, L2-Squared Distance might be used to accentuate differences between images.

Hamming Distance

  • Description: Measures the number of differences between vectors at each dimension.
  • Strengths: Effective for comparing binary or categorical data.
  • Weaknesses: Not suitable for continuous or real-valued data.
  • Use Cases: Error detection and correction, DNA sequence comparison.
  • Real World Example: In genetics, Hamming distance is used to measure the genetic distance between two DNA strands.

Manhattan Distance

  • Description: Measures the distance between two vector dimensions along axes at right angles.
  • Strengths: More robust to outliers than the Euclidean distance.
  • Weaknesses: Not as intuitive in a geometric sense as the Euclidean distance.
  • Use Cases: Taxicab geometry, chessboard distance.
  • Real World Example: In city planning or logistics, the Manhattan distance is used to calculate the shortest path between two points along a grid of streets.

Feedback welcome on Twitter and LinkedIn.

Subscribe to Ian Maurer's Notes

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe