🌉
3-Week Building LLMs Bootcamp
  • Welcome to the Bootcamp
    • Course Structure
    • Course Syllabus and Timelines
    • Know your Educators
    • Action Items and Prerequisites
    • Kick Off Session at Tryst 2024
  • Basics of LLMs
    • What is Generative AI?
    • What is a Large Language Model?
    • Advantages and Applications of LLMs
    • Bonus Resource: Multimodal LLMs and Google Gemini
    • Group Session Recording
  • Word Vectors, Simplified
    • What is a Word Vector
    • Word Vector Relationships
    • Role of Context in LLMs
    • Transforming Vectors into LLM Responses
    • Bonus Section: Overview of the Transformers Architecture
      • Attention Mechanism
      • Multi-Head Attention and Transformers Architecture
      • Vision Transformers
    • Graded Quiz 1
    • Group Session Recording
  • Prompt Engineering and Token Limits
    • What is Prompt Engineering
    • Prompt Engineering and In-context Learning
    • For Starters: Best Practices to Follow
    • Navigating Token Limits
    • Hallucinations in LLMs
    • Prompt Engineering Excercise (Ungraded)
      • Story for the Excercise: The eSports Enigma
      • Your Task for the Module
    • Group Session Recording
  • RAG and LLM Architecture
    • What is Retrieval Augmented Generation (RAG)?
    • Primer to RAG: Pre-trained and Fine-Tuned LLMs
    • In-context Learning
    • High-level LLM Architecture Components for In-context Learning
    • Diving Deeper: LLM Architecture Components
    • Basic RAG Architecture with Key Components
    • RAG versus Fine-Tuning and Prompt Engineering
    • Versatility and Efficiency in RAG
    • Key Benefits of using RAG in an Enterprise/Production Setup
    • Hands-on Demo: Performing Similarity Search in Vectors (Bonus Module)
    • Using kNN and LSH to Enhance Similarity Search (Bonus Module)
    • Bonus Video: Implementing End-to-End RAG | 1-Hour Session
    • Group Session Recording
    • Graded Quiz 2
  • Hands-on Development
    • Prerequisites
    • 1 – Dropbox Retrieval App
      • Understanding Docker
      • Building the Dockerized App
      • Retrofitting your Dropbox app
    • 2 – Amazon Discounts App
      • How the Project Works
      • Building the App
    • 3 – RAG with Open Source and Running "Examples"
    • 4 (Bonus) – Realtime RAG with LlamaIndex/Langchain and Pathway
      • Understanding the Basics
      • Implementation with LlamaIndex and Langchain
    • Building LLM Apps with Open AI Alternatives using LiteLLM
  • Bonus Resource: Recorded Interactions from the Archives
  • Final Project + Giveaways
    • Prizes and Giveaways
    • Suggested Tracks for Ideation
    • Sample Projects and Additional Resources
    • Form for Submission
Powered by GitBook
On this page
  • Similarity Search in Vector Embeddings
  • 1. Euclidean Distance
  • 2. Dot Product Similarity
  • 3. Cosine Similarity
  • See Cosine Similarity Search in Action
  • Case in point: Rethinking cosine similarity based on the choice of regularization techniques

Was this helpful?

  1. RAG and LLM Architecture

Hands-on Demo: Performing Similarity Search in Vectors (Bonus Module)

We now shift our focus to a more nuanced aspect of natural language processing and machine learning: Similarity Search in Vector Embeddings. The significance of this topic, especially after understanding RAG, cannot be overstated. RAG, with its unique blend of retrieval and generation, underscores the importance of accurately finding and utilizing relevant information from a vast corpus, which is stored as vector indexes.

Hence, this principle is closely related to the concept of similarity search in vector embeddings, a cornerstone in understanding and leveraging the full potential of large language models (LLMs). While the upcoming section dives a bit deeper into this topic, it's essential to note that this is a bonus section for a reason. Gaining insights into similarity search and the algorithms in play can enrich your understanding. However, if you're primarily focused on implementing by the end of this course, a detailed grasp isn't mandatory.

Similarity Search in Vector Embeddings

By now you know, vector embeddings allow us to represent complex data, such as words or images, in a way that captures their underlying semantics. One of the critical tasks in many applications, from semantic search to recommendation systems to large language models, is determining the similarity between these vector embeddings.

Three of the most commonly used metrics to measure the similarity between vectors are Euclidean Distance, Dot Product Similarity, and Cosine Similarity. Here's a closer look at each:

1. Euclidean Distance

  • Description: Represents the straight-line distance between two vectors in a multi-dimensional space.

  • Formula: The Euclidean distance between two vectors, a and b, is calculated as the square root of the sum of the squared differences between their corresponding components.

  • Considerations: This metric is sensitive to both the magnitudes and the relative location of vectors in space. It's a natural choice when vectors contain information about counts or measurements. For example, it can be used in recommendation systems to measure the absolute difference between embeddings of item purchase frequencies.

2. Dot Product Similarity

  • Description: Calculates the similarity by adding the products of the vectors' corresponding components.

  • Formula: The dot product between vectors a and b is the sum of the product of their corresponding components.

  • Considerations: This metric considers both the direction and magnitude of vectors. It can be particularly useful in situations where the angle between vectors is of interest, such as in collaborative filtering recommendation systems.

3. Cosine Similarity

  • Description: Measures the cosine of the angle between two vectors, focusing purely on the direction and not on the magnitude.

  • Formula: The cosine similarity between vectors a and b is the dot product of the vectors divided by the product of their magnitudes.

  • Considerations: This metric is not influenced by the magnitude of vectors, making it suitable for tasks like semantic search or document classification, where the direction or angle between vectors is more significant than their length.

Comparing and Summarizing the three options

You now have embeddings for any pair of examples.

Similarity Metric
Description
Formula
Correlation with Similarity

Euclidean Distance

Measures the straight-line distance between two points in space represented by vectors

Inversely related (higher distance means lower similarity)

Cosine Similarity

Evaluates the cosine of the angle between two vectors, indicating their orientation similarity

Directly related (higher cosine means higher similarity)

Dot Product

The product of the magnitudes of two vectors and the cosine of the angle between them

Directly related and increases with vector magnitudes

See Cosine Similarity Search in Action

(Credits: Microsoft Reactor)

You should notice and appreciate that cosine similarity has found extensive applications in areas such as semantic search and document classification. It provides a robust mechanism to gauge the directional similarity of vectors, which translates to comparing the overall essence or content of documents. Imagine trying to find documents or articles that resonate with a given topic or theme; cosine similarity is your tool of choice.

behaviourFurther, if you've ever used a recommendation system, like those on streaming platforms or e-commerce sites, they might be leveraging cosine similarity. These systems aim to suggest items to users, drawing parallels from their historical behavior and preferences.

However, it's crucial to note that cosine similarity may not always be the best fit. For example, in scenarios where the magnitude or 'size' of vectors carries significance, relying solely on cosine similarity might be misleading. Take, for instance, image embeddings that are formulated based on pixel intensities. Here, merely comparing the direction of vectors might not suffice, and the magnitude becomes critical.

Case in point: Rethinking cosine similarity based on the choice of regularization techniques

This has stemmed from a very recent exploration (published on March 11 '24) of similarity search in vector embeddings by researchers from Netflix Inc. and Cornell University. Their study reveals that cosine similarity's effectiveness can vary, and is also influenced by the regularization techniques used in training vector embeddings. Regularization is a fundamental concept and it's like giving any model a rule to keep it from "overfitting" or learning the training data by heart, ensuring it can still perform well on new, unseen data.

PreviousKey Benefits of using RAG in an Enterprise/Production SetupNextUsing kNN and LSH to Enhance Similarity Search (Bonus Module)

Last updated 1 year ago

Was this helpful?

Specifically, this paper shows that cosine similarity might produce arbitrary outcomes, challenging its consistency in capturing true semantic relevance. In a nutshell, this research invites us to reconsider our reliance on cosine similarity and opens avenues for more robust approaches to measuring similarity in vector embeddings. For a deeper dive into these findings, access the .

∑i=1n(ai−bi)2\sqrt{\sum_{i=1}^{n} (a_i - b_i)^2} ∑i=1n​(ai​−bi​)2​
a⋅b∥a∥∥b∥\frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|} ∥a∥∥b∥a⋅b​
a1b1+a2b2+…+anbn=∥a∥∥b∥cos⁡(θ)a_1b_1 + a_2b_2 + \ldots + a_nb_n = \|\mathbf{a}\| \|\mathbf{b}\| \cos(\theta) a1​b1​+a2​b2​+…+an​bn​=∥a∥∥b∥cos(θ)
full paper here