πŸŒ‰
3-Week Building LLMs Bootcamp
  • Welcome to the Bootcamp
    • Course Structure
    • Course Syllabus and Timelines
    • Know your Educators
    • Action Items and Prerequisites
    • Kick Off Session at Tryst 2024
  • Basics of LLMs
    • What is Generative AI?
    • What is a Large Language Model?
    • Advantages and Applications of LLMs
    • Bonus Resource: Multimodal LLMs and Google Gemini
    • Group Session Recording
  • Word Vectors, Simplified
    • What is a Word Vector
    • Word Vector Relationships
    • Role of Context in LLMs
    • Transforming Vectors into LLM Responses
    • Bonus Section: Overview of the Transformers Architecture
      • Attention Mechanism
      • Multi-Head Attention and Transformers Architecture
      • Vision Transformers
    • Graded Quiz 1
    • Group Session Recording
  • Prompt Engineering and Token Limits
    • What is Prompt Engineering
    • Prompt Engineering and In-context Learning
    • For Starters: Best Practices to Follow
    • Navigating Token Limits
    • Hallucinations in LLMs
    • Prompt Engineering Excercise (Ungraded)
      • Story for the Excercise: The eSports Enigma
      • Your Task for the Module
    • Group Session Recording
  • RAG and LLM Architecture
    • What is Retrieval Augmented Generation (RAG)?
    • Primer to RAG: Pre-trained and Fine-Tuned LLMs
    • In-context Learning
    • High-level LLM Architecture Components for In-context Learning
    • Diving Deeper: LLM Architecture Components
    • Basic RAG Architecture with Key Components
    • RAG versus Fine-Tuning and Prompt Engineering
    • Versatility and Efficiency in RAG
    • Key Benefits of using RAG in an Enterprise/Production Setup
    • Hands-on Demo: Performing Similarity Search in Vectors (Bonus Module)
    • Using kNN and LSH to Enhance Similarity Search (Bonus Module)
    • Bonus Video: Implementing End-to-End RAG | 1-Hour Session
    • Group Session Recording
    • Graded Quiz 2
  • Hands-on Development
    • Prerequisites
    • 1 – Dropbox Retrieval App
      • Understanding Docker
      • Building the Dockerized App
      • Retrofitting your Dropbox app
    • 2 – Amazon Discounts App
      • How the Project Works
      • Building the App
    • 3 – RAG with Open Source and Running "Examples"
    • 4 (Bonus) – Realtime RAG with LlamaIndex/Langchain and Pathway
      • Understanding the Basics
      • Implementation with LlamaIndex and Langchain
    • Building LLM Apps with Open AI Alternatives using LiteLLM
  • Bonus Resource: Recorded Interactions from the Archives
  • Final Project + Giveaways
    • Prizes and Giveaways
    • Suggested Tracks for Ideation
    • Sample Projects and Additional Resources
    • Form for Submission
Powered by GitBook
On this page
  • What are Multimodal Models?
  • Exploring various data modalities
  • Bonus Resources: Recommended if you already know about the Encoder-Decoder Architecture
  • Note:
  • Papers for Further Reading

Was this helpful?

  1. Basics of LLMs

Bonus Resource: Multimodal LLMs and Google Gemini

PreviousAdvantages and Applications of LLMsNextGroup Session Recording

Last updated 1 year ago

Was this helpful?

By now, you have a basic understanding of large language models (LLMs) and generative AI, so let's take a fascinating leap into Google Gemini.

Launched by Google Deepmind in December 2023, Gemini is redefining what LLMs can do. It features two main models: Gemini Pro, on par with GPT-3.5, and the more advanced Gemini Ultra, eclipsing GPT-4. Even more exciting are the Nano versions, designed for mobile devices. Intrigued about their Android applications? . And for the developers among you, Gemini's API is now available on Kaggle. .

Curious about what makes multimodal LLMs stand out? Unlike traditional text-only models, these guys handle a mix of data types. Let’s unravel this mystery together!

What are Multimodal Models?

Imagine an AI that understands the world not just through text but through images, audio, and more. That's what multimodal models are all about! An interesting example is the one below from Google DeepMind's paper around Google Gemini.

In this example, Gemini showcases its capability for inverse graphics, where it deduces the underlying code that could have produced specific plots. This process involves not only reconstructing the visual elements into code but also applying necessary mathematical transformations to accurately generate the corresponding code.

Exploring various data modalities

  • Audio as Visuals: Imagine audio waves transformed into visual spectrums like mel spectrograms. This conversion offers a new perspective, making audio data visually interpretable.

  • Speech into Text: When we transcribe speech, we're capturing words but also losing out on nuances like the speaker's tone, volume, and pauses. It's a trade-off between capturing the literal and missing the emotional cues.

  • Images in Textual Form: Here's where it gets interesting. An image, in essence, can be broken down into a vector format and then represented as a sequence of text tokens. It's like translating a visual story into a textual narrative.

  • Videos – Beyond Moving Images: While it's common to see videos as sequences of images, this overlooks the rich layer of audio that accompanies them. Remember, in platforms like TikTok, sound is not just an add-on; it's a vital part of the experience for a majority of users.

  • Text as Images: Something as simple as photographing text turns it into an image. This is a straightforward but effective way of changing data modalities.

  • Data Tables to Visual Charts: Converting tabular data into charts or graphs transforms dry numbers into engaging visuals, enhancing understanding and insight.

Beyond these, think about the potential of other data types. If we could effectively teach models to learn from bitstrings, the foundational elements of digital data, the possibilities would be endless. Imagine a model that could seamlessly learn from any data type!

What about data types like graphs, 3D assets, or even sensory data like smell and touch (haptics)? While we haven't delved deeply into these areas yet, the future of MLLMs in these uncharted territories is both exciting and promising!

Bonus Resources: Recommended if you already know about the Encoder-Decoder Architecture

To delve deeper into the workings of Google Gemini, it's essential to understand its architecture, rooted in the encoder-decoder model. Gemini's design, though not elaborated in detail in their publications so far, appears to draw from DeepMind's Flamingo, which features separate text and vision encoders.

Note:

There is a dedicated session on Vision Transformers ahead in this bootcamp. It's included as a bonus resource along with the Transformers Architecture.

Papers for Further Reading

This section draws upon the valuable insights from an informative . In the realm of Multimodal Large Language Models (MLLMs), we explore the fascinating world where different data types are translated and interchanged, opening up a realm of possibilities. Let's take a closer look:

write-up by Chip Huyen on MLLMs
Wu, S., Fei, H., Qu, L., Ji, W., & Chua, T.-S. (2023). NExT-GPT: Any-to-Any Multimodal LLM. NExT++, School of Computing, National University of Singapore
Gemini Team. (2023). Gemini: A Family of Highly Capable Multimodal Models. Google
Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., & Chen, E. (2023). A Survey on Multimodal Large Language Models. USTC & Tencent YouTu Lab
πŸ˜„
Check this out
Give it a try here
Source: Gemini Team, Google (2023). Gemini: A Family of Highly Capable Multimodal Models