Vector Databases for Absolute Beginners | Idan Gabrieli | Skillshare
Search

Playback Speed


1.0x


  • 0.5x
  • 0.75x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 1.75x
  • 2x

Vector Databases for Absolute Beginners

teacher avatar Idan Gabrieli, Online Teacher | Cloud, Data, AI

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

    • 1.

      S01L01 Welcome V2

      1:34

    • 2.

      S02L01 Introduction

      0:56

    • 3.

      S02L02 AI, ML, DP and Gen AI

      6:30

    • 4.

      S02L03 Vectors

      8:54

    • 5.

      S02L04 Vector Embeddings

      7:09

    • 6.

      S02L05 Embedding Models

      6:14

    • 7.

      S02L06 Similarity Metrics

      5:11

    • 8.

      S02L07 Vector Search

      5:22

    • 9.

      S02L08 Summary

      5:54

    • 10.

      S03L01 Introduction

      1:59

    • 11.

      S03L02 Structured and Unstructured Data

      6:34

    • 12.

      S03L03 Vector DB

      4:53

    • 13.

      S03L04 Vector Search Workflow

      4:31

    • 14.

      S03L05 Selecting a Vector Database

      5:21

    • 15.

      S03L06 Summary

      5:08

    • 16.

      S04L01 Introduction

      0:41

    • 17.

      S04L02 Semantic Search

      6:59

    • 18.

      S04L03 Recommendation Systems

      7:02

    • 19.

      S04L04 RAG

      9:36

    • 20.

      S04L05 Anomaly Detection

      3:56

    • 21.

      S04L06 Visual Search

      2:41

    • 22.

      S05L01 Let’s Recap

      11:24

    • 23.

      S05L02 Thank You!

      0:53

  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.

9

Students

--

Project

About This Class

Vectors - Unlock the Secret to AI's Superpower

Ever wondered how AI can recommend the perfect movie, generate stunningly accurate answers, or understand your natural language queries? The magic lies in vectors, embeddings, and vector databases —the backbone of modern semantic search and Generative AI. This course demystifies these cutting-edge concepts, making them accessible to absolute beginners like you!

What You’ll Learn

  • Vectorizing Data: Learn how raw information is transformed into powerful, searchable numerical representations using vectors.

  • Semantic Search: Explore how AI finds the most relevant content based on meaning, not keywords.

  • Vector Databases: Dive into the technology that stores and retrieves vectorized data efficiently for AI applications.

  • Market Applications: Discover how vector databases power cutting-edge solutions in AI-driven fields like recommendation systems, image and video search, and semantic search.

Meet Your Teacher

Teacher Profile Image

Idan Gabrieli

Online Teacher | Cloud, Data, AI

Teacher
Level: Beginner

Class Ratings

Expectations Met?
    Exceeded!
  • 0%
  • Yes
  • 0%
  • Somewhat
  • 0%
  • Not really
  • 0%

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.

Transcripts

1. S01L01 Welcome V2: Hi, and welcome to this training on vector databases. My name is Idan and I will be your teacher. We are in the middle of a groundbreaking AI revolution advancing at an incredible pace across almost every industry. Everywhere we look from healthcare to finance, retail to government agencies, organizations are looking for ways to implement AI to gain a competitive edge. One of the key building blocks of AI technologies is the concept of vectorizing data into vectors. Vctors play a critical role in bridging the gap between unstructured data and AI applications. Vctor databases are the core engines for managing and searching vectors at scale, driving many useful applications like semantic search, recommendation system, and augmenting large language models. In this training, we are going to uncover the secrets of vectors and vector databases step by step. I will do my best to make this course simple, engaging, and enjoyable. Thanks for watching and I hope to see you inside. 2. S02L01 Introduction: Hi, and welcome to this section. My name is Ian, and I'm excited to start this training with you. In this section, we will lay down the required foundation on key concept about vectors. We'll start by talking about AI, machine learning, deep learning, and generative AI. Next step will be to define the meaning of a vector from mathematical perspective using very simple definition and how it is used in data science. Then we'll be able to talk about a vector embeddings and how embeddings models are used in that process. And the last step will be to understand how those generated vectors can be used to perform vector search using similarity metrics. All right. Let's start. 3. S02L02 AI, ML, DP and Gen AI: Hi, and welcome back. Before jumping to the concept of vectors, let's try to understand the big picture and organize some of the key terms related to AI. Starting with the basic question, what is AI? We use the terms so many times and it means different things to different people. Let's create some common baseline. Artificial intelligence is the practice of getting machines to mimic human intelligence. It is a general purpose technology that can be used for many things, which makes it so popular. AI is the human desire to create a digital brain, a brain that can mimic human intelligence, so machines can perform more complex tasks. Now, what is the meaning of a complex task? Well, this definition is constantly changing. A few years ago, it was a complex task to classify an object in a picture. Now, it is a standard feature in many market applications. It's not exciting anymore. Just recently we started to use generative AI to generate an image or a video clip from text description. AI entered the world of creativity. Practically speaking, we can see how a growing number of different tasks that were performed only by humans are now handled by AI applications. AI keep eating more and more tasks that were only possible by humans. Today, it is still complex to create a fully functional humanoid robot, but probably ten, 20 years from now, it will become a standard technology. There are many AIU cases where this AI brain is a component in a larger application. It is a piece of a much bigger puzzle. Therefore, it is useful to describe it in some more simple terms. This AI brain is commonly represented as a simple box with input and output. It is a simple analogy. We insert an input and using the knowledge stored in that brain, it generates an output like a magic box. The knowledge stored in that AI box is called a trained model. There are many types of AI boxes where each AI box can handle a specific type of data like text, image, video, audio, et cetera. How do we create a trained model to be used inside an AI box? This is the knowledge stored in that box. Well, by using the combination of machine learning algorithms and data, those ML algorithms can scan massive amounts of data, extract and learn meaningful patterns, and store them in something that is called a model. This process is called training a model. In most cases, it will require substantial computing resources coupled with a huge training datasets, all supervised by highly skilled data science team. Machine learning is the core methodology to create the knowledge in AI boxes. In a growing number of AI cases, the patterns in the data are complex and require tools that can perform a deeper analysis. One subset group of machine learning algorithms is called deep learning. Deep learning is based on creating complex artificial neural networks that can store and process highly complex patterns. They managed to take us much deeper into the ocean of data to explore new things, handle more complex patterns, and as a result, improve the ability of machine learning solutions to handle more complex tasks. AI models are heavily based on deep planning. The next step in that evolution is called generative AI. GNAI was the introduction of highly sophisticated AI models. Those models added the important capability to analyze text as a language, breaking the communication barrier between humans and machines and providing the options to generate new content while simulating human creativity. The famous CHAT GPT and many popular other tools are based on generative AI. It is a general purpose technology that can be leveraged in many domains. For consumers, it is a set of new tools that can boost productivity, and I guess it is something that you are already using today. For developer, generative AI is adding amazing new AI models that can be integrated in many software applications for a going number of use cases. Where are we going from here as a community with those crazy fast breakthroughs in AI? Well, I don't know. It is hard to predict. One thing is quite clear. AI is here to stay, and it is opening a wide range of new opportunities. We just need to be able to solve that AI wave in a smart way. Do you know how all those AI models can digest, process and understand the complexity of the outside environment around us, like processing text, an image, a video file, or an audio? Well, it's all about vectors. Vectors form the foundation of how AI models interact with data, and that's the topic of the next lecture. See you next. 4. S02L03 Vectors: I am sure that you have encountered the term vector at some point in your studies. It is how to avoid those math sessions. Let's quickly refresh the concept to get up to speed. A vector is a mathematical object characterized by both magnitude and direction. It is typically represented as an arrow pointing from one point to another in space. The arrow has a starting point and an endpoint. The length of the arrow corresponds to the magnitude of the vector while its orientation in space indicates the direction. As a simple example, let's take a vector in two dimensional space and place the starting point of the vector at 0.0. The head of the vector ends at a specific point with X one and Y one values. This simple geometric interpretation of a vector is widely used in physics, mathematics. One of the most useful methods to represent a vector in space is by using the values of the head point. It's like a point in space. This is the end location of the arrow. In a simple two dimensional space, a vector consists of an X component, which is sitting on the horizontal axis and a Y component sitting on the vertical axis. So instead of visualizing a vector as a line in space, we can represent it mathematically by writing down its components as numbers. For instance, if the X component has a value of eight and the Y component has a value of 12, the vector can be written as a list of number eight and 12. The first number correspond to the first dimension, and the second number correspond to the second dimension. This approach works well for two dimensional vectors. But what happens when we have vectors with more dimensions like seven or 700, we can't easily draw or visualize such vectors in space. Still, it is a vector, and it has a magnitude and direction. The good news is that the concept remains the same. The vector is represented as a list of dimensional components, such as the one, two, three, d four until the seven, for a seven dimensional vector. The number of dimensions defines the dimensional space of a specific vector. By presenting a vector as a list of numbers where each number is one dimension, we can handle any vector size. That's about a vectors and how to present them using a list of numbers. This basic understanding of vectors form the foundation of many concept in data science and machine learning. Let's now explore how they are related. In data science, the concept of data points is essential for describing the properties of an object. A data point represents a single observation or instance about a specific object, and it is characterized by a set of features. Here we have some example of objects and their corresponding data points. For example, an animal, each data point may include features such as the weight, height, number of legs, color, and maybe other features. And sensor reading in an IoT device features may include the timestamp when the device send a specific sensor reading, the temperature, humidity, pressure, and more. Each data point of an object is described by a collection of features. These features are usually manually selected and engineered by data science practitioners based on use cases. For example, an engineer might decide that the temperature reading from an IoT device is significant and should be included as a feature in each data point generated by that IoT device. Let's take the first object, an animal. It has several features such as weight, height, tambrefleg, color there are many more features that can be used to describe an animal. But let's assume that someone selected those specific features for some use case. These features are fixed across all animals in that dataset, meaning every data point is described using the same set of features. If we check and examine one data point for an animal, it may look like the weight is 20 kilogram, number of legs, four, and so on. One of the most effective ways to handle the values of these features for each data point is by using vectors. A vector is basically an ordered list or sequence of numbers that represent a single data points in a multidimensional space. For example, the vector for the animals features will be structured as follow. We have the weight, height, number of legs, color, and so on, if there are more features. In data science, this is referred as a feature vector because it holds the features of a specific object. Each number in the vector correspond to a particular feature or an attribute. The first number represent the weight, the second represent the height and so on. Now why using vectors? Well, computers are incredibly efficient at processing numbers. By organizing data as numerical values stored in vectors, we can leverage powerful mathematical techniques to analyze and process the data efficiently. These vectors, representing data in multiple dimensions can be thought also as arrows pointing in a particular direction and magnitude in space. Each arrow is a specific vector. How the list of features for a specific object are selected. In data science, it is called feature selection. Feature selection is a critical step in the data preparation process helping to ensure that the dataset can capture the most relevant information for the analysis or model being used. The first option to perform feature selection is manual selection. Meaning someone with relevant expertise and knowledge is thinking and considering which features will be relevant for a specific use case. And you know what? It is a practical option for many use cases. However, the world around us is highly complex with objects and patterns that are not always straightforward. Let's consider our example of a feature vector for an animal object. Should we use the number of legs as a feature or not? Do we need to use other features? Imagine that we need to identify the facial properties of an animal based on images as raw data. This introduce an additional layer of complexity. Should we focus on the overall size of the head or the shape of the nose. Deciding which features are relevant and how to extract them is a challenging task, especially when dealing with complex patterns in high dimensional data. In some cases, it doesn't make sense to manually try to select them. Address this challenge, we turn to the concept of vector embeddings, an automated and powerful method for extracting meaningful features from raw data using the power of AI. This will be the focus of our next lecture. 5. S02L04 Vector Embeddings: Hi, and welcome back. In the previous lecture, we explored how vectors are used in data science to hold a list of numerical values, known as features that describe specific objects. These feature vectors organize data in a simple, structured manner where each number represent a measurable or categorical attributes, such as weight, height, price, color, country location. Feature vectors are like the containers being used to carry input data into machine learning models. However, in many real world scenarios, manually deciding which features are necessary to describe patterns and attributes in data is becoming inefficient option. This challenge is especially amplified when dealing with unstructured data, data that lacks a predefined format, such as text, images, video, audio logs. These types of data require more advanced techniques to extract meaningful features that can better represent the required patterns in the data. Let's consider a group of pictures of dogs. It is unstructured data with no predefined format or frameworks. Each picture can vary widely in terms of colors, shapes, sizes, and other attributes. It will be almost impossible to manually decide which features can be used to describe the content of the image. We need an automation tool that will help us to extract meaningful features from each picture. Fortunately, AI models can perform this task. These models can take unstructured data such as text, image, video or audio as input, and generate as an output a feature vector, a structured list of numbers. These numbers capture the patterns identified by the model in the input row data. This process of vectorizing the data is known as vector embeddings. Let's define this important concept. Vctor embeddings leverage trained AI models to analyze complex pattern in high dimensional unstructured data and transform them into a compact, lower dimensional vector. This is a data transformation box. The output is a feature vector also called embeddings. Those AI models are called embedding models. Embeddings play a pivotal role in the field of artificial intelligence because it is offering a powerful technique to transform unstructured data like quotes, images, video or user preference into more compact and structured numerical representation inside vectors that machine can easily process. The process of vectorizing data also normalize the data. The number of features that the specific embedding model generate is fixed, meaning every vector produced by the model will be the same size, the same number of dimensions. The numbers generated by vector embeddings represent patterns identified in the high dimensional space of the raw data. This output, known as vectorized data is the result of the process called vectorizing data. It's bridge the gap between the complex, messy patterns found in the real world and the digital space where everything is represented numerically using numbers. In simple terms, vector embeddings convert unstructured information into a more organized and compact numerical representation that captures the underlying patterns and relationships of the data. The machine learning model used for the vector embedding process is called an embedding model. There are many types of embedding models each designed to handle different kinds of data. Taking the image as an example, creating a vector embedding of an image is like creating a digital fingerprint of that image. It captured the essence of the image in a numerical form. As a simple example, let's assume we have a list of images. We can take each image, fit it into a relevant embedding model, and the model will produce for each picture a vector embedding with numbers. Let's say for the sake of simplicity that our model is generating just ten features for each image. What is the meaning of those numbers inside a specific vector? Well, it is important to understand that the list of numbers inside each vector is not directly related to a specific simple feature like a high or colo, as we saw in manually selecting the feature vectors. The numbers in a vector embedding don't usually have standalone meaning. We should not try to take a specific number from a vector created using vector embeddings and try to figure out the meaning of that number. It is the combination of all numbers that define the vector position in space. Let's see a simplified version of that concept. Can see here two vectors, vector A and vector B. Each vector has a relative position in the embedding space. This position can be used to evaluate similarity between vectors. The proximity of two vectors in the embedding space reflects their similarity. Similar items will have vectors that closer together with a smaller distance between them. For example, when two pictures of dogs have vectors that are close to each other, it is an extremely useful property we can utilize for many use cases like performing a visual search using images, find outliers, a cluster similar objects, and more. As you can imagine, all this magic of vectorizing data is possible using an embedding model. That's the main brain. In that case, how should we select an embedding model? What are the market options in that perspective? That's the topic of the next lecture. 6. S02L05 Embedding Models: Hi, and welcome back. In the previous discussion, we explore the power of using vector embeddings to transform unstructured data into structured list of numbers, making the data easier to analyze and process. The embedding model is the key component responsible for this transformation. Let's dive into the concept of an embedding model. An embedding model is a machine learning model designed to map data from its original high dimensional space into a more compact, lower dimensional vector space. This lower dimensional representation is called an embedding, and it is effectively captures the semantic or structural relationship within the data. Machine learning models are typically trained using algorithms based on deep learning methods like neural networks. During the training phase, these models are exposed to vast amounts of data, allowing them to learn complex patterns and relationships. Once the model is fully trained, it becomes capable of performing vector embeddings, meaning transforming new unstructured data into a vector. The number of features that a specific embedding model generate is fixed, meaning every vector produced by the same model will be the same size. This fixed number of features determines the dimension of the vector. When we talk about a high dimensional space, we refer to models that generate vectors with many features, sometimes as many as 100 or 1,000 or even more. Should we use an embedding model that generate more features with more dimensions or fewer dimensions? Well, there is no simple yes or no answer for that question. Having more dimensions can improve the accuracy of the generated vector, but it comes with trade offs. Larger vectors require more computing power and more memo resources and can increase the latency or speed doing task like searching data. It's a balance between the accuracy and computing resources or efficiency required for processing the data. Now let's say you or your company want to build an AI based application that leverage vector embeddings to transform complex unstructured data. In that case, you have three main options. Option number one, build an embedding model from scratch. Some data scientists may develop the custom embedding models tailored to specific task however, this approach is less common because it requires substantial resources and time, something most companies cannot afford. In many industries, the speed of launching new products is critical, making this option less practical for most market use cases, that's probably more feasible for research institutions or large companies with big pockets that are running a skilled data science team. Another much more practical option is to utilize an out of the box embedding model as a building block. There are many types of embedding models. We can divide them into two main groups. The first group is about open source embedding models. There are a growing number of open source models that can be used, and of course, each model is optimized to specific type of data. There are text embeddings, models, image embedding models, audio embedding models, and so on. The second group is paid models. Those models are encapsulated as services. Accessed using APIs, we can provide the input data and get the vector as an output. As you can imagine, this approach dramatically simplifies the development and integration of AI based applications, and developers can easily leverage powerful models without reinventing the wheel. The third option is to fine tune a pre trained model if our data falls within a specific domain, such as medical records and off the shelf models are insufficient. In that case, we can consider fine tuning an existing pre trained model with additional domain specific data. Like option number one, it will require a skilled team to perform the fine tuning process, but it is much less complex compared to starting the training from scratch like option number one. All right. As a quick summary, when building an A application that leverage embeddings, there are three main option, option number one, build an embedding model from scratch. Option number two, use pre trained models as open source or using paid APIs. And last option is fine tune a predefined model with domain specific data. The selection of the most suitable option is based on the project's specific needs, resources, and timeline. Many developers who would like to focus on their core software applications will go with option number two and use it as a building block, meaning, try to find off the shelf services or maybe open source embedding models that are already trained and optimized for production. Now let's say we vectorized all our unstructured data using a selected embedding model and place them in vectors. How can we group them based on position in space? How can we measure their similarity? That's the topic of the next lecture. 7. S02L06 Similarity Metrics: We have learned that embedding models are powerful tools used to convert complex unstructured data into compact numerical vectors that capture key features and patterns within the raw data. Each vector created by an embedding model has a fixed number of items which define its embedding space. This space is a high dimensional mathematical space where objects are represented as vectors. This embedding space, the relationships and similarities between objects are encoded based on their relative position. Closer two vectors are to each other, the more similar objects they represent, proximity equal to similarity. Instead of trying to find exact matching, meaning two identical vectors, we are expanding our space of options by looking at proximity. This property of proximity between vectors is extremely useful for a variety of applications such as semantic search, recommendation system, anomaly detection, and much more. Let's say we have 1,000 images and we have transformed them into 1,000 vectors. Each vector has the same number of features and represents the image in the same embedding space. So what can we do with this vectorized data? We can perform a variety of tasks. For example, we can take a new picture convert it into a vector using the same embedding model and then search through the existing 1,000 vectors to find the most similar images. This process is called a vector search. The top matched images can be recommended to the user based on similarity. Here we can see that picture number one and picture number two are very similar to the new picture. Another option is to find groups of clusters. We can analyze the entire set of 1,000 vectors and group them into clusters. Images that share similar patterns will be close to each other in the embedding space. This allows us to identify meaningful groups or clusters within the data. Here we have two very distinct clusters or groups. The underlying principle for this task is measuring the similarity between vectors. Let's dive a little bit more into how these similarity metrics are calculated. There are several methods to measure similarity between vectors, and each method is more suitable for different set of use cases. Still, the most common method to measure similarity between vectors for general purpose dimensional data created by embedding is the cosine similarity. Cosine similarity measure the similarity between two vectors based on the angle between them. It focus on the direction of the vectors rather than their magnitudes. How this metric is calculated? Well, using the following formula, it may look complex, but it's quite simple. Let's see what we have here. Assuming A and B are two vectors. On the upper side, we have A dot B, which is the dot product of vectors A and B. It is calculated by simply multiplying the corresponding elements of the two vectors and then summarizing the result. On the lower side, we need to calculate the magnitude of vector A and vector B and then multiply the outcome of each magnitude. As a simple example, let's imagine that we have three vectors in space, A, B, and C. We will use this formula to calculate this metric between vector A and the two other vectors. The cosine value between A and B is 0.905, which is very close to the value one. Meaning it will be translated to a small angle between the two vectors. They are almost pointing to the same direction, meaning A and B are similar to each other. What about A and C? We are getting 0.62, that is translated into an angle of 51 degrees. Those vectors are not really pointing to the same direction. Bottom line, vector B is more similar to vector A than vector C. All right, we have a simple useful metric to measure the similarity between vectors. Let's use it to perform a process called a vector search. See you next. 8. S02L07 Vector Search: In the previous lecture, we saw how to measure a similarity metric between two vectors. In practical use cases, it is required to perform a process called a vector search. It means that on one side, we have a query vector, also called here QV, and on the other side, a long list of vectors to be searched, like a vector one, two, three, et cetera. The vector search process is divided into three steps. In step one, we'll take the query vector and calculate a cosine similarity matrix between that query vector and the vector in that list, and then the second vector and the third vector until the last vector. In step number two, we will reorder the vectors based on the calculated similarity score. Going in descending order. So here, the similarity metric between the query vector and vector number four has the highest value. Next, we have vector number three and then vector number five and more. While looking at this simple two dimensional representation, it makes a lot of sense. We can see that the query vector is close to vector four a three and five. And finally, as part of step number three, we can pick the top similar vectors based on a threshold value. Here, the selected top vectors are number four and three based on that specific threshold. That's the end to end process. Let's see a simple example. Have the following list of seven sentences. All of them are around the main topic of AI. Let's assume each sentence was converted into a vector using a text embedding model. We can place them in a simple table. The last column represents the vector embedding result for each sentence as a list of numbers. I'm not mentioning the specific number, it's just illustration. We have Vctor one, Vctor two, and so on. At some point, I got a new sentence. And it is required to check which sentences in that list are similar to the new sentence I would like to check, meaning they have similar semantic meaning. Let's use the same simple process of a vector such. As a first step, we'll take that new sentence and convert it to a vector using the same text embedding model. This vector is now called the query vector. The next step will be to calculate the similarity between the query vector and all other vectors. Finally sought in descending order, the table based on that metric. As we can see in this table, the nearest vector for the searched sentence is the first one in that list, which is the highest core, 0.851 and the next vector, meaning number three with 0.722. Those two sentences have the most similar semantic meaning to the new sentence that I'm searching for. Another dimension to consider is the threshold range. Does a metric with a value of 0.65 is good enough or not? Should we drop anything below 0.7 for example, here I decided that the threshold value should be 0.7. Well, there is no one answer to that question. It is based on the use case, the application and the domain. If this metric is used in a recommendation system for an ecommerce website, then there is a more room for flexibility. The threshold value can be lower like 0.6. It will be just fine if I get recommendation for a bike gadget because just two days ago I searched for new sport shoes. On the other end, if the application is about finding relevant medical content, then the threshold value should be much higher like 0.9 or even more, making sure that the content is more relevant to the search query. Setting the right threshold is a case by case. We just saw a simple list of less than ten sentences for our vector search. What if we need to handle thousands of vectors, hundreds of thousands or millions of sentences or images or documents. That's the typical case in real market use cases. We need a place to manage and store all those vectors in efficient technology to perform a fast vector search. That's the purpose of a vector databases, which is the main topic in our next section. See you next. 9. S02L08 Summary: Hi, and welcome to the last lecture of this section. I would like to summarize the key concepts we covered. I'm going to use a mind mapping tool called X Mind to organize the topics. Feel free to download the PDF version as a final version or the actual mind mapping file and perform your own adjustments while using the same application. All right? We started this section by defining the high level concept of AI and reviewing the evolution of this amazing technology with machine learning, deep learning, and generative AI. AI is umbrella term of the human desire to create a digital brain to mimic human intelligence. It is a general purpose technology, which means it is useful for many use cases, and it will be used in many domains. ML machine learning is a subfield of AI. It is a group of methods and algorithms to discover and learn patterns from data. Deep learning is a sub field of machine learning for handling highly complex patterns using artificial neural networks. And the last evolution phase is about generating I. GNAI added the important capability to analyze text as human language and generate new content. AI models are trained using massive amounts of data and the most useful way to convert complex data around us for AI models is by using vectors. A vector is a simple mathematical object with magnitude and direction like an arrow pointing in space. It is easy to present such vector in two dimensional space or three dimensional space. But what about four dimensional space or ten dimensional space? It is much simpler to present any vector using a simple coordinate system. We can break any vector to its components like a list of numbers representing the dimension like D one, D two, three, and so on. In the context of data science, vectors are essential tools for representing data points or data objects as a numerical data. A data point is a group of features or properties that describe an object. It is a single observation of an object. A feature vector is a list of all features related to one data point as numbers. Those features are manually selected for simple use cases as part of the data preparation process while picking the relevant pieces of information that can describe patterns related to an object. In a manual feature selection, each number corresponds to a specific feature, such as the animal weight, height, making it possible to describe complex objects in a multidimensional space. However, as the complexity of real world patterns increase, manually selecting relevant features becomes challenging. Vector embedding is a smart way to automate the process and leverage the power of AI models. Vector embeddings are used to vectorize data, transform raw unstructured data such as text, images, or audio into numerical representation. A long list of numbers. Each number in the vector corresponds to a specific dimension. This process is normalizing the data into a fixed size vectors that are organized in the same special space. In this space, the proximity between vectors reflects the similarity between the objects they represent. These vector embeddings are generated by embedding models. Embedding models are machine learning models trained using large dataset and deep learning techniques. A trained embedding model can be used to map data from high dimensional space into more compact lower dimensional vector space. There are a couple of options to leverage an embedding model. One option is to build an embedding model from scratch or use an out of the box embedding model as open source or paid API based services. The last option is to fine tune a predefined embedding model. We also talked about calculating a similarity metric. As the proximity between two vectors indicates their similarity, it can be used to search similar data instead of trying to find exact matching. This property is highly useful for many use cases. One of the most common methods to measure similarity between vector is the cosine similarity. Cosine similarity measures the similarity between two vectors based on the angle between them. Meaning it is focusing on the direction of the vectors. The last topic was about performing a vector search. Step number one, generate a list of vectors to hold data objects using the relevant embedding model. In step number two, process a query request by generating a query vector and then calculating similarity metrics. In step number three, sort vectors using that metric and filter relevant results based on some threshold range. That's a quick summary for this section. In the next lecture, you have a small quiz to test your knowledge. Good luck, and see you again in the next section. 10. S03L01 Introduction: Hi and welcome back. In the previous section, we uncovered the power of vectors, vector embeddings and similarity metrics used for performing vector search. I guess you got the idea that vectorizing data into a list of numerical numbers is an essential step in almost every AI based use case, taking unstructured data and converting the data to vectors to facilitate fast and flexible data search. This is performed by a component called an embedding model. We also saw a couple of examples such as vectorizing a small number of images or sentences, which is nice. But in practical application, the number of objects to vectorize can be much larger like thousands, hundreds of thousands, and even millions. Just think how many products are listed on an ecommerce website like Amazon or Ebay. If a recommendation system inside the website will use vectors and each vector will represent the lending page of each product, it will be a challenge to manage those millions of vectors. It's all about scaling. Where should we store, manage, and search for those vectors created by embedding models? The perfect solution is to use specialized database technologies called vector databases. Those technologies have the ability to store index and search high dimensional data points, delivering the speed and performance needed to drive artificial intelligence use cases. That's the main topic in this section. Let's start. 11. S03L02 Structured and Unstructured Data: Hi, and welcome back. As you probably know, there are two main types of data objects, structured and unstructured. Structured data has a specific predefined structure. In many use cases structured data can be organized in tabular form with rows and columns. It is typically managed in traditional relational databases accessed using SQL queries. Each column stores data types like numbers and strings. This structure is perfect for transactional data. Think about a bank application, storing and accessing financial transactions. The data will be stored in rows and columns like a table. Each row represents a specific transactions, and the columns represent different fields stored for each transaction. When searching for data, it is based on exact keyword matching or using a set of criteria, meaning the search query will be compared against the values inside specific columns. Example, query all bank transactions for a person based on ID number and between specific dates. We'll get a list of rows matching that search query. Traditional databases are great at storing and retrieving structured data. However, they are less flexible when dealing with the semantic and contextual nuance of unstructured data. Unstructured data like images, video files, audio files, text from a blog post, a product description in a PDF file, a question from a chat session, and more. Unstructured data has no specific predefined format. We can't organize this type of data or the patterns in this data into tables with rows and columns. It is high dimensional complex data. Let's consider a group of images as files. An image is unstructured data. There is no predefined format. There is no specific information that describes the content of each image. Maybe the name of the file can provide some basic information. Those image files will be stored in generic object storage solutions like Amazon tree or Azure Blob storage. It is also possible to manually add metadata information attached to each image file that can help to search and retrieve those images. Metadata like category. Creation date, tags, maybe the background color, a type of object inside, et cetera. We can use that extra information about each image to search those image files, but it is limited. This metadata does not capture the full patterns inside each picture. Let's say, we would like to get all images similar to a specific image that we have. This image is a picture of a dog. Now, there are many types of dogs. How can we perform that search? Maybe some images were tagged with the keyword dog or the main colour or the overall dog position. However, there are many types of dogs with different sizes, shapes, and colos. We would like to get images of dogs that are similar to the specific dog in our picture. Any manual metadata added to a picture will be very limited. What about using AI? AI models have the power to represent and capture patterns in unstructured data and translate them into vectors. We can take each image and process it using an image embedding model that will produce a vector representing the complete content inside. So for each image file, we'll also have a vector. The vector is not the image raw data. It is a list of numbers that represent the identified patterns in that image by the embedding model. We can take our specific search image, translate it into a query vector, and then calculate a similarity score between the query vector and each other vectors. The list of vectors with the highest ses will be those similar images that we are looking for. That's great. We have an AI model that helps us to identify patterns in unstructured data, and we can use the concept of vector search to find similar vectors. However, what if I get another image to search? I need to perform the same process all over again. Generate a query vector of that new image and then generate vectors for all other images in my repository, calculate the similarity scoes and filter the relevant vectors. As you may guess, that's not an efficient method. It doesn't make sense to calculate vectors for all content objects all over again for each search query. It is more efficient to calculate those vectors as a batch process one time and store them somewhere. And when I get a query about some things, all those vectors are ready to be searched. We just need to create a single vector based on the search query. So where should we store and handle those vectors? In production environments, the number of vectors created using vector embedding models can be huge. As a result, it is becoming essential to manage those vectors in dedicated storage solutions optimized for vectors that bring us to the topic of vector databases. Those vector databases are optimized to store, manage, and search any type of vector and any number of vectors. That's the topic of the next lecture. See you next. 12. S03L03 Vector DB: Hi, and welcome. In this lecture, we'll start to explore the key benefits of vector databases. As the name implies, a vector database is a specialized database designed to store manage and search vectors at scale. Vector database will provide the following core capabilities. The first one is managing vectors, meaning insert, update or delete vectors as the object inside the vector database. Secondly, the ability to associate meta data with each vector, something which is important because the vector itself is not the raw data, so it's an options to create the connection between the original raw data and the actual vector. We'll see that later on. The third core capability is the ability to find similar vectors based on a query vector and the metadata information attached to a vector. Now, what kind of vectors we can store and query with a vector database? Well, it can be any vector, any vector type and any size, a vector created by an image embedding model, a vector created by a text embedding model. It can be a vector with ten dimension or 400 dimensions. From the vector database perspective, it is just a list of numbers representing vectorized data. It is the perfect tool to handle vectorized data, meaning vectors of unstructured data created by vector embedding models. As you can imagine, this flexibility to store vectors related to completely different data objects makes this type of database highly adaptable to ongoing changes. Can easily add new data types across the application or change search requirements with minimal adjustment to this vector database. The power of vector databases is based on their ability to perform a vector search based on similarity metrics with the goal of finding the closest data points in a high dimensional vector space. Let me rephrase that important concept. Quaring a vector database is different than quering a traditional database. Instead of searching for precise matches between identical vectors, a vector database uses similarity search to identify vectors that reside in close proximity to the given query vector. This fundamental approach of looking at proximity in space provides tremendous flexibility and efficiency that traditional search cannot match. And that's the main reason why it is popular for AI based applications. AI based applications are handling highly complex data objects with highly complex patterns. It does not make sense to try to find identical objects. It is more common scenario to search for similar objects and not identical objects. Many applications, the biggest challenge is to carefully balance speed and accuracy when handling a search query. How long it will take to search and find the right answer to a specific query. On one side, the amount of data is going all the time, and on the other side, there are going number of applications that are trying to leverage this pile of data. If I need to wait 5 minutes when searching for something in Google Search or HAGPT, then that's going to be a problem. In that perspective, vector databases, coupled with the vector embedding vectors stored inside are designed to provide the required capability to search over massive datasets of unstructured data with low latency. Consistent query performance and great accuracy. When using vector database, you don't care if the object is an image, a video, an audio. Everything is vectorized. Everything is basically a list of number which are vectors, and then the whole process of searching data is much more simple. Those are the key benefits of vector databases. In the next lecture, let's talk about the typical workflow when performing a vector search with a vector database. See you next. 13. S03L04 Vector Search Workflow: Hi, and welcome. In this lecture, I would like to take a vector database as a component and see how it will typically fit in a larger application. It is a generic diagram to understand the key concept. On the left side, we have a content repository that can be any type of content like a text, images, PDF files, articles of web pages, video file, et cetera. That's the main application repository or data store of unstructured data or structured data. On the right side, we have a vector database that is designed to store vector data. But vector data is not something that can be just made up. It is something that is generated via machine learning models as part of the embedding process. Therefore, in the middle of the diagram, we have an embedding model component. It can be a single model or multiple models depending on the data types we have in the data store, the content repository. There are plenty of different embedding models for a variety of use cases. This type of contextual information that the embedding capture is a result of the type of model used and the data it was trained on. As a preparation step, we'll take each data object in our content repository. And run it one time in the embedding model component to get a vector embedding as an output. Each vector embedding will be stored in the vector database coupled with additional metadata information about that object. For example, if the object is an image, the metadata might include the image URL, where this image is stored. Maybe the image description, additional image tags, which can be used later on for filtering. It is important to emphasize that the vector embedding itself is not the actual image content. It is a numerical representation of the image features, like its visual content, colors, textures, and overall theme. Secondly, vector databases are not optimized to store and manage the actual content objects of unstructured data. It is necessary to store the original data separately from the vectorized data. Therefore, we must store as part of the vectors metadata a reference to the data store holding the original content objects. It's about the preparation step, generating vectors, using vector embedding models, and storing those vectors in a vector database. Now we are ready to process a specific query from an application here on the left side down below. This application will send a search query. For example, a search query can be an image file. This image file will be translated to a query vector using the same vector embedding model used to convert the complete repository during the preparation. This ensures that the query vector and the vector database content are in the same dimensional space, making it possible to measure similarity between the vectors. Next step is performed by the vector database itself. It will take the query vector and compare it with all other vectors stored in the database. The most similar vectors to the search vector will be ranked in their order of similarity. Finally, as part of the quer result, the vector D B will typically return the associated metadata of those most similar vectors. The metadata will be used by the application to access the relevant image objects from the original content repository. That's the typical end to end workflow when integrating a vector database in an AI based application. As you can imagine, there are many companies that provide a vector database solution. So in the next lecture, I would like to talk about a key factors to consider when planning to select a specific solution. 14. S03L05 Selecting a Vector Database: Hi, welcome back. In this lecture, I would like to review a couple of key factors to consider while choosing the most suitable vector database option for a specific project. The first dimension to consider is the deployment model. It can be divided into two main options. The first one is a fully managed vector databases over the cloud. The vector database is encapsulated as a cloud service that we can use and pay based on consumption. Someone else is taking care of all the required configuration and infrastructure while running the database. Just keep in mind that those fully managed solutions are associated with a price tag, so make sure that it is aligned with your budget and project requirements. Another thing to check is the supported Cloud providers. If your company is mainly running on Amazon AWS, it makes sense to check that this fully managed service is supported in their marketplace. The second deployment option is self hosted. It can be an open source or maybe a proprietary solution that we can deploy on our servers. And there are many great open source database solutions. Some of them are available as self hosted open source and also as fully managed Cloud services. You can see different popular names right over here. That's about the deployment model. The next factor to consider is about integration. A vector database is one piece in a complete solution. It must be successfully integrated with other components in an end to end application. Therefore, it is important to check and verify the compatibility and support with popular machine learning frameworks and tools. So consider the available SDK extensions and APIs for integration that you would like to use. Next dimension to consider is how the vector database solution we would like to use is developer friendly. Many companies are claiming to be a developer friendly, but their website is not aligned with that message. Can you find a solid documentation center in case you need some help? Maybe the solution is technically great but poorly documented. Making it hard to find specific information. It is recommended to open the vendor website and try to find a dedicated section for developers like a knowledge base, documentation, tutorials, access to a community center, see code examples, maybe training sessions. More mature solutions will have the required end to end package for developers. Most probably companies that are providing a vector database as a paid solution will put more resources to provide better developer friendly resources as part of their marketing strategy. The next one is about metadata filtering. Do you remember that a vector stored in the vector database will be associated with metadata. It contains extra information about the vector and could be used for performing filtering that combines the vector search and additional filtering options using that metadata information. From that perspective, it is important to check what data types can be added as metadata information to facilitate the required search. Metadata is commonly added and managed in adjacent structure with key value pairs. The value inside, the key value pair will be limited to specific data types, like numbers, string, bullion or a list of strings. This is something to check. And the last one is enterprise ready. The last dimension to consider is the level of alignment to support enterprise applications. Eventually, the developed application will run in production and must follow certain market benchmark related to security, performance, and uptime. Imagine that the recommendation system on Amazon, will stop working because the vector database is under maintenance. It will quickly translate to substantial revenue loss. It is important to check if the Vctor database solution is aligned to enterprise grade, especially if it's a cloud based solution, checking the supported privacy and security features declared a security certification, performance metrics, and the supported service level agreements. Those are the key factors to consider when selecting a vector database for a specific project. I guess it will be just fine to start with an open source solution during development while minimizing cost and later on, make a decision when moving to production. 15. S03L06 Summary: Hi, and welcome to the last lecture in this section. Let's summarize the key points we have covered so far in this section. Again, it's a mind mapping summary. You can download the final version as a PDF file or the X mind format if you would like to make your own changes. We started by dividing data into two main categories, structured and unstructured. Structured data has a specific predefined structure. One of the most common formats is using grows and columns called tabular form. It is a perfect method to handle transactional data where each row represents a specific transaction and each column represents a specific field in a row. This type of data is commonly stored and managed in traditional relational databases. These databases use SQL queries to retrieve data based on exact keyword matching or specific criteria, making them effective for precise structured queries. However, they fall short when handling unstructured data, such as images, video, audio, and text which lacks a predefined format and cannot be easily organized into table or simple groups. To make unstructured data more researchable, metadata like tags or categories can be added, but this approach is limited and does not capture the full complexity of all patterns in the raw data. AI powered vector embedding models provide a solution to these limitations by converting unstructured data into numerical vectors that capture patterns and semantic content. Those vectors can be used to facilitate a vector search. When converting a large number of unstructured data into vectors, while using embedding models, it will be more efficient to store those vectors in a vector database. Vector databases are optimized for storing, managing, and searching vectors at scale. There are specialized tools designed to handle vectorized data, offering key functionalities such as inserting, updating, deleting and associating metadata with vectors. Their core feature is the ability to perform similarity search, identifying vectors with close proximity to a query vector within a high dimensional space. One of the most significant advantages of vector databases is their ability to maintain consistent performance at scale and to perform a search query over high dimensional data with a much lower latency compared to traditional databases. It makes them a perfect match for AI based applications that must access unstructured data at scale. We also cover the typical workflow of a vector search while using a vector database, exploring how a vector database fits into a larger application ecosystem. Such system architecture will include the following key components. Content repository that stores a raw unstructured data. It's usually an object storage type of solution. Then we have an embedding model that generates a vector embedding, a vector database to store those generated vectors and an application business logic that trigger a search query. There are three main steps in that workflow vectorizing all data objects as a preparation step and then handle a query request by translating that query to a query vector and searching the vector database for similar items to be provided as a query result. Finally, we talked about key factors to consider while selecting a database. First, the deployment model can be a fully managed service or a self hosted solution, checking the integration capabilities with machine learning frameworks, tools, and APIs to ensure smooth implementation evaluating the quality of documentation and overall package to support developers, the ability to add flexible metadata to vectors to facilitate advanced search capabilities. And the last factor to consider is the supported feature for enterprise grade solution. Things related to security, performance, and uptime. That's the summary for this section. You are more than welcome to test your knowledge with the following quiz. In the next section, I would like to cover popular market use cases with vector databases. See you again. 16. S04L01 Introduction: Hi, and welcome back. In the previous sections, we managed to cover the foundation and key principles of vectors, a vector dimension, vector embedding, similarity score, vector search, and also vector databases. In this section, I would like to use this knowledge and review the most popular use cases that can leverage vector databases. We'll talk about semantic search, recommendation system, a retrieval, argumented generation, anomaly detection and image search. It is going to be interesting. Let's start. 17. S04L02 Semantic Search: Hi and welcome. Searching for information while typing some keywords is probably the most popular required feature of many software applications and websites. As an example, let's consider a website that provides access to a comprehensive library of books. Each book has a product lending page that summarized the book content and includes metadata information about the a title, subtitle, main category, et cetera. A visitor can type some keywords in a search bar, and based on those keywords, a search engine will bring up the most relevant content, meaning a list of books. How does the search engine bring relevant content to the user that is searching something? Well, there are two main methods. The first one is called keyword search. It is the traditional method of information retrieval that relies on matching specific keywords or phrases. It is mainly based on exact matching of words like the title of a book, the autor oname tags associated with the book and more. When a visitor types a keyword like romantic novels, the search engine looks for books where exact matches for romantic and novels, two separated words. If they appear in the title or the subtitles or maybe the category metadata, the method is fast, straightforward, and simple for implementation. It will work quite well with data objects that have structured metadata. However, it has a couple of limitations. The visitor sometimes is not using the exact keywords in the search query or the related search terms are not part of the metadata of the items we are trying to search. For example, searching for love stories may not match books labeled romantic fiction. Okay, it's not the same words. Another limitation is when the search query is highly complex, like using a sentence to ask something. Breaking the query into separated keywords to perform the search is missing the context and the semantic meaning of the full query sentence. For example, visitors may search for books in many ways. It can be by topics like saying machine learning basics, by intent, like books to improve leadership skills. By concept, romantic stories about second chances. So there is a great limitation while using keyword search. For example, a user is typing the following search query. What are some good books to read about artificial intelligence? A keyword search might return results for books containing the keywords artificial intelligence or AI, but it might miss books that discuss related topics like machine learning, neural networks, generative AI, or data science. All those words will have semantic related meaning, but the search engine might miss them. Another example, I'm looking for a book that explains quantum computing in simple terms. In that case, a keyword search might return results for books with the keyword quantum computing, but it might miss books that use simple analogies, metaphors or less technical language to explain complex concepts like quantum computing. If the book metadata information, like the title, subtitle, or tags are not specifically mentioning that a book is for beginners, then the search engine will not be able to find such correlations. All those limitations bring us to the second method for searching data called semantic search. Semantic search focus on understanding the meaning behind the user query and matching it to the related content. Even if the exact keywords are not being used. Using semantic search, the search engines will take the complete user query and transform it using an embedding model into a query vector. This transformations captures the complete semantic meaning of the text inside the query. Next, it will take the query vector and search for the top similar vectors stored in the vector database. Each vector stored in the vector database represents a book. So instead of trying to find exact keywords matching the vector database will return a list of similar vectors that have close proximity with the required vector. As you can imagine, it is much more flexible method as the user does not have to use exact keywords while searching something. Secondly, it can better encapsulate the full semantic meaning of the query using the output of the embedding model. And lastly, it can produce more accurate and relevant results compared to the traditional keyword search approach. Let's review a couple of examples. Books on leadership for small teams. A semantic search can understand the intend of getting books about managing startups even if the Demeta data does not contain the words small teams. Looking back on the previous examples, what are some good books to read about artificial intelligence? A semantic search engine would understand the intent behind the query and may return results that are relevant to artificial intelligence as an umbrella topic. For example, it may suggest books on machine learning and deep learning that are closely related to artificial intelligence. As a quick summary, unlike keyword based search, semantic search uses the full meaning of the search query by vectorizing the query. It can find relevant results even if the keywords used in the search query are not identical to the keywords that describe the required content. This search method is useful to help visitors to explore and discover content more easily. All this magic works by combining the power of AI models to generate vector embeddings with the power of using the vector databases to store all those vectors. 18. S04L03 Recommendation Systems: The next popular use case for using vector databases is for building more effective personal recommendation systems. A recommendation system is a core component in many software applications, helping users to better navigate between available options, discover relevant content, and offer personalized suggestions. It is used in many use cases like ecommerce for product recommendation. Streaming services for content recommendation, social media platform for new connections recommendation, and more. Recommendation systems are everywhere. A straightforward example of recommendation system is when we are searching for products on an e commerce website, and the website is trying to figure out what products will be most likely purchased by us. It can leverage our personal data coupled with what we type in the search query or click on something or perform an action. All those signals are being used. Another example of recommendation system is when we watch a movie or TV show on a streaming service, and at the end of the movie, it will try to recommend us other similar content to watch. Building a fully functional recommendation system in a production environment is probably not a simple task. There are many things to consider, like how to handle large scale data. Which data sources will be used, selecting the relevant algorithms to analyze data and more. The idea is to provide seamless experience to the end user, and it's not simple. Users are expecting to get more relevant, highly focused recommendations. Traditionally, recommendation systems are based on two main methods, collaborative filtering and content based filtering. Let's break those two options. Collaborative filtering is a method which is based on user interaction data, such as ratings, purchasing, clicks, actions, the recommendation system is constantly collecting all kinds of data like rating and feedback about items from end users. For example, a user that just marked a positive feedback on a specific movie or a product. This rating will be stored on that item as part of a large collaborative knowledge. Then the recommendation system can go into direction. First direction is to find users similar to the target user based on their interaction and recommend items which are liked by similar users. Think about the sentence that you can see, for example, in a streaming service, people like you also watched. It means it is looking at what other similar people are doing, similar people to the target user. Another direction is to find items similar to the ones the user has interacted with. Recommends items that are usually purchased or rated together. It's like people who bought this item also bought other items. For example, I am on Amazon website watching specific products, and the system will recommend me items that are frequently being visited or purchased also by similar people. That's about collaborative filtering. Next method is called content based filtering. Content based filtering relies on attributes or metadata, such as title, description, categories, tags, et cetera. Similarity is determined by comparing the features of items, meaning the system is not looking on what other users are doing. It is looking on what the target user is doing. The main focus to generate a recommendation is based on the user interactions and the attributes of specific items. For example, in a streaming service like Netflix, it will you watched an action movie. Here are more action movies. Now, there are, of course, more methods being used in recommendation system. Assuming one of them or two of them are being used, the next question is how a recommendation system is able to digest, analyze the data to find the required patterns, and then recommend something. As you may guess, the secret source for such a systems is the combination of using AI models and vector databases. The AI models are smart algorithms which are able to take a variety types of data and generate an embedding vector for each item that encapsulates the identified patterns inside. It is the process of vectorizing data. For example, in a streaming service like Netflix, a variety of data can be vectorized using embedding models. For example, content metadata, metadata about movies, TV shows or episode can be converted into embedding using a text embedding model. Now all those vectors will be stored in a vector database where similar items or similar users are clusters together. The second step called similarity search happens when the system should recommend something. It can be triggered when the user is typing something in the search query or just finished watching something or visiting some location in the website. The search query or user action will be translated to a vector that will be used to search for similar items inside the vector database. Finally, the last step is called recommendation generation. Based on the similarity scores calculated by the vector database, the engine can generate personalized recommendation for each user. As a nice example, referring to using the content metadata, it is probably to take all metadata information about an item as one big text. Fit it into an LLM, large language models and get a single vector that represent the semantic meaning of the complete metadata information about specific item, like taking a movie title, plus category, plus release date, plus cast, plus stags and put it in one long text and then generate an embedding. It's a great options to leverage those powerful AI models and make the recommendation system much more sophisticated while helping to uncover patterns in different dimensions. 19. S04L04 RAG: Hi, and welcome back. Our next use case is becoming very popular with the amazing rise of generative AI as a generic purpose technology that can be used across many use cases. I think this use case is like the low hanging fruits of using vector databases for generative AI. Let's start. When describing an AI model, I like to use the metaphor of a simple box. We have an input and output. The box is doing something, hopefully something useful. As an example, I mentioned doing this training, AI models that are used for embedding. The input is unstructured data like an image file and the output is a vector, a sequence of numbers. The AI model is doing some magic inside and performing that transformation. We have an input and output, and we don't care how this is done inside the box. One type of AI model that is extremely popular is a large language model, so in short LLMs. Large language models are used to get text as input that is called prompt and generate text as output called completion. Popular chat based services like Google Gemini, hat GPT and Microsoft Copilot are based on LLMs. And the results are amazing. These LLMs managed to bridge the communication gap between humans and machines. Suddenly, machines can understand and generate human language and be able to answer questions on a variety of topics. Can anyone train and create such models? Well, it's a good question. Those LLMs are trained using a massive unbelievable amount of public data using state of the art, highly expensive hardware technologies. It means that in many practical applications, LLMs are building blocks. It is easier to use an LLM that was trained by someone else. Can those LLMs fully fulfill all market use cases? Well, no, not everything is perfect in the world of LLMs. Those models have three main disadvantages. The first one is that they are trained up to a certain point in time. Any data created after that time is not part of the knowledge stored in that model. It leads to some strange situation in which the model produce unrealistic answers. Is always possible to retrain models with more recent data, but it's not a simple quick task. Data is something that is constantly created. The second major disadvantage is that the data that was used to train the model is public data. This means that the model was not trained while using private data. There are countless organizations that are storing private data. Think about a business company with data about customers, sales transactions, marketing data. And more, it will not be available with an off the shelf LLM model. The last disadvantage is that those popular LLMs are generic models with the target to handle a variety of tasks. They are trained with a wide range of data, providing them with broad knowledge on many topics, but with gaps in specific areas. It's not possible to train a single model with all data about all topics. Therefore, the model knowledge is not optimized to handle domain specific tasks one option is to try to find domain specific LLMs, like LLMs that are mainly trained on, for example, legal data or medical data. It's a good option to consider. All those disadvantages lead to the conclusions that an LM is one part of the puzzle. If we would like to leverage the full potential of LLM, they should not be used in isolation. We need some ecosystem building around LLMs to make the most of those amazing models. What is the solution to handle or bridge that knowledge gap of off the shelf LLMs? Well, let's augment an LLM with additional extra data stored in external data. That's the essence of retrieval argument generation or in short rug. Retrieval augmented generation is an AI framework for combining the power of pre trained large language models with the ability to retrieve additional relevant information from external knowledge data sources and add that contextual information directly into the prompts feeding the LLMs. This approach allow us to bring specialized knowledge as additional context without the need for extensive re training those models. Let's take, for example, a customer support chat board that we get in many websites. We are typing a questions as a prompt and the chatbot is trying to figure out what information we will need to answer that question. Assuming our AI based chatbot is using an out of the box LLM, it can leverage the knowledge stored in that LL however, in most cases, it's not enough. A generic LLM will not know how to answer any question that involves specific information related to the company knowledge base. For example, here, the user is getting a specific error number while trying to install a software on Windows version 11. The chat bot must be connected somehow to the website private knowledge base. Let's assume the knowledge base is a large collection of many articles, including tutorials, user guides, training materials, product description, and so on. How can we make that connection? That's the RG framework. As part of the preparation, all knowledge based items inside that website will be vectorized, taking each article, each image, each video, each audio, each product description, and processing it in an embedding model and get a vector, of course, a group of many vectors. All those vectors will be stored in the vector database. Now let's describe the step by step scenario. Step number one, a visitor is typing a question as a text point inside the chat bot. Step number two, is called query processing, converting the user query into a vector embedding, capturing the meaning of the question. The input is the original query and the output is a vector. Step number three, next is performing a similarity search. That's where we leverage the vector database. A vector database is optimized to efficiently find the best matching vectors. So in our case, the vector database is query to find the most similar vectors to the query vector. Let's assume the most similar vector was returned by the vector database, including the metadata of that vector. This vector represents maybe a specific article 0R a specific section inside the knowledge base that can answer that question. It can be, for example, a specific section in a product user guide explaining how to perform something. Step number four, the chat bot application can take the vector metadata as part of the query result. And retrieve the extra data from the private data sources, like a specific product installation guide. That's the extra data. Step number five, it is called contextual enhancement. The extra text data from the best selected article will be coupled together with the original user query and sent together as a single input prompt to the LLM. Okay, we would like to leverage the power of LLM. We just add an extra information. This step is enhancing the user pompt with additional contextual data extracted in real time from a knowledge base, providing additional context to the LLM. And the last step is called response generation. Step number six, DLM generate a response based on the original query and the additional context extracted from the private data sources. As you can imagine, this RG framework is helping AI based application to leverage on the fly additional knowledge base and data sources coupled with commercial off the shelf Ems that are used as building blocks. It is basically a cheaper option compared to try to retrain such models. This end to end process is performed in real time, vector databases are playing a key role in the Ag framework. They are providing the required search infrastructure to rapidly retrieve that extra data, this contextual information based on the user poms. 20. S04L05 Anomaly Detection: Hi and welcome. Our next use case of vector databases is for anomaly detection. Anomaly detection or also called outlier detections is the identification of observations, events, patterns or data points that deviate from what is usual, standard or expected, making them inconsistent with the rest of the dataset. It is used across many domains like fraud detection for financial transactions while analyzing user behavior or payment patterns. It can be used to identify cybersecurity events, identify dangerous activities like unauthorized access or maybe data breaches. It can be used for predictive maintenance, performing military intelligence analysis, or business intelligence analysis and much more. It is a very popular use case in many practical applications that need to find patterns in data. However, efficiently identifying these anomalies in high dimensional data is a complex task. It is not always simple to identify anomalies. In many cases, it requires scanning large amounts of unstructured data and trying to find the relevant patterns inside. Another challenge is related to finding small but important patterns in a very noisy environment that can signal something very important. Think about a small financial transaction performed by some organization on a specific date, going to a specific destination. With the right tool, we can mark that small transaction with other small signals as a strong indication to something much bigger. You may guess, it makes a lot of sense to leverage AI models that will vectorize the data, taking high dimensional data like images, articles or social tweets, vectorize the data using embedding models, and let a vector database search for such anomalies. In that case, instead of looking for similar items, it will search for vectors that are less similar to other vectors, which can be a strong signal for finding potential outliers or unusual data points. Going back to the example of financial transactions for fraud prevention, all transactions performed by a person can be vectorized. When a new activity is performed on the same account on the same bank account, this event with all available historical events will be vectorized. Example, I just logged into my bank account from another location. If I'm going abroad, for example, two, three times a month, then this piece of information might not be strong enough to indicate an anomaly. But if I just try to redraw five K from an ATM, and that is something that is never happened before, then it may increase the distance of that vector as an event from regular normal transactions that I'm doing in my account. It's all about using those embedding models to automatically find those patterns. So when combined with vector databases, anomaly detection becomes a powerful and efficient way to handle high dimensional data in real time across diverse use cases. 21. S04L06 Visual Search: Hi, and welcome back. The last use case I would like to discuss is visual search. It is a use case we saw as an example in the previous section when I presented the concept of vector databases. Now I would like to frame it as another use case. One popular option for organizing and then fetching images in a large repository is by adding metadata about the content of each item, like object type, color, action, category, and more. This metadata information can be used to match a keyword search query with the relevant content items. However, there are use cases where the search query is not a keyword search. The user is providing an image as an input, like the following image of a dog. It's a visual search. Think about a mobile app like Google Lens, that you can take a picture of something and it can provide you with similar images. It is becoming a popular use case for ecommerce websites where you can upload an image of a product for product discovery, help to perform style matching for particular outfit. Analyzing medical images to identify potential health conditions or diseases. In real estate, it can be used to upload a photo to find similar properties for sale or rent. In manufacturing, it can be used to identify spare parts or tools from images to simplify procurement processes. And more, we can probably find hundreds of use cases for visual search. As you may guess by now, vector databases are extremely useful for visual search. There are dedicated AI models that are used to extract patterns from visual elements, capturing the essence of the content, including shape, color, textures, movement patterns, sequence of objects, objects inside the image and mode. Extracted vectors can be then stored in a vector database along with metadata about the image. So each vector is associated with the corresponding row data. When a user submits a query image, it is also converted into a vector embedding, then the vector database will perform a similarity search to find the most similar images. That's about the top five most popular use cases for vector databases. In the next and last section, we will summarize the complete training. See you next. 22. S05L01 Let’s Recap: Hi and welcome to our last section. At this point, I would like to quickly recap the key terms and topics we covered until this point. We started by defining AI artificial intelligence as the human desire to create a digital brain and mimic human intelligence. So machines can perform more and more complex tasks. Those AI brains are commonly used as components in larger applications, and therefore, it is useful to present them using a simple analogy of a box with input and output. Machine learning algorithms, coupled with a massive amount of data are used to train those AI brains. This process is called training, and the output is a trained model. To be able to digest and handle more complex patterns, deep learning methods are used with artificial neural networks that are inspired by the human brain. Finally, generative AI added the important capability to analyze text as a language and to generate creative content. Next, we learned the basic concept of a vector with magnitude and direction. That's the visual representation of a two dimensional vector. When dealing with high dimensional vectors like seven dimensions or 700 dimensions, it makes more sense to present them using a list of numbers. Those numbers represent the endpoint of the vector arrow in space. Then we talked about how objects are described in data science using a list of features. Those features can be placed in a feature vector, each data point of a specific object in one feature vector. In simple cases, it is possible to manually select a group of features describing an object type based on expertise and knowledge about a topic. However, when dealing with unstructured data, it is more efficient and more practical to automate that process by using the power of AI models. This process is called vector embeddings, taking unstructured data and converting it to a feature vector as a list of numbers. The feature vector is not a replacement for the raw data. This vector will capture the required patterns in the input raw data identified by the AI model. There are many types of AI models for a variety of use cases. The models that are used for embeddings are specifically called embedding models. And there are three options to consider when implementing a solution with an embedding model. Option number one, building a model from scratch. Option number two, using an out of the box solution that can be an open source or maybe cloud based service using APIs. Option number three, fine tune an existing model with domain specific data. One of the key properties of those embedding vectors is their relative position in the embedding space. The proximity of two vectors indicates their similarity. That's something that is extremely useful for performing a vector search while comparing a query vector to all other vectors or by clustering, trying to find meaningful groups. The most popular method to measure similarity between vectors is the cosine similarity metric. This metric measures the angle between two vectors, indicating whether they are pointing to similar directions or not. When performing a vector search, we measure this metric between the query vector to all other vectors, sort the results in descending order, and then filter the top similar vectors. Next, we talked about vector databases, comparing the two main data types, structured and unstructured data. Structured data has a predefined structure like a tabular form of rows and columns. This type of data is commonly managed by traditional relational databases. Data search is based on exact keywod matching or using a set of criteria. The other side, we have unstructured data without a predefined structure. It is usually complex data and the content of this data is commonly described using metadata. Searching for data using metadata is a reasonable solution for simple cases. However, it's quite limited. We mentioned the challenge of finding similar images like similar dogs based on metadata. The solution for this challenge is to utilize AI models called embedding models. In this example, we will take all images and translate them into vectors, including the query image and then perform a vector search. In practical market use cases, the number of vectors generated by embedding models can easily scale up very fast. And in that case, it is useful to consider using a dedicated vector database solution to better handle those vectors. I mentioned the typical vector search workflow while using a vector database, starting the preparation step of vectorizing all data in the content repository and saving all vectors coupled with metadata information inside the vector database. In case the application down below is triggering a quer request the query will be vectorized and then searched inside the vector database. Finally, the most similar items will be returned as an answer to the application. When selecting a vector database solution, we should consider a couple of things. What is the preferred deployment option? Some companies would like to reduce overhead, focusing on their core application, so they may select fully managed database solutions. While other that are looking for better and much more control may consider self hosted solutions. We should check the integration options like supported extensions and APIs. Try to evaluate how good is the Dcmutation center in case we need some help and also supported metadata filtering options, which is extremely useful while working with vector databases. In case we select a fully managed database, it is important to check how well the service is aligned with enterprise requirements related to security, performance, and SLA service level agreements. The third part of this training was dedicated to reviewing the top market use cases for vector databases. The first use case was semantic search. There are two main methods to perform a search based on a text query. First one is called keyword search, where we take the user search query, break it into keywords, and then try to find exact matching. Its fast and simple search method. However, it has many limitations. By breaking the query into small keywords, we are missing the full semantic meaning of a search sentence. That's where a semantic search is bridging that gap. By vectorizing the search query into a vector, it is possible to capture the full semantic meaning of the query. Secondly, the search is much more flexible and accurate. Instead of looking for exact matching, we are looking for similar items. Next use case was about recommendation systems. That's a core component in many applications to help user discover content and get more personalized suggestions. Recommendation systems use two main options, collaborative filtering and content based filtering. The first one is based on user interactions, data, all kinds of things that users are doing, and the second one is based on attributes of the items itself, all kinds of metadata attached to different items. I mentioned a simplified architecture of recommendation system where all those types of data are vectorized by a group of embedding models, and then the application can trigger request to get relevant recommendations. The next use case was about the RAG framework. Large language models are becoming very popular building blocks in many market applications. However, they have a couple of limitations. Therefore, one of the most popular options to increase their effectiveness is by augmenting the input query for an LLM with extra data. That's where embedding models and vector databases are coming to help. The next use case was anomaly detection being used in many market use cases like finance, cybersecurity, healthcare, and many more. Here we can utilize embedding models and the vectors to find those outliers by looking at vectors that are not similar to other vectors. The last use case was about visual search, taking an object like an image or video, vectorize the data, and then search for similar items by the identified patterns inside and not by using a text query. Again, vectors and vector databases are a great match for performing a visual search. That's the end end story of our training. 23. S05L02 Thank You!: Amazing. You reached the last lecture, and that's great. I hope that you enjoyed the training and learned some interesting things along the way. You can download the mind mapping summary files and consider visiting again to refresh your knowledge. It will be awesome if you could spend 2 minutes and rate the course inside the platform while sharing your personal experience. Each review is important. I also suggest sharing your achievement on social media like LinkedIn, you're more than welcome to tag my name IdanGabrill, I will add another layer to your profile. That's it. Thanks again for joining, and I hope to see you again at other training courses. Bye bye and good luck.