Transcripts
1. S01L01 Welcome V2: Hi, and welcome to this
training on vector databases. My name is Idan and I
will be your teacher. We are in the middle of a groundbreaking AI revolution advancing at an incredible pace across almost every industry. Everywhere we look from
healthcare to finance, retail to government agencies, organizations are looking
for ways to implement AI to gain a competitive edge. One of the key
building blocks of AI technologies is the concept of vectorizing
data into vectors. Vctors play a critical
role in bridging the gap between unstructured
data and AI applications. Vctor databases are
the core engines for managing and searching
vectors at scale, driving many useful applications
like semantic search, recommendation system, and augmenting large
language models. In this training, we
are going to uncover the secrets of vectors and
vector databases step by step. I will do my best
to make this course simple, engaging, and enjoyable. Thanks for watching and I
hope to see you inside.
2. S02L01 Introduction: Hi, and welcome to this section. My name is Ian, and I'm excited to start
this training with you. In this section,
we will lay down the required foundation on
key concept about vectors. We'll start by talking about AI, machine learning, deep
learning, and generative AI. Next step will be to define
the meaning of a vector from mathematical perspective
using very simple definition and how it is used
in data science. Then we'll be able to talk
about a vector embeddings and how embeddings models are
used in that process. And the last step will be to understand how those
generated vectors can be used to perform vector search using
similarity metrics. All right. Let's start.
3. S02L02 AI, ML, DP and Gen AI: Hi, and welcome back. Before jumping to the
concept of vectors, let's try to understand
the big picture and organize some of the key
terms related to AI. Starting with the basic
question, what is AI? We use the terms so many times and it means different
things to different people. Let's create some
common baseline. Artificial intelligence
is the practice of getting machines to
mimic human intelligence. It is a general
purpose technology that can be used
for many things, which makes it so popular. AI is the human desire to
create a digital brain, a brain that can mimic
human intelligence, so machines can perform
more complex tasks. Now, what is the meaning
of a complex task? Well, this definition
is constantly changing. A few years ago, it was a complex task to classify
an object in a picture. Now, it is a standard feature in many market applications. It's not exciting anymore. Just recently we started
to use generative AI to generate an image or a video
clip from text description. AI entered the world
of creativity. Practically speaking, we can
see how a growing number of different tasks that
were performed only by humans are now handled
by AI applications. AI keep eating more
and more tasks that were only
possible by humans. Today, it is still
complex to create a fully functional
humanoid robot, but probably ten,
20 years from now, it will become a
standard technology. There are many AIU cases where this AI brain is a component
in a larger application. It is a piece of a
much bigger puzzle. Therefore, it is useful to describe it in some
more simple terms. This AI brain is commonly represented as a simple
box with input and output. It is a simple analogy. We insert an input and using the knowledge stored
in that brain, it generates an output
like a magic box. The knowledge stored in that AI box is called
a trained model. There are many types
of AI boxes where each AI box can handle a
specific type of data like text, image, video, audio, et cetera. How do we create a trained model to be used inside an AI box? This is the knowledge
stored in that box. Well, by using the combination of machine learning
algorithms and data, those ML algorithms can scan
massive amounts of data, extract and learn
meaningful patterns, and store them in something
that is called a model. This process is called
training a model. In most cases, it will require substantial computing
resources coupled with a huge training datasets, all supervised by highly
skilled data science team. Machine learning is the
core methodology to create the knowledge
in AI boxes. In a growing number of AI cases, the patterns in the data are complex and require tools that can perform a deeper analysis. One subset group of machine learning algorithms
is called deep learning. Deep learning is
based on creating complex artificial
neural networks that can store and process
highly complex patterns. They managed to
take us much deeper into the ocean of data
to explore new things, handle more complex
patterns, and as a result, improve the ability of machine learning solutions to
handle more complex tasks. AI models are heavily
based on deep planning. The next step in that evolution
is called generative AI. GNAI was the introduction of highly sophisticated AI models. Those models added the
important capability to analyze text as a language, breaking the
communication barrier between humans and machines and providing the
options to generate new content while simulating
human creativity. The famous CHAT GPT and many popular other tools
are based on generative AI. It is a general
purpose technology that can be leveraged
in many domains. For consumers, it is a set of new tools that can
boost productivity, and I guess it is something that you are
already using today. For developer,
generative AI is adding amazing new AI
models that can be integrated in many
software applications for a going number of use cases. Where are we going from
here as a community with those crazy fast
breakthroughs in AI? Well, I don't know. It is hard to predict. One thing is quite clear. AI is here to stay, and it is opening a wide
range of new opportunities. We just need to be able to solve that AI wave in a smart way. Do you know how all those
AI models can digest, process and understand
the complexity of the outside
environment around us, like processing text, an image, a video file, or an audio? Well, it's all about vectors. Vectors form the foundation of how AI models
interact with data, and that's the topic of the
next lecture. See you next.
4. S02L03 Vectors: I am sure that you
have encountered the term vector at some
point in your studies. It is how to avoid
those math sessions. Let's quickly refresh the
concept to get up to speed. A vector is a
mathematical object characterized by both
magnitude and direction. It is typically represented as an arrow pointing from one
point to another in space. The arrow has a starting
point and an endpoint. The length of the arrow
corresponds to the magnitude of the vector while its orientation in space indicates
the direction. As a simple example,
let's take a vector in two dimensional space and place the starting point
of the vector at 0.0. The head of the vector ends at a specific point with X
one and Y one values. This simple geometric
interpretation of a vector is widely used
in physics, mathematics. One of the most useful
methods to represent a vector in space is by using the values of the head point. It's like a point in space. This is the end
location of the arrow. In a simple two
dimensional space, a vector consists
of an X component, which is sitting on the
horizontal axis and a Y component sitting
on the vertical axis. So instead of visualizing a
vector as a line in space, we can represent it
mathematically by writing down its
components as numbers. For instance, if the X
component has a value of eight and the Y component
has a value of 12, the vector can be written as a list of number eight and 12. The first number correspond
to the first dimension, and the second number correspond
to the second dimension. This approach works well for
two dimensional vectors. But what happens when we have vectors with more dimensions
like seven or 700, we can't easily draw or
visualize such vectors in space. Still, it is a vector, and it has a magnitude
and direction. The good news is that the
concept remains the same. The vector is represented as a list of dimensional
components, such as the one, two, three, d four until the seven, for a seven dimensional vector. The number of dimensions defines the dimensional
space of a specific vector. By presenting a vector as a list of numbers where each
number is one dimension, we can handle any vector size. That's about a vectors and how to present them using
a list of numbers. This basic understanding
of vectors form the foundation of many concept in data science
and machine learning. Let's now explore how
they are related. In data science, the concept of data points is essential for describing the
properties of an object. A data point represents a single observation or instance
about a specific object, and it is characterized
by a set of features. Here we have some example of objects and their
corresponding data points. For example, an animal, each data point may include
features such as the weight, height, number of legs, color, and maybe other features. And sensor reading in an IoT
device features may include the timestamp when the device send a specific sensor reading, the temperature, humidity,
pressure, and more. Each data point of an object is described by
a collection of features. These features are usually
manually selected and engineered by data
science practitioners based on use cases. For example, an engineer might decide that the
temperature reading from an IoT device is significant
and should be included as a feature in each data point generated by that IoT device. Let's take the first
object, an animal. It has several features such as weight, height, tambrefleg, color there are
many more features that can be used to
describe an animal. But let's assume that
someone selected those specific features
for some use case. These features are fixed across all animals in that dataset, meaning every data point is described using the
same set of features. If we check and examine one
data point for an animal, it may look like the
weight is 20 kilogram, number of legs, four, and so on. One of the most
effective ways to handle the values of these features for each data point
is by using vectors. A vector is basically an
ordered list or sequence of numbers that represent
a single data points in a multidimensional space. For example, the vector for the animals features will
be structured as follow. We have the weight, height, number of legs, color, and so on, if there
are more features. In data science,
this is referred as a feature vector because it holds the features
of a specific object. Each number in the
vector correspond to a particular feature
or an attribute. The first number
represent the weight, the second represent
the height and so on. Now why using vectors? Well, computers are incredibly efficient at processing numbers. By organizing data as numerical
values stored in vectors, we can leverage powerful
mathematical techniques to analyze and process
the data efficiently. These vectors, representing data in multiple dimensions can be thought also as arrows pointing in a
particular direction and magnitude in space. Each arrow is a specific vector. How the list of features for a specific object are selected. In data science, it is
called feature selection. Feature selection is
a critical step in the data preparation process
helping to ensure that the dataset can capture the most relevant information for the analysis or
model being used. The first option to perform feature selection is
manual selection. Meaning someone with relevant
expertise and knowledge is thinking and considering which features will be relevant
for a specific use case. And you know what? It is a practical option
for many use cases. However, the world around
us is highly complex with objects and patterns that are not always
straightforward. Let's consider our example of a feature vector for
an animal object. Should we use the number of
legs as a feature or not? Do we need to use
other features? Imagine that we need to identify the facial properties of an animal based on
images as raw data. This introduce an additional
layer of complexity. Should we focus on
the overall size of the head or the
shape of the nose. Deciding which
features are relevant and how to extract them
is a challenging task, especially when dealing with complex patterns in
high dimensional data. In some cases, it doesn't make sense to manually
try to select them. Address this
challenge, we turn to the concept of
vector embeddings, an automated and powerful
method for extracting meaningful features from raw
data using the power of AI. This will be the focus
of our next lecture.
5. S02L04 Vector Embeddings: Hi, and welcome back. In
the previous lecture, we explored how
vectors are used in data science to hold a
list of numerical values, known as features that
describe specific objects. These feature vectors
organize data in a simple, structured manner
where each number represent a measurable or
categorical attributes, such as weight, height, price, color, country location. Feature vectors are
like the containers being used to carry input data into machine learning models. However, in many real
world scenarios, manually deciding which features are necessary to describe patterns and attributes in data is becoming
inefficient option. This challenge is especially amplified when dealing
with unstructured data, data that lacks a
predefined format, such as text, images,
video, audio logs. These types of data require more advanced
techniques to extract meaningful features
that can better represent the required
patterns in the data. Let's consider a group
of pictures of dogs. It is unstructured data with no predefined format
or frameworks. Each picture can vary
widely in terms of colors, shapes, sizes, and
other attributes. It will be almost impossible
to manually decide which features can be used to describe the
content of the image. We need an automation
tool that will help us to extract meaningful features
from each picture. Fortunately, AI models
can perform this task. These models can take unstructured data
such as text, image, video or audio as input, and generate as an
output a feature vector, a structured list of numbers. These numbers
capture the patterns identified by the model
in the input row data. This process of vectorizing the data is known as
vector embeddings. Let's define this
important concept. Vctor embeddings
leverage trained AI models to analyze
complex pattern in high dimensional
unstructured data and transform them into a compact,
lower dimensional vector. This is a data
transformation box. The output is a feature vector
also called embeddings. Those AI models are
called embedding models. Embeddings play a pivotal
role in the field of artificial intelligence
because it is offering a powerful technique to transform unstructured
data like quotes, images, video or
user preference into more compact and structured
numerical representation inside vectors that machine
can easily process. The process of vectorizing
data also normalize the data. The number of features that the specific embedding
model generate is fixed, meaning every vector produced by the model will
be the same size, the same number of dimensions. The numbers generated
by vector embeddings represent patterns identified in the high dimensional
space of the raw data. This output, known as vectorized data is the result of the process called
vectorizing data. It's bridge the gap
between the complex, messy patterns found
in the real world and the digital space
where everything is represented numerically
using numbers. In simple terms,
vector embeddings convert unstructured
information into a more organized and compact numerical
representation that captures the underlying patterns and relationships of the data. The machine learning
model used for the vector embedding process is called an embedding model. There are many types
of embedding models each designed to handle
different kinds of data. Taking the image as an example, creating a vector embedding
of an image is like creating a digital
fingerprint of that image. It captured the essence of the
image in a numerical form. As a simple example, let's assume we have a list of images. We can take each image, fit it into a relevant
embedding model, and the model will produce for each picture a vector
embedding with numbers. Let's say for the sake of
simplicity that our model is generating just ten
features for each image. What is the meaning
of those numbers inside a specific vector? Well, it is important to understand that the
list of numbers inside each vector is not
directly related to a specific simple feature
like a high or colo, as we saw in manually
selecting the feature vectors. The numbers in a vector embedding don't usually
have standalone meaning. We should not try to take a specific number
from a vector created using vector embeddings and try to figure out the
meaning of that number. It is the combination of all numbers that define the
vector position in space. Let's see a simplified
version of that concept. Can see here two vectors, vector A and vector B. Each vector has a
relative position in the embedding space. This position can be used to evaluate similarity
between vectors. The proximity of two vectors in the embedding space
reflects their similarity. Similar items will have vectors that closer together with a smaller distance between them. For example, when two pictures of dogs have vectors that
are close to each other, it is an extremely useful
property we can utilize for many use cases like performing a visual
search using images, find outliers, a cluster
similar objects, and more. As you can imagine,
all this magic of vectorizing data is possible
using an embedding model. That's the main brain. In that case, how should we
select an embedding model? What are the market options
in that perspective? That's the topic of
the next lecture.
6. S02L05 Embedding Models: Hi, and welcome back. In
the previous discussion, we explore the power of
using vector embeddings to transform unstructured data into structured list of numbers, making the data easier
to analyze and process. The embedding model is the key component responsible
for this transformation. Let's dive into the concept
of an embedding model. An embedding model is a machine learning
model designed to map data from its original
high dimensional space into a more compact, lower dimensional vector space. This lower dimensional
representation is called an embedding, and it is effectively captures the semantic or structural
relationship within the data. Machine learning models are
typically trained using algorithms based on deep learning methods
like neural networks. During the training phase, these models are exposed
to vast amounts of data, allowing them to learn complex patterns
and relationships. Once the model is fully trained, it becomes capable of
performing vector embeddings, meaning transforming new
unstructured data into a vector. The number of features that a specific embedding
model generate is fixed, meaning every vector produced by the same model will
be the same size. This fixed number of features determines the
dimension of the vector. When we talk about a
high dimensional space, we refer to models that generate vectors
with many features, sometimes as many as 100
or 1,000 or even more. Should we use an embedding
model that generate more features with more
dimensions or fewer dimensions? Well, there is no simple yes or no answer
for that question. Having more dimensions can improve the accuracy of
the generated vector, but it comes with trade offs. Larger vectors require
more computing power and more memo resources and can increase the latency or speed doing task
like searching data. It's a balance between
the accuracy and computing resources
or efficiency required for
processing the data. Now let's say you or your
company want to build an AI based application
that leverage vector embeddings to transform
complex unstructured data. In that case, you have
three main options. Option number one, build an
embedding model from scratch. Some data scientists may develop the custom embedding models tailored to specific
task however, this approach is less
common because it requires substantial
resources and time, something most companies
cannot afford. In many industries, the speed of launching new
products is critical, making this option less practical for most
market use cases, that's probably more feasible
for research institutions or large companies with big pockets that are running
a skilled data science team. Another much more
practical option is to utilize an out of the box embedding model
as a building block. There are many types
of embedding models. We can divide them
into two main groups. The first group is about open
source embedding models. There are a growing number of open source models
that can be used, and of course, each model is optimized to
specific type of data. There are text
embeddings, models, image embedding models, audio embedding
models, and so on. The second group is paid models. Those models are
encapsulated as services. Accessed using APIs,
we can provide the input data and get
the vector as an output. As you can imagine, this
approach dramatically simplifies the development and integration of AI
based applications, and developers can easily leverage powerful models
without reinventing the wheel. The third option is to fine
tune a pre trained model if our data falls within
a specific domain, such as medical records and off the shelf models
are insufficient. In that case, we can
consider fine tuning an existing pre trained model with additional
domain specific data. Like option number one, it will require a skilled team to perform the fine
tuning process, but it is much less complex compared to starting
the training from scratch like
option number one. All right. As a quick summary, when building an A application
that leverage embeddings, there are three main option, option number one, build an
embedding model from scratch. Option number two, use
pre trained models as open source or
using paid APIs. And last option is fine tune a predefined model with
domain specific data. The selection of the most
suitable option is based on the project's specific needs,
resources, and timeline. Many developers who
would like to focus on their core software
applications will go with option number two and
use it as a building block, meaning, try to find off
the shelf services or maybe open source
embedding models that are already trained and
optimized for production. Now let's say we vectorized all our unstructured data using a selected embedding model
and place them in vectors. How can we group them based
on position in space? How can we measure
their similarity? That's the topic of
the next lecture.
7. S02L06 Similarity Metrics: We have learned that
embedding models are powerful tools
used to convert complex unstructured data into compact numerical vectors that capture key features and
patterns within the raw data. Each vector created by
an embedding model has a fixed number of items which
define its embedding space. This space is a high
dimensional mathematical space where objects are
represented as vectors. This embedding space, the relationships and
similarities between objects are encoded based
on their relative position. Closer two vectors
are to each other, the more similar
objects they represent, proximity equal to similarity. Instead of trying to
find exact matching, meaning two identical vectors, we are expanding our space of options by looking
at proximity. This property of proximity
between vectors is extremely useful for a variety of applications such
as semantic search, recommendation system, anomaly
detection, and much more. Let's say we have 1,000 images and we have transformed
them into 1,000 vectors. Each vector has
the same number of features and
represents the image in the same embedding space. So what can we do with
this vectorized data? We can perform a
variety of tasks. For example, we can take a new picture convert
it into a vector using the same embedding model
and then search through the existing 1,000 vectors to
find the most similar images. This process is called
a vector search. The top matched images can be recommended to the
user based on similarity. Here we can see that picture number one and
picture number two are very similar to
the new picture. Another option is to
find groups of clusters. We can analyze the entire set of 1,000 vectors and group
them into clusters. Images that share
similar patterns will be close to each other in
the embedding space. This allows us to identify meaningful groups or
clusters within the data. Here we have two very
distinct clusters or groups. The underlying principle for this task is measuring the
similarity between vectors. Let's dive a little bit more into how these similarity
metrics are calculated. There are several methods to measure similarity
between vectors, and each method is more suitable for different
set of use cases. Still, the most common method to measure similarity
between vectors for general purpose
dimensional data created by embedding is
the cosine similarity. Cosine similarity
measure the similarity between two vectors based
on the angle between them. It focus on the direction of the vectors rather
than their magnitudes. How this metric is calculated? Well, using the
following formula, it may look complex, but it's quite simple. Let's see what we have here. Assuming A and B
are two vectors. On the upper side,
we have A dot B, which is the dot product
of vectors A and B. It is calculated by
simply multiplying the corresponding elements of the two vectors and then
summarizing the result. On the lower side, we need to
calculate the magnitude of vector A and vector B and then multiply the outcome
of each magnitude. As a simple example,
let's imagine that we have three
vectors in space, A, B, and C. We will use this
formula to calculate this metric between vector A
and the two other vectors. The cosine value between
A and B is 0.905, which is very close
to the value one. Meaning it will be translated to a small angle between
the two vectors. They are almost pointing
to the same direction, meaning A and B are
similar to each other. What about A and C? We are getting 0.62, that is translated into
an angle of 51 degrees. Those vectors are not really pointing to
the same direction. Bottom line, vector
B is more similar to vector A than vector
C. All right, we have a simple useful metric to measure the similarity
between vectors. Let's use it to perform a process called a vector
search. See you next.
8. S02L07 Vector Search: In the previous lecture, we saw how to measure a similarity metric
between two vectors. In practical use cases, it is required to perform a process called
a vector search. It means that on one side, we have a query vector, also called here QV, and on the other
side, a long list of vectors to be searched, like a vector one,
two, three, et cetera. The vector search process is
divided into three steps. In step one, we'll take the
query vector and calculate a cosine similarity
matrix between that query vector and
the vector in that list, and then the second vector and the third vector until
the last vector. In step number two,
we will reorder the vectors based on the
calculated similarity score. Going in descending order. So here, the similarity
metric between the query vector and vector number four has
the highest value. Next, we have vector number three and then vector
number five and more. While looking at this simple two dimensional
representation, it makes a lot of sense. We can see that the query vector is close to vector
four a three and five. And finally, as part
of step number three, we can pick the top
similar vectors based on a threshold value. Here, the selected
top vectors are number four and three based
on that specific threshold. That's the end to end process. Let's see a simple example. Have the following list
of seven sentences. All of them are around
the main topic of AI. Let's assume each sentence was converted into a vector using
a text embedding model. We can place them
in a simple table. The last column represents
the vector embedding result for each sentence
as a list of numbers. I'm not mentioning
the specific number, it's just illustration. We have Vctor one,
Vctor two, and so on. At some point, I
got a new sentence. And it is required to check which sentences in that list are similar to the new sentence
I would like to check, meaning they have similar
semantic meaning. Let's use the same simple
process of a vector such. As a first step, we'll
take that new sentence and convert it to a vector using the same text
embedding model. This vector is now
called the query vector. The next step will be to
calculate the similarity between the query vector
and all other vectors. Finally sought in
descending order, the table based on that metric. As we can see in this table, the nearest vector for the searched sentence is
the first one in that list, which is the highest core, 0.851 and the next vector, meaning number three with 0.722. Those two sentences have the most similar
semantic meaning to the new sentence
that I'm searching for. Another dimension to consider
is the threshold range. Does a metric with a value of
0.65 is good enough or not? Should we drop anything
below 0.7 for example, here I decided that the
threshold value should be 0.7. Well, there is no one
answer to that question. It is based on the use case, the application and the domain. If this metric is used in a recommendation system
for an ecommerce website, then there is a more
room for flexibility. The threshold value
can be lower like 0.6. It will be just fine if
I get recommendation for a bike gadget because just two days ago I searched
for new sport shoes. On the other end, if the
application is about finding relevant
medical content, then the threshold
value should be much higher like 0.9 or even more, making sure that
the content is more relevant to the search query. Setting the right threshold
is a case by case. We just saw a simple list of less than ten sentences
for our vector search. What if we need to handle
thousands of vectors, hundreds of thousands
or millions of sentences or
images or documents. That's the typical case
in real market use cases. We need a place to
manage and store all those vectors in efficient technology to
perform a fast vector search. That's the purpose of
a vector databases, which is the main topic in our next section. See you next.
9. S02L08 Summary: Hi, and welcome to the last
lecture of this section. I would like to summarize
the key concepts we covered. I'm going to use a mind
mapping tool called X Mind to organize the topics. Feel free to download the PDF version as
a final version or the actual mind mapping
file and perform your own adjustments while
using the same application. All right? We started this section by defining
the high level concept of AI and reviewing the evolution of this amazing technology
with machine learning, deep learning, and
generative AI. AI is umbrella term of the human desire to create a digital brain to mimic
human intelligence. It is a general
purpose technology, which means it is useful
for many use cases, and it will be used
in many domains. ML machine learning
is a subfield of AI. It is a group of methods and algorithms to discover and
learn patterns from data. Deep learning is a sub field
of machine learning for handling highly complex patterns using artificial
neural networks. And the last evolution phase
is about generating I. GNAI added the important
capability to analyze text as human language
and generate new content. AI models are trained using massive amounts of data
and the most useful way to convert complex data around us for AI models is
by using vectors. A vector is a simple
mathematical object with magnitude and direction like an arrow pointing in space. It is easy to present
such vector in two dimensional space or
three dimensional space. But what about four
dimensional space or ten dimensional space? It is much simpler to present any vector using a simple
coordinate system. We can break any vector
to its components like a list of numbers representing
the dimension like D one, D two, three, and so on. In the context of data science, vectors are essential tools for representing data points or data objects as a
numerical data. A data point is a
group of features or properties that
describe an object. It is a single
observation of an object. A feature vector is a list of all features related to
one data point as numbers. Those features are manually selected for simple
use cases as part of the data preparation
process while picking the relevant pieces of information that can describe patterns related to an object. In a manual feature selection, each number corresponds
to a specific feature, such as the animal
weight, height, making it possible to describe complex objects in a
multidimensional space. However, as the complexity of real world patterns increase, manually selecting relevant
features becomes challenging. Vector embedding is a
smart way to automate the process and leverage
the power of AI models. Vector embeddings are
used to vectorize data, transform raw unstructured
data such as text, images, or audio into
numerical representation. A long list of numbers. Each number in the vector corresponds to a
specific dimension. This process is
normalizing the data into a fixed size vectors that are organized in the
same special space. In this space, the
proximity between vectors reflects the similarity between the objects they represent. These vector embeddings are generated by embedding models. Embedding models are
machine learning models trained using large dataset
and deep learning techniques. A trained embedding model
can be used to map data from high dimensional space into more compact lower
dimensional vector space. There are a couple of options to leverage an embedding model. One option is to build
an embedding model from scratch or use an out of the box embedding
model as open source or paid API based services. The last option is to fine tune a predefined embedding model. We also talked about calculating
a similarity metric. As the proximity
between two vectors indicates their similarity, it can be used to
search similar data instead of trying to
find exact matching. This property is highly
useful for many use cases. One of the most common
methods to measure similarity between vector
is the cosine similarity. Cosine similarity
measures the similarity between two vectors based
on the angle between them. Meaning it is focusing on the
direction of the vectors. The last topic was about
performing a vector search. Step number one, generate
a list of vectors to hold data objects using the
relevant embedding model. In step number two, process a
query request by generating a query vector and then
calculating similarity metrics. In step number three,
sort vectors using that metric and filter
relevant results based on some threshold range. That's a quick summary
for this section. In the next lecture, you have a small quiz
to test your knowledge. Good luck, and see you
again in the next section.
10. S03L01 Introduction: Hi and welcome back. In the previous section, we uncovered the
power of vectors, vector embeddings and
similarity metrics used for performing
vector search. I guess you got the idea that vectorizing data into a list of numerical numbers is
an essential step in almost every AI
based use case, taking unstructured data
and converting the data to vectors to facilitate fast
and flexible data search. This is performed by a component called
an embedding model. We also saw a couple
of examples such as vectorizing a small number of images or sentences,
which is nice. But in practical application, the number of objects to vectorize can be much
larger like thousands, hundreds of thousands,
and even millions. Just think how many products are listed on an ecommerce
website like Amazon or Ebay. If a recommendation system inside the website
will use vectors and each vector will represent the lending page
of each product, it will be a challenge to manage those millions of vectors. It's all about scaling. Where should we store, manage, and search for those vectors
created by embedding models? The perfect solution is to use specialized database
technologies called vector databases. Those technologies
have the ability to store index and search high
dimensional data points, delivering the speed
and performance needed to drive artificial
intelligence use cases. That's the main topic in
this section. Let's start.
11. S03L02 Structured and Unstructured Data: Hi, and welcome back.
As you probably know, there are two main types of data objects, structured
and unstructured. Structured data has a specific
predefined structure. In many use cases
structured data can be organized in tabular form
with rows and columns. It is typically managed in traditional relational databases accessed using SQL queries. Each column stores data types
like numbers and strings. This structure is perfect
for transactional data. Think about a bank application, storing and accessing
financial transactions. The data will be stored in
rows and columns like a table. Each row represents a
specific transactions, and the columns represent different fields stored
for each transaction. When searching for data, it is based on exact
keyword matching or using a set of criteria, meaning the search
query will be compared against the values
inside specific columns. Example, query all
bank transactions for a person based on ID number and between
specific dates. We'll get a list of rows
matching that search query. Traditional databases are great at storing and retrieving
structured data. However, they are less
flexible when dealing with the semantic and
contextual nuance of unstructured data. Unstructured data like images, video files, audio files, text from a blog post, a product description
in a PDF file, a question from a chat
session, and more. Unstructured data has no
specific predefined format. We can't organize
this type of data or the patterns in this data into tables with
rows and columns. It is high dimensional
complex data. Let's consider a group
of images as files. An image is unstructured data. There is no predefined format. There is no specific information that describes the
content of each image. Maybe the name of the file can provide some
basic information. Those image files
will be stored in generic object storage solutions like Amazon tree or
Azure Blob storage. It is also possible to manually
add metadata information attached to each image file that can help to search and
retrieve those images. Metadata like category.
Creation date, tags, maybe the
background color, a type of object
inside, et cetera. We can use that
extra information about each image to
search those image files, but it is limited. This metadata does not capture the full patterns
inside each picture. Let's say, we would like to get all images similar to a
specific image that we have. This image is a
picture of a dog. Now, there are many
types of dogs. How can we perform that search? Maybe some images
were tagged with the keyword dog or the main colour or the
overall dog position. However, there are many
types of dogs with different sizes,
shapes, and colos. We would like to get
images of dogs that are similar to the specific
dog in our picture. Any manual metadata added to a picture will
be very limited. What about using AI? AI models have the power
to represent and capture patterns in unstructured data and translate them into vectors. We can take each image
and process it using an image embedding
model that will produce a vector representing the
complete content inside. So for each image file, we'll also have a vector. The vector is not
the image raw data. It is a list of
numbers that represent the identified patterns in that image by the
embedding model. We can take our
specific search image, translate it into
a query vector, and then calculate
a similarity score between the query vector
and each other vectors. The list of vectors with
the highest ses will be those similar
images that we are looking for. That's great. We have an AI model
that helps us to identify patterns in
unstructured data, and we can use the concept of vector search to find
similar vectors. However, what if I get
another image to search? I need to perform the same
process all over again. Generate a query vector of
that new image and then generate vectors for all other
images in my repository, calculate the similarity scoes and filter the relevant vectors. As you may guess, that's
not an efficient method. It doesn't make sense to
calculate vectors for all content objects all over
again for each search query. It is more efficient to
calculate those vectors as a batch process one time
and store them somewhere. And when I get a query
about some things, all those vectors are
ready to be searched. We just need to create
a single vector based on the search query. So where should we store
and handle those vectors? In production environments,
the number of vectors created using vector
embedding models can be huge. As a result, it is becoming essential to manage
those vectors in dedicated storage
solutions optimized for vectors that bring us to the
topic of vector databases. Those vector databases
are optimized to store, manage, and search any type of vector and any
number of vectors. That's the topic of the
next lecture. See you next.
12. S03L03 Vector DB: Hi, and welcome.
In this lecture, we'll start to explore the key benefits of
vector databases. As the name implies, a vector database is a
specialized database designed to store manage and
search vectors at scale. Vector database will provide the following core capabilities. The first one is managing
vectors, meaning insert, update or delete vectors as the object inside
the vector database. Secondly, the
ability to associate meta data with each vector, something which is important because the vector itself
is not the raw data, so it's an options to
create the connection between the original raw
data and the actual vector. We'll see that later on. The third core capability
is the ability to find similar vectors based on a query vector and the metadata information
attached to a vector. Now, what kind of vectors we can store and query with
a vector database? Well, it can be any vector, any vector type and any size, a vector created by an
image embedding model, a vector created by a
text embedding model. It can be a vector with ten
dimension or 400 dimensions. From the vector
database perspective, it is just a list of numbers representing
vectorized data. It is the perfect tool to
handle vectorized data, meaning vectors of
unstructured data created by vector
embedding models. As you can imagine,
this flexibility to store vectors related to completely different data
objects makes this type of database highly adaptable
to ongoing changes. Can easily add new data types
across the application or change search requirements with minimal adjustment to
this vector database. The power of vector
databases is based on their ability to perform
a vector search based on similarity metrics
with the goal of finding the closest data points in a high dimensional
vector space. Let me rephrase that
important concept. Quaring a vector database is different than quering
a traditional database. Instead of searching for precise matches between
identical vectors, a vector database uses similarity search
to identify vectors that reside in close proximity
to the given query vector. This fundamental approach of looking at proximity in space provides tremendous
flexibility and efficiency that traditional
search cannot match. And that's the main reason why it is popular for AI
based applications. AI based applications
are handling highly complex data objects
with highly complex patterns. It does not make sense to try
to find identical objects. It is more common
scenario to search for similar objects and
not identical objects. Many applications, the biggest
challenge is to carefully balance speed and accuracy
when handling a search query. How long it will
take to search and find the right answer
to a specific query. On one side, the amount of data is going all the time,
and on the other side, there are going number
of applications that are trying to leverage
this pile of data. If I need to wait 5
minutes when searching for something in Google
Search or HAGPT, then that's going
to be a problem. In that perspective,
vector databases, coupled with the vector
embedding vectors stored inside are designed to provide the required
capability to search over massive datasets of unstructured
data with low latency. Consistent query performance
and great accuracy. When using vector database, you don't care if the
object is an image, a video, an audio. Everything is vectorized. Everything is basically a list of number which are vectors, and then the whole process of searching data is
much more simple. Those are the key benefits
of vector databases. In the next lecture, let's talk about the typical
workflow when performing a vector search with a vector
database. See you next.
13. S03L04 Vector Search Workflow: Hi, and welcome.
In this lecture, I would like to take a vector
database as a component and see how it will typically fit in a larger application. It is a generic diagram to
understand the key concept. On the left side, we have a content repository that can be any type of
content like a text, images, PDF files,
articles of web pages, video file, et cetera. That's the main
application repository or data store of unstructured
data or structured data. On the right side, we
have a vector database that is designed to
store vector data. But vector data is not something that
can be just made up. It is something that
is generated via machine learning models as
part of the embedding process. Therefore, in the
middle of the diagram, we have an embedding
model component. It can be a single model or
multiple models depending on the data types we have in the data store, the
content repository. There are plenty of
different embedding models for a variety of use cases. This type of contextual
information that the embedding capture
is a result of the type of model used and
the data it was trained on. As a preparation step, we'll take each data object
in our content repository. And run it one time in the embedding model component to get a vector embedding
as an output. Each vector embedding
will be stored in the vector database coupled with additional metadata
information about that object. For example, if the
object is an image, the metadata might
include the image URL, where this image is stored. Maybe the image description, additional image tags, which can be used later
on for filtering. It is important to
emphasize that the vector embedding itself is not
the actual image content. It is a numerical representation
of the image features, like its visual content, colors, textures,
and overall theme. Secondly, vector databases are
not optimized to store and manage the actual content
objects of unstructured data. It is necessary to store the original data separately
from the vectorized data. Therefore, we must store
as part of the vectors metadata a reference to the data store holding the
original content objects. It's about the preparation
step, generating vectors, using vector embedding models, and storing those vectors
in a vector database. Now we are ready to
process a specific query from an application here on
the left side down below. This application will
send a search query. For example, a search query
can be an image file. This image file will be translated to a
query vector using the same vector embedding
model used to convert the complete repository
during the preparation. This ensures that
the query vector and the vector database content are in the same
dimensional space, making it possible to measure similarity
between the vectors. Next step is performed by
the vector database itself. It will take the query
vector and compare it with all other vectors
stored in the database. The most similar vectors to the search vector will be ranked in their
order of similarity. Finally, as part of
the quer result, the vector D B will
typically return the associated metadata of
those most similar vectors. The metadata will be used by
the application to access the relevant image objects from the original
content repository. That's the typical end to end
workflow when integrating a vector database in an
AI based application. As you can imagine,
there are many companies that provide a vector
database solution. So in the next lecture, I would like to talk
about a key factors to consider when planning to
select a specific solution.
14. S03L05 Selecting a Vector Database: Hi, welcome back.
In this lecture, I would like to
review a couple of key factors to consider while choosing the most
suitable vector database option for a specific project. The first dimension to consider
is the deployment model. It can be divided into
two main options. The first one is a
fully managed vector databases over the cloud. The vector database
is encapsulated as a cloud service that we can use and pay based
on consumption. Someone else is taking care of all the required configuration and infrastructure while
running the database. Just keep in mind that those fully managed solutions are associated with a price tag, so make sure that it is aligned with your budget
and project requirements. Another thing to check is the
supported Cloud providers. If your company is mainly
running on Amazon AWS, it makes sense to check that this fully managed service is supported in
their marketplace. The second deployment
option is self hosted. It can be an open source or maybe a proprietary
solution that we can deploy on our servers. And there are many great open
source database solutions. Some of them are
available as self hosted open source and also as fully
managed Cloud services. You can see different popular
names right over here. That's about the
deployment model. The next factor to consider
is about integration. A vector database is one
piece in a complete solution. It must be successfully
integrated with other components in an
end to end application. Therefore, it is important
to check and verify the compatibility
and support with popular machine learning
frameworks and tools. So consider the
available SDK extensions and APIs for integration
that you would like to use. Next dimension to
consider is how the vector database solution we would like to use is
developer friendly. Many companies are claiming
to be a developer friendly, but their website is not
aligned with that message. Can you find a solid
documentation center in case you need some help? Maybe the solution
is technically great but poorly documented. Making it hard to find
specific information. It is recommended to open the
vendor website and try to find a dedicated section for developers like a
knowledge base, documentation, tutorials,
access to a community center, see code examples, maybe
training sessions. More mature solutions will have the required end to end
package for developers. Most probably companies
that are providing a vector database as a
paid solution will put more resources to provide better developer
friendly resources as part of their
marketing strategy. The next one is about
metadata filtering. Do you remember that
a vector stored in the vector database will be
associated with metadata. It contains extra information about the vector and
could be used for performing filtering that
combines the vector search and additional filtering options using that metadata information. From that perspective,
it is important to check what data types can be added as
metadata information to facilitate the
required search. Metadata is commonly added and managed in adjacent structure
with key value pairs. The value inside,
the key value pair will be limited to
specific data types, like numbers, string, bullion
or a list of strings. This is something to check. And the last one is
enterprise ready. The last dimension to
consider is the level of alignment to support
enterprise applications. Eventually, the
developed application will run in production and must follow certain market benchmark related to security,
performance, and uptime. Imagine that the recommendation
system on Amazon, will stop working because the vector database
is under maintenance. It will quickly translate to
substantial revenue loss. It is important to check if the Vctor database solution is aligned to enterprise grade, especially if it's a
cloud based solution, checking the
supported privacy and security features declared
a security certification, performance metrics, and the supported service
level agreements. Those are the key
factors to consider when selecting a vector database
for a specific project. I guess it will be just
fine to start with an open source solution during development while minimizing
cost and later on, make a decision when
moving to production.
15. S03L06 Summary: Hi, and welcome to the last
lecture in this section. Let's summarize the key points we have covered so
far in this section. Again, it's a mind
mapping summary. You can download the final
version as a PDF file or the X mind format if you would like to
make your own changes. We started by dividing data into two main categories,
structured and unstructured. Structured data has a specific
predefined structure. One of the most common
formats is using grows and columns
called tabular form. It is a perfect method to handle transactional data where
each row represents a specific transaction
and each column represents a specific
field in a row. This type of data is
commonly stored and managed in traditional
relational databases. These databases use SQL
queries to retrieve data based on exact keyword matching
or specific criteria, making them effective for
precise structured queries. However, they fall short when
handling unstructured data, such as images, video, audio, and text which lacks a
predefined format and cannot be easily organized
into table or simple groups. To make unstructured
data more researchable, metadata like tags or
categories can be added, but this approach is
limited and does not capture the full complexity of all patterns in the raw data. AI powered vector
embedding models provide a solution to
these limitations by converting
unstructured data into numerical vectors that capture patterns and semantic content. Those vectors can be used to
facilitate a vector search. When converting a large number of unstructured
data into vectors, while using embedding models, it will be more efficient to store those vectors
in a vector database. Vector databases are
optimized for storing, managing, and searching
vectors at scale. There are specialized tools designed to handle
vectorized data, offering key functionalities
such as inserting, updating, deleting and associating
metadata with vectors. Their core feature is the ability to perform
similarity search, identifying vectors
with close proximity to a query vector within a
high dimensional space. One of the most
significant advantages of vector databases is their
ability to maintain consistent performance at scale and to perform a
search query over high dimensional data with a much lower latency compared
to traditional databases. It makes them a
perfect match for AI based applications that must access unstructured
data at scale. We also cover the
typical workflow of a vector search while
using a vector database, exploring how a vector database fits into a larger
application ecosystem. Such system architecture will include the following
key components. Content repository that stores
a raw unstructured data. It's usually an object
storage type of solution. Then we have an embedding model that generates a
vector embedding, a vector database to store
those generated vectors and an application business logic that trigger a search query. There are three main steps in that workflow vectorizing
all data objects as a preparation step and then handle a query request by translating that
query to a query vector and searching the
vector database for similar items to be
provided as a query result. Finally, we talked
about key factors to consider while
selecting a database. First, the deployment
model can be a fully managed service or
a self hosted solution, checking the integration
capabilities with machine learning
frameworks, tools, and APIs to ensure
smooth implementation evaluating the quality of documentation and overall
package to support developers, the ability to add
flexible metadata to vectors to facilitate
advanced search capabilities. And the last factor
to consider is the supported feature for
enterprise grade solution. Things related to security,
performance, and uptime. That's the summary
for this section. You are more than
welcome to test your knowledge with
the following quiz. In the next section, I would like to cover popular market use cases with vector databases.
See you again.
16. S04L01 Introduction: Hi, and welcome back. In
the previous sections, we managed to cover the foundation and key
principles of vectors, a vector dimension,
vector embedding, similarity score, vector search, and also vector databases. In this section, I would like to use this knowledge and review the most popular use cases that can leverage vector databases. We'll talk about semantic
search, recommendation system, a retrieval,
argumented generation, anomaly detection
and image search. It is going to be
interesting. Let's start.
17. S04L02 Semantic Search: Hi and welcome. Searching for information while
typing some keywords is probably the most
popular required feature of many software
applications and websites. As an example, let's
consider a website that provides access to a
comprehensive library of books. Each book has a product
lending page that summarized the book content and includes metadata information
about the a title, subtitle, main
category, et cetera. A visitor can type some
keywords in a search bar, and based on those keywords, a search engine will bring up
the most relevant content, meaning a list of books. How does the search engine bring relevant content to the user that is
searching something? Well, there are
two main methods. The first one is
called keyword search. It is the traditional method of information retrieval
that relies on matching specific
keywords or phrases. It is mainly based on exact matching of words
like the title of a book, the autor oname tags associated
with the book and more. When a visitor types a
keyword like romantic novels, the search engine looks
for books where exact matches for romantic and
novels, two separated words. If they appear in the title or the subtitles or maybe
the category metadata, the method is fast,
straightforward, and simple for implementation. It will work quite well with data objects that have
structured metadata. However, it has a
couple of limitations. The visitor sometimes is not
using the exact keywords in the search query or the
related search terms are not part of the metadata of the items we are
trying to search. For example, searching
for love stories may not match books labeled
romantic fiction. Okay, it's not the same words. Another limitation is when the search query
is highly complex, like using a sentence
to ask something. Breaking the query into separated keywords to
perform the search is missing the context and the semantic meaning of
the full query sentence. For example, visitors may
search for books in many ways. It can be by topics like saying
machine learning basics, by intent, like books to
improve leadership skills. By concept, romantic stories
about second chances. So there is a great limitation while using keyword search. For example, a user is typing
the following search query. What are some good books to read about artificial
intelligence? A keyword search might
return results for books containing the keywords artificial intelligence or AI, but it might miss books that discuss related topics
like machine learning, neural networks, generative
AI, or data science. All those words will have
semantic related meaning, but the search engine
might miss them. Another example, I'm
looking for a book that explains quantum computing
in simple terms. In that case, a
keyword search might return results for books with the keyword
quantum computing, but it might miss books
that use simple analogies, metaphors or less
technical language to explain complex concepts
like quantum computing. If the book metadata
information, like the title, subtitle, or tags are not specifically mentioning that
a book is for beginners, then the search
engine will not be able to find such correlations. All those limitations
bring us to the second method for searching data called semantic search. Semantic search focus
on understanding the meaning behind
the user query and matching it to
the related content. Even if the exact keywords
are not being used. Using semantic search, the
search engines will take the complete user
query and transform it using an embedding
model into a query vector. This transformations captures the complete
semantic meaning of the text inside the query. Next, it will take the
query vector and search for the top similar vectors stored
in the vector database. Each vector stored in the vector database
represents a book. So instead of trying
to find exact keywords matching the vector
database will return a list of similar vectors that have close proximity with
the required vector. As you can imagine, it is
much more flexible method as the user does not have to use exact keywords while
searching something. Secondly, it can
better encapsulate the full semantic meaning of the query using the output
of the embedding model. And lastly, it can produce more accurate and
relevant results compared to the traditional
keyword search approach. Let's review a
couple of examples. Books on leadership
for small teams. A semantic search can understand the intend of getting
books about managing startups even if the Demeta data does not contain the
words small teams. Looking back on the
previous examples, what are some good books to read about artificial
intelligence? A semantic search engine would understand the intent
behind the query and may return results
that are relevant to artificial intelligence
as an umbrella topic. For example, it may suggest
books on machine learning and deep learning that are closely related to
artificial intelligence. As a quick summary, unlike
keyword based search, semantic search uses
the full meaning of the search query by
vectorizing the query. It can find relevant results
even if the keywords used in the search query are
not identical to the keywords that describe
the required content. This search method
is useful to help visitors to explore and
discover content more easily. All this magic works by combining the
power of AI models to generate vector
embeddings with the power of using the vector databases
to store all those vectors.
18. S04L03 Recommendation Systems: The next popular use case
for using vector databases is for building more effective personal recommendation systems. A recommendation system is a core component in many
software applications, helping users to better navigate between
available options, discover relevant content, and offer personalized
suggestions. It is used in many use cases like ecommerce for
product recommendation. Streaming services for
content recommendation, social media platform for new connections
recommendation, and more. Recommendation systems
are everywhere. A straightforward example
of recommendation system is when we are searching for products on an e
commerce website, and the website is
trying to figure out what products will be most
likely purchased by us. It can leverage our personal
data coupled with what we type in the search
query or click on something or
perform an action. All those signals
are being used. Another example of
recommendation system is when we watch a movie or TV show
on a streaming service, and at the end of the movie, it will try to recommend us other similar content to watch. Building a fully functional
recommendation system in a production environment is probably not a simple task. There are many
things to consider, like how to handle
large scale data. Which data sources will be used, selecting the
relevant algorithms to analyze data and more. The idea is to provide seamless experience to the end
user, and it's not simple. Users are expecting to get more relevant, highly focused
recommendations. Traditionally,
recommendation systems are based on two main methods, collaborative filtering and
content based filtering. Let's break those two options. Collaborative
filtering is a method which is based on user
interaction data, such as ratings,
purchasing, clicks, actions, the
recommendation system is constantly collecting all kinds of data like rating and feedback about
items from end users. For example, a user that just marked a positive feedback on a specific movie
or a product. This rating will be stored on that item as part of a large
collaborative knowledge. Then the recommendation
system can go into direction. First direction is to find users similar to the
target user based on their interaction and recommend items which are liked
by similar users. Think about the sentence that
you can see, for example, in a streaming service, people like you also watched. It means it is looking at what other similar
people are doing, similar people to
the target user. Another direction
is to find items similar to the ones the
user has interacted with. Recommends items that are usually purchased
or rated together. It's like people who bought this item also
bought other items. For example, I am on Amazon website watching
specific products, and the system will
recommend me items that are frequently being visited or purchased also by
similar people. That's about
collaborative filtering. Next method is called
content based filtering. Content based filtering relies
on attributes or metadata, such as title, description, categories, tags, et cetera. Similarity is determined by comparing the features of items, meaning the system
is not looking on what other users are doing. It is looking on what the
target user is doing. The main focus to generate a
recommendation is based on the user interactions and the attributes of
specific items. For example, in a streaming
service like Netflix, it will you watched
an action movie. Here are more action movies. Now, there are, of course, more methods being used
in recommendation system. Assuming one of them or two
of them are being used, the next question is how a recommendation system
is able to digest, analyze the data to find
the required patterns, and then recommend something. As you may guess, the
secret source for such a systems is
the combination of using AI models and
vector databases. The AI models are smart algorithms which are able to take a
variety types of data and generate an
embedding vector for each item that encapsulates the identified patterns inside. It is the process of
vectorizing data. For example, in a streaming
service like Netflix, a variety of data can be vectorized using
embedding models. For example, content metadata,
metadata about movies, TV shows or episode
can be converted into embedding using a
text embedding model. Now all those vectors will be
stored in a vector database where similar items or similar users are
clusters together. The second step called similarity search happens when the system should
recommend something. It can be triggered
when the user is typing something in
the search query or just finished watching something or visiting some
location in the website. The search query or user
action will be translated to a vector that will
be used to search for similar items inside
the vector database. Finally, the last step is called recommendation
generation. Based on the similarity scores calculated by the
vector database, the engine can generate personalized recommendation
for each user. As a nice example, referring to using the content metadata, it is probably to take all metadata information about
an item as one big text. Fit it into an LLM, large language models and get a single vector that represent the semantic meaning of the complete metadata
information about specific item, like taking a movie title, plus category,
plus release date, plus cast, plus
stags and put it in one long text and then
generate an embedding. It's a great options to leverage those powerful AI
models and make the recommendation
system much more sophisticated while helping to uncover patterns in
different dimensions.
19. S04L04 RAG: Hi, and welcome back.
Our next use case is becoming very popular with the amazing rise of
generative AI as a generic purpose technology that can be used
across many use cases. I think this use case is like
the low hanging fruits of using vector databases for
generative AI. Let's start. When describing an AI model, I like to use the
metaphor of a simple box. We have an input and output. The box is doing something,
hopefully something useful. As an example, I mentioned
doing this training, AI models that are
used for embedding. The input is
unstructured data like an image file and the
output is a vector, a sequence of numbers. The AI model is doing some magic inside and performing
that transformation. We have an input and output, and we don't care how this
is done inside the box. One type of AI model that is extremely popular is a
large language model, so in short LLMs. Large language models are
used to get text as input that is called
prompt and generate text as output
called completion. Popular chat based services
like Google Gemini, hat GPT and Microsoft
Copilot are based on LLMs. And the results are amazing. These LLMs managed to bridge the communication gap
between humans and machines. Suddenly, machines can
understand and generate human language and be able to answer questions on
a variety of topics. Can anyone train and create such models? Well,
it's a good question. Those LLMs are trained using a massive unbelievable amount of public data using
state of the art, highly expensive
hardware technologies. It means that in many
practical applications, LLMs are building blocks. It is easier to use an LLM that was trained
by someone else. Can those LLMs fully fulfill
all market use cases? Well, no, not everything is
perfect in the world of LLMs. Those models have three
main disadvantages. The first one is that they are trained up to a
certain point in time. Any data created after that time is not part of the knowledge stored
in that model. It leads to some strange
situation in which the model produce
unrealistic answers. Is always possible to retrain models with
more recent data, but it's not a
simple quick task. Data is something that
is constantly created. The second major disadvantage
is that the data that was used to train
the model is public data. This means that the model was not trained while
using private data. There are countless
organizations that are storing private data. Think about a business company
with data about customers, sales transactions,
marketing data. And more, it will not be available with an off
the shelf LLM model. The last disadvantage is
that those popular LLMs are generic models with the target to handle
a variety of tasks. They are trained with
a wide range of data, providing them with broad
knowledge on many topics, but with gaps in specific areas. It's not possible to train a single model with all
data about all topics. Therefore, the
model knowledge is not optimized to handle domain specific tasks one option is to try to find domain
specific LLMs, like LLMs that are
mainly trained on, for example, legal
data or medical data. It's a good option to consider. All those disadvantages lead to the conclusions that an LM
is one part of the puzzle. If we would like to leverage
the full potential of LLM, they should not be
used in isolation. We need some ecosystem
building around LLMs to make the most of
those amazing models. What is the solution
to handle or bridge that knowledge gap of
off the shelf LLMs? Well, let's augment an LLM with additional extra data
stored in external data. That's the essence of retrieval argument
generation or in short rug. Retrieval augmented generation
is an AI framework for combining the power of pre trained large language
models with the ability to retrieve additional
relevant information from external knowledge
data sources and add that contextual information directly into the prompts
feeding the LLMs. This approach allow us to
bring specialized knowledge as additional context without the need for extensive re
training those models. Let's take, for example, a customer support chat board that we get
in many websites. We are typing a questions as
a prompt and the chatbot is trying to figure out
what information we will need to
answer that question. Assuming our AI based chatbot is using an out of the box LLM, it can leverage the knowledge
stored in that LL however, in most cases, it's not enough. A generic LLM will not
know how to answer any question that involves specific information related to the company knowledge base. For example, here, the user is getting a specific
error number while trying to install a software
on Windows version 11. The chat bot must be connected somehow to the website
private knowledge base. Let's assume the
knowledge base is a large collection
of many articles, including tutorials,
user guides, training materials, product
description, and so on. How can we make that connection? That's the RG framework. As part of the preparation, all knowledge based items inside that website will be vectorized,
taking each article, each image, each video, each audio, each
product description, and processing it in an embedding
model and get a vector, of course, a group
of many vectors. All those vectors will be
stored in the vector database. Now let's describe the
step by step scenario. Step number one, a
visitor is typing a question as a text point
inside the chat bot. Step number two, is
called query processing, converting the user query
into a vector embedding, capturing the meaning
of the question. The input is the original query and the output is a vector. Step number three, next is performing a similarity search. That's where we leverage
the vector database. A vector database is optimized to efficiently find the
best matching vectors. So in our case, the
vector database is query to find the
most similar vectors to the query vector. Let's assume the
most similar vector was returned by the
vector database, including the metadata
of that vector. This vector represents
maybe a specific article 0R a specific section inside the knowledge base that
can answer that question. It can be, for example, a specific section in a product user guide explaining
how to perform something. Step number four, the
chat bot application can take the vector metadata as
part of the query result. And retrieve the extra data from the private data sources, like a specific product
installation guide. That's the extra data. Step number five, it is called
contextual enhancement. The extra text data from the best selected
article will be coupled together with
the original user query and sent together as a single
input prompt to the LLM. Okay, we would like to
leverage the power of LLM. We just add an
extra information. This step is enhancing
the user pompt with additional contextual
data extracted in real time from
a knowledge base, providing additional
context to the LLM. And the last step is called
response generation. Step number six, DLM
generate a response based on the original query and the additional context extracted from the private data sources. As you can imagine,
this RG framework is helping AI based
application to leverage on the fly additional knowledge base and data sources coupled with commercial off the shelf Ems that are used
as building blocks. It is basically a cheaper option compared to try to
retrain such models. This end to end process is
performed in real time, vector databases are playing a key role in the Ag framework. They are providing the
required search infrastructure to rapidly retrieve
that extra data, this contextual information
based on the user poms.
20. S04L05 Anomaly Detection: Hi and welcome.
Our next use case of vector databases is
for anomaly detection. Anomaly detection or also called outlier detections is the identification of
observations, events, patterns or data points that
deviate from what is usual, standard or expected, making them inconsistent with
the rest of the dataset. It is used across many
domains like fraud detection for financial transactions while analyzing user behavior
or payment patterns. It can be used to identify
cybersecurity events, identify dangerous
activities like unauthorized access or
maybe data breaches. It can be used for
predictive maintenance, performing military
intelligence analysis, or business intelligence
analysis and much more. It is a very popular use case in many practical applications that need to find patterns in data. However, efficiently
identifying these anomalies in high dimensional
data is a complex task. It is not always simple
to identify anomalies. In many cases, it requires
scanning large amounts of unstructured data and trying to find the
relevant patterns inside. Another challenge is
related to finding small but important patterns in a very noisy environment that can signal something
very important. Think about a small
financial transaction performed by some organization
on a specific date, going to a specific destination. With the right tool, we can mark that small
transaction with other small signals as a strong indication to
something much bigger. You may guess, it makes
a lot of sense to leverage AI models that
will vectorize the data, taking high dimensional
data like images, articles or social tweets, vectorize the data
using embedding models, and let a vector database
search for such anomalies. In that case, instead of
looking for similar items, it will search for vectors that are less similar
to other vectors, which can be a strong
signal for finding potential outliers or
unusual data points. Going back to the example of financial transactions
for fraud prevention, all transactions performed by
a person can be vectorized. When a new activity
is performed on the same account on
the same bank account, this event with all available historical
events will be vectorized. Example, I just logged into my bank account from
another location. If I'm going abroad,
for example, two, three times a month, then this piece of
information might not be strong enough to
indicate an anomaly. But if I just try to
redraw five K from an ATM, and that is something that
is never happened before, then it may increase the
distance of that vector as an event from regular
normal transactions that I'm doing in my account. It's all about using
those embedding models to automatically
find those patterns. So when combined with
vector databases, anomaly detection becomes a
powerful and efficient way to handle high dimensional data in real time across
diverse use cases.
21. S04L06 Visual Search: Hi, and welcome back. The last use case I would like to
discuss is visual search. It is a use case we
saw as an example in the previous section when I presented the concept
of vector databases. Now I would like to frame
it as another use case. One popular option
for organizing and then fetching images in a large repository is by adding metadata about the
content of each item, like object type, color, action, category, and more. This metadata information
can be used to match a keyword search query with
the relevant content items. However, there are use cases where the search query
is not a keyword search. The user is providing
an image as an input, like the following
image of a dog. It's a visual search. Think about a mobile
app like Google Lens, that you can take a
picture of something and it can provide you
with similar images. It is becoming a
popular use case for ecommerce websites where you can upload an image of a product
for product discovery, help to perform style matching
for particular outfit. Analyzing medical images to identify potential health
conditions or diseases. In real estate, it can be used
to upload a photo to find similar properties
for sale or rent. In manufacturing, it can be used to identify
spare parts or tools from images to simplify
procurement processes. And more, we can probably find hundreds of use cases
for visual search. As you may guess by now, vector databases are extremely
useful for visual search. There are dedicated AI models that are used to extract
patterns from visual elements, capturing the essence of the content, including
shape, color, textures, movement patterns,
sequence of objects, objects inside the
image and mode. Extracted vectors can
be then stored in a vector database along with
metadata about the image. So each vector is associated with the
corresponding row data. When a user submits
a query image, it is also converted
into a vector embedding, then the vector
database will perform a similarity search to find
the most similar images. That's about the top five
most popular use cases for vector databases. In the next and last section, we will summarize the complete
training. See you next.
22. S05L01 Let’s Recap: Hi and welcome to
our last section. At this point, I would
like to quickly recap the key terms and topics we
covered until this point. We started by defining AI
artificial intelligence as the human desire to create a digital brain and mimic
human intelligence. So machines can perform more
and more complex tasks. Those AI brains are commonly used as components in
larger applications, and therefore, it is useful
to present them using a simple analogy of a box
with input and output. Machine learning
algorithms, coupled with a massive amount of data are used to train those AI brains. This process is called training, and the output is
a trained model. To be able to digest and
handle more complex patterns, deep learning methods
are used with artificial neural networks that are inspired by the human brain. Finally, generative AI added
the important capability to analyze text as a language and to generate
creative content. Next, we learned
the basic concept of a vector with
magnitude and direction. That's the visual representation of a two dimensional vector. When dealing with high
dimensional vectors like seven dimensions
or 700 dimensions, it makes more sense to present them using a list of numbers. Those numbers
represent the endpoint of the vector arrow in space. Then we talked about
how objects are described in data science
using a list of features. Those features can be
placed in a feature vector, each data point of a specific object in
one feature vector. In simple cases, it is possible to manually
select a group of features describing
an object type based on expertise and
knowledge about a topic. However, when dealing
with unstructured data, it is more efficient and
more practical to automate that process by using
the power of AI models. This process is called
vector embeddings, taking unstructured
data and converting it to a feature vector
as a list of numbers. The feature vector is not a
replacement for the raw data. This vector will capture
the required patterns in the input raw data
identified by the AI model. There are many types of AI models for a
variety of use cases. The models that are
used for embeddings are specifically called
embedding models. And there are three
options to consider when implementing a solution
with an embedding model. Option number one, building
a model from scratch. Option number two, using an out of the box solution that can be an open source or maybe cloud based
service using APIs. Option number three, fine tune an existing model with
domain specific data. One of the key properties of those embedding vectors is their relative position
in the embedding space. The proximity of two vectors
indicates their similarity. That's something that
is extremely useful for performing a
vector search while comparing a query vector to all other vectors
or by clustering, trying to find
meaningful groups. The most popular method
to measure similarity between vectors is the
cosine similarity metric. This metric measures the
angle between two vectors, indicating whether they are pointing to similar
directions or not. When performing a vector search, we measure this metric between the query vector to
all other vectors, sort the results in
descending order, and then filter the
top similar vectors. Next, we talked about
vector databases, comparing the two
main data types, structured and
unstructured data. Structured data has a
predefined structure like a tabular form
of rows and columns. This type of data
is commonly managed by traditional
relational databases. Data search is based on exact keywod matching or
using a set of criteria. The other side, we have unstructured data without
a predefined structure. It is usually complex
data and the content of this data is commonly
described using metadata. Searching for data using metadata is a reasonable
solution for simple cases. However, it's quite limited. We mentioned the
challenge of finding similar images like similar
dogs based on metadata. The solution for
this challenge is to utilize AI models called
embedding models. In this example, we will take all images and translate
them into vectors, including the query image and then perform
a vector search. In practical market use cases, the number of
vectors generated by embedding models can
easily scale up very fast. And in that case, it is
useful to consider using a dedicated vector
database solution to better handle those vectors. I mentioned the typical
vector search workflow while using a vector database, starting the preparation step
of vectorizing all data in the content
repository and saving all vectors coupled with metadata information inside
the vector database. In case the application down below is triggering
a quer request the query will be vectorized and then searched inside
the vector database. Finally, the most similar items will be returned as an
answer to the application. When selecting a vector
database solution, we should consider
a couple of things. What is the preferred
deployment option? Some companies would
like to reduce overhead, focusing on their
core application, so they may select fully
managed database solutions. While other that are looking for better and much more control may consider self
hosted solutions. We should check the
integration options like supported
extensions and APIs. Try to evaluate how good is the Dcmutation
center in case we need some help and also supported metadata
filtering options, which is extremely useful while working with
vector databases. In case we select a
fully managed database, it is important to check
how well the service is aligned with enterprise
requirements related to security, performance, and SLA
service level agreements. The third part of
this training was dedicated to reviewing the top market use cases
for vector databases. The first use case
was semantic search. There are two main
methods to perform a search based on a text query. First one is called
keyword search, where we take the
user search query, break it into keywords, and then try to find
exact matching. Its fast and simple
search method. However, it has
many limitations. By breaking the query
into small keywords, we are missing the
full semantic meaning of a search sentence. That's where a semantic
search is bridging that gap. By vectorizing the search
query into a vector, it is possible to capture the full semantic
meaning of the query. Secondly, the search is much
more flexible and accurate. Instead of looking
for exact matching, we are looking for
similar items. Next use case was about
recommendation systems. That's a core component in
many applications to help user discover content and get more personalized
suggestions. Recommendation systems
use two main options, collaborative filtering and
content based filtering. The first one is based
on user interactions, data, all kinds of things
that users are doing, and the second one is based on attributes of the items itself, all kinds of metadata
attached to different items. I mentioned a
simplified architecture of recommendation system where all those types of data are vectorized by a group
of embedding models, and then the
application can trigger request to get relevant
recommendations. The next use case was
about the RAG framework. Large language
models are becoming very popular building blocks
in many market applications. However, they have a
couple of limitations. Therefore, one of the
most popular options to increase their
effectiveness is by augmenting the input query
for an LLM with extra data. That's where embedding models and vector databases
are coming to help. The next use case was anomaly detection being used in many market use
cases like finance, cybersecurity, healthcare,
and many more. Here we can utilize embedding models and
the vectors to find those outliers by looking at vectors that are not
similar to other vectors. The last use case was
about visual search, taking an object like
an image or video, vectorize the data, and then
search for similar items by the identified patterns inside and not by
using a text query. Again, vectors and
vector databases are a great match for
performing a visual search. That's the end end
story of our training.
23. S05L02 Thank You!: Amazing. You reached the last
lecture, and that's great. I hope that you
enjoyed the training and learned some interesting
things along the way. You can download the mind
mapping summary files and consider visiting again
to refresh your knowledge. It will be awesome if you
could spend 2 minutes and rate the course inside the platform while sharing
your personal experience. Each review is important. I also suggest sharing your achievement on social
media like LinkedIn, you're more than welcome to
tag my name IdanGabrill, I will add another
layer to your profile. That's it. Thanks
again for joining, and I hope to see you again
at other training courses. Bye bye and good luck.