Transcripts
1. Intro: Welcome to our course on evaluating large
language models outputs. As AI and natural
language processing increasingly
influence technology, a deep understanding
of evaluating large language models is crucial for any
modern developer. We'll guide you through
foundational evaluation methods, advanced techniques
using tools like automatic metrics and
auto site by site, and ethical considerations
in AI development. This course emphasizes
practical applications, integrating human judgment
with automatic methods, and prepares you for
future trends in AI evaluation across
various mediums. Hi, I'm Professor Reza with more than ten
years of teaching experience in the field of computer science and
artificial intelligence. In pursuing my PhD, I have collaborated
with MIT Media Lab, Carnegie Melon University, HCII, Harvard University, and University of
California San Diego. I have published in prestigious
venues such as IAE, Springer Nature, and ACMKI. My work has been featured
by multiple news outlets, including the Neck
Web and CBS News. This course is ideal
for you if you have an interest in
learning the skills to assess the outputs of
LLMs effectively in order to enhance your
business strategies and personal innovation. The learning objectives
for this course are understand the strengths and challenges of LLM
evaluation tools. Discover some of Vertex AI
model evaluation services. Optimize model selection to suit your application and
prepare for the future by understanding how evolving
evaluation tools and services can impact
the development and deployment of
large language models. To be successful in this course, you should have a
basic knowledge of machine learning concepts, including model
evaluation metrics and a familiarity with LANs
and their applications. This course is provided
to you in three lessons. Lesson one, basics of large language models
evaluation methods. Lesson two, LLM
evaluation on vertex AI, and Lesson three, the future of generative AI
evaluation models. By the end of this course, you will gain a thorough
understanding of evaluating the outputs of LLMs. You learn how to assess
the effectiveness and accuracy of LLM generated
content across various domains. Knowing these skills
will help you evaluate the quality of
different AI models. You'll be able to select the
right one for your needs. This will enable you
to design, develop, and implement effective and ethically responsible
applications for personal, professional and
business purposes. So let's get started and
explore how evaluating LLM outputs can enhance the reliability and
effectiveness of AI solutions.
2. L1V1 Introduction to LLMs and their evaluation methods: In this video, we'll
explore the concept of lodge language models
or LNS for short. Imagine an AI system so
advanced it can write stories, answer complex questions,
and even hold conversations. Isn't this fascinating? Understanding how these models
work and how to evaluate their output is crucial as these technologies are
reshaping our daily lives. By the end of this video, you'll understand how large
language models differ from traditional NLP or natural
language processing models. We're going to compare them
in scale and complexity. We'll also discuss
the importance of reliable evaluation methods and the potential consequences of improper evaluation on
real world applications. Large language models or LLMs are a big step forward in
artificial intelligence. These models learn from
huge amounts of text data, which allows them to understand and create human like language. It's almost like they can think in a similar
way to humans. LLMs can handle much
more complex tasks compared to simpler
language models. They can carry out
conversations, summarize long pieces of text, and even create
original content. They do it all with an
impressive level of fluency and accuracy that
wasn't possible before. The real power of LLNs comes
from their depth and scale. Unlike traditional
NLP models that work with limited data,
predefined rules, and focus on specific tasks, LLNs are trained on
massive diverse datasets. These datasets contain
billions of words. This allows LLNs to
understand the nuances of language better and handle
various tasks effectively. LLNs use advanced deep
learning techniques like transformer architecture
to learn patterns autonomously without being
programmed for specific tasks. By building a deep understanding of language directly from data, LLNs can go far beyond
the capabilities of earlier models that rely on simpler techniques
and structured input. The advanced capabilities
of LLNs allow them to perform a variety of language
tasks simultaneously, from translating language to
generating creative writing. They can adapt to
different contexts and generate coherent
relevant responses. This sets them apart from earlier NLP technologies which typically handle
shorter isated text. Another key difference is that LLMs large neural
networks enable them to maintain context over long conversations
or documents. This was pretty challenging for previous traditional NLP models. Now let's see why it's important to evaluate
the output of evince. It is important to evaluate the outputs because
these models are being used more
and more in areas where getting the Rich
information really matters. Areas such as healthcare, law, customer service,
news and education. In these fields, it's crucial that the
outputs are accurate, fair, and appropriate to maintain trust and make
these tools useful. Good evaluations help keep the information reliable
by checking that LLMs understand the
input correctly and ensuring the responses
are correct and relevant. Also protect against the negative effects
of incorrect outputs, such as spreading wrong or
misleading information, HA fake news. Another reason that evaluating
LLMs is crucial is because the output of these models represent the bias in the
data they were trained on. We want to make sure that we are following ethical standards. LLMs can amplify biases
from the data that we train them on and that can lead to unfair or prejudiced outputs. Good evaluations can identify
and mitigate these biases, ensuring fairness and prevent
further discrimination. Through evaluation, we can
also check if responses are appropriate and aligned
with societal norms, especially in public
interactions. Regular evaluations
improve these models and they encourage
ethical use of artificial intelligence
and they help build public trust in
interactive technologies. So in conclusion, in this video, we went over the basics of large language models
and how they're different from
traditional NLP models. We also talked about
how important it is to evaluate them and
we learned that making sure LM outputs
are accurate and ethical is key to ensuring they work well across
different applications.
3. L1V2 - Benefits and Challenges of LLM Evaluation Methods: In this video, we will
explore the steps involved in evaluating
large language models. Imagine your news agency needs the best AI to generate
article summaries. How do you choose the right one? We'll guide you through defining goals,
selecting methods, choosing datasets,
and interpreting results all through a
real world scenario. By the end of this video, you will understand the
steps and challenges associated with each step of evaluating large
language models. Imagine you work at a news
agency that wants to utilize LLMs to generate
one line summaries for its news articles. To successfully incorporate
LLMs in this way, you are tasked with evaluating multiple models to determine
the most appropriate one. At first glance, evaluating LLMs may seem straightforward, more or less similar to evaluating a
traditional AI model. First, you define
evaluation goals. Then you choose
evaluation methods. The third step is to select
appropriate datasets, and finally, you analyze
and interpret the results. So let's break down
each of these steps. In the first step, you
want to ask questions like what specific task do
you want the LLM to perform? You also want to see what metrics are more
important to you, overall fluency, coherence, factual accuracy,
or anything else. In the second step, you need to pick the
evaluation method. You can pick from
different methods like task specific metrics, research benchmarks,
LLM based evaluations, and human evaluations based
on your evaluation goals. As for selecting the
appropriate dataset, you want to define
a golden dataset that aligns with your
evaluation goals and metrics. One good place to look for is
the benchmark datasets that are specifically designed
for evaluation of LLMs. For analyzing and
interpreting the results, you want to combine both quantitative and
qualitative results to provide comprehensive
insights for your evaluation. Make sure to note the
strengths and weaknesses of each evaluation method and provide justification
for your conclusion. So hopefully this sounds like a good approach for
evaluating LLMs too. However, there are several challenges
in this process too, especially when it comes to the evaluation of the
outputs of these LLS. The first challenge is in
defining the evaluation goals. In our example, defining
evaluation goals for LLMs in tasks
like summarizing news articles is
challenging due to the subjective nature of what
constitutes a good summary. It's difficult to rely
on a limited number of metrics to assess the
quality of an output. Also, in choosing the
evaluation methods, there are time and
resource constraints. It's going to be
computationally expensive and time consuming to try
multiple evaluation methods. Also, new evaluation methods are being introduced
very frequently and that makes it
difficult to decide which method is the
best for our use case. In selecting
appropriate data sets, the size and quality of available datasets
can pose challenges. In predictive models,
we know that the large datasets with minimal noise
lead to better performance. But in the world of
generative models, we still aren't sure what size and quality of dataset are beds. And finally, in analyzing
and interpreting results, challenges in
explainability can arise, especially when dealing with
newer evaluation methods. We still don't have a
standard way of interpreting the results or assessing the reliability of these
evaluation methods. In conclusion, this
video has covered the essential steps
and challenges involved in evaluating
large language models. We looked at these
evaluations for tasks like summarizing
news articles. We've explored how to define
clear evaluation goals, choose the right
evaluation methods, select appropriate datasets, and effectively
interpret the results. Each step presents
distinct challenges that must be
carefully managed to ensure the successful
integration of large language models into
real world applications.
4. L1V3 LLM - Evaluation on Vertex AI: In this video, we will
explore the tools that vertex AI offers for evaluating large
language models outputs. Imagine evaluating
AI models with tools that highlight
accuracy and fairness. These tools give you
the ability to uncover hidden biases and compare hidden performances
side by side. We also explore some
insights into making your AI models not only
effective but also ethical. By the end of this video, you'll know how to
effectively use Vertex AI to evaluate the output of large
language models. As we mentioned earlier,
in this course, we are going to use Google
Cloud as an example of a platform that provides
tools for LLM evaluation. Google's vertex AI
can help you evaluate the entire life cycle of a large language model
from beginning to the end. In vertex AI, you can
prototype, customize, evaluate, and deploy models of many different tasks and
across different modalities. However, for the
purpose of this course, we will focus only on the evaluation capabilities
that vertex AI provides. Some of the capabilities
available in vertex AI to help streamline the
evaluation process include automatic metrics, which uses reference data to compute task
specific metrics. Auto site by site, which mimics human
evaluation by comparing the performance of two models
with an arbiter model. And safety bias,
which highlights the models biases against
a certain identity group. In lesson two, we
will dive deeper into automatic metrics and
auto side by side. In lesson three, we're also going to briefly
cover safety bias. For now, let's go over each of these three
evaluation methods. Automatic metrics in
AI evaluation are quantitative measures used to assess the performance
of models, especially in tasks like text generation or
machine translation. They are typically fast, efficient and can be a part of a standardized
method used across academia and industry for
comparing different novels. Some of the most common
automatic metrics include blue or bilingual
evaluation understudy, which measures how many
words and phrases in a machine generated translation match a reference translation. We also have Rouge or recall oriented understudy
for sting evaluation, which is another metric
used for evaluating text summarization by counting the overlapping
units such as grams, word sequences, and
word pairs between the computer generated summary and a set of
reference summaries. There is also auto side by side, which is a tool used for automatic side by side
evaluation of AI models, particularly
generative AI models in the vertex AI model registry. This tool allows comparison of the performance
of different models, providing insights
into which model performs better and under
what circumstances. Auto side by side
aims to deliver consistent performance
metrics that align with human evaluations, but offers the advantages
of being faster, more cost effective, and
available on demand. Last but not least, vertex AI also provides safety
bias evaluation. This evaluation checks
and models outputs for biases against identity
groups such as gender. This analysis aims to ensure that the output of
the LLN does not perpetrate harmful
stereotypes or unfair treatment
toward any group. In conclusion, Google
Cloud's vertex AI provides comprehensive tools for evaluating lodge
language models, focusing on performance
metrics and safety bias. Automatic metrics like Blue
and rouge offer standardized, quick and efficient ways to assess model outputs
against reference data. Auto site by site compares
two models side by side, mimicking human
judgment, but with the benefits of speed
and cost efficiency. Additionally, safety
bias checks for fairness across
identity groups such as gender to ensure
that LLNs do not reinforce harmful stereotypes
or discrimination. This holistic approach
to evaluation allows developers to refine LLNs aligning them with
ethical standards and societal expectations
for responsible AI.
5. L2V1 - Automatic Metrics: This video, we will have a
look at automatic metrics and understand their role in evaluating lodge
language models. Imagine a developer struggling with the performance
of their AI model. They spend hours testing and tweaking the
model without having any clear feedback on the effect of their tweaks
on the outcome of the LLM. I believe most of
you who are taking this course already know how
frustrating that can be. But what if I told you there
are tools that provide us precise performance data and highlight the exact
areas for improvement. By the end of this video, you'll understand the
various automatic metrics used in LLM evaluation, why they are used, and
how they can guide the refinement of model
performance for different tasks. They can be classification, summarization, text generation,
or anything like that. So automatic metrics provide a fast and cost effective way to evaluate your
model's performance using a range of task
specific metrics. This approach assesses
models based on input prompt and
output response pairs, allowing you to quickly
gauge their effectiveness. Automatic metrics are
a standard methodology widely used in academic research and many open benchmarks. They utilize commonly
accepted metrics for several general AI tasks, making results comparable across different studies and platforms. Evaluation process
involves fitting an evaluation dataset into the model to generate
prediction results. These results are
then assessed using the selected evaluation
metrics to measure the model's performance on
the specific task at hand. By leveraging automatic metrics, you can efficiently evaluate your model's capabilities
and identify areas for improvement without the need for extensive
manual review. At the moment, the
models available on vertex AI include base and tuned versions of Palm
takes Bison and Gemini. Supported tasks include
classification, summarization, question
answering and text generation. There's at least one metric
for each of these tasks. Each task has a specific
metrics to ensure accuracy. Microphone and microfon scores gauge overall classification
precision and recall. Per class F one assesses
it per category. Rouge L evaluates summary
closeness to a reference, while exact match scores
question answering accuracy. Blue measures the precision of text generation against
a human standard. Using automatic metrics is
pretty straightforward. First, we prepare the evaluation dataset with input output pairs. Then we upload the dataset to
Google Clouds and storage. Finally, we perform the
model evaluation by using the vertex AI Python
library to submit the job. The next video, I'll
walk you through a demo on how to do
each of these steps. But for now, let's
quickly review it. For the dataset,
you must provide the prompt with
instructions and context, as well as a ground truth, which will be used together with the generated
answers to calculate metrics related to
the selected task. It's a good idea to offer at least ten examples similar to how the
application will be used. When you have
prepared your dataset and uploaded it to
Google Cloud Storage, vertex AI has a template for the model
evaluation pipeline. The parameters for running the evaluation pipeline include the location of the
evaluation dataset, the task that is going
to be performed, and the model that should
be utilized for the task. With those parameters, you can then run the model
evaluation pipeline job. We will see a demo of running an evaluation task
in the next video. In conclusion, in this video, we went over automatic
metrics in vertex AI, an efficient and
standardized approach for evaluating LLMs. We've explored the
supported models and tasks, understood the application
of each evaluation metric, and outlined the process of preparing and running
an evaluation pipeline. Through these metrics, you can objectively measure and refine your model's performance
in order to make sure it meets the demands
of real world applications.
6. L2V2 - Automatic Metrics Demo: In this video,
we're going to walk through a live
demonstration of using rapid evaluation SDK to evaluate the output of Gemini and
LLM developed by Google. Through this demo,
you'll see firsthand how to apply automatic
metrics to assess your model's output
and understand the strength and weakness
of different AI models. By the end of this video, you'll know exactly how to use the rapid evaluation SDK to
assess the output of an LLM. We'll cover loading
your dataset, initiating model evaluation, applying automatic
metrics, and interpreting the results to gain insight into your
model's performance. Let's get to the demo. The
link to this tutorial is provided so you can run
the evaluation yourself. In this demo, we'll be
going over how to use the rapid evaluation tool to analyze the
performance of an LLM. This demo will be using Google Callb notebook to guide you on using
the rapid evaluation. We will first begin by preparing the necessary components
to run this tool. First, we'll create a
Google Cloud account. In the account creation, it will prompt you to your
Google Gmail and password. Once you've created the account, you'll see a greeting
screen similar to this. Open the menu tab to the
left and select Billing. From there, you will
have to enable billing. You have to put a credit
or debit card in order to activate billing.
But don't worry. There will be $300 worth of credit provided to
anyone in the beginning. So you don't have to spend any money for running this demo. Afterwards, you
open the menu tab again and select APIs and
services on the screen. You'll click on the
dropdown that says library and search for
the word Vertex AI API. You will then click on Enable
to enable the API to views. Lastly, you'll
create a project in Google Cloud on this dropdown
menu here on the top left. Click on that and you
select new project. From there, Google will guide you on creating
the first project. After you've created
the first project, you will see that there is a unique ID associated
with the project. Make sure to save the ID as this is required for the
evaluation task. Now we are ready to
go to the setup. Begin by running the
first cell down here. Now we're going to
run the package to run the rapid evaluation. Note that you may
have to restart the kernel for the
package to be recognized. Next, we will run this
cell to authenticate. Use the project ID you've seen previously and paste it into
the project ID variable. As for location, this demo
will be using US Central V. You can look up the supporting locations
for this variable. You will receive a pop up window indicating that you have
to log into Google. Here you can login using
your Google Cloud account. It will then prompt
you to access to certain features which you
will allow and continue. You should end up with a page indicating that you
have successfully authenticated to
Google Cloud and then you may proceed
back to the Network. Then we will set up Google
Cloud project information and initialize the Vertex
AISDK using the project ID. After you set up your
project ID and location, run the cell, which
will initialize the vertex AISDK to be used. Next, we'll import the
necessary libraries. Run the cell to get all of
the necessary libraries. Note that the main
libraries are listed below, which are the ones
processing the information. Next, run the library setting cell and the
helper functions. Note that these cells are for formatting information
and adjusting the setting for warnings and logs as well as
performance adjustment. We're now ready to run
the evaluation job. Before that, let's go over the requirements needed
to run this evaluation. First, we need the data
that is being evaluated. To properly format the data
for the evaluation task, we'll create the
pandas data frame using arrays of data
stored in a dictionary. In the dictionary, you
can have an instruction, a context, a reference, a prediction, and a response. Each index value corresponds to the other array at
the same index value. For example, index zero
and the response array corresponds to the other
array index zero, and so on. In this demo, we'll be
using two rows of data. Insert these data as an
array into a dictionary, which is to be converted
to a pandas data frame. Next, we'll decide what metrics to choose for evaluating
the responses. The responses are measured by various automatic metrics that the rapid evaluation
tool provides. Here we can see all
the possible metrics in the central column, along with the type
of measurements in the left and the required data
frame input on the right. For example, coherence measures the model's ability to produce a clear and sound response. Fulfillment measures how well
the model has answered and completed the given instructions with a predetermined prediction, and Blue and Rouge compare
the similarity between the given reference prediction and response in terms of words. You may look into these metrics on your own if
you're interested. After selecting the metrics, you want to measure the input, each of the metric names and
input the arrays shown here. You will also insert the
evaluation dataset into the required dataset argument and provide a name
for the experiment. The last segment of the cell, we run the actual
evaluation task. Upon running the cell, you should see that an
experiment has been created. Clicking on the view
Experiment button will redirect to Google Cloud, where you will be able to view the status of the
evaluation pipeline. The time it takes for the
evaluation task depends on the number of metrics as more metrics take more
time to complete. In conclusion, we've seen
how the rapid evaluation SDK facilitates the assessment
of generative AI models, providing an efficient
way to analyze model performance through
automatic metrics. This approach helps identify
strengths and weaknesses, ensuring your model meets the expected standards for
real world applications.
7. L2V3 - AutoSxS: In this video, we're taking a close look at
Auto side by side, a comparative evaluation tool
for large language models. Imagine working on an
AI project where you need to choose the best
model for summarization. Without clear comparisons, it feels like
guessing in the dog. Good news is with
autost by side, you're able to perform side by side evaluations of the outputs
of two different models. By the end of this video, you'll be able to understand
how Auto side by side works, the role of the atorator and how to use it to
compare model outputs. You'll gain insights into
evaluating LLMs with a clear understanding of what makes one model's response
better than the other. Auto side by side is an evaluation tool that
compares two LLMs side by side. It utilizes an aerator or a judging model to determine the better
response to a prompt. Using this tool, you can
assess the performance of any generative AI model for summary and question
answering use cases. Auto site by site also provides explanations and certainty
scores for each decision. At the core of
autost by side sits the autoator which makes this comparative
evaluation possible. The autoator is an LAN specifically designed
to assess the quality of responses generated by other models when given an
original inference prompt. Auto side by side can evaluate any model with pregenerated
predictions and can auto generate responses
for any model in the vertex AI model registry that supports batch prediction. Currently, it can evaluate
the performance of models on summarization and
question answering tasks. For each side by
side evaluation, auto side by side employs
predefined evaluation criteria. For example, some criteria
for summarization include how well the model
follows prompt instructions? How grounded is the response in the inference context
and instructions? How well does the model
capture key details in the summarization and how
concise is the response itself? Using auto side by side is
pretty straightforward. First, we prepare a dataset
of prompts, contexts, and corresponding
generated responses, only if input prompts required. Then we store the
evaluation dataset to Google Clouds of storage
or a Big Query table. And then we perform the model evaluation by running the evaluation
pipeline job. In the next video, you will see a demo of autoste
by side in action, comparing Gemini Pro with another LLM for a
summarization task. But before that, let me explain how each of
these steps work. Auto site by site accepts a
single evaluation dataset. The dataset must include
at least one example, but for proper evaluation task, something around 400 to 600
examples are recommended. Each unique example has a unique ID and includes
content and responses. We also can add an
additional column for taking human preferences
into account as well. Next, we must set parameters for performing
the model evaluation. For example, in a model evaluation
without human preference, the parameters might specify
the evaluation dataset, columns to use, the
task, for example, summarization or question
answering and Oerator prompt parameters like the inference context
and instructions. We must also provide
the columns containing predefined predictions to calculate the
evaluation metric. After defining our parameters, we can initiate an
evaluation pipeline job using a Google
provided template. The parameter values are passed in to configure
the pipeline job. Auto side by side
utilizes vertex AI Python SDK to
get this job done. After successfully completing an auto side by side evaluation, you can view the
evaluation results. Auto site by side generates three primary types of
evaluation results, judgments table,
aggregated metrics and align matrix if human
preference are provided. Judgments table indicates
the superior response and each choice is accompanied
by a confidence score, which is a value 0-1. The auto side by side
judgments include an explanation of each
of the aerators choices. Auto side by side can generate and compare multiple outputs for a given task to
select the response judged as better based on
criteria like coherence, logical flow, and
capturing the key points. For example, when choosing between response
A and response B, the aerator might explain that while both provide
good summaries, response B does a slightly
better job of capturing the overall story in a more coherent and
organized matter, compared to the more
statistics focused response A. Auto side by side also
provides aggregate metrics. These win rate metrics are derived from
the judgment table as a percentage of times the oerator preferred one
model over the other. These metrics help to quickly identify
the superior model. Also, as I mentioned earlier, auto side by side allows for the validation of judgments
with human preference. This means providing
additional information and parameters within the side by side evaluation
pipeline is possible. In order to do so,
in the dataset, a column must be added
for human preference. We also need to define human preference column
within the parameters. The rest of the process
remains the same. Including human
preference results in additional metrics for
human preference alignment. The output includes all
the regular metrics, but it also includes a
human preference win rate along with the
outerator win rate and a Chenes Cape score, which denotes the level of agreement between the
oerator and the human rater. Again, this is a value 0-1 with zero being random choice and
one being perfect agreement. In conclusion, Auto Side by Side stands out as an
innovative tool in vertex AI for evaluating and comparing the performance
of generative AI models. We've seen how it brings precision to the
evaluation process with side by side comparisons and detailed
explanation features. It streamlines the assessment
of LLNs ensuring that the best performing model can be identified based on
task specific criteria.
8. L2V4 - AutoSxS Demo: In this video, we
will demonstrate how to use Auto
site by site within vertex AI to evaluate the Gemini model
against another lodam. This practical guide will show you each step in setting up and running an evaluation using the tools provided by
Google Cloud Platform. By the end of this video, you'll understand how to navigate the autoste
by side tool, set up your evaluation datasets, and interpret the results from the autoste by site
comparative analysis. This will equip you with
the skills to effectively assess the performance
of generative AI models. Now let's get to the demo. The link to this tutorial is provided so you can run
the evaluation yourself. In this demo, we will be going over how to
use Auto side by side to evaluate and compare the performance of
large language models. To begin, we will first install the following package by
running this command. We will be using this package to call the API from Google Club. After you run the command, make sure that you
restart the runtime in order to use the
newly installed package. A cell has been provided for the user to restart the runtime. After successfully
running the cell, you will receive a pop
up indicating that the kernel has died and
will automatically restart. Now let us set up the
necessary components. We will first create a
Google Cloud account. In the account generation, it will prompt for your
Gmail and password. Once you have
created the account, you will be greeted with
a screen similar to this. Open the menu tab on the
left and select Billing. From there, you will
have to enable billing. You would have to input a credit or debit card in
order to activate billing, but there will be $300 worth of credit provided to you,
so don't worry about it. Afterwards, you will
open the menu tab again and select
APIs and services. Click on the library and
search Vertex AI API. You will then click on Enable to enable
the API to be used. Next, you will create a
project in Google Cloud. Click on the drop down menu on the top left and
select new project. From there, Google will guide you on creating
the first project. Lastly, open the menu tab
again and selet IAM and Admin. You will see the newly
created project. Click Grant Access and
in the principal input the name of the principal of your created project and
in the rolls dropdown, search in the filter object. Here, you will see
the option for environment and storage
object administrator. Add this to the
principal and save. This is how it should look
like with the role having a storage object admin.
Now we are ready to go. Since we are working on
vertex AI workbench, you do not need to perform
any additional steps. To begin, we will
set the project ID. You can find the project ID
by going back to the project drop down and find the
column where it displays ID. In this case, this is
the ID for the project. Run the cell after you have changed the ID to
your project ID. Next, we will set the region. In this demo, the region
is set to US central bond. Now, run the cell block. Now we will generate
a random UUID. This will be used to
uniquely identify the project and avoid
potential name collisions. We will now use the UUID to create a unique bucket URI name. Now we will move on to
setting up the process. We will first import the libraries and
define our constants. We will also define our helpers. Next, we will initialize the vertex AISDK by
providing our project ID, region, and our bucket URI. As we have defined
in our constants, we will be comparing a Gemini
dataset with another LLM, one producing response A
and the other response B. Each row of the data contains
an ID and a document to summarize and the
two versions of the response to the
document are also there. We can get a peek
at this by using Pandas to read the
JSON and format it. Next, we will run the
model evaluation job. Here are parameters
required by the pipeline. The evaluation dataset to
indicate the data location, ID columns to distinguish evaluation examples
that are unique, which are ID and document
fields in this case. Next is the task. The task we are evaluating
is the summarization. And there is the operator
prompt parameters, which is used to
configure the behavior of operator task such as setting the context
and instructions. You will then need to provide the response column A and response column B with the names of the columns containing predefined predictions in order to calculate the
evaluation metrics. In this case, it is
response A and response B. After we define the model
evaluation parameters, we can now run the model
evaluation pipeline job with this given template using
the vertex AI Python SDK. Let this run as it can take a while for the
pipeline to finish. You can click on the link to see the pipeline in action in
Google Cloud platform. This is how your
pipeline looks like. After the pipeline
run has completed, you can use the code segment below to view the judgment of each response and how it compares according
to the aerator. It offers information
such as explanations on the preferences and confidence
score of the aerator. Next, we can also show the aggregated metrics using
the code segments below. This is rather
useful to determine which model is better in the
context of the given task. The aerator also supports human preference to validate
the aerator evaluation. We will now use the other URI, which includes an additional
human preference column. In the pipeline
requirements parameter, we will now include the human preference
column and perform the same pipeline run task
with the new column of data. We can now obtain the human
aligned aggregated metrics. Again, this is how the pipeline looks like in Google Cloud. Using the code segments below, we obtain the performance
of Auto side by side aerator based on
how a human prefers. Lastly, we will clean up
the Google Floud resources. We can run the cell below, and it will clean up all of the resources we used
in this project. In conclusion, this
demo has illustrated the practical applications
of autoste by site in evaluating the
gemini model on vertex AR. We've navigated through
the setup process, demonstrated how to
configure and run the evaluation and interpreted
the comparative result. This hands on
approach ensures you can effectively
leverage autoste by site to assess and enhance the performance of
generative AI models, which in turn helps you make your AI solutions more
robust and reliable.
9. L3V1 - Text based Evaluation Models part1: In this video, we will explore foundational text based
evaluation models for LNS such as meteor and perplexity alongside
fairness evaluation metrics. Did you know that biased AI
models can negatively impact applications in critical areas like loan approvals
and hiring decisions? By using meteor and perplexity, you can mitigate the
risks of these biases by ensuring your models are both
high performing and fair. By the end of this video, you will understand how different
evaluation metrics like meteor and perplexity work
and why they are important. You will also learn about the significance of furness
metrics in ensuring that AI applications treat all demographic
groups equitably. Meteor or metric for evaluation of translation
with explicit ordering improves upon earlier
metrics like blue by considering synonyms,
paraphrasing, and staining. It assesses translation
quality based on literal accuracy, fluency, and intent, making
it valuable for applications requiring nuanced
language understanding. Let's consider a
practical example to understand how meteor works. Imagine we have two translations
of the English phrase, the quick brown fox
jumps over the lazy dog. Meteor would score translation A higher than translation B. Although both translations
convey similar meanings, translation A maintains a more accurate and
fluent structure with appropriate synonym usage, leaps for jumps and
fast for quick. Meteor evaluates
these translations by analyzing word order, synonymy, and the overall
semantic similarity to the reference text. This emphasizes the
translations fluency and comprehensibility. Perplexity is another
measurement used to evaluate language models by
assessing how well a model can predict
a sample of text. It is based on the
probability distribution, the model assigns to a
sequence of words with lower values indicating that the model predicts the
sequence more accurately. Perplexity essentially
quantifies the model's uncertainty
about its predictions. It provides a gauge
of its effectiveness in language understanding
and generation tasks. Let's look at an example. Consider a model tasked with predicting the next
word in the sentence, the cat sat on the Suppose our model predicts four possible
completions, Matt, window, car, and moon with respective probabilities
of 0.5 0.2 0.2 and 0.1. The perplexity of the model
for this prediction can be calculated by taking the inverse of the probability
of the correct word, mat in this case, raised
to the power of minus one. Here, perplexity would be two indicating relatively
low uncertainty. Lower perplexity
values demonstrate the model's confidence and
accuracy in its predictions, suggesting a better
understanding of the context the CAT
sets on the map. We also have fairness
evaluation metrics, which are critical tools
used to assess whether AI models perform equitably across different
demographic groups. These metrics help
identify biases in model predictions that could disadvantage certain
groups based on gender, race, age, or other factors. It can be done by evaluating
differences in error rates, positive prediction proportions, and other performance
indicators. For example, consider a
loan approval AI model that uses personal data to
predict credit worthiness. To assess fairness,
we could analyze. One, difference in positive proportions
in predicted labels. If 40% of applicants from
group A, for example, male applicants are
predicted as credit worthy compared to only 20% from
group B, in this example, female applicants, this
metric would highlight a potential bias in
model predictions favoring group A, two,
recall difference. If the model identifies 90% of actual credit worthy
individuals in group A, but only 70% in group B, the recall difference
metric would indicate the model is less
effective for group B, potentially leading
to unfair treatment. Three, specific difference. Examining how well the model avoids false positives
across groups, we might find that it
incorrectly labels non credit worthy individuals as credit worthy at different
rates between groups, which could affect the fairness of the decision making process. In conclusion, this video has demonstrated the crucial
roles that both performance and fairness evaluation
metrics play in the development and deployment
of language models. We've seen how metrics
like meteor and Perplexity help ensure that models perform optimally while
fairness metrics address biases to promote equity and
trust in AI technologies.
10. L3V2 - Text based Evaluation Models part2: In this video, we'll
expand our exploration of text based evaluation
models for LLMs, focusing on diversity metrics
and zero shot evaluation. Most likely, you have
noticed that a lot of times AI generated
content lacks diversity, which makes it less engaging
or boring for users. By applying diversity metrics, you can ensure your AI generates varied and
interesting responses. We also cover zero
shot evaluation, which will further test your models adaptability to
new and unforeseen tasks. By the end of this video, you will be able to understand the importance and
application of diversity metrics in generating varied and creative outputs. Additionally, you'll learn how
zero shot evaluation helps gauge LLMs ability to adapt to tasks it hasn't
explicitly trained for. Diversity metrics
evaluate the range and uniqueness of responses
generated by a language model. These metrics are particularly important for
applications requiring creative or varied outputs such as content generation
or dialog systems. By measuring aspects
like lexical richness, variation in sentence structure, and the novelty of concepts
introduced in responses, diversity metrics ensure that the models outputs are not just accurate but also engaging and reflective of a vide
array of perspectives. Let's imagine a scenario. Think that you have an AI
model that is tasked with generating story ideas based on a single prompt a
day at the beach. Suppose the model generates
the following responses. In evaluating these responses
using diversity metrics, we would look for
variety in themes, characters involved, and
activities described. Response B would score
highly on diversity for offering multiple subplots
and varied interactions. While response C
would score lower due to its redundancy
with response A. Response D introduces
a novel element, which enhances its score for
introducing unique content. These metrics help in assessing the creativity and appeal
of the models outputs, ensuring they provide fresh and engaging
content for users. Now let's look at
zero shot evaluation. Zero shot evaluation measures
a model's ability to handle tasks it has not been
explicitly trained for. This metric is
critical for assessing the generalization capabilities
of lodge language models. It reveals how well a model can apply learned knowledge to new contexts or problem types without additional fine
tuning or training. It demonstrates the
model's adaptability and flexibility across
various applications. Let's look at an example. Consider a language
model trained predominantly on
English literary text. If you're presented with a task in a completely
different domain, such as generating
technical descriptions for new software applications. Zero shot evaluation
would assess how well the model performs
this task straightaway. Let's look at this example. We can see that
despite this model had no prior training on
software descriptions, the model generates a coherent
and relevant description. It demonstrates good
zero shot capability. This ability to generalize
from literature to technical writing without
any specific training showcases the model's robustness and utility in real
world scenarios where training data may not always be comprehensive
for every possible task. In conclusion, we discussed
how diversity metrics and zero shot evaluation play crucial roles in
evaluating LLMs. Diversity metrics help ensure the generated content meets
the creative demands of real world applications while zero shot evaluation assesses the adaptability of these
models to new tasks, showcasing the robustness and utility in various scenarios.
11. L3V3 - Evaluation of non text Generative AI Models: In this video, we'll talk
about how to evaluate AI models that create
images, sounds and videos. Imagine watching an AI generated
movie where scenes look choppy or the sound feels
off. It would be frustrating. Let's explore how to evaluate these models to make sure that the content they generate is smooth, realistic, and engaging. By the end of this video, you'll know how to spot the important ways
experts evaluate image, sound, and video AI models. You'll become familiar with
the skills to examine and evaluate the media that these generative AI
models generate. Evaluating AI image
generation models involves both subjective
and objective methods. Subjective evaluations are
based on human judgment of factors like visual
appeal and emotional impact. Objective evaluations
in contrast, use specialized tools to measure aspects such
as image resolution, color accuracy, and
the presence of visual glitches or flaws
known as artifacts. Consider an AI generated
image of a landscape. To evaluate it, we might use a pixel based metric like PSNR, which stands for peak
signal to noise ratio. Assess image clarity and
sharpness objectively. At the same time, we conduct a survey where participants
rate the image on realism, beauty, and emotional resonance to gather subjective data. This comprehensive
evaluation helps determine the overall success of the image generation model in creating visually appealing
and accurate images. Now let's move on to sound. Evaluating AI sound
generation models means looking closely
at the quality, accuracy, and emotional effect
of the sounds they create. You can use objective measurements
like spectra flatness and zero crossing rate to technically assess
the sound quality. It's also important to gather subjective feedback
from listeners on how real and emotionally engaging the AI generated
sounds seem to people. Imagine evaluating a piece of AI generated music intended
to evoke relaxation. Objective analysis could measure the consistency of
tempo and the clarity of sound using tools like a loudness meter or
a spectra analyzer. For subjective evaluation,
a listener group could rate the music on its
soothing qualities and emotional effects. Things like that can
provide insight into the music's effectiveness in achieving its intended
emotional goal. How about videos? When evaluating AI video
generation models, you need to look at
two main things, the visual quality
of the video and how will the frames flow
together over time, which is also called
temporal coherence. To measure visual quality, you can use metrics like Ks
and R that we talked about. This metric checks the sharpness and amount of detail
in the video. There's another metric
that is called SSIM, which as for a structural
similarity index. This metric looks at the detail and compares the AI video to a
reference video. To evaluate the
temporal coherence, you want to see how
smoothly the video frames transition
from one to the next. This helps ensure
that the motion in the video looks
natural and logical. Another important
thing to assess is contextual relevance. Does the video content actually match up with the
intended story or scene? The AI generated video should accurately reflect what is
supposed to be showing. For example, consider evaluating an AI generated video that
depicts a diver in the ocean. Objective metrics would analyze the video's resolution
and frame to frame consistency to
ensure smoothness in motion and clarity
in visual details. Subjectively, viewers
could assess how well the video captures the
essence of the setting, considering the elements like the realism of the ocean waves, the natural movement
of the diver, and the overall ambience. This combined evaluation helps determine if the video
generation model effectively replicates a realistic and engaging
diving experience. Conclusion, evaluating non text generative
AI models for images, sounds and videos
is essential to advance AI in creative and
practical applications. By combining objective
measurements with subjective human feedback, we get a comprehensive view
of an AI models capability. This approach ensures
the AI generated content is technically sound and
resonates with people, which is crucial for developing useful and appealing
generative AI applications.
12. L3V4 - Final Notes Importance of Human Evaluation: In this video, we'll summarize
our course and emphasize the critical importance of human evaluation in assessing
generative AI models. Have you ever wondered why some AI generated content is
misleading or inaccurate? We'll dive into what
generative AI does well, where it goes wrong, and why human oversight is necessary to catch and correct
these mistakes. To ensure the output of these models are useful
and trustworthy. By the end of this video, you'll understand the
limitations of generative AI, especially its
tendency to produce false information
or hallucinations. We'll discuss why recognizing the flaws is key for using AI effectively and ensuring it gives reliable and
useful results. Generative AI can do
a lot of tasks well, but it also has some
big weaknesses. One major problem is that it can generate false information
or hallucinations. This means the model outputs wrong or made up information. These models often don't know the limits of
their own knowledge, which is why it's so important to evaluate
them carefully. To use generative
AI effectively, we need to understand
its limitations. This means being aware that the model can make mistakes and coming up with ways to reduce these problems when
using it in real life. Since we need to recognize and address the limitations
of generative AI, we introduce a useful
tool called the IVO test, which stands for immediately
validate outputs. It's a simple but
effective way to check if a generative
AI model is reliable. A model passes the IVO
test if users can easily and quickly check that the output is correct
and meets their needs. This way, even users
who aren't experts can effectively use and validate
content created by AI. To implement the IVO test, users evaluate the
AI generated output by comparing it with
reliable resources, a method known as
post grounding. This lets users check that the information is accurate by looking at established facts. This makes sure the AI's output is not only relevant
but also reliable. This step is key for applications where accuracy
is super important. It allows users to use
tools with confidence. Let's say an AI model is made to summarize
scientific articles. To use the IVO test, users can interact with the AI generated summary
in a special app. If they want to check a specific part of the
summary, they can click on it. The app then shows them the matching section in
the original article. This feature makes it easy for users to compare the
summary with the source, making sure the AI's output accurately reflects
the original content. This method builds trust in the AI and helps
users understand better by connecting the
AI generated content back to its reliable sources. By having humans
oversee AI systems, we can make sure they're not just evaluated for performance, but also for
fairness and ethics. This approach helps stop
the spread of biases and ensures AI is developed in a way that respects
human values. So in conclusion, we discussed
the importance of having humans evaluate
generative AI models along with automated methods. By combining human insights with the efficiency
of algorithms, we can assess aspects
like creativity, context, and ethics that
computers might miss. This approach not only makes evaluations more
accurate and reliable, but also ensures
AI is developed in line with our values and
expectations as a society.
13. Outro: Great job. You've done it. You've completed evaluating
large language model outputs. I'm not here just
to say goodbye. I want you to take a moment and celebrate your achievement
throughout this course. Together, we've
explored new concepts, face challenging tasks,
and grown significantly. Look back and see what you know now that you didn't know at
the beginning of the course. Your commitment has led
to significant progress, and you should be proud
of this accomplishment. This course is only one step in your ongoing
learning journey. The concepts you've
learned here will serve as a foundation
for your future growth. Make sure you keep applying these skills and
maintain your curiosity. To continue your journey, I recommend the following. First, revisiting
course materials to refresh your memory
on the contents. Second, make sure
that you engage with your peers in
the community forums. Third, make sure you take on new challenging projects
to keep your skills sharp. Thank you for being a
part of this course on evaluating LMS outputs. Your engagement means a lot
to me and our entire team. As our course concludes, your journey is just beginning. I'm looking forward to hearing
about what you thought of this course and what you are planning to achieve
in the future. Keep pushing forward, stay curious and enjoy
the journey ahead. Congratulations again, and hopefully I'll see you
in a different course. Signing off, Professor Reza.