Evaluating Generative Models: Methods, Metrics & Tools

Reza Moradinezhad, AI Scientist

Get unlimited access to every class

Taught by industry leaders & working professionals

Topics include illustration, design, photography, and more

Get unlimited access to every class

Taught by industry leaders & working professionals

Topics include illustration, design, photography, and more

Lessons in This Class

- 1.
  
  Intro
  
  3:25
- 2.
  
  L1V1 Introduction to LLMs and their evaluation methods
  
  5:46
- 3.
  
  L1V2 - Benefits and Challenges of LLM Evaluation Methods
  
  5:11
- 4.
  
  L1V3 LLM - Evaluation on Vertex AI
  
  5:11
- 5.
  
  L2V1 - Automatic Metrics
  
  4:59
- 6.
  
  L2V2 - Automatic Metrics Demo
  
  7:46
- 7.
  
  L2V3 - AutoSxS
  
  7:37
- 8.
  
  L2V4 - AutoSxS Demo
  
  8:29
- 9.
  
  L3V1 - Text based Evaluation Models part1
  
  6:07
- 10.
  
  L3V2 - Text based Evaluation Models part2
  
  4:42
- 11.
  
  L3V3 - Evaluation of non text Generative AI Models
  
  5:28
- 12.
  
  L3V4 - Final Notes Importance of Human Evaluation
  
  4:18
- 13.
  
  Outro
  
  1:48

Beginner level

Intermediate level

Advanced level

All levels

Students

Project

About This Class

In this course, you will master advanced evaluation techniques for Large Language Models (LLMs) using tools like Automatic Metrics and AutoSxS. These evaluation methods are critical for optimizing AI models and ensuring their effectiveness in real-world applications. By taking this course, you will receive valuable knowledge and practical skills, including:

Hands-on experience with Google Cloud’s Vertex AI to evaluate LLMs using powerful, industry-standard evaluation tools.
Learn to use Automatic Metrics to assess model output quality for tasks like text generation, summarization, and question answering.
Master AutoSxS to compare multiple models side by side, gaining deeper insights into model performance and selecting the best-suited models for your tasks.
Apply evaluation techniques to improve AI applications across various industries, such as healthcare, finance, and customer service.
Understand fairness evaluation metrics to ensure that AI models produce equitable and unbiased outcomes, addressing critical challenges in AI decision-making.
Prepare for future AI trends by learning about evolving evaluation tools and services in the context of generative AI.
Optimize your model selection and deployment strategies, enhancing AI solution performance, efficiency, and fairness.

By the end of this course, you will have the ability to:

Evaluate LLMs effectively to optimize their performance.
Make data-driven decisions for selecting the best models for your applications.
Ensure fairness in AI systems, mitigating biases and improving outcomes.
Stay ahead of AI evaluation trends to future-proof your skills in a rapidly evolving field.

Whether you're an AI product manager, data scientist, or AI ethicist, this course provides the tools and knowledge to excel in evaluating and improving AI models for impactful real-world applications.

Meet Your Teacher

Reza Moradinezhad

AI Scientist

Teacher

Hello, I'm Reza.

I am passionate about designing trustworthy and effective interaction techniques for Human-AI collaboration. I am an Assistant Teaching Professor at Drexel University College of Computing and Informatics (CCI), teaching both undergraduate and graduate level courses. I am also an AI Scientist at TulipAI, leading teams of young students, pushing the mission of empowering media creators through ethical and responsible use of Generative AI.

I received my PhD in Computer Science from Drexel CCI. My PhD dissertation focused on how humans build trust toward Embodied Virtual Agents (EVAs). I have collaborated with MIT Media Lab, CMU HCII, Harvard University, and UCSD, publishing and presenting in venues such as Springer Nature, ACM CHI, and ACM C&C. I have been re... See full profile

Related Skills

AI & Innovation AI for Development AI Tools Development Programming Languages Python Development Tools

Level: Intermediate

Hands-on Class Project

Do a model evaluation using Automatic Metrics:
https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/intro_to_gen_ai_evaluation_service_sdk.ipynb

Do a model evaluation with AutoSxS:
https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/legacy/evaluate_gemini_with_autosxs.ipynb

Class Ratings

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.

Transcripts

1. Intro: Welcome to our course on evaluating large language models outputs. As AI and natural language processing increasingly influence technology, a deep understanding of evaluating large language models is crucial for any modern developer. We'll guide you through foundational evaluation methods, advanced techniques using tools like automatic metrics and auto site by site, and ethical considerations in AI development. This course emphasizes practical applications, integrating human judgment with automatic methods, and prepares you for future trends in AI evaluation across various mediums. Hi, I'm Professor Reza with more than ten years of teaching experience in the field of computer science and artificial intelligence. In pursuing my PhD, I have collaborated with MIT Media Lab, Carnegie Melon University, HCII, Harvard University, and University of California San Diego. I have published in prestigious venues such as IAE, Springer Nature, and ACMKI. My work has been featured by multiple news outlets, including the Neck Web and CBS News. This course is ideal for you if you have an interest in learning the skills to assess the outputs of LLMs effectively in order to enhance your business strategies and personal innovation. The learning objectives for this course are understand the strengths and challenges of LLM evaluation tools. Discover some of Vertex AI model evaluation services. Optimize model selection to suit your application and prepare for the future by understanding how evolving evaluation tools and services can impact the development and deployment of large language models. To be successful in this course, you should have a basic knowledge of machine learning concepts, including model evaluation metrics and a familiarity with LANs and their applications. This course is provided to you in three lessons. Lesson one, basics of large language models evaluation methods. Lesson two, LLM evaluation on vertex AI, and Lesson three, the future of generative AI evaluation models. By the end of this course, you will gain a thorough understanding of evaluating the outputs of LLMs. You learn how to assess the effectiveness and accuracy of LLM generated content across various domains. Knowing these skills will help you evaluate the quality of different AI models. You'll be able to select the right one for your needs. This will enable you to design, develop, and implement effective and ethically responsible applications for personal, professional and business purposes. So let's get started and explore how evaluating LLM outputs can enhance the reliability and effectiveness of AI solutions. 2. L1V1 Introduction to LLMs and their evaluation methods: In this video, we'll explore the concept of lodge language models or LNS for short. Imagine an AI system so advanced it can write stories, answer complex questions, and even hold conversations. Isn't this fascinating? Understanding how these models work and how to evaluate their output is crucial as these technologies are reshaping our daily lives. By the end of this video, you'll understand how large language models differ from traditional NLP or natural language processing models. We're going to compare them in scale and complexity. We'll also discuss the importance of reliable evaluation methods and the potential consequences of improper evaluation on real world applications. Large language models or LLMs are a big step forward in artificial intelligence. These models learn from huge amounts of text data, which allows them to understand and create human like language. It's almost like they can think in a similar way to humans. LLMs can handle much more complex tasks compared to simpler language models. They can carry out conversations, summarize long pieces of text, and even create original content. They do it all with an impressive level of fluency and accuracy that wasn't possible before. The real power of LLNs comes from their depth and scale. Unlike traditional NLP models that work with limited data, predefined rules, and focus on specific tasks, LLNs are trained on massive diverse datasets. These datasets contain billions of words. This allows LLNs to understand the nuances of language better and handle various tasks effectively. LLNs use advanced deep learning techniques like transformer architecture to learn patterns autonomously without being programmed for specific tasks. By building a deep understanding of language directly from data, LLNs can go far beyond the capabilities of earlier models that rely on simpler techniques and structured input. The advanced capabilities of LLNs allow them to perform a variety of language tasks simultaneously, from translating language to generating creative writing. They can adapt to different contexts and generate coherent relevant responses. This sets them apart from earlier NLP technologies which typically handle shorter isated text. Another key difference is that LLMs large neural networks enable them to maintain context over long conversations or documents. This was pretty challenging for previous traditional NLP models. Now let's see why it's important to evaluate the output of evince. It is important to evaluate the outputs because these models are being used more and more in areas where getting the Rich information really matters. Areas such as healthcare, law, customer service, news and education. In these fields, it's crucial that the outputs are accurate, fair, and appropriate to maintain trust and make these tools useful. Good evaluations help keep the information reliable by checking that LLMs understand the input correctly and ensuring the responses are correct and relevant. Also protect against the negative effects of incorrect outputs, such as spreading wrong or misleading information, HA fake news. Another reason that evaluating LLMs is crucial is because the output of these models represent the bias in the data they were trained on. We want to make sure that we are following ethical standards. LLMs can amplify biases from the data that we train them on and that can lead to unfair or prejudiced outputs. Good evaluations can identify and mitigate these biases, ensuring fairness and prevent further discrimination. Through evaluation, we can also check if responses are appropriate and aligned with societal norms, especially in public interactions. Regular evaluations improve these models and they encourage ethical use of artificial intelligence and they help build public trust in interactive technologies. So in conclusion, in this video, we went over the basics of large language models and how they're different from traditional NLP models. We also talked about how important it is to evaluate them and we learned that making sure LM outputs are accurate and ethical is key to ensuring they work well across different applications. 3. L1V2 - Benefits and Challenges of LLM Evaluation Methods: In this video, we will explore the steps involved in evaluating large language models. Imagine your news agency needs the best AI to generate article summaries. How do you choose the right one? We'll guide you through defining goals, selecting methods, choosing datasets, and interpreting results all through a real world scenario. By the end of this video, you will understand the steps and challenges associated with each step of evaluating large language models. Imagine you work at a news agency that wants to utilize LLMs to generate one line summaries for its news articles. To successfully incorporate LLMs in this way, you are tasked with evaluating multiple models to determine the most appropriate one. At first glance, evaluating LLMs may seem straightforward, more or less similar to evaluating a traditional AI model. First, you define evaluation goals. Then you choose evaluation methods. The third step is to select appropriate datasets, and finally, you analyze and interpret the results. So let's break down each of these steps. In the first step, you want to ask questions like what specific task do you want the LLM to perform? You also want to see what metrics are more important to you, overall fluency, coherence, factual accuracy, or anything else. In the second step, you need to pick the evaluation method. You can pick from different methods like task specific metrics, research benchmarks, LLM based evaluations, and human evaluations based on your evaluation goals. As for selecting the appropriate dataset, you want to define a golden dataset that aligns with your evaluation goals and metrics. One good place to look for is the benchmark datasets that are specifically designed for evaluation of LLMs. For analyzing and interpreting the results, you want to combine both quantitative and qualitative results to provide comprehensive insights for your evaluation. Make sure to note the strengths and weaknesses of each evaluation method and provide justification for your conclusion. So hopefully this sounds like a good approach for evaluating LLMs too. However, there are several challenges in this process too, especially when it comes to the evaluation of the outputs of these LLS. The first challenge is in defining the evaluation goals. In our example, defining evaluation goals for LLMs in tasks like summarizing news articles is challenging due to the subjective nature of what constitutes a good summary. It's difficult to rely on a limited number of metrics to assess the quality of an output. Also, in choosing the evaluation methods, there are time and resource constraints. It's going to be computationally expensive and time consuming to try multiple evaluation methods. Also, new evaluation methods are being introduced very frequently and that makes it difficult to decide which method is the best for our use case. In selecting appropriate data sets, the size and quality of available datasets can pose challenges. In predictive models, we know that the large datasets with minimal noise lead to better performance. But in the world of generative models, we still aren't sure what size and quality of dataset are beds. And finally, in analyzing and interpreting results, challenges in explainability can arise, especially when dealing with newer evaluation methods. We still don't have a standard way of interpreting the results or assessing the reliability of these evaluation methods. In conclusion, this video has covered the essential steps and challenges involved in evaluating large language models. We looked at these evaluations for tasks like summarizing news articles. We've explored how to define clear evaluation goals, choose the right evaluation methods, select appropriate datasets, and effectively interpret the results. Each step presents distinct challenges that must be carefully managed to ensure the successful integration of large language models into real world applications. 4. L1V3 LLM - Evaluation on Vertex AI: In this video, we will explore the tools that vertex AI offers for evaluating large language models outputs. Imagine evaluating AI models with tools that highlight accuracy and fairness. These tools give you the ability to uncover hidden biases and compare hidden performances side by side. We also explore some insights into making your AI models not only effective but also ethical. By the end of this video, you'll know how to effectively use Vertex AI to evaluate the output of large language models. As we mentioned earlier, in this course, we are going to use Google Cloud as an example of a platform that provides tools for LLM evaluation. Google's vertex AI can help you evaluate the entire life cycle of a large language model from beginning to the end. In vertex AI, you can prototype, customize, evaluate, and deploy models of many different tasks and across different modalities. However, for the purpose of this course, we will focus only on the evaluation capabilities that vertex AI provides. Some of the capabilities available in vertex AI to help streamline the evaluation process include automatic metrics, which uses reference data to compute task specific metrics. Auto site by site, which mimics human evaluation by comparing the performance of two models with an arbiter model. And safety bias, which highlights the models biases against a certain identity group. In lesson two, we will dive deeper into automatic metrics and auto side by side. In lesson three, we're also going to briefly cover safety bias. For now, let's go over each of these three evaluation methods. Automatic metrics in AI evaluation are quantitative measures used to assess the performance of models, especially in tasks like text generation or machine translation. They are typically fast, efficient and can be a part of a standardized method used across academia and industry for comparing different novels. Some of the most common automatic metrics include blue or bilingual evaluation understudy, which measures how many words and phrases in a machine generated translation match a reference translation. We also have Rouge or recall oriented understudy for sting evaluation, which is another metric used for evaluating text summarization by counting the overlapping units such as grams, word sequences, and word pairs between the computer generated summary and a set of reference summaries. There is also auto side by side, which is a tool used for automatic side by side evaluation of AI models, particularly generative AI models in the vertex AI model registry. This tool allows comparison of the performance of different models, providing insights into which model performs better and under what circumstances. Auto side by side aims to deliver consistent performance metrics that align with human evaluations, but offers the advantages of being faster, more cost effective, and available on demand. Last but not least, vertex AI also provides safety bias evaluation. This evaluation checks and models outputs for biases against identity groups such as gender. This analysis aims to ensure that the output of the LLN does not perpetrate harmful stereotypes or unfair treatment toward any group. In conclusion, Google Cloud's vertex AI provides comprehensive tools for evaluating lodge language models, focusing on performance metrics and safety bias. Automatic metrics like Blue and rouge offer standardized, quick and efficient ways to assess model outputs against reference data. Auto site by site compares two models side by side, mimicking human judgment, but with the benefits of speed and cost efficiency. Additionally, safety bias checks for fairness across identity groups such as gender to ensure that LLNs do not reinforce harmful stereotypes or discrimination. This holistic approach to evaluation allows developers to refine LLNs aligning them with ethical standards and societal expectations for responsible AI. 5. L2V1 - Automatic Metrics: This video, we will have a look at automatic metrics and understand their role in evaluating lodge language models. Imagine a developer struggling with the performance of their AI model. They spend hours testing and tweaking the model without having any clear feedback on the effect of their tweaks on the outcome of the LLM. I believe most of you who are taking this course already know how frustrating that can be. But what if I told you there are tools that provide us precise performance data and highlight the exact areas for improvement. By the end of this video, you'll understand the various automatic metrics used in LLM evaluation, why they are used, and how they can guide the refinement of model performance for different tasks. They can be classification, summarization, text generation, or anything like that. So automatic metrics provide a fast and cost effective way to evaluate your model's performance using a range of task specific metrics. This approach assesses models based on input prompt and output response pairs, allowing you to quickly gauge their effectiveness. Automatic metrics are a standard methodology widely used in academic research and many open benchmarks. They utilize commonly accepted metrics for several general AI tasks, making results comparable across different studies and platforms. Evaluation process involves fitting an evaluation dataset into the model to generate prediction results. These results are then assessed using the selected evaluation metrics to measure the model's performance on the specific task at hand. By leveraging automatic metrics, you can efficiently evaluate your model's capabilities and identify areas for improvement without the need for extensive manual review. At the moment, the models available on vertex AI include base and tuned versions of Palm takes Bison and Gemini. Supported tasks include classification, summarization, question answering and text generation. There's at least one metric for each of these tasks. Each task has a specific metrics to ensure accuracy. Microphone and microfon scores gauge overall classification precision and recall. Per class F one assesses it per category. Rouge L evaluates summary closeness to a reference, while exact match scores question answering accuracy. Blue measures the precision of text generation against a human standard. Using automatic metrics is pretty straightforward. First, we prepare the evaluation dataset with input output pairs. Then we upload the dataset to Google Clouds and storage. Finally, we perform the model evaluation by using the vertex AI Python library to submit the job. The next video, I'll walk you through a demo on how to do each of these steps. But for now, let's quickly review it. For the dataset, you must provide the prompt with instructions and context, as well as a ground truth, which will be used together with the generated answers to calculate metrics related to the selected task. It's a good idea to offer at least ten examples similar to how the application will be used. When you have prepared your dataset and uploaded it to Google Cloud Storage, vertex AI has a template for the model evaluation pipeline. The parameters for running the evaluation pipeline include the location of the evaluation dataset, the task that is going to be performed, and the model that should be utilized for the task. With those parameters, you can then run the model evaluation pipeline job. We will see a demo of running an evaluation task in the next video. In conclusion, in this video, we went over automatic metrics in vertex AI, an efficient and standardized approach for evaluating LLMs. We've explored the supported models and tasks, understood the application of each evaluation metric, and outlined the process of preparing and running an evaluation pipeline. Through these metrics, you can objectively measure and refine your model's performance in order to make sure it meets the demands of real world applications. 6. L2V2 - Automatic Metrics Demo: In this video, we're going to walk through a live demonstration of using rapid evaluation SDK to evaluate the output of Gemini and LLM developed by Google. Through this demo, you'll see firsthand how to apply automatic metrics to assess your model's output and understand the strength and weakness of different AI models. By the end of this video, you'll know exactly how to use the rapid evaluation SDK to assess the output of an LLM. We'll cover loading your dataset, initiating model evaluation, applying automatic metrics, and interpreting the results to gain insight into your model's performance. Let's get to the demo. The link to this tutorial is provided so you can run the evaluation yourself. In this demo, we'll be going over how to use the rapid evaluation tool to analyze the performance of an LLM. This demo will be using Google Callb notebook to guide you on using the rapid evaluation. We will first begin by preparing the necessary components to run this tool. First, we'll create a Google Cloud account. In the account creation, it will prompt you to your Google Gmail and password. Once you've created the account, you'll see a greeting screen similar to this. Open the menu tab to the left and select Billing. From there, you will have to enable billing. You have to put a credit or debit card in order to activate billing. But don't worry. There will be $300 worth of credit provided to anyone in the beginning. So you don't have to spend any money for running this demo. Afterwards, you open the menu tab again and select APIs and services on the screen. You'll click on the dropdown that says library and search for the word Vertex AI API. You will then click on Enable to enable the API to views. Lastly, you'll create a project in Google Cloud on this dropdown menu here on the top left. Click on that and you select new project. From there, Google will guide you on creating the first project. After you've created the first project, you will see that there is a unique ID associated with the project. Make sure to save the ID as this is required for the evaluation task. Now we are ready to go to the setup. Begin by running the first cell down here. Now we're going to run the package to run the rapid evaluation. Note that you may have to restart the kernel for the package to be recognized. Next, we will run this cell to authenticate. Use the project ID you've seen previously and paste it into the project ID variable. As for location, this demo will be using US Central V. You can look up the supporting locations for this variable. You will receive a pop up window indicating that you have to log into Google. Here you can login using your Google Cloud account. It will then prompt you to access to certain features which you will allow and continue. You should end up with a page indicating that you have successfully authenticated to Google Cloud and then you may proceed back to the Network. Then we will set up Google Cloud project information and initialize the Vertex AISDK using the project ID. After you set up your project ID and location, run the cell, which will initialize the vertex AISDK to be used. Next, we'll import the necessary libraries. Run the cell to get all of the necessary libraries. Note that the main libraries are listed below, which are the ones processing the information. Next, run the library setting cell and the helper functions. Note that these cells are for formatting information and adjusting the setting for warnings and logs as well as performance adjustment. We're now ready to run the evaluation job. Before that, let's go over the requirements needed to run this evaluation. First, we need the data that is being evaluated. To properly format the data for the evaluation task, we'll create the pandas data frame using arrays of data stored in a dictionary. In the dictionary, you can have an instruction, a context, a reference, a prediction, and a response. Each index value corresponds to the other array at the same index value. For example, index zero and the response array corresponds to the other array index zero, and so on. In this demo, we'll be using two rows of data. Insert these data as an array into a dictionary, which is to be converted to a pandas data frame. Next, we'll decide what metrics to choose for evaluating the responses. The responses are measured by various automatic metrics that the rapid evaluation tool provides. Here we can see all the possible metrics in the central column, along with the type of measurements in the left and the required data frame input on the right. For example, coherence measures the model's ability to produce a clear and sound response. Fulfillment measures how well the model has answered and completed the given instructions with a predetermined prediction, and Blue and Rouge compare the similarity between the given reference prediction and response in terms of words. You may look into these metrics on your own if you're interested. After selecting the metrics, you want to measure the input, each of the metric names and input the arrays shown here. You will also insert the evaluation dataset into the required dataset argument and provide a name for the experiment. The last segment of the cell, we run the actual evaluation task. Upon running the cell, you should see that an experiment has been created. Clicking on the view Experiment button will redirect to Google Cloud, where you will be able to view the status of the evaluation pipeline. The time it takes for the evaluation task depends on the number of metrics as more metrics take more time to complete. In conclusion, we've seen how the rapid evaluation SDK facilitates the assessment of generative AI models, providing an efficient way to analyze model performance through automatic metrics. This approach helps identify strengths and weaknesses, ensuring your model meets the expected standards for real world applications. 7. L2V3 - AutoSxS: In this video, we're taking a close look at Auto side by side, a comparative evaluation tool for large language models. Imagine working on an AI project where you need to choose the best model for summarization. Without clear comparisons, it feels like guessing in the dog. Good news is with autost by side, you're able to perform side by side evaluations of the outputs of two different models. By the end of this video, you'll be able to understand how Auto side by side works, the role of the atorator and how to use it to compare model outputs. You'll gain insights into evaluating LLMs with a clear understanding of what makes one model's response better than the other. Auto side by side is an evaluation tool that compares two LLMs side by side. It utilizes an aerator or a judging model to determine the better response to a prompt. Using this tool, you can assess the performance of any generative AI model for summary and question answering use cases. Auto site by site also provides explanations and certainty scores for each decision. At the core of autost by side sits the autoator which makes this comparative evaluation possible. The autoator is an LAN specifically designed to assess the quality of responses generated by other models when given an original inference prompt. Auto side by side can evaluate any model with pregenerated predictions and can auto generate responses for any model in the vertex AI model registry that supports batch prediction. Currently, it can evaluate the performance of models on summarization and question answering tasks. For each side by side evaluation, auto side by side employs predefined evaluation criteria. For example, some criteria for summarization include how well the model follows prompt instructions? How grounded is the response in the inference context and instructions? How well does the model capture key details in the summarization and how concise is the response itself? Using auto side by side is pretty straightforward. First, we prepare a dataset of prompts, contexts, and corresponding generated responses, only if input prompts required. Then we store the evaluation dataset to Google Clouds of storage or a Big Query table. And then we perform the model evaluation by running the evaluation pipeline job. In the next video, you will see a demo of autoste by side in action, comparing Gemini Pro with another LLM for a summarization task. But before that, let me explain how each of these steps work. Auto site by site accepts a single evaluation dataset. The dataset must include at least one example, but for proper evaluation task, something around 400 to 600 examples are recommended. Each unique example has a unique ID and includes content and responses. We also can add an additional column for taking human preferences into account as well. Next, we must set parameters for performing the model evaluation. For example, in a model evaluation without human preference, the parameters might specify the evaluation dataset, columns to use, the task, for example, summarization or question answering and Oerator prompt parameters like the inference context and instructions. We must also provide the columns containing predefined predictions to calculate the evaluation metric. After defining our parameters, we can initiate an evaluation pipeline job using a Google provided template. The parameter values are passed in to configure the pipeline job. Auto side by side utilizes vertex AI Python SDK to get this job done. After successfully completing an auto side by side evaluation, you can view the evaluation results. Auto site by side generates three primary types of evaluation results, judgments table, aggregated metrics and align matrix if human preference are provided. Judgments table indicates the superior response and each choice is accompanied by a confidence score, which is a value 0-1. The auto side by side judgments include an explanation of each of the aerators choices. Auto side by side can generate and compare multiple outputs for a given task to select the response judged as better based on criteria like coherence, logical flow, and capturing the key points. For example, when choosing between response A and response B, the aerator might explain that while both provide good summaries, response B does a slightly better job of capturing the overall story in a more coherent and organized matter, compared to the more statistics focused response A. Auto side by side also provides aggregate metrics. These win rate metrics are derived from the judgment table as a percentage of times the oerator preferred one model over the other. These metrics help to quickly identify the superior model. Also, as I mentioned earlier, auto side by side allows for the validation of judgments with human preference. This means providing additional information and parameters within the side by side evaluation pipeline is possible. In order to do so, in the dataset, a column must be added for human preference. We also need to define human preference column within the parameters. The rest of the process remains the same. Including human preference results in additional metrics for human preference alignment. The output includes all the regular metrics, but it also includes a human preference win rate along with the outerator win rate and a Chenes Cape score, which denotes the level of agreement between the oerator and the human rater. Again, this is a value 0-1 with zero being random choice and one being perfect agreement. In conclusion, Auto Side by Side stands out as an innovative tool in vertex AI for evaluating and comparing the performance of generative AI models. We've seen how it brings precision to the evaluation process with side by side comparisons and detailed explanation features. It streamlines the assessment of LLNs ensuring that the best performing model can be identified based on task specific criteria. 8. L2V4 - AutoSxS Demo: In this video, we will demonstrate how to use Auto site by site within vertex AI to evaluate the Gemini model against another lodam. This practical guide will show you each step in setting up and running an evaluation using the tools provided by Google Cloud Platform. By the end of this video, you'll understand how to navigate the autoste by side tool, set up your evaluation datasets, and interpret the results from the autoste by site comparative analysis. This will equip you with the skills to effectively assess the performance of generative AI models. Now let's get to the demo. The link to this tutorial is provided so you can run the evaluation yourself. In this demo, we will be going over how to use Auto side by side to evaluate and compare the performance of large language models. To begin, we will first install the following package by running this command. We will be using this package to call the API from Google Club. After you run the command, make sure that you restart the runtime in order to use the newly installed package. A cell has been provided for the user to restart the runtime. After successfully running the cell, you will receive a pop up indicating that the kernel has died and will automatically restart. Now let us set up the necessary components. We will first create a Google Cloud account. In the account generation, it will prompt for your Gmail and password. Once you have created the account, you will be greeted with a screen similar to this. Open the menu tab on the left and select Billing. From there, you will have to enable billing. You would have to input a credit or debit card in order to activate billing, but there will be $300 worth of credit provided to you, so don't worry about it. Afterwards, you will open the menu tab again and select APIs and services. Click on the library and search Vertex AI API. You will then click on Enable to enable the API to be used. Next, you will create a project in Google Cloud. Click on the drop down menu on the top left and select new project. From there, Google will guide you on creating the first project. Lastly, open the menu tab again and selet IAM and Admin. You will see the newly created project. Click Grant Access and in the principal input the name of the principal of your created project and in the rolls dropdown, search in the filter object. Here, you will see the option for environment and storage object administrator. Add this to the principal and save. This is how it should look like with the role having a storage object admin. Now we are ready to go. Since we are working on vertex AI workbench, you do not need to perform any additional steps. To begin, we will set the project ID. You can find the project ID by going back to the project drop down and find the column where it displays ID. In this case, this is the ID for the project. Run the cell after you have changed the ID to your project ID. Next, we will set the region. In this demo, the region is set to US central bond. Now, run the cell block. Now we will generate a random UUID. This will be used to uniquely identify the project and avoid potential name collisions. We will now use the UUID to create a unique bucket URI name. Now we will move on to setting up the process. We will first import the libraries and define our constants. We will also define our helpers. Next, we will initialize the vertex AISDK by providing our project ID, region, and our bucket URI. As we have defined in our constants, we will be comparing a Gemini dataset with another LLM, one producing response A and the other response B. Each row of the data contains an ID and a document to summarize and the two versions of the response to the document are also there. We can get a peek at this by using Pandas to read the JSON and format it. Next, we will run the model evaluation job. Here are parameters required by the pipeline. The evaluation dataset to indicate the data location, ID columns to distinguish evaluation examples that are unique, which are ID and document fields in this case. Next is the task. The task we are evaluating is the summarization. And there is the operator prompt parameters, which is used to configure the behavior of operator task such as setting the context and instructions. You will then need to provide the response column A and response column B with the names of the columns containing predefined predictions in order to calculate the evaluation metrics. In this case, it is response A and response B. After we define the model evaluation parameters, we can now run the model evaluation pipeline job with this given template using the vertex AI Python SDK. Let this run as it can take a while for the pipeline to finish. You can click on the link to see the pipeline in action in Google Cloud platform. This is how your pipeline looks like. After the pipeline run has completed, you can use the code segment below to view the judgment of each response and how it compares according to the aerator. It offers information such as explanations on the preferences and confidence score of the aerator. Next, we can also show the aggregated metrics using the code segments below. This is rather useful to determine which model is better in the context of the given task. The aerator also supports human preference to validate the aerator evaluation. We will now use the other URI, which includes an additional human preference column. In the pipeline requirements parameter, we will now include the human preference column and perform the same pipeline run task with the new column of data. We can now obtain the human aligned aggregated metrics. Again, this is how the pipeline looks like in Google Cloud. Using the code segments below, we obtain the performance of Auto side by side aerator based on how a human prefers. Lastly, we will clean up the Google Floud resources. We can run the cell below, and it will clean up all of the resources we used in this project. In conclusion, this demo has illustrated the practical applications of autoste by site in evaluating the gemini model on vertex AR. We've navigated through the setup process, demonstrated how to configure and run the evaluation and interpreted the comparative result. This hands on approach ensures you can effectively leverage autoste by site to assess and enhance the performance of generative AI models, which in turn helps you make your AI solutions more robust and reliable. 9. L3V1 - Text based Evaluation Models part1: In this video, we will explore foundational text based evaluation models for LNS such as meteor and perplexity alongside fairness evaluation metrics. Did you know that biased AI models can negatively impact applications in critical areas like loan approvals and hiring decisions? By using meteor and perplexity, you can mitigate the risks of these biases by ensuring your models are both high performing and fair. By the end of this video, you will understand how different evaluation metrics like meteor and perplexity work and why they are important. You will also learn about the significance of furness metrics in ensuring that AI applications treat all demographic groups equitably. Meteor or metric for evaluation of translation with explicit ordering improves upon earlier metrics like blue by considering synonyms, paraphrasing, and staining. It assesses translation quality based on literal accuracy, fluency, and intent, making it valuable for applications requiring nuanced language understanding. Let's consider a practical example to understand how meteor works. Imagine we have two translations of the English phrase, the quick brown fox jumps over the lazy dog. Meteor would score translation A higher than translation B. Although both translations convey similar meanings, translation A maintains a more accurate and fluent structure with appropriate synonym usage, leaps for jumps and fast for quick. Meteor evaluates these translations by analyzing word order, synonymy, and the overall semantic similarity to the reference text. This emphasizes the translations fluency and comprehensibility. Perplexity is another measurement used to evaluate language models by assessing how well a model can predict a sample of text. It is based on the probability distribution, the model assigns to a sequence of words with lower values indicating that the model predicts the sequence more accurately. Perplexity essentially quantifies the model's uncertainty about its predictions. It provides a gauge of its effectiveness in language understanding and generation tasks. Let's look at an example. Consider a model tasked with predicting the next word in the sentence, the cat sat on the Suppose our model predicts four possible completions, Matt, window, car, and moon with respective probabilities of 0.5 0.2 0.2 and 0.1. The perplexity of the model for this prediction can be calculated by taking the inverse of the probability of the correct word, mat in this case, raised to the power of minus one. Here, perplexity would be two indicating relatively low uncertainty. Lower perplexity values demonstrate the model's confidence and accuracy in its predictions, suggesting a better understanding of the context the CAT sets on the map. We also have fairness evaluation metrics, which are critical tools used to assess whether AI models perform equitably across different demographic groups. These metrics help identify biases in model predictions that could disadvantage certain groups based on gender, race, age, or other factors. It can be done by evaluating differences in error rates, positive prediction proportions, and other performance indicators. For example, consider a loan approval AI model that uses personal data to predict credit worthiness. To assess fairness, we could analyze. One, difference in positive proportions in predicted labels. If 40% of applicants from group A, for example, male applicants are predicted as credit worthy compared to only 20% from group B, in this example, female applicants, this metric would highlight a potential bias in model predictions favoring group A, two, recall difference. If the model identifies 90% of actual credit worthy individuals in group A, but only 70% in group B, the recall difference metric would indicate the model is less effective for group B, potentially leading to unfair treatment. Three, specific difference. Examining how well the model avoids false positives across groups, we might find that it incorrectly labels non credit worthy individuals as credit worthy at different rates between groups, which could affect the fairness of the decision making process. In conclusion, this video has demonstrated the crucial roles that both performance and fairness evaluation metrics play in the development and deployment of language models. We've seen how metrics like meteor and Perplexity help ensure that models perform optimally while fairness metrics address biases to promote equity and trust in AI technologies. 10. L3V2 - Text based Evaluation Models part2: In this video, we'll expand our exploration of text based evaluation models for LLMs, focusing on diversity metrics and zero shot evaluation. Most likely, you have noticed that a lot of times AI generated content lacks diversity, which makes it less engaging or boring for users. By applying diversity metrics, you can ensure your AI generates varied and interesting responses. We also cover zero shot evaluation, which will further test your models adaptability to new and unforeseen tasks. By the end of this video, you will be able to understand the importance and application of diversity metrics in generating varied and creative outputs. Additionally, you'll learn how zero shot evaluation helps gauge LLMs ability to adapt to tasks it hasn't explicitly trained for. Diversity metrics evaluate the range and uniqueness of responses generated by a language model. These metrics are particularly important for applications requiring creative or varied outputs such as content generation or dialog systems. By measuring aspects like lexical richness, variation in sentence structure, and the novelty of concepts introduced in responses, diversity metrics ensure that the models outputs are not just accurate but also engaging and reflective of a vide array of perspectives. Let's imagine a scenario. Think that you have an AI model that is tasked with generating story ideas based on a single prompt a day at the beach. Suppose the model generates the following responses. In evaluating these responses using diversity metrics, we would look for variety in themes, characters involved, and activities described. Response B would score highly on diversity for offering multiple subplots and varied interactions. While response C would score lower due to its redundancy with response A. Response D introduces a novel element, which enhances its score for introducing unique content. These metrics help in assessing the creativity and appeal of the models outputs, ensuring they provide fresh and engaging content for users. Now let's look at zero shot evaluation. Zero shot evaluation measures a model's ability to handle tasks it has not been explicitly trained for. This metric is critical for assessing the generalization capabilities of lodge language models. It reveals how well a model can apply learned knowledge to new contexts or problem types without additional fine tuning or training. It demonstrates the model's adaptability and flexibility across various applications. Let's look at an example. Consider a language model trained predominantly on English literary text. If you're presented with a task in a completely different domain, such as generating technical descriptions for new software applications. Zero shot evaluation would assess how well the model performs this task straightaway. Let's look at this example. We can see that despite this model had no prior training on software descriptions, the model generates a coherent and relevant description. It demonstrates good zero shot capability. This ability to generalize from literature to technical writing without any specific training showcases the model's robustness and utility in real world scenarios where training data may not always be comprehensive for every possible task. In conclusion, we discussed how diversity metrics and zero shot evaluation play crucial roles in evaluating LLMs. Diversity metrics help ensure the generated content meets the creative demands of real world applications while zero shot evaluation assesses the adaptability of these models to new tasks, showcasing the robustness and utility in various scenarios. 11. L3V3 - Evaluation of non text Generative AI Models: In this video, we'll talk about how to evaluate AI models that create images, sounds and videos. Imagine watching an AI generated movie where scenes look choppy or the sound feels off. It would be frustrating. Let's explore how to evaluate these models to make sure that the content they generate is smooth, realistic, and engaging. By the end of this video, you'll know how to spot the important ways experts evaluate image, sound, and video AI models. You'll become familiar with the skills to examine and evaluate the media that these generative AI models generate. Evaluating AI image generation models involves both subjective and objective methods. Subjective evaluations are based on human judgment of factors like visual appeal and emotional impact. Objective evaluations in contrast, use specialized tools to measure aspects such as image resolution, color accuracy, and the presence of visual glitches or flaws known as artifacts. Consider an AI generated image of a landscape. To evaluate it, we might use a pixel based metric like PSNR, which stands for peak signal to noise ratio. Assess image clarity and sharpness objectively. At the same time, we conduct a survey where participants rate the image on realism, beauty, and emotional resonance to gather subjective data. This comprehensive evaluation helps determine the overall success of the image generation model in creating visually appealing and accurate images. Now let's move on to sound. Evaluating AI sound generation models means looking closely at the quality, accuracy, and emotional effect of the sounds they create. You can use objective measurements like spectra flatness and zero crossing rate to technically assess the sound quality. It's also important to gather subjective feedback from listeners on how real and emotionally engaging the AI generated sounds seem to people. Imagine evaluating a piece of AI generated music intended to evoke relaxation. Objective analysis could measure the consistency of tempo and the clarity of sound using tools like a loudness meter or a spectra analyzer. For subjective evaluation, a listener group could rate the music on its soothing qualities and emotional effects. Things like that can provide insight into the music's effectiveness in achieving its intended emotional goal. How about videos? When evaluating AI video generation models, you need to look at two main things, the visual quality of the video and how will the frames flow together over time, which is also called temporal coherence. To measure visual quality, you can use metrics like Ks and R that we talked about. This metric checks the sharpness and amount of detail in the video. There's another metric that is called SSIM, which as for a structural similarity index. This metric looks at the detail and compares the AI video to a reference video. To evaluate the temporal coherence, you want to see how smoothly the video frames transition from one to the next. This helps ensure that the motion in the video looks natural and logical. Another important thing to assess is contextual relevance. Does the video content actually match up with the intended story or scene? The AI generated video should accurately reflect what is supposed to be showing. For example, consider evaluating an AI generated video that depicts a diver in the ocean. Objective metrics would analyze the video's resolution and frame to frame consistency to ensure smoothness in motion and clarity in visual details. Subjectively, viewers could assess how well the video captures the essence of the setting, considering the elements like the realism of the ocean waves, the natural movement of the diver, and the overall ambience. This combined evaluation helps determine if the video generation model effectively replicates a realistic and engaging diving experience. Conclusion, evaluating non text generative AI models for images, sounds and videos is essential to advance AI in creative and practical applications. By combining objective measurements with subjective human feedback, we get a comprehensive view of an AI models capability. This approach ensures the AI generated content is technically sound and resonates with people, which is crucial for developing useful and appealing generative AI applications. 12. L3V4 - Final Notes Importance of Human Evaluation: In this video, we'll summarize our course and emphasize the critical importance of human evaluation in assessing generative AI models. Have you ever wondered why some AI generated content is misleading or inaccurate? We'll dive into what generative AI does well, where it goes wrong, and why human oversight is necessary to catch and correct these mistakes. To ensure the output of these models are useful and trustworthy. By the end of this video, you'll understand the limitations of generative AI, especially its tendency to produce false information or hallucinations. We'll discuss why recognizing the flaws is key for using AI effectively and ensuring it gives reliable and useful results. Generative AI can do a lot of tasks well, but it also has some big weaknesses. One major problem is that it can generate false information or hallucinations. This means the model outputs wrong or made up information. These models often don't know the limits of their own knowledge, which is why it's so important to evaluate them carefully. To use generative AI effectively, we need to understand its limitations. This means being aware that the model can make mistakes and coming up with ways to reduce these problems when using it in real life. Since we need to recognize and address the limitations of generative AI, we introduce a useful tool called the IVO test, which stands for immediately validate outputs. It's a simple but effective way to check if a generative AI model is reliable. A model passes the IVO test if users can easily and quickly check that the output is correct and meets their needs. This way, even users who aren't experts can effectively use and validate content created by AI. To implement the IVO test, users evaluate the AI generated output by comparing it with reliable resources, a method known as post grounding. This lets users check that the information is accurate by looking at established facts. This makes sure the AI's output is not only relevant but also reliable. This step is key for applications where accuracy is super important. It allows users to use tools with confidence. Let's say an AI model is made to summarize scientific articles. To use the IVO test, users can interact with the AI generated summary in a special app. If they want to check a specific part of the summary, they can click on it. The app then shows them the matching section in the original article. This feature makes it easy for users to compare the summary with the source, making sure the AI's output accurately reflects the original content. This method builds trust in the AI and helps users understand better by connecting the AI generated content back to its reliable sources. By having humans oversee AI systems, we can make sure they're not just evaluated for performance, but also for fairness and ethics. This approach helps stop the spread of biases and ensures AI is developed in a way that respects human values. So in conclusion, we discussed the importance of having humans evaluate generative AI models along with automated methods. By combining human insights with the efficiency of algorithms, we can assess aspects like creativity, context, and ethics that computers might miss. This approach not only makes evaluations more accurate and reliable, but also ensures AI is developed in line with our values and expectations as a society. 13. Outro: Great job. You've done it. You've completed evaluating large language model outputs. I'm not here just to say goodbye. I want you to take a moment and celebrate your achievement throughout this course. Together, we've explored new concepts, face challenging tasks, and grown significantly. Look back and see what you know now that you didn't know at the beginning of the course. Your commitment has led to significant progress, and you should be proud of this accomplishment. This course is only one step in your ongoing learning journey. The concepts you've learned here will serve as a foundation for your future growth. Make sure you keep applying these skills and maintain your curiosity. To continue your journey, I recommend the following. First, revisiting course materials to refresh your memory on the contents. Second, make sure that you engage with your peers in the community forums. Third, make sure you take on new challenging projects to keep your skills sharp. Thank you for being a part of this course on evaluating LMS outputs. Your engagement means a lot to me and our entire team. As our course concludes, your journey is just beginning. I'm looking forward to hearing about what you thought of this course and what you are planning to achieve in the future. Keep pushing forward, stay curious and enjoy the journey ahead. Congratulations again, and hopefully I'll see you in a different course. Signing off, Professor Reza.

Evaluating Generative Models: Methods, Metrics & Tools

Reza Moradinezhad, AI Scientist

Watch this class and thousands more

Watch this class and thousands more

Lessons in This Class

1.

Intro

3:25

2.

L1V1 Introduction to LLMs and their evaluation methods

5:46

3.

L1V2 - Benefits and Challenges of LLM Evaluation Methods

5:11

4.

L1V3 LLM - Evaluation on Vertex AI

5:11

5.

L2V1 - Automatic Metrics

4:59

6.

L2V2 - Automatic Metrics Demo

7:46

7.

L2V3 - AutoSxS

7:37

8.

L2V4 - AutoSxS Demo

8:29

9.

L3V1 - Text based Evaluation Models part1

6:07

10.

L3V2 - Text based Evaluation Models part2

4:42

11.

L3V3 - Evaluation of non text Generative AI Models

5:28

12.

L3V4 - Final Notes Importance of Human Evaluation

4:18

13.

Outro

1:48