Artificial Intelligence for Beginners: Build a Custom ChatGPT in 3 Steps | Alvin Wan | Skillshare

Playback Speed


1.0x


  • 0.5x
  • 0.75x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 1.75x
  • 2x

Artificial Intelligence for Beginners: Build a Custom ChatGPT in 3 Steps

teacher avatar Alvin Wan, Research Scientist

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

    • 1.

      Introduction

      1:14

    • 2.

      Scope Project: Text to Text

      5:52

    • 3.

      Scope Project: Capabilities

      5:57

    • 4.

      Scope Project: Task & Metric

      5:32

    • 5.

      Evaluate Models: Test the Best

      6:22

    • 6.

      Evaluate Models: Optimize Cost

      3:40

    • 7.

      Evaluate Models: Open-Source

      7:03

    • 8.

      Demo: Run Open-Source Model

      4:32

    • 9.

      Refine Quality: Engineer Inputs

      4:25

    • 10.

      Refine Quality: Constrain Outputs

      9:18

    • 11.

      Conclusion

      2:00

  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.

983

Students

5

Projects

About This Class

Are you excited about using AI in your product but intimidated by all the jargon and complexity? The multitude of random third-party products? Let’s cut through all the fluff: Let me show you how to customize an AI from scratch — and in 3 quick steps.

This course is designed for entrepreneurs, developers, product managers, and curious learners who want to build powerful, custom AI applications—without expensive APIs or third-party subscriptions. You'll discover the fundamental steps to creating practical, tailored AI solutions using entirely free, open-source tools. And most importantly, we'll improve your LLM's capabilities without any training. 

What You'll Learn

Here’s what we’ll do—fast:

  • Pick a real, text-in/text-out problem you care about.
  • Define the input, the output, and a quick metric so we can judge quality.
  • Try the top model in chat, then the API for repeatable results and cost checks.
  • Step down to the cheapest model that still passes your metric.
  • Make it reliable with prompt structure and constrained outputs (so it always returns the format your app expects).
  • Spin up a small open-source model on free GPUs and wire it up, so you’re not blocked by budget.

No expensive training is required and we go a step deeper than just "prompt ChatGPT".

Who is this class for?

No prior AI or coding experience required—just curiosity and a desire to learn. Whether you're a beginner or have some technical background, you'll leave with actionable knowledge and the confidence to start building tailored AI solutions. With that said, you’ll get more from the course if you first take my Artificial Intelligence for Beginners: How to Learn Machine Learning and Artificial Intelligence for Beginners: How ChatGPT works.

Resources Required:

  • Computer (Windows, macOS, or Linux)
  • Google Chrome browser
  • Google account
  • No paid APIs or third-party services

Ready to Dive Deeper?

  • Interested in advanced machine learning? Try my Computer Vision 101 (Applied ML) course.
  • Want to code? Check out Coding 101 (Python) or OOP 101 (Python).
  • Interested in data? Explore SQL 101 (Database Design) or Data 101 (Analytics).

Meet Your Teacher

Teacher Profile Image

Alvin Wan

Research Scientist

Top Teacher

Hi, I'm Alvin. I was formerly a computer science lecturer at UC Berkeley, where I served on various course staffs for 5 years. I'm now a research scientist at a large tech company, working on cutting edge AI. I've got courses to get you started -- not just to teach the basics, but also to get you excited to learn more. For more, see my Guide to Coding or YouTube.

Welcoming Guest Teacher Derek! I was formerly an instructor for the largest computer science course at UC Berkeley, where I taught for several years and won the Distinguished GSI (graduate student instructor) award. I am now a software engineer working on experimentation platforms at a large tech company. 4.45 / 5.00 average rating (943 reviews) at UC Berkeley. For more, see my Skillshare or Webs... See full profile

Level: Intermediate

Class Ratings

Expectations Met?
    Exceeded!
  • 0%
  • Yes
  • 0%
  • Somewhat
  • 0%
  • Not really
  • 0%

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.

Transcripts

1. Introduction: Training your own model is slow, expensive and overkill for most projects. The real problem lies in knowing where to start, how to get reliable results without burning time or money. Hi, I'm Alvin, a research scientist at a large tech company. I've taught over 60,000 students on Skillshare, and I got my PhD in AI from UC Berkeley. In this class, I'll give you a simple repeatable playbook for building a custom ChachiPT for your use case. We'll do this in three steps. Step number one, scope your project. Pin down how to use AI most effectively for your use case. Step number two, evaluate models. I'll show you how to navigate the large swaths of proprietary and open source models. Step number three, refine quality. Ensure that the model returns exactly what you need every single time. By the end of this course, you'll have a three step process to use for any AI based idea. This course does assume that you've taken my AI for beginners, how Chat Tippit works course. But beyond that, no technical background is required. All you need is a laptop, Internet, and an hour of time. Let's get started. 2. Scope Project: Text to Text: Welcome to AI for beginners. Build a Custom Chachi BT in three steps. Before we begin, let's review some terminology. I'm going to vastly oversimplify these definitions to convey the main idea. LLM or large language model is a model that generates text. You give it Tex as input, and the LLM produces Texas output. AI or artificial intelligence is generally a product that uses an LLM. HHIBT is specifically Open AI's AI product. So AI versus CIBT is like tissues versus Kleenex. Tissues are the general product, and Kleenex is a specific trademark product name. If you found that confusing, that's okay. I go over this in more detail in AI for beginners, how hachBT works, which you can find at this link. Since you're already subscribed, you'll get this course at no extra cost. I also highly recommend taking my AI for beginners tools to learn machine learning, where I discuss a framework for picking up AI and ML. You can find that course at this link. There's no need to write down these links, as you'll find all of the links in this lesson at this URL. Now let's build your custom Chachi BT in three steps. These steps are for everyone to take, even if you're a professional, even if you have a massive budget. Anyone building a custom AI needs these three steps. Step one, scope your project, determine what your AI will do and how you'll judge its capabilities. Step two, evaluate models. There are guidelines for what types of models to use when. Step three, refine quality. There's no need for training, and we'll show you two methods of improving quality. Now let's start with step one, scope your project. The first step here is to narrow your focus to text to text tasks. Meaning you give the AI Tex as input and get text as output. Let me explain why. First, from experience, we know that AI can take in much more than text. We can give AI images, PDFs, websites, videos, and more. It can also output more than text. AI can produce images, code, graphs, audio, and much, much more. That's a lot of different media. However, remember what we said before. AI is a product that uses an LLM, and an LLM is a model that takes in Texas input and generates text as output. So back to our diagram. Let's zoom in. The AI product and look at the underlying LLM. Notice that the LLM only takes in text. All other inputs are first converted into text. Knowing this, the ideal case is to give the AI text directly and skip that conversion process entirely. This is because text conveys the most information per word. Let me explain by example. Let's say there's a webpage with results from a recent soccer game. How many words are in this page? IC one, two, three, four, five, six, 789, ten, 11, 12. So a total of 12 words. I could give the AI this text directly, a mere 12 words, or I could take a screenshot of that webpage and give that image to AI. The AI would then translate that image into a whopping 2030 visual words. That's over 170 times more words for the same amount of information. So in conclusion, Tex is more efficient at communicating information than images of text. As a result of this, for simplicity, for your project, stick with text and TextOut tasks. Here's another example. So you want to ask hatTBT for feedback on your slides. You have several options for download formats. Don't download the slide deck as a PDF, though and definitely don't download as images. Instead, download the slides in plain text format. This plain text format contains just the raw text for all of my slides without any extra formatting or unnecessary information. Now I can upload this file and ask hatTPT to provide feedback on my slides. Luckily, my slides are just provided as raw text, making it efficient for Chat TPT to process. In short, if you want to feed a file to hachBT, export that file in plain text format where possible. Let's do one more example. Let's say I want to ask hATTBT about a blog post I'm reading. Don't screenshot the webpage. Instead, directly provide hATTBT with the URL of the webpage, and you'll find that HTHBT will actually access that webpage for us, summarizing that blogpost. We can check that HTPT access the right URL by clicking on sources. And on the right hand side, we'll see the original webpage listed as a citation. This means hatchbT was able to access the raw text on the webpage and like before, ingest information efficiently, success. In short, give webpage URLs directly so that hachPT can access the webpage and extract its contents as raw text on its own. To summarize this lesson in one takeaway, focus on text to text tasks. This means feeding in plain text data wherever possible for our project. And this now concludes the first substep of the first step. We've scoped our project down by focusing on text to text tasks. In the next lesson, we'll talk about some text to text tasks that AI is best suited for. And if you'd like to get a copy of these slides and see additional resources, you can access this URL. 3. Scope Project: Capabilities: In this lesson, we'll continue discussing how to scope your project. Now, to recap, we're currently on step one of a three step process. My goal in this first step is to help you narrow your project scope down. In the previous lesson, we narrowed your focus down to text to text tasks. And in this lesson, I'll further narrow your focus down to specific text to text capabilities. First, our focus is on building a reliable and accurate customized AI for production. So then we might ask, what can AI do reliably and accurately enough for production? There are many such tasks, but I'll focus only on three of them. Our first category is summarization. For example, Amazon uses AI to summarize product reviews in a customers section. Slack uses AI to summarize channels, and Noon uses AI to summarize Zoom meetings for you. There are many commercial examples of summarization alone. Now, let's try this ourselves. Let's say you run a bakery with hundreds of reviews. You'd like to gain insights from those reviews to help your bakery grow. In your browser, go to chattbt.com. You'll see a page like this one. Let's paste in some reviews and ask hatchibT to summarize those reviews. Here's our summary. HATTibt says that opinions on the almond croissant are mixed, which seems like a fair assessment. You can see the reviews yourself at this URL to judge whether or not you'd agree. So in short, AI is pretty adept at summarization. AI is also production ready for structuring data. Structuring data is a fancy way of saying that AI can translate globs of text into nicely organized tables of data. And at a surface level, that sounds cool. But let me emphasize, cool is an understatement. Being able to take unstructured data and extract clean structured data is actually an extremely useful AI superpower. Let me explain by example. Let's go back to our bakery. From our 671 reviews, we'd like to know what our customers think of our baked goods. For example, Did customers like the almond croissant? In fact, let's first ask, which customers even tried the almond croissant? Ideally, we'd have a nicely organized table of data like this, a list of true for those that liked it and false for those that didn't, we don't have such table. We have globs and globs of reviews like this. Now we need to go through and read each review, then determine which reviews even mentioned almond croissants. There are many ways to mention almond croissants. The review could say almond croissant croissant with almond or almond pastry. Then you might say, let's just look for the word almond. But that would include other croissants, too, because the bakery also sells chocolate almond croissants and blueberry almond croissants. So in short, detecting almond croissant mentions is hard, and that's because unstructured text is really flexible. Luckily, though, AI is perfect for this. Let me show you go to chattbt.com. Go to this URL to copy the prompt. Then paste the prompt here. The prompt gives ChachiBT a list of bakery reviews and asks it to identify Alman Croissant lovers. Hit run, and you get a nice table of results. Instead of reading through reviews, we can easily see which customers tried the Amend Croissant. There are three such reviewers, and of those three, two liked it. As a result, we conclude that 67% of customers that tried the Amen Croissant liked it. And with that, we can easily analyze large amounts of text this way, all because AI can convert unstructured text into structured formats. And now you should be able to see how valuable structuring data is. It's a superpower for analyzing data. The third task the AI is production ready for is coding and small steps. There are many success stories in this space. Cursor enhances developer productivity by offering an AI based code editor, which can write and run code for you. Lovable allows anyone to generate apps from scratch without writing any code themselves. However, for your first project, don't ask AI to build entire code bases. Treat AI like a junior engineer, simple tasks and individual steps. AI is very good at generating snippets of starter encode, at finding and fixing small bugs, and adding a feature that involves only a few files. Let's see this in action. Go to chattbt.com. Let's ask hATTBT to plot a pie chart of Alman Croissant overs using the data from before. As requested, ChachiBT begins writing and writing code, which now gives us a Pie chart. ChachibT also gives us the code for the plot, which we could run ourselves. In short, AI is very capable of writing code in small steps. In summary, AI is production ready for summarizing, structuring data, and coding. Summarization is built into products at Amazon, Slack and Notion. Structuring corporate information into a searchable format is dominated by glean and Intercom. Coding is dominated by cursor and lovable. Of course, AI can be applied in many different other ways. But here, we focus on capabilities that have already found commercial success. Now you've seen capabilities that custom chat Tipts can accomplish with production grade quality. We've now discussed several AI capabilities that have already been proven commercially and repeatedly so, making these capabilities reasonable to rely on for our first project. You can access this URL for all of the prompts I used, linked to the full example conversations and more resources. 4. Scope Project: Task & Metric: In the last lesson, we saw general capabilities that AI is best at. Now, let's translate general capabilities into a specific project. For the first part, we narrowed your focus down to the general text of text task. In the last lesson, we further narrow down your focus to three specific capabilities, summarizing, structuring data, and coding. In this lesson, we'll finally define your project. For your project, you first need to define the task. What will AI do for you? Then define what the LLM will take in as input. Will it take in reviews, emails, essays, and finally, define what text the LLM will produce as output. Will it produce summaries, tables, bullet points? Let's see an example. A I'm looking for a place to eat. Yummy cafe has a low rating, and I want to know why. My first reaction is to have AI summarize my reviews. Let's fill out the project template. Our task is to summarize reviews. Our input will be reviews, and our output will be a summary. That doesn't seem like a helpful definition. So let's be more specific. The restaurant's webpage, I scroll through reviews with really low ratings. One person confused this place with its sister restaurant. Someone else complained about the order that dishes were served in, and the last review did not like the wait staff's attitude. These are all valid concerns, but I personally only care about the food. So I want to filter out non food reviews. Knowing this is my goal, let's now translate this into a concrete project. To our template. I would like to detect non food critiques. The input is a restaurant review, and the output is true or false. Whether or not the review deducts stars for non food reasons. Now we're in business. This is how specific your project definition should be. However, we have one more component, the metric. At a high level, the metric tells you if AI's outputs are good. To start, your first metric will be, do I find the AI's outputs reasonable? Assess by hand first. Here's a review that complains about the order dishes were served in. We ask the AI, does this review deduct stars for non food reasons? The AI makes a prediction. Yes. Now, do I find reasonable? I do. The review is deducting stars for service, not food. The AI passes this first check. Here's another review that complains about the fish's freshness. We ask the AI, does this review deduct stars for non food reasons? The AI makes a prediction. No. Do I find no reasonable? I do. This review deducted stars precisely for food reasons. The AI passes this second check. Here's a review that shows confusion about the restaurant's name. We ask the AI, Does this review deduct stars for non food reasons? The AI makes a prediction. No. Do I find no reasonable? This time, I don't. The review deducted stars for the restaurant's name, not the food. The AI fails this third check. Here is the AI scorecard. We find the outputs reasonable two out of three times. So in conclusion, your first assessment is that the AI's outputs are mostly reasonable. However, the best metrics will quantify quality. So let's be more specific. Back to our scorecard, we found the AI correct two out of three times. So, we say that the AI achieved 67% accuracy. Now your metric is the percentage of reviews that the AI classifies correctly. Said more succinctly, Our metric is model accuracy on reviews that we've labeled as true or false. True of the review deductive stars for non food reasons. Now we can take any AI and compute its accuracy on our task. Here is now our fool project definition. We've added the metric accuracy for how accurately the AI predicts true or false, compared with your own labels on restaurant reviews. Let's do another example. Say you're organizing a Hackathon, and after the online submission portal closes at the deadline, dozens of people email late entries to you. Your general goal is to organize late entries sent over email. More concretely, your task is to detect if an email is a Hackathon entry. The input is an email. The output, whether or not the email is actually a Hackathon submission. The metric is also accuracy, how many emails are correctly classified as entries or not. Now using AI, we can generate a table of data and Voila. It's now much easier to determine which late submissions to accept at scale without having to read all 100 emails to find the Hackathon entries. In summary, define the task, the input to your AI, the output of your AI, and the metric, which determines how good your model's outputs are. That concludes this lesson. In this lesson, we finally defined your project after scoping down and that additionally concludes the first of three steps. You've now scoped your project and we're ready to get started on it. You can access the CRL for a copy of these slides and additional resources. 5. Evaluate Models: Test the Best: Welcome back. In previous lessons, we narrowed down our focus to text to text tasks. Then specific capabilities, then define the task metric. This completed our first task, scoping the project. For our second step, we'll evaluate models, to pick the best model for our use case. To pick a model, you'll evaluate multiple models on their ability to complete your task. We'll start with the very best options because we want to know. Can any AI even complete our task? If the best AI can't solve your task, then less intelligent AIs certainly won't either. So, start with the most intelligent AI. According to Open EI's website, if you're on the free or plus plans, GPD five thinking is your most intelligent model. If you're on the Pro plan, GPD five Pro is your most intelligent model. Let's start our testing via chatbots, a simple interface for a quick test. Go to chatbt.com. You'll see a screen like this. In the top left, expand the dropdown. From this drop down, you'll see fast the fastest but relatively least intelligent model. Thinking, a more intelligent model. Use this if you're on the free or plus plans. And finally, P, the most intelligent model. Use this if you're on the pro plan. I'll use GPT five thinking so that everyone can reproduce my results. We've now selected an LLM. So what should we ask Chat GBT? Let's use our project from the last lesson. Recall, our goal is to filter out non food critiques. The inputs are restaurant reviews, and the outputs are true or false, true if stars are deducted for non food reasons. So for our prompt, all right, given restaurant reviews, output true if a review deducts stars for non food reasons. Then I include three reviews. The first review complains about service, second complains about flavor, and the third likes the food, but felt service was slow. We expect the AI to then return true because the first review deducts stars for service, false, because the second review deducts star is precisely for food and false. The last review does not deduct stars at all. Let's try this prompt now. First, go to this UOL and copy the prompt. Then go back to chattbt.com. Paste in the prompt. Hit run. ChatTBTTs for a few seconds, then gives it response, which is correct, true, false, false. As a result, we conclude that AI can complete our task correctly. Specifically, GPT five completes our task. Testing the chatbot is not enough because chatbots are not deterministic. This means they don't return the same results every time. For example, let's ask hathBT for a dad joke. Chat TBT makes a joke about scarecrows. Let's do this again, the exact same prompt. HatchiBT makes a joke about the alphabet. Do this one more time. Chat TBT jokes about skeletons. So the exact same prompt yielded different responses every time. This shows our earlier point that chatbots are not deterministic. As a result, we need to switch from testing the chatbot to testing the API. The API by default has much less randomness injected. So if AI completes the task once, we can be confident that AI will complete the task repeatedly. Go to platform dotopen.com. Your webpage will look like this. Make sure to click on Login in the top right if you haven't already. Then click on dashboard in the top right. Our page will now look like this. Click create in the center of the page. Your screen should now look like this. In the bottom right, click Auto Clear. This ensures that each time we hit Submit, we mimic the same behavior as sending separate API calls. Now ask for a dad joke, but before you hit Enter, note this specific example can cost up to two tenths of a cent. Once you hit Run, JTBT makes a joke about the alphabet. Let's ask for a dad joke again. Make sure Auto Clear in the bottom right is selected and submit. JTBT makes the exact same joke about the alphabet with a minor punctuation change. And we try again and get the exact same alphabet joke. The results are all almost identical. This is much better than the chatbot, which return completely different responses each time. So we can conclude that API calls are mostly deterministic, definitely much more so than the chatbot. So let's re try our example project prompt via the API. Remember this is our prompt from before. We expect AI to return true, false, false. Go to this URL to copy the example project prompt again. Then back on platform do openai.com, paste that prompt in. Before you hit Run, note this example will cost four tenths of a cent. The GPD five API first returns thinking tokens for a few seconds. Then finally returns the final, correct answer, true, false, false. Let's try one more time. And again, GPD five returns the correct answer, true, false, false. And one more time. And again, the correct answer, true, false, false. So as before, GPD five correctly completes our task. Even more importantly, GPT five does so repeatedly. This is a big win. It means our project can now repeatedly complete our task using GPD five's API. In summary, test the best, most intelligent AI first. First via chat bots for simplicity, then via the API for reproducibility, and that concludes our testing via the API. Generally, we're actually finished testing the best AI. And the next lesson, we'll optimize cost by testing faster, cheaper models. In case you're wondering how I computed the cost of an API call, I'll explain that and include more resources on the course website at this URL. 6. Evaluate Models: Optimize Cost: Welcome back. As a recap, you're in step two of three. But so far, you've only tested the best AI. We tested the best AI via chat, then tested via API. Now it's time to optimize cost and find the cheapest model that can still complete our task repeatedly. Recall from before that GPD five completed our example project task successfully and repeatedly, and it costed about four tenths of a cent. I'll include all cost calculations on our course website if you're interested. Here's GPD five's cost for our example, along with its costs per token, according to OpenAI's API pricing page. Let's try the next cheapest model on our list, GPT five Mini. From your model selector, select GPD five Mini. Your screen should now look like mine. Paste in the example project prompt, and GPD five Mini responds true false false, correctly completing the task. Let's look at the cost. This example costed 700th of a cent. And once we add that to our table, we can see that we've reduced the cost of completing our task by five times. Let's now test our cheapest model, GPT five Nano. From the model selector, pick GPT five Nano. Your screen should now look like mine. Paste in the example project prompt, and once again, we get true false false, the correct answer. Look at the cost again. The cost of our example is now three hundredths of a cent. And in our table, we can see that we've now reduced the cost of completing our task by another two times. But let's reduce cost even further. So far, we've kept reasoning at the default medium level. We want to now set reasoning to minimal instead. Let me show you how to do that. Back on this webpage, next to the model name, click on the settings icon. From the menu, select reasoning effort. This will give you another drop down. Select minimal, which will then look like this. Click outside the pop up. Then paste in your example project prompt. And crazy enough, GPT five Nana without reasoning still responds true false false, completing the task perfectly. Look at the cost now. The cost of our example is now 4000th of a cent, giving us this final table of results. We further reduce cost by another seven times. That's a crazy cost reduction. Given these results, we declare GPT five Nano without reasoning our winner. Now that completes the three parts in step two, test via chatbot, test via the API, and finally, find the cheapest model that still works. This produced the cheapest Open EI model that could successfully solve our task GPT five Nano. In summary, test the best, most intelligent AI first. First via chat bots for simplicity, then via the API for reproducibility, then test iteratively cheaper, faster AI. We've now optimized cost successfully, reducing costs by over 88 times, and that's it for evaluating models. This concludes step two mostly. We're done with proprietary models, but now we need to repeat this process for another vast category of options open source models. As usual, you can find all the calculations, prompts, and more resources at this URL. 7. Evaluate Models: Open-Source: Let's explore one more category of models, open source models. We're currently on the second of three steps where we're evaluating models to determine which one to use. Luckily, there's a standardized hub for all open source models at huggingface.co. Visit this URL, and you'll see a ton of options. It's overwhelming. Here are a few heuristics, though, to navigate this large repository of models. Rule number one, use instruction tuned models. Usually, instruction tuned models have instruct in the name somewhere. The reason for this rule is that LLMs are trained in three stages. In the first stage called pre training, LLMs simply predict the next word. In the second stage called SFT or supervised fine tuning, models learn to answer questions. In the last stage called Reinforcement Learning with Human feedback or RLHF, models learn to align answers with human preferences, being helpful, honest, and harmless. Technically speaking, only the second step is called instruction tuning, but practically speaking, instruction tuned open source models have gone through both SFT and RLHF. In any case, you don't want a model that's only gone through pre training because those models auto complete questions, not answer questions. So in summary, rule number one, use instruction tuned models. For rule number two, use small models. This is obviously not a hard and fast rule, but I strongly recommend it. Now, let's say you ignore me. What if you want to use massive open source models, the biggest and the best? Well, let's see which open source model is the best. Here's the chatbot arena. Gemini Open AI models and Grock four are all proprietary. So the highest ranking open source model is Kimik. Let's go to Kimi K's website. According to this webpage, Kimi K is 1 trillion parameters. So realistically, we would need 16 H 100 GPUs to run this model. First, that's expensive. 16 H 100 GPUs cost $64 for just 1 hour of usage. Second, it's complicated to run inference for a model this big. All this to say, if you want to use state of the art open source models, you should use API's serving open source models because their prices benefit from economies of scale. It's cheaper for you to call an API than for you to run these massive models yourself. For details, you can see this blog post, which I'll link on our course website. To use state of the art open source models, let's follow the same three substeps. Test via chatbot, test via API, and optimize cost. Starting from the first substep, let's test via chatbot. Go to gpthss.com. Remember, this was our prompt from before. We expect AI to return true, false, false. First, go to this URL and copy the prompt. Then you'll see a web page like this. I chose to continue with visible reasoning, but you can pick either option. Neither option changes the quality of the AI's responses. Pasting your prompt and hit Run. Then you'll see true false false, the correct answer. On the left hand side bar, now click GBTOSS 20 billion, the smaller of the two open source models from Open AI. Then in the top right, click on New chat. Your screen will then look like this, paste in the prompt once more and hit Run. Then once again, you'll see true false false, the correct answer. We've seen the best of open source AI can complete our task. Now let's re test via API. On huggingface.co, many open source models have a section in the bottom right where you can send one off messages to the LLM to test it. This is also not free technically, but you can send requests without adding billing information. Paste in your prompt, and Deep Seek R one passes. It outputs true false false. That means open source AI, like Deepsk can also solve our task. Deepsek R one would have costed a third of a cent to run this example. Like before, let's now find the cheapest model that can complete our task. Let's try the next cheapest model. Deep Seek R one Lama 7 billion. This model also gets the answer correct, true false false. Deep Seek Lama cost it a sixth of a cent, so we've now lowered our cost by half. Now let's try the next cheapest Deep Seek V three. Technically, Deep Seek V three answered true false false correctly. It just did so with a lot of extra text. Unfortunately, if Deep Seek outputs different formats each time, our project code would not be able to parse LLM output reliably. Luckily, we have a way to fix this, which I will show you later. But for now, Deep Seek V three doesn't pass. It has the intelligence but not the formatting. Let's try the next cheapest on our list, Lama 370 billion. Hugging Face unfortunately keeps giving me errors. So I found a random other website. I don't necessarily endorse, but any website will do. And it looks like Lama 370 billion answers correctly with true false falls. So Lama 370 billion passes. Let's try the next cheapest model on our list. Deep Seeks Quin. Deep Seek RO's tiniest distilled model just spits out an infinitely long string of tags. So this is a hard failure. Quin failed catastrophically, as we saw. And with that said, we've now completed our table of results. We can re list all prices and cents, then compare Open AI pricing with open source pricing to compare intelligence per dollar. Here are the winners of each category. Notice the difference in cost. Granted, I could have tested smaller open source models that are cheaper than 70 billion but more capable than Quinn, but any smaller, and they can be running your own machine for free, so there's no point in testing their APIs. The Open EI API is 15 times cheaper for this specific task for detecting non food critiques correctly based on our previous results. Generally, for the cutting edge for the cutting edge class of models, proprietary models outperform open source models on intelligence per dollar. So if we're considering open source, let's step away from cutting edge and stick to small open source models. These models can actually be really economical to run on your own. So in summary, I recommend using small open source models. At least for now, for the large state of the art open source models, we've completed our three substeps. This completes evaluation for large open source models. In the next lesson, we'll look at small open source models, small enough to run for free. As usual, you can find all links, notebooks, and other resources on the course website. 8. Demo: Run Open-Source Model: Let's now run open source models for free. We're currently on the second of three steps. We're now trying small open source models. For now, let's treat Lama 3.28 billion as our default. This is a tiny language model that should fit on your laptop. Of course, there are many other open source models. But before I discuss those, the next few slides will refer to a few terms like GPU, node, RAM, and precision. If you're unfamiliar with those terms, you can safely ignore my explanation and just focus on the takeaway. I'll also include definitions on the course website at this URL. Now, back to rule number two, we've previously said to use small models. But let's be more precise about what small means. By small, I mean single GPU models, which is way way simpler than multi GPU and especially multi node for inference. Now that we've narrowed down to single GPU open source models, which open source models fit on a single GPU? Well, that depends on two factors, the RAM your GPU has and the precision of your model. I'll put the full explanation on the course website. But for now, we'll say that LLMs with 16 billion or fewer parameters can fit on a single GPU. Let's use this to find an open source model to use. Back on Hugging face's website, on the left hand side, filter by model size. We want 16 billion or fewer parameters, but 12 billion is the closest value we can use. Then the top right, sort models by likes. After both modifications, you'll see a page like this. Looking at the number of downloads, we clickly see a winner Lama 3.18 billion Instruct, which had over 9 million downloads in the last month. Let's use this model. For our open source experiments, we'll be using an 8 billion parameter model from the Lama family of LLMs. This is one of the most downloaded models over the last month from Hugging Face. And more importantly, we've determined that this model fits on the GPUs we're using. Go to this URL to open the starter code that I've written for you. You should see a notebook like this one. Now at the top of the file, click Run All. You might see an error. If you see an error like this, then you'll have to run each step manually. To run manually, move your cursor so that it hovers over step one. A Run button will appear. Click on that Run button. After this, even though step one is still running, you can immediately hover over step two and click on its Run button. Then you should see dot lines like this indicating step two is queued up. Whether you hit run all or ran the first steps manually, your notebook is now installing prerequisites and downloading weights. This will take five to 10 minutes. So during that time, continue watching this video walk through. After a few minutes, step one will finish. If you see a green checkmark, that means step one completed successfully. After a few more minutes, step two should finish too. Again, you should see a green checkmark indicating success. Now, whether or not you've already run step three, next to step three, click Run. After a few seconds, we'll see Lama's outputs discussing the meaning of life. You should now see the prompt. You should now change the prompt to whatever you desire and hit the Run button one more time. Treat this as a personal chat bot. You can ask or say whatever you might normally say to Chat GBT. For example, we can have Lama explain some math to us or have Lama tell us about our favorite online learning platform. This is effectively your very own AI running in the Cloud. No one but you is using this dedicated LLM. Now it's time to test Lama on our example project. Hit run in the last cell here to test Lama's ability to identify non food critiques. But unfortunately, Lama gets it wrong, saying both the first and third reviews are true that they are both non food critiques. In other words, Lama predicts true false true instead of true false false. We'll fix this in the next lesson without training the model. That's it. We've now run our very first open source LLM, and now we've finally finished the second of three steps. As I mentioned before, our next step is to refine quality. In particular, we'll improve this open source LLMs ability to complete our example project. 9. Refine Quality: Engineer Inputs: In this lesson, we're going to improve our LLMs capabilities by changing our inputs. Generally, this is called prompt engineering. We're now in the third of three steps, and we're now refining quality. There are two substeps here. The first of which is to engineer our inputs. If you haven't already, go to this URL to open the starter code that I've written for you. If you have your notebook from the last lesson still open, use the same notebook. If you're opening this notebook for the first time, at the top of the file, click Run All. The first two steps will take about five to 10 minutes, so you can continue watching this walk through in the meantime. Recall from our previous lesson, Lama 8 billion failed to complete our task. The outputs format and accuracy were both off. Our first approach to fix this is to be specific. Sounds like a silly tip, but just specifying the format is enough. For example, provide three lines of output for each line, denote true or false, no extra text, no formatting. I've added this to your notebook already under a cell titled tip one. In your notebook, scroll down to the structure inputs demo section, like pictured here. Hover over the cell that says tip one, then click on the Run button that appears. After a few seconds, you should see the following output, true, false, true. The outputs are still incorrect. We expect true, false, false, but we fix the format by just adding a few instructions to the prompt. Our next approach is to provide examples. Let's add a few. Let's now add three examples of reviews. Five out of five, the chicken was salty, but good, three out of five, the marinero was too sour, three out of five, the server wasn't patient. And finally, let's add the desired outputs. The first two reviews involve food critiques, and the last review involves a non food critique. So we expect false false true. I've added this to your notebook already in a cell titled tip two. Scroll down and hover your mouse over this cell titled Tip two. Click on the Run button to peers, and after a few seconds, you'll see true, false, true. Unfortunately, the outputs are still wrong, but we have one more trick. For the third approach, we will ask for chain of thought. In short, ask Lama to show its work. In our prompt to the model, we'll simply ask first reason step by step. Then we specify the output format very precisely. As usual, I've added this to your notebook already. Scroll down, hover over this cell title Tip three, and click on the Run button that appears. After a few seconds, Lama finally predicts correctly. We see a block of thoughts we asked for and the correct outputs true false false. So we can now say that Lama 3.18 billion can complete our task successfully. Recap, we applied three tips. Be specific, provide examples and ask for chain of thought. These three together enabled our Lama model to successfully identify non food critiques. Now, what if we use this new and improved prompt on our models from before? Looks like Deep Seek V three now successfully answers the question with true false false, but Quin 1.5 billion still produces meaningless garbage. As a result, our open source results now look like this at a high level when comparing open AI's pricing to open source pricing, Pei is still offering a much higher intelligence per dollar, although we now have a working Lama 8 billion that can reliably produce correct answers on free tier hardware, even cheaper than all of the above. In summary, be specific, provide examples and ask for chain of thought. That's it for the first substep engineering our models inputs. So in summary, we've successfully improved Lama's ability to complete our example task. In the next lesson, we'll make a final set of improvements to make our open source model run robustly on large amounts of data. As usual, you can find all prompts and the starter code I used at this URL. 10. Refine Quality: Constrain Outputs: In this lesson, we're going to improve our LLMs capabilities by structuring our outputs. A word of warning. This lesson features a lot of code. If you're uncomfortable with code, just focus on the takeaways that I discuss instead of the code itself. This is our third of three steps. In the last lesson, we engineered our inputs to improve the quality of the model's responses. In this lesson, we'll improve quality instead by constraining our outputs. If you haven't already, go to this L to open the starter code that I've written for you. If you have your notebook from the last lesson still open, use the same notebook. If you're opening this notebook for the first time at the top of the file, click Run All. The first two steps will take about five to 10 minutes, so you can continue watching this walk through in the meantime. Recall from our open source demo, Lama 8 billion failed to complete our task. The outputs format and accuracy were both off. Let's reproduce this result again. In your notebook, scroll down to the structure outputs demo. Hover over example A, and click Run. You'll get this blob of text. It's both wrong and incorrectly formatted. We want a list of true or false. Fix this, in the previous lesson, we applied three tips. Be specific, provide examples and ask for chain of thought. After applying these tips, Lama produced the correct outputs true false false. However, correctness came at a cost. The original prompt took 288 characters. However, the new prompt takes 1,312 characters. That's a whopping 4.5 times longer input. That's a lot of inputs. So can we improve output format and correctness without increasing the number of input tokens? And the answer is, of course, yes, let me explain. Here's the LLM. It takes text as input and produces text as output. In this case, our input is our bananas fruits. And unfortunately, the LLM outputs. That's not even a valid response to the question. To fix this, we force the LLM to output only yes or no. That way, even if the output is wrong, at least the output is valid. To do this, we need to modify this step at the end, how LLMs translate outputs into words. Let's zoom in. The LLM actually first outputs a list of numbers. These numbers are actually probabilities that correspond to certain words. In our example, the first probability is the likelihood of yes. The second, the likelihood of no, the third to R, and the last to is. The highest probability is 60%, and the corresponding word is R. So we finally output R. This is how the LLM predicts normally, but our goal is to output only yes or no. So let's make some changes. First, disregard all other words. Consider only yes or no. And we now take the higher probability, 10%, which corresponds to yes. Finally, we output yes. And with that, we've successfully constrained our LLM to output only yes or no, taking the highest probability valid word. In this example, we only wanted yes or no. So we use the probabilities of these two words and simply picked the more probable of the two. Now our LLM is forced to output yes or no, and more importantly, we did this without changing the number of input tokens. Navigate back to your notebook and scroll down to step zero setup. Mouse over the cell and click Run. You won't see any output for the step. You're now set up. For our first example, we will force true or false. Equivalently, we can say that we'll force a boolean. A boolean is a true or false. For example, one, we ask the M are bananas fruits and forced output to be true or false. Now, hover over example one and click Run. This will output a single boolean true. This is correct. Bananas are fruits. For our next example, we'll force three booleans instead of just one. Now, we check if bananas, almonds, and potatoes are fruits. Move your mouse over example two and hit Run. Now you'll see the outputs true, false false. This is correct. Now let's see if we can use this to constrain the output format for our non food critiques. Hover over example B, then click Run. And this output's three booleans. The output format is valid. Unfortunately, the output is still wrong, though. It should be true, false, false, not true, false, true. So let's keep going. For our third example, we will force the model to reason step by step. For this step, we'll prompt the model to reason first. Then force the model to output text between think tags. You can choose whatever format you want for thinking. This is just the format I chose. Like before, the prompt is to check whether bananas, almonds, and potatoes are fruits. Move your mouse over example three, then click Run. This will take some time to run about 45 seconds or so. Then you'll see the final output with the correct reasoning and even better the correct outputs true, false, false. Now, let's apply the same structured outputs to our example project for non food critiques. Hover over example C and click Run. You'll see the reasoning as well as the correct answer at the end. Now by structuring our outputs, we've improved the quality of our model and forced the outputs to match a specific format. That's a win win. We've also drastically shortened prompt length. The original prompt was 288 characters. From the last lesson, we needed 1,312 characters. From this lesson, we only needed 386 characters, along with structured outputs to achieve the correct answer. With prompt engineering, we needed a 4.5 times longer prompt to produce the right outputs. With structured outputs, we have a much shorter prompt, just 1.3 times longer. And we're guaranteed the output will always be valid. This isn't to say that one is better than the other, but I want to emphasize that prompt engineering and constraining outputs will actually be used together most often to maximize model quality. Now, let's try constraining outputs for a proprietary model, namely open AI's models as well. Go to platform.openi.com. Your webpage will look like this. Make sure to click on Log in the top right if you haven't already. Then click on dashboard in the top right. Click Create in the center of the page, and you should see a page like this. Click on the model Picker. From the drop down, select GPT five Nano and your screen should now look like this. In the bottom right, make sure to select Auto clear. Then next to the model name, click on the Settings icon. This will pop up a settings menu. Click on the text format Dropdown, and from the dropdown, you can then click Select JSON Schema. So you can set up constrained outputs. Make sure it's JSON schema and not JSON Object. This will open up a dialog like this one. Paste in the following, which you'll actually get from this URL. Copy the constrained output JSON schema. Then paste in the JSON schema here. Scroll to the bottom of the dialog and click on Save. We'll see the JSON schema reflected in the menu here. Then click outside of the menu. From this URL, copy the project example prompt from before. Paste the example project prompt here and hit Run. And now we have the output in JSON format. Try whatever random changes you want to the prompt, the EPI will always return outputs in this format. And with that, you've successfully constrained outputs for hat GPT. In summary, structuring outputs forces the output to be valid. We can force yes or no, we can force numbers, we can force any particular format we want for both open source and proprietary models. We've discussed how it works, a simple example to identify fruits and an improvement for our detection of non fruit critiques. That's now the end of our third step. We've seen two ways to improve our model's quality without training of any kind, and this concludes our lesson on structuring outputs. And in fact, that concludes our third of three steps. And now you finish just the start of your journey building with AI. These are the first steps that everyone should take for a project involving AI. You can access this URL for a copy of the starter code, other prompts, and more resources. In the next lesson, we'll wrap up the class. 11. Conclusion: You've made it to the very end of the course. Et's close out with a quick recap of what you've learned. First, scope your project. To do this, we broke this down to three substeps. Narrow down your focus to the general text to text task. Then narrow your focus down to the established commercially proven text to text capabilities, summarization, structuring data, and coding. Finally, define your inputs, outputs, your task, and your metric. As an example, this was the project description for detecting non food critiques. For our second step, evaluate models, so we can pick one to use. To do this, we again have three substeps. At a high level, start from the best, most intelligent AI. Test models via chat bot, the simplest, most user friendly interfaces, then test models via an API so that your results are more repeatable and reproducible. Finally, optimize for cost by testing cheaper and faster models to see if they complete your task successfully. Once we've picked a model, we then refine quality. To do this, we had two substeps. Start improving quality by refining your inputs. We call this prompt engineering. Then constrain the outputs. This can further improve reliability and quality of your model. And that completes our three step process, which you can now use to build any project involving AI. This is just the start, but it's a solid foundation to understanding how well AI can accomplish your task. Now, using the above process as a guide, apply this to your own project and post the result in the courses Project tab. Feel free to use one of my exampled projects or create one of your own. I'm very excited to see what you create. If you'd like to hear more about follow up classes, follow me on Skillshare and check out my other courses. Congratulations. I'm making it to the very end of the course.