Transcripts
1. Introduction: Training your own model is slow, expensive and overkill
for most projects. The real problem lies in
knowing where to start, how to get reliable results without burning time or money. Hi, I'm Alvin, a
research scientist at a large tech company. I've taught over 60,000
students on Skillshare, and I got my PhD in
AI from UC Berkeley. In this class, I'll give you a simple repeatable
playbook for building a custom ChachiPT
for your use case. We'll do this in three steps. Step number one,
scope your project. Pin down how to use AI most effectively
for your use case. Step number two,
evaluate models. I'll show you how to
navigate the large swaths of proprietary and
open source models. Step number three,
refine quality. Ensure that the model returns exactly what you need
every single time. By the end of this
course, you'll have a three step process to
use for any AI based idea. This course does
assume that you've taken my AI for beginners, how Chat Tippit works course. But beyond that, no technical
background is required. All you need is a laptop, Internet, and an hour of
time. Let's get started.
2. Scope Project: Text to Text: Welcome to AI for beginners. Build a Custom Chachi
BT in three steps. Before we begin, let's
review some terminology. I'm going to vastly oversimplify these definitions to
convey the main idea. LLM or large language model is a model that generates text. You give it Tex as input, and the LLM produces
Texas output. AI or artificial intelligence is generally a product
that uses an LLM. HHIBT is specifically
Open AI's AI product. So AI versus CIBT is like
tissues versus Kleenex. Tissues are the general product, and Kleenex is a specific
trademark product name. If you found that
confusing, that's okay. I go over this in more
detail in AI for beginners, how hachBT works, which
you can find at this link. Since you're already subscribed, you'll get this course
at no extra cost. I also highly recommend taking my AI for beginners tools
to learn machine learning, where I discuss a framework
for picking up AI and ML. You can find that
course at this link. There's no need to
write down these links, as you'll find all
of the links in this lesson at this URL. Now let's build your custom
Chachi BT in three steps. These steps are for
everyone to take, even if you're a professional, even if you have
a massive budget. Anyone building a custom AI
needs these three steps. Step one, scope your project, determine what your AI will do and how you'll judge
its capabilities. Step two, evaluate models. There are guidelines
for what types of models to use when. Step three, refine quality. There's no need for training, and we'll show you two
methods of improving quality. Now let's start with step
one, scope your project. The first step here is to narrow your focus to text
to text tasks. Meaning you give
the AI Tex as input and get text as output.
Let me explain why. First, from experience,
we know that AI can take in much
more than text. We can give AI images, PDFs, websites,
videos, and more. It can also output
more than text. AI can produce images, code, graphs, audio, and
much, much more. That's a lot of different media. However, remember
what we said before. AI is a product
that uses an LLM, and an LLM is a
model that takes in Texas input and generates
text as output. So back to our diagram.
Let's zoom in. The AI product and look
at the underlying LLM. Notice that the LLM
only takes in text. All other inputs are first
converted into text. Knowing this, the
ideal case is to give the AI text directly and skip that conversion
process entirely. This is because text conveys the most
information per word. Let me explain by example. Let's say there's a webpage with results from a
recent soccer game. How many words are in this page? IC one, two, three, four, five, six, 789, ten, 11, 12. So a total of 12 words. I could give the AI
this text directly, a mere 12 words, or I could take a screenshot of that webpage and give
that image to AI. The AI would then
translate that image into a whopping 2030 visual words. That's over 170 times more words for the same amount
of information. So in conclusion, Tex
is more efficient at communicating information
than images of text. As a result of this, for
simplicity, for your project, stick with text
and TextOut tasks. Here's another example. So you want to ask hatTBT
for feedback on your slides. You have several options
for download formats. Don't download the
slide deck as a PDF, though and definitely
don't download as images. Instead, download the slides
in plain text format. This plain text format contains
just the raw text for all of my slides without any extra formatting or
unnecessary information. Now I can upload this file and ask hatTPT to provide
feedback on my slides. Luckily, my slides are
just provided as raw text, making it efficient for
Chat TPT to process. In short, if you want to
feed a file to hachBT, export that file in plain
text format where possible. Let's do one more example. Let's say I want to ask hATTBT about a blog post I'm reading. Don't screenshot the webpage. Instead, directly provide hATTBT with the URL of the webpage, and you'll find that
HTHBT will actually access that webpage for us,
summarizing that blogpost. We can check that HTPT access the right URL by
clicking on sources. And on the right
hand side, we'll see the original webpage
listed as a citation. This means hatchbT
was able to access the raw text on the
webpage and like before, ingest information
efficiently, success. In short, give webpage URLs
directly so that hachPT can access the webpage and extract its contents as raw
text on its own. To summarize this
lesson in one takeaway, focus on text to text tasks. This means feeding
in plain text data wherever possible
for our project. And this now concludes the first substep
of the first step. We've scoped our project down by focusing on text to text tasks. In the next lesson,
we'll talk about some text to text tasks
that AI is best suited for. And if you'd like
to get a copy of these slides and see
additional resources, you can access this URL.
3. Scope Project: Capabilities: In this lesson, we'll continue discussing how to
scope your project. Now, to recap, we're currently on step one of
a three step process. My goal in this first
step is to help you narrow your
project scope down. In the previous
lesson, we narrowed your focus down to
text to text tasks. And in this lesson, I'll
further narrow your focus down to specific text
to text capabilities. First, our focus is on building a reliable and accurate
customized AI for production. So then we might
ask, what can AI do reliably and accurately
enough for production? There are many such tasks, but I'll focus only
on three of them. Our first category
is summarization. For example, Amazon uses AI to summarize product reviews
in a customers section. Slack uses AI to
summarize channels, and Noon uses AI to summarize
Zoom meetings for you. There are many
commercial examples of summarization alone. Now, let's try this ourselves. Let's say you run a bakery
with hundreds of reviews. You'd like to gain insights from those reviews to help
your bakery grow. In your browser,
go to chattbt.com. You'll see a page like this one. Let's paste in some
reviews and ask hatchibT to summarize
those reviews. Here's our summary. HATTibt says that opinions on the almond
croissant are mixed, which seems like a
fair assessment. You can see the
reviews yourself at this URL to judge whether
or not you'd agree. So in short, AI is pretty
adept at summarization. AI is also production ready
for structuring data. Structuring data is a fancy
way of saying that AI can translate globs of text into nicely organized tables of data. And at a surface level,
that sounds cool. But let me emphasize, cool
is an understatement. Being able to take
unstructured data and extract clean structured data is actually an extremely
useful AI superpower. Let me explain by example. Let's go back to our bakery. From our 671 reviews, we'd like to know
what our customers think of our baked goods. For example, Did customers
like the almond croissant? In fact, let's first ask, which customers even tried
the almond croissant? Ideally, we'd have a
nicely organized table of data like this, a list of true for
those that liked it and false for
those that didn't, we don't have such table. We have globs and globs
of reviews like this. Now we need to go through
and read each review, then determine which reviews even mentioned
almond croissants. There are many ways to
mention almond croissants. The review could say
almond croissant croissant with almond or almond pastry. Then you might say, let's just
look for the word almond. But that would include
other croissants, too, because the bakery also sells chocolate almond croissants and blueberry almond croissants. So in short, detecting almond
croissant mentions is hard, and that's because unstructured
text is really flexible. Luckily, though, AI
is perfect for this. Let me show you go
to chattbt.com. Go to this URL to
copy the prompt. Then paste the prompt here. The prompt gives
ChachiBT a list of bakery reviews and asks it to identify Alman
Croissant lovers. Hit run, and you get a
nice table of results. Instead of reading
through reviews, we can easily see which customers tried the
Amend Croissant. There are three such reviewers, and of those three,
two liked it. As a result, we conclude that 67% of customers that tried
the Amen Croissant liked it. And with that, we can easily analyze large amounts
of text this way, all because AI can convert unstructured text into
structured formats. And now you should be able to see how valuable
structuring data is. It's a superpower
for analyzing data. The third task the
AI is production ready for is coding
and small steps. There are many success
stories in this space. Cursor enhances
developer productivity by offering an AI
based code editor, which can write and
run code for you. Lovable allows anyone
to generate apps from scratch without writing
any code themselves. However, for your first project, don't ask AI to build
entire code bases. Treat AI like a junior engineer, simple tasks and
individual steps. AI is very good at generating
snippets of starter encode, at finding and
fixing small bugs, and adding a feature
that involves only a few files. Let's
see this in action. Go to chattbt.com. Let's ask hATTBT to
plot a pie chart of Alman Croissant overs using
the data from before. As requested, ChachiBT begins
writing and writing code, which now gives us a Pie chart. ChachibT also gives us
the code for the plot, which we could run ourselves. In short, AI is very capable of writing
code in small steps. In summary, AI is production
ready for summarizing, structuring data, and coding. Summarization is
built into products at Amazon, Slack and Notion. Structuring corporate
information into a searchable format is dominated
by glean and Intercom. Coding is dominated by
cursor and lovable. Of course, AI can be applied in many
different other ways. But here, we focus on capabilities that have already
found commercial success. Now you've seen capabilities
that custom chat Tipts can accomplish with
production grade quality. We've now discussed several
AI capabilities that have already been proven
commercially and repeatedly so, making these
capabilities reasonable to rely on for our
first project. You can access this URL for
all of the prompts I used, linked to the full example conversations
and more resources.
4. Scope Project: Task & Metric: In the last lesson, we saw general capabilities
that AI is best at. Now, let's translate
general capabilities into a specific project. For the first part, we narrowed your focus down to the
general text of text task. In the last lesson, we
further narrow down your focus to three
specific capabilities, summarizing, structuring
data, and coding. In this lesson, we'll
finally define your project. For your project, you first
need to define the task. What will AI do for you? Then define what the LLM
will take in as input. Will it take in reviews, emails, essays, and finally, define what text the LLM
will produce as output. Will it produce summaries,
tables, bullet points? Let's see an example. A I'm
looking for a place to eat. Yummy cafe has a low rating, and I want to know why. My first reaction is to have
AI summarize my reviews. Let's fill out the
project template. Our task is to
summarize reviews. Our input will be reviews, and our output
will be a summary. That doesn't seem like
a helpful definition. So let's be more specific. The restaurant's
webpage, I scroll through reviews with
really low ratings. One person confused this place with its sister restaurant. Someone else
complained about the order that dishes
were served in, and the last review did not like the wait
staff's attitude. These are all valid concerns, but I personally only
care about the food. So I want to filter
out non food reviews. Knowing this is my
goal, let's now translate this into
a concrete project. To our template. I would like to detect non food critiques. The input is a
restaurant review, and the output is true or false. Whether or not the
review deducts stars for non food reasons. Now
we're in business. This is how specific your
project definition should be. However, we have one more
component, the metric. At a high level,
the metric tells you if AI's outputs are good. To start, your first
metric will be, do I find the AI's outputs reasonable? Assess
by hand first. Here's a review that complains about the order dishes
were served in. We ask the AI, does this review deduct stars
for non food reasons? The AI makes a prediction. Yes. Now, do I find reasonable? I do. The review is deducting stars for
service, not food. The AI passes this first check. Here's another review
that complains about the fish's freshness. We ask the AI, does this review deduct stars
for non food reasons? The AI makes a prediction. No. Do I find no reasonable? I do. This review deducted stars precisely
for food reasons. The AI passes this second check. Here's a review that shows confusion about
the restaurant's name. We ask the AI, Does this review deduct
stars for non food reasons? The AI makes a prediction. No. Do I find no reasonable? This time, I don't. The review deducted stars for the restaurant's
name, not the food. The AI fails this third check. Here is the AI scorecard. We find the outputs reasonable
two out of three times. So in conclusion, your first assessment is that the AI's outputs are
mostly reasonable. However, the best metrics
will quantify quality. So let's be more specific. Back to our scorecard, we found the AI correct two
out of three times. So, we say that the AI
achieved 67% accuracy. Now your metric is
the percentage of reviews that the AI
classifies correctly. Said more succinctly,
Our metric is model accuracy on reviews that we've labeled
as true or false. True of the review deductive
stars for non food reasons. Now we can take any AI and compute its
accuracy on our task. Here is now our fool
project definition. We've added the
metric accuracy for how accurately the AI
predicts true or false, compared with your own labels
on restaurant reviews. Let's do another example. Say you're organizing
a Hackathon, and after the online submission portal closes at the deadline, dozens of people email
late entries to you. Your general goal is to organize late entries
sent over email. More concretely, your task is to detect if an email
is a Hackathon entry. The input is an email. The output, whether or not the email is actually a
Hackathon submission. The metric is also accuracy, how many emails are correctly classified as entries or not. Now using AI, we can generate
a table of data and Voila. It's now much
easier to determine which late submissions
to accept at scale without having to read all 100 emails to find
the Hackathon entries. In summary, define the
task, the input to your AI, the output of your
AI, and the metric, which determines how good
your model's outputs are. That concludes this lesson. In this lesson, we finally defined your project
after scoping down and that additionally concludes the first
of three steps. You've now scoped your project and we're ready to
get started on it. You can access the CRL for a copy of these slides
and additional resources.
5. Evaluate Models: Test the Best: Welcome back. In
previous lessons, we narrowed down our focus
to text to text tasks. Then specific capabilities,
then define the task metric. This completed our first
task, scoping the project. For our second step, we'll evaluate models, to pick the best model
for our use case. To pick a model, you'll evaluate multiple models on their
ability to complete your task. We'll start with the very best options because we want to know. Can any AI even
complete our task? If the best AI can't
solve your task, then less intelligent AIs
certainly won't either. So, start with the
most intelligent AI. According to Open EI's website, if you're on the
free or plus plans, GPD five thinking is your
most intelligent model. If you're on the Pro plan, GPD five Pro is your
most intelligent model. Let's start our
testing via chatbots, a simple interface
for a quick test. Go to chatbt.com. You'll see a screen like this. In the top left,
expand the dropdown. From this drop down, you'll see fast the fastest but relatively
least intelligent model. Thinking, a more
intelligent model. Use this if you're on
the free or plus plans. And finally, P, the
most intelligent model. Use this if you're
on the pro plan. I'll use GPT five thinking so that everyone can
reproduce my results. We've now selected an LLM. So what should we ask Chat GBT? Let's use our project
from the last lesson. Recall, our goal is to filter
out non food critiques. The inputs are
restaurant reviews, and the outputs
are true or false, true if stars are deducted
for non food reasons. So for our prompt, all right, given restaurant
reviews, output true if a review deducts stars
for non food reasons. Then I include three reviews. The first review
complains about service, second complains about flavor, and the third likes the food, but felt service was slow. We expect the AI to then return true because the first review
deducts stars for service, false, because the second
review deducts star is precisely for food and false. The last review does not
deduct stars at all. Let's try this prompt now. First, go to this UOL
and copy the prompt. Then go back to chattbt.com. Paste in the prompt. Hit run. ChatTBTTs for a few seconds, then gives it response, which is correct,
true, false, false. As a result, we conclude that AI can complete our
task correctly. Specifically, GPT five
completes our task. Testing the chatbot is not enough because chatbots
are not deterministic. This means they don't return
the same results every time. For example, let's ask
hathBT for a dad joke. Chat TBT makes a joke
about scarecrows. Let's do this again,
the exact same prompt. HatchiBT makes a joke
about the alphabet. Do this one more time. Chat
TBT jokes about skeletons. So the exact same prompt yielded different
responses every time. This shows our earlier point that chatbots are
not deterministic. As a result, we need to switch from testing the chatbot
to testing the API. The API by default has much
less randomness injected. So if AI completes
the task once, we can be confident that AI will complete the task repeatedly. Go to platform dotopen.com. Your webpage will
look like this. Make sure to click on Login in the top right if
you haven't already. Then click on dashboard
in the top right. Our page will now
look like this. Click create in the
center of the page. Your screen should
now look like this. In the bottom right,
click Auto Clear. This ensures that each
time we hit Submit, we mimic the same behavior as
sending separate API calls. Now ask for a dad joke, but before you hit Enter, note this specific example can cost up to two
tenths of a cent. Once you hit Run, JTBT makes
a joke about the alphabet. Let's ask for a dad joke again. Make sure Auto Clear
in the bottom right is selected and submit. JTBT makes the exact
same joke about the alphabet with a minor
punctuation change. And we try again and get the
exact same alphabet joke. The results are all
almost identical. This is much better
than the chatbot, which return completely
different responses each time. So we can conclude that API calls are mostly
deterministic, definitely much more
so than the chatbot. So let's re try our example
project prompt via the API. Remember this is our
prompt from before. We expect AI to return
true, false, false. Go to this URL to copy the
example project prompt again. Then back on platform do openai.com, paste
that prompt in. Before you hit Run, note this example will cost
four tenths of a cent. The GPD five API first returns thinking
tokens for a few seconds. Then finally returns the final, correct answer,
true, false, false. Let's try one more time. And again, GPD five returns the correct answer,
true, false, false. And one more time. And again, the correct answer,
true, false, false. So as before, GPD five
correctly completes our task. Even more importantly, GPT
five does so repeatedly. This is a big win. It means our project
can now repeatedly complete our task
using GPD five's API. In summary, test the best, most intelligent AI first. First via chat bots
for simplicity, then via the API for
reproducibility, and that concludes our
testing via the API. Generally, we're actually
finished testing the best AI. And the next lesson,
we'll optimize cost by testing faster,
cheaper models. In case you're wondering
how I computed the cost of an API call, I'll explain that and include more resources on the
course website at this URL.
6. Evaluate Models: Optimize Cost: Welcome back. As a recap, you're in step two of three. But so far, you've only
tested the best AI. We tested the best AI via chat, then tested via API. Now it's time to
optimize cost and find the cheapest model
that can still complete our task repeatedly. Recall from before that
GPD five completed our example project task
successfully and repeatedly, and it costed about
four tenths of a cent. I'll include all
cost calculations on our course website
if you're interested. Here's GPD five's
cost for our example, along with its costs per token, according to OpenAI's
API pricing page. Let's try the next
cheapest model on our list, GPT five Mini. From your model selector, select GPD five Mini. Your screen should
now look like mine. Paste in the example
project prompt, and GPD five Mini responds
true false false, correctly completing the task. Let's look at the
cost. This example costed 700th of a cent. And once we add
that to our table, we can see that we've
reduced the cost of completing our
task by five times. Let's now test our cheapest
model, GPT five Nano. From the model selector, pick GPT five Nano. Your screen should
now look like mine. Paste in the example project
prompt, and once again, we get true false false, the correct answer.
Look at the cost again. The cost of our example is now three hundredths of a cent. And in our table, we can
see that we've now reduced the cost of completing our
task by another two times. But let's reduce
cost even further. So far, we've kept reasoning
at the default medium level. We want to now set reasoning
to minimal instead. Let me show you how to do that. Back on this webpage,
next to the model name, click on the settings icon. From the menu, select
reasoning effort. This will give you
another drop down. Select minimal, which
will then look like this. Click outside the pop up. Then paste in your
example project prompt. And crazy enough, GPT
five Nana without reasoning still responds
true false false, completing the task perfectly.
Look at the cost now. The cost of our example
is now 4000th of a cent, giving us this final
table of results. We further reduce cost
by another seven times. That's a crazy cost reduction. Given these results, we declare GPT five Nano without
reasoning our winner. Now that completes the
three parts in step two, test via chatbot,
test via the API, and finally, find the cheapest
model that still works. This produced the cheapest
Open EI model that could successfully solve our
task GPT five Nano. In summary, test the best, most intelligent AI first. First via chat bots
for simplicity, then via the API for
reproducibility, then test iteratively
cheaper, faster AI. We've now optimized
cost successfully, reducing costs by over 88 times, and that's it for
evaluating models. This concludes step two mostly. We're done with
proprietary models, but now we need to
repeat this process for another vast category of
options open source models. As usual, you can find
all the calculations, prompts, and more
resources at this URL.
7. Evaluate Models: Open-Source: Let's explore one more category of models, open source models. We're currently on the
second of three steps where we're evaluating models to
determine which one to use. Luckily, there's a
standardized hub for all open source models
at huggingface.co. Visit this URL, and you'll see a ton of
options. It's overwhelming. Here are a few
heuristics, though, to navigate this large
repository of models. Rule number one, use
instruction tuned models. Usually, instruction
tuned models have instruct in
the name somewhere. The reason for this rule is that LLMs are trained
in three stages. In the first stage
called pre training, LLMs simply predict
the next word. In the second stage called SFT
or supervised fine tuning, models learn to
answer questions. In the last stage called Reinforcement Learning with
Human feedback or RLHF, models learn to align answers
with human preferences, being helpful,
honest, and harmless. Technically speaking,
only the second step is called instruction tuning, but practically speaking,
instruction tuned open source models have gone
through both SFT and RLHF. In any case, you don't want a model that's only gone
through pre training because those models auto complete questions,
not answer questions. So in summary, rule number one, use instruction tuned models. For rule number two,
use small models. This is obviously not
a hard and fast rule, but I strongly recommend it. Now, let's say you ignore me. What if you want to use
massive open source models, the biggest and the best? Well, let's see which open
source model is the best. Here's the chatbot arena. Gemini Open AI models and Grock
four are all proprietary. So the highest ranking open
source model is Kimik. Let's go to Kimi K's website. According to this webpage, Kimi K is 1 trillion parameters. So realistically, we would need 16 H 100 GPUs to run this model. First, that's expensive. 16 H 100 GPUs cost $64
for just 1 hour of usage. Second, it's complicated to run inference for
a model this big. All this to say, if you want to use state of the art
open source models, you should use API's
serving open source models because their prices benefit
from economies of scale. It's cheaper for you to call an API than for you to run
these massive models yourself. For details, you can
see this blog post, which I'll link on
our course website. To use state of the art
open source models, let's follow the
same three substeps. Test via chatbot, test via
API, and optimize cost. Starting from the first substep, let's test via chatbot. Go to gpthss.com. Remember, this was our
prompt from before. We expect AI to return
true, false, false. First, go to this URL
and copy the prompt. Then you'll see a
web page like this. I chose to continue
with visible reasoning, but you can pick either option. Neither option changes the
quality of the AI's responses. Pasting your prompt and hit Run. Then you'll see true false
false, the correct answer. On the left hand side bar, now click GBTOSS 20 billion, the smaller of the two open
source models from Open AI. Then in the top right, click on New chat. Your screen will
then look like this, paste in the prompt
once more and hit Run. Then once again, you'll see true false false, the correct answer. We've seen the best
of open source AI can complete our task. Now let's re test via API. On huggingface.co, many open source models
have a section in the bottom right where
you can send one off messages to the
LLM to test it. This is also not
free technically, but you can send requests without adding
billing information. Paste in your prompt, and
Deep Seek R one passes. It outputs true false false. That means open source AI, like Deepsk can also
solve our task. Deepsek R one would have costed a third of a cent to
run this example. Like before, let's now find the cheapest model
that can complete our task. Let's try the next
cheapest model. Deep Seek R one Lama 7 billion. This model also gets the answer correct, true false false. Deep Seek Lama cost
it a sixth of a cent, so we've now lowered
our cost by half. Now let's try the next
cheapest Deep Seek V three. Technically, Deep Seek V three answered true
false false correctly. It just did so with
a lot of extra text. Unfortunately, if Deep Seek outputs different
formats each time, our project code
would not be able to parse LLM output reliably. Luckily, we have a
way to fix this, which I will show you later. But for now, Deep Seek
V three doesn't pass. It has the intelligence
but not the formatting. Let's try the next cheapest on our list, Lama 370 billion. Hugging Face unfortunately
keeps giving me errors. So I found a random
other website. I don't necessarily endorse,
but any website will do. And it looks like
Lama 370 billion answers correctly with
true false falls. So Lama 370 billion passes. Let's try the next
cheapest model on our list. Deep Seeks Quin. Deep Seek RO's tiniest
distilled model just spits out an infinitely
long string of tags. So this is a hard failure. Quin failed
catastrophically, as we saw. And with that said, we've now completed our table of results. We can re list all
prices and cents, then compare Open
AI pricing with open source pricing to compare
intelligence per dollar. Here are the winners
of each category. Notice the difference in cost. Granted, I could have tested smaller open source
models that are cheaper than 70 billion but
more capable than Quinn, but any smaller, and they can be running your own
machine for free, so there's no point in
testing their APIs. The Open EI API is
15 times cheaper for this specific task for detecting non food critiques correctly based on our previous results. Generally, for the cutting edge for the cutting edge
class of models, proprietary models outperform
open source models on intelligence per dollar. So if we're considering
open source, let's step away
from cutting edge and stick to small
open source models. These models can actually be really economical
to run on your own. So in summary, I recommend using small open source models. At least for now, for the large state of the
art open source models, we've completed our
three substeps. This completes evaluation for
large open source models. In the next lesson, we'll look at small open
source models, small enough to run for free. As usual, you can
find all links, notebooks, and other resources
on the course website.
8. Demo: Run Open-Source Model: Let's now run open
source models for free. We're currently on the
second of three steps. We're now trying small
open source models. For now, let's treat Lama
3.28 billion as our default. This is a tiny language model that should fit on your laptop. Of course, there are many
other open source models. But before I discuss those, the next few slides will refer
to a few terms like GPU, node, RAM, and precision. If you're unfamiliar
with those terms, you can safely ignore my explanation and just
focus on the takeaway. I'll also include definitions on the course
website at this URL. Now, back to rule number two, we've previously said
to use small models. But let's be more precise
about what small means. By small, I mean
single GPU models, which is way way simpler than multi GPU and especially
multi node for inference. Now that we've narrowed down to single GPU open source models, which open source models
fit on a single GPU? Well, that depends
on two factors, the RAM your GPU has and the
precision of your model. I'll put the full explanation
on the course website. But for now, we'll
say that LLMs with 16 billion or fewer parameters
can fit on a single GPU. Let's use this to find an
open source model to use. Back on Hugging face's website, on the left hand side,
filter by model size. We want 16 billion
or fewer parameters, but 12 billion is the
closest value we can use. Then the top right,
sort models by likes. After both modifications,
you'll see a page like this. Looking at the
number of downloads, we clickly see a winner
Lama 3.18 billion Instruct, which had over 9
million downloads in the last month. Let's
use this model. For our open source experiments, we'll be using an 8
billion parameter model from the Lama family of LLMs. This is one of the
most downloaded models over the last month
from Hugging Face. And more importantly, we've determined that this model
fits on the GPUs we're using. Go to this URL to open the starter code that
I've written for you. You should see a
notebook like this one. Now at the top of the file, click Run All. You
might see an error. If you see an error like this, then you'll have to run
each step manually. To run manually, move your cursor so that it
hovers over step one. A Run button will appear. Click on that Run button. After this, even though
step one is still running, you can immediately
hover over step two and click on its Run button. Then you should
see dot lines like this indicating step
two is queued up. Whether you hit run all or
ran the first steps manually, your notebook is now installing prerequisites and
downloading weights. This will take five
to 10 minutes. So during that time, continue watching this video
walk through. After a few minutes,
step one will finish. If you see a green checkmark, that means step one
completed successfully. After a few more minutes, step two should finish too. Again, you should see a green checkmark
indicating success. Now, whether or not you've
already run step three, next to step three, click Run. After a few seconds, we'll see Lama's outputs
discussing the meaning of life. You should now see the prompt. You should now
change the prompt to whatever you desire and hit
the Run button one more time. Treat this as a
personal chat bot. You can ask or say whatever you might normally say to Chat GBT. For example, we can have
Lama explain some math to us or have Lama tell us about our favorite online
learning platform. This is effectively your very own AI running in the Cloud. No one but you is using
this dedicated LLM. Now it's time to test Lama
on our example project. Hit run in the last
cell here to test Lama's ability to identify
non food critiques. But unfortunately,
Lama gets it wrong, saying both the first
and third reviews are true that they are both
non food critiques. In other words,
Lama predicts true false true instead
of true false false. We'll fix this in
the next lesson without training the
model. That's it. We've now run our very
first open source LLM, and now we've finally finished
the second of three steps. As I mentioned
before, our next step is to refine quality. In particular, we'll
improve this open source LLMs ability to complete
our example project.
9. Refine Quality: Engineer Inputs: In this lesson, we're
going to improve our LLMs capabilities
by changing our inputs. Generally, this is called
prompt engineering. We're now in the
third of three steps, and we're now refining quality. There are two substeps here. The first of which is
to engineer our inputs. If you haven't already, go to this URL to open the starter code that
I've written for you. If you have your
notebook from the last lesson still open, use the same notebook. If you're opening this
notebook for the first time, at the top of the
file, click Run All. The first two steps will take
about five to 10 minutes, so you can continue watching this walk through
in the meantime. Recall from our previous lesson, Lama 8 billion failed
to complete our task. The outputs format and
accuracy were both off. Our first approach to fix
this is to be specific. Sounds like a silly tip, but just specifying the
format is enough. For example, provide three
lines of output for each line, denote true or false, no extra text, no formatting. I've added this to your notebook already under a cell
titled tip one. In your notebook, scroll down to the structure inputs demo
section, like pictured here. Hover over the cell
that says tip one, then click on the Run
button that appears. After a few seconds,
you should see the following output,
true, false, true. The outputs are still incorrect. We expect true, false, false, but we fix the format by just adding a few instructions
to the prompt. Our next approach is
to provide examples. Let's add a few. Let's now add
three examples of reviews. Five out of five, the
chicken was salty, but good, three out of five, the marinero was too sour, three out of five, the
server wasn't patient. And finally, let's add
the desired outputs. The first two reviews
involve food critiques, and the last review involves
a non food critique. So we expect false false true. I've added this to your notebook already in a cell
titled tip two. Scroll down and hover your mouse over this cell titled Tip two. Click on the Run
button to peers, and after a few seconds, you'll see true, false, true. Unfortunately, the
outputs are still wrong, but we have one more trick. For the third approach, we will ask for
chain of thought. In short, ask Lama
to show its work. In our prompt to the model, we'll simply ask first
reason step by step. Then we specify the output
format very precisely. As usual, I've added this
to your notebook already. Scroll down, hover over
this cell title Tip three, and click on the Run
button that appears. After a few seconds, Lama
finally predicts correctly. We see a block of
thoughts we asked for and the correct outputs
true false false. So we can now say that Lama 3.18 billion can complete
our task successfully. Recap, we applied three tips. Be specific, provide examples and ask for chain of thought. These three together
enabled our Lama model to successfully identify
non food critiques. Now, what if we use this new and improved prompt on our
models from before? Looks like Deep Seek V three now successfully answers the
question with true false false, but Quin 1.5 billion still
produces meaningless garbage. As a result, our
open source results now look like this at a high level when comparing open AI's pricing to
open source pricing, Pei is still offering a much higher
intelligence per dollar, although we now have a working
Lama 8 billion that can reliably produce correct
answers on free tier hardware, even cheaper than
all of the above. In summary, be specific, provide examples and ask
for chain of thought. That's it for the first substep engineering our models inputs. So in summary,
we've successfully improved Lama's ability to
complete our example task. In the next lesson, we'll make a final set of
improvements to make our open source model run robustly on large
amounts of data. As usual, you can
find all prompts and the starter code I
used at this URL.
10. Refine Quality: Constrain Outputs: In this lesson, we're
going to improve our LLMs capabilities by
structuring our outputs. A word of warning. This lesson
features a lot of code. If you're uncomfortable
with code, just focus on the
takeaways that I discuss instead of
the code itself. This is our third
of three steps. In the last lesson,
we engineered our inputs to improve the quality of the
model's responses. In this lesson, we'll
improve quality instead by constraining
our outputs. If you haven't already, go to this L to open the starter code that
I've written for you. If you have your notebook from the last lesson still open, use the same notebook. If you're opening this
notebook for the first time at the top of the
file, click Run All. The first two steps will take
about five to 10 minutes, so you can continue watching this walk through
in the meantime. Recall from our
open source demo, Lama 8 billion failed
to complete our task. The outputs format and
accuracy were both off. Let's reproduce
this result again. In your notebook, scroll down to the structure outputs demo. Hover over example
A, and click Run. You'll get this blob of text. It's both wrong and
incorrectly formatted. We want a list of true or false. Fix this, in the
previous lesson, we applied three tips. Be specific, provide examples and ask for chain of thought. After applying these tips, Lama produced the correct
outputs true false false. However, correctness
came at a cost. The original prompt
took 288 characters. However, the new prompt
takes 1,312 characters. That's a whopping 4.5
times longer input. That's a lot of inputs. So can we improve
output format and correctness without increasing the number
of input tokens? And the answer is, of course, yes, let me explain.
Here's the LLM. It takes text as input and
produces text as output. In this case, our input
is our bananas fruits. And unfortunately,
the LLM outputs. That's not even a valid
response to the question. To fix this, we force the LLM
to output only yes or no. That way, even if
the output is wrong, at least the output is valid. To do this, we need to
modify this step at the end, how LLMs translate
outputs into words. Let's zoom in. The LLM actually first outputs
a list of numbers. These numbers are
actually probabilities that correspond
to certain words. In our example, the
first probability is the likelihood of yes. The second, the
likelihood of no, the third to R, and the last to is. The highest probability is 60%, and the corresponding
word is R. So we finally output R. This is how the
LLM predicts normally, but our goal is to
output only yes or no. So let's make some changes. First, disregard
all other words. Consider only yes or no. And we now take the
higher probability, 10%, which corresponds to yes. Finally, we output yes. And with that, we've
successfully constrained our LLM to output
only yes or no, taking the highest
probability valid word. In this example, we
only wanted yes or no. So we use the probabilities of these two words and simply picked the more
probable of the two. Now our LLM is forced
to output yes or no, and more importantly, we did this without changing the
number of input tokens. Navigate back to
your notebook and scroll down to step zero setup. Mouse over the cell
and click Run. You won't see any output for
the step. You're now set up. For our first example, we will force true or false. Equivalently, we can say
that we'll force a boolean. A boolean is a true or false. For example, one,
we ask the M are bananas fruits and forced
output to be true or false. Now, hover over example
one and click Run. This will output a
single boolean true. This is correct.
Bananas are fruits. For our next example, we'll force three booleans
instead of just one. Now, we check if bananas, almonds, and
potatoes are fruits. Move your mouse over
example two and hit Run. Now you'll see the outputs
true, false false. This is correct. Now let's
see if we can use this to constrain the output format
for our non food critiques. Hover over example
B, then click Run. And this output's
three booleans. The output format is valid. Unfortunately, the output
is still wrong, though. It should be true, false,
false, not true, false, true. So let's keep going.
For our third example, we will force the model
to reason step by step. For this step, we'll prompt
the model to reason first. Then force the model to output
text between think tags. You can choose whatever
format you want for thinking. This is just the format I chose. Like before, the prompt is
to check whether bananas, almonds, and
potatoes are fruits. Move your mouse over example
three, then click Run. This will take some time to
run about 45 seconds or so. Then you'll see the final output with the correct reasoning and even better the correct
outputs true, false, false. Now, let's apply the
same structured outputs to our example project
for non food critiques. Hover over example
C and click Run. You'll see the reasoning as well as the correct
answer at the end. Now by structuring our outputs, we've improved the quality
of our model and forced the outputs to match a specific
format. That's a win win. We've also drastically
shortened prompt length. The original prompt
was 288 characters. From the last lesson, we
needed 1,312 characters. From this lesson, we only
needed 386 characters, along with structured outputs to achieve the correct answer. With prompt engineering,
we needed a 4.5 times longer prompt to
produce the right outputs. With structured outputs, we
have a much shorter prompt, just 1.3 times longer. And we're guaranteed the
output will always be valid. This isn't to say that one
is better than the other, but I want to emphasize that prompt engineering and
constraining outputs will actually be used together most often to maximize
model quality. Now, let's try constraining outputs for a proprietary model, namely open AI's models as well. Go to platform.openi.com. Your webpage will
look like this. Make sure to click on Log in the top right if you
haven't already. Then click on dashboard
in the top right. Click Create in the
center of the page, and you should see
a page like this. Click on the model Picker. From the drop down,
select GPT five Nano and your screen should
now look like this. In the bottom right, make
sure to select Auto clear. Then next to the model name, click on the Settings icon. This will pop up
a settings menu. Click on the text format
Dropdown, and from the dropdown, you can then click
Select JSON Schema. So you can set up
constrained outputs. Make sure it's JSON schema
and not JSON Object. This will open up a
dialog like this one. Paste in the following, which you'll actually
get from this URL. Copy the constrained
output JSON schema. Then paste in the
JSON schema here. Scroll to the bottom of the
dialog and click on Save. We'll see the JSON schema
reflected in the menu here. Then click outside of the menu. From this URL, copy the project example
prompt from before. Paste the example project
prompt here and hit Run. And now we have the
output in JSON format. Try whatever random changes
you want to the prompt, the EPI will always return
outputs in this format. And with that, you've successfully constrained
outputs for hat GPT. In summary, structuring outputs forces the output to be valid. We can force yes or no,
we can force numbers, we can force any particular
format we want for both open source and
proprietary models. We've discussed how it works, a simple example to
identify fruits and an improvement for our detection
of non fruit critiques. That's now the end
of our third step. We've seen two ways to improve our model's quality without
training of any kind, and this concludes our lesson
on structuring outputs. And in fact, that concludes
our third of three steps. And now you finish just the start of your
journey building with AI. These are the first
steps that everyone should take for a
project involving AI. You can access this URL for
a copy of the starter code, other prompts, and
more resources. In the next lesson,
we'll wrap up the class.
11. Conclusion: You've made it to the
very end of the course. Et's close out with a quick
recap of what you've learned. First, scope your project. To do this, we broke this
down to three substeps. Narrow down your focus to the
general text to text task. Then narrow your focus down to the established
commercially proven text to text capabilities, summarization, structuring
data, and coding. Finally, define your inputs, outputs, your task,
and your metric. As an example, this was the project description for
detecting non food critiques. For our second step, evaluate models, so we
can pick one to use. To do this, we again
have three substeps. At a high level, start from the best, most intelligent AI. Test models via chat bot, the simplest, most user
friendly interfaces, then test models
via an API so that your results are more
repeatable and reproducible. Finally, optimize
for cost by testing cheaper and faster models to see if they complete
your task successfully. Once we've picked a model, we then refine quality. To do this, we had two substeps. Start improving quality
by refining your inputs. We call this prompt engineering. Then constrain the outputs. This can further improve reliability and
quality of your model. And that completes our
three step process, which you can now use to build
any project involving AI. This is just the start, but
it's a solid foundation to understanding how well AI
can accomplish your task. Now, using the above
process as a guide, apply this to your
own project and post the result in the
courses Project tab. Feel free to use one of my exampled projects or
create one of your own. I'm very excited to
see what you create. If you'd like to hear more
about follow up classes, follow me on Skillshare and
check out my other courses. Congratulations. I'm making it to the very end of the course.