Transcripts
1. Introduction: At some point you
may have heard about GPT, maybe ChatGPT, GPT-4, you may have
heard Microsoft call it a spark of artificial
general intelligence. There's a lot to digest. Let me help you do just that. Hi,I'm [inaudible], a research scientist at a large company. I've been conducting research
in AI for six years, previously at materiality
labs and Tesla autopilot, receiving my PhD in
AI at UC Berkeley. In particular, I
studied how to make neural networks run really fast. In this course, I hope to use my background to break
down and cutting-edge technology into digestible,
intuitive concepts. Throughout, I'll
use an illustration first approach to
conveying intuition. No walls of texts,
super complex diagrams or even a smidgen of math. My goal is to help you
build the foundations for understanding chat GPT and
its related technologies. The material in this course
assumes you've taken my AI free beginners course,
linked in description. Beyond that, no technical
background is required. Whether you're an
engineer, a designer, or anyone else curious to learn, this course is made for
you. Let's get started.
2. Why understand ChatGPT: Here's why you should
understand ChatGPT. We'll start with the benefits. The biggest benefit to
taking this course is to understand discussions
around the topic. There are tons of random related terms all
across the web, transformers, large
language models, ChatGPT, GPT this, GPT that. Our goal is to know how these different terms
are all related. We'll cut to the
marketing clutter and jump straight to
technical understanding, the jargon doesn't need
to be intimidating. With this knowledge,
you can then understand the latest
innovations in the area. What does it mean to make
a transformer faster? What is the transformer even? To summarize, here
are two benefits to understanding how ChatGPT works. Knowing the terminology to hold discussions on the
topic and knowing how to read and
understand the news when reading and learning
more about the topic. To be clear, we won't exhaustively cover all
terms or all topics. However, this course gives you a foundation for learning more, basically we'll
cover the intuition and the big ideas
behind the technology. A big part of this is
knowing the terminology, so that will be our focus. Let's now jump straight
into the content. Here's an overview of ChatGPT
and related concepts. First, ChatGPT is a product. It's specifically the name
of Open AI's product. Second, large language models or the technology more broadly, a technology that can
take in texts input and generate high-quality natural
sounding text as output. This is just like
Kleenex and tissues. Kleenex is a specific brand, tissues are the
generic product name. In this case, ChatGPT is a
specific trademarked brand. Large language models are
the general technology. As a result, these
large language models or LLMs for short, are the focus of our course. Finally, transformers are the building blocks of
large language models. We'll be focusing on these building blocks moving forward. Said broadly, our
goal is to dissect the intuition behind
large language models. I'm going to present a simplified
version of these models that conveys the key
ideas for why they work. No need for big
complicated diagrams or unnecessary math equations. We'll have fairly
straightforward diagrams that stick to the main points. Here's a brief introduction
to large language models. In our AI masterclass, we discussed compartmentalizing
our ML knowledge into four categories: data, model, objective, and algorithm. Data describes the
inputs and outputs, what we learn from
and what we predict. Model describes how
to make predictions. Objective describes the goal, what the model is
optimizing for. Finally, algorithm describes
how the model learns. We haven't discussed algorithm much and we will again skip it this time around. For large language models, the input is text. This can be from websites, books, online forums or more. The model transforms this text using the input text to
generate output text. The specific
objective for a model is to predict the next word
given the previous words. We'll break down this
process later on in this course and like before, we'll skip over the algorithm. To recap, we've discussed the
benefits of understanding ChatGPT in the context of fast-moving news
and discussions. We've also briefly
introduced ChatGPT, the product versus
the broader class of technology, large
language models. That's it for the overview. For a copy of these slides
and more resources, make sure to check out
the course website. This is it for why you
should understand ChatGPT. Let's hop right into the
technology so you're well equipped to understand
the barrage of information out there.
3. How to “compute” words?: In this lesson, we'll discuss
how to "compute" words. Here's an example
of what I mean. Take the following equation. We have king - man + woman. What do you expect this equals? Pause the video
here if you want to take a second to think. One reasonable output
would be queen. This is what we'd expect if
we could apply math to words. However, we can't
really do this. There's no such thing as
adding words together, so we need to convert words to numbers which we can
add and subtract. Let's use a specific example. In this case, we want to translate from
French into English. We have French in purple
on the left which translates into I love you
in green on the right. To simplify this example, we'll focus on just translating
one word at a time first. In this case, we focus on
translating Je into I. To do this, we need to
convert Je into numbers. Fortunately for us,
there already exists a dictionary of sorts that
maps words into numbers. That mapping is
called the word2vec. Word2vec defines a mapping
from words to vectors. Vectors are simply
collections of numbers. From now on we'll refer to a collection of
numbers as vectors. Here's an example, here we
have the French word Je. Word2vec maps this word
to the vector.8.1.1, which is illustrated below
with some boxes on purple. We can also have
the English word I. Word2vec maps this to 0.1.9. Finally, we have You. Word2vec maps this to.1 0.9. Note that this mapping
is just an example. In reality, word2vec uses
300 dimensional vectors, meaning each word
corresponds to 300 numbers, way too many to show here. Let's now use these vectors
in our translation example. Now, we translate Je into the corresponding word2vec
vector on the left in purple. Some mysterious computation and in the middle is performed, then we get another
vector in green. This vector in green then corresponds to the
English word I. Now, let's discuss what goes in that box with
a question mark. How are these vectors
transformed? Here's how. That box's goal is to run
meaningful computation. Here's what I mean. This was the example
we had previously, were king minus man plus
woman equals queen. Here's how we can
actually add and subtract numbers or add and
subtract words even. Start with the words
on the left-hand side. We have king, man
and woman translate each word into its
corresponding word2vec vector. This gives us three vectors. Then starting with the
king vector on the left, subtract the man vector, add the woman vector. Doing this gives us a
new vector on the right and that resulting vector happens to correspond to queen, so this is what we
really mean when we "perform math on words." In reality, we're
performing math on the vectors these
words correspond to. So now we can abbreviate the entire process by just
writing this equation. This equation by the
way is a real result. If you looked up the
word2vec mappings, you would actually be able
to compute this equation. Addition and subtraction in this vector space
has real meaning. So knowing that, we can now
fill in our mystery box. Given the input to
translate into English, we subtract the French vector
and add the English vector. More generally to accommodate any task we can
represent any addition, multiplication,
subtraction, etc. This small graph in
the center represents something called a
multilayer perceptron. We won't dive into this much. Just think of this small
graph in the center of our figure as any set of adds, multiplies, subtracts and more. This allows us to
represent any word to word translation task
with this architecture. Now, notice that our
pipeline ends with a vector. We need to convert that
vector back into a word. So for our last goal here, we want to convert
numbers back into words. Here's our pipeline from before. We ended up with some vector. So convert back into a word, we'll find the closest
vector that maps to a word. In this case, our closest
vector maps to the word I and with that we've
finished our pipeline. Let's now recap what we
did from start to finish. First, we converted
words into vectors, then we transform those vectors. Finally, we transformed
vectors back into words. Here is our final diagram. On the left we converted
the word Je into a vector, then we perform some computation in the middle with that vector. This outputted another
vector in green and we then looked up
the closest vector that corresponded to a word. That finally led
us to the word I, and that completes our pipeline, translating from one French
word into one English word. Pause here, if you'd
like to copy down this figure or take a
moment to digest or recap. So we've converted one
word into another word, however we ultimately
want to convert many input words into many output words as we
show here in our example, that will be our next lesson. For a copy of these slides
and more resources, make sure to check out
the course website. That's it for running
computation on words. Now you understand the basics of how large language models run computation on inputs
to produce outputs. In the next lesson, we'll learn how large language
models take in multiple input words and
generate multiple output words.
4. What is a "transformer"?: In this lesson, we'll cover
what a transformer is. The transformer
allows us to take in multiple input words and
generate multiple output words. This is what we've
explained so far. We've converted one French
word into one English word. Now, we want to convert multiple French words into
multiple English words. To do this, we'll
modify our goal. Instead of translating
from one word to another, we'll translate from
the previous word and predict the next word. To do this, we'll change
this diagram into this one. Now, our model takes
in the French phrase and the previous word
shown below in italics. With both of these inputs, the model predicts
the next word, shown in italics on the right. The purple text, in
this case, the French, is what we call the prompt to distinguish between the
two types of inputs. Let's now run this
diagram on our inputs. To generate the first word, we pass in the prompt and
a magical start word. We've denoted the start of sequence word as a start
quotation mark here, for simplicity, in reality, the start token is
some unreadable tag. That magical start token, along with the prompt, then produces the first word I. To generate the next word, we again use the prompt
in purple on top. On the bottom, we now feed
in both previous words, the start-up sequence
represented by a quote and the previous word I. Now we predict next word, love. We do this again. We feed in the prompt on top. On the bottom we feed
in all previous words, the start of sequence
quote and I and love. All of these inputs produce
you. One last time. We feed in the prompt on top. On the bottom, we feed in
all the previous words, the start of sequence then
I, then love, then you. All of these inputs produce one output word,
end of sequence. This end of sequence
is denoted as an end quote in our
diagram and that's it. Once we see the end of
sequence word, we are done. We have now generated a sequence of multiple
output words. We call this process
autoregressive decoding. Autoregressive decoding
predicts the next word one at a time until we reach
the end of sequence word. Here's an overview of
the entire process. This was the generic version
of our diagram from before. On top, we have our
prompt in purple. Below we have all
previous words. These inputs then pass through the mystery box to
produce the next word. Now, we fill in the mystery box. We'll fill in this box using the same process we did before. First convert all
words into numbers. We feed in every prompt, every word in our
prompt one-by-one. We also feed in the
start-up sequence word, which in this case is again
denoted by the start quote. All of these inputs are first
converted into vectors. Somehow our mystery
box then outputs a vector which corresponds
to the next word I. Next, we need some way to
incorporate "context." Effectively, our previous word, which is the start quote, needs contexts from
the prompt to be translated into the
correct first word. Here, I'm using the term
context very vaguely. Somehow automagically,
we need to incorporate information from the
purple prompt into the green vector representing
the previous word. In this case, we'll
incorporate context by simply taking a weighted
sum of all the vectors. This produces just
one final vector which we feed into
the mystery box. Next, we add computation, just like we did in
the previous lesson. We replaced that mystery box
with any number of adds, multiplies, subtracts, etc. This is represented
by a small graph in the center of our figure. Like before, this graph formerly represents a
multi-layer perceptron. But we won't need to
know the details of the perceptron to
understand what's going on. We've now successfully
converted our prompt and the start of sequence
into the first word I. Do the same thing we did before. Predict the next word one at a time from all the
previous tokens. This is the exact same process. Next, we take the prompt,
the start of sequence, and the previous word
I, taken altogether, this produces the
next word, love. We continue this
process iteratively. Finally, we get the end of sequence word as output
and we stop here. This now completes our pipeline. We can now take in
multiple words as input and predict multiple
words as output. We've added two new
concepts in this lesson. We predict the next word one at a time until we reach the
end of sequence word. We also add context by incorporating the prompt
into the previous words. We added contexts in this case by simply using a weighted sum. This was our final
diagram from before. On the far left, we convert the purple prompt into vectors. We also convert the previous
words in green into vectors. We then incorporate context from the prompt into
the previous words by taking a weighted sum. This weighted sum
is then fed into a multi-layer perceptron
to perform computation. That computation produces
a vector, and like before, we find the nearest vector
that corresponds to a word, which in this case is the
end of sequence word. This now completes our
pipeline for converting multiple input words into
multiple output words. There is one detail
we've left out, which is how this
weighted sum is computed. So that you have another term in your pocket, this weighted sum, which adds context, is more formally called
self-attention. Stay tuned as I plan to
either add a new lesson to this course or to release a new mini-course
describing how this works. For now, you understand the
entire multi-word pipeline. For copy of these slides
and more resources, make sure to check out
the course website. That concludes our introduction
to the transformer. You now understand the intuition for how a transformer works and how to generally produce output texts from an input text. Note that we've
left out a number of details in the architecture, but this is a minimal
representation of the key ideas and a good starting
point for learning more about large
language models.
5. Conclusion: Congratulations on making it
to the end of the course. You've now covered
the fundamentals of large language models, effectively how ChatGPT works. We've discussed a number
of different terms, Word2vec which maps words
into a vector space. Where addition, subtraction,
etc, are meaningful. Autoregressive decoding, which is how
transformers produce multiple output
words by generating one word at a time from
all the previous words. Transformers, the
building blocks for large language models and large language
models themselves, the general technology versus the specific brand
and product ChatGPT. You now have the tools to
understand conversations about the field and a
high-level intuition for how large
language models work. Remember that there
are more resources and a copy of the slides
on the course website. If you have any questions, feel free to leave them
in discussion section. I'll also leave links with more information in the
course description. If you'd like to
learn more about AI, data science or programming, make sure to check
out my courses on my Skillshare profile. Congratulations once
more on making it to the end of the course
and until next time.