Transcripts
1. Introduction: Hi and welcome to this course, Music Generation with Music LM. My name is David Armendariz. What is this class about? There is a rapid growth
in AI development, especially notable
in generative AI. Music generation is
part of generative AI. There's this new Google's
model called Music LM. Its launch date
was January 2023, and we're going to
focus on exploring music LM's capabilities
via AI Test Kitchen. What you will learn.
Learn what is music LM. Learn what is music LM capable
for and test out music LM. Well, my I'm a software
engineer and mathematician. I'm a data science student, an AI enthusiast
and a music lover. I hope you enjoy this course.
2. What is MusicLM: This lecture, we're going to learn what is Google's music. Lm. Music LM is revolutionizing
text to music generation. It was presented in ago steny at all in
a paper from 2023. It's very recent. Capable for is to generate
high fidelity music from text descriptions,
the technical details. It's based on another
model called Audio LM. It's capable of producing several minutes of
music at 24 kilohertz. Right now there are other
AI tools like Chat GPT, but they are not
able to generate music as of now December 2023. They also release this public
dataset called music caps. The purpose of releasing
this dataset is to aid in model development
and research extension. So other people can help
Google to enhance this model. It's manually created by
professional musicians. You can also use this model
to train your own model. We're not going to
learn how to do that because we need a lot of
AI knowledge to do that. They also focused a lot on
responsible development. They focused on
preventing misuse of creative content.
What does this mean? They adopted methods from a
paper from this guy called Carlini to ensure uniqueness in generated music
compared to training data. That means that the generated
music is not going to be similar to the training data
they used for music LM. Now. Music LM has a website
that we're going to see right now to see some examples
of what it's capable for. If we go to that website, we're going to see here the paper which you
can see in archive. You can see the dataset
that I talk about, which is the music
in the website. You can see all of the examples that Music L M is
capable to generate. Let's see we have audio
generation from rich captions. The caption here is
the main soundtrack of an arcade game. It is fast paced and up with
a catchy electric guitar of the music is repetitive
and easy to remember, but with unexpected sounds like symbol crashes or drum
rolls. Let's see, that's the example for this main soundtrack
of an arcade game. You can actually think
about it and feel like if you are playing
a game from the '90s. There's this other example. A fusion of regaton and electronic dance
music with a space. Other worldly sound induces the experience of
being lost in space. And the music will be
designed to evoke a sense of wonder and awe while
being danceable. That's pretty interesting. Let's see what this sounds like. Yeah, that's very
specific and I think it did a good job by trying to transmit that
experience to the user. Let's see some other
examples. Long generation. Well, you could see here that these sounds were
only 30 seconds, but they can generate
up to 5 seconds. Let's see, for
example, relaxing as okay, so these are 5
minutes of relaxing jazz. As you can see, I
was like testing at different times if it
sounded like the same thing, but just repeat it all over the time and it's not that case. It's actually different
at different times it's able to generate
long sounds like. Then. This is my
favorite feature out of all the examples
that we have here. The story mode, the audit is generated by providing a
sequence of text prompts. This influence how
the model continues the semantic tokens derived
from the previous caption. I don't know why I
like this a lot, but you can actually have
a song generated by story. For example, time to
meditate, time to wake up, time to run time to give 100% electronic song
played in a video game. Meditation song played next
to River Fire and fireworks. I actually, so let's say as you can see, the song was like a video
game until second. It says here 15. But I actually looked and it was like 19,
but that's okay. And then from there, it changed that tonality to
something more relaxed. And it was actually like
meditation next river. After that, then it
was not like fire. I didn't feel like it was fire, but more like some voices that were try to be put into the
song. That happens a lot. I've been experimenting
with this. Sometimes it tries
to put voices. They are voices that
actually don't say anything, don't expect this
to generate lyrics. But they are like voices
that try to be put in there. I think that was the case
in this fire prompt here. I don't know if you
felt it as well, then I also like this
combination here because this reminds me
of Bohemian Rhapsody, the song from Queen. Let's hear this one as well. Let's listen to this, This Go to Top Extking. Well again, this is a
clear example of AI trying to put voices into
the song. That will happen. I don't know if it's a lot
of times that will happen, but I've seen it
very frequently. These voices are
not intelligible. They are just like Berish because they don't say anything
but you can hear them. Then there's this text
and melody conditioning that you can add a melody that will be fixed
throughout the song. And then we can start
changing the song itself, but by maintaining this melody. For example, let's see the Leo jingle bells whistling
with a guitar solo as a constant or piano solo. As you can see, the piano solo and the guitar
solo word constants. The text P said, hey, first put bello ingo bells
and then some whistling. Okay, it's basically
the constant. Then we have this one which is, I think this one is also very interesting painting
caption conditioning. We have the painting
title, Author, The Persistence of
Memory, Salvador. Right? This is the image just as a reference
from Wikipedia. And we have the
painting description. Basically, this
is something that models like chant
GPT are able to do. You can upload now
an image and it will throw you a description of the painting and then you
can generate the audio. Let's see how the
scream sounds like. Okay. I'm going to be honest, I didn't expect this painting
to sound like that. It sounds like, I don't know, like a Pink Floyd song. Then we have like audio
generation from tags, 10 seconds of instruments. For example, the cello. Let's see, the flute. That sounded a little bit
like the Titanic song. We have genres, for example, let's see British blues, that's more common, I
guess, else that grain. Yeah, that sounds like blues
musician experience level. I don't know why you
would like to put like a beginning piano
player into a song, but let's see how that sounds. Definitely sounds like me And a crazy fast professional
piano player. Yeah, that looks like a fast, professional piano
player and places. This is also one I like a lot. I'm going to put the
example of the gym because it generates a
really good example. Back to touch the ten. Yeah, definitely that is better music than what
they put in my gym. I guess that you'll use this
to put some music there. Epochs. You can also use
epochs like for example, club in the '80s. Let's see how that
sounds like fun. Yeah, that definitely sounds
like a club in the '80s. Well, I was not
born in that epoch, but I've heard songs
from the '80s, Of course, that sounds like something that we'll put
in the club in the '80s. Let's see also this
feature of musical M, which is generation diversity. This means that it can
generate for the same prompt. Multiple examples as we are going to see also
in AI test kitchen. For the same text
prompt, let's see, we have this prompt saying
motivational music for sports. That's one example, and
another example would be this. Okay, yeah, they are different examples
for the same text front. These are all of the examples that music LM is capable for. I'm going to say that not all of these features are available
in AI Test kitchen. In fact, we can only, as of now, test audio
generation from text. Let's test that in the next
lecture. I hope you like it.
3. Trying out MusicLM: Now we're going to actually
test music LM The only way, as of now, December 2023, is to test it via this website. I test Kitchen.google.com You can only sign in with Google. This website is also only
available in certain countries, US, Kenya, New Zealand,
and Australia. But you can easily
use a VPN like I am in order to
test this website. If you click on this
dropdown and go to music, then you will have
a text box order to put the prompt here, you will have the generated sum. You also have the
Settings button. Okay, this Settings button
have three settings. The first one is a Seed. This is a random number
that you can put in here. After you put your prompt, you can put your random number, it's automatically
generated for you. You can click on this button
here to lock that seed. That means that given a prompt, given this seed, you will be able to generate
basically the same output. Because remember, generative
AI can be very random. If you want to avoid
that randomness, then you can put the
set the same prompt. There's also some parameters
called temperature, but we don't have that
parameter here that will make your prompt
more consistent, the output will be
more consistent. Also, we have this track length. Remember that we could
generate up to 5 minutes, but this only allows us to
generate up to 70 seconds. I guess that's because a lot of people might be using this tool. And generating a
five minute song takes more computing resources. They are offering this
website for free. We don't want to use all of their computing
resources for free. We also have the looping, which is a feature that stitches the beginning and end of your track to make
your music endless. Remember in that example where
we had that arcade game, that needed to be endless. Well, this also allows us that when the endless song ends, then it's going to be similar to the
beginning of the track. That's very useful
for things like that. Things like background
sounds for video games. Those are the settings
that we have here. We have the I Am
Feeling Lucky button. Let's see what happens
if I click here. Ambient soft sounding music. I can study too. This is going to generate
some um, music when this is another example. So as you can see, it generated two examples here. We also saw this in the
example output that it could generate multiple
examples for the same prompt. In this text box
we have the chips. We can like rad over these sounds and generate
different things. I'm going to start over
and generate my own track. I like a lot Bachata. I'm going to say
a modern Bachata, it has to be slow first, then fast, and then slow again. It has to be danible,
little romantic. Okay, Let's see what
this generates for me. Again, it's identifying what
things I can change or vary. So it can vary. So yeah, I like this, but I think the beat
from the Pachata is being overlap with
maybe the romantic. Let's get rid of this. Maybe we're putting
too much constraints on this prompt and
let's generate this again, It's generate. I like this a lot more. Let's see the other
example it gave. Yeah, I like this one better. I think I can dance to this. Well, you now have a tool
to generate your own songs. Given a prompt, I hope
you like this video. See you in the next lecture.
4. Trying out TextFX: We are again an AI test kitchen. There's another tool
here called Text X, which supercharges your
writing process with AI, power language tools made in collaboration
with Lupe Fiasco. If I launch this tool, we have all of these ten tools. This is something that can
be also done with GPT. It's not something very
innovative like music LM, but it still can be
useful for people who want ideas out
of this I too. For example, acronym creates a phrase using the
letters of a given word. For example, if I type
the word hamburger. Let's see what this runs here. We have a parameter
called temperature. I think I told you
this last lesson. But if you set
temperature to zero, then the output is going
to be less random. It's going to be almost
consistent 100% of the time. If you put temperature
equal to one, it's going to be something random every time you run this. 0.7 is a decent default. Many models, many AI models
use 0.7 as a default. Let's run this hamburger. It's happy animals
made by great humans, eat really good burgers or
have a meal body really good, or having a meal,
being energized, getting rid of bad moods
and joint relaxation. I think this can be more used for some restaurant
that sells hamburgers. This can be their logo
or something like that, It's very creative alliteration. Find words in a category that
start with a chosen letter. For example, fast food
that starts with age. I guess they will
find the hamburger. Hamburger. Hard shell tacos. Yeah, it was pretty obvious it was going to give me hamburger. We have chain built
a sequence of words where each word relates
to the last one. Again, let's put the hamburger, let's see what happens
with the hamburger. Hamburger, bread,
sandwich, meat, steak, grill, fire, heat. It went from this word to heat by making a
sequence of words. Each word was related
to the last one. Hamburger, bread, basket, grocery store,
cashier, customer bill. It went from hamburger to bill, hamburger, bread, dove
floor, bakery, shop, store. These are all related words. This is very useful
for rap lyrics, I guess in the example they
give you. Lupe fiasco. In this video that
you can watch, he is a rap, a lyric writer. He uses this tool
a lot to generate lyrics by using also
human intelligence, explode, create phrases that sounds similar to given words. Again, hamburger. Let's
see what this does. Hamburger, A fat big
who eats hamburger. A sandwich with
ham and a burger. Hamburger, a type of sandwich
with ham and cheese. Okay, use find similarities
between unrelated things. Let's see hamburger
and the moon. Let's see what are the similarities between
these two concepts. Both a hamburger
and the moon are round and can be eaten
with a forking knife. The moon can be eaten,
both a hamburger and the moon are round and
often associated with food. Both a hamburger
and the moon can be associated with
rounds and fullness. A hamburger with its round shape and the moon with its full face. Yeah, I guess you can
be very poetic with this tool here, POV, let's see. Let's talk about fast food. This evaluates topic through
different points of view. Fast food is cheap and
convenient way to feed a family. Fast food is a convenient
way to get a quick meal. Fast food is a delicious
and convenient way to eat. Now a scene generate sensory
details about the scene. Again, eating a
hamburger in a hotel, I don't know what is going to generate a dry,
overcooked burger patty. A hamburger that is so
dry cracks when you bite into it sticky plastic bun. That's what the AI imagines when you are eating a
hamburger in a hotel. Smiling thing or a concept. And it's going to create a
simul about or a concept. Let's see hamburger. A hamburger is like a pi
that lost its way in life. Why hamburger is like a
pi with a hat on a pix? Well, you can think
about it like that. Yeah, it's like a
Pixa with a hat on. A hamburger is like
a pixi has a bun, meat and cheese,
and it's delicious. I guess the AI is, right? Make a scene more creative. Imagine a person eating
a hamburger in a. Let's see what the AI imagines a person eating a hamburger in a hotel that's floating
in the middle of a lake. A person eating a hamburger in a hole that is
located on the moon. A person eating a hamburger in a hole that is made
out of gingerbread. These are fictional things. Unexpected unfold,
identify words and phrases that
contain a given word. Hamburger again, hamburger
says, back of the hand. Bowl of confusion,
bowl of jarneauess. This is a little bit
more unexpected. Anyways, this tool, text
effects, can help you, if you are a
professional writer, to give you ideas of lyrics
for the song you just made. But it's something that you
can also do with Chat GPT, but this gives you a nice UI
to make all of these things.
5. What is Stable Audio: We have some
alternatives to music. Lm, and I'm going to talk
about a stable audio. First of all,
generating music is not an easy task From a
technical point of view. A stable audio was developed by the same people who created
a stable diffusion, so they have experience
doing this kind things. It uses the stable audio audio
spark Audioparxv 10 model. They are working on using
a new model, version 11. I think it's going
to be more powerful. In the free version,
you can generate up to 45 seconds of a song. Let's take a look
at this website, which is stable
audio.com You can create a free account and then you can go to the
generate section here. As you can see, we have
up to 20 songs per month. If you go to the pricing, you're going to see
the free version. Monthly track generations
20 you can use, you can generate
up to 45 seconds and the license is
non commercial use. If you're a professional, you pay $12 a month and you can generate up to
500 of these tracks. The trucks can be
up to 90 seconds and they can be
commercially used. If you're an enterprise, then you have to get in touch with these people so
that they can set your price. That's the pricing section. The user guide tells you, first of all, some examples
of what this can do. As we saw in the Google website, you can explore all of
these examples by yourself. Use a stable audio to
generate full musical audio. Encompassing a range
of instruments. Include as much detail as
you can as you can tell. The more details you
put into the prompt, the better the result. You can put individual stems, sound effects, et
cetera, et cetera. I like that they are more
explicit under interface guide. This is the interface
they're telling you. For example, steps. It tells you the amount of generation steps used to
create your audio track. A higher step count
means greater processing and this can increase the quality of your
audience likely. And they have found
50 is the sweet spot. Number of results
you can generate, maximum five at a time, okay? But if you put four, this will cost you four
tracks when generating. So be careful of that, because if you put
five for one prompt, then you will only be able in the free version to
generate four tracks. The seed, I already
told you what the seed is, the default. This input is set to random, but you can put any number here. By using the same prompt
and the same seed, you're going to have
consistent outputs. The prompt strength controls
how closely the model attempts to guide the
audio to your text prompt. They have a block post for the model they are using,
the one that I told you, the audio park X10, if you are interested in
the technical details here, we also saw the
licensing scheme here. As a free user, you can use the audio stable audio sample
in your own music, but as a bad user you can
use it for commercial use. You can't train AI models on the generate audio because that goes against their
terms of service. They have, I guess, a better user guide
on how to use this. In the next lecture, we're going to test out a stable audio to see if they
generate better results.
6. Trying out Stable Audio: Okay, so let's take a look
and test the stable audio. I'm going to put my same
prompt, Modern chata. It has to be slow first, then fast and slow again. It has to be ansible. I didn't copy and pasted it, so I have to write
it once again. Let's generate soundtrack
with this description. Also, you have the guide here if you want to
use the user guide. Let's see, Mother and pa chata. It takes a little
bit more I guess, but we have to wait. Okay, it got generated. That's same. No, this doesn't sound like
a Bachata at all. Let's see what happens if I change the smothering
to sensual. But this is not a Bachata that makes me
think Google LM is better. Maybe because they have
more training data. I don't know, but let's
give it a chance. Maybe stable audio. Wasn't trained
with these genres. Maybe they were trained with, I don't know, Rock pop or
some other kind of things. No, this doesn't sound
like a chat at all. Let's see I by modifying
the prompt, the typical, typical chat bungle, I'm going
to put the strength to be 100% Let's see if by modifying
the prompt like this, it's generating a better result. No, no, no, no. We have seen that stable audio is failing at
generating Api chat. But again, you can try it
with different genres. Maybe it generates
better rock. I know.
7. Conclusion: What is the conclusion here? You can now write your
own music with music. Elem, which was developed
by Google Research, is designed to create music
based on textual input. This metal is
capable of producing extended periods of high
quality music that adhere to the provided text instructions to experiment with music L M one can register for the AI test kitchen
as of December 2023, however, for those interested
solely in sample outputs, visiting the Google
research website is an alternative option. We tried also stable audio, but we saw that music LM was better at
generating Pachata. I'm selling Pata
here because that's the only genre we generated. You need to try
other kinds of music because maybe it's better
at generating rock, I don't know, but I
am a ba chata lover. I love to hear bachata. I was disappointed by a
stable audio outputs. Musical was way superior
that stable audio. Don't forget to follow
me on social media. You can join my Discord channel, you can follow me
in Scra and you can subscribe to my Jet Channel. I hope you enjoy discourse. See you in the next course.