Transcripts
1. Intro: Hey, there it's Max and
welcome to your SQL mission. In this mission,
you're going to learn how to use SQL to process as well as analyze data directly on the database
using SQL queries. By the end of this mission,
you're going to be able to write SQL queries
that look like this, and you'll also learn what
relational databases are. Throughout the course,
you're going to first start off with your setup. Then you're going to go
into some basics about tables, databases, and queries. You're going to learn
about arithmetics. You're going to
learn about dates and times, manipulating data, and then you're going
to go also into more complex things like joins, unions and window functions. So I'm excited for
you to get started.
2. Relational Databases Intro: Hey there, x max and welcome. So in this lesson we're going to learn about relational databases just to make sure that everyone's on the same level, that everyone kind of has the same knowledge about relational databases and that we're all familiar with the key terms so that when we actually start using SQL and learning SQL, when you hear any of the necessary terms, you understand exactly what I'm talking about. So first of all, relational or databases in a relational model are split up into tables. So a database is going to be made up of several tables, but let's start with the table first. So for example, we can have a table like we see on the right here. The first thing that we notice is that the table has a name, which you can see above it. Now this is just the way that I'm going to show it. Usually the name is going to be represented somewhere else. But just to make sure that everything is linked here, we'll have the name on top. So in this case our name is user underscore and phone. Now you'll notice something about the naming, which is there to general naming conventions. One of them is using an underscore and the other one is using capital letters. You can see the capital letters example in the user ID, which is a column in our user info table. These two options are just different naming conventions and they're there just to help you, the user or maybe also the database administrator or anyone else who needs to use this database, be able to read everything easier, since it's much easier to read words that are separated by either an underscore or words where each new word starts with a capital letter. And it's much easier to read words like that where everything is either lowercase or uppercase or something like that. And so you'll probably see one of these two naming convention is being used when you're using an SQL database, for example, in a company, or when you get one in a technical interview or something like that. Now as you've probably noticed and are probably already aware, tables are made out of rows and columns and it's the same way in a relational database. So we can see here our table, the userinfo table has four different columns called the user ID, another one called name, another one called join, and another one called email. And it also has different rows. We can see here we've got three rows in this case. So we'll talk about that more in a second. But yeah, this is going to be the kind of base structure that our database is made up of different tables. Each table has a name, and each table is also made up of columns and rows. Now each table in a relational database also has something called a primary key. The primary key is the column that uniquely identifies each row. So we can see here, in this case, our primary key is going to be the user ID, which means every single row needs to have a unique user ID. This primary key can either arise naturally. For example, in our userinfo table, the user ID is going to be unique by design because every user is going to have a unique ID. And so therefore the user ID is a great column to use as the primary key. But in other cases, the primary key may be an additional column because there is no column in our table that not truly results in a unique identification free trial. Either way, a primary key in tables important because that's what's going to allow us to identify each record in our database individually. And so it's important that we have a column for the primary key because that way we can have a unique identifier for each row that we have in our table. Now something else that's important is the primary key. There can only ever be one per table. You can't have two primary keys in a table. So the primary key is going to only be one, and it's also going to allow us to uniquely identify each row. Now we're using the terms row and columns. But actually in SQL, usually the columns are going to be called attributes and the rows are going to be called records. And so you can think about the columns being attributes to each of our records. Again, records being rows. So for example, if we look at the first row or the first record, we have the UserID 1. So the attribute of our first row where it has a user ID of one and name. Alice joined on the 11th of May and an email a at abc.com. So this is going to be kind of the syntax that you'll likely hear when you're looking at tables and you know, hearing about rows and columns or rather records and attributes. Now as I mentioned earlier, usually a database is going to have several tables rather than just one. So for example, our database me, it may be about users and storing user information as well as information about the events that users are doing. And so we could see tables, for example, called the UserInfo, which we had before, who we could also have another table called the event log. And we can have another table called Event Info. And these tables are all going to be related in some way. Usually not the best way to define these relations is through something called a foreign key. So for example, in our event log we have two foreign keys. And with the foreign keys are, is, they're columns in that table that references the primary key in another table. So for example, let's take a look at the event log table. Here we have two foreign keys. One of them is going to be the user ID. Now the user ID references the ID column and the userinfo table. And we have another foreign key here called the event ID. Now the event ID is going to reference the ID column in the event info table. And so in that way, we can also structure the relations between our different tables. Since otherwise it would be inefficient for us to store all of this data in the same table. And that's why we split these tables up, or that's why we split the data up rather into these different tables. Because otherwise we'd get a really big table. Because imagine if we have our event log and for every single event, we also store all of the information about the user as well as the information about the event itself. That means every single record is going to have a lot of duplicates because the user information is not going to change and the information about the event is not going to change. And so a better and more efficient way to structure this is to keep top of the event log and everything that needs that is irrelevant to this event log that's going to change. And information, for example, about the user, can be stored in the userinfo table. And all that we need in the event log is who made the event. But if you want to find out more about the user, then we can have our foreign key, the user ID in the event log table, which references the primary key, the ID, and the user info table. And so there we can get a direct lookup essentially. And for each record, if we want to, then we can get more information about the user and then the same thing about the events. So in that way, we can use foreign keys to make references between tables and link them together. Now we talked about the primary key, putting a restraint on each table, which means that the value and the primary key needs to be unique across every row, needs to be unique across every record. But you can actually also have additional constraints. For example, another constraint that we may want to do is we want to say OK, the email field in the userinfo table also needs to be unique. Now it's not going to uniquely identify the person, but we want it to be unique across each person. So we can have a primary key, for example, being the user ID. And then we have another field, for example, the email. But we also want to make sure that the e-mail column itself is unique. And so we can put additional constraints on a table to say, for example, the email needs to be unique. Or maybe if you have a different platform where you can login over different regions and you have a name and you have different regions, then you can say the name or the username that a person wants to use needs to be unique in each region. So you can have a combination of the name and the region combination need to be unique. So people can have the same username in different regions. But in each region, every person is only allowed to have one username. For example, the usernames only allowed to appear once. And so that way, you can have additional constraints to tables that go beyond the primary key that also require uniqueness or that also result in more conditions being put on the table. Now, this isn't going to replace the primary key, but it's going to be additional constraints in addition to the constraint that we need to have the primary key, that each value in the primary key needs to be unique across every single record. These constraints can of course be good because in that way, if you MRF, someone else maybe wants to create an account, but the e-mails already been used for whatever reason. And we can throw them an error message and say that emails are already being used. And we're not going to create records where several people are having the same e-mail. Because then if you want to send out an email to our users, for example, in several different people are using the same email for whatever reason, and that's obviously going to cause problems. And so if there are any additional constraints, like how many, either through the business logic or just your understanding of how all of this data should be organized, then you can of course put those in or it rather than those constraints can be put onto the individual tables so that, you know, none of these problems happen. Now over time, even if we're splitting our data into these different tables, are tables can still get very large. For example, our event log may grow very rapidly if we have a lot of users using our platform. And so something that you may hear about is something called a partition. Now a partition as a way that we can split a table into smaller sub chunks. And in that way we can store them in different locations. And this can make storage more efficient, but it can also make querying more efficient. Usually you make a partition by a column that is often used for filtering. So for example, a very common way to partition data is by date. Because oftentimes if you're doing a query, for example, you want to see all the active users within the last week, then you're going to be querying or rather you're going to be filtering by date. And so if your table is partitioned or rather split up into different segments, then you don't need to go through that much data. You only need to reference the tables that actually contain the appropriate data. And so in that way, it's going to be more efficient for storage because you don't need to store one large table in one place. But it's actually also going to be more efficient for querying because you then only need to go through a smaller portion of the table to get the results that you're looking for. Now the cool thing about partitions is. Don't really affect you as the query or the partitions are taken care of by the database itself. And it will handle all of these storage things for you. And it will make it more efficient if you're filtering by the columns that are used for the partitions. But if you want to get all the data on the table, you still can, even though the table is going to be partitioned, you can still query all the tables if you want, or you can query the whole table that may be stored across different locations. So even though partitions are going to be splitting up our tables to make it more efficient. And they can make our queries more efficient if we're using the appropriate filters, it's still not gonna have any other negative effect on you or you're not even really going to notice it other than the possibly positive effects that you get from it. And so petitions are therefore extremely cool because they make things so much more efficient. Yet they don't, you don't really have to interact with them in any other way except use them to your advantage. Now there are also other ways that relational databases or tables can be designed to make querying more efficient. And that's by setting indices or a setting additional indices. So for example, let's say we often query by events. We may want to set up an additional index for each event. And in that way, our database will actually keep track of which record has which event ID, for example. Now, indexing is going to require more storage. So obviously you don't want to index every single thing because then you need to keep track of all that stuff. But in next thing on the right columns can make querying much more efficient because now you have a quick look-up table. So rather than having to search through the whole database and find all of the records, you can instead basically reference this lookup table and it will tell you which records you want. So indexing is usually done on the columns that are very often used for filtering. And it can make queries much more efficient. Now you can still do all the normal stuff without filtering on the index columns. And next thing is just another way for the database designers to make querying the tables even more efficient. And so usually it's going to be that the columns that are very often used for filtering are going to be index. And so that's going to make the queries much more efficient. And so if you know which of the columns are indexed, then when you want to do your filtering, it's best to start filtering by those columns. And then in that way, in the subsequent queries that you're doing, you're already going to be needing to go through many less records because you've already reduced the size of your table by performing the filtering. And so indexing, just like partitioning, is something that can only affect us positively basically as the query or because we don't need to use them. But if we use indexes and partitions of tables properly, then it can also really improve the performances of our queries. Now, something else that you may also be hearing about as a table relations. So there are different ways that tables can be related to other tables. One of those ways as a one-to-one relation. Now a one-to-one relation means that one record in one table corresponds to one record in another table, but it has to correspond to only 11 record in that other table has to correspond to only one record in the original table. So you have two tables if they're related by a one-to-one relationship than one row in each table corresponds to one and only one row in the other table. Next, we have a one-to-many relationship. A one-to-many means that one row in one table corresponds to many rows or multiple rows at least. And another table. And multiple rows in that other table can correspond to only one row in the original table. So for example, or userinfo and our event log are related by a one-to-many relationship, one row and our UserInfo can correspond to many rows and the event log, since the user can make multiple events. So we can have multiple records that all have the same user ID. But each of those records only references one row or one record in our userinfo table. And the event info and the event log have the same relationship. Finally, we can also have a many-to-many relationship, which means that multiple records in one table can correspond to one record in another table. And multiple records in that other table can correspond to one record in the original table.
3. Data Types and Schemas: All right. Now that we
understand a bit more about relational databases
and how they're built up and how they work and how they
generally look like. In this lesson, I want to go
over data types and schemas. Now let's start off talking
about data types first. I'm just going to have one of the tables that we had here in our previous lesson
just so that we have this visual thing
to refer back to. If we go into this table and we just look at
some of the values, we can see that there
are different types of data represented here. For example, the ID column looks like it carries integers
or numbers at least. The name column looks
like it carries text, the joint looks like it carries date or some
other time information, and then the e mail
is again text. When we're defining a table, rather than allowing
just every column to take on any value
that it wants, it can make a lot of sense
for us to define the type of data that we're expecting
to see in each column. Now this is useful because
from an outside perspective, you can quickly see
what data type should be in each column and what
you're expecting to see there. It's also good because
different data types require different
amounts of storage. You can optimize storage by choosing the appropriate
data type so that you don't block
unnecessary storage space that doesn't need to be used. This can also really
help with things like performance because if you're using the correct data types, then actually, in most cases, the queries are going
to run faster rather than if you're just
using a general type for everything because the
operations are much more well defined for what
exactly you're trying to do. And it's also nice because
when you want to put in new data or when
you're looking at the table and you want
to extract data from it, know what to expect. If
there's some mismatch, then you can quickly compare
that or you're going to get a message from
the database saying, you've tried to enter
text here, for example, but this field
requires an integer or this field expects to have a date or a time or
something like that. In that way, your databases are going to likely stay
more consistent and be less prone to errors
because there needs to be this validation check to
make sure that whatever is going in is of the
appropriate type. There, of course, as you can
see different advantages, both from the user perspective, as well as from the database itself and from an
efficiency standpoint, both in terms of memory, as well as in terms of performance and speed
of our queries. Let's take a look at some of the data types that
are available to us, starting with the
numerical data types. One of the very common
ones that you're going to see is an integer. There are also different
variations of an integer. You can have a small or a small integer and a big
end a big integer. The difference here
is the amount of space it allocates
to each value. You can see in the integer case, we have four bytes
in the small end, we have two bytes in the big
end, we have eight bytes. We can see here
already here we can make choices that will
affect the performance. For example, if we
have a database that's scoring test scores, it would make sense to use
a small integer because test scores but likely
can't go above 100. Whereas if you want to store large amounts of huge numbers, then it may make sense
to use big integers. Otherwise, your numbers
are going to be cut off at some point and you
can't get any larger. Depending on what you're
expecting from the data, you can actually
make the appropriate choices to make sure that you're not unnecessarily
using extra memory. Then there are, of
course, also other types. There's another
one called numeric or another variation of
this would be decimal. That just means we have as
much precision as we want to. This data type doesn't have a set number of bytes allocated
to it and it's varying. This, of course,
makes it much more flexible from the
user perspective, but also a little
bit less efficient from the database
perspective because it can't allocate specific
amounts of data for each data type or for each value that's going
to come into each record. Now, we also have something called real and
double precision. Both of these are used to
store decimal numbers. But the real can go up
to five decimal places, whereas the double precision can go up to 15 decimal places. Again, depending on how
much precision you need, you can either use the real
if after five decimal places, it really doesn't
matter anymore, or if you really care about
those extra decimal places, then you can go into double
precision, for example, to make sure that you maintain
all of that information. Then another numerical data type that we can set
is called cereal. Again, here we have
the option of using either small cereal or also
big cereal or just normal. What cereal is, it's an
auto incrementation. This is really nice
if we want to have an additional column
for a primary key, but that column isn't
specifically set. It would just be a number
counter that's incrementing. The serial actually
auto increments. For example, if we want to
have a primary key that just keeps track of the row number
or some unique identifier, then we can give that the
type serial because it will auto increment and
we don't actually need to pass any
additional value to it. It will just make sure
it keeps incrementing from the previous value
that it had before. Can also have the
options, of course, between the small cereal and the big cereal versus also
just the normal cereal. Again, just depending
on how much data we're expecting to
have in our database. Now the next data type
category is text type data. There's a data type called text, which is similar
to the numeric or the decimal data type
that we had before, which just allows
variable amounts of data. It's not a fixed amount, but it can vary again based
on what the input is. But we also have options
to have, for example, char or character,
and then we actually define how many bytes we
want to allocate to it. What this does is if we say, for example, char 25, that means we're going
to allocate 25 bytes to every single record
for that column. If we don't go up to 25 bytes, then we're going to have
some padding and otherwise, we're going to be cut off at 25. There again, we can set data size restrictions based
on what we're expecting. Then we also have
the option of using a var char or a
varying character, which just means that we can
go up to this many types. Again, for all of these,
we can store text data, but the amount of
memory that we're using for each record that
we're putting in or for each value in the record in a specific
column that we're putting in is going to be varying depending on which data
type we choose here. Now, there are some
additional data types specific to date time. We have timestamp options, which is just going
to be timestamp, but we also have a timestamp TZ, and we actually have these
in other places too, where we can carry additional
time zone information. We do choose to use the
timestamp with time zone? That's also going to again
increase the amount of storage we need for
each element in there. Then we can have
a date data type, which is smaller than the timestamp because
now we only need to worry about the date and not extra information
about the hours, the minutes or seconds or even milliseconds if that's something
that we're tracking. The date, then again, depends on how much
accuracy do we want in our time information. Are we dealing with dates
and is date accuracy enough, or do we need specific
timestamps, for example. Wise, we can also
deal with time, which is again a
way to store date, but in addition to date, we
can then also store time. Here again, we have the
option to also use time zone. But if we use time zone, then we're again
going to require more data storage because again, we need to store that extra
time zone information. Now, usually, if you're
not using time zone, the default time zone to use
is going to be UTC time. Most of the time
to save storage, you can use the time or the
timestamp without a time zone and just make sure
that all the data that you're putting in
is in UTC time. Finally, there's also another cool data type
called an interval, which as the name
actually suggests, represents intervals
between different times. This can be really
useful because it allows us to do time
based calculations. For example, looking back
a week or going a week forward or going to tomorrow
or within the last hour. The interval is
another data type, that's very cool to use. As you'll see when we get to the daytime section in
writing the SQL queries. This can also be really
nice in queries itself. Well, there are some additional
data types that we have. One of them is a boolean. This is a very small data type, and all that we're
storing here is information about true or false. This can take on just two
values, two or false. We have the money, which
can be used for currency, we have byte A, which can be used for storing
byte strings. We have enumerated, which is just a custom data type
that we can create. Where we can assign a series of elements or a series of objects and we can assign
an ordering to them. For example, we
can use enumerate to enumerate the different
months of the year. In that way, we'll be able to compare ordering of the months. For example, we can say using an enumerator data type that
April comes after March. Whereas if we just stored
that as text information, we wouldn't be able to do that. The enumerated is a custom or user defined data
type that we can create, where we can give specific
ordering to objects. Now. An extension of this would then just be general
user defined data types where we can actually define our own data types, and then use those
in our table itself. Now, as postgres develops or generally as relational
databases develop, there are also other custom data types that are
introduced that may not be initially natural
to the system itself. Some examples are, for
example, JS, XML, arrays, text searches, UUIDs, geometric data types
or network data types. All of these are basically
extension data types that come in because it
can make it very useful to be able to store
this type of data because it's becoming more common to have these data types appear. Being able to put
that into a table in a relational database
can be very nice. But again, these are custom data types that are extensions. As time grows on, it's
likely that there will also be more custom data types
that are available to use. Depending on also just what
the situation requires. As you can see, there
are a large variety of data types that are
actually available to us. But generally, the data types that you're going
to see are going to focus more on the
basic data types. Just being aware of what data types are available
and also what they represent can be
really nice because when you're going
through and you're trying to understand a table, you'll just recognize
this and also if you maybe have to make a decision
about creating a new table, understand what options
are available to you and maybe even
be able to think a little bit about how can I optimize the efficiency
by making sure I use the right data type and also allocating the right
amount of data storage for it. Now, the other thing
that I want to talk about is schemas. Now, in general, when
you refer to schemas, that usually means that we
have a schema which represents a collection of tables as well as information about
each of the tables. But it's also common to
use the word schema to define the properties and
structures of a table itself. For example, let's look at this actually SQL query
in the bottom left, but don't worry too
much about the query. Let's just look at
the elements inside. What we can see here
is we have an ID, which is an integer
type which can also be shorthanded for t. We can see
that it's the primary key, and we can see that it's also not allowed to take
on null values. What we're doing here
is we're just defining our user info table on
the right hand side. We can see here our ID table is currently defined to
carry integer data types. It's also defined to
be the primary key, and it's also not allowed
to carry null values. The next column that we have
is going to be the name. Here we've decided to use
the Varchar data type, which means in this case, we've given 25 bytes to it, which means that we assume that our names are not going
to be longer than 25 because that's the
amount of storage we've allocated to each element that we're going to put
in with each record. Joined here is going to
be a date data type, because if we look
at the column, we're actually just
storing date information, we're not storing additional
time information. To make storage a little
bit more efficient, we don't need to store any additional information about the time, just about the date. If we do want to store additional information
about the time, of course, we're not allowed to
use the date data type because that wouldn't
carry enough information. Finally, we have
the e mail field. Here we decided to use
the text data type because e mails can have
very varying links. It's also important to make
sure that users have e mails. If a user signs up with
an e mail and they have a really long e mail or
a really short e mail, it's hard to predict just how long users can
make their e mails. Now we could make some
assumptions about it. But of course, we don't want
a user entering their email. Then our database
saying we don't have enough storage allocated for this to store the full email. So here, we decided to
use the text data type, but you can also make this
a more educated decision. If you already have data about a bunch of users and
a bunch of e mails, you can see just how large these e mails go and
then maybe leave some extra padding room and then use something like a
varchar data type instead. What you can see on the
bottom left here is basically the schema
definition of our table. Can see what each column is and what data
type it carries. We can see here from this query, just information about
our table itself. We know it has four columns, the ID, the name, the joint in the e mail, and we also know the data
type of each column, which means we also know what values to expect
from each column, as well as if we want
to insert values, what values should be
placed into each column. Take a look at another table, specifically our
event log table. Here a scheme is going
to be a little bit different and we
have the event key, which is here going to be
a serial data type because the event key is just going to be incrementing with each event. If we expect a lot of a lot of data because we have
a very big platform, maybe we even want to
use the big serial here. Vent key is going to
be a primary key. Again, it's not
allowed to be null. Then we're going to have
the event ID column, which is going to be an integer. You can see here, also comparing to the user
ID column below, we can use both in or integer. It is just a shorthand
for integer. Again, we're saying the event ID is not allowed to be null. Then we have the time, which in this case is actually
a time stamp. We can see here we've
got time information, but this is a time
stamp like format, and so we want to make
sure that we use the type, and then we
have the user ID, which is again going
to be an integer. Again, we're going to say
it's not allowed to be null. Now, here what we're
also doing is we're defining our two foreign keys, since our event log
has two foreign keys. We can see the user
ID in our event log, references the ID column
in the user info table. We can see again, just
from the link here. In our event info table, we have the event ID in
the event log table, references the ID column
in the event info. So we can see from the
query in the bottom left, but also the schema
definition of our table, we understand that our
event log table again, has four columns, the event key, the event ID, the
time and the user ID. We also know the
data type of each. We know what the primary key is, and we also understand the foreign key links
for our user ID column, as well as the
event ID and which columns references
in the other tables. So yeah, this is how we can define the schemas of
the individual tables. But this is also if you're
reading the information about a new table that you're maybe just someone past your new
table that you can use, you can look at the
schema definition of the table to understand
what columns are available, but also what data types and how they're stored as well as
what references there are. Again, don't worry
too much about the query because that's what we're going to
learn very soon. Just be aware that
this is how we can define the schemas
of our tables, and this is also what the schema definitions for our tables mean so that we can properly just define
our table as a whole, as well as properly
assigned data types, which can be great for the database itself
to make sure it does efficient memory storage as well as making the
queries more efficient. But it's also great
from a user perspective because we understand what
data is supposed to go in, what data we can expect, as well as what links
there may be between different tables and how
these tables are linked.
4. [Mac] Setting Up a Local PG Server: Now we're going to
start looking at how we can get PostgresS running on our
machine so that we can actually start using it. Now, if you're in a
company or something, it's going to be
very likely that your database is going
to be online somewhere, so you'll have
connection details. But here we're going to
learn how we can set up a local server
to connect to for databases so that
we can just follow along and keep everything
on our local machine. Then if you're working
with a different database, then all you're going
to have to change is just the connections
that we'll look at in the following videos. There's not much else to that, depending on which database
you're going to use, you're then going to need to connect to a different source. But now let's look at how we can actually get this running
on our local machine. We're going to go
to postreql.org. Then once we're here,
we're going to hit on the download button here to
go to the download section. Now, depending on your
operating system, you'll want to make sure you choose the right
package for that. I'm on Mac, so I'm going
to click on MacOS. And then I'm going
to scroll down here to the PostgresS app. This is a very nice user friendly visual interactive
app that we can use. I'm going to click
on here, which is going to take
me to this page. Then I'm going to
head to downloads, and then just click the
download button here, and then that's going
to start your download, which you can then use to get the file that we need to
perform the installation. Once the download has finished, I'm just going to open
up the resulting file. And then I'm just going to take this post grass icon and well, Mac standards drag it
into the application. Then I can also once
it's done copying, open up my applications folder, and we can see here
it's just installing. Then to open it,
all I have to do is just double click on here. Then it's going to
open up the postgress for us that we can use to
host our local server. Click on open here too. If this is the first
time you're opening it, Then here we go. This is what your visual
client is going to look like. You're also going to
have a little icon that comes with it up here. Then if we want to
start a local server, we just head start here, and then you can see we
now have a running server. By default, you're just going to have three databases
on your machine, one with your name for the
computer that you've set up, one called Postscri Then another
one called template one. These are three different
databases that are set up for us on our machine
now, that we can access. The important thing is that we have to make sure that
if we're using one of these databases that our server
here is actually running. Because if we click stop, then we can't connect
to it anymore. If you're having
connection problems, just make sure to check that your server is actually running and that the database name that you're also using is correct. Then the next video
is actually show you how we can connect
to these databases. Now, you can also manage
this from the icon up here. You can see we can also just
stop or server up here. If we want, we can also
fully quote it from the menu bar as well as just quoting it from
our application. But yeah, recommend that you just leave this on while you're
going through the course, just to make sure that you always are going to have
access to the database.
5. [Windows] Local Postgres Server: Now let's take a look
at how we can install our Post Gress
server on Windows. Again, we're going to head
to the postgres or here. Then we're going to go
to the downloads tab, which we can currently
access either up here or just by clicking
on the button over here. Then we're going to
click on Windows here. Then up here, we're going
to download the installer, which is going to be for version 12, which is
what we want here. We're going to download
the Windows installer for the version 12 currently. Then just save this
and let that download. Now, once that's downloaded, you're going to
want to open it up. I'm going to run it here
as an administrator, and then you're going to say, here, Then once our
installer is open, we can just go
ahead and go to it, and we're going to go through the setup here, so we're
going to click next. Choose where to install it. We're just going to leave everything ticked
here, click next. Again, just leave everything
in the default position. Now we need to
provide a database, a password for our database, and it's important that
you remember this. We can see here our user
is going to be posts. I'm just going to use
the password postgres for both of these. But of course, you can set it
to whatever else you want. But I'm going to have my
password here be the same as the user name just
for simplicity since this is a local thing anyway. Then go ahead and click next. Leave the default
port here as 5432, and go ahead and click next. Again, I'm going to
just leave this be the default here and click next and then next here and then just start the
installation process. The setup is done, then you'll just reach the
completion screen. It'll ask you if
you just want to launch stack builder right now. You don't have to.
You can take this. If you left to ticked and
hit finish as a reaction, then you can also just
click cancel or close on the next stack builder
application that opens. We don't need to use that.
You can just hit finish here. Now to go into our admin panel, we're going to go into
our search function. We're going to
search for PG Admin. Here we have PG admin form. Open this. Once that's open, it's going to open it in a
browser as we can see here. Now, I was going to
ask us to just set a master password for post Cris. Admin. I'm just going
to set a password here. This doesn't have
to be the same as you set during the installation, but make sure you
keep note of both of the passwords that are used
here. Then you can hit Okay. Then we can go for
example, into the browser. Now if we want to look
at the postgress server, which is what we did during
the postgress installation. Now we're going to need to use the password that we set during
the installation process, not the master password
that we just created here, but the or the password that we used when we were going
through the installation. Remember, I used the
password postgress here since we're connecting
with the postgres user. I'm going to put in the
password here that I set during the installation
process that we did or that we went
through a couple of minutes ago, hit Okay. And then we can see here, now I have basically access to the post Postgres database
in the server here. You can also see that just
going into the tab here, we have this also
option, where we can, for example, if you want to
open it in a new window, which is what we have here. But this is also how you can
get it into your task bar, and here we actually
have then the database, which is also what we'll
explore a little bit later. In later cases, when we also
just want to connect to our Postgres database or to the local Postgres
server that we have, make sure that you use the same user name and password that you set
during the installation. We'll see later when we look
at data grip, for example, for the case of MAC, we actually don't need to set or provide a user name and password because
we didn't set one. In the Windows case, again, for the data grip example
that we'll go through soon. Make sure you use the user Postcres like we did here
to connect to the server, and then make sure you
use the password that you set during the
installation process, not the password that you set
the master password here, but the password that you set during the installation process, which is the same one that
we also entered here when we got prompted when we tried
to go into the server. Again, the user
name is going to be Postcre and then the
password is going to be the one that
you set during the installation process.
6. [Mac] Postico Setup: This video, let's learn how
we can actually connect to our database and start
writing SQL queries to it. If you're on a Mac, there
is a cool interface that you can use to
connect called postco. This is just the download page that we were on
from the last time. If you're in the introduction
tab and you scroll down, then you can see here
a linked to postco. I've already opened
it in a new tab here. Then you can just download
this and let it stall locally. Then if you want to open
it, then we can head here and this is what
you're going to see. Currently, we don't
have or we're not connected to any database. But let's just go
through the setup. This is going to be
the same depending on the other platforms
that we'll also look at. But let's just go
through it here too. We can give a nickname
to our database. This is just for us to understand
what we're looking at. Here we can say, for example, my first DB or if you want to give it
a more specific name, of course, you can host here is going to
be our local host. We want to keep that. Our port
is going to be port 5432. It's important that
we also keep that. If we also head back
to the postcras, this is also going to be
running on port 5432. It's important that we just
make sure we keep the same. User, we can also keep the same. There's no password required because we're running
it on a local host. The database that we can
connect to the default one is just going to be the
name of the user, which we have here. If we go to our postgres, we can also see that
that database exists. But we can also see that
we have different options, for example, the
Postgres one here, as well as the template one. You can connect to any of these. I'm going to connect
to template one. If you don't put anything
here, is just going to use the default values
that it has in here. Once we're ready, we
can just click Done. Now that we have the
database connection setup, we can just click Connect here, and then we're going to get this new window open for
us, where we can see, we can go into the
query editor SQL query, and here we can then
write our queries. We can also see by
the menu bar up here, which database we're
connected with. Right now we can see we're
connected with template one. But if we actually just click
background my first DB, which is the nickname that
we've given to our database, we see that the
server that we're connected to actually
has three databases. If we want, we can also
connect to the different ones. We can see here now we're
in my user name database, or we can go out and we
can go into template one or we can go out and we can go into the
Postgres database. You can see for each of
these, we can then also open the query editor and depending on which
database we're in. Well, that's going to be the one that we're
connected with and that we're
writing queries to. If we want to create
a new database in their server
or in our server, we can also do that
by just clicking on the database button here, and then we can give
our database a name if we want to or to delete it, we can just click Delete here, and then we just type in drop database and that
will delete that database. If we want to go back to the option where we actually connect it into this database, we can go into file,
show favorites window, and then that's going to have the original window where we connected into our
database room. Want to set up any
additional connections, then you can create
a new favorite, recommend just disconnecting
from this one first. Currently, we're
on the fee trial. If you want to create
new connections here, then you can create
a new favorite and then it's going to well, then you can set up additional
connection options. You can have one, for example, on your local side that you
want to play around with. If you have one at work
or somewhere else, then you can also
connect to these here. As you can see, when
we're downloading postco, you're going to by default
at the trial version. The trial version is
free, indefinitely. It just has a reduced
number of features. But for any intents
and purposes, it will serve just fine. But of course, if you do want to activate, then
of course you can. But just make sure that
you check out all the different or that you at least check out also
the other video. I'm going to talk
about a data grip, just so that you
see the different user interfaces that
you can deal with or the different
connection services or applications for
connection that you can use so that you decide
which one is the best for you, which one you like the most. Then I recommend just
sticking with the one that you enjoy the most you feel
most comfortable with. Yeah,
7. Datagrip Setup: Now, the next tool
that I'm going to show you to connect to databases with and do all of that cool stuff with
writing queries and everything is going to be a
data grip from jet brains. This is also the one that we're going to use
throughout the course. You can get this by
going to jet brain.com, then going to tools, and here go to data grip. With data grip, you're going
to get a free 30 day trial, should be fine to go
through the course, as well as do some
additional practices. But if you do enjoy it, then
it does have a paid version. There are, as we also saw with
postco or also other GYs. There are, of course, other
free ones that you can use, depending on how
much you like it or also if you've played
around with other ones, you can then opt to go with
either data grip or you can, of course, also choose
any other system. It doesn't really
matter in the very end. All of this is just going to be connecting to our
databases and stuff. It's just whichever one you
feel most comfortable with. So we're going to start off
with a free 30 day trial. If you want to follow along
with the course exactly, then you can of course
get data grip here. Then once your trial expires, you can either decide
if you want to continue with data
grip or if you want to use a different GI
that has free services, for example, postcode that we saw offers lifetime free trial, and there you can just
continue to use it for free. That's totally up to you. But go ahead and download data grip. Once that's done, just follow
the setup and open it up. Now, once you've downloaded
an open data grip, this is going to this
when you open it in case you don't have this
window here on the side bar. You can get that very
easily by going into view, then tool windows and
clicking on database here. Then that's going
to just open or close the window on
the left hand side. To connect to a database here, I'm just going to right
click new data source. I'm going to select
postgress QL here. This is going to allow us to set up a Postgres data source. I'm going to click
on this, and that's going to bring up this
connection window here. If you don't see postgress when you let me
close this again. If you don't see
postgress up here, then it's likely just
going to be down here. Once you've selected
them before, then they're going
to be moved up to for you here so that
you have easier access. It's probably not going
to be up here for you if you've never used
data group before, in which case, it would
be down here somewhere. Anyways, go ahead
and open that up, and that's going to give us our postgress connection window. Here again, we can
give a nickname. Then we can have our
connection options. Our host is still going
to be the local host. Our port is still
going to be port 5432. We don't need a username or password to collect to
the local database. Then the database that
we want to connect to, we can choose between, of
course, Postgress again. Or if we just open up our Postgress app, we
also have, of course, Temple one, as well as the user name of whatever
computer that you're using. We can just then click Okay. And that's going to connect us to a database, and
we can see here, we also get a console with it where we can actually
write our SQL queries in. If we want, we can also explore database by opening it up, and then we can see
the structure here. There's already a database
set up in it by default. But there's no tables in it yet. We'll learn about
this more a little bit later once we get
into all of this. But this is how
we can connect to the Postgres database
using data group. In case you accidentally ever
close your console here, you can just easily re open
it by right clicking on here, and then open query console. Now, something as cool
about data grip is we can actually have several
database connections open. For example, I'm going to add another new
data source here, and I'm going to add here
another postgres data source, and now we're also going to
connect to template one. Our settings here is still going to say the
same still going to be on local host is still
going to be on port. Our nickname here is just
going to be filled in my default, and
then I can click. Now we can see that
we're connecting to another database here
on our local system, which is this one. Here. Here we can also see that we can also go in
here and explore it. The Postgres database
by default actually has some more things
set up in it. But again, we'll
learn about that more once we actually
get into the query. This is just more to make sure
we have the proper setup. You can also see here
now I have two consoles. One that's connected
to our postgres. Again, this is going
to be our nickname. The other one that's for
the template one database. Of course, if you close either or any of these
or all of these, you can just go back on here, right click open query console, and also here we can
open query console. You just want to make sure that whichever query
console you're using, you want to make sure that
it's for the right database in case you have
multiple databases open. Feel like your toolbar tab here is getting
filled up too much. You can also just click
on the database again, right click and remove, then that's going to remove the configuration
for the database. One thing to note about that is that the configuration
has also gone. If you want to reconnect to it, you're actually
going to have to go through the steps again of new data source and
then connecting to the data source and putting
in the configuration details. But if you want to remove it, then that's how you can do that.
8. CLI Connection: Now, the other option that
you can take and that you may sometimes take is using the
command line interface. This just means
using the terminal or something like
that to connect to. Sometimes you may
want to use this. Although personally, I prefer
the graphical interfaces. Other times you may
be forced to do this. If you're, for
example, connected to a remote machine and you have to work on the database over there, and the only option you have is using a
command line interface. Either way, you can also use
the command line interface. For us, connecting to it is
going to be easy because we have our Postgres server here. We can just connect or double
click on one of these, and then that's going to open up the connection in
a terminal here, and then we can write just
our SQL commands in here. These are the different options
that are available to us. We're going to go with
data grip in this course. But of course, you can
choose whatever you want to. You can, of course, also switch between the different
applications. At the end of the day,
it's not going to matter super much
which one you go with because they all provide
the same means to an end, which is connecting
to our database and letting us
write our queries. Whichever one you feel most comfortable with is the one
that you should go with, because at the end of the
day, they're all going to basically allow us
to do the same thing.
9. Creating and Dropping Databases: Now that we know how to
connect to a data source. Let's go ahead and
learn how we can create a database inside
of our data source. I'm going to be connecting
to the template one, just because there
are no databases or anything inside of here,
it's completely empty. I'd recommend that you
do the same just to have a completely
fresh data source. For you, it may even be possible that when you click
this drop down arrow, you won't see anything because usually these folders
only created, if you create and then remove
a database afterwards. For you, if you
do the drop down, it's possible that nothing
actually comes down. Now, in case you're
in the configuration here, for example, it's telling you
that the database is not found or you
can't connect to it. For example, if I try to
connect to template two here. Before I actually apply, I can also just test the connection. If I try this, it's not going to work because the
database doesn't exist. In this case, if you do want to connect to a different database, but one option, of course, is always just connecting
to the PoCs one, which should always be included and created on every system. The other ones are usually going to be created
on the other systems, but there can be cases
that differ from system to system where database may not by default be created. To do that, we can just connect
to postcras for example, then we can either
use this directly, or we can go into
our query console, and this is also
what we're going to learn in the coming lesson, we're going to write
create database, and we're going to create, for example, template two, if that's the one that
we want to connect to. Then we can run this and we can see here that it
was just updating, and we're going to go back
into our properties here. Now if we try to connect to template two and
we test our connection. We see here now it's
working properly. This is just because
in the video, I'm going to be
using the template one database in case it
doesn't exist for you, you can do this to create
it and then connect to it. But of course, alternatively
you can also just use the default
Postgres database to. But let's go ahead and create a database because remember, when we learned about
the relational model, we have tables that
are within a database. Before we can create tables, we first have to have a database to put these tables into. How can we create a database? Well, here, I'm going to have
the querin console open. You'll notice that this
is the console for my template one data
source connection, as you can see by the name here. In case you have multiple
data sources open, make sure that you're
using the right console for the right data source. Also make sure that
you're connected using your post grass
to make sure that your local postgress
server is running. If it's not, make
sure you click Start. Okay, so how can we create a database in our
template one data source? To do this, we're
going to use or we're going to type into our
query console here, create database, and then we're going
to type in here the name of the database, which we'll fill in in a second. Then we're going to end
it with a semi colon. There are some important
things to note here about general SQL
query structure. First of all, all SQL
commands are usually capitalized by the user just to make the
query more readable. Also, it's customary to end SQL commands
with a semicolon. You don't need to do this,
but this indicates that the query is over and you can actually chain multiple
queries together, whenever you finish a query, you can then put a
semicolon here to basically say, this
query is done. Let's create our database, and let's call it our DB. If we run this, Then we can
see we get a consolg here. If we go into databases, now we have our DB created. This is the database name
that we've given it. We can see here we get some extra folders that
come with it too. Yeah. This is how we can go
about creating a database. What about if we want
to remove a database. Well we have to do then is
change the create to drop. Then if we run this, you'll notice that now our
database is gone. We can see that it's been
removed from our data source. To show you that the
capitalization is not necessary and neither
is the semicolon. Let's go ahead and repeat the same query that we
did in the beginning. But this time using
everything is lower case, as well as not using the
semicolon at the end. If we run this, and here we go, we can open our database again and everything is
back to the way we had it. You can see you can
still write SQL queries with everything being lower case and without the semicolon. However, it's usually
good practice to stick with the
semicolon at the end, as well as to write all SQL
commands in capital letters. Because in that way, it's much
easier to read the query. It's easy to distinguish what
is part of an SQL query, and what is maybe a name that occurs in a table in
our database that has a similar or identical name to something that you would
have in an SQL query. And of course, also
create a second database. Let's create another
database here called our second database. Let's use some underscores
in here to make this database more readable
or this name more readable. If we go ahead and run this, then you'll see here now we have two databases and our
template one data source. You can see we can go
into each of these, and each of these is
a separate database. Of course, we can remove
these again like this, and then we can also remove our first database by using
the appropriate name. Then again, calling the
drop database statement.
10. Creating Schemas: All right. Now that we know how to create and
drop databases, let's take a look at how we
can create and drop schemas, since that's where we want to
put our tables inside them. From our last lesson, we were able to create RDB here. But sometimes you may
actually run into an error. For example, if I type in here, drop database, and
I want to drop RDB, it's going to give me an
error because I'm already connected to the
currently open database. Because I'm connected to it, I can't actually drop it. So one thing that you can do to change this
would for example, go into a different
option that we have here. We see we have our DB, which is going to be
in our template one, but we also have different
options to go into. The other thing that
we can do if we really want to drop
a database is, we can create a new database. For example, our underscore DB We can create that, and then we have to change
our connection, which we can do by going into
here and clicking into RDB. Now, I guess, these
names are not particularly useful when
you're saying it out loud, but now we have two databases. One of them has an
underscore in it, which is the one that we're
currently connected to, as we can see here in
the top right corner. The other one does not
have an underscore in it, and it's not the one
that we're connected to in our current session here, because again,
that's not the one we have up here in
our top right corner. This here is going to
be data grip specific, but it's likely that you'll have similar UI indications if you're using different
SQL editors. If we want to, we
can, for example, now drop our DB, since we're now connected to the database that has
an underscore in it. If we want to drop the database that we were
connected to before, we can now drop database, and then calling the name in this case without the
underscore, we can run it. Now we can see that the
database has been removed. One thing that you'll notice is if we go into the schema here, we actually can't
see any schemas. However, if we're looking
closely at our RDB symbol here, we actually see that
we're connected to our underscore DB, which is the database
public, which is the schema. If this is the case, you'll
likely get this notification, which I'm also
seeing up here from data grip where we can just
click on introspect schema. Then now if we go to our schema, we can see now it's
available to us. We actually have the
public schema available. But how can you
create a new schema. Well, to do that, we
type create schema, and then we just
type in the name just like we did
for the database. For example, here,
this one we can call our schema, and
we can run this. For me, I've already created it, so I'm getting an error
that already exists. Let's see how we can drop it. We can type drop
schema, our schema. Then if we run this, we can see how the
schema gets removed. Then if we want,
we can create it again by typing create schema, and then it will create
another schema for us. Now of course, the schema name doesn't have
to be like this. It can be whatever name
is most appropriate. If we want, we can
drop this again. It's usually customary to or. It's usually okay to stick with the public schema. There's
nothing wrong with it. If you already have
a public schema which should be
there by default, then you can also
just stick with a public schema and
not worry too much. But if you do specifically
need or want to create extra schemas to
then group tables into, then of course, you now know how and you have the
option to do it.
11. Creating Tables: Now let's learn
how we can create tables into our
database schemas. To do this, the syntax
we're going to use is going to be create this time table, and we're going to
have our table name. For example, here we can use the user info like we also
saw in the previous lessons. Now, there's some extra stuff
that we need to do here. We want to open and
close parentheses here, and now we can put
in the actual names of the columns in our table as well as what data
type they carry. Make all of this look
a little bit nicer. I'm going to carry this
over several lines. I'm also again going
to put a semi colon at the end here just to make sure I don't forget it later. Even though as
you've seen before, it works without it, it's just good practice
to have it there. Let's have our first column, the ID, which is going
to be just an integer. Here, let's have this
our primary key. Now to create a next column, we're going to put a comma
here to separate this. Then our next column, let's
have this be the name. Let's have this be a
varchar using 25 bytes. The next one we can have joined. This one, let's have
the eight date. We can set their
email if we want to, and this one we can, for
example, be a text field. If we run this, we
can see here now our user info table has been created inside of
the public schema. We can see here
also if we go in, we also have information
about the primary key, as well as the columns that
are inside or the attributes, as well as the data types to each column reach
attribute that we have. How can we remove a table? Well, that's going
to be much easier. We're just going to
type drop table. Then we're going to put
in here the table name. For example, user info, semicolon hit run. There we go. Forget to click run
and then just wait and you can see here now our
table has been removed. What if we have
multiple schemas. Let's create a second schema, and let's call this
one our schema again. Then let's run this. Now
we have two schemas. I'm just going to go
back for a bit here so that we get our code
that we had beforehand. What happens now if we run our create table statement?
Well, let's try it out. If we run our create
table statement, we see here now we're putting
it into the public schema, which is also the one that
we're connected to here. There are different options
that we have available. One of them is we can change
our connection up here. For example, connecting
to our schema, instead of the public one, and we can see here now our
connection has been changed. Here we can again create
the user info table. This time, it's going to
be created in our schema. Sometimes you may not have the luxury of switching
around like this. Sometimes it may be different. In other cases, if you
need to specify a schema, you can also add it by putting the Schema name in
front of the actual table. Let's create another table
here called user Info two. Let me fix this typo here. You can see even though I'm
connected to our schema, which is the schema here. I've created a user info two table in the public
schema as we can see here. Of course, if I remove
this prefix here, and I rerun this,
then it's going to be created in the schema that we're currently
connected to. You can see if you're dealing
with multiple schemas, you can then create either switching to them and creating
tables directly in them, or if you want, you can
reference the schema beforehand to then create
tables in there now. So what would happen now that we're created
to our schema, let's go ahead and
switch back to the public one by
going into RTB, going into public, hitting. What happens now if we want to drop schema and we want to drop the new
schema that we've created? Well, now we're
going to run into an issue because we have
tables created inside. One thing that we can
do is we can go on and we can remove
every table inside, and then we can
delete the schema, or as data Grepard
suggests to us, we can use something that's
called cascade at the end. We can type cascade here. Then it will basically just
perform this operation, but it will continue
and it will do the appropriate cleanup
also inside in this case. If we run this now, not only is our schema
going to be removed, but also all the tables
contained inside. Now to clean this up even more. Let's also go ahead and drop the user info
two table so that we don't have weird
table names that have all of the same columns
that are conflicting. Let's go ahead and
drop this table two. Let's also take a look at how we can create the other
two tables that we saw in the lesson just so that we can also look
at foreign keys. Here we can again
have create table, and let's create our
event info table, event underscore info. Here we're going to have
two columns inside. One of them is going
to be ID, which again, we can be an integer, and let's make this
our primary key. Then we're also going
to have the name, which we can have a
text or a varchar, or whatever we like, whatever we think
is appropriate. Now, one thing I want to do is I'm going to make these names capital instead of lower case just to show you
what happens here. If we run this and we go
into our event info table, we can see here
that even though we created them with
uppercase names, when we go into the table, we still only have
lower case names here. That's also why
it's so nice to use these underscores because even if these conversions happen, it still makes it easy to read. Let's go ahead and also
create the event log table. Here we're going to
have our event key, which is going to be serial so that it's auto incrementing. Then let's just keep our two IDs here and not worry too much
about the other stuff. We're going to have an event ID, which is going to be an integer. Then we're also going
to have an user ID, which is also going
to be an integer. Then if we want, we can
also add, for example, the time column to it, or we
can call it upper case time. This we can make a timestamp, let's make this upper case. We're not going to
add times on here because if we're using this, just use UTC, and
that's going to save some storage space for us. Now that we have
our four columns, how can we reference
the foreign keys. To do this, we're going
to go on a new line, and we're going to
say, foreign key, and you may already remember this from the
lessons beforehand. We're going to open
and close parentheses. We're going to put
in here the name that we want to reference
from our table, which is going to
be the event ID. We're going to say references. But then we're going to put
in here our table name. For example, vent info. Then here we're going to put the column name that
we want to reference. In this case, the ID. Our event ID is going to
be a foreign key that references the ID column
and the event info table, and doing the same thing
for the user info, we're going to create
the foreign key, and it's going to
be our user ID. It's going to say references
the user info table, here it's going to
reference the ID column. If we go ahead and run this, and let's go ahead and
take a look at our tables. Now we also have the
event log table. We can see here, we've
got the primary key, but we also have foreign
key references in here.
12. Altering Tables: Now, it's likely that over time your database
requirements may change. You may want to alter tables, you may want to
add stuff to them, remove them, or even
change stuff inside. To do this, we can use
the alter table command. We're then going to put the name of the table that
we want to alter, and then what exactly we
want to alter about it. Let's see some examples. Let's say that we want to also add region to our user info. We say, we have their
ID, their name, and when they join
in their e mail, but we don't actually know
what region they're in. What part of the
world are they in. We're going to say
we're going to alter our user info table. In this case, we want
to add a column. We're going to type
add. Here we're going to have the column name, which is going to be region, and then we want to have
the column data type. This we can say,
for example, again, let's have this a varchar, maybe of size 20 or whatever
we deem appropriate. Then, of course, we also
want to make sure to put a semi colon at the end here
just for good practice. Then if we run this and just keeping a look at
the user info table, we see now we have the
region column added to it. Let's say at some
point, our table is getting too big
and we decide, the region column is
actually not that important. Then we can use instead of
the command drop column. Now we're just going to put the column name that
we want to drop. We want to drop
the region column. You can click or run here, and then that's going to remove it from our user info table. Let's say at some point, we realize that our ID
column is getting too big, the integers in here
are getting too big, and we need to modify our
column so that we can we take care of larger IDs because our user base is
growing very large. What we can do to
do that is we can create or we can
say alter table, we want to alter the
user info table. Now we want to alter column. We want to alter the ID column, we want to change type
of the ID column to, for example, a big. We can run this. Then
you can see here our type from integer is now going to be updated
to a big type. If we want, we can
also revert this. You'll notice here I can use the lower case as well as
the upper case version. This way, I can modify or alter columns that are
already available. Now, there's a lot more
alterations that you can do that are really
situation dependent. Depending on what
situation you're in, maybe you want to update your database
or you need to update your database and you want to
alter or modify something. Um, the easiest
thing to do is just quickly do a quick
Google and say, how do I turn this into this, and then it'll be the exact
same syntax that we had here. Either it will be
something that you've already seen, for example, altering column types or
adding or removing columns, or something very
similar, for example, modifying a column or
changing its constraints.
13. Enumerated Types: So let's take a
look at how we can create enumerated data types because sometimes
we want to have or there is a natural
ordering to things, but it's hard for us to maybe represent that to a computer. So an easy example of this is let's take
the days of the week, and let's say there's
a specific ordering, and we start with Monday
being the earliest day, and then we can go Tuesday,
Wednesday, Thursday, Friday, Saturday, Sunday, or
maybe you want to start with Sunday and go to Saturday or whatever order
you want to use. But maybe you want to say
something like I only want to look for cases that
happen after Wednesday. So between Wednesday and
Saturday or something like that. Using that ordering
is difficult because it's not inherent to
the words itself. We know it because we understand the meaning of days of the week, but a computer, of course,
doesn't know this. One thing that we can do in SQL is create an
enumerated data type, and in there we can then specify the hierarchical ordering between these
different instances. For example, to do this, we can create a type, and we're going to have
our type name here. This one we can
call, for example, the day of week. Then we're going to
say here as a num, which is going to
say as enumerated, open and close
parentheses and then have a semicolon at the end
here just for good practice. Now we can put in here the ordering of the different types. For example, let's say
we first have Monday, then we have Tuesday. Then we have Wednesday and then let's continue this on the next line to make
it more readable. Then we're going
to have Thursday. Then we're going to have
Friday Saturday and Sunday. Now we've created a new
type called day of week, which is an
enumerated data type, and the ordering between
them is as we see it here. What we can do, for example, is we can now use this type
to create another table. Let's create create table. Then here we can let's
just call this test table. That way we don't have
to worry about a name. In this table, let's use it to keep track of for
example attendance. Although I guess a better name now would have been attendance, but whatever. Let's
leave it as this. What we can keep track
of here, for example, is a student ID, Let's just have
this be an integer. Then we're also going to, for example, we can use
our enumerator data type, so we can call this
here to be weekday, which is going to use our
day of week data type, and then we still
need a primary key. Let's just call this some key, and this is going to be serial, and it's going to
be our primary key. If we run this, right here, it's
actually giving us the option of which
statements we want to run. For example, only run
the create statement, we're going to get
an error because we haven't created the
numerate type yet. And this is probably
because we have the creen box here
which indicates to us that we're going to run
this query section here. Let's run this again
and this time, selecting the whole statement. Then if we click Run or
selecting this and running it, we can see now it works because it's created
our type for us. We can see here if we
go into our test table, we now have a custom data
type or called day of week, which as we can see here,
we created up here. The important thing here
is also that we have the semicolon at the
end here because otherwise it would have tried to chain these commands together. In this way, now we have
different SQL queries, and it's important to just
separate different SQL queries with semicolons
to make sure they don't get tried to chained
together or something, but that they happen
sequentially. That way, even if some queries or some commands
may run in series, it's always good practice to separate them
because other times you're going to run into issues. It's always good
practice to just separate or rather end your
queries with a semicolon. In that way, the query editor will know that this is
the end of the query. Then the next thing
that will come will be the start
of a new query.
14. Inserting Values: So now that we know how to
create and remove tables, as well as also create our
custom enumerator types and create tables with that. Let's learn how we can
insert data into our tables. To do that, we're going
to use insert into. Then we're going to have
here the table name. For example, let's
use our test table, and then optionally, and we'll get back
to this in a second. We can put the
column order here, but let's skip this for now. Then we're going to use
the keyword values. Then we're going
to open and close parentheses, and again, for good practice,
putting the semicolon at the end of the query here. Now we're going to
put in the values that we want to insert. Now by default, the
values are going to be in the order that we've
also put them here. For example, we have a
student ID and a weekday. Because our key here
is actually a serial, we don't need to
provide a value for it. Let's say our student ID
is going to be one and the weekday value is
going to be Monday. Then we can run this and
then it's going to insert. It's run the whole statement. It's going to insert
into our test table, the values one and Monday.
How can we see this? Well, we can just
double click on our table here and that's going to open it
up in a new tab. Here we can see we've
got the student ID one. We've got the weekday as Monday, and we also have our key here, which was auto created for us. Now we can also specify the
column order here if we want. For example, maybe we want
to insert the weekday first, then afterwards, we want
to insert the student ID. So that way, we have to
change the order here. Let's say for Tuesday, we again have student one. But this time, notice that the ordering of the
columns is different. We've chosen or we've chosen our custom ordering that we want to use to insert
into this table. But it's important that we
follow this custom ordering. Otherwise, the table is going to assume we use the
default ordering, and that's also how it's
going to insert the values. One more important thing
that I also want to point out is that
it's important here that we use single
quotation marks that we don't use
double quotation marks. I'll explain why in a second. Let's go ahead and run this. We're going to run
the whole query. It's going to insert
into our table here. Let me close it and reopen it. It's going to insert
into our table. Again, we have our
student ID one. We have the weekday
now is Tuesday, which is exactly what
we provided here. Now the reason we
don't want to use double quotation
marks is because double quotation marks are
used for column names. Sometimes, we may have column names that may clash
with something else in SQL. To specify that
something is a column, we can also put double
quotation marks around it. For example, let's
use Wednesday here. Let's here use the
double quotation marks to specify that
these are columns. Again, this is going to
be for student ID one. Again, because this is how we're specifying the column
order right now. If we go ahead and run that, and currently it's
throwing me in error that my spelling of
Wednesday is incorrect. Let's take a look at our day
of the week and data type, and I'm just going to, I see. There was an extra space here. Well, let's just leave
it as it is for now. We just need to put an
extra space in front. Here, of course, you can
then also see what happens, and this is actually what
I was going to cover next. But you can see what happens
if you put in here a value into an enumerated
data type that doesn't belong to the
data type itself. Using now the
appropriate formating, let's go ahead and run
this one more time, and then take a
look at our table. Okay. Take a look
at a table here. If we want, we can
close and reopen it. Something else that we
can do is we can just hit the refresh button here and that's going to refresh our
table as we can see here, we can see that when we
go back to our console, using the double
quotation marks here. Again, let's specify the column. Whereas using a single
quotation mark, let's specify the value
that we want to insert. If we put in here double
quotation marks, for example, let's use Thursday here, but in this case, we're going to use
double quotation marks. We can already see by the color coding that it's going to fail. And let's close this. But let's try it anyway. Let's go ahead and run this and run so we can see
that it says here, the column Thursday
does not exist. Again, double quotation
marks are four columns. Single quotation marks
are for text strings. If we go ahead and run this now, inserting it here and
just reloading our page, we can see here now again, we've inserted our data here. What happens if
we want to update our type that we're using since we've now messed up
our type a little bit. Now what we can do
for this is just taking a look at our
types again here. The syntax that we're going
to use, we're going to say, alter type, and we're going
to alter our day of week. We're going to say rename value, and we're going to rename
Wednesday two, Sam colon here. Run let's take a look at our type and actually let's
go ahead and refresh it. So we can see here now our
type has been updated. I notice the same
mistake here on Sunday. Let's go ahead and do
that also for Sunday. Sunday and we're going to
rename that to Sunday. Let's go ahead and run this. Go back into our types here, reloading, and we see here a Sunday has
now also been fixed. If we take a look at
our database refresh, we can actually see
that our database value has also been
appropriately updated. Now we've seen how we can
insert data into our database. Also, how we can alter
the type if we want to. Of course, another option is if you're just starting fresh, if you have a completely
clean database, you can also just
remove the table and just create the type from scratch and then continue
on with the correct type. Of course, there are
different options to solving the same problem.
15. Running an SQL Script: Now, as you can
see, inserting data into our tables can be
a little bit tedious, especially because in SQL, we have to insert them
one value at a time. That's usually why data into SQL tables is inserted
from the back inside where things are set up with the other pipelines and whenever the necessary data comes to
it or passes through it, it's then uploaded
into the database. Be using the insert
into statements is not particularly efficient if you as a user have to do it because you have to
write it all out. It's much more efficient
to do that as part of the code where that's part of your data pipelines
so that you can then just upload the
data into the database. So most of the time, you're not actually going to
be inserting data. If you're using it for querying to either explore your data, to analyze it or
either to export it to run your own custom analysis, or do other stuff with it. Be aware of that. Now there are some ways where we can
insert data quicker. For example, if we go to our
event info, and let's say, insert into event info, and we want to insert the values and just give an idea of one
and a name, for example, login, then if we can run this or if we want to actually put in multiple
values at once, for example, two here
is going to be click. Then as you can see
the insert values, we can just separate
by commas and then each row or
each record that we want to insert here is
within its own parentheses. We can, for example,
insert two values. Let's take a look at
our event info table. Currently, it's empty. If we run this and then take a look and to refresh our table, we can see we've inserted
two values at once. But even this is not
particularly efficient. Again, most of the time
basically all of the time, this is going to be
done from the back end side as your data passes through the
data pipelines because all of this
can be automated, and that's extremely
great because then your database keeps
getting populated with data as your data pipelines are processing the data
or moving it around. Now, for us, what we're going
to do is we're going to use a open database that we
can play around with. Before we do that,
let's just go ahead and drop all of our tables. We're going to say drop let's stick with the
upper case syntax here. We're going to drop
the event info table. Let's go ahead and execute this. In this case, because
we've linked our event log to the event info
with a foreign key. If you remember how we created the event log, we
have a link here. To the event info
with the foreign key. We're also going to use the
cascade statement to do this continuous series
of deletions where all of these references
are. Running this. It's going to remove our
event info table for us. Then we can do the same thing
for the user info, again, cascading the event because we have it still linked
in the event log, and then we can
delete the event log. Here we no longer need to
cascade anything because it's no longer linked to anything. Let's go ahead and remove that. And then we can also
remove the test table. Let's go ahead and remove that. Then we can also close our extra tap here because
it's no longer relevant. We're going to go
back into data grip, and let's just clear
our query editor here to our query console. To get our new data, we're going to go
to this getup page here for the chinook database. Include a link to this
in the materials tab, so you can reference
it from there too. To get this data,
this is going to be an SQL query that we can run, you can just hit download here, and either it's going
to download free directly or as it did for me, it's going to open it in a
second, if I clicked it. It's going to open it
here in a new tab and you can see here all of this is
actually SQL statements. Again, if you have SQL or if you have SQL table is
integrated into your company. Most of the time,
it's still again not going to be inserted
using SQL statements, but rather it's going
to be connected directly to the SQL table. But anyway. Here, all of
this is going to be done for us through
using SQL queries. If you want, you can
also take a look at this to see what
it's actually doing. But if you want to
then use this query, you're going to
right click here. Okay. You're going
to right click, and then you can click saves. Then I've saved it
on the desktop, as you can see already. Just make sure that you take the dot TXT extension away so that we can
use it in a second. And then just hit Save. You can, of course, also save
it wherever you want. I've already done that.
To run this SQL query or the series of SQL
queries that we've now downloaded and add
them into our database. We're going to right
click on our schema here. Then we're going to have
here run SQL script. We're going to choose
this SQL script here. We're going to hit open, and then it's going
to run it for us. We can just wait
until it's finished, which would only take
a couple of seconds. We can see here now
that it's done, it's created new tables for us. We can take a look at each if we want and just
explore them a bit, or you can directly
view them here too. Now we have data that
we can use that's actually much more than we could have inserted
ourselves manually. If you want to open
the query editor in because disclosed
of accident. Again, we know how to do that. This is how we can then
also run SQL scripts, if we want, which in this case, just lets us create a
practice or lets us use a practice database
that we can then use for extra querying. Again, of course, if
you're going to be working with a company
database or something, the database is already
going to be filled in, and so When you're queering it, you're just going to focus on
either doing your analysis or finding the
necessary data that you can take out to do
further stuff with. Since the uploading
is mainly going to happen in the data pipelines because it's going to be
much more efficient to update all of that automatically as part
of the pipelines.
16. Reading Data: All right. So now that we have our tables available to us, let's now learn how we can actually get data
from our tables. Let's go ahead and just take a look at the first table here, the one on albums called album. If we just take a look here, we see we have several
different columns, several different attributes,
we've got the album ID. We've got the title, as
well as also an artist ID. We have different albums here that are made by
different artists, and each album, of course, also has its own unique ID. How can we get data from here? How can we read the data from this table
using an SQL query. We're going to go
back to our console, making sure that
we're connected to our correct data source here
using that correct console. Now to get data from our table, we're going to say, select. Then we're going to say star,
which means everything. Select all the columns. Then we want to select
from the album. Now, if I hit Enter here
and let it auto complete, you're actually going to see we have quotation
marks around here. We'll talk about
these and we'll also talk about this star here, as well as how we can
modify these things. But first of all, let's run this and see the
results that we get. If we run this, we can see here, these are the results that
we get from our query. We can see we've selected
three different columns. We've got the album
ID, the title, as well as the artist
ID, and we can see here, we can scroll down and selected a bunch of
data or in this case, actually all of our data
from our album table here. Let's first take a look at these quotation marks because a lot of times in SQL queries, you're actually not going to have quotation marks
around your table names. Where does this come
from? Well, if we look at the SQL
script that we ran, we can actually see here the create table
statement that they used, put quotation marks
around the table name, which means we now have to refer to it using quotation marks. If I took the
quotation marks away, you can already see
here it turns red, but we can try to run it anyway. We're going to get
an error because it says the table does not exist. How can we change this because it's not very nice to always have to put quotation marks around it. To update this, we're
again going to use the alter table statement, and we're going to
alter the album, again, putting it in quotation marks here, and then
we're going to say, rename two, and now we're just going to call it album
without the quotation marks. Let's go ahead and run this. Now, let's go ahead and write
the same select statement, select star from album. But this time, we're not going to put quotation
marks around here. If we run this, You can see this time we don't
have an issue because we've now renamed
our table to go from the quotation marks to a table name without
quotation marks, which is just a lot
more convenient because we don't have to always put quotation marks
around everything. In some cases, you may
encounter this and that's just because the create
table statement, as we can see here, had quotation marks in the
original create table thing. Let's also talk about
this star here. What we put here is
the columns that we want to select from our table. In this case, we put star, it means select
every single column. I sometimes we don't want to
select every single column. What if sometimes
we just want to select a subset of columns. Let's say we only care about
the album ID and the title. We don't want to
extract the artist. Well, what we can do is we can say going to
select the album ID, and we're also going
to select the title. From the album. Now again, you'll see here I need to put
quotation marks around, and again, the auto
fill here from Data Grip has done that for me. Again, that's just because if we go into our SQL query here, then we can see that
the table names were also created
using quotation marks. If we go back, we
can see here again, we need to use quotation marks
around or table names too. If we run this, we
can see here now our updated results no longer
contains the artist ID, which we have here. But instead, we have the album
ID as well as the title. Now we can also, of course, change the ordering here. Instead of having
the album ID first, we can maybe instead
read the title, and then the album ID,
so if we run this, then we can see
here the output of the order of our columns is
also going to be flipped. After the select statement here, we select which columns we want. If we want to select all
columns, we put a star. If we want to select
specific columns, we have to put their name here. Create table statement
does not use quotation marks in
defining the column names, then we also don't
have to use them here. But in this case, they do, which means we also have to
use them within our code. Now, sometimes when you're
just exploring a database or maybe you're writing
a query and you want to see the
results of the query, but you don't want
everything returned because returning
all of the data, especially, so there's a lot of data in your database
is a lot of work. There's an extra statement
that we can put at the end of our query called mit. This limits the number of rows that are returned to us
that will actually show up. For example, if I put mit five, this means we're only going
to see five rows here. It's going to be the
first five rows. If we run this, we
can see here now we only have the first five
rows returned to us. If I do limit two, it's only going to show the
first two rows. If I do it 50, it's going to give us
the first 50 rows. In that way, you can also manage just how many
values are returned, which you can either use
inside of queries if you want for I don't know if you're doing some specific custom filtering, or you can also use it
when you're testing your queries and you don't want all of the values returned, but you just want to make sure that it's working properly, and so you only want
to couple of values, you want to limit
your results to a couple of rows so that you
can take a look at them, make sure it's working
properly or maybe spot some mistakes,
fix your query, try it again, and then
until it's working, and then you can also again, remove the limit statement, if you want to run
it on the full data set or rather if you want
all of the records returned.
17. Creating Tables From Query Result: So in the previous lesson, we learned about how we can read data from our tables
within our database. But now let's take a look at
how we can actually create new tables from the results because sometimes maybe
you're writing a query, and then you want to
save those results in a separate table so that you
can either quickly access it for visualization
because you've done some data processing or
maybe use it somewhere else. And so there may be, of course, situations where you want to store the results
of your query in just a separate table for further use down the
line. How can we do that? To do that, we're going
to put first here, create table, and then we're going to
have the table name. Oh. Create table, and we're
going to have the table name. Maybe we call this
album results, and then we're going to say as, and then we're going
to put parentheses around our select
statement like this. What that means is
we're going to create a table and we're going to
use the results from here. Now you'll notice we're
actually not defining any of the columns or we're not defining also any of
their data types. All of that's going
to be inferred from the results
of our query here. If we run this, Then just keeping an eye on
the left hand side here. We can see now we have
our album results table, which is as we can see, just the results of our query here and it's
going to be identical to our album table because this is again what we
have returned from here. We can also take a look
inside if we want. We can just take a
look inside here, we can see the
types are inferred. One thing that we do notice is that there's no primary
key defined here. The primary key, as well as potential foreign keelings are not continued on into
our new table here. But we can still
use the results of our queries and then save
them into separate tables so that we can then later
on use those results for whatever it is
specifically that we'd like. Now, as you've seen me
do several times now, I've started expanding
our queries over several lines just to
make it easier to read. Sometimes it's totally okay to run this whole query as one, but in other times
this may not work or it may not be optimal because it will
become really really long. You can, of course, split up your query over
several lines, which will also make
it easier to read. Usually, you want to split
it on specific key points. For example, if you're selecting
from different columns, usually what you want to do
is you also want to have the from statement
on a new line, and we can do some
proper ordering here. We want to create a table and
have the table name here, and then we can put
our query here. If you want to
select for example, specific columns
like the album ID. Then on the next line here, we can have for example, the title so that
this way becomes much more readable and it doesn't extend very long
into the horizontal. What happens if
we run this query again? Running the whole thing. This time we're
going again error because our table
already exists. What we can do is we can amend our create
table statement here, and we can say, create
table if not exists. Now this works for the create table statement that uses the result of
the select statement, but this also works for our general create table statement. Remember when we had
our create table and then we provided
the table name, and we define each column as well as each
type and as well as primary keys and
foreign key references as well as maybe
other constraints. Also there we can use
create table if not exists. The nice thing about this
is that this will only run the statement or will
only try to create the table if it
doesn't exist already. Then it's not going
to have this problem of clashing with an
already existing table. Now, one more thing
I also want to go through is how we can
access table names. Let's remove this again, and we're just going
to select everything. Now currently we're doing
hoop Let's go back here. Currently we're doing select
everything from album, and it's inferring
which table we want because we're already in the database as
well as in the schema. But sometimes we may have either several databases
or several schemas. How can we specifically
reference the exact table that we want if we have several databases here
or several schema. We can also provide
the full path to our database or
rather to our table. We can say, we want to go from RDB here we want to
choose the public schema. And here we want to
choose the album. If we click Run on this now, you can see it
still fully works. But in this case,
we're referencing the full path to our table. Now this is, of course, useful. Again, if we have several
different schemas or several different
databases contained inside, then we can always make
sure if table names are repetitive that we're
referencing the right tables. I can also do the same thing. For example, let's remove our album results table
that we've created. We're going to say drop table, and we're going
to drop the album results table using the
full reference path. We could also use it just
with this component here, but we can also use the
full reference path. If we run this, and we can watch our table
now gets removed, and we can also see that our
data grip updates it and basically says, this
is not defined. This is an unknown reference. We can see we can use both absolute paths as
well as a relative path, a lot of this is just
ease of implementation. In our case, for example, when
we have just one database, one schema with a
bunch of tables in it, it's going to be much
easier to just reference the table directly rather
than putting the full path. But as your database grows
or you get more schemas or maybe the company already has several different databases
or several different schemas. Using these absolute
path can be great because it makes
sure that you're referencing the right tables.
18. 16AliasingAndOrderingHB: Sometimes when we run queries or when we're trying to take out specific columns are referencing specific tables. The names can get very long. So one thing that I want to look at now is going to be using aliases for names so that we can shorten the names that we're using in our queries. So let's take a look at the employee table. I've already opened it here. We can scroll through the different columns available to us. And let's just take out three columns here. Let's take out the firstName, the lastName, and let's also take out their email. So we're going to say select first name and then we're going to select last name. And we're also going to select the e-mail from the employee table. I'm right. And so we'll just put our semicolon here. And let's again limit our results. But let's also put some structure into here. So r from statement is going to be on a new line. Our limit is going to be on a new line, and each of our columns is also going to be on a new line. So this will just make it more readable as the query, maybe it's larger. And let's also, of course put a number here when we're limiting our statement. So if we run this, we can see here these are our results. And you'll notice that in this case we still have to put the quotation marks around the table name because I haven't updated the table name here. And our column are still requiring quotation marks. Now when our columns are created without quotation marks, we can still use them to know no column names, but it's not necessary. In this case. It's actually necessary for us because it was created with the quotation marks. And so if we remember, if we take these away, we're going to run into an issue because our table was created with quotation marks around. Nevertheless, using quotation marks around column names is actually pretty nice because it's much easier to see what the column or which ones are called them names and which ones are other things. So keeping quotation marks around column names isn't actually too bad, but it does, and it's also Kieran table names, but it does get a little bit tedious to user on two ammoniums. So if you want to do that, you can of course, like we saw in the previous lessons, just alter the table name here and then you don't have to use the quotation marks around anymore. All right, So how can we add some aliases to or columns? For example, let's say that we want our output column to not be called firstName, but rather we wanted to give it something else. For example, let's say we just want to use the output column here to be first. And we want our last name here to be called last. And we want the email to stay as email so we're not going to change them. So if we run this, we can see here now our output is first and last, and our email has stayed the same. So we can see here we can assign aliases to our column names, and that's actually going to change them in the resulting responses that we got. And later on as the queries get bigger, we can then also reference or table names using the aliases that we assign them here, rather than using the full name. But it's not only for columns that we can assign aliases, we can also do it for tables. So to assign an alias for a table, we'll just put a space here. And then we would just put the alias that we want to use. For example, if we want to shorthand for employee, which may be use MP. And that means we can also now use MP to reference employee. So one place where you can do that is, for example, sometimes when you're dealing with multiple tables in the same query, and maybe some of the columns are the same columns across several different tables. You want to make sure that you're referencing the right column from the right table. So just like we saw when we were selecting from a table and we gave the full path to the table. We can do the same thing for the columns. So for example, we can then reference our table and here, and we can say m dot first name. You can say m dot last name. And we can say m dot email. And so if we run this, we see it's run this. So we see our results are not going to change. But the way that we're writing our query does because we're now referencing things differently. So we can see here, this is how we give aliases to our column names using the as keyword. But if we want to give an alias to a table name, we can just put a space after we have the front statement here in the table name. And then we put the alias that we want to give to the table name. And this just lets us use this alias throughout our query so that we're referencing this table rather than having to write out the full table name. Of course, in this case it may seem a little bit unnecessary just because we can already extract the columns like this. But when you have several different tables in the same query, it can be really nice. To make sure that when there are different columns, are identical columns across the different tables that you referencing the right column from the right table. And so this is also then how we can use the shorthand from the table that we're using. And we can use it, for example, to access the column attributes inside to make sure that we're accessing the column from this specific table, which is a shorthand for this table here. Now another thing that I also want to go over is ordering results. So in right now, the way that our results are being returned is just by the way that they're inserted here. So we can see we have the order here. And that's also exactly how they're going to be returned to us here. But sometimes we actually want to order things, either ordering as part of the query and then doing something else later on, or just having the final results ordered. So to do that, we're going to use the order BY statement with 1D here. Alright, so the order BY we can now put in the column names or the column location. We'll look at both and then we can order by those. So let's say we first want to order by the firstName and then by the last name. And then we want to return our results. And we want to order in ascending order so that whatever is lowest will be first and then we'll go in increasing order. So to order first by the FirstName, we can then put in here first, which is going to reference, in this case, the column alias that we used up here. Then we're going to order by last. And now we also want to say that both of these should be ordered in ascending order, which is actually going to be the default. But let's run it like this first. And then we'll specifically put the ascending order. So we're running this query. And we can see we are now ordering in a ascending firstname order first, and then in case there's a time first-name, which in this case it doesn't look like there is for them and use the last name for ordering. So to check this, we can also go into our table, which is actually pretty small as you can see. And we can also just click the ordering by first-name here. And we'll see we're going to get the same ordering as we get over here. So again, the order BY statement is the order by which we want to order the columns from. So first we're going to order by the first column here, which is our alias for the FirstName column. In case there's a tie, we're then going to use this last column, which is our alias for the last name column. In this case, we don't have any ties, so it doesn't really do anything. And then after that there's no specific ordering that we're using. So it's going to in case there are more ties, just default back to the border, the natural order of the data within our table. Now, to specify explicitly that we want to use ascending order, we can put ASC here, and we can put ASC here. So we can run this again. Now our results are not going to change because it also defaults to ASC. But you can see after every column that we have in this order BY clause here, we can also specify the ordering that we want to do, either ascending or descending. So if we want to order first the firstName in descending order, and then in case of ties, use the last name in ascending order. We can specify descending like this, and then we can leave the LastName as ascending like this. So if we run this, now our results are going to change because we're using the firstName first, and we're ordering in descending order, which means the highest first for alphabetical large urns closer to zed. And then in case of ties, in which case currently there are none, we're going to use the last name. So in this case again, the last name is not being used, but the ordering is going to go first by first-name ordering, descending, so highest first and then Lois at the very bottom. And in case of ties, use the last name and then sort those columns that have ties on the firstName, last name, but here use a ascending order. And so as you can see, we can specify the specific ordering that we want to use for every single column that we want to order BY. And then we can also order by, for example, and the e-mail column here. If we want to. We can also actually order BY columns that we're not selecting from as long as they're in the original table in this from statement. So even if we don't select them out for visualization, we can actually still order by them because they're in the original table. So we can order here, for example, let's take the birth date. So we can order by birth name and it's going to default to ascending order. So we can run this. Well, we can see that we don't actually have the birth at columns, so we can't really verify this MOOC is still order by the birth date, even if we're not using it in our final select statement here. Finally, another syntax thing that we can use is instead of putting the column names or aliases. So for example, here we have the column name and previously reused the column alias. First. We could have just as well use the firstname, true column name here. You can see both of these would work, but we can also use first like we did putting this in descending order now, and then using last. And let's also put this, or let's keep this in ascending order. So here we're using the aliases. We can use the names. And let's just run this again to show. But something else that we can also do is we can use the number for the column in which it appears in our select statement. So let's say we want to order first by the first here, and then second pi or LastName. We can actually also use the one, which means we're first going to index by the first column in our select statement. And then two means the second column in our select statement. So this is again just another shorthand that we can do in case I call him names, get too long. We can write our code neater, cleaner and quicker. So you can run this and we can see we're going to get the same results because it doesn't change. We can have a one here which is referenced, going to reference the first column that we're selecting here. We can put the alias here, which is the alias that we've assigned to this column. So we can run this and see. Or we can also put the full column name. All of these are equivalent. Of course, using the alias, in my opinion, is kind of the easiest because that's why we're assigning aliases. B can of course also use the full column name or the number version of it, depending on your preference. The numbers I think are a nice shorthand, but sometimes they can be confusing because then you have to look back up and you have to see what is the first column and what is the second column. And in case you're selecting something like 10 columns, you actually have to start counting down to make sure that you're using are referencing the right column. So that can be nice for shorthand. But of course, you know, they all have their trade-offs. Personally, I prefer the alias the most, but really it's up to you to whatever lets you make the most readable queries. Now as we can also see here by the ordering, we can chain on additional order bias. So for example, we can say, okay, we want to order first by the first, the first column here, then by the second column here, since we ever wanted her 2s. And then we want to order, for example, by the birth date, which isn't even included in our select columns here. And then we can put this here. And let's say here we want to order descending. And then we want to maybe order by the email. So we can put a three here if we want. Or we can put email whichever we want. So we can see, and maybe we want to put this ascending so we can see, we can chain different orderings together, just separating everything by comma. And then the ordering that is going to choose as first is going to be this untie, it's going to use this and on TI for both of these, it's going to use this untie for all three of these, it's going to use this and so on. And so you can see for each column, we can also specify if we want to do descending or ascending order.
19. Conditional Filtering Direct Comparison: Now, we've just seen how we can read data
from our database. But of course, reading the full database
or cutting it off after a certain
number or limit of rows is really not ideal. We ideally like to
do more with SQL than just read all of the
data from our database. The first thing that we're
going to learn to do extra is going to be filtering. Let's take a look at
this track. Table here. If we scroll through
it, we can see we have different columns here, we've got the track
ID, the track name, the album ID, the media type ID, genre ID, the composers, the time and milliseconds
of the song, the byte size of the song, as well as the price for
the individual songs. Let's try to take this
table and let's try to find all of the songs
composed by ACDC. To do that, we're going to say, select, and we're going
to say just start. From the track table
hit enter here. Again, we have the
quotation marks. Again, because that's how
the table was created with the quotation marks
around it and then let's just go ahead and limit our
results first to five rows just in case so that we can do checking before
we return the whole table. Now, to filter specifically for something or to do
filtering in general, we're going to have an extra
statement here called where. Here we can say, what exactly
we want to filter for. For example, we
want to currently filter for where composer, and we want to find where the
composer is equal to ACDC. How do we do this comparison? Well, strings are
going to be put into single quotations
like we have here or text is going to put
into single quotations. Go to our database here. We again, ACDC is written all capital with a
forward slash here. We're going to say, Where
composer is equal to ACDC. Notice here, we have
a single equal sign. In case you know some other
programming language, it's customary to use double equal signs and other languages. But in SQL here, we have a single equal sign. Let's go ahead and
run this query. We can see here all of our return results
are composed by ACDC. Then if we want, we can
also turn off this limit here and get the full list of rows or the full
list of records. We can see every single record here fulfills this
ACDC criterion. Now we can also do other
types of comparisons. For example, something
else that we can do is less than or equal to. Now, of course, it's not
going to make a lot of sense to compare less
than or equal to with a name because then we're
doing weird text sorting, which is going to be mainly
based on alphabetical stuff. It's not going to make
that much sense here. Let's use a different column. Use the milliseconds
column here. Let's say, where
the milliseconds is less than or equal to, and let's pick two 100,000
milliseconds, 200 seconds. We're selecting all the rows or all the columns
rather from the track where the records fulfill
this criteria where the milliseconds column is
less than or equal to 200,000. Let's go ahead and limit
our response again. So we're going to run this. Now we can see if we
scroll to the right here, all of our tracks have a time less than
200,000 milliseconds, less than 200 seconds. Again, we can take away
the limit statement. Run this on the full rather
get all the values returned, we can see all of them here have times less than
200,000 milliseconds, less than 200 seconds. We can also do a greater
than or equal to. Everything that's
at least 200,000. This, and we get
everything that has the milliseconds here above 200,000 milliseconds
as we can see here. We can also just take
the equals away and just say greater than or we can also do just less than and so we can see we can do also specific
filtering like that. The main difference, of course, using less than or
less than or equal to, is that the equal
two also includes everything that is
equal to 200,000, whereas the less than
would only include those instances that
are less than 200,000. For example, here we'd
have one, nine, nine, 999. That would be the
highest instance we can include assuming
that everything is integras here because otherwise
we'd be equal to 200,000 and so that would be equal
and that would only be included in the less
than or equal case, or when we check for equality, or when we check for
greater than or equal.
20. 18ConditionalFilteringGroupComparisonHB: So we've just seen how we can do some basic filtering by looking for equality. Are looking for less than or greater than or less than equal to or greater than or equal to. But what about when we want to find all of the, for example here on the composer column, all of the records where Angus Young, one of the composers. Currently we can't do that because if we look, for example, where composers equal to Angus Young, it's not going to give us that value because it's not exactly equal to it. There are also other components to it. And so currently we can't use the equality to find all those records because there are also other elements in here, but there are other things that we can do. So let's take a look at that now. So first of all, let's write our select statement again, select all the columns from the track. And let's limit our results right now. And again, here we're going to have the where statement. Now we want to find where Composer, and we want to find where Angus Young is in the composer. So what we can do is we can use the like keyword. And so we're going to say where the composer is like. And then we have some text in here, so we can say Angus Young. But in this case, we'd still be looking for exact matches to this. But there are some characters that we can put in that give basically some room to kind of expand. So if we use a percent, so that means where this component is in here. But the percent means that we can have anything after this point. And the percent in front would also mean that we have anything in front. So we can have any multiple amounts of character in front. And then the percent means just as many characters here. But the important thing is that somewhere within we match exactly Angus space young, like this. If we took away the person in front, it would be the row would have to start with Angus space young. And then after that we can have anything. And if we had, for example, the percent and the frontier, then we would basically have to end with Angus space young. So to be safe, we can use it like this, which will mean that there can be anything in front and anything behind. But somewhere within our record, we would need to see exactly this piece of text here. So let's go ahead and run this. And so we can see we have all of the rows here contain Angus Young. And let's take away or limit statement here and a rerun this and scrolling through a results, we can see now we have more results than all contain Angus Young. And in fact, in all of these cases, Angus Young is the first part of this entry here. So we could technically also just take away this percent symbol here. And we would still get the result because there's nothing ever going to be in front, at least in this case, but we couldn't be sure of that beforehand. And if you wanted to stay safe, we can still leave this here just in case later on a record is entered where Angus Young is not the firstname inside. But what happens if we take away this last percent symbol here? Well then, now we're not gonna get anything returned because none of the records end with Angus Young and we no longer leave the option to have other characters afterwards. So we need to, in this case, make sure we have other characters afterwards. Just to make sure that there are cases where at the Angus Young string is contained inside, but there's also room for other values to also be contained within the same element here. Now, there's also a different wild character that we can use. Instead of using the percent, which basically lose room for any number of characters, we can instead put something like an underscore. Now the underscore means that there can be any one character here. So for example, let's replace this y with an underscore. And if we run this, we see we're still going to get the same result because we're looking for where we have angus space. And then this underscore here means that any character, any one character can be here, not too, it has to be exactly one. And it's on this case, we have the y here, and then we have the OU and g. If we took away the space here, it would no longer work. Because now we no longer have any just one character here. But we could put a percent symbol here, which means that there is any number of characters can be also between here. We just need to have at some point Angus. And then afterwards we need to have. At this point. So you can see there are different options for us. One of them is going to be the underscore, which lets us replace, just rather have the option of having one free character. And then that we have the percents, which let us have the option of having as many characters there as possible. Now while we're doing all of this, the capitalisation is actually very important. If we changed this to be lowercase, would no longer work because we no longer have exactly this match. So what we can do though, is we can use a different statement here called I like, which means it's insensitive to cases. So in this case, it doesn't care if there's uppercase or lowercase letters. And just looks at if we have this text match and it doesn't consider cases, it is case insensitive. And so we can see in this case it works again because we still have the text here. But this time using the AI alike were actually go in case insensitive. So we're ignoring if everything is upper or lowercase. And we're just getting the match where the text is the same. And you can see we can capitalize the G here, for example. And we'll still get the same match. Because everything in here or the matching that we're doing from here to here is going to be case insensitive. Now in some cases, we may not be looking for text matches like this, but we'd be looking for a certain number of values. For example, let's take a look back at our tracks here. Maybe something that we want is all of the tracks where the album ID is either one or were the album ID is four. So we're looking for two cases. We want both albums 14 or all the tracks that are from Al my d1 and album ID for. So how can we do that? Well, first of all, we're going to change your column here again. We're going to say where album ID. And again, we have to use the quotation marks here because our Create Table was defined with the quotation marks for the columns. Usually like I said, the quotation marks are optional. But in this case, because our table was to find with the quotation marks, we have to use them. And now we're looking for the cases where the album ID is in. And now the album ideas that we're looking for a 14. So we're saying we're looking for all the columns from the tracks here and all of the records from the tracks where these records have the album ID column where the value is in one or four. So if we run this, we can see here now all of the records that we have returned have Album ID, either one or four. And we can also do, for example, one to inform. And so as you can see, we can expand this as much as we like. But then there are also some cases where we maybe want to pick a range of numbers. So for example, let's say we want to find all the albums with album ID one to five. So we can do, for example, 12345 like this and run this, which will give us our results. Now, this works, but it's not optimal because we have to type all of these out. And it still works because we have five numbers here. But imagine we had like 500 or even more or imagining we were looking for a date ranges. None of that would really work anymore. And there's a better way that we can do that. Now, one of those ways we already saw beforehand, which is using greater than equals or less than or equals. And then we can chain conditions together, which is what we'll learn soon. But there's actually another statement that we can do with just called between. So we can say we're looking for the cases where the album ID is between 1, no parentheses here, and five. So again, we're going to select all the columns from the track table. And we're going to filter by the records where the records here where a rose here have an album ID that's between 15. So running this, we can see all of our values here are between 15, inclusive on both ends. And now, as you may have already guessed, awesome because of the suggestion that we got from our SQL editor here. We can also do not between, which means that we're only going to look at the records where the values are. The album ID are not between 15. So if we run this, we see everything here has values not between 15. And this NOT statement also works for the other things that we looked at. For example, not like. And then here we can do AC, DC. So any place where the composer is not like this text here, AC, DC, which is gonna give us all of the records where the composer is not this ac-dc string.
21. 19JoiningConditionalsHB: So now we already know a good amount of filtering. But sometimes even the filtering that we currently know is not enough. Sometimes we need to do more. Sometimes we need to join specific filtering conditions together to really get the results that we want. So how can we do that? Well, let's work at it with an example. Let's try to find all of the records here where our composer is equal to AC, DC and the song length is greater than or equal to 250 thousand milliseconds. So we're going to say selecting all columns from the track table. We're going to filter for the rows where the composer. And now we can either say equal to or we can use the like here. But we're going to use a was equal to because we want exact equality. So we're gonna say where the composers equal to AC, DC and the milliseconds column is at least greater than or equal to 250 thousand milliseconds. So if we run this, we can see now our results are all of the songs from ac-dc are all the chromosomes in this database, where here the milliseconds is greater than or equal to 250 thousand. We can of course, also switch this around and look for less than or equal to 0, which in this case I'm only gives us one record. We can see we can use the and statement here to do some filtering. But let's make this even more complex. Let's say we want to find all of the cases where the tracks milliseconds is greater than 250 thousand. But we only want the songs where the bytes is less than. We got here, three and then another three. So we've got 10 million is kind of the order of magnitude that we're working with. So we want to find all of the songs where the bytes is less than 10 million. So we can add if we want. Another condition here, say and we're bytes is less than or equal to ten million. And somebody can run this. And so then we can see we get these specific tracks returned. Alternatively, instead of chaining with and we can also use or statements. So for example, let's look for all the songs that are either made or we'll start off with all of the songs that are either made by AC, DC or are at least 1,050,000 thousand milliseconds long. So the OR statement here means we're looking for to match either of these conditions. So let's go ahead and run this. And we can see here we either have ac-dc, in which case, as we can see here, we can find the row. Here we go. The milliseconds column is less than 250 thousand. Or in the cases where composers not ac-dc, all of them milliseconds is at least equal to or greater than 250 thousand. So we can see here we have an or statement. Now what we can also do that's cool, is you can continue chaining conditions using parentheses. So for example, let's look for all of the songs were the composer is either ac-dc or the track is longer than 250 thousand milliseconds. Bud It's bytes is also less than or equal to 10 million. So we want either AC, DC or at least 250 seconds and less than 10 million bytes. So either of those two conditions, but then the second one we have to. So what we can do is we can use parenthesis here. And then we can say, and bytes is less than or equal to 10 million. And so what's going to happen is this statement is going to be evaluated as one. And it's going to be compared to this one. So either we have this one or we have the result here where both of these conditions need to be fulfilled. So you can see there's a lot you can do with a filtering using the and in the order, as well as using parentheses to decide which conditions need to be joined together. Now, of course currently we just use things like equal to or greater than equal to or less than equal to. But this also will work just as well if we put here, for example, the like, or if we put in there in a range, all of it will work just the same. We can use the other keywords that we also learned in where statements here, and also continue changing conditions using the oars, using the ands and putting parentheses around conditions that we want to group as one parentheses or rather that we want to group together as one unit. And we can also continue chaining parentheses. So for example, in here, we can then put another parentheses and do some more stuff, for example, and bytes is less than 10 million. Or I don't know, let's think of something or a composer is like. And then actually let's move this to a new line to make this cleaner and move this to a new line. And the composer is like Angus Young. Notice that we're using the light here, not the I like, so we need to use the correct casing. So in this case, and let's make sure we close the parentheses 2. We're looking for either where the composer is like ac-dc, or where the milliseconds is greater than or equal to 250 thousand and our bites is less than 10 million. Or the composers like Angus Young. So let's run this to see the results. And so what we do here, Let's go backwards. We either check for the, first of all, we have the condition out here. So everything that's ac-dc is going to be here. But now what we have here is, okay, we're either looking for less than or equal to 10 million bytes or the composer needs to have Angus Young in it. And then if either of these two are true, we also need to find that the milliseconds is greater than or equal to 250 thousand. So for example, we can see in this record here, we have the bytes being greater than 10 million. But our composer is like Angus Young. So this part of the condition is fulfilled, which makes this or statement true. This one here. So this one is true. Now we need this one and this one to be true. Which means that our milliseconds need to be greater or equal to 250 thousand, which is what we have here. So we have, this one is true, and this one is true. Now we need to say either wear a composers like AC, DC, which is false, or where this whole segment here was true, which it isn't this case. So we can see we can make pretty complicated logic using the filter in here by chaining together are different operations like, like V equals IR I like in between, as well as using the ands and the ors and doing parentheses to group together things that we want to have grouped together.
22. 20NegatingConditionalsHB: All right, so now we've seen a bunch of filtering where conditions are true. But in some cases we want conditions to not be true. Now we've kind of seen this already, but let's take another little more detailed look at it. So going back to our track table here, let's first of all, just look for all of the cases where Composer is a not equal to AC, DC. So what we can do here, you can select, just select all of our columns from the track table and aware the composer. And now to have not equals, and we can have an exclamation mark equals. And then here we're going to put the text ac-dc. So this is going to give us all of the records where the composer is not equal to AC, DC, as we can see here, none of these records containing CDC. And what we could also do here because we have text here. So we could say not like ACC, which means all of the cases where it's not like this text here, which is gonna give us the same result in this case. Now we can also do other things. For example, not between, which we've kind of seen. So we can say not between. And we want to say because in this case between doesn't make much sense for the composer column since it's all text. So let's use the album ID. So we want to find all of the records where the album ID is not between 46. So that's going to give us all of the records here are album ID is not between 46, so not four, not five, not six, and everything else we're going to get. Or we can also do a naught in here. And then putting our parentheses, which means that our album ID is not in 46, which means it's not equal to 46. So if we run this risk now going to get album ID five because we're not using them between, but now we're specifically matching the instances in here. So we can see that we can use the not equals two, for example, like this to do negation for the equality case. And we can do the NOT statement for some of the other more complex where methods that we have. Now you may, you may be wondering, well, what about opposite of greater than equal to? How can we can negate that? Well, we already have a statement for that which is the less than. So we can say anything that's not greater than or equal to is going to be less than. And everything that's not less than, it can be greater than or equal to. And then we can get the same thing, vice versa. Now one more thing I want to add here, if we just go back to some of our previous results for the between, is in some cases we may actually want to order our results too. So let's look at the results from this. You have our album ID is not between 46. Did I need to go back any further? Troubled again? Okay. Weird. Seems to be okay. So let's order everything by our album ID here. So the order BY statement we've already learned, but there actually has to be a specific order to the locations where we're placing these SQL statements. So SARSA select and we have the from. In this case we're going to have the where and the order by is going to come afterwards. So for example, we want to order by album ID in descending order, which means highest first. So if we run this, you can see we start with the highest album ID and go towards the lowest. Now it's important that this order BY comes after the where statement. Because if we put it before the where statement, then we're going to get an error because, well, let's run the whole thing because it doesn't belong there. That's not the correct order of the SQL syntax. So Dasa come afterwards. Now also, same thing with the limit statement. The limit statement has to come at the end. So everything that we've learned for this order BY statements still holds true because still do additional ordering by doing commas here. But it's important that the order that we're doing these commands in is in this specific order. In this case, it's going to start with select and from the where statement, then the order BY and then we have the limit because if it's in another order and it's not going to work.
23. 21FilteringNullValuesHB: Now there's one more special case of filtering that we haven't looked at yet. So let's take a look at our composer table again here. Some of these instances may have missing values, which are represented here as null values. So currently, there's no real way that we know how to check for this yet. So let's take a look at them. So let's first of all, select all of our columns from our truck table, filtering for the records where the composer. And now we want to say everything. We want to find all of the cases where the composer is null here. So to do that, we actually have to write is null. And then a semicolon just to kinda finish or SQL query here and then hitting a run. And so now we get all the cases where the value here in the composer column is null. Now, the important thing is that we don't do that equals here, because this is not correct. We want to make sure we do is null. That's the correct statement to use. Now for negation, it's going to be pretty simple and you maybe already saw this in the code suggestion. All you have to do is write IS NOT null. And that's gonna give us all the cases where the composer is not known. And so this again is going to work like any other conditioning that we've had here. Again, in this case, we just have a special case to use when there are no values contained inside. But everything else that we've learned about the where statement still holds true.
24. 22CreatingNewColumnsAndArithmeticsHB: So in this lesson, let's take a look at how we can do some operations on our columns. Let's open up our track table again here. And the first thing that we should do is let's try to convert this bytes column instead to kilobytes. So to do that, we're going to divide everything here by 1024. So how can we do that? Well, going back to our console, we're going to write select, and then we'll leave star here as a template for now, but we'll come back and change that in a second. We select star from track. And let's also just limit our results to five for now, just in case kind of good practice or our tables now tubing here. But just in case if we had millions of rows or something, it'd be good to test and only return a small segment of those rows. All right, so let's pick out for now and just the bytes column. And let's convert it also two kilobytes. So the first column that we want to take out is going to be bites. And the second column that we want to take out is again the bytes column. But here we want to divide it, divide by 1024 to make it into kilobytes. So do then, we can actually just use the slash here for division and put in the number 1024. And that's going to divide the values in this column here by 1024. And then we can use an alias to, to make it more, to give it a better name. So we can save this here as kilobytes for example. So let's go ahead and run this. And we can see here, here we have the bytes and here we have the same amount, but in kilobytes instead. Now of course, the division operation is not the only thing we can do. We could also do other things such as, and maybe it, let's create some new columns for this. So let's multiply our bites. Can multiply it by two, for example. And here we can save this as multiplied bytes just to be able to differentiate them. We can do subtractions. For example, we can take our pipes and subtract something off from it. And then here we can call this sub bytes for subtracted bytes. And then let's also just do the other one, which is going to be an addition. And here this is going to be our added bytes. So I'm just gonna put some spaces here to make all the formatting nice. So if we run this, you can see here there's a lot of different operations is simple arithmetic operations that we can do onto our columns. So you can see here we kind of have all the default operations available to us. We can also do other cool things rather than just using fixed numerical values. We can also use other columns. For example, let's try to find a new column. That's the number of bytes per millisecond of song length. So we're going to divide the values in the bytes column by those in the milliseconds column. And then we get the number of bytes per millisecond and the track. So to do that, we're going to take our bites column and we're going to divide it by the milliseconds column. We're going to give this an alias called bytes per million. So bytes per millisecond. And again, we can run this. And we can see here now we have different values of the number of bytes used to store the song per millisecond length of the song. And we cannot do also other cool things here. For example, we can order now by our bites per millisecond if you want. So maybe the first thing that we should do is just take away this limit statement here. And let's run that one more time to get the full results. And let's also take out the track name so that we actually see what this track is. And so we're going to remove this now and just take out the track name. And we don't need to provide any extra name. We don't need to provide any extra additions or anything here. Because they're not, they don't really mean much here this multiplications are subtractions, but as you can see, we can't do them. All right, so we've got our bites calm. If we run this, we've got our kilobytes, we've got the name of the song and also the bytes per millisecond. So let's go ahead and order our results by the bytes per millisecond in ascending order. We're going to order BY now. We can have different ways. So for example, we can order by four be the quickest way. And it's kinda easy to see here because now it's still easy to count the number of columns that we have. You can also order by bytes per milliliter if we want. So we can see here we get the same results or we can actually repeat the operation and order by bytes per milliseconds like this. This is by far the least ideal. Probably. It's much better to use the actual column alias that you assigned up here. But just to show you, we can do this and if we can also do it in descending order, of course. So there we go. Now we can see there are different ways for us to do sorts of simple arithmetic operations, both with kind of natural values as well as also using it between different columns. Now, let's do something else. Because also here we can make this more complex by adding in things like parentheses. So we saw that when we are doing joining conditionals and where we had the option of using ands and ors as well as parentheses. We can do the same thing here. So let's just take something and let's take our truck ID. And let's add onto it the album id multiplied by the media type ID. So here for example, we're going to take our truck ID. And then we're going to add on. And then we can take our album ID plus the media type ID, times the media type ID. Now this doesn't, this doesn't really mean that much, but just to show you that we can do all this. So first of all here we've got a simple addition and multiplication. And this we can just call right now as custom column. And we'll go ahead and run this. We can see here we get the results of whatever that is. But if we use parentheses, we can actually do r plus operation first, and then we can do our multiplication operation. So just like we did for our conditionals where we can use parentheses, we can also do the exact same thing here to use parentheses. And then we can see obviously now this thing is going to be evaluated as one unit. And then the result of this is going to be multiplied. So as you can see, we can also do modifications to our columns here. Doing column divisions are kind of dividing one column by the other. We can also multiply columns by each other. We can add columns onto each other, and we can also use parentheses to kind of dictate the order of operations that we want to use.
25. 23DatatypeConverstionsHB: Now, previously we saw that we can do, for example, divisions here. And we did that by dividing our bites column by a 1024 to convert it to kilobytes as we see here. But you probably realized, or you probably at least wondered, that dividing all of our bites, always buy a 1024. It's very unlikely that we're always gonna get perfect integer values here. And you're absolutely right. It's not very likely. And a lot of these are actually floating point numbers are, a lot of these are real numbers or decimal numbers. But they've been converted to integers because we have an integer here and an integer here. So we're doing integer divisions. So in this lesson we're going to look at how we can change our datatypes to be different values. Now in this case, it's going to be pretty easy because one thing that we can do is we can change our 1024 to be 1024. And that's going to change it from an integer to a decimal place number. So if we run this, we can see here now we actually have decimal is included in our divisions. And so we can see the results here. Obviously have decimal numbers, whereas beforehand we just had them rounded to the nearest integer. But there are also different ways that we can do this. And since we're not always just going to be using hard-coded numbers like this. But oftentimes we maybe want to refer to specific columns or we want to change the types of columns. This is not always possible. So for example, we can't do the same thing to the bytes column. We can't just put a 0 here. And you can see that if we run this, we're going to get an error because again, this is not going to work. So there are other ways that we can approach doing the same thing here. This is going to be called typecasting. So what we can do is, and let's try this on the bytes column. First. We can cast, we can cast our bites. And then we put in here the keyword As. And then we're going to put here the datatype. And so now we can cast our column and change the datatype. For example, we can change it currently it's default integer, which we can see also if we open this up and we look at the bytes, we can see it's an integer here. So we can casted, for example, to be a real datatype. So if we run this and in this case, our results are not going to change just because of the way that they're displayed. But our values here is actually now a real type. So let's apply the same thing down below and actually see some results. So let's cast our 124 number instead to be in a real number. So we're going to cast 1024 As a real number. So this here is not going to be a real number down here. So if we run this, you can see here now we again have this floating point division. Again now have decimal places. And the datatypes that we can put here are the data types that we learned in the intro lesson. So you can cast to any of the data types that are available to us. So we can do the casting like this. And as you can see, we can, we can cast on individual integers. We can also cast on full columns. For example, like we saw above. We can cast our bites also to be a real number datatype. So if we run this and we can see these are the results that we're going to get. Now. All of them we're doing here is changing the datatype. Now one way to do this, again is using the CAS statement here. But another way to do this is, and I'll show this first on our 1024, but this that works everywhere else too. We can also cast this using this syntax. So we're going to put two colons. And then here we can put the data type that we want to cast it to. So running it like this. You can see here, or having this, both of these are equivalent. Well, trillion, both of these are equivalent. So there are different ways that we can cast. One of them is what we just saw, which is the cast as the other one is using these double colons, which allows us to cast the value in front to this new datatype. So we can also do this, for example, to our bites here. But let's convert this to a text. Because now we're going to have an issue, since we're trying to divide text by numerical value and that operation is not defined. So we can see here, we can cast both the individual values that we put in here, which of course is not going to be as useful. But we can also cast full values or our full columns, but we can also cast our full columns two different datatypes. And so that can be really nice for different operations. Or also if you're datatype and the columns the wrong value. For example, if you're storing a numerical value instead as a string, then if you're trying to do divisions, it won't work Even if you have. For example, let's look, well, let's take this and convert it to text. And then if we also convert this to text, then this is going to be a problem for us because we can't divide text by text. So in those cases, what we can do is we can cast our texts values, two integers, for example. We can do the same thing here. We want to have this outside and customer bytes here, two integers. And so in some cases when the datatypes are messed up in our database, for example, or also when we want to do data type conversions for different reasons, we can use casting with the two syntaxes. One of them is this double colon, and one of them is this cast as datatype syntax. And so that way we also have more flexibility dealing with datatypes as well as shifting around to different datatypes.
26. 24LogicStatementsHB: Now in some cases, we also maybe want to include logic statements into SQL queries. So let's look at a simple example of how we can do that. Let's take our track table here again. And let's again go through the composers. And now we're going to make a new column that basically says yes or no if the composer is AC, DC. So let's take a look at them. Let's go ahead and first of all, just select everything from our truck column and just limit our results to five. Kind of like to have this skeleton code here because after that you can come back and change it. So just to have the skeleton here. So the first thing that we're gonna do is choose the composer, which column, just so that we have the true value to compare to. And now we want to make a new column. And we can call that, for example, is AC, DC, which is going to be yes. When our composer here is AC, DC and is going to be no otherwise. So how can we do that? We can use the case statement that's part of our, that's a part of SQL. So we can say case when composer equals AC, DC, then yes else, no. And then we are going to save that as ac-dc. So let's take a look at this full statement here. The case keyword indicates that we're going to start a case statement here. And then we have the one keyword which says when this condition, then we do the following. Else, we have this. Now there's one more thing that we're missing at the end of every case statement. Once we're done with all of our cases, need to put end here. So it starts with case when composer is equal to AC, DC. And now we saw this comparison already when we were looking at the filtering, specifically the where statements. So we do case when composers ac-dc, then the value is going to be YES. Otherwise, the value is going to be no. End to indicate that this is the end of our case statement. So let's run this. And we can see here we have two columns. So the first one is our composers here. And currently we're limiting our results. So let's go ahead and take away this limit statement and go through some more results so you can see everything where it's not equal to AC, DC we have known. And when it is equal to AC, DC, we have, yes, as we can see here. All right, cool. We can also of course, split this up over several lines to make it easier to read so that we don't know, go over so many lines. So we can do something like this, for example. And it's still going to give us the same results. Now, we can also do more things. For example, rather than having just 11 statement, we can also have several. So let's say we want to have a separate check and recheck when composer is like. And now we're going to use the light statement that we also saw before. And here we're going to look for Angus Young, anywhere in there. Then we're going to say Maybe else, no. So let's run this again. And we can see here in the case where Angus Young is inside, we have maybe in the case where it neither Angus Young nor AC DC is inside. We have NO. And otherwise when AC, DC is inside, then we have yes. Now the important thing to also be aware of here is though we have a specific order. So first we look for this statement, then we look for this statement. And this here is going to be our fallback statement. So depending on what order you put these in, your results may or may not change depending on exactly what comparisons you're using. Now in this case, we use the equal and the light statement. But of course we've also seen other statements that we can use for comparison when we were going through the filtering specifically. Maybe to clean this up a little bit more. I'm going to put the when and the Venn statements on the same lines. Or I can also split each of these over separate lines, just something that makes it intuitive to read, but also something that keeps our SQL queries from just running away too foreign to the horizontal, which again also not makes it very readable. So we can see we have a lot of different options using the case statements. And we can also do other things. For example, we don't need to put a text value here. We can have here, for example, the 1. Here, we can have maybe 0.5. And otherwise we can put 0. And we can run this and we can see the results here. So we can see we're not just limited to putting only text. However, the datatypes that we choose do need to be compatible with each other. For example, something that's not going to work is if we put in here, maybe because now it doesn't know which datatype to choose. So you can see here it's trying to use an integer, but then when it encounters the text and it just becomes confused. So just make sure the types that you're using are compatible with each other. As you can see here, all of these are going to be real like numbers. So all that should work fine. What we're gonna be using texts beforehand. And of course, we can chain, continue to chain these conditional statements if we want, depending on other things that we're looking for. And we can again use these comparison operators, which we also saw before earlier. So what's cool about the case statements is now we have a way and logic into our SQL queries. And again, these are going to be creating new columns for us. And these new columns contain values that are specifically made up of case statements that we've defined here. Again, the important syntax is just we have a case to indicate restarting the case statement. Then, uh, when i to indicate that we're starting the comparison or conditional here, then we have the actual conditional that we're looking for. If this conditional is true, then we have this keyword. We're going to take on this value. This conditional is not true. We're going to move to the next one, to the next one statement. If there is an X1 statement, look at this conditional. If this condition is true, then we're going to take on this value and we're going to continue on through every single conditional that we have. If we have even more when. And then statements down here. And then we'll continue to go through all of these. And if all of these fail, then we're going to revert to this else statement, which is when all of the above fail, then we're going to take on this value. And then we're going to save or column giving it. And this alias, as we can see here.
27. 25ManipulatingStringsHB: So in this lesson, let's take a look at how we can do some text trimming. So let's take a look at some of our columns. And again, kind of in default back to the composer column, which we've been using a lot in these track table here. Let's try to take out the a at the beginning of every record here. So to do that, we're going to select just writing our skeleton, everything from track. And let's just limit, limit our results to five for now. So again, what we want is we want the composer just to compare to, to make sure that everything is working properly. And now what we wanna do is we want to take the composer column, but we want to remove if there is one, the a at the beginning. So how can we do that? Well, there a very cool SQL, there's a very cool SQL feature that we can use called L Trump. And so what we can do here is we're going to put the text or recall them in here. And then we're also going to put the value that we want to trim away from the left-hand side. So for example, if we want to trim away the a, then we're going to put VA here. And then we can give us an alias, for example, trimmed column. All right, so let's run this. And now we can see, okay, we've got our composer. And if we have our trim column here, we have all the cases where there is an a in the front. We have the a trimmed away. So let's take our limit statement away and look at the full results. So we can see anytime there is, for example, Angus Young, you can see it's taken away when there's ac-dc, we can see the a is taken away. And other cases here, again, the a is taken away. So we can see V L trim looks for this character here on the left side. And if it exists there, it's going to take it away. Now we can extend this more. And if we put in here, for example, multiple characters like AC, then of course the results are going to change because Angus Young, for example, does not contain a C, but the a is, we can see here is still trimmed away. So there's a specific amount of trimming that we can do is we can see from the left-hand side on the important thing to note here is that the trimming that we're doing is case sensitive. So if we put it here lowercase a, then nothing is going to happen. Because again, we're doing things case-sensitive. So we need to remember that, that what we're using here is going to be case-sensitive. Now we also of course have the option to do a write trim and rtrim. And let's take away, for example, the n, so that the n at the end of Johnson will be taken away. And so if we run this, we can see that all of these are now taken away. Wherever there is an n at the very end. Now haven't taken away. Just scrolling through this, it looks kinda looks like Johnson is the main one. So you can see the results here. Any case where there is an n at the end, it doesn't now taken away. In case we don't want to specify if ship B, the left or the right, and we just want to remove extra characters on either side. Then we can do those by just using trim. Which means it's going to, if it finds this character on either side, it's going to take it away. So if we run those are results are still going to be mainly what we had before, just also because we have case-sensitive results. So in this case, it's unlikely that any of our composers. Here, we're going to start with a lowercase n, but we still have the option of taking it away from both sides. And what we can also actually do is we can chain operations. For example, we can run the trim first, which looks for the N on either side in terms it away. And then if we want, we can then run an L trim on the results. So for example, let's say we trimmed away the n. And then what we want to trim away is, let's say v a on the left-hand side. So what we're gonna do is first we're going to run the statement in here. And then we're gonna get the resulting column in here, which is, for example, what we had here, Angus Young and then the Johnson didn't happen n. And then we're going to run the L trim on the result here, this time looking to turn away the capital a on the left-hand side. So being able to chain these operations together is of course very nice and very useful. And in some cases you're gonna do this. For example, if your date values are kind of messed up or in a weird format and there is maybe extra spaces or extra padding is around some texts values. And you can just get rid of all of this junk that's not supposed to be there to make sure that your data's in a more standard format. So most of the time you'll be using this to do some sort of processing on your column in case the data has either some weird padding around it. Or you just want to take away some unnecessary stuff that shouldn't be there so that your values are actually cleaner. Now, two other cool things that we can also do to just process our texts a little bit, is we can do some casting. We can cast these either to uppercase or lowercase. This can be really nice because since then a lot of times we're dealing with case sensitivity. Maybe at points, we don't really care if Angus Young is written capital a MGUS or if it's written lowercase, a MGUS are all capital or all lowercase. Sometimes we don't really want to deal with these case sensitivity things. So a good solution to this is just casting everything. We're converting it all to either uppercase or lowercase. So to do that, we can use the opera keyword. And let's just put in here a composer column again. Let's give this the alias, and let's put a space here just to kind of make it more readable as upper composer. And then let's also cast to lowercase, which we can do using the lower keyword, again using the composer column. And then this we can have here as lower composer. So running this, we can see here now we have the term statement which we had before, the original column, everything converted to uppercase and everything converted to lowercase. So here we can, for example, also try after we converted everything to lowercase. Let's look at trimming away any of the acids. Of course, let's make sure that we're using lowercase S us here, since everything in here is going to be lowercase. So that should take away the, if we just take a look here, the S is at the front here. And then we have also the Taylor roads. So the S at the end should also be taken away here. So let's go ahead and run this. And we can see in this case, both Ss are taken away. Again, this can be really useful when there may be some padding or some other things around your data and you want to do some processing on your strings, just being able to take away some of these values. Now we can combine this, for example, to what we learned with casting. And let's say that we want to take away v dot 99 at the every, at the end of every unit price here. So what we can do is we can just say, alright, so let's take our unit price and we're gonna convert this to text. And then on here, we're going to trim from the MOOC, our trim from the right-hand side, the dot 99, since this is now going to be text values. And then if we want, we don't have to, but if we want, we can convert this back to an integer. So remember we can use the double colon or you can also use the as statement. And then here we can do the cast. And to show that this is an integer, we can add, for example, plus one. And then this year we can save as trimmed unit price. So we can run this and just taking a look at the right-hand side here. So what we did is we convert or a unit price to text, this column here to text. Then we stripped away at the dot 99 on the right-hand side, which has now all text values. Then we cast this back to an integer and added one to it. So we can see even in cases where we're not dealing with text data, we can still use the other parts of SQL that we know and combine these operations together. So we can see, for example, here we can use some nice casting to then do this cool operation, converting between text type and then back to integer. And then being able to also use, for example, the integer additions here. And of course, depending on your situation, depending on your needs you can get with us. You can also do this inside the case statements for example, or also another places. So again, a lot of this you can use for just processing your data to make sure that you can get it in the right format that you needed to or to make sure that you can do this type of formatting or being able to change things around based on your need. And of course, because this works on text, you can still cast everything in the text as we've seen here, and be able to perform technically the same operations on values that aren't inherently texts and then just being able to convert them back later if you want to. So this can be really nice just for general data processing to make sure that you get the data and the actual format that you wanted to have it in.
28. 26StringPositionalInformationHB: Now in some cases, we may want to know additional information about the sizes are components of our strings. So let's take a look at these now. Now there are different ways that we can use these, and we'll just use them right now for VR numerical purpose. But we can use them one for the analytical side to understand what types of values we're dealing with. But we can also use them to, for example, change them into case statements to do specific conditional operations. So let's just write our skeleton first. Select everything from track, limiting our results. Five here. And what I wanna do is I want to understand just how big some names can get given our current database. So let's take our names here. And I want to know just how long axon names can get, because maybe we've assigned a size and want to know if the size of the bar chart, for example, we've assigned to our name column here is appropriate. Are we getting close to reaching the upper limit of that var char size is way too large. Or maybe some other analytical purpose that we want to know the length 4. So to do that, we can use this function called length. We can put it inside the value that we're looking for. In this case, we're going to look at the name column. And then we are going to give us an alias as rural length, or maybe we can call this name length to make it even more descriptive. So let's go ahead and run this. And now we can see this is going to give us how many characters are inside of our text here on the side of our string? How many individual characters, including the spaces, including the parentheses, including any other special character, how long is in this case the texts that we have? So this is the result that we can do that. So maybe we don't want this anymore. But what we can't do is maybe we want to find the largest thing, so we want to see it in descending order at least. So we're going to order by the name length in descending order. So if we run this, we can see now, all right, so the largest name in this case has a 123 characters in it. And we can see we've got three over a 100. We've got some around 90 and so on. So you can see here the scale of things that are going. We can also include this if we wanted to a case statement. So we can say, remember to put commas between separate columns. Case. When the length, just keep this capital of the name column is greater than 50. Then we take on the value 50 plus. And then the next statement we can have, when the length of the name column is greater than 40. Then we take on 40 plus, or we can say 41 to 50. And so I'm not going to write all this out. Instead I'm just going to copy it several times and then use some indentation. So on. Now, when the length is greater than 30 and I say 31 to 40, this one here is going to be 21 to 30, this one is going to be 11 to 20. Otherwise, we're going to say less than or equal to 10, ending our case statement. And then here we can say this for example, As name length bucket. So let's go ahead and run this. And now we can see, okay, now we have an even quicker overview of, you know, what category, for example, this falls into. So we can see that we can also use this in different positions. So we can use it either directly. But of course we can also use it inside case statements are inside other things, whatever it is that we're looking for. Now another cool thing that we can do is we can find positions of characters if they exist inside of a string. And we can do this by using the position statement. And then here for example, we can say the substring that we're looking for. So whatever we're looking for here in the main stream that we're looking at it for. So for example, let's look at the first occurrence of the letter a in the composer column. But let's convert this to uppercase because we don't really want to differentiate between lower and uppercase a is. So we're just looking for the first position of the letter a in the composer column where everything is converted to copper kings. And then we can save this or give it an alias as a position. So if you're on this, if there is any new, Good, If we also print out the composer column on the side so that we can actually also compare the results. So we can see if there isn't a inside, then it will give us up position. Now this position, we start counting at one here. So it's going to be 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12. So we can see here the first time the AI appears, it's in position 12 in our string. In case it's 0, we can see it doesn't show up at all. And then in other cases here, for example, we can see it occurs in position 1. And so on. That way, we can look for either a specific value or we can look, for example, for substrings. For example, we can look for Angus. Whenever this fully appears. Now for all of these, it's going to be 0 until we find the instance where it's Angus Young. Probably best to do this by adding a where statement, year, where composer. And I'll just do, I like, even though we can do it as a lowercase, it's nicer to have as uppercase to make it more readable. Is alike. Angus Young were using the eye like so we don't need the capitalization here. I'm just going to do it anyway, just because it kinda feels natural to. So we can see here, okay, Now we can use the filtering just to do some quicker checking. And we can see RA. So the first time we see angus, the full string Angus here is one. Never well, since we've done this filtering where you don't see it anywhere else. So we can also look for something else. For example, the first time we were see young. And then again, just to make it easier, let's filter specifically for all the rows where there's young contained inside. And again, we're using the eye like so we're doing case insensitive filtering here. And here we can see for example, okay, So the first position is seven. So 1, 2, 3, 4, 5, 6, 7. So the substring young starts at position seven in here. And so we can see this function we can also use to identify specific starting positions, which can be really nice again, as we can see here, the upper is also getting its use. So we can see a lot of these functions can be chained together, each providing a piece of the puzzle which we can then use together with other functions to do some data processing or other components in our SQL queries.
29. 27StringSubsetsAndConcatinationsHB: All right, so we've learned a lot about strings now and how we can use different functions towards gaining information from strings that we have in our columns. But flows to go ahead and take a look at some more cool things that we can do that will specifically help us with processing our data. So let's go ahead and look at our track column again. And the first thing that I want to look at is doing string concatenation. Which means we take two strings or we can even take more than two and we can join them together. So for example, let's start off with joining the name and the composer so that we get a new column that has the value which is the name of the song. And then we say the composer afterwards. So for example, that would look like for those about to rock, we salute you, and then after that we would have the name of the composer. So to do that, let's just go ahead and first write our skeleton SQL. And it's going to be the track table. And let's also just go ahead and limit our results to five. Alright, so we're going to use the name of the composer. So let's go ahead and just extract those, just so that we can of course, compare our results. Now, what we wanna do is we wanna take the name column for now. And to it, we want to add the composer. So let's take our name column. And now to it, we wanted to just add on the composer column at the end. So the way that we can do this string concatenation is by just using these double vertical lines. So we're gonna put our first string here, which in this case is going to reference the column name. And now we're gonna put our second string here, which in this case is going to reference the column composer. And then what it's gonna do, it's gonna take whatever is here and then onto the end of it, it's going to join whatever is here. So let's give this a name. We'll cut this, call these concat columns, loop concat columns for concatenated columns. And let's just go ahead and run this and see what it looks like. Alright, so we have our initial name column here. Then we have our composer, and now we have our result here, which is literally just the name. And then you can see as soon as it ends are composers here start. And we can also see that for other things, in case where we have no values, the final result, as we can see here, is actually also going to be a known value. Now of course, the format that we have this data now is not particularly useful. So let's go ahead and make it a little bit nicer to look at by adding a space after the column name. And then we say bye, and then we say the component name. So to do that, we're just going to put in here the string space. And then we're going to have Bye. And then we're going to have a space. And then we can just add tumor vertical lines. And suddenly now we're doing several different concatenations. First we're taking the name column. Then to the end of that we're adding space by space. And then we're doing another concatenation. That to the end of this, we're now adding the composer column. So if we go ahead and run this, we can see now, okay, the formats that the better we have the name, then we have buy, and then we have here the composers themselves. Now we can do even more here. So let us take this all the way and let's start off with track. And then we're going to put the truck ID. And then we're going to give the name of the song. And then we're gonna give the composer. So before that, we're going to say Track. And then now we want to have the track IT. So how can we do that? So to do that, What's cool about this is we can actually, as long as we're in-between here, also put in integer values. So we can just directly reference the track ID column and the string concatenation. We'll actually just take care of the conversion for us. And to make this formatting a bit nicer, I'm going to put a colon and a space here. And so if we run this, we can see here now we have track and then we have the ID, and now we have our name column turned on. So let's go ahead and just add another space here just for some formatting and run that again. And so we can see even though our track ID is actually an integer, which we can also see when we just look at it here. The string concatenation will take care of the conversion for us. But of course, if it wouldn't be also know how to do the conversion ourselves using casting by, for example, just casting it to a text value. So in case it wasn't taken care of for us, we could run it like this. And of course it would take care of that for us. Or we could cast it to another data type that we find more appropriate. But as we can see, the string concatenation. Actually takes care of all of this for us. Now, of course, usually this is in the opposite direction of where we would like to go. Usually we may have data like this and then we would like to go into this format. But nevertheless, just seeing string concatenation is really nice because at some point you do want to use it and to good use. And in this case, our databases very clean on of course, um, but in some cases you may have times when you have, for example, let's take a look at the invoice column. There may be times when you have the date and the time in two separate columns. And you just want to join them together and use them as a full date value. So then you could do string concatenation to get a date plus time. And then you can do a casting to a datetime for example. And then from there directly have the datetime and one column. So depending on how your data is formatted, sometimes the columns that you'd be using are not in the ideal format that you would actually like them to be in for. Either no further analysis on the line or just some processing that you wanna do, or also just for having a cleaner export by having less columns. And so string concatenation can be really nice because as we can see, we're able to join together different columns, as well as adding in customs. Strings of our own are all joined together. And of course, notice again here that the custom texts that we have is within single quotation marks. And any reference to columns here is in double quotation marks. Now again, we need the double quotation marks because by create table, the column names were to find with double quotation marks. When our table is not defined with double quotation marks for using the columns, then we could also just reference the column directly like this. But we can never use double quotation marks for strings. There are only reserved for column names, as we saw are also table names in this case, as we can also see depending on how the create table statement was written. All right, so now we looked at and joining things together. Let's go in the opposite direction and look at reducing things. So I'm gonna take this away just so that we don't have a very long output down here anymore. And the next thing that we're going to look at is using or finding substrings of a string that we have. So let us take a look at a, just the simplest implementation first to find the substring. We're going to go and use this substring method, substring. And then here we can open and close parentheses. And then the first thing that we want to put in is the text or the string that we want to take the substring from. For example, let's take the composer column. And then we want to put in here where we want to start. So we can say, for example, from position one in SQL, the first position is always going to be position one in case you know what another programming language you may be used to. The first position being referred to as position 0. But in SQL, the first position is position one. And if you don't know another programming language, then of course this is very intuitive that the first position would be referenced to as position one. So either way from one and the first character here is going to be at position one. And now you want to say how many we want to extract. So to do that we want to say, let's say we want to extract the first five. So we're gonna say from 145. So what this means is we're gonna take the substring. So we're going to take a smaller part of the string of whatever is put in here. And we're going to start this at position one. And we're going to go for five characters. And let's give this an alias. So we can call this sub column, and let's go ahead and run this. And so we can see here, we start off with Angus for example, and we go, We start at position May 1 go for 1, 2, 3, 4, 5. She can see here. Let's look at the next one. F ball Tez. So we start at position 112345. And so you can see this is also exactly what we have here. Now there's actually another way that we can also do the syntax you don't actually need to write from. And four can also use commas, a separation instead. So for example, you could just put on here the string and then we can put in a comma. And this is going to be the starting point, and this is going to be how many values we want to extract. So if we go ahead and run this, we can see we get the exact same result. And in fact, we actually don't even need the final four value. So the final four value is not even necessary. And we can also take this out, at which point we'll just. Start taking all of the string from just the starting position. So we can see here, in this case, we start at position 1. So we take everything. If we start at position five, for example, then we're only going to continue taking onwards from position five. So 1, 2, 3, 4, 5. So from this character onwards, we're going to take everything as we can see here. Now, even in this method, we can use the results of other methods that we've learned previously. For example, Let's try to extract the first five characters after we encounter the first space. So to do that, first of all, we have to find the position of the first space. And we can do that using the position method that we learned before. So here we're going to find the position. And first of all, we're going to put in here the composer. And we want to find the position of the first space. And then we're going to extract the next five characters. But since we want to start after the first space, rather than at the first space, we also need to add a one to the result here. Since the position of the first space, we'll make a start exactly on the space. But we want to start at one after this space. So we're going to add one to here. So let's go ahead and run this. Now also this statement we can make smarter by using other functions that we also learned beforehand. For example, let's try to find the first five characters after the first space that counter encounter. So to do that, first of all, we need to find the position of the first space. We can do that using the position function which we learned earlier. So we can say position. And now you want to find the position of the space in our composer column and type 2. So remember using the position function. So we want to find the substring of the composer column. And we want to start at the position or rather after the position. And we'll get to this in a second after the position of the first space. So to get the after, this will give us the position of the first space. And so that would mean we will start there. But if we want to take the position after the first space, and we need to add one to our result here. And then we want to extract the first five characters. And so, and that's going to be the five here. And so now again, what we're doing is we're finding these Trump string or refine a substring and the composer color. So every row is going to again have a different value. And we're going to start at the position of the first space plus one. So we're going to start one after the first space, and then we're going to extract five characters. So if we run this, we can see here these are the results that we get. And let's take away our limit statement and see how it looks for everything. Specifically. For cases where there is no space like an AC, DC here. Or let's see if there's another in this case here. So you can see in these cases, we're just starting at the initial value, since the position of space and composer, there is no space. And so therefore this is going to give us 0. So if we run this separately and we save this as space position and are on the second. So you can see the space position in this composer named for example, there is no space, and so this is going to give us 0. And so the plus one as just going to have us start at character one, which is why, for example, we see here starting like this and y for example, for CDC. We also see more just taken the first five here, which is actually going to give us the full composer name because of the space position here is going to be 0. And so we're going to start at position one, since we're doing the plus 1 here. Now we can make this even more complicated if we wanted to by again, using the same methods are the same functions that we've had before. And Let's actually doing this, split this up over several lines to not have it go too far into the horizontal. So what we're gonna do is we're going to find the first word after the first space. So for example, we want to find everything between the first and the second space. So in this case it's going to be young comma, or here it's going to be bolt has comma. So how can we do that? Well, now we have to make this component here smart or you have to make it dynamic. So first what we wanna do is we want to find everything after the first space. And then we want to only find everything up to the next space. So we're going to start off with finding everything to the first space, which we've actually already done here. So we can just take this and we can copy this and put this in here. But then close off our parentheses since we want to take everything after the first space. Now, we've actually already done this because exactly this here, this part of our query here is extracting, well, in this case we're extracting the first five characters, but we could just as easily remove this. And now we're extracting everything after the first space. So we want to take that though as input. So what we're gonna do is rather than using our composer column here, or rather the string corresponding to the current row of the composer column. We're going to instead use the result of this. And so we're going to just here, we can see that. We just want to make sure that all of our parentheses are closed off properly. So our first element in here, and let's again just adjust our spacing to make this more readable. We can see that the first element or the first value, and aside here is the result of this statement. And this is gonna give us everything after the first space. So now in this case, for example, we're going to have young and so on, because we're not limiting the characters. Now the next thing we wanna do is we want to find everything up to the next space. So because we know that we're starting after the first space, Our starting character here can just be one. Since we're starting directly after the first space, which means we always want to start at the next word. But now we need to find the end position, which is going to change based on the length of the name. For example, young has five characters in the name Malthus F6. And in each of these cases we actually have a comma behind. And so how far we want to go to reach the second space is again going to change. So what we can do is we can say, all right, well, we already have this here. Basically, we just want to go up to the position of the next space, but rather going up to the position of the next space and composer, which is going to be the position of the first space. We again need to use the result of this statement here. So putting this in here and just putting this again on a new line so that everything is easier to redone. Alright, so let's go ahead and give this a run. And let's take a look at our result. So what we can see is now we get exactly what we wanted, which is young a biotas comma and this quiz, it's just an AMD. So let's walk through this one more time because this structure is kind of intricate. There's a lot of nested components here. So our first value, the string that we're taking the substring from, is actually the result of taking the substring of the composer column starting after the position of the first space, which is what we did earlier. So for example, for our first row here, it would just be young and everything after that, since we're not limiting how far we go. So the next thing that we do then using this result is we start at position 1. Since we now, with that, we're starting after the first space. And we want to go up to the position of the next space. So the position of the next space is going to then be two or rather to properly get that, we again need to use the result here. And in that way we can then use the young and so on. And now we need to find the position of the next space, because if we just used the original composer column, then it will just give us the position of the first space in the full column. But we don't want that because we're using already a substring part of this column. So we wanna make sure that when we're using the position function here, that we're using the same substrate. So again, we're going to find the position of the first string in the result of the substring statement that we've written up here, which is going to be the same one that we use down here. Now actually, there's one subtlety here, which is we actually have a trailing space in addition to the trailing commas that we have in some cases. And now we could try to basically say, okay, we're going to go up to the position minus one to get rid of the trailing spaces. But the problem is in some cases when there aren't any spaces in our substring anymore. For example, here, our result here is going to be 0. So we're gonna go already going for 0 length, which is why, for example here AC, DC, we have nothing cooler taken a no characters. And if we put in a negative value, you can see here that in this case, postgres is complaining because we're not allowed to put in negative values. Here we can take out negative amounts of text or negative amounts of characters. But we already known the solution or we already know the function that we can use to solve this problem. Namely, we're going to use the trim method. And here I'm just going to use the right trump because we specifically want to trim off the right side. And what we wanna do is we want to trim off the trailing comma and also the trailing space. We're going to end our function here and just put this on the same line again just to kind of have everything and syntax. And let's go ahead and run this. And so now we can see we got rid of the trailing comma as well as the trailing space that is actually hidden. And if we just had a comma here, are results actually wouldn't change because we have that trailing space first that we can't see in this case because of how the output looks. So just be aware of this, that in this case we would have a training space and a trailing comma. Now as he can hopefully see, even though some of the functions in the Postgres by themselves may seem a little alike. For example, why would we need to know the position that there must not be very many use cases. But the cool thing is that you can use all of these functions together. And through that you can actually write some pretty cool processing stuff. And you can see we can kind of nest results here and use the results of one sub string as the entry point for another substring. So there's a lot of cool stuff that we can do by chaining different functions together are using the results to do some really nice data processing. Even in this case, if the task was only to just pull out the first value after the first space and between, between the first space and the second space rather. So we can see here though, there's again, some really nice stuff that we can do by just using several of these different functions together so that we can get ino, some unique and custom processing where everything is kind of just adapting to the input values that we have. And we don't need to specifically hard-code any specific lengths are starting points or anything like that. As we can see here, these two values that we have returned aren't very different lengths. But that's totally fine because our query takes care of everything and it doesn't really care how long each of these are, because it also has that taken care of automatically inside.
30. 28StringReplacementsHB: And we've already taken a pretty good look at a lot of stuff that we can do with strings in SQL. But there is one more cool and useful thing that I want to show you, which is replacing values within strings. So let's first of all just write our skeleton code here. Facility we have something. And in this case we're actually going to use the invoice column because I want to use this date column here. So we're going to select from this invoice column. Let's just limit our results to five again first. And what I wanna do, I want to replace this space here instead with a T. Now, there are different ways that different programs and writing times are there different ways that different people may be formatting topics? So sometimes something that you'll see is a T here, or you'll see some slashes here. Now, SQL has a very specific format it wants, which you can also see here, year, month, day with dashes between them. And you have the space, and then you have hours, minutes, seconds. And then here you have the milliseconds components. And you can see all of this is subfields are separated by colons. And here we have a separated by a dot. And so we can also look into honor invoice table. And we can see here the invoice date in this case is actually a timestamp. But the value that we see as the users kind of made friendly for us. And it's actually the nicer representation of this timestamp value. And we can actually also use this result directly. But because different programs and sometimes even different people format dates differently. And this is the SQL format that you want, but you may not always get it because sometimes if the data's just uploaded from the result of a string of vacation of a timestamp, then sometimes the way that that program puts this timestamp or puts this date into a string and may not be the actual format that you need. So let's just look at some string of placements. Even though we already have the value that we want here. Using the string replacements, you can, if the other general format is the same, you can already get some really nice values. We can really just go to the format that you need. So did you do that? Let's just change this invoice date here to instead habit capital T in between here, which are the common thing that you might see at some point. So to do that, we're going to select invoice date and we're going to select it so that we have to compare to. And now to do the actual placement, we're going to use the Replace function. And let's keep this uppercase. And the syntax that we want to use here is first, we want to put the full string that we're going to use, which in this case it's just going to be with the result of the invoice date, then we want to have the value that we want to replace. So for example, in this case, and let's replace the space. And then we have the value that we want to replace it by. So for example, let's replace it by the capital T. And then let's give this an alias to as modified date. Now there's actually one more subtlety that we need to take care of, which is our invoice date is a timestamp internally. But when we're looking at it, it's actually converted to us into the text representation of this timestamp, which of course makes it much easier to read. It's much easier to read the actual date with year, month, day, hours, minutes, seconds, rather than looking at just a bunch of numbers which represent a time. So what we need to do to properly do this is we also need to convert this timestamp to a text format. And let's go ahead and do the conversion here. And let's also go ahead and do the conversion internally. Actually, let's not do that here because I want to show you the result of his two, but we need to have it in here because otherwise we're looking at these, these are actually timestamped values. So there are time stamps, but we need to convert them into a text representation of a timestamp, which when you do it in SQL, looks like this. Otherwise we won't be able to use it properly. So if we go ahead and run this, we can see here that this is the result that we're getting when we're converting it to text format, which is basically everything after the dot here, which in this case is not used just because it's not present. And then we have the space here replaced by the tea. So we can see this is the SQL or presentation that we see when we have timestamp values, but we don't do any other conversion. This is how they're shown to the user in this case. And we'd actually, if we want to cast this to a text, we can see here this is the resulting casting that we have which are the exact same value because we don't have any extra millisecond components here. But it's still nice to just be aware of this subtle difference that you're going to get when converting to a timestamp column to a text datatype. Just because it slightly changes how the output looks, but that also really affects how we're going to use these values. So we can see sometimes we may get dates, date values or datetime values which have this team between, which in this case, we can actually chain. And we're going to chain. Because in reality, let's just treat the example that we're getting this value inside. So we're going to chain our operations here. But in reality, we just kind of taken this modified date column and then let's remove this basic AND, OR this t again. So we're going to replace from the result of this replacement statement here, which is what we see here. So if this were a true input and we want to take the tea away, since that kind of puts it in the esco format that we need to use it as an ID, date, time, or as a timestamp. So we're going to replace this and we're going to replace away. And let's put this on a new line for now. We're going to replace away the T and we're going to replace it with a space and the endless. And it's also give the column name here again. So if we go ahead and run this, we can see we're getting the exact value back. But again, what we've done is we've replaced the space here with the t. And then we've replaced that T awaken. And sometimes you'll just see that, you know, there may be cases where you just have date times that look like this. Now there are some SQL databases which can deal with this format. For example, postgres is one of them where we can actually still take this and convert it to a timestamp type. And we can see if we run this, run, this whole thing. We can see here we're kind of converting it back to timestamp. But generally the correct SQL formula you want as a space between here, because not every SQL database can deal with these T values like this. And so in some cases, the statement may actually crash. And so just be aware that there's a kind of the ideal SQL forever without you want, which is year, month, day, space, hours, colon minutes, colon seconds. And the reason that you're having it like this is because even if these values are strings, if you sort them, they're still going to be sorted in the right order. And so that's also why it's important that we have this format of year, month, day and NADH day, month, year or month day, year or something like that. Having this very specific format makes everything very standard. So in this case, we can see in Postgres, the capital T is not a problem, but this is not true for all databases. And SQL is kind of very variable. And of course, if you end up having to wear, maybe you use several different SQL databases. The statement may make more than one of them and may crash and the other because again, they approached that was slightly differently. So just be aware of that. Ideally, you want to go for this clean format. And of course, you can use all of the replacements and castings. And this was substrings and everything else that we've learned, all of the processing stuff to kind of modify all the columns and put them into the proper format so that you can do the type conversion. But just be aware that you may encounter sometimes having date columns and improper formats. And another one that is commonly seen is, for example, having dashes, slashes instead of dashes. So another one not as commonly seen is, um, start converting this to a timestamp so we can actually see the result. But another one that's commonly seen as humming like dashes here or something like that. So just be aware that these are things that you may encounter. And this case, as we can see, because the timestamp casting worked again, but other SQL databases may not be so great at adapting to these different formats. So just be aware that we can use the simple replacement methods as well as other string methods that we used for processing or manipulating data beforehand to get our data into the proper format.
31. 29DateAndTimeHB: Now a date and time information like we also have here in the invoice date column are likely to come up in one form or another very often in your SQL tables because it's really nice to keep track of when things occurred. So just let's take a look at dates and times and how we can use them to extract some information from them. Just because it's going to be such a common field to see. Now what's really nice is that SQL is really good at dealing with dates and time information. And, and that's great because it gives us a lot of flexibility. So let's just go ahead and first write a skeleton select statement from our invoice table, limiting our results 25. And again, we already saw this before, but let's just go ahead and select the invoice date column. And we'll go ahead and run this. And we can see here this is kind of the texts or presentation that we can't return to us in our table because the value itself has a timestamp. And we also saw that if we cast this to a text value, we can just get it without the millisecond component in this case because it's actually not included. But let's say that we're only interested in using the date, for example, we're not particularly interested in using the time. Let's say we want to run an analysis where we want to see, okay, I need to understand when things happen on a day by day alone, or maybe even not even that size. But I'm not super interested at the time component here. So what we can do is using our invoice date. We can just cast this to a date. So let's go ahead and maybe let's give this an alias to as date. So we can run this and we can see here, this is going to just be our kind of initial output. And if we just take out the date component, we can see that we very easily just cut out the date and we can do stuff with the date. And we can also, for example, take out the time. Because maybe, let's say, maybe let's say we're only interested in doing an analysis looking at specific time of days. And we're not super interested in the days and the date here, but rather we're more interested in looking at the time of day because we wanted to understand fluctuations with time of day or something like that. So we can see here as soon as we have it in the proper timestamp format or in the proper date time format. We can then very easily just extract the information that we need out of it. And we can also use this for filtering. So for example, let's take away our limit statement first so that we kind of get the full result. Let's say we're interested in analyzing some behavior based on time of day in the week. That is between, that starts from the first of the first 2000 line and goes until the sixth of the first. So we can do is in our where clause, we can say where invoice date is between. And here we can say, we're going to take the string 2019, uh, 11. This we can cast to a date and end. Don't need these parentheses here between this. And then also here, what we're going to have this 2000 and who I wanted to write nine here now 192 thousand line and then a 1 and 0, 6. And again, we're going to convert this to a date. So if we go ahead and run this, we can see here very easily, we can do date filtering and we can also do day time-specific during. Because SQL inherently understands date and time formats, which of course is really useful because date and time values come up so often. And most recently, or more generally, you care about what happened most recently, not as much as what happened five-years In the past, of course, you may want to be able to access that information. But most of the time when you're trying to access information or do an analysis, you're most interested in the most recent data. And so with SQL, we can very easily do this by just using the inherent date values, the inherent time values, as well as the timestamp value, since they're all kind of compatible with each other. Now, there's also another cool thing, which is this function called now. Now what that does is it basically gives the current date and time value. So let's go ahead and run this. And we can see here the result here is going to be the same because it only returns one value, but it gives the current date and time in UTC time. So we can see that it gives the exact date and time value in UTC time. And we can also use this again here to extract values from it. For example, to get the current date, we can just do now and then use the result. And from here just extract the date. And we can save this as current date. And so our current date column here, for example, it's just going to be what we see here, the current date. Now we can also use this for filter. For example, we can say, all right, let's find all the places where the invoice date is greater than or equal to or rink. In this case, it will make more sense. But let's say or the devs component of the invoice date is equal to today's date, which we can do this. Now in this case, we're not gonna get any values returned to us, of course, because the data here is old and there's nothing that kind of comes close to Today's date as we can see. But we can still use dynamic filtering like this. Interest kind of have our queries adapt based on the date and as things change, our data still going to stay updated because we're using this internal NOW function, which just gives to us the current date and time. And then we can extract, for example, the date and, you know, get date information from there. And of course, because SQL inherently understands and makes proper use of date and time values, we can do all sorts of comparisons and looking at ranges and all sorts of cool stuff.
32. 30DateTimeIntervalsHB: Non addition to just being able to extract date and time values and being able to do comparisons and stuff. Sql also has a really cool option of us being able to use time intervals. So let's take a look at that. Let's just run our skeleton select statement here, select from our invoice table, limiting our results to five. Go ahead and run this. Now, let's try to select or rather let's try to find first the time difference between today's current date and the invoice date here. So let's extract the invoice date so that we have comparison value. And then we're going to use our now to give us the current at date and time. And now to get an interval, all we have to do is subtract off the invoice date, which has a timestamp. So because both of these are timestamp compatible, we can now take our null value, which gives us also a timestamp and subtract from the date of the invoice. And here we can call this, for example, as time since invoice. And if we run this, we can see here this is our invoice date and this year is the time difference since the voice. So for example, we've seen here, are we seeing in this in this case, it's about 4,200 days since these invoices half-past depending on the exact date. Now we can also convert these, for example, to date values because currently we're getting information about hours, minutes, and seconds. We can also convert these to date. So if we want and which case we're just going to directly get dates, dark di, differences, as we can see here. So this is one way that kind of interval show up. We can get interval differences between different times. But of course it doesn't stop there. We can also use intervals. So let's take a look at that. Let's first of all take our invoice column and our invoice date column, and let's just add some time onto it. So baby, we don't want to have or we don't want to show the data in UTC. We haven't internally stored in UTC, but when we take it out, we want to shift to a separate time zone. For example, let's go to UTC plus 1. So how can we add an hour on tour invoice date here? Well, it's actually really easy because all we have to do is do plus here. And now we have this interval option and SQL. And the interval. Now we just put in here a string of the amount of time that we want to add on. For example, we want to add on one hour. And here we can save this as, for example, add one hour. So if we run this and let's just take this away for now, comes kind of in the way. So if we run this, we can see this is our original input state and this is our invoice date with one our audit onto it. So what's nice is if we have our datetime values internally, as you can see, we can still do all sorts of timeline conversions. And once we take it out and make it appropriate to whatever we want to show, of course are optionals lot just to add on an hour, but we can really don't everything that we'd like, for example, for rather let's rename our column first. So here we can say add time. Let's add 1. Second, trend. Do it like this. You can do this and think for a minute. We can add on one month, we can add on one year, or we can even out on multiple years. If you want to add a multiple years, you can either do two-year or you can also put an S here as we can see, their results themselves don't change. But it's nice because we can use this interval value now to have a time interval that we specify through the string here. Now what we can also do, for example, let's say we want to add them two years and one month. Run this. And we can see that everything adopts to two. So the interval statement here is cool. And it's cool to be able to use it in the select statement, but it's even cooler to be able to use it inside of our filtering. So let's take a look at that and let's select or let's take away or statement. And let's get rid of our limit for now. And also just getting rid of the this here and make it what we're going to filter by. So I want to add on 15 years to our invoice date. And now I want to filter by where our invoice date plus the interval of all. Let's make us all capital just to kind of indicate again, that's SQL statement. Plus the interval of 15 Years is greater than the current date and time. So if we run this, we can see in this case we're going to get all values returned because if we add a 15 years onto the invoice date that we're going to be in 2024. And so we can see, of course, in this case everything is going to be past the current date and time. Now that's one way to do it. Another way to do it, or to do filtering is, let's say we wanted to find all of the invoices in the last week. So what we can do here is we can also use intervals on the other side. So let's say our first of all, let's take this way just to kinda use the columns as we would find them in regular tables. So let's say our invoice date table is still being updated and we want to find all of the invoices within the last week so we can hear say all right. Let's take the current date and time and let's subtract off the interval of seven days. So we're looking for everything within the last seven days. And so we can run this. In this case, we're not going to get anything returned because this happened forever ago. So we can do here, for example, as we can again. In this case, our results are of course not gonna change of routing got so many years. But you can see here just imagining that our invoice date kinda got continuously updated. We can have our queries adapting to the current time intervals, or we can find specific time ranges by either taking the current date and time, taking off seven days. We can cast this to a date to make sure that the time information doesn't come in and then take off seven days. And in that way, we can do some really nice filtering to look at, for example, the things that happen the last seven days and things that happened the last 10 days, that things that happened in the last 24 hours, the things that happened in the last five hours. We can have a really custom time intervals to use for filtering. And that's really nice because most of the time when we're using data that has some sort of timestamp attached to it. We're not going to be using all of the data across all time. Most of it, most of the time we're only interested in looking at specific time regions. And so we can see having this option to be able to use intervals to either make comparisons to the current date and time or to the current date, or defining time intervals by using this interval statement here, we can make really custom select statements that just give us data for exactly the timeframes that we're interested in.
33. 31CommentsHB: Now another very important thing about writing queries, or also about writing code in general is leaving comments. Because when you're reading something and if it becomes complex, It's really nice to have comments or annotations that explain what you're doing or what you're thinking or what this part is four, so that when you come back to it later or when other people tried to read it, they understand the thought process and what is going on here. Because sometimes if you write a more complex query or more complex code, and then you just leave it for a couple of months and then you try to come back when you're trying to figure out what exactly you did at each point and why this is here in Ohio, it has this format and why you maybe did the special trick. Having left a comment is a really nice way to guide people through it so that they can understand your thought process, the way of thinking and why, for example, specific that has been done or whatever else you may want to write inside. So let's take a look at how we can write comments. And we're just going to use a simple query statement here from our invoice table. And just limiting our results to five, just kinda because it's a face skeleton statement to have. So let's go ahead and leave some comments. Leave some very basic comments becomes the nonetheless. So the first way that we can leave comments is by using a double dash. And then everything that we write after this dash here is going to be common. Good. Put a space here so that it becomes more readable. For example, here we can say selecting columns. And if we run this, then you can see we kind of get everything returned. And even if we remove the space here, it's still all works. The comments are not part of the execution, but we can leave them here. We can see these don't extend onto the next lines as we can see because this part of the SQL statement is still being written. However, if we hadn't separated or SQL statement over several lines, then everything after this on the same line is going to be commented out, in which case we're now going to have a problem. So that's also, it's nice to have all of the stuff split over several lines. Now in this case, as you may have seen, Databricks actually does some smart commenting, where if you have a comment and you add a new line somewhere, it automatically adds this number dash in front, which in most cases actually really nice in this case when we were trying to go back to our query and of course wasn't what we were trying to do. That That's a thing that's specific to data grip and not specific to Postgres. Anyway, we can use common statements like this. And let's just run this again to get rid of the warning. So we can use KMS statements like this to annotate our code or our queries. So we can sit here, for example, selecting from invoice table. Of course, in this case, our comments are kind of trivial and not particularly useful. But as the thought process gets revolved, maybe you leave a comment at the beginning of a more complex part of the query. And then you can see this part does this. You don't need to annotate every single line, but just saying like this following block is responsible for this or when you have a specific logic somewhere in there, we're using, for example, position statements to find specific intervals. You can say that R data format is like this, which is why we can use the assumption to find the position of this, to find this next value. You know, just explaining your thought process of that when people come to see it or renew, come back to it a couple of months later it because maybe you haven't needed to change in that you can still understand the thought process and you don't need to try to break down the queries and understand and kind of recap the thought process that you came up a couple of months past. So yeah, we can write single line comes up this, we can also write them on separate lines. For example, new line comment. So this is going to be a comment that's occurring on a new line. And again, all of these comments that start with S double dash only span single lines. We can also have multi-line continents. Now, one way that we've kind of seen that we can do is we can just extend and commons onto multiple lines and have more lines here. But this is not very efficient. Because if you want to write over several lines, you don't want to have all of these dashes in there. So what we can do instead is we can use a format that looks like forward slash star. And then that starts the multi-line comment. And we can end the comment with star forward slash again. And then just taking away the statements here because it no longer less than necessary. And now we can see now we have a multi-line comment. So everything between this forward slash star and star forward slash, all of this is going to be commented out and ask comments go. We can see that if we run this, our results are of course not going to change. Now with this, well, we can even do is we can uncommon between the statements. For example, we can say here about to say the table and then ending or common again. And we can run this, and this is still going to work because none of this is going to be part of the actual final query that's going to be executed. All of the comments that relieving here are just to make our queries or if you're using it somewhere else. And code is the comments in general are there to make the code more readable, to explain your thought process, to explain the logic. Again, the point of comments is not to comment every single line, especially like the trivial common who have committed here selecting our columns, selecting from the invoice table. But rather it's if you're going into a part that's not obvious and self-explanatory that may cause some confusion. It's good to just leave a comment to explain in this common can be shorter, can be long depending on how much explanation is necessary just to explain what is going on here. So that's someone who may be new to it, who see the query for the first time. Are you coming back a month, two months, six months, a year later to the same query can still very quickly understand and kind of understand the essence of the query. Understand what's going on in each part and that they're not going through it and or lost by what is going on a specific part. And then they have to spend time trying to figure out, okay, so what exactly is this part of the query doing here and why is it like this and like why are there all these special conditions? So in that way, Mino, having comments can make everyone's life a lot easier.
34. 32AggregationFunctionsHB: Now what's really cool about SQL is that queries can go way beyond just selecting specific columns and rows and doing operations on rows, which is already very cool. But we also have options to do all sorts of aggregations. So let's take a look at some of this now. Let's go into our track table again. And let's specifically look at this milliseconds column and just look at them, try to get some aggregate information about this. For example, what if we want to find the average time of a solid or the longest time or the shortest time. Or what if we just want some information like how many values are in this column? So let's take a look at some of these. We're going to write our skeleton select statement first, selecting from the truck table and limiting our results to 5 plus just kinda see you, this is the table for I like to selecting from. All right, so we're going to now first try to find the average milliseconds to patient. So to do that, we can use a function called or denoted AVG for average. And if here we just put inside the milliseconds column, then we're going to get one value returned, which is going to be the average duration. We know that it's the duration because that's kinda what the millisecond column donate. But it's just the average of the milliseconds column. So we can give us a better alias average of milli seconds. And so this is going to be the average value of our milliseconds column. We can also do other things. For example, we can find the longest time. So we can find the maximum of the milliseconds column, which we can see here. This here is going to be the longest time that we have in this milliseconds column here. And this is our value here. We can also find the shortest time by taking the minimum. Sum, updating our name again here we're going to have the minimum. And here we can see this is the shortest time that we have in our truck values for the time in milliseconds. We can also use a, another aggregation function where we can sum up all of the values in a column, for example. So we can sum up everything in the milliseconds, which is going to be the total time of all tracks. So summing up all of the values in this milliseconds column here is going to give us this result here as we can see, now, there's another cool function that we have, which is the count. Now the count in here, we're going to put a column and it's going to count the number of occurrences that we have. So it's just going to give us a count of the number of values that we have. So for example, we can put in here the milliseconds column. And it's going to tell us the number of instances in milliseconds. And if we run this, we can see in this case we have 3,503 different instances. So 3,503 values in here. These are not unique values. These are just different instances in this case, or table was kinda cut off at 500 just because it doesn't want to show too much. But we can actually, we can actually see here for you into them filtering for example, we have more values available. In fact, we have 3,503 rows where this values to find. Now the reason I specifically say the number of instances in milliseconds and because when there are null values, this affects the count. So let's take a look at some value is where we know there are no values, which is the composer column. So let's count the number of instances that we have and the hope and the composer column as number of instances and composer and run this and looking at the results here we see we have 2525 rows that do not contain null values. Whereas for the milliseconds, we have 3,503 rows which do not contain null values. What about if we just want to get the total number of rows without having to worry about if a column may or may not happen, no value inside. Well. And two common ways to do that. One of them is just putting a star inside here. This is going to be total number of rows and the start is just going to basically just going to let us count everything. So we can see here, if we look at it, we have 3,503 rows without an old values. We could also put the integer one in here, and it's just going to count one for every single row that it has. And so we can see we're going to get the same result returned here. So we can use the count function on columns, in which case we're going to count the number of rows where we have an unknown values, which we can see are different depending on the different columns. For example, in the composer, where we do have a null values, this number is lower. Or we can put either a star or the number of one inside here. And that's going to just count for us the total number of rows that we have, as we can see here. Now we can, of course, still do a bunch of cool other stuff too, or columns. For example, let's first use some just simple arithmetics and turn or milliseconds in two minutes, fractional minutes. So the first thing that we're going to do is just divide or milliseconds by a thousand. That's going to give us the Sung than seconds. And then we're going to divide it by 60, which is going to give us the song length in minutes. So let's go ahead and run this. And if we scroll to the left, we can see here the average song length in blue. This case it's going to be minutes, is just a little over six minutes. So 6.05, which is 606, which is now not six minutes and seconds. But this value here goes between 01. So it's 0.6 out of one. So this is going to be fraction that is here. So we can see, we can still do stuff to our columns inside and then use the aggregations on it. And of course we're just going to be calling the function on the final result that we have inside here.
35. 33GroupingAndFilteringAggregationsHB: So we just saw aggregations and aggregations are cool. But what's even cooler as aggregations by different groups. So what that means is, let's say we want to find out the number of songs that each of the composers who wrote. Currently, what we'd have to do is we can use the count function, which we can count the number of rows that we have. And then we're going to have to filter for each composer, for each different composer. But SQL can actually go way beyond that. And what we can do is we can create groupings by each different entry that we have in here. And then we can get aggregations for each of these different entries that we have. Now this is going to be specific to using aggregations and there are some specific requirements, but let's take a look at that. Let's do the first task cannot be talked about, which is selecting for finding the number of tracks that we have for each different composer here. So starting off with our skeleton statement, selecting all columns from the truck table and limiting our results to Phi of having a semicolon at the end here for good practice. And we can see it right. So these are the first five results on our table. Okay? So now what we want to have is, first of all, let's count the number of rows that we have. And so this is just going to be our number of rows, which you can see here's 3,503. Okay, but how can we group this by the different composers that we have? Well, we can do is there's a cool statement that we can use called and group BY. And here we can put the column or columns that we want to group by. For example, let's group by the composer column. And now if we run this, we can see here we're getting different count values. Actually right now we don't know what this refers to. That's okay. We can just take out our composer column two. And we can show our composer column. And we can show the result of the account. And we can see here, okay, So when the composer is John Dolan, you see we have COUNTIF one and run the composer is this, you have a count of one, and so on. So you can see here we get the different values for the different entries that we have composers. And if we want, we can remove the woman statement and kind of seeing the full result for all the different composers. So you can see here some composers we have more entries for. Most of them are a lot of them we just have one entry for. So in this case, it's really depending on the composer is we can see here. Now of course, we can also use our order BY statement and less order and also give us an alias as number of tracks. And let's order by a number of tracks, by a number of tracks in descending order, so that we get that the highest count first. So rerunning this, scrolling back up, okay, so the highest count is actually the ones where you have no values. After that we have Steve Harris, U2, jogger, slash Richards, and so on. So you can see here now for each composer that we have, we also know how many tracks in this case we have for them in our table. Now we can also add some filtering for this, and there are two different types of filtering. Let's go with a basic 1 first and just take out all of the cases where the composer is null. So where are we going to put this? Because we remember it from earlier that the order that we have statements, and it's actually important to an SQL query. And we need to make sure that everything is in the right order. Now the place that this was go is between the from statement and the group BY. So we're going to say where composer is not null. And now let's run this and we can see here now the null case has gone away, but none of the other values here are changed because we're grouping buyer composers here. Now it's important again that we have the correct ordering. Because for example, if we take this and we put it here, then it's not going to work. So we need to make sure that our ordering of the statements that we're using in our SQL query are correct. All right, so we've grouped by or composers and we filtered out the cases where the composer is not null. And so now we have the result for each composer. We have the number of tracks that we have for them in our database. So what happens if we try to select a column here that we haven't grouped by. For example, what happens if we also try to select the name column, which contains the name of every track. So if we run this, then we're gonna get an error. Because as soon as we use this group by statement, we're using these aggregations. We can't have columns in here which are not an aggregation themselves or are using the group by statement here. So to be able to show the name column, we'd actually have to group by composer and name. Now if we run this, we'll see this is not particularly useful because it should be that we only have one record. And you can see in some cases here we have more. But really what we're expecting here is that, you know, the numbers here are going to be small and there's, there's really not much practicality to grouping, but composers in names. We can see it actually works. And so the way that this grouping makes its first, we group by composer. And then within each composer, we grouped by name of the song. So that way, if we have two composers who've written a song by the exact same name, it's not going to be that the two composers are going to be grouped together, but rather first we group by composers, and then within each composer group, we grouped by name. And then we have our aggregation, buyer grouping here. If we switched the ordering, then loop, then in the cases where two composers have written the same song name, then we would see that kind of reflected here. In this case, we have some of these instances again because we remember that we have several records where the composer is de Paris and where the actual name is Rothschild. But we would also get additional results because we can have just, if we took away this composer column here, if two songs have the same name, then they would be. And now we have to take the composer we from here, then they would be grouped into the same segment. So that's an important thing to be aware of, that the ordering, and let's take this, let's take this out again. The ordering of the grouping that we're doing has an effect on the final result. And so the grouping works by taking the first statement, that's going to be the first group within this group. We then have the next grouping. So this is going to make the second group, and then so on. So for example, let's try to make use of this in a more practical way. And let's try to find the number of tracks that we have for each composer, for each album. So we see here we have an album ID column. So let's first group by composer, and then let's group by album ID. And so let's also show the album ID here. And so what we can have is now we have the results, which is for each composer, um, and for this specific album, this is the number of tracks that we have for this composer and this album. And we can actually get different. We can see here we have the same composer, but we have two different albums. And these case, these are album ids which we can find, for example, in the up column we have here the map of the album ID to the actual name of the album itself because it's good practice to use integers and other places so that if the title changes or something, you don't have to update all the databases, but just the information where it's stored. So we can see here, we have the album ids which represent a specific album. So in this case we have a composer which we have two albums for and for album 55 by this composer, we have 20 tracks and album 54 by this composer, we have 17 tracks. Now of course, we don't just need to do the count aggregation here. We can also do other ones. For example, we can call it, can do all the ones that we saw previously. For example, it's got the average milliseconds here. So as song length, and then we're going to have another column which gives us the average song length of each album for each composer. So we're going to group by composer first. Then we're going to group my album ID for each composer album combination. We're going to then have the number of trucks as well as the average length of the song. And we remove this typo here. And so we can see this is what we have for when the composer is JC Fogarty. And then the album ID here is 55. Here we have 20 tracks for it in our database with an average song length of 224 thousand milliseconds in this case. So we can see here in this case, we also didn't need to add the milliseconds column into the group by statement because it's part of the aggregation. So as long as we have aggregations here, then that's fine. But the only columns that we're allowed to select, either the columns that are included in the group by statement, or if the results of aggregations otherwise, as we also saw earlier, we're going to get an error. So what happens now if you want to do some extra filtering? For example, let's say, okay, we want to get all of the results, but we only want to see the results where we have at least 15 tracks. If we tried to add this into your cause and say, and where the number of tracks is greater than or equal to 15, we're going to have a problem because the aggregation has not been performed. And so we cannot do the filtering of it yet. There's actually a different way that we can filter though by aggregation results. And the way that we can do that is after the group by statement and before the order BY, we can put having and we here we can put the aggregation that we want to filter by. For example, we can say Having count one greater than or equal to 15. And so this means having the count of this aggregation result here being at least 15. Now the important thing is, and let's run this to see if the result. So you can see if we scroll down now, we have no results that are beneath 15 and the number of tracks. So we can see that the where statement here allows us to do filtering on a row by row paces. We'd afford doing aggregation in our query. Then we have to use the having statement. If we want to do filtering by aggregation results in the same time as we're doing the aggregation itself. And we can't put this into the winner statement. We have to put this into this new statement called having. Now, here again, we can also chain filters. For example, we can say having a count greater than or equal to 15 and an average milliseconds duration of greater than or equal to 200 thousand. And so if you run this, and before we had 13 rows. And if we run this now, now we have nine rows. And we can see here in all of these cases, the number of tracks is at least 15 and the average song length is going to be at least 200 thousand milliseconds. So we can still do the additional filtering also using the same rules that we saw at the layer statement. It's just because we're doing aggregation, the filtering for these aggregation results of all we're doing in the query have to be put into this special having statement and cannot be put into the where statement here.
36. 34JoinsHB: So previously we've learned on how to do some more cool aggregations, also using the group pie statements as well as using having for aggregation filtering. But one of the kind of somewhat annoying problems that we saw is that the optimized D here is represented as an integer rather than as a name value, for example, or a title here. And that's of course, due to the nature of how relational databases are built up. And it's also very good practice because it doesn't make sense to repeat the same name. Hundreds or even thousands of times depending on how big the table who gets in this album ID value here, It's much better to have an integer that represents the album title. That way, if the album title gets changed, for example, capitalisation gets changed or something like that. We don't have to run the update through everything else. But also using integers is more space efficient than having the full text values here. So you'll see this very often that we have different tables. And again, we also learned about this in a relational structure. But we'll have different tables for different things. For example, at, for example, every album is just going to be kept track of by itself and has a specific ID. Each artist here is referenced by its ID, but each ID actually represents a specific artists as we can see here. And so this is a really nice way to kind of split up data to make sure that no table grows too big or that we have unnecessary large repeating information in one table that can be made more efficient. For example, if we kept track of the artist's name and the album title instead of the album ID. On this case, we don't have the artist ID. But you understand my point that if we kept track of the name rather than the id, it's not as efficient because it's much more efficient to keep track of integer values here because they are going to be smaller taken a space rather than repeating, for example, the title for those about to Iraq, we salute you hundreds or thousands of times depending on how many entries we haven't are different tables. And again, it's also good practice because then if we want to make a change to a title, because for some reason we noticed that we made a typo or something when we put it in, rather than having to make all of these changes and every single different table that has this title value, we can just change it in the, in this case, for example, Album table. And then any album that's referenced by this ID will just find the right title, rather than us having to go through every cell table, see if there's a title there and if there is updating it. In this way, everything is kind of a nice structure that is all still connected, but we're doing the connection and a much smarter way that makes everything much less error prone and is also more efficient. So you double in there. But of course now we encounter these problems like we see here, where we have the album ID. But really, of course at this point we'd like to see the album name. So how can we do this? So first of all, what I'm gonna do is I'm going to comment this out because I'd like to get back to this later so that we can implement the solution here. But first, I'd like to just look at the more simple use case for this. And what we're gonna do is we're going to write a join statement where we can join two tables together. And we're going to use our artist table as well as here does the album table. So we're gonna take the ABO table here. And for each album, we're going to join on the artist table. Now we know that the artist value here refers to the artist key here, so that we know, for example, when we see artists id and one here, we, this corresponds to the name ac-dc. So the way that we can join these tables together is using an SQL statement called join. So let's take a look at, Let's first write are just kind of skeleton SQL statements. We're going to select everything from. And we'll start off just with the album table, which we have here. Remember in this case we don't need the quotation marks around it because we made that change away back where we actually renamed it. And let's just limit our results to five. For now. Go ahead and run this. All right, so now we have our album table here, but let's go ahead and join on the artists. So to do that, we're going to have our select statement and to define also the columns here that we're going to have the from statement. And now we're going to have afterwards a join statement. So this is going to be the table, and this is going to be the second table that we're going to join onto it. So we wanted to join onto this table, here, the artist table. But now we need to specify what do the joint norm, because right now there's no way for it to know which column should are these tables match that? Now of course, you know, you may say, Okay, well the title here is artist ID. And if we go into the artist's column here is also called Artist ID. But remember that in some cases we may just call this ID because it is the ID and it's implied that it's the artist ID. But in another table, it may be called the artist ID to make it more explicit that this is referring to the ID of the artists because we can have other ideas, for example, the album ID. So now we need to specify what these things with these two tables need to be joined on. And we do that using the honest statement by saying where. And now we say, okay, where the album oh, well, that was auto completed by data grip. But let's do this properly. Where the album and now in the album column, we're going to find the artist ID. So where the album artist ID, which we have here is equal to. And now we want to join it onto the artist table. So we want to say where the album artist ID is equal to the artist ID that we find inside of the artist table. Now, this is the statement though. We also just got autocomplete to buy from data grew up. And if we actually look at our table formats, we can see here. And the reason that it knows this is because we can look at the album and the artist ID here has a foreign key that links to the artist ID in the artist table. So it's right here. So we can see that there are these foreign key links which are really nice because it specifies how these columns are related to the different tables. But we can actually use it an honest statement here and we're going to say, OK. So let's just run this so that we can actually see the results. So we can see art. So we have our album ID and we have our title. These come from the album table here. So we have our album here, which has the album ID and the title. But now we have two more columns, card, the artist ID and another artist ID. And that's because we have two artists at these, one from the album ID column and one from the artist column. So in this case, we actually are, It's showing us that this is the artist ID from the album column, and this is the artist ID from the artist's column. And we can see here, of course they're going to match because we've joined on these tables with the condition that the artist ID and album table is equal to the artist ID in the artist table. So we can see here that joins. Now these values here match. And now we have the name column, which comes from the artist table here. So we can see we've taken two databases or two tables, rather, the outline table and the artist table. And we've just paste or we kind of stuck them together on the side that they're stuck together where this artist ID column is equal, which we can see here. Now at this point, it would be really good to start using table aliases. So for example, let's get this table yes art. And here we can give us, for example, ALP. And so we can specify now rather than saying album dot artist ID, because we're using table aliases, we're going to say L dot artist ID. And here we're gonna say art. So this is the first way that we can specify, but also we probably don't want to select all the columns. We probably, maybe we want the album ID, but the album ID comes from the album table. So it's good to specify that we want the album ID from the album album table. Let's say we want the title from the album table. You're going to say from the album table, we want the title. And let's say we want one of the artist IDs, but we don't want them both. And we can pick and choose if we want the one from the album. Here, the artist ID, or if we want the one from the artist table, it doesn't make a difference because they're going to be the same because we joined on them. But it's just nice to pick one so that we don't have duplicate values here. And we also want the name of the artist itself, which comes from the artist table. So we're gonna go from the artist table, we want to use the name. And so we can run this. And we can see now everything is much cleaner specifically because we don't have artists duplicates, but it's good to use the table aliases so that we're referring to the proper tables here. And of course we don't need to use a table ASUs. We can also use the full table names, but it's nicer to use aliases because it makes it shorter. But it's good to specify exactly which columns you want to select. Because a lot of times, especially when tables get bigger, they're going to be a lot of columns that you probably don't want to use. And so it's better to just specify which columns exactly you want to use. You don't just have it for joining. For example, let's take a look. Our truck hauling here has a bunch of columns. So for joining us onto another table that's about as big, then of course we're just going to have a bunch of columns. We're not going to use many of them. So therefore, you know, it's good practice to just make sure that and let's start closing some of these because they're also taking up a lot of space. So it's good practice to just select out the columns that you actually want unless you want to use all of them. But even then it's good to know you don't want to have duplicate columns having the artist I need twice, That's just a waste of space. So it's good practice to specify which ones you want to pick out. So we can see here, this is the statement that we used to join the album and the artist table together. And this join was done on the column where the artist ID and the album table is equal to the artist ID and the artist table. And of course we can take away or a limit statement here. And that's going to give us the full response for all of these cases. And now something that's important to know is that what's happening here is actually called an inner join. And we can specify the inner join if we want. Although if we just used the join statement itself, it's not, it's going to imply the inner join, but with the inner join means is that when any of these values are null, it's not going to join on them. So for example, if we had a case here where the artist ID is null, then that row is going to be dropped. Or if there's an artist ID here, which is not represented in the artist table, then these values are not going to be included. So what an inner join means is that we're only looking or we're only getting the results where this table or these values have a match here. And were these values have a match here? Now there are other joys that we can do where if we don't have an artist ID, for example, in here, or we have an artist ID, but it's not represented here that we keep that information. But in this case, if we're using the inner join, which is implied when we use the joint statement, just the simple joint statement. Then we're only looking for the intersection of these two. We're only looking for where this value and this value, they exist on both sides. And so if one of the values doesn't exist on either side, that we're not going to have that room. This is important to know because if one value is present somewhere, but it's not present somewhere else, those rows are going to be removed. So just be aware of that. This here is what is called an inner join. All right, so now that we know how the join works, Let's use this. All of them. Our previous statement that we've commented out, wait earlier now here. So let's rerun this. And what we wanna do is we want to replace this album ID column instead with the album name. So how can we modify our statement here to take out the outer name? So what we're gonna do is we're going to join and we're going to do an inner join. But that's implied when we write the joint statement here. We're going to join onto this track table here, the album table, this one. So we're going to join onto here, the album table. I'm going to use the shorthand out here. And here I'm going to use, well, just the shorthand TR. And I'm actually specify each of the columns now to make sure that there are no confusions or mistakes in case. For example, the album also has a column called composer, because then there's going to be a conflict of two. I choose a composer from the track or from the album table. So now I'm going to make sure to specify each of the column references that I have to make sure that there's no confusion or errors that happen. So all of these cases when we're using it on the track table, because now we're bringing in the album table. So onto the truck table again, we're going to do an inner join with the album table. And we're going to join on where the, using the shorthand for attract table, where the album ID is equal to V album ID of the album table. And now, rather than selecting the album ID here, I want to select out extend the album title from the album table. So if we run this and we still have our album ID reference down here. So in this case, we can actually do the grouping by the album title because we're performing the join and then we'll performing the aggregation operation. So you can see here first we're performing the join, then we're performing the grouping aggregation operation. So rather than joining onto the album ID, NOW we're going to join on to the album title. It's gonna give us the same result, of course, because the ID just refers to a title. So let's run that one more time. And now we can see our ID has instead been replaced by the title which came from the album table here. And we got that because every reasoning, the track table, every album ID here, we joined on the album ID here, and we updated our select statement to instead extract the title from the album table, where our track table and our album table were joined together on. So we can remember here we joined our two tables too on the album ID columns. In this case, the album ID was called out and my D in both cases, and in some cases, for example, in the album column, it would make sense to just call this DID because it would be implicitly implied that the ID would represent the album ID. So in case I would just say ID here. And then we would just reference the ID column of the album table. In this case it doesn't. But I'm just saying that these two do not have to always be equal. In this case, they happen to be equal, but there are also many cases where they won't be equal because the primary key would just be called ID, rather than having the table name or something that also specifies what ID exactly it is. So we can see joins are very useful because of the way that relational databases are built up, which is really useful and efficient because it allows us to separate information and just keep track of different things separately to make sure that not everything is just affecting everything else that our storages are efficient. And then when we do updates, these updates only to, uh, be applied in one place. And then when we use joins that basically propagated everywhere because we're using IDs everywhere else. So a lot of this is very good practice which leads to good performance, good storage, and also leads to much easier updating. And then we can use the joint statements because we have all of these relations between our different tables to then fill in the proper information. For example, replacing the album ID here instead with the title of the album. And we can see here none of our other statements really changed except because we are now no longer using V or taking out the album ID. We have to instead replace the aggregation, the album title instead of the album ID, because the album ID no longer is in the select statement here. And then the other thing that we did as adding the aliases for our table to make sure that there's no confusion. For example, when we have duplicate column names to make sure that we're referring to the correct column from a specific table. So for example here referring to the composer column from the truck table, because it could be that the composer column could also exist in the album table. So to make sure that we don't get that, we specify which table, but I want to choose just in case if you are confused by what just happened, if you double-click on it, takes away this navigation pane on the left and then double-click on it again and it brings it back just as a side note. But yeah, as you can see, the joins are really nice because it allows us to join all this information together that is intentionally split up because of the structure of relational databases.
37. 35LeftRightFullJoinsHB: All right, So now we've learned about a join, specifically inner joins where we look at the intersection between two tables. But in this case, we're only limited to the intersection. And there may be some cases where we have data on someplace and there may not be a corresponding value somewhere else. For example, let's take a look at this invoice line table. The invoice line just has kinda purchase information about who or what was purchased, what invoice was, what track was purchased to the price, as well as the quantity purchased. And we can see we have a track ID reference here. And if we just look at that, although you probably already know where the name, we can see here, the truck ID is a foreign key reference to the truck ID in the truck table. So if we open up the track table, can see here we've got the different track IDs. So let's count how many tracks we have in the truck table first. So let's go ahead and select and we're just going to count one from our from our truck table. And we're not gonna do anything else. And so we can see here we've got 3500 and three different rows. And since each row has a unique Track ID because the truck ID is the primary key, we can also deduce that we have 3500 and three unique tracks. But just because we have that many tracks, it doesn't mean that every single track that was purchased, maybe it was maybe it wasn't. Let's find out. So if we look at the invoice line, how can we count how many different tracks were purchased? Currently, we have our aggregation methods, but right now we don't have a specific tool yet to understand just exactly how many unique tracks were purchased here. Since some tracks can be purchased more than once, since different people can purchase the same track. So often we find this out. Well, what we can do is we can make a slight modification to our counts statement here, which is inside. You can put the keyword distinct, which is going to look for distinct occurrences. But in this case we're not allowed to use the number 1 anymore. You have to make a specific reference to a column. So we're going to reference the track ID column inside the invoice line. So we're going to reference here the track of the D and we're going to update our table from track to the invoice line table. So let's cover this statement. In wartime here we're doing the count aggregation. But rather than just counting all of the occurrences, we want to know the number of unique occurrences and track ID. So we want to know how many different track IDs and present inside of this track ID column. And so that's what the distinct allows us to do inside here, allows us to count unique occurrence. It allows us to count distinct occurrences of separate items inside here. So again, it's important that we actually reference the column where we want to find the number of unique occurrences inside. So if we run this, we can see in the invoice line table and the truck ID column, we have one hundred, ten hundred and nine hundred and eighty four different or rather unique tracks. That doesn't mean that each truck only show up once it's possible that it showed up more than once. That's not the information we're trying to get. We just know that 1984 unique track IDs show up in this truck ID column inside of the invoice line table. So what does this mean and why is this important? Well, if we want to maybe join our track and our invoice line table, that means that some of our tracks actually in fact over about a 1000 or even over a 1500 tracks don't have any invoices for loop, which means the data will not be used because we're looking at the intersection. But sometimes we don't want to lose it. Sometimes if there's no invoice to it, we just want to keep it as you know, we want to keep it there, but we just want to have a say that there's no data for this one. And so we can't do that with an inner join because the inner join looks for the intersection. So let's write a query where we take our track table here. And to every track, we also attach basically the information that we have here from our invoice line table. Now, for each track, then we're going to have at least one invoice. Now we can have multiple invoices, which means we're going to get some duplicate rows where the information about the track itself as a duplicate. But then the information about the invoice is going to be separate. So to do that, or we're gonna do is we're going to say select. And let's just start off with select everything from the track table. Unless give this a shorthand TR. And we're going to now do a left join. And we're going to join it on the invoice line table. And let's just give this also shorthand I L. And then we're going to join on where the truck ID of the truck. And we can already see data group is suggesting it to us because of the foreign key reference. But where the truck ID of the track is equal to the truck ID of the invoice line table. So again, we want to match where this truck ID here is equal to the truck ID that we have here. And if we have more than one occurrence, for example, for our first track here, then we're going to have the truck table data duplicated. So we're going to have two rows here, but the invoice data on that's going to be attached onto it is going to be one row for basically every invoice, every separate invoice that we have for it. So we're not going to have full duplicate rows. It could be the parts of it, for example, from our track table here, can be duplicated, but then the the corresponding invoices if there's more than one invoice, for example, if there's three invoices for the first time here, for those about to rock, we salute you and we're going to have three rows for this. Track, information is going to stay the same, but each row is going to have information about a separate invoice. But all of this will be available to us for every single row, for every single record that we have in our now the join table. So let's go ahead and just run this to see the result. And so as we can see here, we now have again the information when there are duplicate columns about which which table the column comes from. We've got information from the track table. And then if we scroll to the right, we also have the information from the invoice line table, which we can see here. And so, yeah, this is going to be our tables that are now joined together. But the important thing is, and let's take a look at the number of track ideas from our track column that show up here. So we're going to count count distinct, and we're going to focus on the truck ID from the track column, since not all tracks show up in our invoice line table here, but all of the tracks are kept track of in our track table here. So let's look at the unique number of track IDs that we have in this join table, focusing on the column truck ID that came from the track table. So running this, we can see we've got 3500 and three unique tracks, which is exactly what we would expect. But for those tracks, for not all of them, we actually have invoice data. So let's just select everything again here. But let's focus on finding the cases where we don't have invoice data to see what those look like. So we're going to say where the invoice line truck ID. And now in the cases where we don't have data, these values are going to be null. So we're going to look for where one of the columns from our invoice line is null. We can use the truck ID. We can also use any other column from the invoice line. Because if we have a track that has no corresponding data and the table that we're joining it on, then the resulting rows for that joint are just going to be null. So if we run this, you can see here we've got, for example, the truck ID 7. And if we look, we have no invoice data that corresponds to it. And so this is what our left join does with a left join does, is it takes this whole column and takes the first, sorry, it takes the first whole table, this whole table here. And it joins it on, or joins on another table and a joins on this condition. But if there's no corresponding instance in our second table, then we're still going to keep the information from this first table or from this a left table is kind of what you can think of it. But because we have no corresponding information from the other table, the values that we have in here are going to be null because there's no corresponding information to add on to this. So how does this differ from the inner join that we had before? Well, if we take away the left here, and if we run this again, we can see here our results are going to be nothing. There's no sink, no row where there's a null value inside of this invoice line. And that's because when we're doing an inner join, we're looking at the intersection. Whereas if we're doing a left join, we're keeping all of the information from the first table. And if information from the second table exists, we're going to add that on. But for the cases where this information does not exist, for example, in the cases where the tractus doesn't have an invoice and the invoice line table. Then, because we still have the columns and the resulting join table, these valleys here just going to be null because there's nothing to it. And so we can see that's what the left join does, is that we don't toss out the rows that don't have corresponding data in the other table. Which sometimes can be good. Sometimes it's also not good because that also means your resulting tables are going to be much larger. Because if there's no, there's no corresponding data, then in the intersection this would be tossed out and the resulting table would be smaller, whereas now we're holding onto it. So depending on what you need, if you still want to keep information that may not have corresponding information in the other table, then you want to use a left join. Another alternative is a right join, which we'll see in a second, or a full join or full outer join. But if you only want to focus on the intersection, then you want to make sure that you use an inner join, which you can also shorthand by just having the join here. So I mentioned two other types of joins. So we have the left join, which will run the whole thing. So we have the left join. Now the left join again focuses on keeping every single value from here and joining on the data from this table here. And in the cases where there's no match from the second table, we just still keep the data from the first table that the corresponding values from the second table we're just going to be no. Now there's also something called a right join and a right join. The difference is that rather than having this table kind of keep all of the form the base and keep the null values. For instance, I going to have this table. And so the right join and left join or just a matter of perspective, because a right join, what a right join is, is it's basically a left join. But we flipped the table order like this. So that's what a right join is. And so, you know, it's easier to just always go with a left join and just have your table order accordingly. And that you're always using the first table as the base to join onto. Because on a left join, if there are some values here, for example, in the track column. And if we run this, we can see there are no instances in our invoice line where we have a truck ID that's present in the invoice line, that's not present in the track table. There's no instances. So if we're doing a left join on the invoice line, then this result here is going to be null. But if we did a right join now, using our current syntax that we're going to get the result that we have before. Because now we're also keeping the information from the truck table and there are going to be some tracks that are kept track of an old truck table that do not have a corresponding invoice in the invoice line table. So we can see a left and a right join or just, they're basically the same thing. It's just the order that we're mentioning these tables in is different. So you can use a left join instead of a right join and just flip the order of the tables. And that's usually kind of a better way to go about it just to make sure that you kind of keeping a consistent way of thinking. But of course you always have the option to do this. Now, there's also another type of join, which is what if we want to keep all of the information from both tables? What if we have some data and our invoice line table that doesn't have corresponding values in the track table. And we have some data in the track table that doesn't have corresponding values in the invoice line table. And we want to keep all of these. And in the cases where we have data on the invoice line table, but not in the track. We want the values from the, for the columns that come from the track table to be null. And in the cases where we have information about the track, but not about the invoice line or no corresponding value in the invoice line. We want the values in the columns that correspond to the invoice line to be no and the values for the truck to still be there so that we have the intersection as well as the information from every single tape. And so this is called a full join, or it's also called a full outer join. There, the same thing. So again, what group what this does is you're looking at the intersection as well as all of the values in both tables that don't have corresponding values in the other one. Whereas with the left join, you're looking at the intersection plus all of our values in the first table that don't show up in the second table. A right join, you're looking at the intersection plus all of the values in the second table that don't show up in the first table. Sorry, yeah, these are the different types of joins that are available to us. Remember, me know, if you're increasing or join sizes basically going from inner to left or right to a full outer join. You're going to keep more data and you also keep the data that may not be matched in the other table. But that also means your resulting tables are going to be a larger. Now, usually when we're doing joins, the sizes of our tables are only going to increase because we can have, for example, a single instance in the track table can have several corresponding invoices in the invoice light table. So if we have a 100 tracks and each track has three invoices, our table is going to grow because we need to keep track of every single one of those inner voices too. So just think about that. Even if you start off with a small table, your table sizes may grow because you're looking for all of these intersections. So again, you know, use it based on your needs, but don't just use left or right or full outer joins if you're not going to use the data because the corresponding table that are going to be big. And it's better to only look at the intersection if you're actually only interested in all of the cases where the columns that come from each table or both NOT null. Basically you want to make sure you have all of the information from the track table that has corresponding information in the invoice line table. So only focusing on the intersection and those cases just make sure you only use the join to reduce your table size, which is going to be more efficient. And it's also going to lead to better performance. But another on the last, if you do have instances where you want to make sure you can observe that information, even if there may not be corresponding values in the other table, you always have the option of using the left or right joins, as well as full joins are full outer joins as their full name is called.
38. 36SelfJoinsHB: Now we've seen that different sorts of joins, inner left fried outer joins, where we can join the information from one table onto the information that we have in another table. But actually there's nothing about the joint statements that say you can't join a table onto itself. Actually, we can do that. We can do a self join where we take the information from a table and we join it onto itself. So that's what we're going to look at now. And we're also going to look at how we can use, or rather how we can join on multiple conditions. Since before, we were always just matching. For example, the truck ID and the invoice line has to be equal to the truck ID in the truck table. But we can also use multiple conditions like we did in the filtering for example. So let's take a look at that. Let's first of all, just select everything from our invoice line table, gives us the shorthand IL. And then we're going to join again on the invoice, On the invoice line table. And let's just give this a shorthand ILA, just invoice line and then a for conditional. And so we're going to join on, first of all, we want to join on where the invoice line and the invoice ID are the same. So in this case I want to look for one invoice and I want to see what other purchases or maybe done in that invoice. And so we'll see why exactly. We will do that later on in this video. So let's just join on where the invoice IDs here are the same. So we can see we can have different invoice line IDs, but all of these different items are part in this case of the same invoice. So for example, in this invoice, the person purchased two different tracks. And so we can see we have two different tracks represented here. So let's join on where the invoice line, the invoice ID is equal to the invoice ID of our second invoice line table here, or the same invoice line table. But let's also add another condition here where we say that we want the track ID here to be bigger than the truck ID in our first table. That way we can just make sure that we don't have a bunch of cross matches, but that we have at least some sort of ordering. Again, we'll see the application here in a second. So I'll just take our invoice line truck ID and we want to make have this. We want the truck ID of the traditional tale to be greater than the truck ID of the initial table. Now the ordering here doesn't actually matter much because everything is kind of symmetrical. But it just helps with visualizing this in your head. The second one is kind of the one that you want to see as the reference. And this one here you can think about us, the base. And you can see here the simplicity of adding on additional conditional on our join statements is just by putting an end here. So we're saying on this condition, and then we can just use hands to chain on additional conditions. We can also use a war is if we want. So, just like we saw in the where cases, the same sort of logic applies if you want to make more conditional logic here for how we exactly want to join these tables together. So again, this is going to be basically the same as what we saw in the filter in cases. Except now here we're doing it on the on statement that references the joint statement that we have above. So let's put our semicolon here and let's run this. So the results that we get from this, we can see here we've got, we've got all of the same columns, one for each table. We've got the inverse line ID from the first tables here and all the other corresponding values. And then we have the same columns from the second table, the invoice line a, as we call it. And again here of the MOS on ID, the invoice ID, the truck ID, and the unit price. But we can also see that here we have the truck ID from the second table that we joined on this four. And the truck ID of the first one is to, as per our second condition here, the truck idea of the second table has to be greater than the truck ID of the first table. So why would we want to do this? What can we use this for? Well, there are certain times when this information names to be released. Well, for example, let's say we want to figure out what tracks are often bought together or if someone buys one truck, what are the chalk maybe also like some. What we can do is let's give some of these table names, an alias so that we can better reference it. And we also don't need all of the data here. So what we wanted to select as we want to take the truck ID from our invoice line table. The first one here we can call this as base track, maybe the kind of baseline track that's purchased. And we also want the truck ID information from the second table. And this can be the additional additional truck, and we also need the as keyword here. So if we take a look at this, now, we have this truck, it was purchased and this track was also purchased. We have this track was purchased and this truck was also purchased with it. And this truck was purchased and this track was also purchased with it. So now we have information about. One track that was Bob, as well as another track that was bought with the same track. So let's say we want to figure out what tracks are often bought with truck ID six. So if we go to the track table here, just kind of doing this lookup manually. Okay. Truck ID six corresponds to put the finger on you, that truck. So let's say we want to find where we want to know if someone buys this track, what are other tracks that they may like? Or just more generally, what are other tracks that have been purchased in the same kind of invoice because if someone purchases this truck and they also purchase another truck, is likely that they're going to be no like the other tracks to. So if someone wants to buy this truck, maybe as a first thing that we can think about is what other trucks to people who've already purchased this track, what other trucks that they also purchase. Because maybe someone who purchases this truck may also be interested in the things that other people have purchased. And so we can use this result here to very quickly get that information using a group by statement. Because we can say, all right, let's group by our bass track. And we also want to group by afterwards are additional track. And now we also want to use an aggregation and we want to count the number of occurrences As. And we can call this one paired purchases maybe. And then let's order everything by the paired purchases in descending order to just get a table output here of one this truck was bought. And when another truck was bought with it, How often were these tracks bought together? So let's go ahead and run this. And we can see here, for example, when the truck 1412 was bought than they were two times where the truck 1424 was also bought with it. And we can see in this case the kind of paired purchases are, are pretty low. But that's also kind of a limitation of just how many tracks we have available, as well as how many invoices we have in our data and as the number of invoices in our invoice line table here grows. That means we'll have more purchases, more information about purchases, and we'll get more information about tracks that are often bought together. And so we can expect these numbers here to also grow. Of course, we can also sorted in the other order to see, okay, well, since our maximum value is two, the only other option we have is having a value of one. But you can see the approach that we can take with this is we can join a table onto itself. And usually we want to use additional conditions on the jobs. Now we don't only need to use additional conditions on self joins. We also of course have the option of using additional conditions when we're doing other types of joins. It's just when we're doing self joins, we probably very likely want to use additional conditions or just to do some extra filtering. Because if you're joining all of the information on the table onto itself, again, your table size is going to grow a lot. So to make sure that you're reducing this and really only picking out the information that you need. You probably want to put in additional conditions to this statement here about the join that's going on above. And so in this case, for example, we were able to use the self join to quickly at least get some superficial information about, you know, if if one truck was purchased, what other tracks were also purchased and how often were these tracks purchased together?
39. 37UnionsHB: So we've seen joins now, but there are also other cases where we may want to join information about tables together, but we actually don't want to join them horizontally, but rather we kind of want to stack them vertically. And these may show up, for example, if you have several tables that kind of keep track of the same thing, but the information is split up over several tables just to keep any individual table from maybe growing too large. Or you have different tables, each of them kinda representing data from an individual day. And you want to join all of this information together into one table so that you can actually run a query on the full table. So how can you do this? Well, let's first of all create our two subset of tables so that we can join them together. And we're going to use the create table statement. And we're going to call this table track part one. And we're going to create this as the result of the following query. We're going to select everything from the track table, but only where the truck ID is less than or equal to 2000. Let's put a semicolon here for kind of good practice. And let's go ahead and run this and let's take a look at our database now. So going into our schemas into a table here. So we've got our one table, that truck, one, which contains all of the information from our tractable, but only the first two thousand trucks. And then let's create a second table truck Part 2, where all of the information and where the truck ID MOOC from the truck table is greater than 2000. So we're basically splitting our track table here, which we remember from earlier, contains about 3,500 rows into two smaller tables, truck Part 1 and truck parked too. As we can see, if we just open this one here, we can see we start out truck ID 2001. So how can we join these informations together? Because maybe we want to use the result of all of our tracks because we need it for something. Maybe, you know, if we want to do a join on all of our chalks, we first need to be able to join or different tables together so that we actually have all of the information about all of the talks that we have, rather than using one smaller table that only contains a subset of all the data that we have. So to do this, we can just, first of all, just writer's kinda skeleton the select statement here, we're going to select everything. Let's just start with truck Part one and onto here we now want to add on the information from the truck Part 2. So to do that, we're going to use a union statement here. And then we're going to have another select statement where we're again just going to select everything from. Now we're going to say select everything from track Part 2. So if we run this, we're now running the whole thing and putting a semicolon here also for just good practice. So we're going to have all of our information from the two tracks. We're selecting everything from truck Part 1. We're selecting everything from track Part 2. And this union statement here means that this information, or that these two results are going to be joined together vertically stacked. So rather than having the kind of horizontal joins that we saw when we use the joint statements. And instead the union is, we're going to use the results from here. And we're just going to add onto the bottom also the results that we have from here. And so in this way, we can use the union statements to then join together multiple tables. And we can even extend this. In this case, we only have two tables, but let's just pretend that we have another table so we can do another union. And let's say we want to select everything again from Luke track part 2. And we can just run this again. And we can see that it still runs. So if we had multiple tables here, we could just do multiple union statements where we're selecting the information that we want from the appropriate table. And then using the union statement here to kind of stick it on, on top of each other so that the final result, going to be the final table that we actually want that contains all of the data kind of joined together. So again, in the cases where in your data and maybe split over several different tables, you can use the union statements to join that information from all the tables together so that you have it all in one place so that when you're running your query, you can run it on all of the tables. Are all of the data from the separate tables joined together, stacked vertically on top of each other. Rather than having to query each subtable individually and then trying to join the results later on. Now if we want to get rid of the two tables again, remember all we have to do, you just have to say drop table. And we're going to drop our truck Part 1 and run this. And then we're also going to drop our trunk part two and run this again now have a kind of cleaned database because we don't need to split our track table into two separate tracks. So we can use these or we can use it for the example of the Union. But then it's also good to kind of clean up back up against that. We don't have a bunch of clutter in our database here.
40. Subqueries: All right. In this lesson, we're going to go
over something really cool which really
provides you with a lot of flexibility in terms of the queries that you
write because it allows you to basically nest
together different results. What we're going to
learn in this lesson about is something
called sub queries. Now, the ideas, and
let's for example, go back to our
invoice line table, We can now write an SQL query
where we just, for example, group by the invoice ID, and then we can just count
the number of occurrences. We can get a result
that tells us how many purchases were
made with each invoice. Let's go ahead and just
write that query real quick. We're going to just have our skeleton SQL statement first. We're going to select from
the invoice line table, and let's just limit results to five for now until we're
confident about our query. From here, we're going to
select the invoice ID. Now we also want to
run an aggregation, and we just want to count the number of items that
we have in the invoice ID, and to do that, we
also need a group by statement and we need to
group by the invoice ID. If we run this and let's
give this an alias as number of items. If you're in this,
we can see here, Invoice ID one, for example, has two items purchased in it. Invoice ID two has four items. Invoice ID three has six items, and can remove this
and take a look. Here we can see this is
our data looks like here. Of course, we can also
order it, for example, by the number of
order by the number of of items maybe in A sending order. We can see here all of these invoices of
purchase one item, and if we go through
further down, now we have the two item
columns, and so on. We have these results now. But what if we now want to know how many invoices had
two item purchases? Basically, we've ran
this aggregation, and now we want to
use the result of this aggregation or maybe
we've ran another query, and we want to use the result of that query as our starting
point for the next query. Problem is, in this case, we can't really use there's
a typo here by the way, we can't really use this column because this column only exists after the
query is finished. We can't really do
anything else with it. We can't run any extra
aggregation by the number of items because the column
itself doesn't exist yet. It only exists after
we ran this query. So how can we then use the results of this
and basically continue on? Because sometimes queries get more complicated and you want to run one part of the
processing and on that result, you then want to continue
on to do something else. Maybe you want to
do some processing. On the result, you want to
do then some aggregation. On that result, then you want to do some additional aggregation, but now based on a different reference
frame because now you have different columns that maybe represent
aggregated results. Now, to do this, we can use
something called sub queries. What a sub query is is
basically a way for us to say, let's use the result from a query rather
than a specific table. The way that a sub query looks like is that we have
a select statement, and then we have
a firm statement, just like before, But instead
of having a table here, we instead have a query. Then at the end of
this, we also have to add a name for a sub query. For example, we can just
call this sub for subquery. We can recognize
the same format. We have a select statement, then we have the F. But instead of having
a table name here, we instead have a full query. What we're selecting
from here is the result of this
whole query here. Then we're also giving
this an alias and we actually have to give
our subqueries an alias. Otherwise, we're going
to have problems. We're giving it an alias, calling it sub so that we
can reference it later on. If we run this, we're going to get the same result because all that we're
doing right now is just selecting from the results, selecting all of the columns from the result of subqueries. But of course, we can
now do extra stuff. Now, for example, we can
access the number of items. We can also still access the invoice ID
because both of these were returned from our
query inside here, from our sub query. Let's now group by the
number of items and count how many invoices had basically fall into each
of these categories. So we're going to now
show the number of items and we don't need to do this if we
only have one subquery, but it's good practice
to reference with the subquery that
we're taking this from because if we have
multiple subqueries, then we can have
duplicate column names. Just like when we
do when we're doing joins or something and
we have to reference or we would prefer to reference the appropriate table to make sure that
there's no confusion. Similarly, when we're
using subqueries or generally when we have
aliases, for example, also just for tables,
it's good to use those names to make sure we're referencing the
appropriate values. We're selecting the number
of lines from our subquery. Now we also want to run
an aggregation again. We can just do the count
one, at the bottom here, we're now going to group
by the number of items. And let's order by here again. And here we want to order by, and maybe we should
give this nals first. This here, we can say,
number of invoices. Now let's order by the number of invoices
in a sending order. It doesn't really make
sense to have the order by statement in here
because in this case, we're not using the
result of the order by. Just to make things
more efficient, we can just take the
order by statement out because we're not
using it in any way, and we're just changing
the order later on. We can take it out
here because we're not using it in some cases, you may actually be
using the order. Case, you can keep it
in in the subquery. But in this case, we're not using the order, so we're
going to take it out. I'm also going to
add some indentation here just to show that all of this is part of the same
subquery to make it easier to see, yeah. If we run this now, and let's go ahead
and scroll up, so we can see, we have 59 invoices where there were
nine items bought each. 59 invoices where there were
four items bought each. Actually, we have
a lot of 59 cases probably because this is
the practice data set. Then we have 117 invoices, where there were two
items bought each. We can see now we can
use subqueries to use the result of a query as the starting point
for the next query. Sub queries are really
nice because it allows us to get
past those problems. Where we just have some results and we want
to continue using them, but we can't access the columns
because they don't exist yet because they're only
created after the query is run. Therefore, there are many cases that you're going to
encounter where you're going to want to use a sub query
so that you can continue on using the results
from the previous query. Now, in this case,
even though we return the invoice ID
in our statement here, we actually don't make
use of it up here. Just to keep things clean, we can even just
remove it because we're not really using
it for anything. In this case, we can just
remove the invoice ID to keep our subquery
clean because again, we don't need to make use of it, so there's no point
in returning it because it's not going
to add much value. Because all that we're using
is the number of items. And now that we have
the number of items, which is the result
of an aggregation, we can now use that resulting
column, and for example, run an aggregation again on it, now grouping by the values
in this column instead.
41. 39WindowFunctionsAndAliasesHB: Now we've already seen that we can do a lot with SQL, but there are some things that are still on Insert. And one of these things, for example, is taking aggregations a step further. Because currently when we're running aggregations, we mainly focus on getting aggregate the results over the whole table. But sometimes we don't want aggregated results over the whole table, but rather we'd like to keep all rows and yet able be able to have intermediate aggregate results or aggregate results that are running averages or anything like that. So let's go ahead and take a look at how we can solve some of these problems. So first of all, let's just write our skeleton SQL statement here. And we're going to use the track table here. So let's go ahead and just select everything from the truck table right now. And first of all, just limit our results. If we run this, we can see here, Alright, exactly what we expect. So what if, for example, we want to see a running aggregation over the different albums that we have here. And we want to know what the running time is for each track that going through the album. Or what if we wanna keep the information about the rows but still have access to the aggregate value. How can we do that? Currently we can't. So let's answer the second question first, which is we want to keep information about all of the rows, but we also want to have access to the aggregate value. So what we're gonna do is just going to select everything so that we can have this information. But now we also want to be able to access this running sum. So basically what we wanna do is we want to sum over the milliseconds here so that we're keeping track of the time of each song as we're going through the album. So we want to sum over the milliseconds table here, or over the milliseconds column rather. But if we do this, then we have to use a group by statement. But there's another way that we can actually expand on the statement, namely using window functions. So what we can do is we can overhear right over and then space and open and close parentheses. And inside of the statement, we can then create partitions by doing partitioned by. And let's just finish the statement first and then kind of go back and look at it. And let's partition by the album idea. So what this means is we're selecting all the columns still, and we're going to do a sum over the milliseconds column, but we're not using the group by statement instead, there's another option that we have available which has this overstatement here. And so inside of this overstatement, we can then create partitions, which is similar to what we would do in the group BY clause. But here we're just defining how we want to split our data. And rather than getting aggregate results returned for each column and having to include those as kinda the output based on what we have in the group BY, we can just still take all of our columns and kind of maintain all of our records. But we can have an extra column which either has a running aggregation are also just an aggregation based on the group. Now the way that this aggregation works depends on what grows are going to be selected. So in this case, if we're using the partition BY what's going to happen is we're going to split all of our data into different partitions based on each album ID here. And for each partition where Lincoln a run this aggregation, we're going to run it in this case over each partition. So if we just give us an alias, we can call this album length, album duration or whatever you want. And let's just, let's just run this and take a look at the output. And let's also take away or limit statement here. So running this and then going down to our output down here, moving all the way to the right, we have here our album length. And we can see if we scroll down, this is going to differ based on what album ID we have. So we have a unique album length for each album ID. But the interesting thing in the way that this differs from the group by statement is that we still maintain all over individual rows. We still have all the information about the individual rows. We also have access to the aggregate value. And we can see that they differ across the different partitions here. So we have our album ID one, and you can see in this case, all of these values are the same. We've got our album ID 2 here. We just have this one value. We've got an open-mind E3 here again, the values are going to be the same and for four and so on. So now we have the aggregate, or rather the sum over this milliseconds column for each album ID. So splitting everything by the different album ids and summing over the milliseconds. But because we're using this overstatement and we're using partitions up here. Rather than using the group by statement, we now have access to the individual robots too. So of course, this is great when you also need to have access to this other data. But other times the group by statement may be better because you're gonna get a much smaller table and you're mainly going to be getting the results that you need if you don't need to have access to the other data here. So again, it kind of depends on your use case and what exactly you need from these aggregation results. Now currently are aggregated value here is the same across all columns, but even that doesn't need to stay like this. Now, the way that this aggregation works, again, it depends on how we're splitting our data into here. If we're just using the partition BY statement, then what happens is for each partition, all of the rows are included. But we can actually also add another clause in here, namely the order by. And this is the same order by that we've seen before, but internally for each split that we're doing, in this case by the album ID, we can also order the results within each split. And let's order, for example, by the name column. And so what happens in this case if we run it, we're actually going to get different values here, even within the same album ID. And so what you can see from here, we're actually getting a running total. So in this case we're getting a running total of up to this song. This is how much time, both of these songs, and then up here it's all three of these sounds together. Here It's all four of the songs. And then once we reach a new album, the counter resets. And the reason why this is different is because when we're using the order by clause rather than all the rows for each partition being considered together, we only consider the rows that go up to and including the value that we're ordering by. Now, if there are duplicates, all of those duplicates are going to be considered. So if we have, for example, the name of the second here and the third song being the same. Then the aggregate value that we're going to get here is going to be the same across the two because we're ordering by and there's a duplicate in the name, so it's going to be the same. And it's also going to be equal to the first three basically because the first one won't be unique in the second or third would be a duplicate. And so the value that we get here would be the one for all three. So just kind of some things to note about how this behavior or how this behaves. So yeah, as we can see now, we can have a different way to approach aggregation. And one of them is the groupBy and the other one is using this overstatement. And then we can partition by different columns here. And if we're only using the partition BY, then the aggregate values that we're going to get are going to be basically within each partition. But if we, if we add an order by clause two, then our aggregation is now no longer going to be a within each partition, but rather it's going to be a running aggregation based on the order that we've defined here. Now, what if we want to do some additional aggregations? So for example, we have the sum, but maybe we also want to do like a running count because it's an additional aggregation and that we can do. So we can also do, for example, just to account one here. And maybe we also want to just keep track of the average song length two. So we also wanna do an average of the milliseconds column maybe. So here we've got our album song count, and here we've got our average album length. So if we go ahead and run this again, and we can now scroll over a bit more and look at the results. So again here for each of these, we're going to have a running aggregation because we're using the order by clause. Here. We're going to have a running total, or you're running some of the milliseconds. Here we're going to have a running count. And here we're going to have a running average over all of the milliseconds. So in this case we can see we only consider one value, which is why we have here. In this case, it's going to be the average of the first two, which is this result here. Here's going to be the average of the first three and so on. So you can see we can use the different aggregation functions and still use them over these different windows. But in this case, it's not ideal to always repeat the statements here. Ideally, we could have a shorthand for this so that we're not just repeating this because it makes it harder to, it makes it harder to read. But also if you want to change something you have to go through and you have to change it and every single place. And we can actually also create aliases for these windows here. So to do that, we're going to go down here and we're going to defining window. Then we're going to give it an alias name or we're going to just give it a name so we can hear say album, name, split for example. And then we're going to use statement. And then we can take the definition of this window here. And we're going to put it down here. And then instead, we can use this alias that we've now created for this window here in here instead. And so with using this, we're going to be referencing this. And then we can put it in here too. And we can also put it in here. Now again, this has the advantage that on the one hand, this becomes easier to read. But on the other hand, this is also much cleaner to write in because if we want to make a change to our window here, then we can just make the change one's and it's propagated to all of the different. Aggregation is that we're doing here are all the different uses. Whereas if we just copy paste everything that every time we make a change, we have to make that change and every single different column, which of course it's not as nice. So let's run this and we can see we're obviously gonna get the same results because nothing else has changed. We've just introduced an alias. But now if we do something else, for example, let's try to partition by album ID, but also partitioning maybe by the genre ID as a second layer of partition. Now it's very likely that each album is going to have the same genre ID. But maybe there are some cases where an album has different tracks that have different genres. It's possible. Not very likely in many cases, but it is possible. So in those cases where there is more than one genre, then we're going to have a split going on there. And if there's only one genre inside, then even if we're splitting by the genre ID, because there's only one genre or results, we're not going to change. So just like we have in the group by statement, we can also add additional here. So it's not just, it doesn't just have to be one column that we're partitioning five, we could actually partitioned by several. The important thing to keep in mind with this is just that first we do this partition, then we do this partition. And if we have more than, we would do those afterwards. So the genre id would only be split within each album ID group. So just an important thing to keep in mind. But now we can rerun this and you can see that because we're not using an alias, all of these changes are automatically applied to every way where we use the alias. Whereas if we just use the copy pasted value, we would have had to make this change three times, which on one hand is annoying. But on the other hand, you can also forget to make the change somewhere. And so then your results are going to be wrong, Mainly because you didn't apply the change everywhere and that's common mistakes that happen. So in this way, you can speed things up, make things easier to read, and worry less about having these types of errors. Now in terms of where this definition would go in our full SQL format. So this is going to go if we have a where statement and we're going to have the window afterwards, but it's also going to happen before the order BY, so let's say we want to say whatever and we're not gonna do any other filtering. We're just going to say we're tracking ID is greater than or equal to one. And then down here we're going to have our orderBy. And let's just order everything by album ID, which would likely already is. But we can just go ahead and do it again and just say an ascending order here. And this is where the window aliased definition needs to go. We take this and we move it up here and run this. Then we can see here we're going to get a syntax error because again, we need to have things in a specific order and so it doesn't belong there. So we need to take it back here. And then if we run it again, everything is fine. Now, we can see that we uses the ascending here. Of course, if we're using the order by clause and here are the same thing still applies. We can also define a Sunday here, or we can also use the descending. And just like we could add multiple things here in the partition, we can also add multiple things here on the orderBy to potentially resolve ties. So for example, we can first-order by the name and then maybe we want to order by something else, for example, the bytes. And so in those cases, and again, here we're not going to be doing name ascending and whites descending since we need to supply the ordering form for each column that we're using. Otherwise it's going to just be assumed ascending. So we can see we have a lot of flexibility with this too. But now if there is a tie in the name column, rather than it being used in the aggregation for as basically a duplicate value. Instead of Skinner look to the second ordering which is going to be in the bytes column and will only have a duplicate count and the aggregation somewhere if the name and the bytes are both the same. Otherwise we're going to have proper ordering going on. So we can see, we can run this again and again. We can just see our results here. And the table below.
42. 40RowNumbersAndRanksHB: Now we've just seen how we can use window functions to do aggregations over individual parts of partitions that we define. But with 10 functions, we can not only do aggregations, but we can do other things such as counting row numbers. And this can also be really useful. So let's take a look at how to do that now. And for this, we're going to use the invoice line table. And what we're gonna do is we're going to do a I'm silly, kinda split and just split everything by the unit price, or rather partition everything by the unit price so that we're gonna get some duplicate values. And then we can see how these different things affect all of it. Let's go ahead and write are just select skeleton. We're going to write select everything from the invoice line table, just putting a semicolon here. All right, so we're going to select everything and now we're going to have our window functions here. So the first thing that we can maybe look at is the row number. So just keeping track of, you know, the row number that were in each partition. So to do that, instead of putting an aggregation function here, we can instead put in the row number function, which doesn't take an input. And then we're going to have our standard window function format that we also saw before, where we do an over. And now here we define what we want to partition BY. So let's partition by the unit price. And let's also do some ordering. Let's order by the truck ID. And this here we can just put as the row number. Now for this, we're definitely going to get some duplicate values because of course, there is likely going to be some duplicate track IT purchases. So we can see we're grouping or rather we're partitioning by the unit price here and here. Bird ordering by the truck ID within each partition. In this case, it's likely that everything has a unit price of $0.99. And so we just have one large partition that has this ordering. And in this case we can see just looking at the row numbers, we still have incremental row number is even though we have duplicate values here. Now if we further defined are ordering it by not just truck ID, but also by invoice ID, then our results here are not really going to change just because not much else has changed. So we can still do keep track of individual arose here, and we can still do further ordering. But there are also different ways that we can count row numbers. So in this case, we're giving a row number for every single row that we have regardless if there's a duplicate value or not. Let's take this order BY here away again. But sometimes we don't want that. Sometimes we want duplicate values to have the same row number. And to do that, we can use a function here called rank. Again, it doesn't take any input. And just for each, within each partition, we're then going to give a rank to each order by industry that we have a rather by each entry that we have. But the rank is different in that duplicate values, as we can see here, are assigned the same row number. It's probably better if we update this to rank. But rank is again just a form of row number. But for duplicate values, they're going to have the same rank. So you can see here this is 1, 2, and this would be three, but it's a duplicate because when we're ordering by the truck ID, these are the same values. So within this partition of a unit price of 0.99, the truck ID values here are the two rows here are duplicates because again, they have the same truck ID. So in this case, they're the same rank. And we can see once we move on, we then go to 45678 and so on until we reach another duplicate. And then they're gonna get the same rank. But then we continue counting and basically just keeping track of the number of duplicates here and then continue counting on where we would have if we were using the row numbers instead. Now this can be really nice because sometimes you don't want to assign row numbers are different row numbers when there are duplicate instances. And so the rank can also be a very useful thing to use. But there are also other cases where you don't want to be skipping numbers and you don't have as much interested in absolute row number rather than the full ordering that things go in and you just want to have kind of consistent consecutive ordering. And what you do you can use for this is something called a dense rank or the dense rank function. And so if we run this, the change that we have is rather than in this case, for example, having a four because we have two twos here, we're just going to continue counting on from the next consecutive number. So it no longer matters is how many duplicates we have. We're not going to basically skip that many numbers, but rather we're just going to continue on counting from the next number. Now again, this is specific for each partition that we have. In this case, using the unit price allowed us to create a partition where we can be sure that there are some duplicates for the truck IDs here. But of course, if you have, you know, multiple partitions than your rank, or in this case it's actually going to be the dense rank or also the row numbers. They're going to be specific to each partition. They're not going to go over the whole table. They're going to be specific to each partition. So that's just an important thing to keep in mind, that your rank, your dense rank, or even your row number are going to reset and they're only going to be defined within each partition is split that you have here. Now if you wanted to find it over the database, what you can do is instead of partition splitting by anything, and maybe you don't even have a column that kind of has the same value everywhere. You can just take this partition statement away and just leave the order BY statement. And, and that way our window function is still gonna run, but we're now not partitioning by anything anymore. And so in that way we move from having the rank or the dense rank or the row number specific to each partition. But rather we have it for the whole dataset because we're now no longer creating a partition.
43. 41UsingOtherRowsWithWindowFunctionsHB: All right, so we've gotten a look at window functions and we saw some cool stuff that we could do with it specifically doing running aggregations are aggregations based on the partitions without sacrificing any of the information that we have in the individual rows. Now at some point though, it's likely that you'll probably want to access information from other rows. For example, maybe you want to know about two consecutive purchases or 11 purchase happened when the next purchase will happen or, or something like that. Currently the way that we're going about using data is each row is kind of treated individually. And right now there's no way for us to access information from other roads. So let's take a look at that now. And we're going to use the invoice table for this. And let's just go ahead and write our skeleton SQL statements. We're going to say select everything from the invoice table. And let's just go ahead and limit our results to five for now. Go ahead and run this. So this is going to be our results here. And now let's try to create a row or modifier row so that with every row that we have, we have the invoice date column and we also have something called the next state, which right now is just going to be the voice of the date of the next invoice, regardless of if it's from the same customer or anything like that. Let's just have a new column that gives us the date of the next invoice. So how can we do that? Well, the way that we can do that is again, using our window functions. But what we're gonna do here is we're going to use something called Read. Now what we do with lead as we put in two parameters, the first one is going to be one column we want the next information from. So in this case we want to use the invoice date. And on the other parameter that we have is how far ahead do we want to go? So for example, we want to take the value from the next row. Then we're going to use the lead on the invoice date and we're going to go one forward. So for example, for at this current row and what the lead is gonna do is it's going to allow us to access the value from one row further in the invoice date. If we do two, if we were this row, we would be looking to rows further. So that's kind of how we can define this here. How many rows do we want to go in front? So let's stick with one. Now to complete this, we also need to again define the partition that we want to go over specifically using the window functions. But as we also saw previously, we don't actually need to define the partition if we don't want to. So if you just want to use the full table, which we're going to do for now. We're just going to use the order BY clause and let's just order by invoice date. So everything is going to be ordered by invoice date. And for each row we're going to just have the next invoice date. And let's go ahead and give this column alias, calling it next invoice date. And if we go ahead and run this, and let's go ahead and take a look. Our output. Here we have the next invoice date, and here we have the current invoice date. And we can see that this value is just equal to what we have for this value here. So you can see we're always just taking the next value from the invoice date column and having it in our current row. Now we can of course also do more with this. We can do some of the arithmetic. So we also learned about before. For example, if we want to get the time difference to the next invoice, then let's just take our statement here, and let's just copy all of it. Put it down here, and update the column alias to say time to next invoice. And what we're gonna do is we're gonna take our invoice date and we're going to look at this. We're going to use this lead function here to get the next value. And then from it, we can just subtract the current invoice date. So if we go ahead and run this and we get an extra column here. And now we can see, okay, what is the time difference to the next invoice date? So in this case we have a one-day time difference, which we can also see here. And let's actually just only select the invoice date column here so that we can compare everything side-by-side. So running this again. So here we got the current invoice date, the next invoice date, which comes from using this lead function on the invoice date column. So we're taking the next value, since it's the one from the invoice date column which is here. We can see that here the time differences one day between this value and this value. Here again, the time difference is one day between this one and this one. Here the time difference is three days between this one and this one. And so we can see, we're just using the value from the next row. And of course we can continue to use our arithmetics here. It's just important to note the order that we're using the arithmetics In that we're doing it after we've defined our window here. Because when we're using window functions, like the lead here, for example, we need to make sure that the ovary. And keyword it comes afterwards. So we have to have all of this kinda as a group statement. And then afterwards we can do, for example, the minus as we have it here. Now again, we don't just need to select from the next column. We can also go, or rather from the next row, we can also go to rows forward or just changing this to be a two. And if we go ahead and run this, you can see here now we're basically skipping two rho hat. So this value here is for this row, two rows further. So we can see this one comes here, this 12 rows further, this one which goes here. So you can see that it's not just limited to one row. We can kind of define the jump that we want to take. Now, what if we want to go backwards? What if instead of looking at the next row or the next pro after that or after that or, you know, whatever else. What if we want to go backwards instead, we want to look at the previous row. So what we can do here is instead of using the lead, the function that we want to use here is called the lag. And here we can then update our name and say previous invoice team. And we can do the same thing here using the LOG instead of the lead. And we're gonna get the same. Let's do just the previous row rather than two rows previous. And we're gonna get the same thing except in this case, we have, we're looking at the row head rather than the next row. Here. We also need to update this time since previous invoice. Now there's going to be something interesting here and we haven't considered where we haven't encountered the other case I came from the lead, but we'll see when we run this, the first value here is going to be null. And that's because there's nothing before the first row. And the same thing happens with the lead when we reach the final row. Or if you have here maybe two or three. If we reach the second loss for the third last or whatever number we have here, row and every row after that, we're going to have a null values in these columns. Just because those values do not exist, right? There's no, is it if this is our first row, there is no previous row to the first row, so these values are going to be null. Similarly, when we get to our final row, there's no row after the final row. And so these values need to be null. So that's just something that kinda comes from IT and something to be aware of. But again, here we can use the lock function together with this window that we've defined here to get access to values in the previous rows. And of course, we don't just need to use the invoice date. We can use whatever other column in 1D. We can just define what column we want to access the previous value from. How many rows we want to go back or forth depending on if you're using the lag or the lead. And then also the window that we want to use. And this case because we didn't define a partition, we're just using the whole dataset, but we can also do this on a partition basis. So for example, we could try to partition by customer ID. So you can set this up. We can say partition BY, and then we're going to use the customer ID. And then we can do the same thing over here. And at this point, because we're already using two and we had to do the copy pasting and probably good to define an alias. We're going to quickly say window here and we're going to define our window or probably not Use outward. So we'll just say w As and put this here. And go instead over W, where we don't have to deal with these copy-paste issues. So let's go ahead and run this one more time. So you can see you now we've just created an alias which we learned about earlier. So we're going back to the previous row. But in this case we've created a partition by customer ID. So let's also go ahead and look at our customer ID. And we can also see that information for you. Go ahead and run this, okay, so here we have all of the invoices for customer ID one. And it's also change the column order here just to kind of have things easier to compare. So first costume ideas, the first column that we can ever inverse state as well as the previous invoice date and the time since. So you've got customer ID one, their purchases and basically every point here, the time to the previous purchase and what the time difference was to that. And if we remove our limit statement, then we can of course get the information. And also for the other customer partitions that cap. You can see at some point depending on, you know, how many customers or how many purchases a customer has. Sometimes have a customer only has one purchase, then all of them we would see, for example, would just be a null. In this case, these customers all have multiple purchases. But of course, if we're creating partitions, it's not unlikely that some of our partitions may only contain a value of one. But the important thing here to also see is that when we're using the partitions, again, because of the way that window functions work, we're computing this over every window. And so we can see that even though here there's probably a date that happened before this, because this is specific to the customer ID eight group. We have no values here since this was the date of the first purchase of customer ID eight. So again, kind of the same principle with partition that we also saw in the aggregation when we learned about window functions first, that here our window function, or rather the function that we're applying over this window, again is going to be specific to each partition. And it's also going to be bordered by the statement that we have in here.
44. 42NTilesHB: Now in some cases, we may want to split our data into different buckets and equally, or rather have data in buckets of equal sizes. So that would maybe we can do some sort of block or bucket analysis or whatever other application we're trying to use it for. So how would we be able to use SQL to maybe already creates some buckets for us that we can use or rather put numbers to each of our rows and basically split them into buckets. So again, we can do that using window functions, and let's again use the invoice table here and a writing our skeleton SQL statement, just selecting everything from the invoice column. And just go ahead and limit the results to five at first. Go ahead and run this just so that we have a result already. All right, So let's split our bucket and two equally sized groups. So again, we're going to use a window function here. What we're gonna do is the function is called n-type, which allows us to define how many equally sized buckets we want to split our data into R we want to assign our data into. So for example, if we say and five, that means we're going to put it into five equally-sized buckets or as equal as possible. So let's take a look at this output and let's actually make this. Actually it will do the window, why partition over customer ID to make sure that we can actually see the differences. So since we're using a window function here, we're going to have to go use the keyword. Now we're going to partition by the customer ID. And let's also once again, order by date. And let's have an invoice date. And let's give this an alias or we say bucket number. And let's also take away our limit statement here. And, uh, go ahead and run this. All right, so scrolling all the way to the right where our final column is going to be. We have here our bucket number. And let's actually just take a reduced number of columns here since we're not gonna use them anyway. So we'll take the customer ID, we'll take the invoice date because that's what we're partitioning by as well as what we're ordering by and run that one more time. So we've got customer ID one here. And we can see that our purchases here have basically been split into as equal buckets as possible. We're not able to get to into everything. But we can see here the first year I've been grouped into bucket 1, second 2, or number 3, and 4 of congruent bucket to the next 1345. So trying to get as equal as possible, but there is no perfectly equilibrium distribution. Then we can get here. And then going on to the next bucket for customer ID 2. Again, we have the splitting ups into the different buckets. And this case we actually have the same number of purchases for customer with id1 and customer with id2. So in this case, just because we're using a practice dataset, it looks like all the customers actually have very equal number of purchases. So all the bucket distributions that we're getting here, you're going to be the same across the different customer IDs. Again, just because each customer actually has very similar number of purchases, if not identical number of purchases. But we can again just play around with our bucket sizes here. For example, if we use seven, which is the number of purchases that each customer here actually has. We can see that our buckets here we're going to be now equal one and each for instead use three. We're going to expect three in the first 12 in the other ones. So we can use the n tile here to define how many buckets of, ideally are approximately equal size we want to split our data into. And it's going to just put those in. And again with the first one's kind of getting the most values infant can't fill up. The bottom one is that they will have one less. But trying to split our data in two equal buckets based on the partition that we've defined here, and also based on the order that we've also defined within our window here.
45. 43UnderstandingQueryPerformanceHB: Now if you're really getting into SQL, maybe it's some points. You also want to understand more about how your queries are planned or structured or what the approach is taken for executing and maybe some other optimization of steps that you can do. So to do this, let's run a small sample query on the track table. And let's just run an aggregation to count the number of tracks in each album. So we're going to run the select and we're going to select the the album ID. And then what we also wanna do is we want to count one as lumber trucks. And let's just have this capital just to make it easier to read from the track table. And then we still need to group by the album ID. So again, we're selecting from the truck table and we're grouping by the album ID and we're just counting the number of trucks that each album has. So if we go ahead and run this, these are the results that we're going to get. This is the album ID and this is the number of tracks that it has. But now if we want to understand its performance, it's going to be a little bit difficult. So SQL actually has a statement available for us called Explain. And if we run this, we can take a look at it. It, instead of giving us the output as it's, instead going to give us the plan of the query. Now I'm not gonna go into really any detail, but the query plan, because the query plan itself is, can become pretty complex. And if you really do want to work on these optimizations, then you're probably going to have to do some extra research specifically into understanding how different aggregation work, but also about how your database that you're using is set up because different databases actually perform. Aggregations are also just queries in general differently. Because when we're writing queries, we're just telling the database what we want done. And the database itself actually performs and finds the most optimal way to get these values. And so this may differ from database to database based on just how the database works and how it breathes in and rights and how it plans and how it's how the values are indexed and all of this stuff. So again, this is kind of going to be getting deeper into how the database itself as set-up. Which is why we're not going to dive into it. But nevertheless, it's important to know that this option is available to you. And so you can use the explain statement to get an understanding about the plan, that the query plan, how the database plans to execute the query to get you the results that you're looking for. And then you can also understand, you know, how many rows are being processed or how many columns are being used and in each component. Or you can also do something like Explain, Analyze, which also tell you the computation time, the execution time rather of how long the execution actually took. So that you can dive deeper into understanding each of these components so that, you know, maybe you can find where the query takes the longest and work about optimizing that. Or maybe you understand that the, or maybe you find that the approach that you're taking is optimal in one database, but it's actually suboptimal miss one because of the way that it was set up. And so that way you're able to get really deep into your queries and perform some big optimizations by understanding the structure and the approach that your database has kind of defined as you've Howard wants to approach executing this query to get you the results that you've asked for.