Elasticsearch 7 and the Elastic Stack: Hands On | Frank Kane | Skillshare

Elasticsearch 7 and the Elastic Stack: Hands On

Frank Kane, Founder of Sundog Education, ex-Amazon

Play Speed
  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x
98 Lessons (8h 33m)
    • 1. Intro: Installing and Understanding Elasticsearch

      0:44
    • 2. Installing Elasticsearch [Step by Step]

      17:35
    • 3. Please follow me on Skillshare!

      0:16
    • 4. Intro to HTTP and RESTful API's

      11:48
    • 5. Elasticsearch Basics: Logical Concepts

      1:58
    • 6. Elasticsearch Overview

      5:44
    • 7. Term Frequency / Inverse Document Frequency (TF/IDF)

      3:48
    • 8. Using Elasticsearch

      3:59
    • 9. What's New in Elasticsearch 7

      3:41
    • 10. How Elasticsearch Scales

      7:27
    • 11. Quiz: Elasticsearch Concepts and Architecture

      4:08
    • 12. Section 1 Wrapup

      0:30
    • 13. Intro: Mapping and Indexing Data

      0:36
    • 14. Connecting to your Cluster

      7:03
    • 15. Introducing the MovieLens Data Set

      3:53
    • 16. Analyzers

      8:27
    • 17. Import a Single Movie via JSON / REST

      10:25
    • 18. Insert Many Movies at Once with the Bulk API

      5:29
    • 19. Updating Data in Elasticsearch

      6:28
    • 20. Deleting Data in Elasticsearch

      2:15
    • 21. [Exercise] Insert, Update and Delete a Movie

      4:14
    • 22. Dealing with Concurrency

      10:20
    • 23. Using Analyzers and Tokenizers

      10:47
    • 24. Data Modeling and Parent/Child Relationships, Part 1

      5:24
    • 25. Data Modeling and Parent/Child Relationships, Part 2

      7:00
    • 26. Section 2 Wrapup

      0:23
    • 27. Intro: Searching with Elasticsearch

      0:29
    • 28. "Query Lite" interface

      8:05
    • 29. JSON Search In-Depth

      10:13
    • 30. Phrase Matching

      6:21
    • 31. [Exercise] Querying in Different Ways

      4:25
    • 32. Pagination

      6:18
    • 33. Sorting

      7:54
    • 34. More with Filters

      3:34
    • 35. [Exercise] Using Filters

      2:39
    • 36. Fuzzy Queries

      6:05
    • 37. Partial Matching

      5:30
    • 38. Query-time Search As You Type

      4:00
    • 39. N-Grams, Part 1

      5:16
    • 40. N-Grams, Part 2

      8:11
    • 41. Section 3 Wrapup

      0:20
    • 42. Intro: Importing Data

      0:50
    • 43. Importing Data with a Script

      8:16
    • 44. Importing with Client Libraries

      6:35
    • 45. [Exercise] Importing with a Script

      3:55
    • 46. Introducing Logstash

      4:50
    • 47. Installing Logstash

      8:57
    • 48. Running Logstash

      5:11
    • 49. Logstash and MySQL, Part 1

      7:55
    • 50. Logstash and MySQL, Part 2

      7:47
    • 51. Logstash and S3

      7:55
    • 52. Elasticsearch and Kafka, Part 1

      5:58
    • 53. Elasticsearch and Kafka, Part 2

      6:45
    • 54. Elasticsearch and Apache Spark, Part 2

      8:20
    • 55. Elasticsearch and Apache Spark, Part 2

      5:58
    • 56. [Exercise] Importing Data with Spark

      8:48
    • 57. Section 4 Wrapup

      0:36
    • 58. Intro: Aggregation

      0:59
    • 59. Aggregations, Buckets, and Metrics

      10:13
    • 60. Histograms

      7:39
    • 61. Time Series

      6:03
    • 62. [Exercise] Generating Histogram Data

      4:21
    • 63. Nested Aggregations, Part 1

      6:03
    • 64. Nested Aggregations, Part 2

      8:45
    • 65. Section 5 Wrapup

      0:23
    • 66. Intro: Using Kibana

      0:20
    • 67. Installing Kibana

      4:37
    • 68. Playing with Kibana

      10:06
    • 69. [Exercise] Log analysis with Kibana

      3:19
    • 70. Section 6 Wrapup

      0:21
    • 71. Intro: Analyzing Log Data with the Elastic Stack

      0:31
    • 72. FileBeat and the Elastic Stack Architecture

      7:33
    • 73. X-Pack Security

      3:10
    • 74. Installing FileBeat

      5:58
    • 75. Analyzing Logs with Kibana Dashboards

      9:52
    • 76. [Exercise] Log analysis with Kibana

      5:25
    • 77. Section 7 Wrapup

      0:31
    • 78. Intro: Elasticsearch Operations and SQL Support

      0:39
    • 79. Choosing the Right Number of Shards

      5:09
    • 80. Adding Indices as a Scaling Strategy

      4:02
    • 81. Index Alias Rotation

      3:52
    • 82. Index Lifecycle Management

      2:09
    • 83. Choosing your Cluster's Hardware

      3:17
    • 84. Heap Sizing

      3:14
    • 85. Monitoring

      6:25
    • 86. Elasticsearch SQL

      5:30
    • 87. Failover in Action, Part 1

      7:12
    • 88. Failover in Action, Part 2

      8:46
    • 89. Snapshots

      9:51
    • 90. Rolling Restarts

      6:39
    • 91. Section 8 Wrapup

      0:29
    • 92. Intro: Elasticsearch in the Cloud

      0:58
    • 93. Amazon Elasticsearch Service, Part 1

      7:20
    • 94. Amazon Elasticsearch Service, Part 2

      5:32
    • 95. The Elastic Cloud

      9:48
    • 96. Section 9 Wrapup

      0:11
    • 97. Wrapping Up

      3:54
    • 98. Let's Stay in Touch

      0:46

About This Class

New for 2019! Elasticsearch 7 is a powerful tool not only for powering search on big websites, but also for analyzing big data sets in a matter of milliseconds! It's an increasingly popular technology, and a valuable skill to have in today's job market. This comprehensive course covers it all, from installation to operations, with over 90 lectures including 8 hours of video.

We'll cover setting up search indices on an Elasticsearch 7 cluster (if you need Elasticsearch 5 or 6 - we have other courses on that), and querying that data in many different ways. Fuzzy searches, partial matches, search-as-you-type, pagination, sorting - you name it. And it's not just theory, every lesson has hands-on examples where you'll practice each skill using a virtual machine running Elasticsearch on your own PC.

We'll explore what's new in Elasticsearch 7 - including index lifecycle management, the deprecation of types and type mappings, and a hands-on activity with Elasticsearch SQL. We've also added much more depth on managing security with the Elastic Stack, and how backpressure works with Beats.

We cover, in depth, the often-overlooked problem of importing data into an Elasticsearch index. Whether it's via raw RESTful queries, scripts using Elasticsearch API's, or integration with other "big data" systems like Spark and Kafka - you'll see many ways to get Elasticsearch started from large, existing data sets at scale. We'll also stream data into Elasticsearch using Logstash and Filebeat - commonly referred to as the "ELK Stack" (Elasticsearch / Logstash / Kibana) or the "Elastic Stack".

Elasticsearch isn't just for search anymore - it has powerful aggregation capabilities for structured data. We'll bucket and analyze data using Elasticsearch, and visualize it using the Elastic Stack's web UI, Kibana.

You'll learn how to manage operations on your Elastic Stack, using X-Pack to monitor your cluster's health, and how to perform operational tasks like scaling up your cluster, and doing rolling restarts. We'll also spin up Elasticsearch clusters in the cloud using Amazon Elasticsearch Service and the Elastic Cloud.

Elasticsearch is positioning itself to be a much faster alternative to Hadoop, Spark, and Flink for many common data analysis requirements. It's an important tool to understand, and it's easy to use! Dive in with me and I'll show you what it's all about.

Transcripts

1. Intro: Installing and Understanding Elasticsearch: Let's dive right in. In the real world, you'll probably be using elasticsearch on a cluster of Lennix machines, so we'll be using limits in this course you boon to in particular. Now, if you don't have a new boot to system handy, that's totally OK. I'm going to walk you through setting up a virtual machine on your windows or Mac PC that lets you run a boon to inside your existing operating system. It's actually really easy to do. Once we've got a new boot to machine up and running, will install Elasticsearch and just for fun will create a search index of the complete works of William Shakespeare and mess around with it. After that, we'll take a step back and talk about Elasticsearch and its architecture at a high level. So you have all the basics you need for later sections of this course. Roll up your sleeves and let's get to work 2. Installing Elasticsearch [Step by Step]: Let's dive in and get elasticsearch installed on your home PC so you can follow along in this course if you'd like to. Now Elasticsearch is going to be running on a new boot to Linux system for this course. But if you don't already have a new boot to system sitting around, that's okay. What we're going to show you is how to install virtual box on your Mac or Windows PC and that will allow you to install Lubutu running right on your own desktop within a little virtual environment. Once we have a boon to installed inside, Virtual Box will install elasticsearch on it and after that will install the complete works of William Shakespeare into Elasticsearch and see if we can successfully search that. So that's a lot to do in one lecture, but I'll get you through it. Let's talk briefly about system requirements pretty much any PC should be able to handle this. You don't need a ton of resource is for elasticsearch. If you do run into trouble, however, make sure that you have virtual ization and able to near bios settings on your PC and specifically make sure that hyper V virtual ization is off if that's an option in your bios , but just try following along first. You shouldn't need to tinker with your BIOS settings unless you run into problems. Also, be aware that the anti virus program, called a vast, is no to conflict with virtual box, so you'll need to switch to a different one or turn it off while using this course. If you're going to be using a vast anti virus with that, let's dive in and get this set up. So if you're running on Windows or a Mac PC, then you need to install a virtual environment to run a boon to within first. And to do that, we're going to use virtual box. So if you're already on a boon to you don't need to do this. But for those of you on Windows or Macs, you'll need to do this step first, head on over to virtual box dot org's right there and click on the big friendly download button. Here it is free software, and I am on windows, so I'm going to go ahead and download the Windows version of the binary. Once the installer for your operating system is downloaded. Go ahead and run it. There's nothing special about it, Really. Just go through the steps that it walks you through. Next, choose where you want to install it. All these defaults are a OK. It will interrupt your network connection while installing, so make sure you're OK with that and go ahead and install. Give it a permissions that needs and off it goes. That was easy. So let's go ahead and hit finish here and virtual box of sitting there waiting for us to add some virtual machines to it. So next we need to actually download an operating system to run within our virtual environment here. For that head on over to a boon to dot com like that from the boot to home page here. Just look for the download link, and we want to use the boon to server, and we're looking for the 18.0 for long term support version. Go ahead. We for that to download. This will take longer because it's a much bigger download. We'll come back when that's done. So the image for the U boon to installation media has downloaded successfully. Just make sure you know where it went on your PC, and we're gonna go back to virtual Box and set it up. So from the virtual box manager, let's go to the machine menu and say new to add a new virtual machine. We'll give it a name. Elasticsearch. Somebody changed the machine folder to another drive. It has more space on it. My C drive is almost full, so make sure that you're using a hard drive that has plenty of space for this. For me. My e drive has the most room, and you'll need to create a folder in there. To put this in already. Have a virtual box PM's folder and let's go ahead and create a new folder to put this stuff in. We'll call it elasticsearch seven and select that folder. Lennox is the correct type, but we want the version to be you Boon to 64 bit. There we go hit next, and I'm gonna allocate about half of my physical memory to this virtual machine. So for me, that's going to be eight gigabytes 81 92 megabytes. We will go ahead and create that virtual hard disk fort. We're gonna need about 20 gigabytes for that. So again, make sure we're putting it someplace that has plenty of room defaults. Find there dynamically allocated is fine. And we are going to select a different home for this. Make sure that that is on a drive that has plenty of space. And I'm gonna increase this from 10 gigabytes to 20 gigabytes because 10 just isn't enough . Doesn't have to be exact. All right, so we've got things set up. Let's go ahead and kick that off to sit the big friendly green start button. We're going to now select that icy winds that we download it. So click on the folder icon here, navigate your downloads and select the image of a boon to 18.4 point two server and off it goes all right. After a couple of minutes of booting up where into the installation menu here, go in and select your language. For me, that's English. And my keyboard layout is also English. Us just hitting entered its like those done selections there hit Enter again. We want to install a boon to, and that should be fine for the network configuration. We do not have a proxy server so again hit. Enter to skip past that. The default mirror addresses Fine hit. Enter and we will use the entire virtual disk. Remember, this is the virtual destroys setting up. There's no risk of actually corrupting our main operating system. Disk ear. So hit enter again and one more time and everything looks fine. So hit enter again to accept the done selection There. Now, here we need to to use the down arrow just like continue and say Yes, I'm sure you want Do this and enter type in your name. Let's say your name is student your service name. Whatever you want. Yes, seven. Sounds good to me. Is your name student password Password. Again. Use whatever you want here when you're done. Hit tab. Just like the done selection here and enter. We do want Open Estates server. So go ahead and hit space bar to select that and hit Tab and have began to select on. We'll install the software that we need by hand so going to hit tab to go to the done option there and hit enter again and now it's often doing its thing installing. So we have to wait for this to finish several minutes later. That initial installation is done and it's asking us to reboot now. So go ahead and hit. Enter. Don't worry about those failed messages. They're perfectly OK. We need hit enter again and that will reboot hopefully into our brand spanking new boot to environment after a minute s So it looks like it's finished booting up. Just hit Enter to get a log in, prompt here and we'll log in as a user named student that we set up and the password that he also set up during installation. And we're in Cool. All right, so we have a boon to up and running now. We just need to install elasticsearch itself on our new system. Now, if you'd like to follow along from written instructions from this point, you can head over to my website at Sun Dog Dash education dot com slash elasticsearch and you'll find written steps here of what we're about due or if you prefer to follow along with the video, you can do that as well. So here we go. Now that we're logged in, let's go ahead and first start off by telling Lubutu where to find the elasticsearch binaries. So we're gonna type in and w get dash que uppercase. Oh, that's an o nada zero than a dash with space after it. Https colon slash slash artifacts dot elastic dot co slash uppercase g p g dash key dash elasticsearch Make sure you pay attention to capitalization and spaces and everything here . One wrong keystroke and it won't work. All right, all right. Now we're going to space in a pipe symbol. That's Ah, shift back slash and in the space Sudo Apt Dash key Space ad space dash. All right. So again, double check everything. We should all the spaces, right. Make sure you got all the dash is where they should be. Everything supper cases should be And that's an O and not a zero. Well, type in your password again. All right, Step one is done. Next step sudo apt Dash, get install Apt dash transport dash Https should look like that. Next we'll say echo space Quote Deb. Https colon slash slash artifacts got elastic dot co slash packages slash seven point x because this is the last six or seven slash apt stable main quote and then another pipe to pseudo t dash a space slash e t c slash ap ti slash sources dot list dot de slash elastic dash seven point x Not list. All right, and finally Sudo Apt Dash get updates. Ampersand ampersand sudo apt dash, get install elasticsearch which should actually go out and install elasticsearch That appears to have worked. So now that we've installed elasticsearch, we need to configure it. To do that will say pseudo v i slash e t c slash elasticsearch slash elasticsearch not why ml so I need to make the following changes. Go ahead and use the arrow keys to move down to where it says no dot name and move over to the end and note and hit the I key to enter insert mode in the V i editor and then backspace to get rid of that comment there by naming our note. No Dash one very creative, and we'll keep scrolling down. Next, we're gonna look for the network dot host setting. There it is. Go ahead and uncommon that and change it from that 20.0 dot 0.0 that Just make sure that everything works fine in our virtual environment. Next, we're gonna go out of discovery and uncommon to discovery dot c dot hosts and change that from Host one and host to two 1 27.0 dot 0.1 inside quotation marks just like that, and finally will go to cluster dot initial master notes on comment that as well, and set that to just node one because we only have one node in our little virtual cluster here. All right, that should do the job. Let's go ahead. Hit the escape key to get out of insert mode and then type in colon. W Q two Right and quit now already actually start elasticsearch up. So let's say pseudo slash bin slash system CTL Demon dash reload. Next, we'll say pseudo been system control. Enable elasticsearch dot service and finally sudo slash bin slash system control. Start elasticsearch start service, and this will make sure that elasticsearch boots up automatically when we start our machine in the future. So it generally takes a minute or two for elasticsearch to actually start up successfully. We contest if it's up and running yet by doing the following curl dash X get all caps 1 27.0 dot 0.1 colon 9200. And right now we're getting a connection. Refused because it hasn't started up yet. So I'm just going to try this again in another minute to and what it actually gives me back a successful response will know that we're ready to move forward. All right? After a couple of minutes actually got back this response instead of a connection. Refused message. So once you see this, you know, you ready to keep on going, and you should get this default response back that just sends with, you know, for search. All right, so now that we have elasticsearch running, we need to actually have some data to search with. Let's go and download the complete works of William Shakespeare and import that so type in the following to get that. Don't you get http colon slash slash media dot son dog dash soft dot com slash e s seven slash shakes dash mapping dot Jason, This just defines this scheme of the data that we're about to install now that we've downloaded that data type mapping, Let's go ahead and submit it to elasticsearch thusly. Curl dash upper case H cooked content Dash Type Colin Application slash Jason Quote Dash export When. 27.0 dot 0.1 Colin 9200 slash Shakespeare That'll be the name of our index Dash dash data dash, binary at Sign Shakes Dash mapping dot Jason and that has submitted that data type mapping into elasticsearch. So knows how to interpret the data that we're about to get it. Let's go ahead and download the actual works of William Shakespeare with w Get http colon slash slash media dot son dog dash soft dot com slash es seven slash Shakespeare underscore 7.0 dot Jason And that's everything Shakespeare has ever written in Jason Format. Let's go ahead and submit that to our index curl dash Upper Case H quote content. Dash type colon application slash Jason Quote Dash Ex Post single quote 1 27.0 dot 0.1 colon 9200 slash Shakespeare slash underscore Bulk single quote. Dash dash Data Dash Binary at Sign, Shakespeare underscore 7.0 dot Jason and we'll talk about what this is all doing later on. Right now, I just wanna get you up and running and doing something cool. So it's gonna go ahead and two on the entire works of William Shakespeare and index that into elasticsearch. That will, of course, take a few minutes, so we'll come back when that's done. All right. It took about 15 minutes for all that data to get Index, but compared to the amount of time that it probably took William Shakespeare to write all of that, I guess that's nothing, right? Let's hit Enter just to get a nice, clean prompt back here and let's get some payoff from all this work. Hubby. We've done a lot so far today. We've actually installed in a boot to system running in a virtual environment on your own, PC, installed a lasted search from scratch and installed and index the entire works of William Shakespeare. So let's try and actually search that data now and actually do something with it. So let's issue the following command actually search for to be or not to be and see what play that came from. I think you might know the answer, but let's just see that it works. Typing Curl Dash Upper Case H quote content, bash, type colon application slash Jason Quote dash X Get single Quote 1 27.0 dot 0.1 colon 9200 slash Shakespeare slash Underscores search question mark Pretty single quote Dashti and another single quote. So basically, what we're saying so far is I'm going to issue a Jason request to our elasticsearch server that's running on when. 27 outs. Here, let's hear about one in the Shakespeare Index, and I'm going to issue a search query and get the results back in nicely formatted results . It enter and we're gonna start off are the body of our request with a open, curly bracket? Enter quote, query, quote colon open, curly bracket. Enter could again match underscore phrase quote colon and another curly bracket. Quote text. Underscore entry quote. Colon quote. To be or not to be quote you see what's going on here. Basically were saying that we're sending a query to Elasticsearch to match the following phrase that contains the text To be or not to be and terrible clothes off those curly brackets. 12 and three of them, and a final single quote to close that off and let's see what we get back they worked so cool. You can see here that to be or not to be came back from the play name Hamlet. The speaker was Hamlet, and the full line there was to be or not to be. That is the question. And apparently Elasticsearch has chosen to be during this lecture. We have actually successfully set it up from scratch on your own little Lubutu system. And now that we have elasticsearch running, we can start to learn more about how it works and start experimenting with it and doing more and more stuff with it. So keep on going, guys. It's about to get interesting if you're done. For now, however, the way to safely shut this down is to go to the machine menu of your virtual terminal here and say, a CP I shut down that will send a shutdown message to the host and cleanly shut things down . And then, when stun, you can be free to close the virtual box manager as well 3. Please follow me on Skillshare!: the world of big data and machine learning is immense and constantly evolving. If you haven't already be sure to hit the follow button next to my name on the main page of this course. That way you'll be sure to get announcements of my future courses and news as this industry continues to change. 4. Intro to HTTP and RESTful API's: So before we can talk about elasticsearch, we need to talk about rest and rest full AP eyes. The reason is that elasticsearch is built on top of a rest ful interface, and that means to communicate with elasticsearch. You need to communicate with it through http requests that adhere to a rest interface. So let's talk about what that means. So let's talk about http requests at a more high level here. So whenever you request a Web page from your browser, what's going on is that your Web browser is sending an http request to a Web server somewhere requesting the contents of that Web page that you want to look at and elasticsearch works the same way. So instead of talking to a Web server, you're talking to an elasticsearch server. But it's the same exact protocol. Now, on http requests contains a bunch of different stuff more than you might think. One is the method, and that's basically the verb of the request. What, you're asking the server to do so In the case of actually getting a Web page back from a Web server, you be sending a get request, saying that I want to get information back from the server. I'm not going to change anything or in any information on the server. I just want to get information back from it. You might also have a post verb. That means that you want to either insert a replace data that's stored on the server or put , which means to always create new data on the server. Or you can even send a delete verb. That means to remove information from the server. Normally won't be doing that from a Web browser, but from Elasticsearch client. Totally a valid thing to do. It also includes a protocol. So specifically what version of http Are you sending this request in Might be http slash 1.1. For example, you will be sending that request to a specific host, of course. So if you're requesting away page from our website, that might be Sunday, August education dot com. And the URL is basically what resource you are requesting from that server. What, you want that server to do So in the case again, of a Web server, that might be the path to the Web page that you want on that host. There's also a request body you can send along. You don't normally see that with a Web page request, but you can send extra data along in whatever structure data you want to the server within the body of that request is well. And finally, there are headers associating with each request that contains sort of metadata about the request itself, for example, information about the client itself that would be in the user Asian for a Web browser. What format the body is in that might be in the content type stuff like that. So let's look at a concrete example again, getting back to the example of a browser wanting to display a website. This is what an http request for that might contain in that example were sending a get verb to our Web server and were requesting the resource slash index dot html from the server, meaning we want to get the home page. We will say that we're sending this in 1.1 http protocol, and we're sending it to a specific host on That's our websites on dog education dot com. In this example, there is nobody being sent across because all the information we need to fulfill this request has already been specified and will be a whole slew of headers being sent along as well. That contains information about the browser itself. What type of information in languages it can accept back and return from the server. Information about cashing cookies that might be associated with this site. Things like that. So a bunch of information about you being sent around the Internet whenever you request a Web page. But fortunately with elasticsearch, where our use of headers is pretty minimal. So with that, let's talk about rest ful ap ice. Now that we understand, http requests So the really pragmatic practical definition of arrestable a p I is simply that you're using http request to communicate with your Web service of some sort. So because we're communicating with elasticsearch using http requests and responses, that means that we're basically using arrestable AP I Now there's more to it than that. You know, we'll get to that, but at a very simple level, that's all it means. You know, it sounds fancy, but that's really it. So, for example, if I want to get information back from my elasticsearch cluster like search results for example, I'm actually conducting a search I would send a get for belong with that request, saying I want to get this information from ELASTICSEARCH going to insert information into it. I would send a put request instead, and the information that I'm inserting would be within the request body. And if I want to actually remove information from my elasticsearch index, I would send a delete request to get rid of it, Mallika said. There's more to rest on. That s so let's get into them or the computer science e aspect of it. Rest stands for a representational state transfer, and it has six guiding constraints. And well, to be honest, these aren't really constraints. Not all of them. Some of them are a little bit fuzzy. We'll talk about that. Obviously, it needs to be a client server architecture we're dealing with. You know, the concept of sending requests and responses from clients to servers doesn't really make sense unless we're talking about client server architecture. And that is what elasticsearch offers we have in elasticsearch server, or maybe even a whole cluster of servers and several clients that are interacting with that server. It must be stateless. That means that every request and response must be self contained. You can't assume that there's any memory on the client or the server of the sequence of events that have happened there, Really? So you have to make sure that all the information you need to fulfill a request is contained within the request itself. And you're not keeping track of state between different requests. Cash ability. This is more of a fuzzy when it doesn't mean that your responses need to be cached on the client. It just means that the system allows for that. So maybe your responses include information about whether or not that information can be cashed again. Not really requirement. But it's on this list of rest constraints. Layered system again, not a requirement. But it just means that when you talk to, for example, son dog education dot com, that doesn't mean you're talking to a specific individual server. That request might get routed behind the scenes to one of an entire fleet of servers, so you can't assume that your request is going to a specific physical host. And again, this is why statelessness is important because one host might not know what's going on in the other, necessarily. So they might not be talking to each other. Really? Another sort of fuzzy constraint is code on demand. And this just means that you have the capability of sending code across as a payload on your responses. So, for example, a server might send back JavaScript code is part of its response body that could then inform the client of how to actually process that data. We're not actually gonna be doing that with elasticsearch, obviously, But rest says you can do that if you want to. And finally it demands a uniform interface. And what that means is, ah, pretty long topic. But at a fundamental level, it just means that your data that you're sending along is of some structured nature that is predictable. And, you know you can deal with changes to it in a structured way. So at a high level, that's all it is. With that out of the way, why are we talking about rested all here? Well, the reason is that we're going to do this whole course just talking about the http requests and responses themselves, and by dealing with that very low level of how the rest ful AP I itself of elasticsearch works, we can avoid getting mired into the details of how any specific language or system might be interacting with elasticsearch pretty much any language out there Java, JavaScript, python whatever you want to use is going to have some way of sending http requests. So it really doesn't matter what language you're using. What matters, Maurin understanding how to use elasticsearch is how to construct these requests and how to interpret the responses that are coming back from it. The mechanics of how you send that request and get the response back is trivial, right? You know any language can do that? If you're a Java developer, you can go look up how to do that. So we're not gonna get mired in the details of how to write a Java client for elasticsearch Instead, what we're going to teach you in this course is how to construct http requests and parts the responses you get back from elasticsearch in a meaningful way and by doing that, you'll be able to transfer this knowledge to any language in any system that you want very easily. Some languages may have a dedicated client library for elasticsearch that provide sort of a higher level rapper over the actual http requests and responses, but they'll generally be a pretty thin wrapper, so you still need to understand what's going on under the hood to use elasticsearch successfully. Lot people get confused on that in this course, But there's a very good reason for why we're just focusing on the actual http requests and responses and not the details of how to do it from a very specific language. All of elasticsearch documentation is done in the same style. The books that you can find about elasticsearch. Same idea. There's a good reason for that. So the way we're going to interact with Elasticsearch in this course is just using the curl command on a command line. So again, instead of using any specific programming language or client library, we're just going to use Curl, which is a limits command for sending http requests right from the command lines. We're just gonna bash out curl commands to send out requests on the fly to our service and get the responses back and see what comes back from them. The structure of a Curl Command looks like this. Basically, it's curl dash H, followed by any headers you need to send. And for elasticsearch that will always be content type of application slash Jason, meaning that whatever's in the body is going to be interpreted as Jason format. It will always be that. And in fact, we will show you a little bit of a hack on how to always make that header specified automatically for you on Curl to save you some typing That will be followed by the Earl, which contains both the host that you're saying this request to and in this course l will usually be the local host 1 27.0 dot 0.1, followed by any information that the server will need to actually fulfill that request. So you know what index Taiwan talk to what data type, what sort of command am I asking it to do? And finally, we will pass Dash D and then the actual message body within quotes. That will be Jason formatted data with additional information that the service needs to actually figure out what to give back to you or what to insert into Elasticsearch. Let's look at some concrete examples to make that more riel. So these 1st 1 of the top here we're basically querying the Shakespeare Index for the phrase to be or not to be. So let's take a closer look at that curl command and what's in it again. We're saying curled ash age content type application. Jason, that's sending a http header that says that the data in the body is going to be in Jason format. Dash X Get means that we're using the get method or the get verb, depending on your terminology, meaning that we just want to retrieve information back from elasticsearch. We're not asking it to change anything. And the girl, as you can see it, includes the host that were talking to in this case 1 27.0 dot 0.1, which is the local loop back address for your local host. Elasticsearch runs on Port 9200 by default, followed by the index name, which is Shakespeare and then followed by underscores search, meaning that where you want to process a search query as part of this request. The question Mark Pretty is a query line parameter. That means that We want to get the results back in a nicely formatted, human readable format because we're gonna be looking at it on the command line. And finally we have the request body itself, Swiss fight after a dash d into between single quotes. And if you've never seen Jason before, this is what it looks like. It's just a structure data format where each level is contained within curly brackets, so it's always contained by curly brackets of the top level. And then we're saying we have a query level and within those brackets were saying we have a match phrase command that matches the text entry to be or not to be. So that is how you would construct a riel search query and elasticsearch using nothing but an http request another example. Here, we're going to be inserting data. So in this one, we're using a put verb again to 1 27 0.0 dot one on Port 9200. This time we're talking to an index called Movies and a data type called Movie, and it's using a unique identifier for this new entry, called 109487 and under a movie I D. 109487 were including the following information in the message body. The genre is actually a list of genres, and in Jason that will be a comma delimited list of stuff that's enclosed in square brackets. So this particular movie is both the I Max in sci fi categories, its titles Interstellar and it came out in the year 2014. So that's what some real http requests look like when you're dealing with the last six or so. Now you know what to expect and how we're actually going to use elasticsearch and communicate with it. We can talk more about how Elasticsearch works and what it's all about. We'll do that next. 5. Elasticsearch Basics: Logical Concepts: So before we start playing with our shiny new elasticsearch server, let's go over some basics of elasticsearch. First, we'll understand the concepts of how it works, what it's all about, how it's architected. And when we're done with that, we'll have a quick little quiz to reinforce what you learned. After that, we'll start messing around with it. So there are two main logical concepts behind Elasticsearch. The first is the document. So if you're used to thinking of things in terms of databases, a document is a lot like a row in a database that represents a given entity, something that you're searching for. And remember in Elasticsearch. It's not just about text. Any structure data can work now. Elasticsearch works on top of Jason Formatted data. If you're familiar with Jason, it's basically just a way of encoding structure data that may contain strings or numbers or dates. Or what have you in a way that you can cleanly transmit across the Web and you'll see a ton of examples of this throughout the course, so it'll make more sense later on. Now, every document could have a unique I D. And you can either explicitly assign a unique I d to it yourself or allow elasticsearch to assign it for you. The second concept is the index and index is the highest level entity that you can query against in elasticsearch, and it can contain a collection of documents. So again, bringing this back to an analogy of a database, you can think of an index as a database table and a document as a row in that table. The scheme that defines the data types and your documents also belongs to the index. You can only have one type of document within a single index in Elasticsearch. So if you're used to the world of databases, you'll find elasticsearch to have similar concepts. Think of your cluster is a database. It's indices is tables and documents has rose in those tables. It's just different terminology. But as you'll soon see, even though the concepts air similar, how Elasticsearch works under the hood is very different from a traditional database 6. Elasticsearch Overview: let's start off a sort of a 30,000 foot view of the elastic stack in the components within it and how they fit together. So Elasticsearch is just one piece of this system it started off is basically a scale will version of the loose seen open source search framework. And it just added the ability to horizontally scale Lucy in in a See So we'll talk about shards of elasticsearch in each shard in elasticsearch is just a single loosening inverted index of documents, so every shard is an actual loose seen instance of its own. However, Elasticsearch has evolved to be much more than just loosen spread out across a cluster. It can be used for a much more than full text search now, and it can actually handle structure data and aggregate data very quickly. So it's not just researching and handle structure data of any type, and you'll see it's often used for things like aggregating logs and things like that. And what's really cool is that it's often a much faster solution than things like Hadoop or a Spark or Flink. You're actually building in new things into the elasticsearch all the time with things like graph visualization and machine learning that actually make elasticsearch a competitor for things like Hadoop and Spark and Flink. Only it could give you an answer in milliseconds instead of in hours. So for the right sorts of use cases, Elasticsearch could be a very powerful tool and not just for search. So let's zoom in and see what Elasticsearch is really about at a low level. It's really just about handling Jason requests. So you're not. We're not talking about pretty you eyes or graphical interfaces when we're just talking about the last two church itself. We're talking about a server that can process Jason requests and give you back Jason Data, and it's up to you to actually do something useful with that. So, for example, reason Curl here to actually issue an arrest request with a get firm forgiven index called Tags, and were just searching everything that's in it. And you can see the results come back in Jason format here, and it's up to you to parse all this. So, for example, we did get one result here called for the movie. Swimming to Cambodia has given User I D and a tag of Cambodia. So if this is part of a tags index that we're searching, this is what a result might actually look like. So just to make it riel, that's a sort of output you can expect from elasticsearch itself. But there's more to it than just elasticsearch. There's also Cabana, which sits on top of Elasticsearch, and that's what gives you a pretty Web. You Why? So if you're not building your own application on top of elasticsearch or your own Web application, Cabana can be used just for searching and visualizing what's in your search index graphically, and it could be very complex. Aggregations of data can craft your data. You can create charts, and it's often used to do things like log analysis. So if you're familiar with things like Google Analytics, the combination of Elasticsearch in Cabana can be used as sort of a way to roll your own Google Analytics at a very large scale. Let's zoom in and take a look at what it might look like. So here's national screenshot from Cabana Looking at some real log data, you can see there's, ah, multiple dashboards you can actually look at that are built into Cabana on this lets you visualize things like where the hits on my website coming from and where the error response codes how they all broken down. What's my distribution of Urals? Whatever you can dream up. So there are a lot of specialized dashboards for a certain kinds of data, and it kind of brings home the point that Elasticsearch is not just researching text anymore. You can actually used for aggregating things like Apache access logs, which is what this view in Cabana does. But you can also use Cabana for pretty much anything else you want to later on. This course will use it to visualize the complete works of William Shakespeare for up, for example, and you can see how can also be used for text data as well. It's a very flexible tool in a very powerful you buy. We can also have something called Log Stash and the Beats framework, and these airways of actually publishing data into elasticsearch in real time in a streaming format. So if you have, for example, a collection of Web server logs coming in, that you just want to feed into your search index over time, automatically file beat can just sit on your Web servers and look for new log files and parson out. Structure them in the way that Elasticsearch wants and then feed them into your elasticsearch cluster as they come in. Log Stash does much the same thing. It can also be used to push data around between your servers and elasticsearch, but often it's used as sort of an intermediate step. So you have a very lightweight file beat client that would set on your Web servers. Log stash would accept those and sort of collect them and pull them up for feeding into elasticsearch over time. But it's not just made for log files, and it's not just made for elasticsearch and Web servers, either. These are all very general purpose systems that allow you to tie different systems together and published data to ever needs to go, which might be elasticsearch might be something else, but it's all part of the elastic stack still, but it can also collect data from things like Amazon is three or caf co. Are pretty much anything else. You can imagine databases, and we'll look at all of those examples later in this course Finally, another piece of the elastic stack is called X Pack. This is actually a paid at on, offered by Elastic Dot Co. In it offers things like security and alerting and monitoring and reporting features like that. It also contains some of the more advanced features that are just starting to make it into elasticsearch now, such as machine learning and graph exploration. So you can see that with X PAC, Elasticsearch starts to become a real competitors for much more complex and heavyweight systems like Flink in Spark. But that's another piece of the elastic stack when we talk about this larger ecosystem and you can see here that there are free parts of expect like the monitoring framework that led to you quickly visualize what's going on with your cluster. You know, what's my CPU utilization system load? How much memory that I have available things like that. So when things start to go wrong with your cluster, this is a very useful tool toe have for understanding the health of your cluster. So that's it. At a high level. The elastic stack obviously elasticsearch can still be used for powering search on a website. You know, like Wikipedia or something, but with these components, it could be used for so much more. It's actually a larger framework for publishing data from any source you can imagine and visualize ing it as well through things like Cabana. And it also has operational capabilities through X Pac. So that is the elastic stack. At a high level, I have been more to elasticsearch itself and learn more about how it works. 7. Term Frequency / Inverse Document Frequency (TF/IDF): Now, of course, indices aren't quite that simple. And Index is actually what's called an inverted index, and this is basically the mechanism by which pretty much all search engines work. As an example. Imagine I have a couple of documents in my index that contain text to data. Let's say I have one document that contains space, the final frontier. These are the voyages, and maybe I have another document that says he's bad. He's number one. He's a space cowboy with a laser gun. And if you understand what both of those air references to then you and I have a lot in common now an inverted index wouldn't store those strings directly. Instead, it sort of flips it on its head. A search engines such as Elasticsearch, actually splits each document up into its individual search terms, and in this example, we'll just split it up for each word, and we'll convert them to lower case just to normalize things. Then what it does is map each search term to the documents that those search terms occur within. So in this example, the word space actually occurs in both documents. Many of the inverted index would indicate that the word space occurs in both documents one and two. The word the also appears in both documents, so that will also map to both documents one and two. And the word final only appears in the first document, so the inverted index would map the word final as a search term to document one. Now it's a little bit more complicated than that in practice, and in reality, it actually stores not only what document it's in, but also the position within the document that it's in but at a high conceptual level. This is the basic idea, an inverted indexes, what you're actually getting with a search index, where it's mapping things that you're searching for to the documents of those things live within. And, of course, it's not even quite that simple. So how do I actually deal with the concept of relevance? Let's take, for example, the word the How do I deal with that? The word the is going to be a very common word in every single document. So how do I make sure that only documents where the is a special word are the ones that I get back? If I actually search for the word the well. That's where T F I D E. F comes in that stands for a term frequency times inverse document frequency. It's a very fancy sounding term, but it's actually a very simple concept, So let's break it down. Term frequency is just how often a given search term appears within the given document. So if the word space occurs very frequently in a given document, it would have a high term frequency. The same applies of the word the appears frequently to the document. It would also have a high term frequency. Now. Document frequency is just how often a term appears in all of the documents in your entire index. So here's where things get interesting, so the word space probably doesn't occur very often across the entire index, so it would have a low document frequency. However, the word the does appear in all documents pretty frequently, so it would have a very high document frequency. So if we divide term frequency by document frequency, that's the same is multiplying by the inverse document frequency. Mathematically, we get a measure of relevance, so we see how special this term is to the document. It measures not only how often does this term occur within the document, but how does that compare to how often this term occurs and documents across the entire index. So with that example, the word space in an article about space would rank very highly. However, the word the wouldn't necessarily rank very highly at all. That's a common term found in every other document as well. And this is the basic idea of how search engines work. If you're searching for a given term, it will try to give you back results in the order of their relevancy. Relevancy is loosely based, at least on the concept of T F I D f. It's not really that complicated. 8. Using Elasticsearch: So how do you actually use an index and elasticsearch? Well, there's three ways we can talk about one. Is the rest ful a P I. Now, if you're not familiar with the concept of rest queries, let me explain it in a very high level. It's just like how you request a Web page from a Web server from your Web browser on your desktop. So when you're requesting a Web page on your browser, like chrome or whatever you use, what's happening is that your browser is sending arrest request to a Web server somewhere and for every rest request. It has a verb like get or put or post and some sort of body that specifies what it is that you want to get back. So, for example, if you're looking for a Web page, you would send the rest query for a get verb, and then that get would request a specific girl that you want to retrieve from that Web server. Now, Elasticsearch works exactly the same way over the same http protocol that Web servers work across, so this makes it very easy to talk to elasticsearch from different systems. So, for example, if you were searching for something on ELASTICSEARCH, you would issue a get request through arrest AP I over http and the body of that get request would contain the information about what it is that you want to retrieve in Jason format. We'll see examples of this later on. But the beautiful thing about this is that if you have a language or an A P I or a tool or an environment that can handle http requests just like talking to the Web normally, then it can also handle elasticsearch. You don't need anything beyond that if you understand how to structure the Jason requests for elasticsearch than any language that can talk to http can talk to in elasticsearch server, and most of this course is going to focus on doing it that way. Just so you understand how things work at a lower level and what elasticsearch is capable of under the hood. But you don't always have to do it the hard way. If you're accessing elasticsearch from some application, your writing like a Web server or Web application. Often there will be a client, a P I that provides a level of abstraction on top of those rest queries. So instead of trying to figure out how do I construct the right Jason format for the type of search that I want or inserting the kind of data that I want? There's a lot of client AP eyes out there that could make it easier for you. They just have specialized AP eyes for searching for things and putting things into the index without getting into the nitty gritty of constructing the actual request itself. So whether you're using Python or Ruby or PERL or C plus plus or java, there is an A P I out there that you can just use now in this course, we're going to focus on using the rest ful AP eyes and not these higher level clients. I don't want to single out one language as the only language with that we used in this course. If I were to go through this whole course using only the Java client, it would be useless to people coating in Java script or python, for example. But all of the different clients in every language boiled down to rest calls in the end. So if you understand the underlying, http requests that these clients generate. You can understand any of the client AP eyes, and you'll be able to move more easily from one language to another to so please don't get upset that I'm not going to teach you how to write Java or any other specific language to use elasticsearch. The lower level information I'm giving you will make it easy to use the job a client, a P I or the A P I for any other language. Finally, there are even higher level tools that could be used for analytics. And whether we'll look at in this course is called Cabana. It's part of the larger elastic stack, and that is a Web based graphical. You I that allows you to interact with your indices and explore them without writing any code at all. So it's really more of a visual analysis tool that you can unleash upon pretty much anyone in your organization. So, in order of low level to a higher level AP, I there are rest ful query so you can issue from whatever language you want. You can use client AP eyes to make things a little bit easier, or you could just use Web based you wise to get the information you need us. Well, so those are the basic concepts of how elasticsearch is structured and how you interface with it. With that under our belt, we can talk more about how it works under the hood and how its architecture works. 9. What's New in Elasticsearch 7: If you're already familiar with elasticsearch in just looking to get up to speed on what's new in the Last exerts seven, here's an overview of the main changes. Elasticsearch tends to roll out big new features, even within minor releases. So this isn't everything that's new since Elasticsearch 6.0, necessarily. But a lot of features introduced within the six point X run have been declared production ready with ELASTICSEARCH. Seven. For a while now, Elasticsearch has been in a long process of deprecating the concept of types. It used to be that in addition to documents and indices, there was also a type that allowed you to associate different schemes with documents within the same index. Conceptually, they found this to be a bad idea, as it made people think types work the same as a database table. When your reality they behave differently, you'll find that some AP eyes that used to take a type name now use a generic type called underscored Doc instead, and others just omit the type parameter and entirely. Now, configuration files and plug ins that used to require types to be specified no longer do. This is really the most pervasive changed to Elasticsearch, and it's what required us to re record this entire course when the last six or seven came out. The change I'm most excited about personally is the official release of sequel support in Elasticsearch. We've added a lecture and a hands on activity for this later in the course so you could get familiar with it. But it really couldn't be much easier. You canal query your elasticsearch index using the same sequel. Syntax, you probably already know Sequel seems to be the lingua franca that's tying together all sorts of big data technologies, and Elasticsearch is falling in line with that. There have been a lot of changes to the default configuration settings for elasticsearch, especially as they relate to the number of default shards, which is now one instead of five. And how replication works in a production setting. Though you really should be tuning these values yourself anyhow, they've updated to the latest version of loose seen under the hood. Really, Elasticsearch is just a distributed version of loose scene, with various layers added on top of it. Some add ons that used to be installed separately from Elasticsearch are now included with it by default. They seem to be moving away from the concept of a separate package of add ons to the open source version of Elasticsearch, which is called X PAC, toward making those add ons open source themselves, but only enabling certain features. If you purchase an enterprise license for Elasticsearch, you no longer need to install a specific version of Java before installing elasticsearch as it comes with one of its own. Now, however, other components of the elastic stacks such as cabana, will still need an external J. D. K. Replication across clusters is now possible in the S seven. It's also a new feature called Index Lifecycle Management, or I L. M. This is pretty cool stuff. It automates the progression of your data through a life cycle through hot, warm, cold and a leash in phases. This could be really useful to automatically moving log data into read only less expensive storage over time and ultimately deleting it once it's no longer required to be retained. We'll cover this in more death later in the course. There's also a new job, a client for elasticsearch that they're quite proud of. If you're a Java developer. You'll definitely want to check it out. But we're going to keep things at a lower level within its course and interact with Elasticsearch via the underlying rest ful queries. Not only does that keep what you're learning language agnostic, it will help you to understand what's going on at a lower level, even if you choose to later use a higher level. AP I such as H. L R C. There are also countless performance improvements and small little breaking changes, All of which I'll refer you to the documentation, for there are a few other smaller features that have rolled out as well, but these are the ones I'm most excited about. 10. How Elasticsearch Scales: Let's talk about elasticsearch architecture and how it actually scales itself out to run on an entire cluster of computers that can scale up is needed. Elasticsearch is main scaling. Trick is that an index is split into what we call shards, and every shard is basically a self contained instance of loose seen in and of itself. The idea is that if you have a cluster of computers, you can spread these shards out of across multiple different machines as you need more capacity. You could just throw more machines into your cluster and ADM, or shards to that entire index so that it can spread that load out more efficiently. So the way it works is once you actually talked to a given server on your cluster for elasticsearch, it figures out what document you're actually interested in. It can hash that to a particular shard i d. So we have some mathematical function that can very quickly figure out which Shard owns the given document, and it can redirect you to the appropriate shard on your cluster very quickly. So that's the basic idea. We just distribute our index among very many different shards and a different shard can live on different computers within your cluster. Let's talk about the concept of primary and replica shards. This is how Elasticsearch maintains resiliency to failure. One big problem that you have when you have a cluster of computers is that those computers can fail sometimes, and you need to deal with that. So let's look at this example. We have an index that has two primary shards and two replicas, So in this example, we're going to have three nodes and a note. It's basically an installation of elasticsearch. Usually you'll see one note installed per physical server in your cluster, and you can actually do more than one note per server if you want, But it would be a little bit weird to do that. But the design is such that if any given note in your cluster goes down, you won't even see it. As an end user. You can handle that failure, so let's take a look at what's going on here. In this example, I have two primary shards. That means that they're those air, basically the primary copies of my index data, and that's where Wright requests are going to be routed to initially. That data will then be replicated to the replica shards, which can also handle read requests whenever we want to. So let's take a look at how this is set up. Elasticsearch figured this all out for you automatically. It's a big part of what Elasticsearch provides on top of loose seen. It manages this redundancy for you. So if I say I want an index with two primaries into replicas, it's going to set things up like this if he gave it three different notes. So let's look at an example of fail over here. Let's say that node one were to fail for some reason, you know, had a disc failure where the power supply burned out or something like that. So in this case, we're going to lose Primary Shard one and replica shard zero. But it's not a big deal because we have a replica of charred one sitting on No. Two and another replica sitting on note. Three. So what would happen if Node one just suddenly went away? Elasticsearch would figure that out, and it would elect one of the replica notes on two or three to be the new primary node and Since we have those replicates sitting there, everything will be fine. We can just keep on accepting new data and we can keep on servicing read requests because we're now down to one primary and one replica, and that should be able to get a spy until we can restore that capacity that we lost with no number one. Similarly, let's say no Number three goes away. In that example, we lost our primary node zero, but it's OK because we had a replica sitting on Node one and No. Two and Elasticsearch can just basically promote one of those replicas to be the new primary. And it could get by until we can restore the capacity that we lost. So you can see using a scheme like this. We have a very fault tolerant system. In fact, we could lose multiple notes. I mean, no two is just serving Replica knows at this point, So we could in fact, even tolerate Node one and No. Two going away at the same time, in which case would be left with a primary. A note. Three for both of the shards that we care about. So it's pretty clever how that all works now. There are some things to note here. First of all, it's a good idea to have an odd number of notes for this sort of resiliency that we're talking about. Also, the ideas that you would just round robin your request as an application among all the different nodes in your cluster. It would spread out the low to that initial traffic. Maybe your application manages distributing those requests across different notes. Or maybe you have some sort of load balancer device that does that for you. Let's talk a little bit more about what exactly happens when he write new data or read data from your cluster. So let's say your indexing a new document into elasticsearch that's going to be a right requests. Now, when you do that, whatever no you talk to will say Okay, here's where the primary shard lives for this document you're trying to index. I'm going to redirect you to where that primary shard lives so you'll go right. That data index it into the primary shard wherever that node lives on, and then that will automatically get replicated to any replicas for that shard. Now, when you read, that's a little bit quicker. They just routed to the primary shard or to any replica of that shard, so that can spread out the load of reads even more efficiently. So the more replicas you have your actually increasing your read capacity for the entire cluster. It's only the right capacity that's going to be bottlenecks by the number of primary shards that you have. An unfortunate thing is that you cannot change the number of primary shards in your cluster later on. You need to define that right when you're creating your index up front. And here, by the way, is what that syntax for that would look like. Through arrest request. We would specify a put verb on arrest request with the index name, followed by a setting structure and Jason that defines the number of primary shards and the number of replicas. Now this isn't as bad as it sounds, because a lot of applications of elasticsearch are very read heavy. You know, if you're actually powering a search index on a big website like Wikipedia, you're going to get a lot more read requests from the world than you're going to have indexes for new documents so it's not quite as bad as it sounds for a lot of applications. Often times you could just ADM or replicas to your cluster later on to add more read capacity, it's adding more right capacity that gets a little bit Harry. It's not the end of the world, though. If you do need to add more right capacity, you can always re index your data into a new index and copy it over if you need to. But you do want to plan ahead and make sure that you have enough primary shards upfront to handle any growth that you might reasonably expect in the near future. We'll talk about how to plan for that MAWR toward the end of the course, by the way, just as a refresher. Let's also talk about what actually goes on with this particular put request for it. We're defining the number of shards, So in this example, we're saying we want to three primary shards and one replica. How many shards do we actually end up with here? Well, the answer is actually six, So we're saying that we want three primary shards and one replica of each of those primary shards. So you see how that adds up. We have three primaries times one replica per primary, which is three total replicas, plus the three original primaries, which gives us six. If we had to replicas, we would end up with nine total shards, right? Three primaries and then a total of six replicas to give us two replica shards for each primary shard. So that's how that math works out. Could be a little bit confusing sometimes, but that's the idea, anyway. That's the general idea of how elasticsearch scales and how its architecture works. The important concepts here are primary and replica shards and how Elasticsearch will automatically distribute those shards across different notes that live on different servers in your cluster to provide resiliency against failure of any given note. It's pretty cool stuff. 11. Quiz: Elasticsearch Concepts and Architecture: As they say in my kids schools, it's time to show what you know. It's quiz time, but don't worry too much. I just want to make sure you were awake during these past few lectures. First, question this scheme of your documents that is the definition of what sort of information is stored within your document. Is that defined by the index itself, the clusters settings or the document itself? Where is the information stored? As to the actual scheme of the data types of the information that document contains The answer is the index or technically, the document type associated with the index? Remember, the index is kind of like a database table, and you can only have one type of document into given Index. It defines the individual fields and what data types they are that a document contains. So again going back to the example of an Apache log entry. The type associated with that index might define things like the U. R L that was requested or the status code and the request time and the referring URL and things like that, or for storing something like Wikipedia entries that might just include things like the text of the article itself. The author of the article. The title of the article. Things like that. It's all defined by the type mapping inside of Documents Index. And again we define type mapping in Elasticsearch when we're setting up our indices. Question two. What purpose Do inverted into C. Serve and in search engine. This isn't just specific to elasticsearch. It's in general for search engines to does an inverted index allow you to search phrases in reverse order. Do they quickly map search terms to the documents they reside within? Or are they a load balancing mechanism for balancing search requests across your entire cluster? What do you think the answer is? If you said the 2nd 1 you're right. They quickly map search terms to documents. Remember, an inverted index simply maps out specific search query terms to the documents that they live in. So as you index documents and inverted indexes actually created where it splits those documents into its search terms, and it serves as a very quick look up of where to find those search terms in any given document. Next question. If haven't index configured for five primary shards and three replicas. How many sharks would I have in total? A little bit tricky here. Think about that for a little bit. The answer is 8 15 or 20. I can think of ways of doing the master. You can get any of those answers, but the correct answer is, Well, if you still to think about it. No cheating. The answer is 20 shards. So the way it works is that I have five primary shards. I start with five shards, and then I want three replicas of each art as well. So three times five is 15 so I end up with five primaries and 15 replica shards for a total of 20 shards in this particular example. Remember how this works? Because they can add up fast. And remember, a given note can actually contain many different shards. So you know we'll distribute charge among notes in whatever way makes sense in your cluster automatically. Just because they have 20 shards does not mean that I need to have 20 machines in my cluster. We can have many shards on a given node. Next question. Elasticsearch is built on Lee for full text, search of documents to or false. Well, you got a 50 50 shot on this one. But if you've been paying attention at all, you know the answer is false. Elasticsearch can index any sort of structure data with any kind of mapping. You can dream up. So it's not just for full text search anymore. It's not just researching encyclopedias or websites. And blog's elasticsearch can also be used for searching and even aggregating in visualizing numerical data or time based data or whatever you can dream up. And increasingly, it's being used as a tool, for example, for aggregating weblogs from Web servers and sort of building a system that can compete with Google analytics and things like that. So elasticsearch it's not just research anymore. How do you do on the quiz? Hoping did pretty well there. If not, go back and review those first few lectures because thes are important concepts that will provide the underpinning for all the stuff we're about to be doing going forward. But with all this conceptual stuff under our belts, we can now roll up our sleeves and do some more hands on activity. So let's get busy 12. Section 1 Wrapup: we have accomplished a lot together in a short period of time. We've installed in a boot to virtual machine and got elasticsearch up and running on it. And we've covered the basics of what elastic searches for and how it works. You've already learned some valuable information, so congratulations on reaching this point. But we've only scratched the surface of what Elasticsearch can do in the remaining sections of this course. We're going to focus on hands on activities to show you how to import, search and analyze your big data with elasticsearch in many different ways. Keep on going. I've got a lot left to show you. 13. Intro: Mapping and Indexing Data: This next section is about mapping and indexing your data. To make things interesting will import a real world data set of movies and movie ratings. I'll cover the basics of what we're doing in slides so you'll have him for a reference. But you're going to follow along with me and actually import this data into elasticsearch yourself. Then we'll go over how to insert, update and delete individual documents in your index and some of the complexities that can arise when doing this on a busy cluster will also cover different ways. Elasticsearch can analyze and token eyes your text data and talk about data modelling. When you're dealing with relational structure to data in Elasticsearch, there's a lot to cover, so let's get started. 14. Connecting to your Cluster: So let's go over how to actually connect to your cluster now that we have it set up and running on a virtual machine on your desktop. Now, if you're running in a new boot to desktop machine that you had lying around, obviously you already know how to connect to that. You just sign in and you go to town. But earlier, back in lecture one when we set things up in a virtual machine. If you did that, you know, we just signed right into the consul on that machine once we started up. And that's not how you would do it in the real world. Typically, you would connect here, cluster or machine on your cluster using ssh from some remote location. That's how we're gonna do it. Here is well within this course. So let me show you how to actually connect here. Elasticsearch server using ssh on your own PC. First thing we need to do is open up porch 22 on our virtual machine so we can communicate with it by the secure shell or ssh, protocol. And while we're at it will open up the other ports. Elasticsearch will need as well then we need to install some sort of a terminal application on our desktop that we can use to communicate with our virtual machine. Now, if you're on Mac OS, you already have a terminal application, and you could just use it. Just type in ssh 1 27.0 dot 0.1, and that will connect you to your local machine that you're running there within your virtual image. But if you're on Windows, Windows does not come with a built in terminal application, so we'll have to go install something called Putty. In order to do that. That has a nice little side benefit to has better support for things like copying and pasting data from your Windows clipboard, as opposed to just logging directly into your machine from the virtual machine. Finally, I'll show you how to actually connect to your so called cluster, which is really just a single elasticsearch host running within a virtual machine on your desktop. And we'll practice that because we're going to have to do that every time we log in and try to do something in this course on your cluster. So let's dive in and see how it all works. So let's start by opening up reports that we need to communicate with our elasticsearch server from outside of our virtual environment. To do that, go ahead and open up the Oracle VM virtual box application that we installed earlier. Select the Elasticsearch Virtual machine is set up and click on settings from here. Go to network and then open up the advanced tab here and we're going to click on the port forwarding button. So from here, hit the plus sign and we'll start by adding in a rule for ssh traffic so we can actually connect to this through a terminal. So just give it a name like ssh! Protocols TCP and its hitting tab to go between the fields here. Host I p will be 1 27.0 dot 0.1, and as the stage runs on Port 22 so we'll go ahead and open that up. We'll skip past a guest I p inside our guest port to be 22 as well. I know that if you're using this on Mac OS, you'll want to be using port 2222 or something else. There might be a conflict with Port 22 on your system on Mac Os. So keep that in mind. We'll click the plus sign again and will also open up a port for elasticsearch itself, in case we want to communicate with it outside of the actual environment. And that will also be on 1 27.0 dot 0.1. And a port for elasticsearch is in 9200 finally will open up a port for the cabana. You I because we'll want to connect to that as well. 1 27 knots. Here it is. Here, that one that runs on port 56 01 just like that. So once that all looks good, hit. Okay? And okay again. And let's go ahead and kick off elasticsearch our virtual machines so we can practice connecting to it. But first you need a terminal application. So, as we said before on Mac os or Lennix Cyril set, you have a built in terminal application already that you can use from the command line. Right? So you just open up a terminal prompt type in Ssh. 1 27.0 dot 0.1 and then you could log into your virtual machine or the machine that's actually running on the host itself, for that matter. But on Windows, we have no such luck. We need to install a terminal application, and the one I'm using is called putty. So if you do need to download that head on over to putty dot org's and click on the friendly download link, assuming you're on a 64 bit version of Windows, you'll want the 64 bit installer. Just go ahead and download that and run it. I won't waste your time by showing you how to do that. I'm pretty sure you know how to install software. But once that set up, just open up putty and what you need to do is type in the host name, which is 1 27.0 dot 0.1. The default port of 20 t was correct and make life easier. You can give it a name like uh Elasticsearch and save that session so you can easily open it up later. Now, I'm still waiting for the virtual host to spin up here, so I'm just gonna wait until that's done loading. So I did have to give it a few minutes to boot up fully. If you're not sure if it's ready to log in yet. Just hit the enter key from the main virtual box window here, and you should see a log and prompt if it's ready for you to sign in. Once you get to that point, just go back to Putty and we're gonna already have won 27.0 dot one on Port 22. Selected here for ssh. Just hit the open button and that should connect us the first time you connected. Will say, Hey, are you sure about this? I haven't seen this machine before. Yes, it's OK. Click the yes button and we will log in as student and the password that you set up during initialization and we're in cool. So now we're connected to our virtual host through an ssh client outside of it. And this is how you do it in the real world, right? You probably have a server running somewhere else on your network or in the cloud somewhere and connecting to locally from your PC Ah, at your desk through an ssh client. So if you're on Windows you're good to go. At this point, you can do whatever you want within this client, and when you're done, just typing exit to get out of it. So if you're on Mac OS, you will have to use a command prompt instead a terminal prompt instead of putty. And I don't have a Mac system handy, but I have opened up the Windows Command prompt here, which is at least similar and spirit. You can just you ssh! And then the user name, which would be student at 1 27.0 dot 0.1 and then dash P and whatever port number you're using again. If you do run into trouble on Mac OS, it's probably due to a port conflict on Port 22. So you'll probably want to go back to the port mapping and virtual box to use a different port like to to To To. And this is where you would specify that After that, just type in the password and we're in the same exact deal. So he had entered to get a prompt, a little bit sluggish on Windows, But that's why we use putty. But on Mac Os is how it would work. Just typing exit to get out and you're done. All right. So now you have the ability to connect to your virtual machine from an ssh client. Be it Puddy on Windows or a terminal using the ssh command on Mac OS with that in hand, remember how to do that? Because we're going to come back and do that before we start every exercise going forward. And let's start to do some more interesting work coming up next. 15. Introducing the MovieLens Data Set: Let's also introduce you to the free movie land status that that will be using throughout this course and just get you familiar with what what it's all about. Let's go ahead and download and take a look at it and figure it out. So it's going to be an important thing that we work with throughout the entire course. Movie lenses. Just a free data set of movie ratings, kind of like Netflix when you're moving rating movies on Netflix. So if you got a movie lines dot org's, you'll see they have a website where you can rate a bunch of movies and get back movie recommendations. And over many, many years they built up a very large database of user ratings and movies as well. So we can use this as a way to play around with not only textual data such as movie titles , but also structure data like movie ratings and the properties of the movies themselves, like the release year and genres and things like that. So we're going to be using the state a lot throughout the course because it's kind of fun stuff to work with. So let's just go take a look at what this data looks like and familiarize ourselves with it . We'll head on over to group land, started work, and I've right in so head on over to group lens dot org's. And this is the repository for all these movie ratings data sets that they offer up free for use in research purposes. There should be a data sets link here, and from here you can download different sizes of data to experiment with. Now, since we're just doing this for educational purposes on a single machine, we don't want to get too much data. We're not. We're not doing big data here quite yet, so let's scroll down to the ah recommended for education and Development section here because we're doing education and we want the small movie lens latest data set. This is basically 100,000 ratings sampled from their existing data, and you could just go ahead and download that by clicking on Emma latest small dot sips. So let's go ahead and do that. While we're waiting for that to come down, click on the read meat out HTML and explore what's in here. This gives me more information about the format of the data and the files that are included . So if we scroll down, you can see more information here. Basically, we have the concept of user ID's in this data. So to keep it anonymous, every user assigned e numeric identify air and movies Air also assigned America identifiers so that we can cut down on the amount of data that we have to pass around. There's a separate file for mapping those movie ideas to movie names such as movies dot C S V. And those are the individual files that are part of this ratings dot CSB will be the data set of 100,000 ratings, and every row in that file will contain a user i d. The movie idea that user rated the actual rating and a timestamp of when that reading occurred. There's also tags data that might come in handy later, just ah, various textual terms that are associate with each movie. There's also a title that maps movie ideas to their genres and the movie title, and that will definitely come in handy later in the course as well. We're not gonna be using links dot C S V, but if you did want to tie this data back in tow IMDb pages or what have you? That's a way to do that as well. Let's go ahead and explore that data that came down. I'm gonna go ahead and open that up and it's sitting in my downloads folder. I'll just right click on that to extract it, and all we're doing is exploring it here. We're actually gonna load this up directly into our virtual machine later on. So don't worry about where this is quite yet, and there are all the files that we talked about. So let's go ahead and open up the movies dot CSP file. For example, if you have excel or anything else that's already configured to display, see SV comma Separated Values files, you should be able to see that here. And you can see, for example, movie I D. One corresponds to the movie Toy Story and Toy Story has been assigned the genre's adventure animation, Children, comedy and fantasy so you can see that a movie can have multiple genres. So this is gonna be some interesting fun dated to play around with in Elasticsearch. You know, we can actually ah indexes data and start searching it and do things like, you know, find all the action movies that I know what. It's a lot of creative things that we're gonna dream up to do with this information. So I just want you to get familiar with the nature of this data before we start playing with it, and we're going to start doing that next. 16. Analyzers: So now that we've seen some data that we would like to import into Elasticsearch, let's talk about map ings, which tells elasticsearch how to store that information. What is the mapping? Well, you can think about a mapping and elasticsearch is a schema definition. It's telling its elasticsearch what format to store your data in and how to index it and how to analyze it. Elasticsearch usually has reasonable defaults and will infer from the nature of your data the right thing to do. More often than not, it can figure out if you're trying to store strings or floating point numbers are integers or what not, but sometimes you need to give it a little bit of a hint. Here's an example of where we're going to import some movie lens data, and we want the release date to be explicitly interpreted as a date type field. So here's what a Jason request would look like to do that. Now we're going to be using the curl Command throughout this course. Remember, that's just a simple Lennox command that allows you to send an http request to the server that's running elasticsearch. So in that case, that's going to be our local host running within our virtual machine. The syntax of the curl Command is that you need to specify the content Header is using the Dash H Command. So we're going to send that along. Is part of the http request header that says content dash type application slash Jason, which tells the server that were sending Jason formatted data and that's a new requirement as a elasticsearch six. I'm not actually showing it on this slide, however, because I'm about to show you a short cut that will save you from having to type that every time now in elasticsearch five and earlier, you didn't have to type that at all. But unless it's her six and up you do anyway, once we have our curl command kicked off will send in X put, which is the verb for the http request the http action, if you will, we're going to do a put command to put information into our index. The actual address were talking to is 1 27.0 dot 0.1, which is the loop back address for the local host. And colon 9200 indicates that were running on Port 9200. Then we say slash movies, which is the name of the index that were manipulating and after that and Curl will say Dash D, and then a single quote that says that we're going to send along the data as the request body. Following this quote, those single quotes will enclose the Jason data. That's actually the body of this message that were sending into elasticsearch that Jason Body contains a mapping section which specifies that for this index, what the properties of the following fields should be that year type date is what's telling elasticsearch that I want you to explicitly interpret the year field when it comes in as a date type and not just a strength full of numbers and dashes. Mapping can do a lot more than that, though, so mapping skin define field types like we talked about in other feel. Ties. Besides, date can include strings, bites, short integers, integers, long integers, floating point numbers, double precision floating point numbers and Boolean values as well. So those are all types that elasticsearch recognizes and can handle. You can also specify whether or not you want to feel to be indexed for full text search or not as part of your mapping. So, for example, you might say index colon not analyzed. If you don't want that information, be part of full text search if you just want that to be sort of ancillary information that gets tagged along in the results instead. More interesting are the field analyzers and analyzers have multiple different things. They can dio. They can have character filters. So, for example, you can remove HTML encoding or convert ampersand, stow and boards. You could do token izing with toke anizers that's basically specifying how you split strings based on white space or punctuation or non letter characters. There are also token filters where you can do things like lower case. All the information for search purposes do stemming synonyms and stop words will talk about that more in a bit. There are many different token filters to choose from. For example, standard will split on word boundaries, simple splits on anything that isn't a letter and also lower cases for you. And then it's a simple white space token. Isar that just splits on white space but doesn't convert to lower case. You can also specify the language that you want to use for this index that will be language specific. Stop words and stemming rules that get imported as a result of what language you're using. Go through that a little bit more depth here, shall we? So basically, there are three things and analyze Ercan do when his character filters. So when analyzer can remove HTML encoding and do things like convert ampersand to the word . And, for example, the idea here is that if you apply the same analyzer to both your search query and the information that's indexed, you can get rid of all those discrepancies that might apply between the two. So if I apply the same analyzer that maps and presents to the word and on my search query and in the data that I'm indexing, that means that it doesn't matter if I'm searching for ampersand or and I'll still get ampersand results back from my inverted index. We also have token Eisner's, which actually splits your strings up in certain ways. So how do I know what makes up a word how our search terms broken up? Your choice for token Isar determines that you can for example, split strings up just based on white space, on punctuation or on anything that's not a letter. There's also language specific ways of doing token ization as well. That's how you specify how a string it's broken up into search terms that get indexed. You can also do token filtering again. If you want your searches to be case insensitive, which usually do, you might want your token filter toe lower case everything again. If you apply the same analyzer and the same token filter to your search queries and what's indexed. You can get rid of any case sensitivity if they're both lower case in both what you're searching for and what you stored for things to search for. You can also handle stemming for example, if you want the word box and Box says, and boxing and boxed toe all match for a search for the term box you could do stemming to normalize all of those different variants of the given word to the same root stem. Synonyms are also a language specific thing. If you want to search for the word big toe, also turn up documents that contain the word large. Your token filter could normalize all those to a given synonym or to make sure that those all work as well. You can also specify stop words. So if you don't want to waste space and store things like the and and and A in little words like that that don't really convey a lot of information, you