Transcripts
1. C# performance tips: Let me ask you a question. Would you like to become a C sharp performance architect? Okay, I have to admit, I made that word up. But in my book, a C sharp performance architect is a senior developer who writes high performance C sharp coat. So a senior developer who is acutely aware off coat optimization techniques onto rights, called for games for data analysis for real time data acquisition for all these cool environments where fast coat is essential, So would you like to become a C sharp performance architect? Would you like to join the club then? This is the course for you. In this course, I will teach you a tongue off optimization hacks. We're going to look at basic optimization. So this is the low hanging fruit, the easy tweaks that will make your code run up to 1000 times faster. Well, look at intermediates optimization. So these are fairly advanced reflex rings off your codes. That will give you a small performance improvement. We're gonna look at some really advanced optimization is we're going to look at using pointers in C sharp on writing directly into memory on in some scenarios. This is faster than using the standard .net classes. We will look at the foundation off the dot net runtime, so I will give you a crash course in the stack on the heap. How value types and reference types of stores on the stack, on the heap on how data moves between the stack on the heap. When we're running coat, I will teach you into media's language, so I will actually take codes compliance into intermediate language. Show that intermediate language to you, and I will explain what all the instructions are doing. So by the end of the course, you will be able to read intermediate language on You will become aware of how this issue compiler compiled. Stores coat into intermediate language on how certain I'll instructions can slow down your coast. So this is a fairly large course. It contains lots and lots of lectures there quizzes to test your knowledge on. You can download the source code that I've been using for benchmarking. So would you like to become a C Sharp performance architects? Then this is the course for you. So I've created this course for medium level or senior see shop developers who wants to learn how to write fast and efficient C sharp coat to get their career ready for cool subjects like game developments or real time data acquisition. Thank you for listening. I look forward to welcome you in the course.
2. Introduction to code optimization: So let's talk codes optimization. What is code optimization? Well, count optimization in general implies modifying on I t system to make it work more efficiently or use fewer resources or be more robust. For example, a computer program may be optimized so that it will execute faster or use less memory or disk storage or be more responsive in terms off the user interface. We optimize code to avoids unacceptable slowdowns. Winner code is in use. Imagine clicking a button labels show report and then having to wait 10 minutes before anything shows up on acceptable user interface. Delay is two seconds, so we need to find a way to generate that report in only two seconds. We also optimize to make our codes scalable. I've personally experienced websites that run fine with 10 simultaneous users but completely break down when 50 users happened to visit the website all at the same time. There is even a name for it. It's called the slash daughter effect names for when the popular websites slash DOT features a website on its front page, which then promptly crashes because millions of visitors click the hyperlink on overload. The Web server, a song his knowledge off code optimization is going to help you tow avoid these potential disasters. So in this course, we're going to look. That's how to make a program execute faster. We have a number off strategies at our disposal. To achieve this goal, we can rewrite algorithms to achieve the same result. With less cold, we can avoid unnecessary instructions. We can make sure that the libraries we use are specifically designed for the task at hand. We can reduce memory consumption as much as possible, and we can avoid scenarios where the code is blocking, waiting for a slow resource. The performance optimization in this course all fall into one or more of these five categories. But before we get started, a final word of warning. There is a famous quotes by Michael Jackson on I mean the British computer scientist Michael Jackson, not the other guy. The quote goes like this. The first rule off program optimization is. Don't do it. The second room off program optimization, which is for experts only, is don't do it yet. The famous computer scientist Donald's Neuf had a similar clothes. Premature optimization is the root of all evil. The thinking behind these quotes is that your code will completely go to help. If you try to focus on performance optimization right from the start while you are still working on your program, your coat is fluids and evolves in unexpected directions. It is impossible to reliably predict in advance where the performance bottlenecks will be. Here's an example near the end off this course, you will learn that pointers can speed up your coat, but only if you use them in a very specific way. So imagine re factor in your entire program to use pointers, and then, a couple of weeks later, you need to reflect on your algorithm and down. The point of optimization no longer work. So now you're stuck with unsafe, hard to read coat that is actually slower than well written, clean and safe coat. That doesn't use pointers, so you're probably going to have to undo everything. Another example. You spent weeks squeezing every possible ounce of performance out of a library function, and eventually you managed to optimize the function to run in less than one milliseconds. Congratulations, Then your program evolves again, and by the time you go into production, your message is called 99% of the time, right after a database core that takes 300 milliseconds to complete. So all your optimization zwart from Nothing. So the recommended course of action is to write simple, clear on modular code on Leave the organization until the very end when you reliably know where the performance bottlenecks are going to be. However, there are exceptions to this rule. I will leave you with the full version off Donald's news quote. We should forget about small efficiencies, say about 97% of the time. Premature optimization is the roots off evil. Yet we should not pass up are opportunities in that critical 3%. So if you can clearly see the 3% right from the start, feel free to begin optimizing right away.
3. What is stack memory?: stack memory was simply the stack is a block off memory that is used for calling methods and storing local variables. So let me draw that here on the blackboard. I'll draw the stack as a vertical column like this. No, when I start running some code, that stack will initially be empty. But when Mike Oates calls the methods this happens, the method parameters the return address on all local variables off the methods are put on the stack. This whole block of data is called stack frame. So what happens to the stack when my codes calls another method from inside this method, which this the same thing happens again? The method parameters return address on local variables off the new method I put on the stack right on top off the previous stack frame. So this is why it's called a stack. Information is stacked on top of one another. What happens when my cult encounters return statement? As you probably know, a return statement ends a method and returns back to the calling code. Now on the stack. This is what happens. The entire stack frame that corresponded to the method is removed, but you might be thinking what happened to all the local variables that were stored in the stack frame? Well, they all go out of scope, which is just a fancy way of saying that they are destroyed. So this is an important fact to remember the moment you return out of the methods all your local variables off that methods go out of scope and are destroyed. If I continue on my program and also return out of the first methods, we are back to where we started with an empty stack. You might be wondering what happens if a method calls in other methods, which cause another method, which calls another methods 1000 times well, the stack would quickly fill up with stack frames until there is normal room. Eventually, the stock will be completely full on the DOT net framework Throws stack. Overflow Exception. If you see this error message, it means you have an infinite sequence off method calls somewhere in your code. Let's take a look at some code. I wrote a simple program to draw a square on the screen. You see that I have a drawer square methods that calls a drawer line method four times to draw the four sides off a square. I will put a break, points inside the drawer, line methods and then run my cold. Watch this. Now, at this point in my coat, that stack will look like this. My first call to draw square is here at the bottom off the stack with its four parameters, return address and local variables. Next is the call into a drawer line with again four parameters. Return address on. In this case, no local variables because drawer line doesn't have any in visual studio. You can take a look at this stack by opening the call stack view, which is here. You can see the call into draw square and then into a drawer line. So this window shows you which stack frames are stored on the stack. Right now, as a final demonstration, let me show you a stack. Overflow exception. Let me modify my coat like this. I've modified my coat. So now draw a line calls into a drawer line, which then calls into a drawer line which calls into a drawer line. You get the picture. This sequence will never end. When I run this program, it will create an infinite sequence off drawn line methods calling themselves Let's run the program and see what happens. I'm there you have it. The stack is limited in size. If I have too many methods calling all the methods, eventually the stack will be completely full on the DOT net framework flows the stack Overflow exception. So what have we learned? That'll Net Framework uses the stack to track medicals every time you call the message all of its method parameters. The return address on all local variables are placed on the stack. This entire block of memory is called a stack frame. When you return out of a method, the top stack frame is removed. All local variables go out of scope at this point, and I destroyed. If you have an infinite sequence off methods, calling all the methods the stock will fill up completely until nettle throw a stack. Overflow exception
4. What is heap memory?: the other type of memory inside a computer is called hit memory or simply the heap. Let's take a look at the following line of code Now. Any time the keywords new appears in online, you were creating an object on the heap. This is a fundamental rule. Internet objects are always created on the heap. They never go on the stack. So I've made some changes to my drawer square program. Let's take a look. Previously, the drawer line methods expected four integer parameters to draw a line. But now I have a drawer Polygon Methods, which expects an array off line objects on draws everything in one go. The draw square method sets up a line array with four objects corresponding to the four sides off the square and then calls drop all a gun to draw everything in one. Go now remember, in my old drawer square program, I put a break points inside the drawer line methods and when I ran my codes, the stack looked like this. But now I've made a lot of changes to the program. So what are the stack on the heap going to look like now? Okay, so imagine I put a break points inside the drawer. Polygon methods on run my program. The stack on the heap would then look like this. The parameter cold lines exists on the stack because it's a parameter, but I initialized it with the new keywords. So the array itself is created on the heap, so the variable on this stack refers to an object on the heap. You can also see that the lines array has four elements and that each element refers to a line object elsewhere on the Now, If you're thinking this all looks a lot more complicated than the previous cult, then you're under the right track on. This brings me to an important points. This code, which uses a winery, is slightly slower than my previous coat, which used integers for everything. And this is because all the extra references have to be calculated, from the lines parameter on the stack to the array. Instance, on the heap, hands from an array elements of the heap to a line object instance elsewhere on the heap. So the main take away, for now is codes that uses integers is slightly faster than code that uses objects. So now what happens when the draw polygon method finishes. That stack frame is removed on the lights, parameter goes out of scope and is destroyed. But here is something you might not have expected. Z array on the line. Objects on the heap continue to exist now. This is an interesting situation. The lions parameter is out of scope, but the objects on the heat are still there. We say that the objects are de referenced because of the variable or parameter, that refer to them without of scope. De referenced objects continue to exist and are not destroyed immediately. So here is another important take away. The DOT net framework will always postpone. Cleaning up the referenced objects on this is because cleaning up the heap takes a lot of time on by postponing it for as long as possible. Your code will actually run faster, but eventually the framework will have to clean the heap or we would quickly run out of memory. This cleaning process is called garbage collection, and it happens periodically in the background. When the framework starts garbage collecting. It identifies all objects on the heap, which are no longer referenced by any variable parameter or objects in your coat on its de allocates each of them. We will get back to garbage collection in later lectures, and I will give you some tips on how to avoid performance problems in your codes due to garbage collection. So what have we learned? Every time you used the new key words in your codes, you are creating an object on the heap. The variable itself may live on the stack, but it will refer to an object on the heap. The new draw square program that uses a line array is slightly slower than the old program that used integers on. The reason for this is that all the extra object references need to be processed when parameters and local variables on the stack go out of scope. The corresponding objects from the heap are not destroyed. They continue to exist in a D referenced state. The next framework postpones cleaning up the referenced objects on the heap for as long as possible for performance reasons. But eventually the framework will start a process called garbage collection. As de allocates all the referenced objects on the hip
5. What are value types?: in the previous lecture, we learned about this stack on the heat. The knowledge you gains will help you going forwards when we look at variables, how they are stored in memory by the dot net framework and how this effects the performance of your code now. Previously, when I talked about this stack, I showed some code with the drawer. Linus. It's that used four integer parameters. Let's take a closer look at an integer in the Dark Net framework. The insurgent type is part of a special class of types called value types, but what is the value type? The value type is a type of variable, where the type of the value off the variable stored together. So if I have a local integer variable with a value off 12 hundreds and 34 that interview type on its value will be stored together like this. So let's take a look at this stack again. Previously, when I talked about this stack, I mentioned all types of data that the stored on the stack they are message parameters. The return address off a message sounds local variables. So if I have a local variable off type integer with the value off 12 hundreds and 34 it would be stored on the stack like this. You see that? The type and the value stored together on the stack. Now keep this in mind because in the next lecture I'm going to talk about reference types, which is storage differently. So you might be wondering which types in the dark net framework are actually value types. Well, here's a complete list. All numeric types are value types, including all off the integer floating points on decimal types. Also, Boolean in operations and structures are value types. Anything else in the dot net framework is called a reference type, which I will discuss shortly. When people try to explain the difference between a value type and a reference type, you often hear the following explanation. A value type is a type that exists on the stack. However, this is room value. Types can exist both on the stack on the here, let me demonstrate. Previously, when I talked about the heap, I mentioned what kind of data is stored on the heap. Do you remember what it waas? It was all objects, instances created with a new keywords in c sharp. So imagine that I create an object on the heap using the new keywords in my coat on this object has one interred your field containing the value 12 hundreds. 34. Now this integer will be stored like this. So you see, I know have the value type on the heap so value types can exist both on the stack on day on the the defining property value types is not where they are stored, but that the value is stored together with the type. So let me finish by showing you to importance additional features or value types. Let's say I have a message in my codes with two variables A and B. Both are introduce. The variable A contains the value 1234 under variable B is zero. Now what happens when I assign it will be Check this out. The value off A is copied into be. Now this is an important feature of value types their signs by value, which means their value is copied over. So now what happens if I compare A and B? They are two different variables that just happens to contain the same value. So how will. The dot net framework interprets the situation well like this. The framework considers these two variables to be equal. This is the second important feature off value types. They are compared by value, which means two different variables holding the same value are considered equal. So what have we learned? Value types store their value directly, together with the type value types can exist on the stack on and a leap value types are a signs by value, meaning the value is copied over value. Types are compared by value. Two variables with same value are considered equal.
6. What are reference types?: In the previous lecture, I talked about the value types and briefly mentioned its counterpart the reference type. But what is a reference type? Well, a reference type is a type of variable that refers to a value stored on the heap. Not previously. When I talked about the heap, I showed my modified draw square program that had a draw a polygon method, if you remember with a line array parameter, So draw polygon expected an array off line objects. Let's take a closer look at that line objects just to refresh your memory. Here is my coat again. You can see the definition off the line class here. It's a simple data container with two sets of coordinates. So imagine I have a message with a local variable off type line. What would that look like? A memory well, like this. You can see that the variable itself is on the stack, but it refers to a line Objects on the but can a reference type variable also exist on the heap? Yeah, sure. All I need to do is create an object on the heap. Using the new keywords turned half that objects have one fields off type line. Now the memory will then look like this. You see that they now have a reference type variable on the heap, and it refers to a line objects, which is also stored on the what elsewhere, so to summarize. Reference types can exist on the stack on day on the heap, just like value types, but they will always refer to a value on the heap. Let me finish by showing you to importance additional features off reference types. Let's say I have a message in my codes with two variables A and B. Both variables are lines. The variable A refers to a line instance on the heap under variable B is said to know. Now what happens when I assign a to B. Check yourselves The reference off A is copied into be. This is an important feature off reference types. They are assigns by reference, which means that the reference is copied over. You end up with two variables, referring to the same object instance on the hip. So now what happens when I compare A and B? They are two different variables that refer to the same objects on the heap. How will the dot net framework interprets this situation well like this. The framework considers these two variables to be equal. But wait, what about this scenario to reference type variables pointing to two separate objects on the heap. But both objects contained identical data. How will the dot net framework interpret this situation Drinker sounds. The framework considers these two variables to be not equal, so this is another important feature off reference types. They are compared by reference, which means two different variables referring to the same objects are considered equal. But two different variables, referring to two separate but identical objects, are considered not equal. So what have we learned? Reference times can be sensed to the no value Reference Types store, a reference to their value, and this value is always stored on the reference. Types can exist on this stack on day on the, but their value is always stored on the reference. Types are assigns by reference, meaning the reference is copied over reference. Types are compared by reference. Two variables referring to the same objects are considered equal, and two variables referring to separate but identical objects are considered not equal
7. What is boxing and unboxing?: in this lecture, I'm going to show you a mystery. Take a look at this code. This is a very simple program. I started with the variable A containing the value 1234. Then I declare a second variable. Be off type objects. Andi. I assigned a to B in C. Sharp. All types inherit from objects, including integers, so you can put basically everything into an object type variable. But wait into jurors are value types, and objects are referenced types. So in memory, my variables are stored like this. Here is my integer variable A with his value off 1234. And here is my object variable B. But B is a reference type, and we have learned in the previous lecture that reference types always refer to a value on the heap here. This is not possible because A is a local variable, so it exists on the stack found it's a value type, so its value is also start on the stock. There is simply no way for B to refer to a because both variables recites in different types of memory. A reference type can never refer to stack memory, so this program can never work. Right? Well, this one does. I go back to December in studio Andi, run the program. Here we go. Cool. It actually works. But how is that possible? Now, this is weird. Based on what we learned in previous lectures, this program should not work. But yet it does. How is that possible to find out? Let me de compel this program into intermediate language. By examining the intermediate language instructions, we might find a clue. And here it is. Look at this. Intermediate language instruction called books. Can you guess what it does here? I'll draws on the blackboard. Here is the memory layouts with the two variables A and B. Now the books instruction does this. So to make the program work, the Net framework actually copies the interviewer value from the stack, then tax the value into in objects on it Places this objects on the so the variable B can then refer to it. This whole process is called boxing. Boxing happens every time behind the scenes. When you have a variable parameter fields or property off time objects and you assign a value type to its boxing is nice because It kind of blurs the line between value times and reference types. But boxing can also be a pain because it's introduces extra overheads in your code. Now you might be wondering if there is a corresponding process called unboxing. Yep, there is. Here is my program again. Let's take a look at the final line. I declare a variable, see off type integer and assign the object value to its using a typecast. Another bit of magic. Actually, because see, exists on the stack and be the objects refers to an object on the in the intermediate language. The corresponding instruction is here. It's called Unbox. Let me go back to the blackboard on Draw the Unboxing process. We started from the box situation with the integer on the Now. Then this happens. Unboxing unpacks the integer on the heap and copies the value back into the variable See on the stack. Unboxing happens every time behind the scenes when you have an object value and you cast it to a value type. Boxing and unboxing can seriously affect the performance off your coat, so make sure you avoid it as much as possible in mission critical sections. I'll share some tips and tricks in later lectures on how to do this. What have we learned? Boxing is the process off taking value types on the stack, packing them into objects as placing these objects on the heap. Boxing happens whenever you assign value type to a variable parameter of field or property off type objects. Unboxing is the reverse process. Objects off the heap are unpacked on the value. Types inside are copied back to the stack. Learn boxing happens whenever you haven't object value and you cast it toe a value type. Boxing and unboxing negatively affected the performance off your coat and should be avoided in mission critical sections.
8. What are immutable strings?: in this lecture. We're going to take a look at the string class in dot net now. What do you think is a string of value type or a reference type? Let's find out. I wrote through codes to test what a string actually is. Check it out. I starts by declaring String variable a on initializing It's to the value ABC. Then I declare a second string variable B and a sign A to B. In the next line, I add a single character to the end of stream be and finally I write both strings to the console. Now, if strings are referenced types, the stack and heap would look like this. I would have to variables A and B, both pointing to the same string objects on the heap. Then, if I modify string variable B, the string invariable A would also be modified because both variables refer to the same string. However, if strings are value types, the second heap would look like this. I would have to variables a and B containing separate strings when I modify string variable B. The other string invariable A is not affected. Which scenario is right. Do you think Let's find out by running my code. Here we go and there you have it. Strings are obviously value types, right? Well, no strings are actually referenced types in memory. Things look like this. Here are the two variables A and B, both referring to the same string on the heap. But something special happens when I modify string variable B instead off modifying the string on the heap directly. This happens. We say, that strings are immutable. Objects in the Net. Any modification to a string results in a new string being created on the original stream is left untouched. Because of this, strings behave as if they are value types. So what are the benefits of having immutable strings threat safety, immutable strings? A threat safe? Because they cannot be modified memory savings. Identical strings can safely be merged together. This is called in turning first assignment to copy a string. All you need to do is copy the reference instead of having to copy all the characters over one by one. And first comparison intern strings can be compares by comparing the reference instead of having to compare all the characters one by one. So let's take a look at the intermediate codes to see what's going on behind the scenes. I place a break point on the final line off my program. Start the program and then switch to intermediate codes. The first line is here. A string with content. ABC is loaded on the using the load string instruction on stores invariable A with a store location instruction. Zen A is assigned to be by simply loading the reference invariable A with a load location on storing it to variable B with a store location, a superfast string assignments by reference exactly what we would expect for an immutable type. The magic happens when I modify the string invariable be behind the scenes. The framework calls a string concussed method. Let me look that method up using the assembly browser in summering studio. Here it is. You can see that the can cat mess it's creates a new string here. It then copies both string arguments into the new string and returns a reference to the new string here. So there you have it. Instead, off modifying this string invariable be the codes creates an entirely new stream, copies everything over into its and stores a reference to the new string. Invariable Be the old string in B is left on the heap in a deal referenced state waiting to be garbage collected. Every methods in the string class that modifies the string in any way has this behavior. Instead off directly modifying the stream. It creates an entire new string and places the modifications. In there. The original string is left untouched. So what have we learned? Strings are referenced types on and immutable. Because of this, strings behave as if they are value types. They are signs and compares by value. Immutable strings are threats safe in musical strings are fast because they can be signed and compares by reference. Immutable strings save memory because identical strings can be merged together in memory.
9. A crash course in intermediate language: in this lecture, I am going to talk about intermediate language. So what exactly is intermediate language? Well, let me start by showing you a classical compiler, in this case a C or C plus, plus compiler. At the top. Off the blackboards, you see a simple piece of code, a four loop that adds an integer to a variable cold results. A, C or C plus plus compiler would take this fragments of source coat and compile it to machine language, which is basically just a bunch of numbers storing memory on executed by the CPU in your computer but a darknet compiler. Dust things differently. Let's take a look at the same piece of codes written in C. Sharp. A C sharp compiler will first compile the source coast to a special intermediate language called Well, you guessed this common intermediate language. C i, l or silk. The same coat is stored in a dll or xer file. When the code is run, another compiler kicks in. This compiler is called a jit compiler, or G i T G. I T stands for just in time. The compiler runs just in time at the last possible moments before the codes needs to be wrong. The Jit compiler takes the intermediate language and compiles it toe machine language. So why this complicated two step compilation process? Well, because it actually has a number off important advantages. First, the silk coat can be optimized for the platform it's running on. So differences between, for example, a M D on Intel. CPU use can be fully exploited by the jit compiler to create fast and optimized machine language. Seconds. The soon cold is fully portable as it isn't bound to a particular hardware platform. The coat can run on Windows on Lenox on Apple computers. In fact, I created this entire course on an Apple MacBook Pro and all the code examples. You are going to see a running natively on OS X, but you could take these exact same executed bols that I'm running, copy them to leaners or windows and run them there. It would work. Another useful advantage off silk coat is that it can be very fight before running to make sure that the coat does not perform any dangerous operations and finally, still cold can be annotated with metadata, for example, serialization instructions telling external cold exactly how to convert your classes toe XML and back. Are there any downsides to using silk coat? Well, yeah. Using a two step copulation is slightly less efficient than compiling directly from source codes to machine language. So compiled. The Net colds is slightly slower than directly compiled C or C plus plus code butts. Compilers have caused him really good in the last couple of years, and so this difference has become microscopically small. Today it is almost impossible to hand cult machine language that is more efficient than what the C Sharp compiler produces. So let's look at intermediate language in more detail. How does the language actually work? Intermediate language is based on three concerts. The first is the sequence off instructions shown on the writes on the blackboard. Intermediate language programs contain, or the sets off instructions that are executed in sequence, just like a C sharp program. The local variables in an intermediate language program, a stored in special slots called locations. There are a number off locations available. I have drawn four on the blackboard on the left. Finally, there is an evaluation Stack a stack is a collection off values with two basic operations, a push called a Load in Ill, which adds value to the stack and pushes everything else down by one level, the other operation is a pop called Store in Ill, which removes the top value from the stack on moves full. Other values up one level. There are three groups off instructions in intermediate language instructions that pushed data on the evaluation. Stuck instructions that perform operations on the stack on instructions that pop values from the stack. So let's take a look. That's a very simple program. Take a look at the following two lines of code I initialized an integer with the Value 456 and then adds one to that value. How would this program look in intermediate language? Well, this program was consists of only four intermediate language instructions. The first instruction is loads constant. One, which loads the four bytes signed integer value. One on the evaluation stuck note that's there is only one local variable in my program called I. And so it gets stored in location zero, with the initial value off 456. The next instruction is loads Location zero, which loads the value off the variable I in location zero on the evaluation stuck. So now we have two numbers of the stock one and 456. The third instruction is Aunt, which adds the top two numbers on the stack together. So the result is 457 which goes on the stack and replaces the original two numbers. So now we have a single number on the stack. 457. The final inception is store Location zero, which pops the top value off the evaluation stock and stores it in Location zero, which corresponds to the variable I. Here are a couple of other instructions that you might encounter when you're looking at compiled C sharp coat. The books and unbox instructions do exactly what you might expect. They box and then books value types on the the B and E instruction stands for branch, not equal. It jumps to a difference Location coat. If the top two numbers on the evaluation stock are not equal, cool and called virtual, call static on non static class members, load elements and store elements load and store elements in one dimensional race. New array as new objects creates a new array, respectively. The new objects on the return returns from a methods and throw throws an exception. So you might have thought that intermediate language is very complex, but it's actually quite simple. The language is not that difficult to understand, and in upcoming lectures were going to take a look at compile, see Shark odor lots and analyze how the compiler translates she sharp source code into intermediate language on what the performance implications are. Let's take a look at the simple program. Take a look at my code here. I start with the number two and then use this four loop to calculate the 1st 16 powers of two by repeatedly multiplying the number by two. So how does this look in intermediate language? Well, all I have to do is set a break points at the end of the program here, run my program and then switch to disassembly view like this. We are now looking at the compiles intermediate language coat annotated with lines from the original source code so that we could easily find the compiled colts for a given line off C sharp codes. So let's start with the declaration off the number variable. You see a load constant to instruction here, which pushes the number two on the stack and then a store location zero, which stores the number in location zero. So we know that location zero corresponds to the number variable. Then comes the forum. Now pay close attention to the address labels here because the codes is shown out of order to make it more easily readable. The first step is to initialize the variable I 20 So we have a load constant zero on and store location one instructions here. So now we know that the variable I is stored in location one. Then the cult jumps to location 14. There we have a load location one, which is the very well. I load Constant 16 which loads the number 16 on, then a BLT instruction. Now BLT stands for branch less than on. It will jump to location A. If the variable I is less than 16. Location A contains the main body off the loop. It loads the number variable on the stack. Then it loads the number two on the stack. It multiplies both numbers on the stack together and stores the results back into the number variable because the code is out of order. The next instruction after location F is actually here at Location 10. We're back where the variable I is increment ID has tested. If it is below 16 was the variable. I reaches the value 16. We will full through the BLT instruction here. The next location after location 17 is one. See on the return instruction returns out of the method, and there you have it. A five line C sharp program compiles to only 17 intermediate language instructions on the compiled cold was quite easy to read, so congratulations you are now able to read compiled. See shark coat. Being able to read and interpret intermediate language codes is a very important skill, which will help you a lot when you're optimizing the performance off your coat. So what have we learned? C. Sharp Coats compiles to intermediate Language Coat, which is then compiled again by a just in time compiler. Two machine language Gits Copulation can optimize codes for local hardware into media's language codes, is portable and can run on multiple platforms like windows, Linux and Mac. It can be very fight for correctness before running, and the coach can be annotated with extra metadata, for example, to guide serialization. Intermediate Language uses locations to store local variables and uses an evaluation stack to perform operations on data. Intermediate Language has built in support for creating objects calling methods on accessing fields. Intermediate language has built in support for creating and manipulating one dimensional raise.
10. Tip #1: prevent boxing and unboxing: previously, I told you that boxing and unboxing negatively effects the performance off your code, and you should avoid it whenever possible, especially in mission critical sections. But how bad is the performance overhead off boxing and unboxing? Is it worse the worry? Let's find out I wrote a program to measure the difference in performance between method that uses only inter jobs versus and methods that does the exact same thing using an object variable. Here's the code. I have two methods. Measure a man's measure B. The 1st 1 measured a takes an integer and adds the value one to it a 1,000,000 times. I use a stopwatch class toe accurately measure the number off milliseconds it takes to run the loop. The second message Measure B, does the exact same thing but now uses an object type variable instead of an integer. Everything else is the same. One million repetitions on I'm adding the number one during each loop. It oration. If I scroll down to the main methods, you see that I start by calling both measurements methods and discarding the results. I do this because there might be a start of delay in my coat that could skew the measurement results. So to eliminate that delay, I run the complete test. Once discovered the results on, then run the test again. Finally, I display the results in milliseconds on the console for both methods. I'm assuming that's method A will be faster. So I also display how many times Method B is slower than method A. No, I already told you boxing and one boxing introduces a significance performance. Overheads in your cold. So let's find out how much overhead I'm talking about. Let me run the program now. There you have it. Message A, which uses an integer variable, takes 10 microseconds. Method B, which uses an object variable, takes 55 microseconds. Method B is 5.5 times slower than Method A. Let's take a look at the intermediate code. I'll start with the integer addition in Method A. If I said two break points and run on, then switch to the disassembly view. Here we go. You see that the intermediate code is annotated with C sharp source code, so it's really easy to locate the line we want. In this case. The addition in Method A. Here it is Now, before I explain what the's instructions do, keep in mind that intermediate codes uses a special evaluation stack to perform calculations. Numbers and variables are loaded onto this stack using load instructions, starting with LD on results are stored back into variables with store instructions starting with S t. The addition in my code is going to be a sequence off load on on store in intermediate coat . So here are the four instructions. The 1st 1 load location. One loads the first local variable, which happens to be the variable A onto the evaluation stack. Then comes the load constant one instruction, which loads the number one until the evaluation stack. The add instruction adds the top two numbers on the stack together, in this case, the value off a under number one. And it's replaces these two numbers with the results of the addition. Finally, the store location, one instruction stores. The addition results into the variable A. Now let's look at the same line in message. Be using the object variable. Here it is again. We start with load location one, which loads the object variable A until the evaluation stack. Now keep in mind that an object variable is a reference type, so we now have a reference to an object on the heap on the evaluation stock. So the next instruction has to be on unbox instruction because the objects on the heat needs to be unpacked on the value inside needs to be copied until the evaluation stack. Then comes the familiar load constant one, which loads the number one and then the add instruction to add the two values together. No a is an object variable, so the results off the addition needs to be packed into one objects and placed on the heap . So the next instruction has to be box, which does exactly that on places, a reference to this new objects on the heap on the evaluation stuck. Finally, we have store location one which stores the new object reference invariable day. So method a method be contained pretty much the same coat, a low location load constant, then an ad and finally, a store location. But Method B needs an additional unbox on box instruction because we are performing on interred. Your addition on the reference type of able on the DOT net framework needs to move data between this stack and the heap to make this work. So the difference in performance is entirely due to these two instructions box on Unbox. Avoiding boxing on unboxing in your own code is easy. Simply avoid the object type. But did you know that the Net framework is full of classes that use objects or objects? Arrays for internal storage, for example, Pretty much the entire system does. Collections name space uses object a race internally, so keep it away from the following classes in Mission Critical code. Another popular class is the data table in system dot data data tables store the value off each row in an object array. Even types data tables are not safe to use because they used type casts on top off that same object array. So to keep your code fast, consider using generic collection classes in the system. Does collections don't generic name space? Or if you know the number off elements in advance, consider using a simple, one dimensional types array for data tables. There isn't a simple alternative. If you could get away with it, re factor your codes to use a data reader because they are much faster than a data table. But remember, a data table is the results off a database operation, which might take many, many milliseconds to complete. There's no much points optimizing the data retrieval that comes after. Unless you were doing millions of operations on the same day, the table objects. So what have we learned? Casting objects, Variables to value types? Introduces on unbox instruction in intermediates coat storing value types in object variables introduces a box instruction in intermediate codes. Codes with books on on books runs up to five times slower than the same code without the's to instructions, you should avoids casting to and from an object in mission Critical code. You should avoids using non generic collection classes in Mission Critical Code on. And you should avoids using data tables in mission critical codes, but only if you perform many operations on the same data table. Object
11. Tip #2: add strings together efficiently: in this lecture, I want to take a closer look at string concatenation or how to add strings. Together, there are two ways off adding strings together. The first is the one we properly all use all over the place in our codes. We learnt to string variables together using the plus operator. Let's take a look at some code. I've made an application to measure the performance off adding strings together. I start with an empty string here and then in this loop, I add a single character to the string 10,000 times. No, If I run the program, you can see the performance here in the output window in milliseconds. But there's another way to add strings together. The dough Net framework provides a special class called string builder for well building strings. If I write the same coat using a string builder, starting with an empty string, adding a single character 10,000 times the code looks like this pretty much the same excess I using the pens methods instead of the plus operator. But the effect is the same on adding strings together. No, let me uncommon the measurements codes for this second methods on and run the program again so we can compare the results. Chicken does string concatenation using the plus operator takes 242 milliseconds, but the same codes using a string builder takes only one milliseconds. The string builder is 242 times faster. Now I have to make a little disclaimer. When you repeat this experiment on your own computer, you will see faster times, probably in the range off hundreds and 50 to 200 milliseconds. The reason for my slower performance is that I'm running a screen recording program in the background right now on it's eating up CPU cycles on allocating available memory. Remember the lecture about immutable strings? I showed that the string class is an immutable class, meaning that any modifications to the string creates a new string. This makes the string class behave like a value type, even though it is actually a reference type. So what's going on in memory during the Luke when I'm adding to the string? Here on the blackboards are the 1st 3 iterations off the loop. You can see that this is a super inefficient process. Every time I aunt, a character to the string. The entire string is copied over to a new string on the heap. My coat is doing 10 thousands memory copy operations behind the scenes. Another big disadvantage off this coat is that it leaves a trail off de referenced string objects on the heap for 10,000 iterations. The code leaves 9999 dead string objects behind. The garbage collector is going to have a lot of work cleaning that all up now. Contrast this with a string builder. A string builder uses a character array off a certain default length and simply writes text into the array. So the 1st 3 iterations looked like this much better, don't you think? Each string addition writes one character into the your A. This is a lot more efficient, and when we're done with the stream, we simply de referenced The string builder on the garbage collector only needs to clean up a single object, so here is the main take away off this lecture. If you're adding strings together in your code, always use a string builder. Try to avoid adding strings using the plus operator. I don't my previous measurements by performing 10,000 additions. But how does this string builder stack up when I only do one or two additions? Let's find out I from what he finds my codes like this. I have two loops now. The first look only does a limited number. String editions 2 to 19. The outer loop repeats this experiment 10,000 times. Andi adds up the total time. In milliseconds, I outputs the number of additions 2 to 19 and for each addition counts. I display the total number off milliseconds for regular strings on for the string builder. Let me run the program. Here we go there for up to four additions. Regular strings are actually faster than string builders above four additions. The string builder is faster. Here is the output of the program, plotted as a line graph. The blue line in the graph is the performance off regular string editions. You can see that irregular strings outperform string builders up to four editions. That's five or more additions. String builder class becomes more efficient on the more additions you do, The larger the string builders performance leads becomes Now. The reason for this behavior is that the stream builder class has some overhead in setting up its internal character terrain, keeping track off the string wings on. And it's internal buffer capacity on expanding this buffer when needed. The cones that uses regular strings just before a sequence off memory copy operations to implement the additions. These are actually more efficient than the string builder, but only up to a point, and that point is exactly four editions. So how should you do string concatenation? Well, if you have four additions or less, there's nothing wrong with using regular string variables on the plus operator. This will actually give you the best performance if you're going to add more than four strings together or you don't know the number off additions in advance used to string Builder Instead, once you get up to thousands off additions or, in this case, 10 thousands editions, exactly this string builder is actually more than 240 times faster than regular string additions. So in that scenario, always always use the string builder
12. Tip #3: use the correct list class: if we want to store a collection off values. There are many possibilities in the dot net framework. We have the collection classes in the system dot collections. Name space. We have generic collections in the system. Does collections. Does generic name space Onda. We have regular types a race. So which approach is the fastest? Let's find out. Let me switch over to my coat. I will start by measuring the performance often array list. I have a loop here that adds one million integers to the array list. Down here is the seconds missings that also adds one million insurgents. But here it uses generic list off type integer to store the values. Let's run the codes on to compare the performance. Here we go on the results. Far, the array list codes takes 254 milliseconds, whereas the generic list only takes 42 milliseconds. The generic list is six times faster. To understand the reason for this, let's take a look at the memory layouts for Ray List. After three additions, the stack and the heap looked like this. Now, if you remember the lecture about boxing and unboxing, I mentioned that many collections in the system dot collections name space use objects, arrays for internal storage. The array list is no exception, so each elements in the array list is a reference to a boxed integer, which is elsewhere on the heap. On We've seen that boxing and Unboxing introduces a significance performance. Overheads to your coat. Now compare this with the memory layout off a generic list. Again, the list is on the heat, but now the inter jurors are also stored directly inside the list elements. And this is because a generic list off type integer uses a native integer array for internal storage on not an object array. This eliminates all the objects. References from the list elements on also removes the need for boxing. So the main take away is always use generic lists and collection classes on Avoid the classes in the system dot collections name space for mission critical codes. Now let's try a suit Alternative. A native integer array pre initialized to one million elements. I loop over each array elements and set the value one by one. Let me compare this coat against the other two. Here we go now. This is a lot faster. The native array only takes 11 milliseconds to run. Whereas the genetic list needs 52 milliseconds, the native array is almost five times faster than the generic list. The reason for this excellent performance is that the intermediate language that the C sharp compiler compiles, too, has native support for a race. Let's take a look of my compiles cold. I'm going to set a break point on, rerun the program on, then switch to disassembly view. Here we go on, and here is the assignment. As you can see, the reference to the array is first loaded on the evaluation stack. Then the array offsets into the element that we want to write to is loaded on the stack on . Finally, the value that has to be set to the elements is added to the stack. And then comes a Set Elements Instruction, which writes the value into the specified array elements. So we have a dedicated instruction called set elements for writing values into a race. No one of the code using an array is fast. Now. You might be sinking that I'm cheating. Because the array list and the generic list classes used on internal array on continuously resize this array when it overflows, whereas my interviewer array was already pre initialized to one million elements. So it's never overflows. So let me modify my codes to remove that overhead. I will go to the line here where the array list is initialized on. I put in a default capacity off one million elements. Then I go down to this line on. I do the same for the generic list. Now let me run the code again. You see, the generic list now takes only 31 milliseconds to run, which is 40% faster than the previous run. The array list takes 187 milliseconds, which is an improvement off only 26%. The native array takes 12 minutes. Seconds on is still the fastest option, 2.5 times faster than the generic list. The built in support for Berets in the intermediate language ensures that they will always outperform collection classes. The generic list and array list simply cannot compete with that role Performance. Here is a graph With all the performance results, the array list spends most of its time boxing interred your data pre sizing the frailest through the correct number of elements boosts the performance by 40% but it's still not a very good results. The generic list is a lot faster because it can store the introducing directly in the list elements on the heap without needing an extra box reference and pre sizing the list to the correct number. Off elements boosts performance by 26%. The native array remains the fastest option. Even with precise lists, the array is still 2.5 times faster than the genetic list. So when should you use a race? Well, if you have mission critical codes on the number off elements is know in advance using array. If you have mission critical codes on the number off elements is not know in advance, use a generic list. Avoid the classes in system dot collections. They use boxing on our superseded by the generic classes in system DOT collections don't generic
13. Tip #4: use the correct array type: In the previous lecture, we looked at generic and non generic collections and compared them to native raise. It turned out that native arrays are the fastest, but generic collections have the advantage that they automatically grow that you Portmore elements into them. So if the number off elements is known in advance and array is the fastest option, but the dot net framework actually supports three types of a race. The first type is the one dimensional array that we've seen in the previous lecture. Anything declared with the square ray brackets in C Sharp is a one dimensional array, for example, the interviewer array shown here. You declare the array like this, Andi, you access elements like this. The second type is a multi dimensional arrange. For example, a two dimensional array is declared like this, and you access elements like this. The third type is a jacket array. This is simply an array off a race. You declare a jacket, a beret like this and then create new array instance for each top level elements like this , then you access and elements like this. It's called a jacket array, because each top level elements can have a difference number off sub level elements. A two by two jacked array can have a jack right hand side like this. So how do these arrays compare? Well, let's go to Xamarin Studio and find out I've written a program that initialize is three. Raise each containing one million elements. Here is the one dimensional array, with 1000 times 1000 elements on this loop signs of value to each element. And here is the two dimensional array, now initialized to the thousands by 1000 elements. I have two nested loops to go through each row and column, and I assign value to each array elements. The third message uses jackets array. The disadvantage of jacket arrays is that you have to initialize the entire outer array with array instances. That code is here. Let's find out which array is the fastest. Ready. Here we go. This probably wasn't a surprise. The one dimensional array is the fastest, with 12 milliseconds. Then comes the jacket array with 16 milliseconds and finally the two dimensional array with 24 milliseconds. Now you'll probably remember from the previous lecture that the intermediate language has native support for one dimensional erase This is why the one dimensional array is fastest on the jagged array comes in second. A jagged array is simply a one dimensional array nested inside another one dimensional array. So you still get the speed benefits off the built in intermediate language support. Perhaps surprisingly, the two dimensional array is the slowest, and this is because a two dimensional array indulged. Net is just a class without any special runtime support. Here, let me show you my compiles. Cold. If I set a break point, rerun the program on switch to disassembly view ons. Look up the correct line here. Here is the one dimensional array. You can see the familiar sets elements instruction that writes the value into the You're a element. And here is the JAG jittery first a lows elements instruction to loads, the reference to the inner array and then a set elements instruction to rights a value into that inner array. But now look at the two dimensional array. It's simply a call to a static set. Methods off the your A Class one method call for every element access, plus whatever implementation is inside that set method, that's a lot more codes to execute them. The one dimensional and jagged arrays. Now you've just seen that a one dimensional array is faster than a two dimensional array, so we can speed up a two dimensional array by using a technique called array flattening. Take a look at this three by three already off integers. That's a total off nine elements that I can lay out like this. I have color coded each row of elements. Now I can also pack all of this data into a one dimensional array like this I can access and elements given the row and column like this. So let's try it in code. I have a program here that sets up a 1000 by 1000 integer array and then lose through all the elements accessing them one by one. Here is the same coat, but with a one dimensional array with one million elements. I still have the two nested loops to access All rose in all columns, but now I used the translation formula to flatten the row and column into a one dimensional index. Which coats Do you think it's faster? It's hard to tell. Actually, we know the one dimensional array will be faster. But now we also have the overheads off doing the extra multiplication to translate the row and column into a flattened index. Will the extra multiplication offset the overheads off the two dimensional array? Let's find out. I am running the program now. I was here we are. The one dimensional array is still the fastest option, with 15 milliseconds compared to the two dimensional array. With 22 milliseconds, the flattened array is 1.5 times faster. Here is a graph with all performance results. The two dimensional array is two times slower than the one dimensional array. Even in the flattening test, when the one dimensional array had the extra overheads off having to do a multiplication, it is still 1.5 times slower. This suggests that flattening two dimensional arrays is a good idea. Perhaps surprisingly, the Jagged array has quite a good performance, too. It's only 1.3 times slower than the one dimensional array on when compared to the flattening results, there almost equal 15 milliseconds for the flattens one dimensional array on 16 milliseconds. For the jagged array, the difference in performance is only one milliseconds, which is 6%. So how should you use a race if you only have one dimension off day za, use one dimensional erased for the best performance. If you have two or more dimensions of data, consider flattening the array. If this is not possible, consider using a jacket array. If there is no other option, use a multi dimensional array.
14. Tip #5: avoid throwing exceptions: in this lecture. I want to take a look at the responsible use off exceptions. So what are exceptions? The Net framework uses exceptions for handling errors. So this is how that works when something goes wrong in a block of codes that codes can throw an exception. Using the throw statements the methods immediately aborts on the stack is unwound, which means the framework keeps executing implicit return statements, jumping out off nested method calls on going higher and higher up the call stack until it reaches coat with a try catch block. The framework jumps into the first catch block that matches the throne exception and starts running the code There. You can capture the exception by putting a parameter in the catch expression, so that sure sounds like a lot of overheads, doesn't it? How big do you think the performance overhead will be for throwing and catching an exception? Well, let's find out. I've written a program that repeats very simple operation in this case, increment ing on integer one million times. Here is the first measurements methods that simply increments and integer variable I was. Here is the second measurement method, the same codes, but after increment ing the variable the code throws on invalid operation accession. Now the throw statements is inside a try catch look, so there is no need for the dot net framework to start unwinding the stack in search of a try catch block. We are already inside a try catch book, so execution jumps directly to the catch statement, which in this case, does absolutely nothing. So this code will measure the performs overheads off a single throw and catch. There is no additional overhead because the stack needs to be unwound. There is also no overheads because off exception handling codes because in this case, the catch block is anti ready. Here we go. I'm running the program. Did you expect that the overheads off throwing an exception is massive increment ing an integer variable a 1,000,000 times takes only six milliseconds, but throwing on exception two million times takes an additional 6.9 seconds, not merely seconds seconds. The code, with exception handling, is a staggering 1150 times slower. This leads to the main take away off this section. Never use exceptions in mission critical codes. Fair enough. I will simply avoid using the throw statements in my coat, and we're all set right? Well, no. I also need to make sure that the libraries on base classes that I use also do not throw exceptions. The reality is that many class methods used exceptions to signal a non critical condition to the caller. If I want to avoid those exceptions, I cannot call these methods and directly. But I will have to check my include ater thoroughly for errors. Let's look at the simple use case. I'm going to run the program that converts strings to introduce. Take a look at this code. I have a message here called Prepare List. It uses a string builder to assemble a five digit number using a random number generator. When the string is ready, it is stored in this generic list. The code is repeated a 1,000,000 times, so eventually I will have a 1,000,000 strings to parse. The measurement methods are down here. The 1st 1 loops through all the 1,000,000 strings on uses. The integer part method to convert the stream toe a number. Now, if anything goes wrong during parsing the parts methods throws a former exception. I catch the exception here on simply suppress the error. The seconds measurements, methods does the same. Loop through all the strings on Parsons one by one. But now I use the try. Parse methods instead of the parson isn't try. Parts will attempt to parse the stream on simply do nothing. When the parsing fails, it will never throw an exception, so I don't need to try catch book. Here. You can see the difference in sin talks. Try Pars returns Boolean value indicating if the parson was successful, it also has an outputs parameter to store the result. Compare that to the parse methods, which simply returns the passing results directly and has no parameter or arguments to indicate if the parsing was successful or not. Okay, back to the preparing list methods. Take a look at the coast. Do you see anything strange at the end off my character array? I have an extra letter X. The random number generator here generates a number between zero and 10 so 10% off the time it will add the letter X to the stream instead of a valid digit. If I put a break point just beyond the method, call run the program on, then inspect the contents off the generic list. Then you see that a couple of numbers are invalid because they contain on X instead of a digit. The parse and try parts methods will fail on these numbers on only the parts. Methods will throw an exception when this happens. So based on what we already know about exceptions in that they are really slow, we expect the try parse methods to perform a lot better than the parts method. How big do you think the difference will be? Let's find out. I'm going to run the program, a measure the difference in performance. Here we go. And here are the results. The Tri Parse Methods takes 618 milliseconds to Rome. The parts methods takes 3727 milliseconds, so parse is six times slower than try pars because of the extra exception handling. So we've seen that data time conversions can also be problematic. Mission critical coat. Due to the fact that many conversion methods throw exceptions when the input data is invalid, are there any other common situations where exceptions are thrown? Yep, that are I'm going to show you one more. So here is another program, and in this program I am doing the opposite. Off the biggest program. I am converting integers back two strings. I have a prepared list message here. Guards. Now it fills a generic individualist with one million random integers. Andi Down here is the first measurement methods, which loops over each integer in the list, and for each number, it uses a look up table to convert that number to a string. If you look up here, you'll see that I have pre defined on inter your string dictionary to act as to look up table by indexing the look up table with an integer. I can quickly find the string that corresponds to that interview. So back to the measurements codes. I use the look up table to find the string and store it in the variable s If anything goes wrong, this catch statements is going to catch the exception on suppress the arrow. Now let's take a look at thes second measurements methods. It does the exact same thing, but it has an extra check here. I test if the integer is actually in the dictionary before I use it to look up the corresponding stream. This, except check avoids an exception. And so I don't need a try. Cash block. Okay, back to the prepare list message. Do you see the in validator? It's very simple. I have 10 entries in the dictionary for the digits from 0 to 9, but the random number generator generates numbers between zero on 10. So in 10% off the cases, I will have the number 10 in the list on the look up table doesn't have an entry for that number. If I put the break point just beyond the medical, run the program and then inspect the contents off the generic list, then you can see that occasionally I have a 10 in the list. The dictionary look up, will fail for this number and only the measure. A methods will throw an exception when this happens. So based on what we know about exceptions on the performance overhead off exceptions, we expect measure A to be slower than Measure B. So let's verify that I'm going to run the program and measure the difference in performance . Here we go and here are the results measure a takes 1180 milliseconds to run measure be only takes 167 milliseconds, so Measure A is seven times slower than measure be because of the extra exception handling . Here are all measurement results Combines in a single ref down here are parts and try pars with the parts method taking 3727 milliseconds on try pars only needing 618 milliseconds because it doesn't need to throw an exception if the import data is invalid, this is a drop in run time off over 80%. Then the seconds test with the dictionary look ups performing a look up on handling exceptions takes 1180 milliseconds. But if we check the key first and then perform the look up, we avoid exceptions all together. On the run, time drops to only 167 milliseconds again, a drop in run time off more than 80%. So what are the best practices for using exceptions? Well, if you use exceptions in a mission critical loop, then only used them to indicate a fatal condition that requires you toe aboard the loop entirely. Do not use exceptions for non fatal conditions that you handle by simply moving on to the next loop iteration. Do not put. Try catch blocks in deeply nested code or in low level AP I functions because they will slow down your coats. Put the exception handling as close to the main program as possible. Never use a generic catch. Exception statements, which caches all exceptions because it will also catch nonfatal conditions on. It won't be immediately obvious why your code is slow If you're writing an A P. I do not use exceptions for noncritical boundary conditions like converting invalid data or failing a look up operation. Consider the try parse pattern for that with a Boolean return value indicating success andan out parameter for returning data. Whatever you do, Never ever use exceptions to control the flow of your program.
15. Tip #6: use for instead of foreach: this lecture is going to be all about optimizing loops. A common advice you often hear on speeding up loops is that you're supposed to use a four statements instead of for each to look over the collection. But is this true on, And if so, how big is the difference in performance between the two? Let's start by looking at the mechanics off Z four and for each statements, I'll begin with four. When using a for loop to access elements in a collection, we usually starts with a loop that counts from zero to the number off elements in the collection Minds one. Then we use on index of expression to access each elements. In turn, a four Luke is only possible if you have direct access to any elements in the collection, usually via the square brackets. In extra meditation in C sharp, the array lists generic list on array classes. All supports indexing. Another way to access elements in a collection is by using on in numerator behind the scenes. The for each statements starts by creating a new in numerator object in other razors have only three members. A current field that contains the current collection element value a move next message to move to the next element. Andi reset methods to jump back to the start off the collection. So therefore, each statement sets up a loop that keeps calling move next until it reaches the end of the collection. During each loop iteration the current field contains to the current elements value. So which statement is faster? It's hard to tell. Actually, the four statesman's requires an index, so on the cost off, being able to directly access any elements may be high. But on the other hands, Ford each needs to set up on inem aerator object on, then call the move next message in each loop, it oration. So it's all depends which is faster, the indexer or the move. Next methods Here are the pros and cons off each methods. The four statement is potentially the fastest because of its simplicity, but it requires a collection with an indexer. On collections do not know in advance, in which order the elements will be accessed through the indexer, so all values will have to be loaded in memory first, for each is more complex because it uses an enormous crater behind the scenes, and it requires a call to the move next methods to advance to the next element. But the advantage off in operation is that it works on any collection. Another advantage is that the current value is calculated on demand, so the entire collection does not have to be in memory to be inaugurated. Advice on the Internet about four and for each is mixed, some say always used for. But others say the performance improvements is not worth it. So let me show you a program that measures the performance off both statements for a variety off collections. I've written the program that sets up three collections off 10 million integers an array list, a generic interview released on a regular integer array. The lists are initialized in this method. Here. Prepare lists. The message generates 10 million random numbers between zero and 255 on ads. Them toe all three lists, then come to measurements methods. Down here, there's six of them. I'll start with the 1st 1 measure. A one is loose through all 10 million elements off the array list with a simple for loop and uses the built in indexer off the Iranians class to access the elements. Now the next message measure A to also loops through all elements off the array list. But now, using a for each statements instead of the four statements, so behind the scenes for each will create on in numerator objects that steps sequentially through each elements in the collection. The next two methods are measure Be one aunts Measure B two. They do the exact same thing, but with the generic interred your list instead of the array list. And finally, I have the methods Measure C one and Measure C two. Again, they looked through all 10 million integers, but now they use the integer array. So you see all three collection classes supports both indexing on in admiration, which will be fastest. Let's find out I am running the program now, so let's wait for the results. And here are the results. Using four fallen array takes 98th milliseconds using for each Alan Array takes hundreds and three milliseconds, a tiny difference. But using four and for each on the genetic list, takes respectively 300 eights on 501 milliseconds, and we get the biggest difference in performance when using four and for each on an array list, respectively. 302 on 872 milliseconds. You see that for a raise? The difference is very small on optimizing A for each two or four is probably not worth the effort. But for the generic list and the array list, the difference is quite large. Why is that? Let's take a look at the intermediate cold. I'll start with the Measure C one method that uses a simple for loop to access every element in an integer array. So here is the for loop. These three intermediate language instructions set the variable by 20 Then we jump to location one F on to compare the variable I to the value of 10 million. If I is less than 10 million, we continue here on Execute the loop body. You can see the familiar load element instruction here to access on array elements. Finally, the codes continues here, where we aren't one to the variable by check again. If it is less than 10 million now, let's look at the same coats in the Measure C two method. It's almost the same shutting on index variable to zero, jumping to location 24 comparing the index to the length off the array. Andi continuing if it is less. But look at this bit here just before executing the new body. We have four instructions. Load location to load location. Three load elements on store location. One. These four instructions. Retrieve the current array elements and store it in location one, which corresponds to the variable I. So the difference between a four and A for each statements for regular arrays is only these four intermediate language instructions on this is why the measures performance difference between the two is so small. The compiler doesn't actually create on in numerator objects using currents and move next. Instead, it emits a slightly modified four look to implement the enumeration. Now let's look at the genetic lists. In the measure. Be one on Measure B two methods. The measure, be one methods, is very straightforward. First, we get the familiar set of instructions for the four loop here. Then, when accessing the list elements, the codes uses a virtual call to an indexer. Calls get item. The return value is an integer which gets stored in location to. But Measure B two is very different. The for each statements starts by calling the get in numerator methods on the list objects . The list in numerator is actually a structure and not a class on. It gets stored in location to the codes, then calls. Move next on the in operator and continues if the result is true. Next, it gets the current elements value by calling the property, get currents and stores the results in location one. Then the loop body gets executed, and then we're back at move next. This coat is a lot more complex than simply accessing list elements through the index. Enhanced, it runs slower. Now let's look at the array list in methods. Measure a £1 measure A. To there is nothing special in Measure A one, a regular for loop on a virtual call to the gets item indexer. But look at this bit here. The get item injection returns and objects, not an integer. So before we can store the results, it has to be unbox with this unbox instruction here, and we have already learned that boxing and unboxing introduces a significance performance overhead in your coast. Now here is measured. A to this coat looks very similar to the genetic list cult in Measure B two. But look at the differences here First. The array list in numerator is a class on not strapped, which means that the cool to get current is a virtual call instead of a regular call. This might seem like a trivial detail, but the colvert instruction is, in fact, slower than the call instruction. But secondly, get current returns with objects and not an integer. So the result needs to be unbox again before it can be stored, which happens here with the Unbox instruction. So here is an importance. Take away no generic in. Operators always return the current value as an object. So if you use them toe in operate overvalue types, they will unbox in the background always used generic enumerators, if possible. Here are all the measurement results combined in a graph using £4 for each on an array takes, respectively, 98 under hundreds and three milliseconds using four and for each on a genetic list takes respectively 308 and 501 milliseconds on using four and for each on an array list takes, respectively, 302 on 872 milliseconds. We have seen that there is very little difference between a four under for each when in operating a regular array. The reason for this is that the compiler optimizes array and admiration by emitting a slightly modified four loop. Instead of going through the hassle off, creating on the in operator objects with generic lists, we see a clear difference between four. And for each, the numerator is instruct, which is a value type, and so it can store value types like integers efficiently and yet in admiration, is still 1.6 times slower than using a for loop. So here it makes perfect sense toe optimize a for each loop by rewriting its into a for loop. And finally, when looking at a ray lists, we've seen that there is a lot off unboxing going on behind the scenes. The indexer returns. An object which needs to be cast to an integer on the in operator is a class and not destruct on. It returns the current value as an object, which again needs to be cast to an integer in operation on an array list has a huge penalty . It is two points eight times slower than using a for loop. So this is how you choose between four and for each. If you're using an array, don't bother rewriting a for each into a four. The difference in performance is simply not worth it. When using a generic list, using a four statements is 1.6 times faster than using a for each statements, so rewriting for each into a for loop is definitely worth it. And for array lists, the improvements is even bigger. A four statement on an array list is 2.8 times faster. But if you think about all the boxing and unboxing going on in the background, you might actually want to consider rewriting the array list into using a generic list instead. And finally, if you use for each and you are in operating a collection off value times, always make sure that you are using the generic Anoma Reiter, which is in the E a nominal herbal tea interface on not the non generic in numerator, which is in the E innumerable interface on. The reason for this is that the non generic in operator will always return the current value as an object, so your compiled coat will contain lots off box and unbox instructions and using a generic in memories that will avoid this.
16. How does the garbage collector work?: this lecture is by requests off suck. It's funny who asked me to look into garbage collection? Thanks for the request, suckers. I hope you enjoy this lecture, and I also hope I pronounce your name correctly. If anyone else has any special requests on what's topic you'd like me to address, send me a message and I will work it into my road map or future lessons. If you remember in the second lecture off this course, the one on heap memory in the fundamentals section, we saw what happens when reference type variables go out of scope. The variables which exists on the stack are destroyed on the corresponding objects on the heap. R D referenced de referenced objects continue to exist and are not destroyed immediately. I briefly mentioned that a separate process called the garbage collector periodically cleans up these objects. So in this lecture, we're going to take a closer look at the garbage collector. What is the garbage collector and how does it work? Let's start with a very simple program. This program only has one message with one local variable objects array. The array is initialized with five objects on these five objects also reside on the heap adjacent to the array itself. Now, to make things more interesting, let's remove array elements two and three by sitting. There array elements to know the corresponding objects. Number two and three still exist on the heat, but now they are de referenced. There are no references to these objects from anywhere in the coat. What happens when the garbage collector kicks in the door? Net. Garbage Collector is a mark and sweep collector, which leaves. There are two distinct stages off garbage collecting a mark stage on a sweep stage during the mark stage. The garbage collector marks all the life objects on the so in this example, that would be the array itself and the objects 01 on four. The objects two and three, are skipped because there are no references to these objects from anywhere in the coat. Next comes the sweep stage all objects which have not being marked in the previous stage R D reference on, and in this stage they are de allocated from the heap. So in this example, the objects two and three have not been marked, and so they are the allocated on. This leaves a kind of hole on the heap. The dog net garbage collector performs one additional step after the sweep, which is called compacts. In the compact stage, all holes on the heap are removed, so in this example, the object four is moved up to fill the hole That object to left behind the mark and sweep garbage collector is very good at locating. Each and every deal referenced object on the heap on remove it. But it also has a big drawback. It is very slow. During the mark stage, the garbage collector has to inspect each object in memory to determine if it is life or the referenced. If there are many thousands off objects on the heap, your program will actually freeze for a while as the garbage collector is inspecting each and every object. The process is also very inefficient because long living objects on the heap are checked and rechecked, enduring each cycle because they could be d referenced at any time. So a very long living objects might get checked hundreds of times if it is still alive. The solution to this problem is called a generational garbage collector, the dot nets garbage collector is generational, and it has three generations, which you can visualize as three separate heaps. All new allocations go into the first generational heap called Generation Zero. So if we revisit the test program with the five element array with elements two and three, settle no. Then the memory layouts would look like this. Everything is the same as before, but now all objects are residing in generation zero generations. One comes to our anti. The first collection cycle does a mark and sweep on all objects that survived the sweep move to Generation one. So after one cycle, the memory layout looks like this Z array. Andi objects 01 and four have survived the sweep and are now in generation one. Now imagine that the program continues at this point on. It puts a new object five in array elements to all new allocations. Go into a generation zero so the memory layouts would look like this. As you see, this is an interesting situation. The array recites in generation one, but its elements are in generations zero and one. This is perfectly valid. Now. The garbage collector kicks in again for a second cycle. All generation one objects move to generation to on the new object in generation zero moves to generation one if the program continues and puts a new object six in a rare elements three, it would again go into generation zero. We now have an array in generation to referring to objects in generation 01 as to again proof of the violence. So you might be wondering at this point why is all this happening? Why have these three generations? Well, the answer is very simple. Generations help to limit the number off objects in generation zero. Every collection cycle completely clears generation zero off all objects. So in the next cycle, the garbage collector only has to inspect new objects that were created after the last cycle. Of course, the problem isn't going away. The garbage collector simply moved the objects somewhere else. But here's the key generations one as to are collected very infrequently. The garbage collector assumes that anything that reaches generation to must be a long living object that does not need to be checked very often, so this solves two problems. First, it reduces the number off objects in generation zero, so the garbage collector has less work to do second, long living objects that survive into generation to I'm not checked very often, which is exactly what we want. The generational garbage collector is a beautiful high performance algorithm, but it has an important drawback. It is inefficient as processing large, long living objects. Consider a large and long leaving objects will get allocated in generation zero. It survives the first cycle, and the heat gets compacted, which potentially moves the object in memory. Then it moves to generation one. It gets compacted on moves to generation to all. In all, these are two compaction and two moves, So a total off four memory copy operations for a single objects before it arrives in generation to on the garbage collector ignores it for a while. If the object is very large, these four copy operations per objects can introduce a significance performance overheads. So the solution to this problem is to have two separate heaps, one for small objects on one. For large objects, the design looks something like this. Indoor nets that are too hips. The small object. He, which works with the three generations we discussed before, and the large culture tip, the special thing about the large object heap is that it does not use generations. In fact, it has only a single generation, which is synchronized with generation to off the small objective. So when the garbage collect were processes generation to off the small of your team, it also runs through the entire large objective. Another interesting facts about the large object heap is that it does not compact the during the sweep cycle. It's simply merges free memory blocks together, but it does not do any compaction to optimize the total amount of free space. You might be wondering what determines if an object is small or large? Well, the size threshold is at 85 kilobytes. Any objects at 85 kilobytes for larger goes directly to the large object heap. Any objects smaller than this limit goes into the small project heap. Having these two separate heaps solves the problem off large, long living objects. They no longer need to be copied four times before they end up in generation to, but instead they go directly into the large object heap, which is only processed in generation to and never compacted. And there you have it, the dog nets garbage collector is a generational garbage quality or that uses a mark sweep compact cycle. It has separate heaps for large objects and small objects. If you think about it, the DOT Net garbage collector makes some very specific assumptions about objects and lifetimes. First, it assumes objects will either be short lift or long lived. All short lift objects should be allocated, used and discarded in a single collection cycle. Any object that slips through the cracks, so to speak, is caught in generation one in the next cycle. So any object that survives to collection cycles ends up in generation to and must be a long living object. Also, any object over 85 kilobytes in size is always considered to be a long living object. On looking at the collection frequency off the various generations, it is clear that the garbage collector assumes that the overwhelming majority off objects will be short lived. So I can sum up my memory optimization advice in a single sentence. Do not go against these assumptions. So what have we learned? The garbage collector uses a mark sweep and compact cycle. The garbage collector has two separate heaps for large and small objects. The large object heap on the small objective. The small object he uses three generations, all new objects are allocated in generation zero and progress towards generation to the large object. He has a single generation which is processed together with generation to off the small objective. Also, the large object heap does not compact the heap. To optimize free memory, the garbage collector makes two important assumptions about object sizes and lifetimes. One 90% off all objects smaller than 85 kilobytes must be short lives to all objects larger than 85 kilobytes must be long lived.
17. Tip #7: optimize for garbage collection: welcome to parts to off the lecture series on fast garbage collection. In this lecture, we're going to look at several performance optimization is to make the garbage collector run as fast as possible. But first, let's recap what we learned about the garbage collector. In the previous lecture, the DOT nets garbage collector uses a mark sweep on compact cycle to clean up D referenced objects from the heat. It's use is to separate heaps for large and small objects. The large or jetty on this small object, the small objects he uses. Three generations, all new objects are allocated in generation zero ons. Progress towards generation to generation zero is collected very frequently. Generations one hands too much less so. Generations helped to limit the number off objects in generation zero. Every collection cycle completely clears generation zero off all objects in the next cycle , the garbage collector only has to inspect new objects that were created after the last cycle. The first memory based performance optimization that we're going to look at is to simply limit the number off objects that we creates in generation zero. The less objects are created, the less work the garbage collector has to do to clean up the heap. There are two strategies that you can follow to limit the number off objects on the hip. The first is to make sure your codes does not create any redundant objects anywhere on second to allocates, use and discard your objects as fast as possible so that they are already to be the allocated by the garbage collector in the next cycle. If you wait too long between allocating, using on discarding your objects, you run the risk off the ending up in generations one or two. So for short lived objects, you want your coats to be as tight as possible. Let's take a look at some examples. Here is a code fragment that loops 10,000 times on builds up a string, using a string builder with a cold to the appends method. Can you see the problem with this coat? I'll give you 10 seconds to think. Here's the solution. The problem is with the string concatenation inside the appends message, you'll remember that strings are immutable, and so the two string message on the addition both creates extra string objects on the heap for every loop it oration the cold at the bottom avoids this problem by assembly, calling upend twice. The difference is 40,000 less string objects on the heap. That's a big improvement. So here's another example. See if you can spots the problem. I'll give you 10 seconds again. And here is the solution. If you store integers in an array list, the integers gets boxed on the hip. The generic lists avoids this problem by using on internal integer array instead of an object array, a simple modification to the codes that results in 20,000 less boxed into your objects on the okay. One more example. A small, static object gets initialized, then lots of other cult runs first on. Finally, the object actually gets used. What's wrong with this picture? I'll give you 10 seconds, and here is the answer. The object is small, but the gap between allocation and use is very large, so there's a big chance the objects ends up in generation one or two before it gets used. The codes at the bottom avoids this problem, my first, making the objects non static there, allocating it's just before use and finally setting the object reference to know right after use to signal to the garbage collector that were done on that the objects is ready to be collected. If you don't like having no assignments all over your codes, you can also wrap the bottom codes in the message, as have the object of reference go out of scope when you exit the methods. That's my favorite solution. The next optimization you can perform is to fine tune the lifetime off your objects. The garbage collector assumes that's almost all. Small objects will be short lived, and all large objects will be long lift, so we should avoid the opposite. Small, long lift, all tex or large shortly subjects. It's instructive to view these combinations on a graph. If I plot the object lifetime horizontally on the object size vertically, I get the following charts. The bottom left and top right quadrants are where you want to be. These combinations off objects, sizes and lifetimes match exactly with the assumptions off the garbage collector. The top left on the bottom right? Quadrants are at odds with the assumptions off the garbage collector. If your code contains lots of objects from thes quadrants, you are effectively working against the assumptions off the garbage collector at the performance off your coat will probably suffer as a result from it. So what can we do to get into the correct quadrants? Let's start with objects. Lifetime to re factor large, short lift objects we need to increase the object. Lifetime. There is a very simple strategy for this, which is called object pooling. The idea is that instead often discarding and objects and allocating and new objects. Instead, you reuse the existing objects. Because the same objects is being used over and over, it effectively becomes a long living objects. This is a very popular strategy for optimizing the large objects heap. So let's look at an example. Here is a fragment of codes that allocates a large array list on, then uses it twice, discarding on reallocating the list between uses. How would you improve this coat? I'll give you 10 seconds to think about it, and here is the solution. Instead, off discarding on reallocating the list, you instead wipe it clean with a call to the clear message and then reused the list for the second method, call in the new coat. The same array list objects gets used over and over, it's lifetime is increased on the objects effectively becomes long living. This change improves performance and reduces the chance that the large object heap becomes fragmented. Now let's look at the inverse problem. We have a small, long lift objects that we must refract or into a short lived objects. How would that work? Here's an example. This coat fills an array list with 10,000 pair objects. Pounds each hair contains two inside yours. So what's wrong with this code? I'll give you 10 seconds to think about it. The problem is that the array list is a large object, so it goes on to the large object he that is assumes to be long living. But the list is filled with tiny pair objects, 10,000 off them. All these objects go until the small object heap into Generation zero, because the array list is keeping a reference toe each Eisen. All these parents will never d reference, and they will eventually all move into a generation. To the solution is to use another popular refectory strategy instead, off having one list with two integers in each list element. We break the list of parts into two separate inter generates. Because an integer is a value type, it will be stored with the array. So now we only have two larger raise in the large object he and absolutely nothing in generation zero. Problem solved. The third optimization you can perform is to find June the size of your objects. The garbage collector assumes that almost all slow objects will be short lived on. All large objects will be long lived. So if we have the opposites in our codes, school long lift objects or large short left objects, we need to refractor the size off these objects to get them back into the correct charge. Accordance. Let's start with a large short left object to reduce the size off this object. There are two strategies. One split the object of parts in sub objects, each smaller than 85 kilobytes or two reduced the memory footprint off the object. Here's an example off the second strategy. A loop fills the buffer with 32 thousands, but it's Can you see what's wrong with his codes? I'll give you 10 seconds, and here's the answer. The loop fills the buffer with bites, but the buffer is defined as an array off integers. The buffer holds 32,000 items on Since an integer is four bytes in size, this adds up to 100 28 thousands bites. This is above the large object threshold, and therefore this buffer goes directly until the large object heap on gets collected in generation to the solution is to re factor the buffer as it bites buffer. Now the memory footprints off the buffer is exactly 32,000 bytes, which is smaller than the large objects thresholds. And so it gets stores on the small objective in generation zero, just like we waas. Now, let's look at the inverse problem. We have a small, long lived object that we must re factor into a large, long lifts object. How would that work? The solution is, to either and large the memory footprint off the object or to merge several objects together to create a bigger objects that can go on the large objective. So here is the final example off this lecture, this coat declares a static array list on. Then, somewhere halfway in, the program starts to use it. What's wrong with this code? I'll give you 10 seconds. Here's the answer. It's clear that the object is intended to be a long living object because it is declared a static. If we also know that the list will eventually contain at least 85 kilobytes of days, huh? Then it is better to initialize the list to this size. This ensures that the list goes directly on the large project heap because if you do not initialize the list, it gets the default capacity, which out of the top of my head is 16 kilobytes. So then the list goes into the small object heap in generation zero and eventually moves to generation to after potentially, having undergone four memory copy operations by initializing the list to the correct size right away, you avoids the migration from generation zero to generation to entirely. It might seem strange that you can optimize coat by making objects bigger, but that's exactly what we're doing here. And sometimes it really works. So what have we learned to optimize your code in such a way that the garbage collector runs as fast as possible? You need to follow these strategies first, limit the number off objects that you create seconds allocates use on discards small objects as fast as possible. Seward's reuse all large projects. You want to work with the garbage collector and not against it. And so you must ensure that all objects in your codes are either small and short lived. Four large and long lives. So if you happen to have objects that are either lush and short lift or small on long lift , you might want to re factor them for large shortly subjects. You can either increase the lifetime war, decrease the size off the object and four small long lift objects. You can either decrease the life sign or increase the size. All these changes will benefit the performance of your coat.
18. Tip #8: use fast delegates: in this lecture, I'm going to take a closer look. That's delegates. Now, if you're not familiar with the concept off delegates, the delegates is nothing more than a type that wraps a mythical. So where a regular type defines what kind of data weaken, store, say, stream into your dates, etcetera, a delegates defines what kind off method we can call. So what are the method for amateurs on what is the return type to define? A delegate in C. Sharp, I used the delegates keywords like this. This specific example defines a method call that expects to interred your input parameters one insecure outputs parameter on a void return type. Now this declaration does not define any variables yet. All we have at this point is a new type definition called adds Delegate, that describes a particular methods signature to create a new delegates. Variable. I need to do this. This sets up a new variable off type. It's delegates. So how do I sign the value to this variable? Well, first, we need tohave methods somewhere in our code that exactly matches the signature we set up earlier to insert your inputs parameters. One integer outputs parameter and a void return type something like this. Now that I have the methods implementation, I can assign the delicates variable like this and then I can call it like this. Now What you've seen so far is calls a unique cast delegates. This is a delegate that I initialized with one single message implementation and when I invoke the delegates, it calls the single methods. But delegates are much more powerful than this. I can also set up a multi cast delegates. This is a delegates that can call more than one methods in a single go. Let's say I have to add methods in my coat called It's warm and adds to. I can then set up the delegates like this. Note the use off the plus is operator in the third line. This, as an extra method to an existing delegates on effectively sets up a multi cast delegates. Any delegates can be multi cussed by simply adding more methods to it. There is no limits to the number off methods you can ask, but please don't go overboard again. I invoke the delicates like this on this will call every adds methods in sequence in the same order as I added them to the delicates. So this got me wondering. Is there a performance difference between a unique cast and a multi cast delegates? On? What would the penalty be off using a delegate instead of a regular method call? Let's find out. To test the performance off delegates I wrote the following code. I start here with the familiar delegates for adding two numbers together. Down here are two implementations off the hands methods which are completely identical and match the delegates signature. Then I run three tests in my first test. I call the ads methods directly. This will provide a baseline. The second test sets up a unique cast. Delegates on calls it twice. This will tell us what the overhead is off, using the delegates instead off directly calling the ads methods. And finally, I sent off a multi custom delegates, assign it with two methods and then invoke that delicate. Once. This will compare the performance off multi cast on unique cast delegates, and here are the results. There is a 9% performance overhead when using a delegates instead of calling the handler directly, that's not too bad, but using a multi cast delegates is more than twice a slow as calling a unique as delegates twice. Whoa. The reason for this is because of how delicates are implemented. Behind the scenes, a delegates is implemented by the multi cast delegates class. This class is optimized for uni cast delegates. He uses a message on a target property to directly call a single method. But for multi cast delegates, the class uses an invocation list on internal generic lists, holds references to each methods, and they are called in sequence the overheads off. Stepping through the invocation list is what causes this large difference in performance. Finally, let me offer a word of advice. I sometimes see cold like this. This line of code defines a demagogue variable on, then initialize. Is it with a doubIe methods that does nothing. The advantage off pre initializing the delicate variable is that we no longer have to check if the variable is no. So instead of doing this, we can now simply do this regardless. If the delegates has been initialized, the code will always work. The delegates will either call the dummy message on do nothing or call the dummy messes on Duthie signs message in sequence. But please don't do this. You've seen in the performance measurements that multicast delegates are much slower than unique as delegates. So the simple act off pre initializing the delegate variable actually makes the coat more than twice a slow. So here is my advice on fast delegates in C sharp. The performance overhead off using the delegates is only 9%. In most cases, this is perfectly acceptable, and so feel free to use delegates anywhere in your coat where it's convenient Breath if you must have the highest possible performance, then remove all delegates from mission critical cold sections and grab that extra 9% off performance. And you should always avoids using multi cast delegates in Mission Critical Coast because they are more than twice a slow, as unique as delegates.
19. Tip #9: build a fast class factory: in this lecture, I am going to see if I can speed up a very common coat. Construct the class factory. Now you might be wondering, what exactly is a class factory? Simply put, a class factory is a special class that constructs other classes on demand based on some kind off external configuration data. You often see class factories being used when accessing databases in codes. We simply want to access the database on. We don't really care if the data happens to be stored in a Microsoft sequel server database on Oracle database or Maya Scwill database. That is an implementation detail that the codes should not have to worry about. So in codes, you might see something like this. In this example, The Class Connection factory is a class factory that knows how to create a database connection object for the sales data. So if there is a setting somewhere in a configuration file that specifies that all sales data is stored in an article database, then the class factory can use this information to return a system dot data dots. Oracle clients dot Oracle Connection class. You can see how class factories, simply five the cold enormously. You no longer have to worry about where certain data is stored. The class factory knows on will automatically return the correct connection objects for each request. Class factories have another big advantage. They allow us to move data around at runtime if, at a certain point, the sales data migrates to a my SQL database, all we need to do is to update the configuration file on the next data request, the class factory will read the modified configuration data and return A My SQL connection objects instead To build a class factory, you need to complete this fragment off coat. So the inputs is a string which contains the type off the class to be. Instead, she ated on the desired output is an actual instance off that class now, this is actually a difficult problem to solve. The closest you can get that's compile. Time is something like this. This performs really well because the compiler can compile any possible scenario we might encounter at runtime. But the disadvantage is that we have to anticipates every possible use off the class factory beforehand. For example, it's completely impossible to move the sales data to a my SQL database if we haven't anticipated the use off my SQL with a corresponding case statements before hands. So this coat is not very useful in a class factory, but it does have the best possible performance. I will add the code to my measurement later on to serve as a benchmark value. Okay, so a switch statements will not suffice. What else do we have? Another possibility is to use reflection. The system adult reflection, name space has a very nice class called activator that we can use to construct any kind off objects we like based on the name of its type. So I can rewrite the previous method to use reflection instead, and it will look something like this. Now this is going to work perfectly in all scenarios, I can throw in any type name I like on. If the type exists, the activator class will construct the corresponding class instance. It's perfect. However, this cold has a big problem. Reflection is really slow. We will see how slow in a moment when I run the benchmark, but trust me when I say this is going to introduce a massive performance setback in your coat this does not always have to be a problem. We can assume that database connections are not going to be created all that often. Furthermore, opening a database connection is a much slower operation than constructing the class, so a slow down in construction is barely going to be noticeable to the end user. Because of this, you often see the activator class popping up in C sharp coat in non critical coat. There's absolutely nothing wrong with using a little reflection to suit your needs. But what's about critical cold, where top performance is crucial? Fortunately, there is another solution. I am going to show you a mind blowing trick for constructing any kind of class at run time with a performance that is comparable to the switch statements. Check this out. First, let's get back to the basic problem we needs to complete this method. This is a genetic methods that can construct any time, but a collection off specific methods would also be fine. So let's say I need to construct a class called my class type one. Then I would need the following message in common intermediate language. The body off the message is going to look like this. So here is the guest off the trick. I am going to dynamically create a delegates at runtime right these two intermediate language instructions into it and then call the delegates to create the class instance. Here's how it works. My class factory first receives the type name string, which describes the type off the class instance to create. The first thing the class factory does is check in a dictionary if the corresponding delegates has been created already. If so, the factory retrieves the delegates from the dictionary and calls it directly to create the class. Instance even knows the factory creates a dynamic method on writes the new objects and return instructions into it. It then wraps the methods in a delegate and stores the delegates in the dictionary in case we need it again later. Finally, it calls the delegates to create the class instance. Creating a dynamic methods delicates is really slow, even slower than calling the activator class. But here is the thing. Once we have a delegate, we can keep calling it over and over again to create more class instances. Andi calling a delegate is much, much faster than using the activator class, so there will only be a delay when creating the first objects off a given type. Any subsequent coal is going to be super fast. Okay, so let's take a look at my benchmarking program for this lecture. You can see in this constant declaration up here that I am going to interest, and she ate one million objects. The first message measure a loops one million times on uses the static switch statements to construct the objects you can see here. That my program creates string builder instances on doesn't really support anything else. So the switch statements is a bit silly because it only supports a single class on cannot be modified in any way after copulation. But we'll keep its in here to serve as a benchmark value. It's signifies the best possible performance when creating one million objects. Next up is Measure B, which uses the activator class to create the string builders really simple and clean codes on. It's really a shame that it's not going to perform very well. You're going to see in a moment how slow reflection actually is. Finally, here is Measure C. It uses a class factory called gets class creator. This class factory returns a class creator, which you can see is actually a delegate for a method without any parameters not returns an object. The next line creates these string builder by simply calling the delegates. So all the magic happens in the class factory. The first thing the class factory does is try to retrieve the delegates from the dictionary . If the dictionary does not yet contain the delegates the factory moves on on gets the constructor info for the string builder. It then creates a dynamic method. Dan's writes the new object on return instructions into the method. Finally, it wraps the methods in a delegates on stores. The delegates in the dictionary finally down here the main program methods calls all three measure methods and displays the your own time in milliseconds. Okay, let's see how these three techniques measure up. I am going to run the program. Check this out and here are the results. The switch statement ran in 131 milliseconds. The activator class talk 11,377 milliseconds on the dynamic method talk 673 milliseconds. Here are the results in a graph. So the switch statement constructed one million string builders in 131 milliseconds. This corresponds to 7633 objects created pair milliseconds, but the activator class did much, much worse. It took a whopping 11,377 milliseconds to construct the string builders, which corresponds to only 87 objects for milliseconds. The activator class is 86 times slower than the switch statements. And finally, here are the dynamic message results. The dynamic methods delegate constructed one million string builders in 673 milliseconds. This corresponds to 1485 objects for milliseconds, so the dynamic methods is five times slower than the switch statements. But it's almost 17 times faster than the activator class. Okay, so what have we learned? A Class factory is a class that constructs other class instances on demand using external configuration information. A common way to implement class factories is with the activator class, but the activator class is 86 times slower than static compiled. Coat. A much better solution is to use dynamic methods, delegates the dynamic methods. Delegates are only five times slower than static compiles codes on 17 times faster than the activator class. In other words, if you replace the activator class with dynamic message delegates in your class factory, you aren't going to speed up your coat by a factor off 17.
20. Are arrays on the stack worth the trouble?: In the last couple of lectures, we have looked at collection classes, including the array list, the generic list on several types of native raise, one dimensional, two dimensional and jagged arrays. There's one thing that all these collection classes have in common. They all live on the heap. But what if I told you that it is actually possible in C? Sharp toe allocates on array of interviewers directly on the stack, having a collection off integers off the stack as its advantages. If you look at the heap, the heap is managed memory. This means that the garbage collector occasionally moves blocks of memory around toe optimize free space. But this stack works differently. Memory on the stack is allocated when your code enters a method as D allocated. When you return from a methods allocated, memory is never moved around. This has an important advantage. If we could somehow get a pointer to stack memory, it will remain valid until the stack memory is the allocated. But getting a pointer to heat memory is much more difficult because the garbage collector can move the memory somewhere else at any given time to obtain a point or two heat memory with have to fix a block off memory in place on the heap, telling the garbage collector You cannot touch this block off memory until I'm done with it all this memory management creates a performance overhead, so it's not unreasonable to expect that point of operations on the stack will be faster than the same point of operations on the So let's look at some code again. Here is the program I used previously to compare the performance off the array list the generic list on the native array. But I've added another test method at the bottom measure D. Let's take a look. I start by declaring on unsafe codes. Block in C. Sharp point of operations are only possible inside unsafe blocks, so this is a mandatory step. Then I use this key words stuck A look to allocates on array off 10,000 integers directly on the stack. The return value off this operation is on Interviewer Pointer that points and directly into stack memory. Then two nested loose. The outer loop repeats the experiments 10,000 times. The inner look runs through all 10,000 elements off the array and science each and every one of them. You see that even though list is an interviewer, Pointer, the syntax, toe access and elements is exactly the same as if list would be an integer array. The square brackets erase. Intacs still works here. Up here is the Measure C Methods. This method does the exact same operation, but it uses a regular integer array on the heap so we can compare performance. Let's run the codes. Here we go. The regular integer of Ray takes 808 milliseconds to run, but the Stack array only needs 764 milliseconds. The Stack array is five points, 4% faster. Let's take a look behind the scenes and see what's going on in intermediate language. I will switch the environments over to debug mode center break points, run the code again and then switch to disassembly view so that we can see the intermediate language instructions on. And here is the codes for the regular integer array assignments. You see that this is the familiar arrangement off three load location instructions and one set element. The list variable is stored in variable location. One on is loaded on the evaluation stack, then the I Valuable is loaded twice, once as the array index and once as the value to be set. Finally, the set elements construction stores the value at the Given Index. Now let's look at the Poynter operation in managed codes. If I scroll, here is the same assignments in the measure D methods. First, the list pointer is loaded on the evaluation stack, Then the variable I on the number four are loaded, multiplied together and added to the list value. This makes sense if you consider that an integer is four bites in memory, the coat is simply calculating the memory address to write to, which is list plus why multiplied by four. The next instruction is load Location three, which loads the value to be written on the evaluation stack. And finally, we have a store in direct instruction on this. Instruction writes the value directly into memory. So, perhaps surprisingly, these seven point or instructions actually run faster than the four array instructions in the measure. C Method on the difference is because the set elements instruction rights into managed memory, whereas the store in direct instruction writes directly into a fixed memory address without having to take any kinds off memory management into account. Finally, let me run the program again. But this time, with all measurements, methods enables, I am going to compare the array list, the genetic list, the Native interview array, hands of the integer or a allocators on the stack. Here we go. And here are the results. Let me show you a graph off the results, so they're easy to compare. The array list takes 9314 milliseconds to run. The generic list is a lot faster. That's 2704 milliseconds. The integer array is even faster. That's only 745 milliseconds. But this stack array beats everyone with a run time off only seven hundreds and 13 milliseconds. The difference in performance between the native integer array on the array on the stack is 4%. This is a lot less than the difference between, say, the integer array on the generic list, which is 72%. So if you optimize some codes by replacing a genetic list with the native array, which already gives you a 70% performance boost, is it worthwhile to go even further and turn the array into a stack based array just for the extra gain of only 4 to 5%. The answer is probably not. The difference in performance is so small it could easily be a measurement fluke. To test that scenario, I made a couple off modifications to my program. Let's take a look. I simply find the program to only look at the regular array on the stack neccesary each measurements methods expects parameter that indicates the number off array elements to allocates. The measurement is repeated 5000 times on each method returns to the total run time in milliseconds. Down here is the main program methods, with the loop that repeats the experiment for difference numbers off array elements. So what are the results? I actually prepared the results in advance on put them in a graph. So here is the performance for both types of the race for the race sizes from 0 to 49 elements. Pretty inconclusive, would you say? But maybe the performance difference becomes significance as larger array sizes. So let's test that, too. Here are the results again, but now, for a race sizes between one thousands on 100,000 elements again inconclusive. There is no clear winner. So when should you use Stack based on race? The answer is almost never, certainly not for performance optimization, the improvements are tiny, only a few percent, and quite often the stack based array is actually slower than a regular heat based array. Eso. The stack is limited in size. Stack based arrays can only store up to 10 megabytes of data before you run out of Stacks Base. To give you an idea, I tried allocating 10 million integers on the stack. I've got a stack overflow exception. So you might be wondering, Why is thes stack a lock? Key words even available? Well, the answer is to provide on interface to other programming languages that allocates arrays on the stack. When you are writing seashore coat that has to interface with the Windows a P I or libraries written in C or C plus, plus the stock. Alex Keywords lets you set up a memory buffer as a kind of gateway between those languages on the dot net manage memory. So in conclusion, only use stack a lock for interfacing with unmanaged coat on. Do not use it to optimize an array because it's probably not worth it.
21. How do I use pointers in C#?: in this lecture, I'm going to talk about pointers now. Pointers are not used very much, but the C Sharp a language does support them, but by default point of operations are forbidden because they are potentially dangerous. If you use pointers in your coat, the donut run time can no longer guarantee that memory will never be corrupted. And this is why they are forbidden by default. However, point operations become available. If you go into your project properties on enable unsafe coat, then you can use the following declarations and statements in your coat. The unsafe keywords enables point of operations. You can use the keywords to mark blocks off coat that use pointers. Or you can put the keywords in a method declaration to make that entire methods unsafe. When you declare a variable by adding and Asterix at the end of the type, you declare a pointer off that given time. So in this case I'm declaring a bite point or the fixed keywords, pins and objects in memory, telling the garbage collector to not move the objects around on the heat until we are done with it. Because the object is fixed, you can then use the one percents operator to get the memory address off the objects. So in this example, I am fixing on objects in the variable Q on the heat, and then I obtain its address with the and percent operator, and then a sign that memory address to the bys pointer in the variable p, the Asterix operator de references appointer, which means to access the data as that memory location. So in this example, I am reading a single bites at the current address off the pointer. You can also treat pointers like a race. The indexer operation reads data at on offsets to the Currents Pointer address. In this example, I'm reading a single bites at the memory location. Two bites ahead off the current address off the point. Or you can change the memory address off the pointer by simply adding or subtracting from its. In this example, I'm updating the current memory address off the pointer by advancing it three bites ahead. Point of support was added to the dominant run time for three reasons. One two aides interoperability with unmanaged code to to support a C or C plus. Plus don't that's compile it and three to make it easy toe port point of based algorithms to see sharp. So should be used pointers in C Sharp or just stick to a race. Well, let's find out. The classic example for using pointers in C. Sharp is when doing image manipulation. Images, which are loaded into memory are managed by the operating system. The image data is in unmanaged memory as not on the darkness heap. So there are only two ways to interact with the image data. They're using high level get pixel towns, set pixel methods or by obtaining a pointer to the unmanaged memory. I've written a small program that loads image into memory and then performs a grayscale transformation on its Let's take a look. I'll start with the main methods down here. My test image is a picture off 19 seventies model lane, assert Burke. This image is the gold standard for testing image processing algorithms. I love the image into memory, then called the measure a mess is to perform the grayscale conversion using the get pixel and set excell methods and then write the results back to disk. Then I do the whole thing again, but this time I call the measure Be Message, which uses point of operations to perform the grayscale conversion up. Here are the conversion methods. Measure A receives the image as a parameter loops through each pixel. Using these 24 lose against the color off the pixel with a get pixel medical. The formula here performs the conversion to gray scale. Then I create a new color value, using the grayscale variable and modifying the currents pixel by calling Seth Pixel. Now compare that with Measure B. I start with a call to lock bits which requests access to the raw image dater. Then, in this unsafe cars block, I obtained a bite pointer to the image data by calling the two pointer methods. The grayscale conversion formula is here notes that I use an index room to obtain the reds green and blue values at the current point address on and one and two bites ahead off the current address. This next lock of codes writes the grayscale value into the red, green and blue pixel values by repeatedly writing to the current point of address on advancing the pointer by one bite. And finally I unlock the image data on return the runtime in milliseconds. So which method is going to be faster, You think? Let me run the program and find out now. As I said before, I am running a screen capture program right now on this program puts pressure on the CPU on available free memory. On this distorts the performance values. So I actually prepared this section in advance and put the undistorted performance values in a graph. Here are the results I run the test three times as the runtime for performing a grayscale conversion. Using get pixel and set pixel is between 146 ons 184 milliseconds. But when using pointers, the runtime drops to only seven milliseconds. For all three tests, the point of operation is an overwhelming 96% faster than the get pixel handset pixel method. Finally, let me show you the results. I will open a file browser on select the original image and the image looks like this. The outputs off measure A and Measure B is here. As you can see, it is a perfect grayscale conversion on the outputs off. Both methods are identical. So we've looked at image manipulation in C sharp and discovered that pointers give a huge 96% performance boost. Is it acceptable to use pointers? In this scenario, the answer is yes. The main take away for this lecture is always used. Pointers for image manipulation in dot net. You don't really have a choice because the image data is in unmanaged memory. On the alternative get pixel and set pixel methods are way too slow. I can generalize this advice into the following guidelines. First, if you are interacting with data in unmanaged memory, try to use high level get and set methods to read and write the data. This will avoid having to use the unsafe keywords on Mark your Assembly as unsafe. If this route is not acceptable because you need to read and write a large amount of data on the over heads off the individual get and set methods becomes too large. Zen. Obtain a pointer to the data instead on modifying the memory directly for image manipulation. Always use pointers because the image days I will always be too large for any other kinds of high level access. It is perfect accepted practice for image manipulation. Libraries written in C sharp to be unsafe
22. Tip #10: speed up code with pointers: In the previous lecture, we looked at using pointers for image manipulation in C sharp. Now, image manipulation qualifies as a scenario where we are interacting with unmanaged code. The operating system loads images memory on its own unmanaged heap. And the only way we can get at the image data is by requesting a bite point or so for this specific scenario. Sure, we can use pointers. But what about other use cases? Is it useful to implement data structures like linked lists and binary trees with pointers ? Or should be just stick to a race To find out? We need to accurately compare the performance off array operations and pointer operations to see if there is any difference between them. So let's imagine for a moment that my entire operating system is actually written in C sharp. I am running this codes on a macro pro, and I'm going to pretend that the always X operating system is actually written in C sharp . So when I loads on image into memory, the image data is immediately available as a managed by tary. So I have written a second program based on this assumption. Take a look at this Let's start with the first measurements methods measure A. My bites array is up here, Andi. I will pretend that this is the image data loaded into this array. I loop through the array on converts, each pixel into gray scale, using the familiar formula. So this cult uses traditional array indexing to convert each pixel. Now here is the same code's rewritten using pointers. I used the fixed keywords to pinch the array in memory, and then I get the address off the first elements in the array with the 1% operator, and I assigned that address to the bite pointer in the variable P. The rest of the coat is exactly the same as the array based codes. I look over every pixel with this variable by and used pointer indexing to read and write each by its value. And finally, here is another point of based methods. But instead off, using indexing toe access, each pixel, I simply advance the pointer itself. Each loop interational only reads the reds green on blue values on, modifies them into gray scale. Then I advanced the pointer by three bites to access the next pixel, so there is still some indexing in this method, but only by 01 or two bites. The main methods down here calls all three methods for varying image sizes. I measure the performance for simulated images off 512 up to 4096 pixels wide in steps off 100 28. Now, as I mentioned before, the screen capture software distorts my measurements, so I prepared the results in advance. Here are the results. As you can see, there is not much difference between Measure A Measure B, using a blight array or buys pointer with indexing, has more or less equal performance. But look at Measure C consistently faster than the other two by 25% on average. So why is measure see so much faster? To find out, we need to take a look at the intermediate code. So let me make sure the project is in debug modes. I'm going to put a break points in the main methods, then around the program on switch to disassembly view. So here we go. I'll start with the measure, a message that uses a regular managed by tary. The code that modifies the pixels is here. You can see. It's the familiar three load instructions and then a store element instruction to modify the array elements. The image array is in location zero. The index variable I is in location to, as the value to be written is in location three. We have seen this many times. By now. This line of codes modifies the next color value. It's almost the same coat, but the variable I is incriminated by one with these three load location to load Constant one on add instructions on the third line of codes is almost the same. But now the variable I increments it by two. And finally, over here, the variable I is incriminated by three with these four load location lows constant ums on store location instructions. So how does this coat look when we're using pointers? Let me scroll down to the measure, be methods and find the corresponding lines of coat First, the pointer P is loaded with this load location to instruction, then the variable. I is loaded with this note Location three instruction on bills are added together to obtain the memory address to write to. Finally, the value to be written is loaded with this load location for instruction. The final store in direct instruction performs the actual right to memory. You see that this sequence off instructions is almost the same, except that there is one extra on instruction to calculate the memory address and that the value is written with a store in direct instruction instead of a store element instruction . The next two lines are almost the same, except for the extra adds instruction to either adds one or two to the variable I. Now let's take a look. That's a Measure C. Writing to memory is just these three instructions loves location. Three. To load the points are load location five to lows. The grayscale value Andi store indirect to write to memory. The next lines have these two extra instructions load constant one and adds to increment appointed by one and load constance too, and add to increment the pointer by two. Finally, the pointer is increment it by three with this cold here, which is identical to the cult for incriminating the variable I in the other two methods. So why is this colds faster? Well, to start with, this coat has less intermediate language instructions than the other two methods measure be needed. Five at operations to modify the array where this cult only needs to add operations and measure a needs. Three load location instructions before calling store elements where this coat only needs to instructions. Also, the point Traditions are much smaller in measure be. We have to add the variable I to the point of P to obtain the memory address as the end off the loop, I grows to very large values up to 48 million. Compare this to methods. See where the pointer is. Only increment is by 12 or three bites. Small number additions are much more efficient than large number additions because CP use have specialized instructions for increment ing value by one. So what can we say? In conclusion? When is using pointers valets Choice. We've seen that managed a raise on index point of operations performed equally in dot net. But for very large index values, it is much more efficient to increments appointer and read and write data close to the point or currents memory address. And so this is the main take away for this lecture. Normally, you should not use pointers in C sharp because the performance gains are simply not worth it. Index pointers and regular one dimensional arrays have equal performance. To avoid having to use unsafe coat, you should pick a race. How is it in specific scenarios? The point of based algorithm can outperform an array based algorithm by up to 25% for these scenarios, The following conditions need to be met. First, you are reading and writing a large block off data off at least 10 megabytes in size or larger. Second, the algorithm scans each bite sequentially or jumps through the data in a predictable manner and swords during each loop. Iteration only data very close to the current pointer address needs to be read or written. If your algorithm adheres to these conditions and you don't minds producing an unsafe assembly, feel free to use pointers and pick up the extra 25% in performance
23. Course recap: congratulations. You have completed the entire course. You are now a certified C sharp code optimizer. I have shown you about 25 different kinds of calls optimization. How to use them in your own coat plant, which specific performance improvements they will give you. The optimization is were spread out over three sections. We had the basic optimization, the low hanging fruit where a simple edit off your coat could result in a massive 100 folds . Performance improvement. We have the intermediate organisations which were more complex, called the rights that offer only a small to medium sized improvements on. We had to the hard core advanced optimization where we used several exotic features off the Net framework to boost performance. These optimization is have given you a rich toolbox off knowledge and ideas that you can use when writing your own coat or when you are collaborating in a development team, especially if you're working on mission critical codes where top performance is crucial. And if you discovered some interesting new performance technique of your own police, share them in the course discussion for him, for us all to enjoy. I hope you enjoy the course on learned some useful new techniques that you can apply in your software developments career Now go and build great things