Writing production-ready ETL pipelines in Python using Pandas | Jan Schwarzlose | Skillshare

Playback Speed


  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x

Writing production-ready ETL pipelines in Python using Pandas

teacher avatar Jan Schwarzlose

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

78 Lessons (6h 47m)
    • 1. 1. Introduction

      3:03
    • 2. 2. Task Description

      4:01
    • 3. 3. Production Environment

      2:03
    • 4. 4. Task Steps

      3:49
    • 5. 5. Why to use a virtual environment

      4:02
    • 6. 6. Setting up a virtual environment

      6:09
    • 7. 7. AWS setup

      6:53
    • 8. 8. Understanding the source data

      10:10
    • 9. 9. Quick and Dirty: Read multiple files

      12:27
    • 10. 10. Quick and Dirty: Transformations

      15:48
    • 11. 11. Quick and Dirty: Argument Date

      9:57
    • 12. 12. Quick and Dirty: Save to S3

      8:41
    • 13. 13. Quick and Dirty: Code Improvements

      8:27
    • 14. 14. Why a code design is needed?

      2:42
    • 15. 15. Functional vs Object Oriented Programming

      6:29
    • 16. 16. Why Software Testing?

      4:31
    • 17. 17. Quick and Dirty to Functions: Architecture Design

      0:57
    • 18. 18. Quick and Dirty to Functions: Restructure Part 1

      15:38
    • 19. 19. Quick and Dirty to Functions: Restructure Part 2

      12:32
    • 20. 20. Restructure get_objects Intro

      1:50
    • 21. 21. Restructure get_objects Implementation

      11:35
    • 22. 22. Design Principles OOP

      4:22
    • 23. 23. More Requirements - Configuration, Meta Data, Logging, Exceptions, Entrypoint

      11:30
    • 24. 24. Meta Data: return_date_list Quick and Dirty

      17:50
    • 25. 25. Meta Data: return_date_list Function

      14:17
    • 26. 26. Meta Data: update_meta_file

      12:12
    • 27. 27. Code Design - Class design, methods, attributes, arguments

      13:41
    • 28. 28. Comparison Functional Programming and OOP

      1:04
    • 29. 29. Setting up Git Repository

      4:53
    • 30. 30. Setting up Python Project - Folder Structure

      4:40
    • 31. 31. Installation Visual Studio Code

      2:31
    • 32. 32. Setting up class frame - Task Description

      1:54
    • 33. 33. Setting up class frame - Solution S3

      11:57
    • 34. 34. Setting up class frame - Solution meta_process

      1:00
    • 35. 35. Setting up class frame - Solution constantss

      1:00
    • 36. 36. Setting up class frame - Solution custom_exceptions

      0:23
    • 37. 37. Setting up class frame - Solution xetra_transformer

      2:54
    • 38. 38. Setting up class frame - Solution run

      0:34
    • 39. 39. Logging in Python - Intro

      1:28
    • 40. 40. Logging in Python - Implementation

      12:50
    • 41. 41. Create Pythonpath

      6:00
    • 42. 42. Python Clean Coding

      3:34
    • 43. 43. list_files_in_prefix - Thoughts

      4:11
    • 44. 44. list_files_in_prefix - Implementation

      0:34
    • 45. 45. list_files_in_prefix - Linting Intro

      1:15
    • 46. 46. list_files_in_prefix - Pylint

      4:46
    • 47. 47. list_files_in_prefix - Unit Testing Intro

      2:58
    • 48. 48. list_files_in_prefix - Unit Test Specification

      1:04
    • 49. 49. list_files_in_prefix - Unit Test Implementation 1

      14:49
    • 50. 50. list_files_in_prefix - Unit Test Implementation 2

      14:30
    • 51. 51. Task Description - Writing Methods

      2:02
    • 52. 52. Solution - read_csv_to_df - Implementation

      2:01
    • 53. 53. Solution - read_csv_to_df - Unit Test Implementation

      3:38
    • 54. 54. Solution - write_df_to_s3 - Implementation

      2:01
    • 55. 55. Solution - write_df_to_s3 - Unit Test Implementation

      1:48
    • 56. 56. Solution - update_meta_file - Implementation

      0:24
    • 57. 57. Solution - update_meta_file - Unit Test Implementation

      1:48
    • 58. 58. Solution - return_date_list - Implementation

      0:24
    • 59. 59. Solution - return_date_list - Unit Test Implementation

      1:48
    • 60. 60. Solution - extract - Implementation

      0:34
    • 61. 61. Solution - extract - Unit Test Implementation

      2:20
    • 62. 62. Solution - transform_report1 - Implementation

      0:34
    • 63. 63. Solution - transform_report1 - Unit Test Implementation

      2:20
    • 64. 64. Solution - load - Implementation

      0:34
    • 65. 65. Solution - load - Unit Test Implementation

      1:48
    • 66. 66. Solution - etl_report1 - Implementation

      0:34
    • 67. 67. Solution - etl_report1 - Unit Test Implementation

      1:48
    • 68. 68. Integration Tests - Intro

      0:28
    • 69. 69. Integration Tests - Test Specification

      1:04
    • 70. 70. Integration Tests - Implementation

      12:09
    • 71. 71. Entrypoint run - Implementation

      12:21
    • 72. 72. Dependency Management - Intro

      7:34
    • 73. 73. pipenv Implementation

      4:12
    • 74. 74. Profiling and Timing - Intro

      2:05
    • 75. 75. Mem-Profiler

      3:07
    • 76. 76. Dockerization

      4:54
    • 77. 77. Run in Production Environment

      4:31
    • 78. 78. Summary

      1:32
  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels
  • Beg/Int level
  • Int/Adv level

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.

19

Students

--

Projects

About This Class

This course will show each step to write an ETL pipeline in Python from scratch to production using the necessary tools such as Python 3.9, Jupyter Notebook, Git and Github, Visual Studio Code, Docker and Docker Hub and the Python packages Pandas, boto3, pyyaml, awscli, jupyter, pylint, moto, coverage and the memory-profiler.

Two different approaches how to code in the Data Engineering field will be introduced and applied - functional and object oriented programming.

Best practices in developing Python code will be introduced and applied:

  • design principles

  • clean coding

  • virtual environments

  • project/folder setup

  • configuration

  • logging

  • exeption handling

  • linting

  • dependency management

  • performance tuning with profiling

  • unit testing

  • integration testing

  • dockerization

What is the goal of this course?

In the course we are going to use the Xetra dataset. Xetra stands for Exchange Electronic Trading and it is the trading platform of the Deutsche Börse Group. This dataset is derived near-time on a minute-by-minute basis from Deutsche Börse’s trading system and saved in an AWS S3 bucket available to the public for free.

The ETL Pipeline we are going to create will extract the Xetra dataset from the AWS S3 source bucket on a scheduled basis, create a report using transformations and load the transformed data to another AWS S3 target bucket.

The pipeline will be written in a way that it can be deployed easily to almost any production environment that can handle containerized applications. The production environment we are going to write the ETL pipeline for consists of a GitHub Code repository, a DockerHub Image Repository, an execution platform such as Kubernetes and an Orchestration tool such as the container-native Kubernetes workflow engine Argo Workflows or Apache Airflow.

So what can you expect in the course?

You will receive primarily practical interactive lessons where you have to code and implement the pipeline and theory lessons when needed. Furthermore you will get the python code for each lesson in the course material, the whole project on GitHub and the ready to use docker image with the application code on Docker Hub.

There will be power point slides for download for each theoretical lesson and useful links for each topic and step where you find more information and can even dive deeper.

Meet Your Teacher

There are so many cool tools out there - especially in the small / large / big data area. One life is not enough to know all tools and be proficient with them. But even with a quite small and good toolset, you can implement great projects with real value.

In 2012 I graduated as engineer for mechatronics. Programming especially in the embedded area was an important part of my education. During my first years as an engineer, I discovered more and more my passion for Python, especially with small / large / big data.

After a few hobby projects, I took the step to work professionally in this area in 2016. I've been working now for years as a data engineer being involved in great projects.

I like to pass this knowledge on through courses in data engineering and data scien... See full profile

Class Ratings

Expectations Met?
  • Exceeded!
    0%
  • Yes
    0%
  • Somewhat
    0%
  • Not really
    0%
Reviews Archive

In October 2018, we updated our review system to improve the way we collect feedback. Below are the reviews written before that update.

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.

Transcripts

1. 1. Introduction: Welcome to the course writing production ready ETL pipelines in Python using Pandas. My name is Yang shots loser, and I will guide you through the course. I already have several years of professional experience as a data engineer. After having dealt with Python and the processing of small to large amounts of data as a hobby for a long time. Over the years, I have been involved in numerous industrial projects for small, large, and big data. These were in the classic data warehouse area with relational databases and in the data engineering environment with data lakes and modern big data tools like this, I could get profound knowledge and project experience with Python and Pandas. The goal of this course is to write an ETL pipeline and Python panels that will be ready to be used in a production environment. We are going to implement best practices for coding in Python and executing robust ETL jobs. The production environment we are going to use consists of a GitHub code repository, a Docker Hub image repository, and execution platform such as Kubernetes, and an orchestration tool such as the container native combinators workflow engine, Argo workflows, or Apache Airflow, the ETL pipeline we are going to create will extract data from one AWS S3 bucket as source, apply transformations and load the transformed data to another AWS S3 bucket as target. In this course, you will learn each step to write an ETL pipeline in Python from scratch to production. You will learn how to use the necessary tools and packages such as Python 3.9, Jupiter notebook, Git and GitHub, Visual Studio Code, Docker and Docker Hub. And the Python packages, pandas, BOT-2, 3 Pi YAML, AWS, CLI, Jupiter, Pylint, moto coverage, and the memory profile. You'll also be able to apply functional and object-oriented programming and understand the two different approaches in the data engineering field. Furthermore, you will learn how to do a proper code design, how to use a metaphor for job control. And you're going to learn best practices in developing Python code such as design principles, clean coding, Virtual Environments, project and folder setup, configuration, logging, exception handling, linting, dependency management, performance tuning with profiling, unit testing, integration testing, and Docker realization. So I wish you a lot of fun and success with my course. 2. 2. Task Description: Hello, Let's talk about the task we are going to solve in this course. Here we can see a public dataset on GitHub of the Deutsche a bursa, the German marketplace organizer for trading or shares and other securities. There are actually two datasets, the cetera and the urine data. For us as a source, we take the etc. dataset, et cetera, stands for exchange electronic trading and is the trading platform of the Dodge and Burn group. This dataset is derived near time on a minute-by-minute basis from Deutsche versus trading system and saved in an AWS S3 bucket available to the public for free. The link to the GitHub repository is in the course material. And as you can see here, you find detailed information about the dataset, the naming convention. And morrow. Let's see a sample of the dataset to get a first impression. The first column lists the ISI in the international securities identification number. Each entry of an ISI n shows basic information such as the security type and security ID. And on a per minute basis, Start makes Min and end price, the traded volumes and the number of traits. And now let's imagine our company, client or our selves want to get a report that is looking like this. Here we see an aggregation of the iris INs on a daily basis. And what we want to know are the opening, closing, minimum, and maximum price, The Daily traded volume, and the change of the current day's closing price compared to the previous trading day's closing price. And our task now is to create a production ready Python data job that is extracting the source, cetera dataset from the exedra S3 bucket since the last run of the job and saves the report we just saw on the last slide to a target AWS S3 bucket. And as you can see here, a scheduler should be able to run the job on a weekly basis. Let's imagine we are having the calendar week 1. The scheduler should trigger our Python data job. And this should extract data from the source and save to report to the target packet one week later in calendar week to the new data and the source debt was not processed yet, should be extracted again. And to report edit in the following week, again, after again. And every week just the same. Let's point out some requirements that are important for us as the developers of the Python application. The format of the target files should be parquet. Parquet is inefficient, column oriented data storage format of the Apache Hadoop ecosystem. It is built to support very efficient compression and encoding schemes. The first date that should be extracted will be given as input. So the first time the JOB runs, all dates from the input date until the today's date should be extracted and report should be built. And for all next runs, there should be an auto detection of what files already processed and whatnot. And the whole Python job should be configurable in this way. That if, for example a column name changes in the source and target report, the code should not be touched, but just a configuration file be edited. Enter job should be written that it is production ready. What production ready means? We will learn during the course. 3. 3. Production Environment: Hello. Before even thinking about the code, we should have a good understanding how the production environment looks like. I just take a typical production environment and we assume that this is where we are going to deploy our application. We will have a code repository where we commit our production code, typically a GitHub repository. There we keep our Python code and a configuration file for configuring our Python code. And the code repository will also contain a Dockerfile. Using this docker file, we will create a Docker image that contains our Python code. I call it the application code, and all dependencies. We need an order to execute our code. This Docker image will be pushed to an image repository. Here in this course, we are going to use Docker Hub. From this image repository. The image can be pulled by our orchestration tool. It could be, for example, Apache Airflow or Argo workflows for Kubernetes. The orchestration tool will create a Docker container on an execution platform, as I just call it. This could be, for example, a Kubernetes platform. This docker container contains now the application code, since it was part of the image we built before. And when spinning up the Docker container, the config file should be mounted and some secrets, such as AWS credentials should be added, for example, as environment variables. And finally, the orchestration tool will trigger our application according to the required schedule. So the necessary data will be extracted from et cetera, S3 bucket processed and report saved on the target S3 bucket. Now we have a good understanding of the production environment. So we can make a plan, what steps should be taken to create a working application. 4. 4. Task Steps: Hello. Now I would like to introduce the steps we are going to take in this course to achieve our goal of a production ready data pipeline with Python pandas. At first, I will explain why it is important and best practice to use virtual environments for a Python project. And we're going to set up a virtual environment where we will work whenever we develop something. Afterwards, we will set up a user with your AWS account to get programmatic access using credentials. Then we will use Jupiter notebook to access the cetera source data and get a feeling of the data. Jupyter Notebook will be used as well to create a first script like quick and dirty solution of our data pipeline. This will give us a good understanding of how to solve the task and what steps are needed. Having the code in mind, we will discuss two common approaches to structure the code, functional and object-oriented programming. And I will explain why testing is important and introduce the different ways of testing. After this, we are going to restructure our quick and dirty solution using the functional approach. And then we will discuss object-oriented design principles and all the missing requirements we did not care with the quick and dirty solution such as configuration, logging and metadata. Once at this point, we know everything to do a proper object-oriented co-design. Why a co-designers important? You will also get an idea during the course. Having our co-designed, we can finish setting up our development environment, creating a GitHub repo and cloner to our local machine, create a Python project and a good folder structure and setting up an IDE where we can coat. I will use Visual Studio code in this course. But basically, it can be used any IDE you are comfortable with. Then it will be timed to implement a basic frame of the classes. We want to create a calling to our co-design. Since we have a design, the coding part itself should be quite straightforward. Once the classes are there, we can add logging functionality according to Python's best practice. Before we coat functionality, we will discuss what is clean code and Python. And then for each piece of our code, we are going to implement the functionality itself, execute linting in order to clean our code, create and execute unit and integration tests with a high test coverage. Once the coding part is done, we care about how to best manage dependencies using pip and I will also show how to tune performance using profiling and timing. The last step before our code can be used in production is to create a Dockerfile using the docker file and create a Docker image and push this image to Docker Hub from where any orchestration tool can use this image to run our data pipeline. And finally, I am going to show you how our data pipeline can be executed simulating a production environment using mini cube, that is a local Kubernetes cluster and Argo workflows, the container native workflow engine for orchestrating parallel jobs on Kubernetes. 5. 5. Why to use a virtual environment: Hello, a common case for a developer is working on multiple projects. So let's imagine you're working on project a and project B. Project a needs to Django package with the version 1.9, but Project B needs Django version 1.10. By default, the packages are installed in a system wide location. If you use only the system-wide Python installation with its packages, whenever you switch between project a and B, you have to uninstall the package and re-install it with the required version. This is definitely time-consuming and annoying. How can we switch between projects without uninstalling and green installing packages? The standard approach is to use a separate virtual environment for each project. In the Python context, a virtual environment is an isolated environment that contains its own python dot exe and set of third-party packages. This means you can now create a virtual environment with Django version 1.9 for project a, and another virtual environment with Django version 1.10 for project B. And whenever you want to switch between the projects, you only have to activate the corresponding virtual environment. And the good thing is dead. There's no limit to the number of environments since basically they are just directories containing a few scripts. And there are several tools out there to manage virtual environments and Python as a standard library, there's pi event. It is shipped with Python tree, but they're located in Python 3.6 because of some issues. The equivalent for Python 3.6 and higher is VEGF. Then there are some third party libraries such as virtual EMF. This is a very popular tool since it has more features than vents. It supports Python 2 and 3. Pi f is used to isolate Python versions and make the process of downloading and installing multiple Python versions easier. Pi and virtual nth is a plug-in for pions by the same author as pions. It combines pions and virtual enough to be used at the same time. The virtual nth wrapper is a set of extensions to virtual nth that gives more features to virtual nth, Pi nth virtualenv wrapper is a plug-in for pi nth by the same author as pi n for the integration of virtual nth wrapper into pi. And the last tool I want to introduce is Pip nth. This tool is supposed to be used when you develop Python applications and not libraries or packages. Since we are going to develop an application, we will use this tool to manage the virtual environment for two or actually three reasons. The first reason is dead. This is to recommend for application dependency management by python.org. It combines Pip and virtualenv and one tool replaces the requirements TXT and especially solve some issues that a cure using requirements take C and multiple environments such as test, integration and production. As in our case. This means that we are able to manage virtual environments and managing packages only with one tool, PIP. How it exactly solves the mentioned issues and what the issues actually are. We will discuss in a later step. If you want to know a bit more about virtual environments, the tools and Pip, I provided some useful links with further information. But don't worry, the understanding features and commands we need, you're going to learn just in the right time. 6. 6. Setting up a virtual environment: Hello. Now let's set up a virtual environment for our project. Therefore, I already opened the python.org downloads page. And here you can see we can download the latest version for our system. In my case, it's for Windows. So I just recommend using really the latest Python version for our project. Here you just have to download and install the version. And I will not show you how to do this. I'm sure that you're able to manage it by yourself. Once the latest Python version is installed, we can check where it is actually installed. Therefore, I opened a command line window. And on a Windows machine, I can use the command where Python, on a Unix machine, you can use which Python? Here we can see the location of my system-wide insulation. It is Python 3.8. You can have multiple versions of Python on your machine. So let's open here just the path. I copy this and we open the File Explorer, go to our location. And here we can see on my system right now I have Python 3.8 and Python 3.9. For my virtual environment, I want to use Python 3.9. So I can go here inside the folder and here is the Python accident I have to use. Now let's go back to the terminal and just install pip nth to our system-wide Python installation. Using pip. Install pip. Let's just wait until it is installed and we can check it with pip list. And here we can see PIP nth with our version is installed. Now let's go to our File Explorer once again. And let's find any location where you want to create a new project folder. So I go to this PC, see here. And here I create a new folder with the name Sandra project. Now I go into my folder, copy the path, go to my command line CD into my Project folder. And now I do pip nth shell minus-minus Python. Now I have to tell the version I want to use until we have our directory, the path to our directory of the Python version. And I copy it, and I add Python dot exe. And now it should create a new virtual environment. Here we can see the name, so this is correct. And now we can check our project folder. And here we have a new file that was generated. It is pip file. Let's just open it. And here we can see the URL of our package repository, then the packages that are installed. And here our Python version 3.9 as we wanted to do. And let's see here in the terminal, we can do pip list. And here we can see that there is only pip setup tools. And we'll, so this is a new empty that say environment. So if we want to check out where our environments actually are created, we can use pip nth, minus, minus. And here we see the path through our virtual environments. We can also go there. And here we can see that I already have several virtual environments and stopped. So the first thing now, I want to do a first page I want to install for installing new packages. We type pip, install pandas, so appendices the package we have to use, we want to use. So let's install it. And now actually pip nth takes care of the installation it under the hood also uses pip. So it's finding the latest version as well. And here you can see that it's working with pip file dot log and resolving dependencies. Some steps are done. Now let's check it is installed. Pip list. Here we can see pandas is installed. And let's check our PIP phi here, the updated version now we have here as a package, pandas with a star. This means that we are going to use the latest version whenever we installed. So debts at our virtual environment is now setup. 7. 7. AWS setup: Hello. In order to get access to the exedra dataset, we need an AWS account. Although it is a public dataset, we need at least a user with programmatic access. So therefore, just register. If you don't have an AWS account to AWS, it's pretty easy and it's for free and you will not regret it. There are a lot of options, a lot of resources you can use for free and not for free for sure. So I'm already locked in the AWS management console. And here we just go to IAM. This is the identity and access management. And here we go to users. And here you can see I already have several users. And here we add a user. And I just call it cetera, underline user. And I choose here the excess type to programmatic access. So this is sufficient for us, so we keep it like this. Programmatic access, AWS management console access, we don't need for this. So next permissions. And then here we choose Attach existing policies directly. And here we search for S3. And in our case, let's just take for simplicity Amazon S3, full access. I choose this, the next, next create user. And here we get the AWS credentials to get the Xist. This is what we need and we get a CSV file. This is the only time we get the CSV file, so download it. I recommend to download it and to keep it safely. So here we have it on the downloads. And now we can open this. And here we see we got our access key ID and the secret access key. This is what we have to take. And there are actually two approaches to get programmatic access or to get access at all. Actually programmatic. One way is to create environment variables. This is what we are going to do now. So I just go to the environment variables of my account here. And then I choose new. And I add AWS, underline access, key ID. And then I choose this one. I just copy it and paste it here. Okay? And then I copy this secret key. And I add the environment variable, AWS underlined secret access key. And they should be written capital letters. So Oh Kay. And Okay. And now we just go into our project folder. I copy the path here. I already opened my project folder. And I open the command line CD into my folder. And I activate our virtual environment that we created before with pip nth shell. Now you can see here our virtual environment is activated. Here is the name of this virtual environment. And now I want to install the AWS CLI. This is the command line interface to interact with AWS resources and for S3 as well. So I want to make sure that we have access to the public dataset, to the exedra data. And I just do pip install AWS. Cli. Just takes a moment. It's installing. Installation succeeded. This is good. We can just check it with pip list and t. We can see a AWS CLI is installed. And now we can do AWS, S3, ls, Deutsch, bursa, cetera, PDS. And I just want to choose a date and tear the 21st of April 2021. And here, the option is recursive and no no sign request. And let's see what happens. And here we can see that we got access to AWS S3 to the public cetera dataset. Here, there is a list of the files that are available for this date, as you can see here. So this means now we have a access to our source data. 8. 8. Understanding the source data: Hello. Now the first thing we want to do is to understand the source data. I usually use Jupiter notebook to get a first impression of the data and to play with around to do some analysis or whatever even some plots can do with Jupyter Notebook just easily. So to get the Jupiter notebook, I open the command line window and I'm already inside my project directory. And at first, I have to activate the virtual environment with pip and shell. And now I want to install Jupyter to get the Jupiter notebook with pip install Jupyter. This just takes a moment. Let's just wait. And after this, I also have to install BOT-2 tree. This is the AWS SDK for Python. So if we want to use Python to get access to AWS, we have to use BOT-2 tree. So Jupiter is still installing. Let's wait. Just takes a moment and installing dependencies from pip file lock. Well, n now pip install OTO tree. The same. Hopefully, this is quicker and let's wait. And now our packages are installed. So let's just pip list. And here we can see several other packages actually are installed now as well. And this was installed with Jupiter actually. So here we have Jupyter installed and Bo tree, this is what we need and for sure pandas. But pandas we already installed. Now we can open Jupyter Notebook, just typing jupyter notebook. And in my browser. And instances opened the interface. And here there is new and Python tree. Here I can open a new Python 3 Notebook. And what I want to do now is to take one file of the cetera dataset and just read it as a Pandas DataFrame. This is a CSV file. And here, having it as a Pandas DataFrame, we can just play around, take a look, understand the data and plotted or whatever we want to do with it. So therefore, it first I import libraries, I need, I definitely need bow 2, 3, and I need Pandas as pd. And what I also need, there is a Pandas method to read CSV. But using the S3 storage, you can't immediately write the key or the path to the S3 five. So not today's 35, but to the S35, but under S3 Object Storage. So therefore, I use from Io, import string IO as an intermediate step. So this is a kind of buffer in memory buffer we can use that, we can use the, the read CSV method. So okay, we import this. And at first we create a boater tree resource, S3 resource, PO2 tree resource S3. And a bucket, S3, bucket with the bucket name and the bucket name is dacha, bursa, cetera, PDS. And then I want to actually get a list of a of all files of one date. There was written. And we already saw that the exedra data is saved for each date and then for each hour. So there is the AWS SDK provides a method where we can filter by prefix. This is exactly what we need here. And I just call it a bucket object. And then I use my bucket instance that I created here. Bucket. I have to use objects and filter, and then I can specify the prefix, and I just take whatever prefix. So this is basically the date. At first the year minus month. I take 0, 3, and 15. So the 15th of March, I want to take a look and then I have to use, I have to iterate over this, this bucket objects. And there I use a list comprehension. I just say every object for object in a bucket object. And here we just have to type it in Jupyter Notebook in order to get it printed. And actually, if you don't know how to actually execute any line here. So executing is true. Pressing Shift and Enter. Not just enter. If you just press Enter, then you just get a new line. And here we can see all files for each hour for the 15th of March of the cetera datasets. So here 01223. So this is the last hour of this date. And now I want to read actually this CSV object. I just take any CSV object. And in order to read, we can use the bucket dot object, and then we can just say what key we want to use. So the key is here, the file name, Let's call it like this. And I just take this one. Well, maybe this one here will be data, I guess. So o, k, and then I have to type, GET, get, body. Then read and decode. I will use UTF-8. And if we take a look to the CSV object, here we can see our CSV data and we want to read it now as a Pandas DataFrame. So therefore we use string IO, what we imported, I call a data string IO. And then here I take the CSV object as an input argument. And finally, we can read the data as DataFrame. And here we use pd, read CSV data. And the delimiter is a comma, as we see here. And now we can print the DataFrame. And here we have our data. We can take look at an impression plotted or whatever you want to do. So just feel free to take a look to the data, get impression, and he has debts it to get a first understanding of the data get first excess. Now we know how to access this data. We get feeling already. This is important here. When we want to code a production ready application. 9. 9. Quick and Dirty: Read multiple files: Hello. Now we know how to access the cetera dataset with Python using bottled tree. And let's jump directly into a quick and dirty solution using Jupyter notebook. Except if you already have a clear picture of what you have to do in order to get a production ready coat. I mean, if you have done it already plenty of times. So the same task, the same thing. Probably you have it in your mind. You don't need this quick and dirty solution. But for me, personally, I I prefer to open, just open the Jupyter Notebook and get a feeling of the code. What I have to write, what, what is possible, what the functions are, what the methods are, I have to use. And like this, once you have done it, like you get quick, quick and dirty solution. You know what to do and you know in which direction you have to go. So that's why I prefer the Jupiter notebook then a IDE, because this is like interactive. You can just play around with the data and please play with some things. So for me I prefer the specialises in, yeah, this is a personal choice. So now the first thing, let's take our notebook we wrote in the last lesson. And let's take actually two buckets here we create no, Not buckets here we created actually like this bucket objects, object, let's call it like this, using this date, the 15th of March, let's just say, Okay, we want to take actually not only the 15th, but we all know also want to take 16th. And now as a first step, let's try to read all the files of these dates and create a big, not so big, but a DataFrame out of this. So the first thing now here we got this, these two bucket object one and packet object 2. And here we can just do the same and actually we iterate at first over packet object 110. We can just convert these two lists by using the plus operator. And here we have this. And now let's take a look. We can run all again. So here we see all our files from the 15th and 16th. This is how we want to have it. So now, the first thing that comes to my mind to read everything, because here we only read one key 15. So the first thing, the easiest thing, not for Python programmers for all kinds of programs is a for loop. So we can just, let's say we iterate using a for loop for an object. In objects. We use here exactly what we defined here or objects. And then we create our object, actually a, our CSV object. And the key here is just the key, is object that key. We can use here, the object, object, the key and get good body read, decode. This is all the same. The next thing we also have to use, actually this, these two lines. Let's just take them. So there we have data. Then we create a DataFrame, and then we have the possibility to create DFT all equal to df dot append the F. I want to ignore the index equal to true. So here we are creating one DataFrame and we all the time append DataFrame, df underline all with the DataFrame we just built or read from the CSV file. But for sure here at first we have to initialize the. Df all equal to PD dot DataFrame. And the point is here, we have to use the same column, so there should be the same columns as all other objects. So I suggest just as a first step as equivalent to the solution, we read an initial DataFrame and use these columns from there. So columns equal to question mark. So here I suggest insert one cell above and I want to read n initial DataFrame. I just call it CSV object in it. And here I want to use just as the first key actually hear from our objects, objects dot key. And this is the same we can copy, but we have to use CSV object in it. And here, reuse the f in it. By pressing Alt and Enter it. The code executes and creates a new line and empty line. So I want to see df in it. This is actually empty probably because during this hour in the night. Yeah. The stalks, the DOJ bursa is not open, so it is not working, so no data. But we just want to use the column sexually. So we can use df dot columns. These are the columns. And now we can use exactly here these columns. And now let's see what happens if we execute our code. I create a new line. I want to see what is their underlying all. Now it takes a few moments. So here we got all our data for the 15th and 16th of March, as you can see on a date, here's the first the first line is 15th of margin tear, the last ones they are from the 16th of March. And this is what we want to have. And now let's actually take the columns we are only interested in. So the columns we are interested in are actually ISIL. Yeah, mnemonic. We can take, let's take although we don't exactly needed there, it depends on the requirements, the client or whoever wants to have them, so you can take them, but we just take it although we don't need, but then we take the date, definitely the time we need now to start prize, Mac's price, Min price and price we need. And the traded volume, let's see here the trade volume and here a number of traits we actually don't need for our report. If you take a look to our report, it gets a clear picture that we don't need it. So df, all I suggest then we just write a list, create a list of the columns we want to use. This is ISBN. Actually I can just add the column C10. Yeah, I can just copy and paste it from here. So ISBN mnemonic actually are we don't need. So let's not use it. Currency, we don't need date. We want to have stock price, Min price and price and traded volume. And now we only want to have these columns in our DataFrame. So re-select them using D, all equal to df dot loc. We want all lines, we want all rows. And here we use just the columns. Let's see. This is what we want. This is correct. Okay, actually hear this. We can delete by choosing marking the line and pressing 2 times d w d means. And now we can just, uh, kind of filtering in case there are actually here, in case there are entries without, without data. So we can just do df underline all drop and a in-place equals true. Although I don't expect a lot of empty entries, let's say. And we can just C, D, E, F, all shape. Probably this is the same number here. There was no filtering, but there might be filtering because we don't need empty. Yeah, we need actually all the columns are in case there are no values. Yeah, we don't, we cannot use them. So this is just a kind of filtering. 10. 10. Quick and Dirty: Transformations: Now let's get at first the opening price. And I just write it here as good opening price per IS IN and per day. And I want to have it as Markdown. And yes, so we want to have the opening price for each ISA n and for each day. So we can do DFS, all opening price like this, recreate a new column with the name opening price. And we take df dot sort values. At first, we sort the values. By the column time. Like this, we have the first entry of a day will be our opening price. And then we can do group BY. This was the wrong button. And we want to group it by ISIL and date. And we want to choose actually to start price, because the opening price is the first start price. When we sort our videos by time and group them by ISIL, an end date. So we choose the column stock price and do transform. First. We can execute the code and let's see what we have here. We can see the open one column opening price, and this is the start price. This means per ISIL IN, for example, if we choose this one, DF, DF or ISI n equal to this ISI. And let's take a look. We are, we can see that for each day, this column, there we have the starting, like the opening price, the first stock price, let's say of the day here for the 16th of March, we have another price. What makes sense? Okay, The next step, we want to get the closing price. Closing price per ISIL and date. And we want to have it as mark down. And here actually we do almost same, but we take the last one. So this will be closing price. And here we just use the last. So we can do this again to check to see what we did. So the opening price, a sphere, and here we cannot see it right now, too many entries, but here we can see the last entry of our day, it's at 16, 25. And here we can see the end price. This is the last in price, 22, 56, So does the same, so it seems okay. And now we can do our aggregations. All aggregations we need. And this again as Markdown. And now we can group all the DataFrame. So d of all equal to d of all taught, group BY and regroup it by ISIL and date. And we choose the option as. In the x equal faults. If we don't, if we don't set as index two faults, then actually ISIL and date will be the index. And we don't want to have it, we don't want to have them as index. And now we can use dot egg for aggregation. And now we want to have the opening price, euro. And here we choose our column. This is our column opening price of winning price, what we just created or what we just built. And here we use Min. Then we want the minimum or the closing price. Closing price, euro. This is closing price. And here we choose the Min as well. Actually here it doesn't matter if we use min, max or whatever because they are always the same for each ISI n, and for each date, as we can see here, they're always the same price. But if we are grew, doing group BY, if we grouping them by ISR and end date, we have to choose any aggregation, so we just choose MIN, does this okay? More or less arbitrarily. I, I chose it. And four, what does the next? We want the minimum price, euro. And here we want really the Min price. And here definitely we want them in maximum price. Euro. This is the column max price. We have to choose makes. And the daily traded volume is the volume and the sum. And this looks good. Now let's see what we have. This looks good for each day. No, actually not. We did one thing. Yes, for each day and for each ISIL, we have actually two. Mono two for each day and ISBN we have one entry. So we can see here for one ISBN, we have the data for the 15th of March and for the for the 16th of March, and the same here for the next ISN and so on and so on. So this is what we want to have, this correct. So now we want to get a last thing. We want a column that shows the change of the previous day's closing price to today's closing price. So we want to have percent change previous closing. Let's call it just like this. And the first thing I would do is probably the easiest one. Ds4, create just a another column roof the previous closing price. And here we have D, F on the line, all sort values by date. This is important too. At first, sort the values by date. Then group. By ISIL. And here we want to take the closing price, euro and shift one. Like this. We get the closing price from the last day. And DF all. Here we can see actually that for our ISDN, we already took a look here for this date because there's no data. We don't have data for the 14th of March, so there is no data and the rest is the data, as you can see here, previous day's closing price. So this is what we want to get as a first step. And the next step is that we create change. Previous closing percentage equal to df all. And this is closing price, euro minus df. Previous closing price. And divide it by default. Previous closing price multiplied with a 100. There is an error. What is the problem? Actually? There is dF all here. I forgot something and no, Only one. And let's execute it again. And now we want to drop a column, actually the column previous closing price we do need anymore. We just wanted to have now a final report, the change previous closing price. And in place equal to true. We want to set that the DataFrame df all will be updated as well. And what else do we want is that we want everything to round and the number of decimals are too. This is what we want, the format. And now let's see what we got. So this looks good. This is what we expect. Or ISIL IN date, opening price, closing price, minimum price, maximum price, daily traded volume. Oh, here's a typo. Volume. For sure. It should be trade volume. This is actually where is it? Here? Yes. Let's do it. Let's run it again. Now we have to run everything again. Cells run all, not everything, but there, there was a point that the DataFrame was already updated with the other names and with the other columns. So we couldn't just recalculate, let's say it re-execute everything. So I just run everything. And here we can see it right now, daily trading volume is as expected now, and the change of the previous closing price to the today's closing price. So these are actually our transformations. What we need, this is already, let's say our report. So we could say we already done. Yeah, but to production there's a long way. The next step is to use a date as an input argument because our requirement is to configure our job in this way debt. The data will be extracted since one minimal day 1, first date actually. So that's why we have to rebuild, let's say our code a bit to use a date as input argument. This we will do in the next step. 11. 11. Quick and Dirty: Argument Date: Hello. Now we want to use a date as input argument. So here I have the notebook open. We just wrote. And let's insert one row here on cell. And let's say, Well, we want to give an argument date. And what is the date we want to choose? Let's take a look. Today is the ninth of May. So it is Sunday. It was not working. No data means on Saturday, no data. Let's take, for example, well, Friday, so the seventh of May. And this means that if we take this as input argument, all dates or all data should be extracted and built the report since this day. And now, I have to insert one more cell here. Now, what I need as an argument, actually, this is a string, so our input is a string. But as argument we want a datetime object. And actually also not of this day, but of the previous day. Because as you remember, we didn't have any data here. Let's take look. For our previously. There was no data. No data. This means that whenever we choose a minimum date, we have to extract the data from the day before as well. And afterwards we have to filter it here that we only show actually the, really the report of the minimum day that we gave as input argument. This means here, I just call it arc date, datetime. So we should use datetime to do proper filtering. And here I use the package datetime. This I have to import from datetime, import datetime and time delta. Actually I said that for proper filtering we needed, we don't need it for proper filtering, but I want to calculate the day before this day. So this data and package datetime library gives us some good functions for it. So at first here, we want to convert our string into a datetime object. And this we do using string P time. Then we use this arc date. Then we have to tell the format we want to use at first, the year, minus percent m for the month, minus percent d for the day. And then we only want to use the date because date time actually uses date and time, but we only want to use the date. So date and minus timedelta does water already here. Important? Days. One, this is just this simple. So let's say, let's see what is now our date. So our arc date actually is the seventh of May. And here our minimum day debt data should be extracted as the 6th of May. So now we can hear remove our bucket objects. And we only have to do one actually list comprehension. So this, this, this packet object 1 and 2 are not needed anymore because here we are filtering according to a prefix, but here we get a date list. Let's see. So the easiest thing is here that retell objects equal object for. Object in a bucket dot, objects, dot all like this. We get actually all objects in this bucket. And here we provide a filter. If let's imagine, okay, we have here our keys. They are just strings. And we want to filter that only these parts should be later or equal this date, actually the sixth of May. This means we can do it like this. Datetime string P time. Here we do the same as, as here. We convert actually here this part to a datetime object. But how can we get this part? We just use the object. This is the key. Here. You can see it objected key. And we can just split it, the string by a slash. And then we use the first list item. And here we have to turn. Now, here we have to tell the format again. We just take the same format. And here dot date. And this should be bigger or equal than arc date d t. This is here our argument. So that see what happens. This we can delete. It will take a bit of time. I just run from the first one, from the first line and yeah, let's just wait. I just pause the record until it's done. It takes maybe one or two minutes, nodes done. And we can list all objects here. And we can see that all objects since the 6th of May are listed till the seventh of May. So as you can see, obviously there are no other data available. And yes, now let's continue. Let's just run our code to the point actually that, yes, here. Finally, we want to filter also by date. So let's insert here one more cell. And here we apply a filter with d of all df, all dot date, bigger or equal arc date. Here we can immediately use actually our string. And we have to use this state because our report should not show more den, den or older data. Then this argument, the argument, the argument, that argument sounds funny. So let's see, and let's take a look. And now we can see here, only for this day, actually for the seventh of May, There's data because the other days are Saturday and Sunday. And here we got all our values we need. And as input argument, we have our date till today. So this is now one more step we implemented. Let's go further and save our report to S3. 12. 12. Quick and Dirty: Save to S3: Hello, Now it's time to save our report to a S3 target packet. But at first we have to create this bucket. Therefore, I opened the AWS management console of my AWS account. And here I go to S3. And I want to create a bucket. And I just named this bucket, cetera, minus1, 234. You name it, whatever you think, but it has to be a unique name all over the world actually. So the settings here, the region I leave as it is, block all public access. I also leave it and I scroll down and create bucket. And here we have our exedra 1234 and we can use it. And now I open the command line window inside my project folder in order to install one more package Pi arrow. This is what we need to save our report to S3 as Parquet. So at first, I'm going to activate our virtual environment with pip and shell. And now I can just pip install Pi arrow. And here we can just wait until it's installed. Notice installed, and we can open Jupyter Notebook. Knowledge, open our notebook we wrote in the last session, I just saved it as quick and dirty solution. And here the first thing we have to additionally import is from Io, import bytes IO because we want to save it as Parquet. We can't use string IO, we have to use bites arrow. Now at first Let's run all our code to sell. Run all that we have df here. And this is what we want to use to write. And here I just say save to S3 or write to S3. I just write, write to as tree. And let's just wait a moment until it's done. The code executed, as you can see the report here. And here this I use as Markdown. Actually I want to create a new cell. And the first thing now we have to think about a key. I just call it. The key is we want to save the file. I just call it, for example, exedra daily report underline. Plus I want to use the today's date, actually the datetime as a string. So I can just use datetime today. Dot string F time. This converts it to a string. And the format is percent year, month, day, underline, hour, minute. And second. And plus dot k. This would be the key. And now we can use bytes IO to create an output buffer. I create out buffer equal two bytes, I, O, D, F, all thought to park. This is the method we are going to use. Adhere to our buffer. We want to write index we set to false. We don't want to save it. And now we have to create a bucket instance for our target bucket. And there we use S3 bucket. And then the name of the bucket or bucket is cetera, 1, 2, 3, 4. And now we can use this packet target dot put object. And as the body, we use our out Buffer dot getValue. And the key we save as key. And now we can execute it. And let's see. Here it says it's executed, it's saved. Let's check our men management console. So here we have our cetera bucket until we can see our saved Parquet file. Let's also check if the content is correct. So I just want to read the uploaded file. And here I do actually the same as we did before for object in a bucket, target objects all. And I want to print all the keys in the bucket. This is what we have right now, and this is also what I want to read. Let's just take our Katya for reading. So this is this. Actually, it's not a CSV object. We can use it as per object equal to this. And Sai Kyi, we just use this and we don't have to decode it now. And this read, read as data, we use here bytes IO. And then we take this object. And the DataFrame. We can use as DataFrame report equal to PD dot read par k. Now, since we're using the bytes IO buffer, we can use the pandas read Parquet method. And here's an problem. What is it saying? An error cute, no such queue and calling the get object operation, the specific key does not exist. This means here we use bucket for sure we have to use target bucket. And it is just reading the report. This is our report. So the data we wanted is safe. We can just compare it here D, If all should be the same as the data from our Parquet file. 13. 13. Quick and Dirty: Code Improvements: Hello. Now let's look through our code and see what we can improve or what we can write more Pythonic or maybe right in a way that is more efficient. So therefore, I open the Jupyter Notebook, re-wrote in the last sessions, the first thing I would do is to write all our arguments and parameters at the beginning of the code. So here we have our date already, and then the next is here, our format. So we can just parameterize it. And I could name it as source and the line format equal to this. And then here we use this source format. The same here actually. We can also use source format. And then the next is our source target name. So a source bucket name. So we can call it source bucket equal to Deutsche bursa, and so on, cetera, PDS. And here we just use the source bucket to say we can do with our target bucket. And we just scroll down, find the target bucket name we use. And this is here. We just cut target bucket. And here I replace it, or here I paste it. And then we have our columns. I guess this is the next probably here, our columns. We can just take them, move them here. And let's see what else we have here. This actually the format we decode also could use it as a parameter. I just leave it as it is here right now. The same for all our column names. Right now we leave it as it is. But I would do this. I would take the key and move it here as well. This one here we can remove and our objects. We can just here put together this, we can remove the cell. And one thing we definitely can improve is here this big for loop. This is really not Pythonic. So what we can do instead, we could insert one cell here and take now everything that is inside the for loop. And actually write a function, for example, CSV to DataFrame. And here we write the file name as an argument. And here our code. Here as a key, we use the filename and return a DataFrame. And afterwards, we just can build our df all using pd.com kit. And we create a list comprehension iterating through our objects and using our CSV to df function. So here we can actually for object in objects and then we can use our CSV to d f function and the object as input argument. Then here we want to IQ nor the index equal to true. And one thing is missing because here we need the key, our file name, as input argument for our function, but we are not using the key here. Either we write the key here, or we just write it actually here. It doesn't really matter. But yeah, we can use it here. So now this thing should work actually. And let's just test it using the date. What is the date today? It's the tenth of May. So let's use the ninth of May and run our code to here. I'm just waiting until it executed, Marat executed, and we can just check our result. This is as expected. This means now this we can remove. Now, this means everything here. This, this, this for-loop and in it we don't need anymore. This is really not Pythonic. So this is a much better way as you can see here. Just creating a function and doing concat using a list comprehension. Much better. It's much more beautiful and much more efficient. So this we can remove, remove this, we can remove this really neat. And this, we can remove this. We need the shape we do need. So this we need, we need, we don't need. Here. We were just checking. And this this we do need this. We don't need here. This we can leave as it is, I guess the filtering, this write to S3. This looks good. And reading part anyways, is not actually part of our code. It was just to check. So we can just leave it as it is Mao, we can execute everything. So we can just sell, run all. It's going to take a few moments. I just pause the record and it successfully executed. Just the only thing is that actually here is the new file that was written. So we also want to read this for sure and not the old one we used. And the report. This looks pretty good. So now we have a working code. This means that it is extracting from the source. It is creating the report and it is loading to the target. This means we could think now we are ready, we have a working ETL job. And but I tell you it will take a bit more effort to get production ready high-quality code. So let's see in the next session how we are going to proceed. 14. 14. Why a code design is needed?: Hello, why do we need a co-design and how does it help us? As developers? Imagine we are in our development environment. Basically, we could say the code we wrote using the Jupyter Notebook is working. We could just add maybe some more requirements that need to be included. Do some end-to-end tests. This means we execute our piece of code under several different conditions according to the requirements inside the development environment. And we could think our work as date engineer's is done. So let's deploy it to the production environment. And everything is working. The job is triggered again and again. And the code is working fine. But let's imagine you are in a project. You write your code and you leave the company, the project, you get sick or whatever another colleague has to continue. And for sure at first to understand the quote, Everyone should be able to understand the code. It should be as intuitive and self-documenting as possible. Is this the case with our code? Probably it is not that difficult to understand our code yet because it is just a, let's say, small script. But let's imagine the project continues. More requirements and features should be added, and the code grows and grows. If we continue just scripting as we did till now, I promise you that the code will be even for yourself, not readable and understandable anymore. I exactly had this problem when I worked on one of my first projects, I was writing an application that predicts sports bets and simulates the results. Arrows coding, not much carrying about clean coding without any design comments or whatever. I couldn't work on it for approximately two months. And after the break, I wanted to add a new small feature. It took me two days to understand the code and how to change the code, adding this small feature. And I even don't want to talk about how difficult it was to debug. I learned this lesson the hard way, because I understood this code cannot be used, modified and extended. So I decided to redesign my total code according to common software development principles. These are some of the reasons why a proper design beforehand is so important and promised me it will save you and other developers a lot of time. 15. 15. Functional vs Object Oriented Programming: Hello, The first design decision we have to take is what programming parenting we follow. There are basically two mainstream paradigms that are important for applications like ours, executing ETL tasks. This means extracting data from a source system, applying transformations to the data, and loading the transformed data to the target system. There is functional programming and object oriented programming. Functional programming basically means structuring the code in small chunks named functions. A function contains some statements to execute a particular task for the application. The output of a function only relies on a given input arguments. In this way, any side effects are eliminated. The function wants written can be invoked and reuse everywhere in the code. In this way, the code can be managed to be modular and clean. The easiest function you can imagine is to add two integers and return the sum. The output will only rely on the two input arguments. A functional program is especially suitable where the state is not a factor and mutable data is not, or where we little involved. Frankly speaking, if there are no boundaries required or if they are predefined. There are several advantages like efficiency, lazy evaluation, nested functions, bug-free code, and the ability for parallel programming. Considering the way ETL applications work, functional programming seems to be the natural way of structuring our code. Because building data pipelines can be nicely be done by chaining multiple small functions. But let's take a look to the object oriented programming paradigm. Renew program object-oriented. You program objects to represent things exactly as things are in the real-world. Let's consider the object car as an example. Each car has a color, body type, length in each moment, a speed, acceleration, etc. This means that each object has attributes containing data. And in order to manipulate these attributes, there are functions. In object-oriented programming, It's just called methods. Methods for the car could be dr. Park reverse or even like update color. Using objects and programming provides the ability to easier implement real-world scenarios. There are four major principles when we talk about object-oriented programming. There is encapsulation. This means to summarize all, let's say features of a thing as one object containing attributes and methods. The attributes describe the object's state, like in our car case, the current color or the speed. These attributes shouldn't be accessible by other objects for manipulating and communicating with the objects. There are the methods that are callable by other objects. This separation between attributes and methods summarize in one object is called encapsulation. The second principle is abstraction. This means that objects are implemented in a way that other objects should not know their inner details in order to use it, but just knowing the interfaces. Let's think about a car object and another object person. The person should only know the cars interface to use it. This means there are methods such as speedup, break, or park that the person only should know to use the car and manipulate its state, then there is inheritance. The description of an object like a car is called a class in object oriented programming, let's say we create a class car. What means we describe what attributes and methods a car object would have to principle of inheritance provides the ability to create parent and child classes. We could create a parent class car and each car has definitely the attributes, color, length, hate, and methods like Dr. Park, update color. Now we can create child classes, luxury car, sports car, and SUV. For instance. The child class luxury car inherits automatically the car's attributes and methods. So for sure also a luxury car has a color, length and hate and methods dr. Park and update color. Additionally, there can be defined attributes and methods that only belong to the class luxury car. Let's say there's an additional attributes seeding level and a method set CTD little. This is what inheritance is. And the fourth principle is polymorphism. This means that each child class can use exactly the same methods as its parent class. But there's the opportunity to implement this method for each child class individually. Let's come back to our apparent class car. We implemented the method start engine. Now for the child class luxury car, whenever starting the engine driver assistance systems should be initialized. And for the sports car, the steering wheel and the seats should get a special sporty position or whatever. So each car has the method start engine, but as implemented differently for each child class of Car. Now, you've got a good idea what object oriented programming and functional programming mean. What approach for writing ETL jobs and data pipelines should we finally use? Functional programming is all about data manipulation. And it is highly reusable, seems to be perfect for our case. But in object-oriented programming, we have got the four principles that give us a strong feature set that can be highly efficient. I suggested we had first go the functional approach, restructure or a code accordingly. And after we see the pros and cons using the object oriented approach. 16. 16. Why Software Testing?: Hello. Before starting to restructure or code using the functional approach, I want to talk about testing, why we should test our code and how the diagram shows the costs over time for the three cases NOW, test, manual tests and automated tests. It is not immediately apparent unit tests for your software. In the diagram, you can see that initially, when you start writing code, having no tests or doing manual tests is much cheaper than writing automated tests. It is because of the process the developers initially have to get familiar with the test suit to test-driven workflow and they need to write them. But maintaining code without tests is no option since they will lead to expensive production bugs. Manual tests scale linearly. As the code grows, the testing time increases. So in the long run, providing automated tests is cheaper. The most expensive box, our production box. The early under software cycle, the back can be found, the cheaper it is to fix it. It prevents regression. This means that developers cannot remember all the features that need to be tested after a while in case of refactoring the code, adding new features or removing features, the costs for manual tests grow linearly as the number of features increase. But automated tests are written ones. Writing testable code improves the overall quality because developers are forced to stop and think about their architectural choices. When every new class must have unit tests. A good test suit as Roman religion as well, cause tests are the causes dotting a better eye viewing, Waterloo code will refuse to ensure that the test cases are well-written and don't miss any important test case before reviewing the actual code. Tests also help developers to add new features, especially with a bigger and older codebase. It can show developers if changes or new features broke something important. And tests can be used to fix a bug that already showed up in production. Once the bug is there, a developer can write a test that causes the arrow and use this as a blueprint to fix the issue. What types of testing can reapply. The first type of tests and the earliest tests are unit tests. They test individual units or components such as functions or classes. Integration tests determine if independently developed units of software work correctly when they are connected to each other. System testing is conducted on a complete integrated system or application. Taking all integrated components that have passed integration testing and acceptance testing is conducted under complete system or application to determine if two requirements are met. The acceptance tests are usually done by the customer. And we, as developers, definitely have to care about unit testing depending on the project, whether there's a separate test team. We, as developers, might be involved in integration or system testing or not. But in each case it is always good to have automated integration or even system tests for your developed code. Whenever we write functions or classes, we have to ensure they work as expected using unit tests. This is really important during the development of code for production. And this will also be a big part in this course. But since we are data engineers, data scientists or ETL developers, we always deal with data source and target systems. They can be file-based databases, object storages, or whatever. This means that running integration or system tests are fundamental to ensure a working application. If the tests are integrational already system tests, it is not always that clear to separate in this course we are right at least one integration and system tests to ensure that all components of our application, the source and target systems work together. 17. 17. Quick and Dirty to Functions: Architecture Design: Hello. Before restructuring our code and functions, Let's take a quick look at a common used architecture is style, the layered architecture that fits our concerns quite well. Here we see four different layers. The infrastructure, adapter, application and the main layer. The domain layer includes entities, value objects, or domain events that are not interesting for us right now, the infrastructure includes databases, roots, caches, for example, this means the object storage for the source and target data is in this layer regarding our code, we also don't need to consider this layer for our ACOTE important are only the adapter and the application layer. The adapter layer is responsible for accessing to infrastructure and external APIs. And the application layer contains the features of our application. So we can use these two layers as a first way to structure our code. 18. 18. Quick and Dirty to Functions: Restructure Part 1: Hello, here I opened our code in Jupyter Notebook. So let's try to see what parts of our code we can map to the adapter layer. For sure, all operations that directly access or communicate with the S3 storage. Each function we write, we should try to write as pure functions means no side effects that the output only relies on the input. And they should be as small as possible. The smaller they are, the better we can test them, the better the code is readable, especially if we give them some meaningful names. And the better the code can be maintained and new features added. Whatever we can, we should submit as arguments. This should be configurable. It brings flexibility to the application. So let's just start. So here we can insert a cell below and we write as a comment that we want to create here, the adapter layer. And the first function we can imagine is for sure to read our CSV files from our source and to convert it to a DataFrame. So dysfunction, we actually already wrote. So we can just take the whole piece and paste it here. I will just rename it a bit to read CSV to DataFrame. And now we can think about what we want to configure. So there is the bucket, there is the key. So defining we already have it here. Then we can configure the decoding and the delimiter. So here we can add the bucket as argument. We can leave it as bucket. Then here we use the filename. I will just write it as key, since we are talking about objects storages. So there's always keys and not filenames. And so key, then the decoding. I would edit as decoding equal to UTF eight as a default value. And here we can just use it as decoding. And the same with delimiter or separator, I call it SEP. And here we use step. So this is our read CSV to DataFrame function. And the next function I can think of, since we are able to read CSV files to DataFrame, this now to write the DataFrame to S3. And here we can create a function, right, DataFrame to as tree. And what can we take? We can take our code here. Actually. This and this we don't need. But this. And here we can use the bucket as input argument as well. I will just name it bucket as well. So here should be submitted a bucket object. And the DataFrame definitely we need, so here we don't need to use DFT. All we can just use df. And what we need is the key. The key here, we can leave it as it is. And as a return, we can just write true. This is our right DataFrame to S3 function. And then there's only one part left where we interact with the S3 storage. This is here where we get our file lists. Here we call the objects. And together with this part here, we can create a small nice function where we use the argument date as input and get a list of our findings. So I would take this part here. And this part. And this I just call min date. It seems to me a bit more meaningful. So we can also write a tier where we use it min date. And now we can create a, let's call it get object's function or maybe return objects function. And here as input arguments, we want to have the bucket again and our arc date. I guess that's it. If we take a look here, yeah, this is good. And this, and we return our objects. And one more thing, actually to pay attention right now we don't return a list of file names, not strings. Maybe we should consider to return a list of file strings, five names. So this strings. So we decided, or I decided actually before, to use object dot key here, but we can use it here. And this means here we can remove it. And now actually we have our adapter layer. This means here are all functions, all functionality where we interact with our S3 storage, read CSV to DataFrame, right? Dataframe to S3 and return objects. And the rest of the code is part of the application layer. One way to structure our application core is in three parts. Extract, transform, and load. Since it is an ETL job, why not just structured a code like this? Therefore, we insert a another cell here below for the application layer. And we can immediately at the extract function. And it should return a DataFrame. And what do we take here for our extract function? Actually, the only part we need is this and this we can add here. Then. Here we use the function CSV to DataFrame, we renamed it, so we also rename it here. And our arguments we need is the bucket and the key. So the key here is our objects. So this is fine. Here. These are default values for our arguments. We don't have to use it. And what we need here then as input arguments for our extract function is definitely our bucket and R0 objects lists, so our lists of filenames. And this means we input the bucket lists of our objects and we return a DataFrame. And the next function we need is our transform function. And here I just call it transform Report Ron, let's imagine. This is our first report. We are coding, we are building, and there will be multiple more reports. So therefore, I just call it for the sake of simplicity, transform Report Ron. And we want to have an input argument, a DataFrame. This is what I know right now. And also as redrawn, we want another DataFrame. This will be our transformed report. And what are the parts we are using for our transform function. Let's go to our code. So this cell we can remove and this part we can use. Here, we can add this, we call it the f instead of df all. And what we need is our columns as input argument. Then we can delete this. Then this part. We want to use as well. Our kind of filtering here, we just also called a df. And then the same for this. Here we name it df as well. And actually all parts of our aggregation, so dis we can remove, remove, we take this. And here we remove underline. All is well, all then can remove this. Here. We take this, remove this, remove this edit here. And here, again, remove all. And so on. We take here this code. Actually we can take everything. And here as well. And now we have it. I remove these cells. And I just added here. Here we have to remove all remove. Then here we have three move. Or all everywhere. We use it. Here. Here are the same, the same. And here the same entity. We see that we have an argument date, this one. And this we have to use as input argument as well. So now I guess ROR transform report function is ready to use and then we need to load function. Load. And here I just would basically use R0, right? Dataframe to S3 function. And here we need a as input arguments, the bucket, a dataframe, and a key. So the bucket we can also hear submit as input argument then the DataFrame. And here we need the key. And here we can use the part we already created. Actually, where is this part? Here, here, the key. So I will take this as well. And here as these parts, I would configure them as well. And I would call it Actually, I just write them. May be here below as target key. And this one here I use as input argument. And here we can use this as well. And the same for our format. Maybe we are going to change the format. We want to save it. So it could be CSV or whatever other format. And now we call it target format equal to k right now. And here we also added as input argument and 10 here. And finally, we just return true. And this is our load function. And now I would even some all three functions and one function that calls the total ETL job. 19. 19. Quick and Dirty to Functions: Restructure Part 2: So I create a function ETL report one. And here I just use my extract function with these arguments. And what is the return value is a DataFrame. Then the transform function takes this DataFrame and returns our report DataFrame or transformed DataFrame. And finally we use load. And yeah, here can, we can also return true. And now we need just our input arguments. So here we have actually one packet and we have another bucket in our load function, but they are different buckets. So here we have the source bucket as I call it. And here we have to source packet. And then we have the target bucket. We can add here. Then we have our objects. Then the next, our DataFrame is here. Our columns, we need then our argument date. And we need here the target key and the target format. And that's it. This is actually our application layer right now. And basically quite simple, extract, transform and load and one method doing the pipeline of extracting transform, transforming and loading. And that's it for the application layer. Now we are having a good structure using the functional approach. Let's create a kind of entry point for our application and call this function main. Here we write all our parameters hard-coded. Later we will take care of the proper way to configure our job. But right now we keep the values just inside our main function. And then we have to initialize or connections to S3 and finally call our methods that are needed. So let's create our main function or entry point. Main function entry point. And here we define just Maine. And at first we want to list our parameters, our configurations. And actually I just write it here later. We want to read this as config. And what are the parameters we want to use? Let's take a look. We have our argument date, hour, actually, all of them we want to use. And these ones as well. This we can delete. Then we want to init our connections so we can use S3 Photo, photo tree. And here, here we initialize our S3 object and then we need actually two buckets. We need our bucket for the source, and we need to save for our target. Yeah, it's a bit confusing I, at MIT that we are almost using the same names. Here, source bucket. This is just a string of the bucket name. And here bucket on the line source. This is the object we are going to use. And now we want to run our application. And here we can just call at first. Our get objects methods or return objects method. This returns a list of objects or a list of files. And here we use our source bucket because we have to connect to the source packet and our argument date. And we want to use this, or we want to call the ETL report RON. And here we use the bucket source, bucket target. We want to submit actually the objects and not just here, the strings, then objects. We get from here, columns from here, our update from here, target key from here, and target format from here. Something else that is left there is the source format. Actually. Let's take a look. Probably we forgot it. And to source format, we have probably here. Yes, here we forgot our source format as input argument. So we have to edit it here. Then where do we call our return objects? Actually only here. But here we have to edit and then we have all our input arguments, and This is our entry point now. And then finally, finally we want to call actually or to run our application. So we have to run it. And for this we just call the main function. And let's take a look. What do we have actually here as tree target, but yeah, this we don't need anymore. This we don't need here, we just have the reading. And I would suggest now we just run all our application. So cell run all and we will see what happens. I just pause the record and now I got the first error. And you can see the trace bake. Actually here in our extract function return the F transform report your df is not defined as we can see here, we use TF underlying all. So this probably I forgot. So extract. Yes, here I forgot to remove this. And then I run everything again. And another error occurred here we can see again DF on the land all. And this is inside our transform function. So here I forgot df on the line all as well. And so we run it again. So now it ran successfully, at least our code, but here again in error. So pocket target, just in the reading part, bucket target, we don't know, does know. So here we defined actually pocket target. And now this should work. I would say, no, it is inside our main function. Sure, this means that we have to create this again. Actually here for our reading part. And here we use the target packet. And let's see. Now it's listing exactly a third file. Let's take a look to our date. Today is the 11th of May, 1258, right now. And this was loaded 255 at the 11th of May. So this is our file. We can just use this one. And let's see what is our report packet target doesn't know again, we replace it. And are report DataFrame. So this looks good so far. Here we can see that we are having data for the 10th and 11th. What was our minimum date was the ninth. And if we take a look to our day-to-day, it is the 11th and this means it's a Tuesday, Monday, Sundays. It was the nine, so Sundays at what's not working. So for sure there's no data. This means as well, for the tenth, we don't have actually here data. Let's say, Well, we can improve the code. Let's just say this is according to requirements. This is as our client once it is. Okay. So we don't mind right now this can be improved whenever we want to. And but let's assume our code is working fine. This is according to our requirements. And here, now we have already a nice piece of code. If we take a look, we have a good adapter layer with three functions. We have an application layer with four functions and one entry point with our main function. This means now we have a good structured code. Good fingers, we can easily test it. It can be extended. And everyone who takes a look to our code can easily understand it. Just Pi actually, probably here reading the main function, then it says EDL report. Then we go here to our ETL report. And at first it's this extracting, transforming and loading easily understandable. So now we can go ahead. 20. 20. Restructure get_objects Intro: Hello, right now we are having a function flow as shown here. We take the arc date as argument. This is the first date since when we wanted to have the report. But what we finally want to get is that the JOB runs once a week and it loads all the missing files since the last load. But the logic how to do this, we don't have yet the function return objects. We wrote in this way that it takes the minimum date, our update as a string and returns all files we have to extract since one day before this day, return objects has actually two tasks to solve right now. The first is to determine the days are our application should extract. Here we just made an assumption that we take all day since our arc date minus1 day until today. And the second task of this function is to connect to the source and list all files that should be loaded for the determined days. A good idea now is to split this function into two separate functions, return date list and return files and prefix. The exact logic of return date lists we will take care bit later, the name of the function return files and prefix is really close to the bottle tree API where we can filter the fights in Bagehot BI a prefix and dysfunction will be called inside the extract function as well. Because there we anyways, we'll have to connect to as tree. And the function return date list could even be a function call to an external service or whatever. Now this decoupled in a better way and also more convenient to write good unit tests. The extract function should now take a list of dates that should be extracted as input argument and not a list of all objects as before. 21. 21. Restructure get_objects Implementation: Hello. Now we want to restructure our return objects function. In the last session, we discussed why and how we are going to restructure dysfunction. So they are, they're basically will be two parts. The first part is to get a list of dates we need, and the second part is for each date to get a list of files for this date. And therefore we are going to use the Buddha tree function to filter by prefix. So let's start with the list files and prefix function. So here we can just add it to the adapter layer. And I define list files in prefix. And basically here we need, let's see what arguments we need. And you can use the same, almost the same as here. I just copy and paste it. We're going to create a list comprehension as well. But instead of bucket dot objects, all we use filter. And here we use prefix equal to prefix, and this will be a date. So in this way we can filter by date and we get all files of a date because it's an object storage. Therefore, there are no folders, so we can only filter by prefix. And here we have our files. And we return the files and our input arguments we need. We have the bucket and for sure the prefix. This is the list files and prefix function. And now let's write our returned stateless function. So we can just rewrite our return objects function. Return date list. Here we need bucket arc date source format as well. And this line we need, then we need the today's date. Today is equal to date, time dot today. And here we only want to have the date. And then we can imagine, okay, we want to create our return date list. And the first thing, or the first date we want to have is our Min date. But we want to have it in a string format. So therefore, we can use strings time. And here we can use our source format as well. And now we can imagine that we want to get back all the dates from the Min date to today. This means we can just make a for loop, let's say a for loop, or in Pythonic way. It's a list comprehension and iterate over the number of days from mandate until today. This means min date. And if we take a time delta plus timedelta plus days equal to x, and our x will be replaced and not replaced will be our running variable of the for-loop. And here reiterate for x in range. The first should be 0 because the first date we can see it here should be deaminated. So the days of our time delta will be 0. This means it's purely our mandate. And then we iterate from 0 to actually we can use today minus start and then the number of days. But today minus start, there will be one day less, so we have to add one day. And then we are getting a return date list for all our days since the date until today. And this we just have to return return date list. This we can remove and here we have our two functions. This is actually not part of the adapter layer, so we can remove it from here. And this more part of the application layer, but not exactly the application, not our core. Because this also could be, or it will be actually probably a nother function from another module. Let's see. And we call it application layer. Not core may be. So it is yeah, I just say it is part of the application layer, but not the core. Okay, Here we have these functions and now we just have to change the main function. In the main function, we have to change it here. Instead of calling return objects, we want to return date list. The input arguments are the same. Here. We want to have a date list. And this we also change. This means we also have to change ETL report one. And here our input argument. Here are input argument objects to date list. This means extract. We have to change here to date list. And here we have to actually now add one more thing. It is to get a Our, an object's list to iterate through our date list. And for each date we want to call our list files and prefix function. Here we can use a nested list comprehension. And let's say we want to get our files and I just replace it for objects. Maybe this is more meaningful. And our list comprehension here, at first we iterate over our date list for date in date list, and for each date, we then want to iterate over the list of files we get for the prefix. So for key in here we call our list files and prefix function, list fights in prefix. And here we use the bucket and the date, this is this date. And finally, we want to use this key and then we get a list of all our keys, of all our files for all our dates in the waitlist. And yeah, that's it. Now we have actually a, we should have actually a working application separated the return objects function. And let's just run our code. Let's but before let's hear set our arc day to another day, today is the 12th of May, my case. So let's just write the 11th in order to don't have a long process time. Okay, I save it and run all. And we will see afterwards. And here we can see our code is not running. Actually, there is an error syntax error, invalid syntax, as we can see here. For year. Here I forgot that it is a list comprehension, so I forgot the brackets we need. And now let's run everything again. And here I called it start instead of min date. Actually. Here I had start in my mind because before when I first actually implemented everything for myself, I called it start. And here. For sure, stark, Dells, no stock right now. So we have to call it mandate. Let's see, maybe some other errors will occur. Let's run all again. And now it seems to, at least it seems to process everything. So let's just wait and processed. So we can also actually choose the one here to read. And let's see what happens. And here we get our report as we wanted. This is good. So this is working and we can continue. 22. 22. Design Principles OOP: Hello, we restructured our code going the functional approach. Now let's use this knowledge has input and go the object-oriented approach. But before we come to our code, I want to introduce some design principles that are well established in battle tested, object-oriented best practices. I will just quickly go through them and not into detail. But I recommend that you familiarize yourself these principles if you are not yet. The first principle is called composition over inheritance or compensate you reuse principle. And it says that code should be reused by composition of classes. This means that if you already have an implemented class, you can just reuse this class by using an instance in your new class rather than trying to find commonality between classes and creating a family tree. The next principle is encapsulate what we're always encapsulating anything which is likely to change saves time. Whereas a variance, there should be a class for it. Encapsulation refers to objects bundling all of their own attributes and methods so that using them does not affect other objects. When it comes to designing an object-oriented program, requirements regional likely to change in the future should be encapsulated so that the change can be carried out simply when required. So we can say that encapsulating what rabies minimizes the volume of code which may need altering. Then there is the dependency inversion principle that says we should program against abstractions, not concretions. It states that high-level module must not depend on the low-level module, but they should depend on abstractions. The idea here is that if one class depends on an instance of another class, then it should not depend on or create concrete instances of the outer class. Rather, it should depend on an abstract interface which debt class implements and a concrete instance is injected in. This enables loose coupling and testing. Inversion of control, frankly speaking, follows two steps. Separate the what to do part from the well-to-do part. And the second step is to ensure the wind part knows as little as possible about the word pot and vice versa. Basically, the two principles, dependency inversion and the inversion of control, is all about removing dependencies from your coat. Classes should be as decoupled as possible. Also really important is to solid principle. Solid stands for single responsibility principle, open, closed principle, Liskov substitution principle, interface segregation principle, and dependency inversion principle. Here we see again the dependency inversion principle that we already discussed. What I want to emphasize here is the single responsibility principle that states there should never be more than one reason for a class to change. What means every class should have only RON responsibility? Another important principle is dry. Don't repeat yourself. Whenever you write something that you already wrote, you should rethink about a design. Instead, you should only write it once and after reuse it. And the last principle I want to mention is YAGNI, you aren't going to need it. It states that a programmer should not add functionality until deemed necessary. So you always should implement things only when you need them and not when you just foresee that you might need them. Otherwise, you might put effort in implementing things that will never even be used to conclude everything. It is a lot of theory here, a lot of principles, they work together. And not all principles can be implemented strictly as they state. It always depends on the project and the requirements. 23. 23. More Requirements - Configuration, Meta Data, Logging, Exceptions, Entrypoint: Hello, It is time to discuss all further requirements we need to incorporate and to consider when doing the final co-design. Logging is an important topic. We should definitely decide what we want to look and how it should be locked for our case now we assume that all we want to lock should be only written to standard out. And we only want to lock information about the progress of the job. Since it is supposed to run our job using an orchestration tool, we assume that this tool will be able to take the locks to standard out, save it, and give the possibility to further processed logs using primitive cells, for example, or whatever tool to implement this logging. There's a very nice built-in Python library with the name surprise logging. This provides us a nice out of the box decoupling of defining the logging and your application code. So we don't have to reinvent the wheel. We also have the ability to configure the logging using a config file such as a YAML file. We can log to standard out or to a file. How exactly we are going to implement the logging in Python, I will show later once we are implementing our design, the next important point is exception handling. Here we always should be aware that we are running an ETL job. And the goal is to ensure that the required data is in the target and in time. So here it is important that when error there goes something wrong with the job, that data transformations, read or write an exception should be raised, the job should fail and the orchestration tool should show it. And maybe responsible people should be notified that someone can check what happened. This means when writing each line of code, you should think what can happen and how it could fail. Is there an exception raised by the packages you use? Is the way the exceptions are handled and locked sufficient for you, especially regarding debugging. Here should be thought of a consistent concept. I see three ways of doing it. The first way is creating a single point of exit, like a wrapper around the main method, catching all raised exceptions and handling them according to our needs. Here it should be pointed out that to do so anyways, all kinds of exceptions should be managed by this wrapper. Edit the handling and logging as rushed, and finally raised an exception that the job and the orchestration two fields. A second way is to catch the exceptions inside the methods they are raised there. They can be handled. Logging as rushed, can be added at an exception can be erased in order to failed job if rushed. And a third way is to not catch the exceptions. Just letting the race exceptions as they are and let the job failure for our case, I prefer a hybrid of the second and the third way. For each individual case, I will decide if the exceptions raised are sufficient or not. In addition to it, I will create custom exceptions to be raised if something happens that is not as required and not raised yet. And another requirement we need is an entry point for the orchestration tool to run the code, we have our orchestration tool according to the needs it calls our code using a single entry point. This we can call, for example, one dot p-y. The orchestration tool executes the command python run dot PY and the file one dot p-y is executed. This is our so-called entry point. Run dot p-y contains our main function. We actually already wrote it in Jupyter Notebook, the main function and first configures the job, then everything is needed is initialized and our job is finally called an executed, as you might already guess, another requirement is how we deal with configuration. When we talk about configuration, it is important to understand why configuration is important. Let's assume we are developing our application according to our initial requirements. And we hard code values like source and target endpoint URLs and the column names of the target data. We finished coding, create a Docker image. The scheduler creates a container. Using this image, starts our entry point 1 dot PY, and everything works fine since we ensure that our application is running, doing a lot of testing before deployment, suddenly the requirements change, the names of the target columns should be renamed. What do we have to do as a developer? Change your code, run all the tests, commit, and diploid again, this means another version of the application code and a new Docker image are created. Well, the scheduler runs it. Everything works fine. Now, Environment parameters change, such as the source or target endpoint. Or we want to deploy our code to another environment. What we anyways have to do two times. If we develop in the development environment, we have to deploy it to the integration and then the production environment. This means that always the application code needs to be changed, tested, and committed, and every time a new Docker image is created. Sure, this is a process that is working. But you, as a developer, always have to change the code and test it. What if we could reduce the effort for the developers to not touch the code every time some parameters change. Now a configuration Phi comes into the game. Let's see the process when configuration and application is separated, we got our initial requirements. You write the code and create an image that can be used or parameters that could change. You write in a conflict dot YAML file. Now, when the scheduler want to run the code, it takes both at runtime. It creates a Docker container using the image and mounts a configuration file inside the Docker container. And finally, the application runs as expected. Now the requirements change. The target columns should be renamed. You only have to change the config file and create another version of it. The image you use is still the same since the application code did not change. And it just runs as expected. And again, you want to change the environment. You just create another config file for this environment. Take the image that is still the same and the scheduler can just run it. As you can see, the application code does not have to be changed. The effort for development and maintenance is much less. Let's just figure out where we actually read in our config file. Inside our main function, in our entry point run dot PY, There's a block configuration. Here. We just read in all necessary parameters from the config file and submit them to other parts of the code. That's, it doesn't seem so difficult, right? And one thing is still left. We haven't been talking about much before. The question how to ensure that all fives from the source are processed if we schedule our job once per week, for example, if the job fails and after we don't know what was process and whatnot, we also don't want to reprocess everything week by week. The answer is using a Meta file for job control. Just to remember, we see here the part of our code with the list files and prefix inside the extract function and return date list function. We especially separated them and now we can implement returned stateless function as needed. So let's implemented this way that here we process a Meta file dot csv. So let's design our remit of phi process in detail. The meter phi can be quite simple, so a CSV file is just sufficient. And we can read it easily using Pandas read CSV method as input argument. We have our date argument here called Min date. And let's assume we want to process all files after the 22nd of April 2021. Then we have the today's date. We assume today is the 25th of April 2021. How the structure of the metafile could look like. Sufficient would be even one column, the date column. Here we write the date of the source that we're already processed by our jobs. Here we can see that the 21st and the 23rd row already processed. But our mandate says that we want the process to 20-second as well. For whatever reason a 20-second was not processed, maybe an error occured or whatever. The easiest way is, at first to get a list of dates since min date and until today. For our case, it is the 22nd, 23rd, 24th, and 25th. Now we load the process states from the Meta file and get a list containing the 21st entrant. The third, the last step is just to determine all the dates since the first date that are not in the Meta file and are bigger than or equal the mandate like this, we get the 22nd, 23rd, 24th, and 25th. I am aware that like this we processed the 23rd again. But since we want to calculate the change of the closing price compared With the day before for the 24th. To calculate the change, we anyways, have to read the data off the 23rd. That's why just to make it easy for our case now, we just read all the days after the last not processed day that is later or equal. The mandate important is all source files will be processed even if the job fails one week, the next week, it detects the unprocessed files automatically, or if the job will be triggered manually. The second column, process column in the Meta file is just for tracking purpose to see when exactly which date was processed last. So now all requirements are clear. I guess. Let's just do a small proof of concept and next session about the Meta file, if we can realize it the way we thought. And after this weekend, do the final design of our code. 24. 24. Meta Data: return_date_list Quick and Dirty: Hello, Let's now do a quick and dirty solution of our return data's function. So therefore it first we have to create a metafile. So here I opened Notepad Plus, Plus, and let's first write a header. So we are going to have two columns. I call the first column source date and the second date time of processing. I guess this is more meaningful. So there we had in our example, the 23rd of April. And let's just say the processing DateTime is 123323 of the same date. And then we have the 21st, 21st of April. And here the processing is at 123021. This I can save. And now I'm just going to upload our created metafile to our bucket. We created on AWS S3. And here I'm inside the management console. I choose Upload. I'm already inside the bucket and upload. Then I just take the Meta file here and upload. Now we have the Meta file and we can work with it. And now I'm inside the Jupyter Notebook again. And here I insert another cell. And what we can use is our function read CSV to DataFrame. I just copy it here. And the first thing we just want to read our Meta phi and see if we can read it. So I create a string with MIT file dot CSV. This is how our Meta file is called. Then the bucket name target is Sandra, 1, 2, 3, 4. Here again, our S3 equal to Butoh, tree, resource 3 and bucket target equal to S3. Bucket, bucket, name target. And our DataFrame df meta. We can just call this. And here we use pocket target and the Meta key. Now let's see. This is our meter phi so we can read it. And then we have our arc date or our mandate, however we call it. We said this is the 22nd of April. And our today state, I call today's string, is the 25th of April. And now we can just use our functionality of our return date list. I just copy this and paste it here. And the only thing we have to change is our Today. We should use actually the same as here, but using the today's string. And then source format. For sure when we need to source format. So we can just copy it here. And that's executed again. And our return date list is the list since our argument date minus1 until today. And we decide today is the 25th of April. So this is correct as a first step. And if we now take our return date list and remove all the dates we already have processed in our Meta file. This is the 21st and 23rd. Then here this we remove. Then we can determine the minimum date that should be processed at first. And actually not the minimum date, but here we determined the minimum. This is the 22nd. But since we want to have the previous day's change as well, we need to calculate minus1 day and then we get to 21st to extract. But our minimum date will be 22. So this is the date since when we want to have the report. And a good idea now is to transform these dates to datetime objects and the same for our dates in the Meta file. So therefore, we can actually, yes, we can just remove here our string time. And then we should have datetime objects here. And for ROR Meta file, we can use, Let's correlate source states that were already processed. And I create a set actually. We will see why. And then PD to datetime, the Meta. And this is the source date, the name of the column. I use the name of the column, and then DT and date. And let's see. This is now what we wanted. And now we can determine our actual Mindy. Min date is min. We want the minimum of the set of our return date lists. So here we also create a set. Since we have a set. Because sets, using sets, we can just use the operator minus and then we use source dates. And as we said, minus time delta is equal to one. And then the Min date should be to 21st, if I am correct, yes. And now we can calculate actually the return dates. This is the list. Here I use a list comprehension and there we want to convert it back to string. And here we use, we use for date in here. In return date list. We just iterate over return date list. And if the date is bigger or equal, our Min date, then we want to take a taut string F time. And here we use our source format. And then we should have our return dates. So this is what we want to return them from the 21st till today? Yes. And what else do we need? We need actually the mandate to be returned as well. And this we also can do return min. Date, because this is what we need finally, to filter insight or, or transform function, as you might remember. And here we can use our Min date plus timedelta. Or actually we can use just our octet. That's our arc date, is our mandate. And let's just try several options. Actually hit this, we can remove and move everything to one cell. And this we do need as well. We also don't need. And let's imagine we want the 24th as our first day to be processed. So what should be extract them? I expect to be extracted all days since the 23rd and our minimum date is the 24th. And return dates. Yes, this is correct. And now let's try the 23rd. And actually what I expect to be extracted is the first, they should be the 23rd because we already processed the 23rd. So it should determine that we only need in the reported 24th and 25th. And therefore, we need one day before the 23rd. But here it's saying the 22nd. So this means that here we have to use only the second item. So the second day. This means because here we are saying min date it is, or arc date minus one day, That's a 22nd. And then here we just remove the source states from here, including this first day. And then we are seeing actually even here, Min and minus timedelta. So this is the reason why we got this. So let's try it again. Min arc is NMT secrets. And no, Not since till 0, but one to the end we want to use. And now it's correct. And let's simulate that we are actually using the today's date, but all days till today are already processed here. But I don't want to change the Meta file right now. So what I can do is I set it to 21 and the arc date to 21 as well. So what happens now? And there is an error because min arc is an empty sequence. So here we have to think of something. This means in case actually here, this is empty. Yeah, it makes sense because for sure all days were process already. So it should return an empty sequence. So we can think of something that if this, I can just call it dates missing. And by the way, I don't want to fail to job in case is empty. I don't want to fade a job. That job is correct because in case it fails, it gives the wrong impression that something's wrong, but it's right. So that's why date missing. And here, if date missing, this means if there is no empty list, if there is an item inside list, we want to execute this. And else, we want to return dates as an empty list. And here, return min date will be just datetime. I choose a date in the future, 101 dot date. And like this, we know as well that is actually, there's no date to be extracted. And now let's try to gain what happens. And here, return dates is empty. And we can even check return min, date. This is our date as expected. Let's try this again. What happens if this set up is? And we get what we expect. This is our list. Yeah, you can just play around now it is working actually S, We were thinking. And now we can continue in the next session to form a function out of this. 25. 25. Meta Data: return_date_list Function: Hello. Now we want to change to return date list function using the code we wrote in the last session. Therefore, let's check what we have in our return date list function. So we have the Min date, the today's date, and here the return date list. Well, and what we can use from here, the bucket we don't have to create. What we need is this. We want to read the metadata. This we can insert here. Then we need this one and this one. And actually this everything here. And here we are. Okay. So far, so good. I guess one thing I should change. Here, we are using return date list and return date lists as the function name. So I just call it maybe dates. So this is a different name. Just to get confused. And here we have to use dates. Well, just one thing here is left, but I will not remove it. Because let's imagine we run our job the first time and there is no metadata. What will happen? We call the return date list function. It connects to our metadata at least at once to read it. And what happens, it throws and no such key arrow. And this is what we don't want. We want to run the job even if there's no metadata In case there is no metadata file, we just run to return all our dates since mandate until today. And that's why it is good here to catch the error. And the way to catch the error is to just use a try and except block. And this one here, these two lines we don't have to use inside the try block. Because yes, here, anyways, the arrow cannot be thrown. So this can execute any ways. And here we use the try and only here, it can happen actually, when we call this method, then it can happen that there is no such key. And we want to execute everything in case there is a metafile. And in case there is no Meta file, it will throw the bucket. We can use this bucket instance here. Session client S3, exceptions, no such key. And then we want to return dates like this. And the return min date. Anyways should be our date. So here we can see in both cases, though, we have three cases, return mandate equal to update your arc date. And here it's daytime. It's the first of January, 2200. Okay. Then it's okay. No duplicated code. And here we want to return the return date list or return dates. Actually return date list is here. Return date list are here. It's return dates. So we have to call it return dates. And here we want to return the dates and we want to return the min date. And one thing is left, We have to use our metadata, actually the key for our metadata as an input argument. So here we can use meta a key. And like this, whenever we call our read CSV to DataFrame function, we can use this. And actually this is our method. Now, this should work so far. Now we can clean our code here. We can remove this, this, this, this, this, and that's it. Now, we can execute our code. And therefore I1 two run that today is the 14th of May. So I just said the the arc date as let's say the 12th. And we also should change our metadata and this, we have to do no, I opened the Meta file and our arc date is the 12th of May. So let's set here are processed days to the 12th and the 13th OF MY this we can leave as it is, it's not important for our method actually. This is, as I said, just for here, for tricking cases, for documenting, for tricking. And actually the final result should be that we only get the report for the 14th. And here I open the AWS management console with our S3 bucket. And here I click upload. And I choose my Meta file, uploaded, Upload and close. And now we can execute our notebook. Run all. And let's just wait. And for sure here re, forgot, I forgot the argument Meta key in the ETL report function. And here it should be. No return date list is here. Not in detail report function, but here Meta key. And our meter key was I deleted, it was Meta file dot CSV. And let's run it once again. Run all and again an error. And here is a name error. Bucket target is not defined. Let's take look here for sure. Here we have bucket target, because here we only have packet now. And one thing that I remember that I forgot as well is actually here to use our bucket target as well, and not the source packet because the target in the darker in the target bucket, there will be the Meta file and not in the source. And our return values will be, let's call it extract date, end date list. And this extract date will be our arc date here for our ETL report. This we have to use right now, like this. And now let's run it again. And now it seems to run. Let's just wait. The code successfully executed we can use here are our new file. And so here we can see did I made something wrong because we only wanted to have the report for the 14th, but we have the 13th. So there should be something wrong with our date, probably with Our extract date here. Let's see what we get inset or a return date list. So whenever there is no metafile, what does our return mandate? It is our update. This means in our case, our arc date was the 12th. And then we want to have the data since the 12th. This is correct. If it is today, if all days were already processed, this is this path here. If there's a metafile and days were processed, this means also today, then we get this mandate. Okay? And an empty list of dates. This is not the case. We got the list. Then this means here should be the wrong code. Yes, here we get returned mandate Octavia. It wasn't my mistake because now we have the Min data actually are arc date is the 12th, and then we give it here, we return it here. Then we even submitted to our ETL report and to our transform function here. And finally, this is what we filter. So we use all dates after or equal. The 12th of May. This is for sure not correct. So here we have to use actually this one, this is our correct mandate. So if we are using here mandate, we have 2 plus timedelta again, because here we remove or reduce one day. We can actually use this. And then we want to convert it to a string. And here we can use string F, time and the source formed. And now it should work. Hopefully. Let's run it a gain and weight the code executed now, and let's use our latest file. And now we only have the report for the 14th. So this is correct. Now, this means now we have our function return date list as we wanted. And in this way, we automatically will only use the unprocessed days of the source whenever our job to run. 26. 26. Meta Data: update_meta_file: Hello. Now we are able to get a list of dates generated from our metafile. And now we want to write another function to update omega phi whenever a job run. So therefore, here I open the Jupyter notebook. And here in the application layer, not the core, I will add a function update metafile. And as arguments, I need the bucket, the Meta key, and a list of the dates to be updated. And this will be just the list of the extracted dates. So I call it extract date list. And the first thing, I want to create a new DataFrame, and then I read the old dataframe. This means the DataFrame, the data that is already in our modifier. And then I just can concatenate the new and the old dataframe and right back to our S3 mystified. So d f mu equal to pd DataFrame. And then the columns equal to source date, datetime of processing. So we have two columns, and this will be an empty DataFrame at first. And then we add data for our source state column. And this would be the extract date list and the date time of processing column. Here we are at the today's timestamp. So I use datetime dot today and string f time. And here we can use our actually our source format as well. This would be another format. So at first I just use it as a string. I write at hard-coded year minus month, minus d. And later we can think about if we want to parameterize it or not. So now we have the new DataFrame, and now the old dataframe. We can just read csv. Read CSV to DataFrame. And then we use the bucket and the Meta key. And then we concatenate DF new and d old and create a d of all with PD concat and df old and D F mu. And finally, we can use our right DataFrame. Now we cannot use, actually, we have to create a another function because we want to write a DataFrame to S3 CSV and not as here in our case Parquet. So we almost can copy and paste our function. We already have, I just call it CSV, underlined CSV. And here we have to use in stead of bytes i, o, string IO. Here this already imported. Then here df2 CSV output buffer. We can leave as it is, and then put object, put up Jay. This looks good. And now we can use this method or this function. And here we write df to S3 CSV bucket. We use DF, DF all. And the key is the meter key. And now we just have to edit to our other functions, especially where do we want to update our Meta file? This is most probably hears the best way right after or load, right, DataFrame to S3. And once it is written, we can update our ohmmeter file that these exactly these fights, the source files were processed. And here we have our extract date list, meter key, the bucket we can use from here. And these two arguments we have to add here. And then we also have to add it here to our ETL report. That's at these two arguments. The meter key. We have to use hear me the key and the extract date list we can create using our waitlist. As you remember, our date list contains the lists from the first date we want to extract. We want to have inside our report minus1. So this we have to just filter. And here we can use a list comprehension date for date in date list with a filter. If date bigger, equal arc date. This is our first. Today we want to extract. So this is what can use Meta key we have here. And this we have to add in our ETL report insight or a main function as well. We have the meter key, Meta file CSV, and here we edit. And now hopefully it should work. Let's see, what is our arc date and what is our today's date? In my case, it's the 20th of May. So I just use the 19th. And this means after this, for the 19th and the 27th. Once the job is done, the Meta phi should be updated. So let's just see it, how it works and run. Now, we are going to wait. So here we have an error. And what did I do here? In our load function, in our main function, not in the main function, but in our ETL report. So df, target key, target format. Here we have to use the target key. I don't know what happened here actually, but I can just run it again and wait. So narrowed finished successfully. Let's check in our AWS management console. Inside the bucket. My one is Sandra 1, 2, 3, 4, and then the Meta file. I can download it and open it. And here we see our updated Meta file. So the 19th and 20th is updated as expected. Just the only thing that I see is the format of our daytime or processing. For sure. I forgot to add also the time. I only edit the the date, but this we can change later. So this is not that important. Main fingers, it is working, it's updating, and job is working. 27. 27. Code Design - Class design, methods, attributes, arguments: Hello. Finally, we are going to do a proper code design. Here we can see all our functions we created, since we already grouped them before. Why not just keep the grouping as this? Probably you notice that we only have one, right? Df2 as tree function and node 2. Here we have two ways to go, either a separate function for each file type or on common function that takes the output format as an input argument and uses conditionals to execute the logic. The pros for using separate functions are that the functions itself will be straightforward. And we get more flexibility to set things up like separator and coding, etc. What inheres in the output format. The corn is to most probably have duplicated code. So I decided to use one function order to not have duplicated code. And because the requirements allowed, because we don't need much flexibility in the settings yet. So we can take the functions list files and prefix, read CSV to df, right? Df to S3 and create a class S3 bucket connector. We even can decide that the S3 bucket connector should be written in a fire S3 dot p-y. Why I actually like this and not in a different way. Since we want to do object oriented programming, it is about objects as representations of the real-world. And always when we have to connect to source or target systems, we need an interface. So our class is tree bucket connector is just the interface to the source and target. And its methods are all we need to connect to the S3 buckets. This way, we are also following the single responsibility principle. Whenever there's a change in the interface of the S3 buckets, only. The S3 bucket connector is changed. So the only responsibility is the communication to the S3 bucket. The functions update Meta file and return date list can be grouped in a class Meta process. And I will just name the phi where to save it. Meta process dot p-y. Here, more or less the same as this tree bucket connector. It is the interface object for communication with the metafile. Then we've got our ETL methods, extract transform report one, load and ETL report one. For them, we create a class named Sandra ETL and save it in the 5k Sandra transform.py. Here we have to find a compromise between abstraction and the YAGNI principle. You aren't going to need it. We want to write the CIDR ATL class as abstract as possible. One could even think about an abstract parent class beef extract and load methods and in the future, multiple child classes. But since we don't have more requirements, what other functionalities and reports we need. I would just recommend taking the four functions we have and creating one more or less abstract class in the weight is extended. And we have our main function as the entry point. You recreate just a rounded PY file. Somehow, there has to be a starting point, even in object oriented programming that does not start with a class, but where you create an instance of a class and call methods. Now let's focus on the S3 bucket connector and how we are going to create it. And Python, there are three types of methods and classes. Instance, class and static methods. I provided a link in the course material to some explanation about the differences of these method types. Here we just use instance methods because we need an access to the instance attributes in order to list, read, and write, I decided to have the endpoint URL as a string. The bow 2, 3 session session, the bottom trees session resource as tree and the tree session resource bucket, bucket instance attributes. The idea here is to create for each bucket we want to connect to a separate instance of the class. Like this. We will have one instance or object for the source and one for the target. And we have the arguments we need to connect to a S3 bucket. There are the excess and secret key strings. Because of security reasons, we don't use the keys in our coded or we just use the names. And in order to work, there should be environment variables with these names containing the keys. When we create a Docker image with our code and spinning up a container using the image, we can just create environment and variable. It's inside the container during the creation process of the container. And the keys can be stored in a more secure way then just inside our code. In addition to the keys, we need the endpoint URL and the bucket name as strings. So basically the design for our S3 bucket connectors done. But since I decided to use just one right method, I have to build IF statements that decide about CSV and parquet, and I don't want to hard-code them. Therefore, I will create a constants dot PY file defining everything did with Hartley change over time and is hard-coded to have all constants separated in 15. Just provides a better structured code for the five formats we use. I create a class as 35 types. Right now, we do only have CSV and Parquet, but in the future could be added more file types. These will be class attributes in order to use them immediately without a need of creating an instance. This, we are more or less just name spacing the parts of the code that belong together. The same way. I just want to group the methods for the Meta fire and write them as class methods. Right now, there's really no need to create the Meta process class and disgrace that we have to create instances of the class, initialize it with arguments. It just seems to static the way we use the Meta file. And most probably it won't change over time at all. Actually, it would be even sufficient to just create a two functions and group them in 15 just as functions. But let's already created a class to have a consistent structure. And who knows how the requirements change over time. Maybe later, we will have to refract to the middle process class. The formatting of the metafile could be either hardcoded in our Meta process class or configurable or hard-coded in our constants dot p-y fire. I assume that it does not change over time. So I decide to create another class Meta process format inside the constants dot p-y. Here we only need instance attributes as strings. There we have Meta date format, meter process date format, meta source date column, meter process column, and the Meta file format in case there are changes, we only have to change it here. Now let's come to the exaggerate the L class. Whenever we want to execute our exit or ADL job. We want to create an instance of the exedra ETL class. This instance should be able to first of all execute our functions or methods. So we write them as instance methods. To execute our methods, we need some instance attributes. We need a connection to our source and target buckets according to the principle composition. Over inheritance, we use S3 bucket connector instances to provide these connections. Then we need the Meta key and source and target arguments. These will be all arguments that we are going to choose to be configurable. The datatype we choose for it we discussed right after. And there are the extract date, extract date list, and the Meta update list. All these instance attributes are basically the attributes we need to execute the methods as arguments. We want the S3 bucket connectors to be provided from outside because we don't want to deal with instance creation inside our class. We want encapsulation to be as independent as possible from the inner structure of the S3 bucket connector. The Meta key source and target arguments day will be configurable in the conflict or the AMA. So they also have to be provided from outside during instance creation. The extract data, extract date list and Meta update list. I decided to be calculated during instance creation using the meter process method return date list. This also could be calculated outset of this class and submitted as arguments. But since we can't execute our ADL job without them, I decided to have them inside the instance to make sure to use the correct methods. So what about the source and target arguments that come from the configuration file? The easiest way would be to just read them in the main function and submit each parameter separately as argument. But this will be a lot of arguments. Another way to write less would be to use the double star keyword arguments as the class input and submit a dictionary. Like this, we don't have to write much, but makes it hard to read a quote. It won't be obvious where and what parameters are used. For this reason, I decided to create for the source and for the target a class of a named tuple. I call them exedra source config, and cetera target config. They will be written in the same document, cetera transform.py. With this approach, we can just create a named tuple instance by submitting the dictionary in the main function. And then just use the name tuple as input arguments of the exedra ETL class. And the code is still readable. Instead of a name tuple, we could use object or data class as well. I provided a link for further information about the differences. But for our case, it doesn't really matter which one we use in the two name tuples are basically take everything that I've found reasonable to parametrize. The first extract date the columns to be used during extraction, the different column names to be used during transformation for source and target. The key of the target file, the date format in the target filename, and the target five format as well. As you can see, they are class attributes inside the name to boots. And now we can replace the question marks in our auxiliary ETL design and replace them accordingly with seagrass source config and cetera target config. Now we are almost done with our class design. Just one more thing. I want to add one file for custom exceptions. There. We can create our own exceptions if we need them. Now let's take a look how the flow of our program looks like. First of all, we have our orchestration tool. This one, we'll call the entry point named for one dot p-y. Inside one PY. The main function is called where at first we read all parameters from the config file. Then we initialize DS3 connection using this tree bucket connector. After this, we create instances of the name to boots for the configuration. And then an instance of the exedra ETL during the initialization of the exedra instance, the meter process class is used to process the metafile and determine the dates that should be extracted. After initialization, the execution can be triggered and in the end of the execution, demystify will be updated. Now, we are done with the code design. We determined which fights classes, arguments, attributes, methods, and functions we want to create and how to program flow looks like. I guess we have a good idea of what needs to be implemented now. 28. 28. Comparison Functional Programming and OOP: Hello. I think now we can conclude comparing the functional and the object oriented approach when we talk about ETL jobs and data pipelines. Since we were more or less at first implementing the exedra job using the functional approach. And after we designed the code using the object oriented approach, using functional programming is quite native and intuitive and at first, easier to implement. For object-oriented programming, we need to put more effort in a good design. All in all, as we saw, both approaches are suitable and everyone can just decide what approach to go. I prefer to use the object-oriented programming because it provides more features to structure the code. Do a good encapsulation, make the code easier to extend and reuse, especially when it is going to be a longer and bigger project. I would always prefer the object oriented approach. 29. 29. Setting up Git Repository: Hello. Now we want to create a GitHub repository cloner to our local machine and do a first commit in a new branch. So therefore I opened my GitHub. My name is, It's me 87. And here I just want to create a new repository. And I call it exedra underlying 1, 2, 3, 4. I skip the description, I keep it public. I add a README file, I add git ignore, and here I choose Python. I create a repository. So then as a next step, if you don't have it yet, you have to install Git locally. So therefore here on the website, get minus cn.com, under Downloads, here you can download the latest release for your operation system. In my case, it's Windows, so I have it installed already. I will not show it to you. So I'm sure you are able to manage it. Once it is installed. We can go inside our GitHub repos story. Here, click on Code. And here we can copd the HTTPS URL. And then I go inside my project folder. It is, cetera underlying project. I do right-click and Git Bash here. And here, I type git clone, paste this URL. And now it's cloned the repo story. And then I take my pip file and pip file lock and copy it here. I copy my path, open the command line CD to my folder, and then I do pip install. Now I, it is creating a new virtual environment and then installing the dependencies according to the pip file and the pip file that lock. And this just takes a moment. I'm just pausing the record. And now all dependencies are installed. So we can do pip and shell to activate our virtual environment and do pip list to see what are the packages that were installed. So this is what we need. And now we can go to our Git Bash and I do CD in my repository. Here we can see we are on the main branch at first I want to create a new branch. So I do git checkout minus b and I call it develop. So this is the develop branch where I, where I am going to create a code. And now I get to get status. And here we can see we have two files that are not committed and not edit. And I can do git add star. So all files should be edit git status. So here we have our pip file and pip file a log. They are green now. And now I can do git commit m. I paste a comment, for example, initial version with pip file. And then I can do git push. And here we can see that we don't have the original develop on our GitHub repository. So I have to do git push minus-minus set upstream origin, develop. And now it's pushing. Now we can check our GitHub repos story and t we can see we have our main branch and our developer branch. And I click on the develop branch. And here we see our pip file and pip file a log. So we set up our GitHub repos story and our git repository locally. 30. 30. Setting up Python Project - Folder Structure: Hello. Now we are going to set up a proper folder structure for our project. So here I'm inside my Git repos, cetera, 1, 2, 3, 4. And at first I create a new folder, I call it etc. This will be the folder of our source code. So here inside we are going to create all our source code and a number folder we need, I call it tests. So here will be all are written tests, unit tests, integration tests. In order to test our modules or source code. Then we need a new folder. Conflicts. So inside this conflicts folder, I will write our configuration file or maybe several configuration files, whatever we need. And inside our AAC cetera, I create one folder common for yeah, like common libraries, common modules, and a folder transform us. So here will be our actual transform us, our transformation for our exedra report in the common module will be, for example, the meter process and the S3 bucket connector. And the same structure we are going to create for our tests here as well will be common. And transformers. And one folder, integration test inside here, common and transform us will be unit tests and integration tests will be test to test everything together. And one more thing we need is a init.py in each of our RR, let's say folders in order to be able to import our modules. And this means that I create just by Notepad Plus, Plus, I can create an empty init file. This is everything. We need, nothing more. So I just have to find your cetera, et cetera, 1, 2, 3. And first thing where we are going to save it is here underline, underline. In it, underline, underline dot p-y. And here it is, a Python file. Well, I didn't need this dot p-y ending. And here we have it now inside the transformers. And I just copy and paste it to here. Here. Actually, in each step of our folders, Let's see. And here we don't need it. But inside our tests as well. Here. Here. Here, most probably here as well. The init file we need in order to import our modules, we write. For example, we are going to write here in our common photos some S3 bucket connector and the meter process. And in order to be able to import here inside our transformer, for example, the code we have to provide in it files. This is necessary for Python. And this is our folder structure. 31. 31. Installation Visual Studio Code: Hello. Now it's time to choose an IDE. So my case, I choose Visual Studio code. So therefore I opened the website here. You can download it if you also want to use Visual Studio Code. And according to your system requirements, you can download as tier. So I downloaded a Windows version and I will not show you how to download and install it. I'm sure you're able to manage it. I already started. So the way I usually open Studio Visual Studio Code, I go inside my directory. So this is my development, my working directory. I copy the path, open the command line, CD to there, and then I just call it by Cote. And then it's just opening actually here. Our Visual Studio Code. And important is here to know you have to install extensions here. So there is a Python extension for Visual Studio Code. Here on the left side we have extensions, and here we can see what extensions are installed. Jupiter, it is actually a global version or a global extension. So important is that you have the Python extension installed here. I also have pilots. This is something under the hood going on. So basically, the Python extension is what you need in order to be able to Develop here with Visual Studio Code. And then we should open our working directory or folder. So here on the left side, we can go to Explorer. Here you can see that my photo is already opened, cetera One 234. If it is not, in your case, you can just click on File and Open Folder. And then here you can choose your directory, select folder, and then it's open and then you can work on it. 32. 32. Setting up class frame - Task Description: Hello. Now it's time for you to make your hands dirty. In the same way we just created the S3 dot PY file and the S3 bucket connector class together. Now it's your turn to create the Meta process PY and emitter process class with the methods update metafile, and return date lists, just leave them empty. Afterwards, I would recommend that you create the constants dot p-y file with the enum classes as 35 types and Meta process format containing the, all the constants for our history class and for the meter process class. And then there are custom exceptions. We don't know what exactly we need right now, but you can already create at least the file. And then there is our exedra transformer PY contain the cetera IDL class and our two name2 pits. This you already can create, just leave all methods empty, or at least providing the pass that you can execute the code. And then there is finally our entry point or run dot p-y with our main function. You don't have to write code inside, but you already can create it. So the solution I provided in the course materials. And after, I'm going to show you what is the code and we are going to discuss it in the solution videos. So have fun. 33. 33. Setting up class frame - Solution S3: Hello. Now let's start writing our classes. We're just going to write our classes and whatever we need for our classes. And the methods we are just going to list, but not to fool these, we are going to fill in a bit later. So therefore, I want to start with the S3 bucket connector. And at first, I want to create a new file here. I save it here. Under cetera, in the common folder and in our class design and our co-design, we decided, or I decided to call it as x3 dot p-y. So it is a Python file as three. And now the first thing I want to create is a docstring. So docstrings are really important for clean code, like comments. In order to make everything readable, I just call it connector and methods accessing S3, you can write whatever you want for other people to make it understandable. And then we can do our imports. What we need is import OS and then import bow 2, 3. And our class is called tree bucket connector, right? Basically it's just writing, just typing. Now the design, we already did the names, what attributes we want, and so on. So then again, docstrings for our class. So this is a class for interacting with S3 buckets. And the first thing we want to create is an init method. So the init method is needed in order to initialize an object. So we create a class and we want to create an instance of the object. We want, for example, to connect to our source bucket. So therefore we create an instance of this class for our sauce bucket. And whenever we have some arguments, we need the init method. And in Python we have to use the self keyword. And this tells the Python interpreter that actually here you're referencing to the instance. So this is an instance method. And like this you're telling, okay, Well whenever you create an instance, for example, our source packet connector, then we want to call the init function or the init method. And our arguments we are submitting here. So what are the arguments we want to use this as x is key and what is Willie help fool in Python? Actually I use it, our type hints. So usually you don't have to tell in Python what type is which variable or argument or whatever. But in order to make it more readable, it is good practice, best practice to provide typing. So I'm just telling with a colon and then it is a string. And then we need the secret key is a string, endpoint, URL, string, and the bucket string as well. And now doc strings as well. So this is the constructor for S3 bucket connector. And here I am telling what are the arguments of the parameters. I have the x's key. This is x is key for accessing S3. Then I have param, secret key. This is the secret key for accessing S3. Then I have to per endpoint URL. And this is the endpoint URL to S3. I mean, kind of self-explaining. But anyways, you can write whatever you think is good to make it more understandable, more readable. And then we have bucket. So these are just commons S3 bucket name. Whenever you call Actually this class doc strings out there that you can read it in the shell. And now these arguments in order to be able to access them inside the instance using other instance methods, we have to create instance variables, self-taught. Endpoint, URL, endpoint URL, like this. You're telling, well this endpoint URL we can use as self.init point URL inside our class whenever we create an instance, then self-taught session. So here I use BOT-2 three session. This is what we used in our Jupyter notebook, right? And the AWS X is key ID equal to OS environ Xs P. And then a lot of stuff is here. And then AWS Secret x is key, OS environ, secret key like this, using OS dot environ. Actually, whatever we submit as arguments here, x is key. We decided to use it as environment variables. So these are just strings, disorder names. For example, if we are creating an environment variable, AWS, access key or key KID, however you call it. And this is the name of the environment variable we created. And here we are getting the value of this environment variable. Because anyways for Bo tree we need the AWS access key, ID and the AWS secret access key. And afterwards we need self dot underlying S3 equals two self-taught session to the resource. Service name equal to S3. And the endpoint URL is equal to endpoint URL. Here, I decided to use self-taught underlying S3 using one on the line in front of a variable, it symbolizes death. It is a protected variable. So in Python, we don't have private and public variables or methods in classes, not as in other languages like Java or in Python. They are just conventions. So using a single underline, it says it is a protected variable omitted. And using a double underline, it says that is a actually private method or private variable. And they shouldn't be touched actually. So this is just a convention for other developers that they know, okay, well, these variables we shouldn't touch. So I decided Ds as tree session, this should not be touched. And the self bucket, the actual bucket we're going to use in order to connect to our real bucket. This also should be a protected one. So self underline, S3, bucket, and this is bucket. And now we have our init function or our init method. And now we can just write our methods we want to add here we already know it. List files in prefix. Self. You always have to use the keyword itself. And here I just keep it empty. That's why I ride pass. And then define, read csv to df self. We also could have named SDF, but I already decided to DEF probably SDF is better, but yeah, it doesn't matter. They're just names, let's say. And write df2, S3, self, PES. So I save it. This I can ignore. And here we go. We have our S3 bucket connector class defined. 34. 34. Setting up class frame - Solution meta_process: Hello, Our Meta process file and ohmmeter process class. I guess they are quite quick to implement. So I just TO provided some docstrings for our module and for our class, just to have some documentation, so some information. Then here I created the class without any init. And here, as we discussed in the code designed it, we are going to use just static methods in order to use them just as usual functions. I would say so. And yeah, here, I just use the pass in order to leave the functions empty right now. And what else to mention is that the meter process file or module I created here in the common folder. And that's it for our Meta process. 35. 35. Setting up class frame - Solution constantss: Hello, The constants P, why I created in the common folder as well, the same as for the Meta process and the S3. And here basically we have two classes. At first the import from enum input ema, enum. And here these classes are child classes of the enum class. And yes, basically here we just define our constants. They are class attributes. Here in our S3 file types. There's CSV parquet, and here an ohmmeter process format. There's metadata format, and so on, exactly as we discussed in the code design. So quite quick, I guess nothing to think about just to quote two, purely write it down. 36. 36. Setting up class frame - Solution custom_exceptions: Hello, The custom exceptions file I created in the common folder as well. And there is nothing to code yet, just the docstrings. And at least we already created the custom exceptions file. And in case we need custom exceptions, we can create them here. 37. 37. Setting up class frame - Solution xetra_transformer: Hello. Now let's talk about the cetera transformer module. And I created it inside the exedra transformers folder. And at first I imported the name tuple from typing. Then I imported our S3 bucket connector we just created. And I imported it from Sandra, common as tree. So we use our folder, cetera or our project folder exedra 1, 2, 3, 4 as a kind of base. And right now it won't recognize this. But we are going to add this to our Python path a bit later. I will show you. We will do it together, that it will be able to import our modules. So then here I created both named tuple classes, cetera source config, and here cetera target config. So basically here I provide a doc strings with some explanations, some descriptions about our attributes and T are purely the attributes with the datatype, string list and so on. Here's the same. And then we have our exedra ETL class. And there the main thing is our init constructor, our init method. And here we are submitting the S3 bucket soils S3 bucket target. So therefore we needed the S3 bucket connector. And then we have our Meta key, source arguments and target arguments. Here we are just submitting our name two bits we created here. So they are, by the way, they are all the attributes or parameters that should be configurable for our job. And here in our, our exedra ETL init method, here we are defining, creating our instance attributes that are needed here. For the source bucket, the target bucket meter key and so on. I could already created them for our extract date, extract date list, meta update list. I just skipped it. And then we just have our methods extract, transform and load ETL report one. And I just left them empty for now. 38. 38. Setting up class frame - Solution run: And then we have our entry point, r1 dot p-y. I created it in our main directory, the exedra 1, 2, 3, 4. And yeah, for now it's just the if name equals main. Whenever restarted, then it should execute here, this code. And then I created the main function that it's called here and just left it empty. So that's it. 39. 39. Logging in Python - Intro: Hello. So what about the logging best practices in Python? It is actually quite simple and it is definitely not using just simple print statements. Just use the standard Python logging module since it is well-designed, easy to use, and where are we flexible? There are multiple importance levels. Debug info, warning, error, and critical that can be used. What level each message should use can be controlled, where and how to output can be totally controlled and even decided later. There are already many useful build in handless available that can be used out of the box. But you even can create your own ones. And last but not least, there's the opportunity to use config files such as any JSON or YAML. This provides us the opportunity to add just a logging section to our config file. What gives us a lot of flexibility? If you're not familiar with the built-in Python logging module, I provided the link to the official documentation and also one link to a very useful and informative posts that is really hands on. I will not go into too much detail, but we are going to use the logging module in this course. We will setup the configuration in the Yammer file and use it in our code. 40. 40. Logging in Python - Implementation: Hello. Now it's time to implement the logging to our coat. At first we have to install another package. So here I am in our virtual environment and I do pip install. Yum. This is what we need to read and parse our conflict or yaml fire the packages installed now and let's go to our project folder. And here inside the conflicts folder, I created a new yaml file. I just called it cetera report one, config.com. And let's just open it. And here as a first step, we are going to implement a logging section. So this is our logging configuration. And the section name, I just call it logging. And then we define the version, version one. And at first we have to define a formatter for matters. And I call it etc. This is my formatter i1 to use. And here I define the format. And here I just want to have it locked in the way to the right cetera transformer. Then we want the time, then the level name, and finally the message. And now we have to define a handler under handlers. And I just call it console because we just want to look to the console. And then we define a class. They are already predefined classes we can use in the Python built-in logging package. And here for our needs, there is the stream handler that locks to the console only. And then we have to define the format that we want to use. And there we defined already are familiar with the name Sandra. So we use a tier. And then the level we want to use is, for example, debug. And now we have to define a logger that we can use. And a common scenario is to just use the root logger. And here we call it the root. This is a best practice, let's say. And in this way, it will just propagate all the settings we do here to descendent loggers. Here we can also define the level I use debug. And then we have to tell what handless we want to use. And we only define one handler here, the console handler. And this is what I also will use. And now we can add first implemented logging to our entry point, to our one dot p-y here I just open it in Visual Studio Code. And the first thing here is to import the packages we need. So import logging, then import logging dot config, and import yaml. And here in our main function at first we can add docstrings. This is just the entry point to run two exedra ETL job. And here we are going to parse the YAML file. So the config path is, in my case, see cetera, project. Sandra, underlying 1, 2, 3, 4 conflicts, and then cetera report one. Conflict dot yammer. And the config. To read, I can use the jamovi package dot safe load. And here I open conflict. And this way I should be able to see the config. I just can print it. And let's just run it. Lets it first select interpreter. This is Sandra, 1, 2, 3, 4. And local variable config reference before assignment are here. I just used config. I have to use conflict path for sure. And here we can see our content of our logging file. This looks good. And now we can configure our logging. And here I define log config. This is just our logging section we want to use from our config file or a YAML file. And then we can use the handy function, logging dot config, dot dict config. And here we use our log config as a dictionary. In this way, we will loading actually our yaml file, choosing here the section. Do this. This way. We have it as a dictionary. And here this dict config, this is a very useful way just to load our config as a dictionary that like this. It will set our logging, our former does, our handlers and our logos. And now a common practice is to define the logger using logging dot get logger. And here we just use as a argument the name of our file. We decided to use the root logger. And the way using here actually the name. This is a common practice that it creates a logger using the name of the file. And in this way, actually we also could use or we could output the name of the file if it is wished. So I decided we are, we don't need it or I don't need it. And the way that we defined our root logger, anyways, it will just propagate the settings we just made here to any logo we're going to use either here or in any other module. So, and let's just try it. Logger dot info, I want just to print out or to log a info message. This is a test dot and Let's run it. And here we see our logging message to the console, cetera transformer, the date and time info. This is the level we decided we want to use or I decided. And this is the message. This is a test. And now let's add the logger to our S3 and rock cetera transformer fire. And here let's go to Sandra, common to S3. And here at first we have to import logging. And usually I defined the logger inside the init function, inside the init method of our class. So here I use it as a, let's see, week private attribute with a single underscore. And here I create the self on the line logger equal to logging dot logger. And here, the same way, using the name means the name of our file. And like this, we can use to serve underlying logger inside our class. This I just copy and I opened the transformers, etc, transformer. And here I do the same at first import logging. And here in the init method, I just pasted here, and that's it. Now we are able to use our logging in case we want to change the logging behavior, we just have to change our config file here. We can add formulas or we can change the formatter. We can add handlers or whatever. Now we are really flexible and we can create our own handlers, even as I already mentioned before. And there are plenty of handlers we can just use. So if you're not familiar with this, you can just inform yourself, read a bit and see what is able with the built-in logging package. And he has that's it about logging. 41. 41. Create Pythonpath: Hello. In this video, I want to talk about how to ensure to import our modules we wrote from everywhere on our system. Therefore, lets make a small test. So here I'm inside the project directory. I activated the virtual environment with pip and shell. And let's just jump into the Python shell. And let's try to import our S3 file types. So from here, it is possible to import because we have the direct connection. Let's say we are already inside our project directory. Now. Let's exit and let's go to here we can see let's go to our test directory because we also want to be able to import our modules and ordered to execute tests from inside our tests directory. Again, let's activate the Python shell and let's try to import. And here you can see no module named Sandra is found. So what is problem? What does the issue actually? Let's import sys and let's do for P in sys.path print p, It contains a list of directories that the interpreter with search in for the required module. When a module is imported within a Python file, the interpreter first searches for the specified module. Among its built-in modules. If not found, it looks through the list of directories defined by sys.path. And as you can see here, that there's no directory, no path to our project folder. So what we can do, we can add an environment variable python path, containing the path to our project folder. So therefore, I just open here the environment variables and create a new one with Python path. And here I insert the path to our Python directory and I save it. So now I have to close the shell and reopen it. I have to go to my directory. I go to the tests folder and I activate the shell. And then not the shell. I activated the virtual environment. Then I activate the shell. And Let's take a look to import sys to our SIS system path for P in this path, print p. And here you can see now we have the path to our project directory. Let's try to import it now. Import as 35 types, and now it's possible to import it. So this means we solved that in a really easy way. To be able to import whatever modules we ride from everywhere on our own system. Now I want to do one more small thing. You remember we created init.py file. It's everywhere in each directory, let's say an order to be a before the interpreter to recognize them as regular packages. Let's make an experiment. Let's, for example, remove this init top pie. And let's try to import it again. And here we can see that it is working actually. But what is the difference using a init.py or not? It will be treated as a regular package if you use an init.py and if not, it will be a namespace package. This is possible for Python 3.3 plus. So this means the imports are working even without in it dot py. I provided a link for you to have a further read just to get more knowledge and bit more understanding about this topic. But important to know is if you want to have a regular package, it is recommended to use a init.py, although it's working as you can see anyways. So this means we are going to add our init Pi again. And now we can continue our work. 42. 42. Python Clean Coding: Hello. Now it's time to talk about what is clean code in Python? A really important topic for sure, there is an overall understanding about clean code independent on the language. Since every language has its own features, there are differences as well. The first thing I want to mention is the Zen of Python. This is a collection of 19 so-called guiding principles written by Tim Peters, who was a major contributor to the Python language. This is actually straightforward. So I just read it that you at least heard at once. If you haven't heard it yet, beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated. Fled is better than nested. Sparse is better than dance. Readability counts. Special cases aren't special enough to break two roots, although practicality beats, purity, errors should never pass silently unless explicitly silenced. In the face of ambiguity, refuse the temptation to guess. There should be one and preferably only one obvious way to do it. Although that way may not be obvious at first, unless your Dutch now is better than never. Although nevertheless, often better than right now. If the implementation is hard to explain, it's a bad idea. If the implementation is easy to explain, it may be a good idea. Namespaces are one honking, great idea. Let's do more of those. Then there's PEP 8. This is a mass no, it's a style guide for Python code. I strongly recommend you using PEP 8. Whenever you code in Python. Pip aid defines what is considered a clean code layout. Just to mention indentations, tabs or spaces, maximum line length, blank lines, imports and more. It guides you how to use string quotes, whitespaces, training commas, comments, naming conventions, and gifts programming recommendations. In addition to the two mentioned points, not to forget, using the design principles we already talked about. Like dry, don't repeat yourself, really important and solid. Then keep it simple, stupid kids. Actually, the same what says the Zen of Python in the third line, simple is better than complex. Keep it as simple as possible. Another point to write clean code in Python is to write Pythonic. If you're not yet, get familiar with the special features python has and use them such as dunder methods, context managers, decorators and comprehensions. And finally, use a Linder such as Pylint to have a good, reliable support. I provided a bunch of links in the course material, whereas a lot of material about clean code in Python. But for sure you can also just Google and you'll find many, many sources about this topic. 43. 43. list_files_in_prefix - Thoughts: Hello. Now it's time to implement each method one by one. The way I usually do it before writing the code of one method. I just take a look to the function I already wrote and have just some final thoughts about how exactly I am going to write a code. What else is needed? What are the cases? Actually, what different inputs and outputs will be, and what exceptions do I have to add? Are there any exceptions to be added? And what is about logging? How do I implement logging to? I want to have logging for this or maybe even naught. So here, now it's about our list files and prefix function. This would be a method of the S3 bucket connector. So one thing definitely needs to be changed is here, the bucket with the underscore. So this is the way we created the instance variable bucket just with an underscore. And yes, it's a small method for sure. There's not a lot to think about, but what are the different cases, the different inputs and outputs. So there is the first case that we have the prefix as a string. And this prefix exists in the bucket. What will be the output? It returns all files with the prefix. This is the case we want to have. So what will happen if the prefix is a string, but it doesn't actually exist inside the bucket. So in this case, the boater tree method objects dot filter will just return an empty list. So also our files will be an empty list. So this means the BOT-2 tree method takes care Of this case. We don't have to add anything. Additionally. And what if the prefix is not a string? In this case, the boater tree method filter takes care of this case as well. It throws an exception. If this is not a prefix, it just throws an exception. Our job will fail. So I or we don't have to take care of any exception. We don't have to catch this exception in this case. So as we agreed before, SI decided before, let's say that if our other packages we are using, take care of these exceptions is fine for me. So because the job will fail and I can just debug it, I will see the exception. And this is okay for me. And the last case I can think of is that there are any connection problems, something wrong with the bucket or source problems. All of these cases actually bolted treat takes care of. So I also don't have to think about any exceptions and I don't have to add any exceptions. This means for this small Matador, it's still a function. I don't have to implement any exceptions. What about logging? There's actually no point of logging because it would just return files. For me. There's nothing to lock. In case of exceptions. Both to treat, takes care of the logging, let's say. So. Now let's write our code. 44. 44. list_files_in_prefix - Implementation: Hello, for the ETL report one method, I removed all the arguments. Here you can see there are no arguments anymore. And there was a part extract date list in Jupiter notebook. And is I removed as well? We don't need it here anymore because we are executing this in the init method of our exedra ETL Class. Apart from this, everything is exactly the same. There's not much CO2, just calling the methods. 45. 45. list_files_in_prefix - Linting Intro: Hello. Before we continue with our method list files and prefix, I want to give a short introduction to liters. I already mentioned earlier that using a Linder is a good idea in order to clean coding. What is actually a linear term? According to Wikipedia, is a Linder, a tool that analyzes source code to flag programming arrows, Bax, stylistic arrows, and suspicious constructs. Basically, it helps you to keep your code clean. There are several popular and well-tested lintels for Python out there, such as pylon Pi flakes, Pi code style Pi, doc style, banded, and my Pi. Before committing to get Git hooks can run Linares to meet quality standards, and winters can be used in continuous integration to ensure a high quality code. In this course, we are going to use Pylint. It is one of the oldest winters still well maintained and the core features are well-developed. So does a reliable tool to ensure good code quality. 46. 46. list_files_in_prefix - Pylint: Hello. In order to use Pylint, we have to at first install it in our virtual environment. So here I opened the terminal and activated the virtual environment. So then I use pip install pilot. Now this installed and I can open Visual Studio Code. And at first we have to change some settings. Therefore, we go here to File Preferences settings. And here I go to Extensions and Python. Then I go down search the button where we can click to Edit Settings, dot JSON. And here I want to add two settings. At first, python dot linting Pylint enabled should be set to true. And then Python linting island arcs. I want to enable fatal error and warning messages. And here I even can disable some messages. So for now, I just disable no absolute import. These messages are actually for Python two. There it makes sense now, when we use Python 3, we should disable some messages, and this is one of the message. So here I know disarmed going to be thrown. So that's why I already disable it. So in case whenever they are Pylint errors, we have to take a look if they really make sense. In this course, I won't do a detailed introduction to pilot. There are a lot of sources out there where you can find material information to get acquainted to and to get more details about how to use Pylint, what settings should be done, what messages should be enabled or disabled. For our case here, this is just enough and I saved settings dot JSON. And then here we are in our S3 bucket connector. And I just want to do linting. At first, I press Control Shift and P, like this, ICT first select an interpreter. And here, what is the interpreter I want to use? It is cetera 1, 2, 3, 4. Then again, I press Control Shift P. And I want to select a linear. It is Pylint. And then again Control Shift P. And then run linting. Like this. We can manually run the winter, but the winter also always runs whenever you safe. Here, your coat. And then here you can see the problems that a cure or this is what the Linton gifts. There is, for example, trailing whitespace. And now we can work on it. Here just means this is not beautiful. This is not clean code. Same here. Probability this is trailing whitespace and missing functional method docstring. Here this is about our two methods we are going to implement and also add docstrings. So this we can ignore right now. This means the rest, our list files and prefix and our class as we wrote, it is just fine. 47. 47. list_files_in_prefix - Unit Testing Intro: Hello. Best practice to execute automated unit tests in Python is to use a testing framework such as the built-in unit tests or the third party pi test. Although pietas is a framework with a lightweight syntax and quite popular, I decided to use the built-in one here in this course. It supports test automation, sharing or setup and shut down code for tests, aggregation of tests into collections, and independency of the tests from the reporting framework. Unit tests should be written that they are testing all methods that can be called from outside. Actually in this way that all possible scenarios are covered. The goal should always be to achieve 100 percent test coverage. This means that all lines of the code are tested by the unit tests. There are some situations where it is not possible, but you should always try to reach it. It gives you also more confidence that your code works as expected. There's a package called coverage that we will use to check the test coverage of our tests. When we test a method, often imported objects are needed but do not influence the functionality of the test submitted itself. In this case, we can just mark these objects using unit tests mock patch. This will look up this object in a given module and replaces that object with a mock. Like this. You don't have to execute code that is not needed to be executed. The same for interfaces like our object storage S3. We should not connect to a during unit tests because if there's a problem with the connections, that test will fail. But we only want to test our unit means our method, so we don't need the real storage. S3 also can be mocked. There is a package named moto for mocking the whole S3 interface. You can just use it as if you are using the real S3 storage, the same in case you use databases for your application, then you should use fake databases, so-called stops, or in-memory databases instead of the real ones. But in our application we are not using databases. I just wanted to mention it. I will not give a deep introduction to unit testing in Python and using mock and PECC. But I provided a link to posts about how to write unit tests in Python and understanding the Python mock object library. 48. 48. list_files_in_prefix - Unit Test Specification: Hello. Before we are going to write our unit tests, we should do a test specification. This means we should think about what we want to test in order to be able to just write our unit tests easily. So before we were thinking about the different cases are metals will meet. There was the first case where the prefix is a string and it exists in the bucket. Then the second is where the prefix is a string and doesn't exist in the bucket. And then if the prefix is not a string, in this case, both the tree takes care of the exception that will be thrown. This means this case we don't have to cover the same for problems with connections, with the bucket source problems, both a tree takes care of any ways in case of connections, we're not testing it in unittest, but for bucket problems, both retreat takes care of throwing an exception. And this we also don't have to cover this means we only need two cases. And here I prepared in extra sheet with some columns, module, method, test name tests, description, test in it, input and expected output. And here we have our module as tree, the method lead, list files and prefix and two test cases. The first test I just named test this files and prefix. Okay? This means that the prefix exists in our bucket. And then the other test case is test list files and prefix, wrong prefix. And the test description here test the list files in prefix method for getting to FI keys as list, The mocked S3 bucket. And for this test case, test list files and prefix method in case of a wrong or not existing prefix. This covers both cases wrong or not existing prefix. For the BOT-2 tree method we are calling, it doesn't matter. And here we have to test in it for our first test case. At first we have to initialize the S3 bucket, then uploading to files on the bucket with the prefix, and then creating the S3 bucket connector test instance. And then the input will be the prefix and the expected output Is a list contained into findings. And for all second test case, there is basically the same test in it. The mochte S3 bucket, uploading two fives on the bucket with the prefix and creating the S3 bucket Connect or test instance, the input will be a wrong prefix and other prefix that doesn't exist here on the bucket. And the expected output is an empty list. So these are our test cases, and now we can just write them. 49. 49. list_files_in_prefix - Unit Test Implementation 1: Hello. Now let's implement our first unit tests. Therefore, let's create a file inside the tests common folder. So here files, new files, and I save it here in the tests folder, common. And then I name it test underscore S3. And it is a Python file. Here important to know is that the name of the file should be test in the beginning in order to be recognized by the unit test framework. So we can save it. And the same for our testing methods. So at first, let's create a framework of our unit tests, our unit test class actually. And at first, here we write some docstrings, just test S3 bucket connector methods. Then we import OS, this is what we need. Import unit test, import BOT-2 tree from moto, import mark as tree, and from cetera.com and dot s3 import S3 bucket connector. This is what we are going to test. And then we can create our testing class. So usually in unit tests recreate classes that inherits from unit tests, test case class. And I call it Test S3, bucket connector methods. And hear from the urine tests test case class. It's inheriting. And again, some docstrings with a description testing the S3 bucket connector class. You can just write whatever you want to write. And then at first, we can use the setup method. This is a method that can be used to initialize our tests, whatever we need to do beforehand, we can do here whenever you call unit tests. Then it will recognize the setup at first. And before executing our tests, if it's for multiple tests and make sense to do it here inside this method. So here, docstring, setting up the environment. And at first I just write pass. Then the same way There's a teardown method. In case after executing the unit tests, there needs to be done something you can just write here inside the teardown method. Executing after unit tests. I just keep it as well. And then we have our test list files in prefix o mattered. And here I just write a description we actually already created inside our test specification. Test the list files in prefix method for getting to fire keys as a list on the mock S3 bucket. I just skip it for the first moment. And then the second unit test is test list files in prefix, wrong, prefix, self. And here, the description tests to list files in prefix method in case of a wrong or not existing prefix. Then here we can write pass and an order to test our methods. Here. Immediately we can write name if name equal to main, unit tests, main. And here we can see that pie learned through some errors and warnings. Most important now for us is here the Moto import is not able to be executed. So we have to install more to it first. So let's go to our console and let's do pip install moto. It is installed now. And let's save it again. Moto is now resolved here just some warnings. And now we can write our setup method. And here the first thing we need is setting up our street. So mocking as tree connection. Start. And here we can use our mock is three. So I create an instance variable with self underscore, S3, S3, and self.age as three dots like this. We are starting our street and here after we executed our tests, we can also stop this more as tree with self Markus three, stop. So here, mocking S3, connection. Stop. Then the next thing we need some class arguments that we have to set up. We have to define them in order to submit them to our S3 bucket connector. And I just call it defining the class arguments. There we have the three x's key. And here I just name it AWS Access Key, ID. Then we have self as three secret key. I just call it AWS secret access key. Then the actually these two keys, it doesn't matter what you write here. They just should be some strings. And then self S3, endpoint URL. And here I use HTTPS, S3, EU, EU, Central, 1 dot Amazon, AWS.com. And the bucket name, I just call it test bucket. Then we need to create the S3 access keys as environment variables, as three x's keys as environment variables. And here we can use OS dot environ. Here I just use these two keys I just created. And here this would be our key one. This would be just a key. It doesn't matter now, what is the actual key? Because we are just mocking the story. But in real, it would be the value. And here, S3, secret key. Now does I just call K2? Then the next step is to create the bucket on the mock test 3. Creating, creating a bucket on the mochte S3. And here I create self as tree at first using Bo tree withdraws. Service name is s3. Endpoint URL is self. S3 endpoint URL. And then self as x3 dot create bucket with the bucket itself. And S3 bucket name. And create bucket configuration. And here I use location constrain, this is necessary. Otherwise it throws an error. And here I use EU Central 1. And finally, I can too, I can create the bucket instance that we can use it to upload files. And here I create an S3 bucket equal to self S3 bucket. And here I use the bucket name. And what we need is to create our S3 bucket connector testing instance. So finally creating a testing instance. And here I just call it self S3 bucket connector. And here we imported our S3 bucket connector. And then we can use our x is key, secret key, endpoint, URL, and bucket. This is what recreated here. The secret key. And we need a comma. Then our endpoint URL, and our bucket name. This should be enough for our setup and teardown. Now, let's write our unit test. 50. 50. list_files_in_prefix - Unit Test Implementation 2: Let's start with the test list files and prefix. Okay, so the usual way I write unit tests is that at first I write the expected results. After I do a test in it. That is needed for the special test. After I do the method execution. Then test or tests After Method Execution. And if needed, some clean up after tests. Let's start with the expected results. So what we expect is a prefix. I call expected prefix, prefix expect. And here I just write in prefix as a string. This is what I expect. And then we need two keys because I want to upload two files, k1 expect, and here I use an f string, the prefix expected, and then I just call it test1, CSV. And the same for the second five but with 22. And then I get my test and t, I just create a CSV content that I can use for both files that here we are going to upload. And here I just use the triple quotes. And here I will have column one, column two, and value a, value b. This can be just as easy as possible for the sake of testing. And then here, self S3, bucket dot put object. And the body will be our CSV content. The key will be our key, one expected key. And the same for the second file. And this will be just a key to expected. And then our method execution. Here, we get as a result and this result. And you recall our self as S3, bucket connector dot list, fives and prefix. And here we give as an argument prefix expected. Afterwards, we can just do some tests. And first thing I want to test is that the returned list has two items and the second two with self assert equal LEN of this result is equal to two. Then I can check if key one expected is in the list result and K2 expected this in the list result. So I can do self assert in K1 expected in lists result. And the same for the key to expected. And this should be enough to test to make sure that the result is as expected. And finally, we can do some cleanup. So here I just clean up the mochte is three for the other tests that they don't get confused. The S3 bucket dot delete objects. Here. I do a delete. And you are right, the objects result. But first is the key with key one expected. And the second is the key with C02 expected. This should be okay now. And now we can just executed and see what's happening if the test is executed successfully. So I'm just selecting the interpreter. No, this is not this one. I used this and I run without debugging. Here. I can see it was already before, ran two tests in two seconds or here. And both are okay. This means the test is fine in case anything is wrong roof the test. And if you only want to execute only this test instead of running all tests because the way I did right now, because of here, unit tests calling unit test dot main. It just executes at first are set up. Then all the tests we defined here, and then the teardown method. If we only want to execute our tests, we also can do it manually just for debugging. If you want to see what's happening, if an error is happening. But you also can set a breaking point here for sure for debugging if anything is wrong, calling only one method, you could just create an test instance here. I call it test instance. And then you could call step-by-step at first set up, then as a last step and then test instance dot RR. Testament. The same way urine test does it whenever you execute unit tests. And if we executed here, we can see that it's just executing. If we do a print statement test. This is also one way to debug in case anything goes wrong here you can see the test. But probably a better way is using a breakpoint and debugging. So now it's time to write the second test, our test list files and prefix, wrong prefix. Here we basically can just copy and paste our test. But one difference here is that we want to have and expected prefix that does not exist or is wrong. So I just call it no prefix for example. And one thing we don't actually have to do is to upload the files. Although I did it in the test specification to upload the file and then just use a wrong prefix. But it doesn't matter. It's actually the same case. The same handling. If there's just a wrong prefix or there's no prefix. This means this part we can skip and remove method execution. This is the same. And ROR clean up. We also don't have to do then. So this test is much smaller. And the only way, the only thing we have to do here is self-assess are true. And the result is empty and we can just check it using not list result. And now we can execute again our tests. And let's see if they are okay. And this looks good. Two tests, okay. This is just fine. And now there's just one more thing we can check. This is the coverage and there's a handy package called coverage. And we can just go to our shell and activate our environment. And here at first, pip, install coverage. And one morphing, We also should be inside our project folder. Once it is installed, we can just run the coverage and it gives a good number. Of how many lines are of our application code is covered by the tests. Coverage is installed now and now we can just run coverage run. And here as an argument, I want to omit actually that it is also testing inside our Raja environments some folders. So here virtual n fs. And also inside the test folder, we don't want to check the coverage. We really only want to check the coverage inside our application code. So this is the way to do it. Minus M unit test discover. And like this, it finds all the test cases in our test suit. And here minus V, it gives some more details and that's executed now it should execute our two tests. Here we can see ran two tests. They are okay. And now we can call the report brief coverage report minus M and Q, we can see debt, what is covered by our unit tests. So here we have our common module. Here, the init function for sure it's empty, but what is important now is the S3 dot py file. And here we can see 88% of our code is covered. But we're going to add two more tests or the tests for two more methods. And this will be up to a 100 percent then afterwards. And now we can do a git commit because we finished writing our method list files in prefix and the unit tests for it. And so we can just check its status. And here we can see that there are several directories and files like run dot p-y that we should add. So I just do git add star. And then I do git commit, m. list, edit list files in prefix and unit tests. And then we can do a git push. And that's it. Now we have it on our remote repository as well with our first function. And whenever you finish a method, I recommend doing a git commit. So that's it about our first method. And let's continue with the next methods. 51. 51. Task Description - Writing Methods: Hello. Now it's your turn to write all the other methods and their unit tests. So what does therefore the S3 bucket connector re-implemented the list files and prefix method. You have to implement the read CSV to DataFrame and write CSV to DataFrame. And you have to implement the Meta process class methods, update metafile return date list. You have to write the middle process format class, the S3 file types clause as well. And here finally our AAC cetera transformer class exedrae, TL class exedra transformer was the name of the file. So there you have to implement extract transform load or ETL report. And also here our auxiliary source config and exedra target config. So just go through the methods one by one. Think about what do you need? What are the different cases? You have to think about exceptions. Maybe you can write some custom exceptions. Think about logging and then just implement the methods. And finally, also implement the unit tests according to the cases you for the bout and try to reach a 100 percent coverage. I provide all the solutions, all the scripts for the methods, and also for the unit tests, also a test specification for each method I will go through the solution. So I wish you a lot of fun implementing all of this. It shouldn't be that tough. It is or isn't important to do it exactly as I did it, but just do it and do it using clean code, using the best practices that we've learned here in this course till now. So see you later. 52. 52. Solution - read_csv_to_df - Implementation: Hello, Here you can see my implementation of the update Meta file method. Basically it's the same code as we already wrote interpreter notebook. I just added here some comments and I replaced the hard-coded values with Meta process format values. And what I added is here this try except block. Because one case, we also have to take care of whenever we read CSV to DataFrame. If there's no modifier, then we get an exception, no such key. And in this case, I just take the new DataFrame we created here and take it as DFL. This is one thing, one case, and another case is if there is already an emitter file. But if the columns of the already existing metafile is different than the columns we want to write. So df new columns is not equal to df old columns. And here I just use collections counter. So n in this case, I raise a wrong metafile exception. We can take a look here. And the custom exceptions. And TI also created a wrong metafile exception. In the same way I created the wrong format exception. Afterwards, you can add whatever you want here, but at least it's raising an error. And we get this wrong metafile exception. And in case it is right, then we do the conquered. And that's it for this method. 53. 53. Solution - read_csv_to_df - Unit Test Implementation: For unit testing, I created four different test cases. The first is to test right DataFrame to S3 with an empty DataFrame. Here as an expected result, return expected, I expect none. And then here the log expected is the DataFrames empty and so on. So here I just create an empty DataFrame df empty. And here, when I execute a method, I submit the, the empty DataFrame as the argument. Here I get the result. I check the law work, and then I check if the return expected none is equal to the result. Then the next test case is test right? Dataframe to S3, c is v. So here I expect to have a CSV file. So here I create what I expect. And this DataFrame expected this year what I create. And I also submitted as the input argument to the right DataFrame to a stream method. And then here, after the method execution, I first read from the mochte S3 bucket. I read the object and I expect to have the same DataFrame here as result and expected. So here I just use self as a true DataFrame. Expect dot equals dataframe result. And here for sure the return value should be true. And after this, I just do a clean-up. And the same it is for Parquet. I created a unit test and instead of CSV, I just use Parquet. And the last test case is that I have a wrong format. So I also called the unit tests, test right DataFrame to S3, wrong format. And here the only difference I do is that the format expected I said to just wrong format. I just use any string. And here I submitted as argument. And what we expect is that the wrong format exception will be raised. And unit test provides a convenient method, self as races, in combination with the context manager and the with statement, we can just check if this exception that they create a tear wrong from an exception is raised. And in case everything is as expected, the test case is okay and it passed. So these are the four unit test cases. 54. 54. Solution - write_df_to_s3 - Implementation: Hello, Here you can see my implementation of the update Meta file method. Basically it's the same code as we already wrote interpreter notebook. I just added here some comments and I replaced the hard-coded values with Meta process format values. And what I added is here this try except block. Because one case, we also have to take care of whenever we read CSV to DataFrame. If there's no modifier, then we get an exception, no such key. And in this case, I just take the new DataFrame we created here and take it as DFL. This is one thing, one case, and another case is if there is already an emitter file. But if the columns of the already existing metafile is different than the columns we want to write. So df new columns is not equal to df old columns. And here I just use collections counter. So n in this case, I raise a wrong metafile exception. We can take a look here. And the custom exceptions. And TI also created a wrong metafile exception. In the same way I created the wrong format exception. Afterwards, you can add whatever you want here, but at least it's raising an error. And we get this wrong metafile exception. And in case it is right, then we do the conquered. And that's it for this method. 55. 55. Solution - write_df_to_s3 - Unit Test Implementation: For unit testing, the ETL report one methods. I only created one test case because I just want to see all the pipeline working as expected. And it's basically almost the same as the order methods. So here just expected results tests in it. Method Execution. I'm just calling here ETL report one. And after I check on the mocked as three, that the DataFrame is as expected here, and the metadata as well. After I do clean up. And now we implemented the unit tests for all our methods. And our target is to have a coverage of a 100 percent. So now let's check. I hope you already checked it step-by-step. What is the coverage? And now we can do coverage run. And let's see, it is executing all written tests. Seems okay. And Ren 21 tests. And now we can do coverage report minus m. And here we see all our 100 percent, our constants, custom exceptions, meter process as three, and our exedra transformer, this means all our modules, all our code is covered by our unit tests. 56. 56. Solution - update_meta_file - Implementation: Hello, for the return date list method, I use exactly the same code as in Jupyter Notebook, just some modifications I did. So I beautified the code a bit and for sure I removed the hard-coded value. So here you can see it is almost exactly the same. So there's not a lot to explain. 57. 57. Solution - update_meta_file - Unit Test Implementation: For unit testing, the ETL report one methods. I only created one test case because I just want to see all the pipeline working as expected. And it's basically almost the same as the order methods. So here just expected results tests in it. Method Execution. I'm just calling here ETL report one. And after I check on the mocked as three, that the DataFrame is as expected here, and the metadata as well. After I do clean up. And now we implemented the unit tests for all our methods. And our target is to have a coverage of a 100 percent. So now let's check. I hope you already checked it step-by-step. What is the coverage? And now we can do coverage run. And let's see, it is executing all written tests. Seems okay. And Ren 21 tests. And now we can do coverage report minus m. And here we see all our 100 percent, our constants, custom exceptions, meter process as three, and our exedra transformer, this means all our modules, all our code is covered by our unit tests. 58. 58. Solution - return_date_list - Implementation: Hello, for the return date list method, I use exactly the same code as in Jupyter Notebook, just some modifications I did. So I beautified the code a bit and for sure I removed the hard-coded value. So here you can see it is almost exactly the same. So there's not a lot to explain. 59. 59. Solution - return_date_list - Unit Test Implementation: For unit testing, the ETL report one methods. I only created one test case because I just want to see all the pipeline working as expected. And it's basically almost the same as the order methods. So here just expected results tests in it. Method Execution. I'm just calling here ETL report one. And after I check on the mocked as three, that the DataFrame is as expected here, and the metadata as well. After I do clean up. And now we implemented the unit tests for all our methods. And our target is to have a coverage of a 100 percent. So now let's check. I hope you already checked it step-by-step. What is the coverage? And now we can do coverage run. And let's see, it is executing all written tests. Seems okay. And Ren 21 tests. And now we can do coverage report minus m. And here we see all our 100 percent, our constants, custom exceptions, meter process as three, and our exedra transformer, this means all our modules, all our code is covered by our unit tests. 60. 60. Solution - extract - Implementation: Hello, for the ETL report one method, I removed all the arguments. Here you can see there are no arguments anymore. And there was a part extract date list in Jupiter notebook. And is I removed as well? We don't need it here anymore because we are executing this in the init method of our exedra ETL Class. Apart from this, everything is exactly the same. There's not much CO2, just calling the methods. 61. 61. Solution - extract - Unit Test Implementation: For unit testing, I created two different test cases. The first test case is if the input DataFrame is empty. And here I just do basically the same thing as for the Extract Method. I do a test in it here with the extract date, extract date list, and then here in the method execution eye patch the object return date list. And then I created a test instance, Sandra ETL class. And then here basically I execute the transform report one method. And here I just expect a lock and as a dataframe input, I create an empty DataFrame. And for sure here, then in the test after the method execution, as her true should be the data-frame result we get here from our transform method should be empty. And for the next test case, where the report is, okay, where we create a report. I defined here the DataFrame input. And here I use to self DataFrame source I created in the setup method. And here the extract data, extract date list I set to some values. And what I expect is self DataFrame report this I created as well in the here and the init, The setup. And here you can see this is the data report according to my source data I just created before. This should be the data report after the transformation. And here, this is exactly what I do. After I just assert in LOC expected, I check and the DataFrame expected should be equal to the data-frame result. 62. 62. Solution - transform_report1 - Implementation: Hello, for the ETL report one method, I removed all the arguments. Here you can see there are no arguments anymore. And there was a part extract date list in Jupiter notebook. And is I removed as well? We don't need it here anymore because we are executing this in the init method of our exedra ETL Class. Apart from this, everything is exactly the same. There's not much CO2, just calling the methods. 63. 63. Solution - transform_report1 - Unit Test Implementation: For unit testing, I created two different test cases. The first test case is if the input DataFrame is empty. And here I just do basically the same thing as for the Extract Method. I do a test in it here with the extract date, extract date list, and then here in the method execution eye patch the object return date list. And then I created a test instance, Sandra ETL class. And then here basically I execute the transform report one method. And here I just expect a lock and as a dataframe input, I create an empty DataFrame. And for sure here, then in the test after the method execution, as her true should be the data-frame result we get here from our transform method should be empty. And for the next test case, where the report is, okay, where we create a report. I defined here the DataFrame input. And here I use to self DataFrame source I created in the setup method. And here the extract data, extract date list I set to some values. And what I expect is self DataFrame report this I created as well in the here and the init, The setup. And here you can see this is the data report according to my source data I just created before. This should be the data report after the transformation. And here, this is exactly what I do. After I just assert in LOC expected, I check and the DataFrame expected should be equal to the data-frame result. 64. 64. Solution - load - Implementation: Hello, for the ETL report one method, I removed all the arguments. Here you can see there are no arguments anymore. And there was a part extract date list in Jupiter notebook. And is I removed as well? We don't need it here anymore because we are executing this in the init method of our exedra ETL Class. Apart from this, everything is exactly the same. There's not much CO2, just calling the methods. 65. 65. Solution - load - Unit Test Implementation: For unit testing, the ETL report one methods. I only created one test case because I just want to see all the pipeline working as expected. And it's basically almost the same as the order methods. So here just expected results tests in it. Method Execution. I'm just calling here ETL report one. And after I check on the mocked as three, that the DataFrame is as expected here, and the metadata as well. After I do clean up. And now we implemented the unit tests for all our methods. And our target is to have a coverage of a 100 percent. So now let's check. I hope you already checked it step-by-step. What is the coverage? And now we can do coverage run. And let's see, it is executing all written tests. Seems okay. And Ren 21 tests. And now we can do coverage report minus m. And here we see all our 100 percent, our constants, custom exceptions, meter process as three, and our exedra transformer, this means all our modules, all our code is covered by our unit tests. 66. 66. Solution - etl_report1 - Implementation: Hello, for the ETL report one method, I removed all the arguments. Here you can see there are no arguments anymore. And there was a part extract date list in Jupiter notebook. And is I removed as well? We don't need it here anymore because we are executing this in the init method of our exedra ETL Class. Apart from this, everything is exactly the same. There's not much CO2, just calling the methods. 67. 67. Solution - etl_report1 - Unit Test Implementation: For unit testing, the ETL report one methods. I only created one test case because I just want to see all the pipeline working as expected. And it's basically almost the same as the order methods. So here just expected results tests in it. Method Execution. I'm just calling here ETL report one. And after I check on the mocked as three, that the DataFrame is as expected here, and the metadata as well. After I do clean up. And now we implemented the unit tests for all our methods. And our target is to have a coverage of a 100 percent. So now let's check. I hope you already checked it step-by-step. What is the coverage? And now we can do coverage run. And let's see, it is executing all written tests. Seems okay. And Ren 21 tests. And now we can do coverage report minus m. And here we see all our 100 percent, our constants, custom exceptions, meter process as three, and our exedra transformer, this means all our modules, all our code is covered by our unit tests. 68. 68. Integration Tests - Intro: Hello, for writing and executing integration tests, basically the same testing framework. Unit tests can be used, but now it should be tested how modules work together and real interfaces such as S3, in our case, should be used. Some of the unit tests can be just reuse, but this time without mocking and patching other objects, because the purpose is to test how they work together. 69. 69. Integration Tests - Test Specification: For doing integration tests, I also do a test specification beforehand, and here I already noted down what I want to test. And I only want to test when I have no modifier. I just do it in discourse like this. I only show you one example. We do it together, and then you can do it by yourself. You already got good practice doing all the unit tests. Then in your project and your job, you can do it by yourself. It won't be a big deal for you. So here I have the test name, the test description, then the test in it, I will create a source bucket on my AWS account with the content, I create an S3 bucket connector test instance. And then finally, what I expect as a Parquet file on S3 with the source data, content and report one format. 70. 70. Integration Tests - Implementation: Hello, for the integration test, I'm going to use my AWS S3, and therefore, I already created two buckets, the exedra in test source and cetera in test target. So this will be our source bucket and does our target bucket as a next step, I create in my folder integration test a new file, and I call it test underscore, underscore, cetera, transformer dot py. And the first thing I do is to take my tests, etc. Transformer. This is our unit tests five for the etc. transformer. I just copy and paste everything. And then I just modify it a bit. So here task cetera, ETL methods. I can write integration tests. Etl methods than I remove the most test methods. This I keep as it is for the moment. And here, test Extract Files I can remove. And this I remove. And then what I don't need is the more tool then mooc and PECC, because I want to use my real AWS S3. And then I rename the class too, int test, CIDR, ADL methods or whatever. And then into Gracian testing. Then in the setup, setting up the environment, I can remove this tree mocking. Then here I have the source bucket name. I called it, cetera, int, test source. And here exedra and test target. Then my meter key, my Meta phi, I call Meta file csv. Then this, I can remove. One thing you have to take care of the environment variables that you have, your AWS access key and your AW as secret access key. And then as a next step, bot to tree resource. This we need, this we don't need, we don't have to create buckets. As in the unit test. I already created a bucket is as instances I want to keep, I want to create DS3 bucket connector. And then as a next step, what I want to use, as you remember, here, we created hard-coded values because we won't be patching the return date list method. It will take the today's date as a reference. And I am going to create a date list with the dates before today. And this way we can execute the code today, tomorrow or even a half year later. But one thing we have to do all the hard-coded dates. We have to substitute. So I create a data list and I call it at first a comment, creating a list of dates and then self dates. Here. Use a list comprehension and then I use date time. Today, date. By the way, we have to import datetime from datetime, import datetime, and timedelta. Here. I want to use the datetime today minus a timedelta of days equal to date. Why do we hear date? Should be like this? And then the bracket here, dot. I just do a line break. Dot, string F time. I want to use the format we need. And here I use Meta process format. Taught meta the format dot value. And then here there's the for loop for I in range. And I just take the last eight days that I create the Meta process format. This is your Meta process format. I F2, import it from exedra common constants. Import meter process format. Okay? And now here, for example, we use instead of the first of April, we can use here. Now self dates. And then here I use the third item. And the same for all other days here. Here in teardown method. This we don't need anymore. But instead I want to clean my buckets. Here. I just use a for loop. Now the setup and tear down is done. Right, our integration test. And here I just rename it. Then the expected result. Here again, I use the self dates, the test and that we don't need this. I can remove the PECC. I also don't need. So here, Sandra ETL, our buckets are roughing. This is what we are creating. And then our method we're going to execute. And here we are going to test it. Does everything looks good. That's it basically here. If we have more tests we could do to clean up actually here. But now we only have one test. And for me now here the clean up and tear down is okay. And now I can just executed, I just select the interpreter. This is Sandra, 1, 2, 3, 4. And then I go. And here we can see our test run. Okay? This means it is working fine. If you want to play around with it, It's good. You could just, for example, removed the teardown, the cleaning part here, and check what was created, who and what is there. If everything is as expected, you can debug it a bit. And that's it. This is now one integration tests. As I already mentioned, I only write this integration tests here. But for sure, in a real project, you always should write more integration tests to test your application, your code, that it works as expected. 71. 71. Entrypoint run - Implementation: Hello. Now it's time to finish our entry point, R0, R1 dot PY. And first thing I do is adding the imports I need. And now I want to submit the path to my config file as an argument whenever we call the Rwanda dot PY. And therefore we can use the R package. And now it's time to change our config file. To update the values we need. Open the config and at first I add a. As a next step, I have to add the configuration for the source and target according to our name to view classes. And for the date I choose to 24th of June 2021, because today is the 27th. And then we just need to configuration for our Meta file. And now we updated our config file. And now we can go to our run dot PY and update all the sections here. The login part we'd already did here. And what else we need is to create to S3 bucket connector instances for the source bucket and the target market. We read S3 configuration. And then we create the two instances. Then we want to read our source and target configuration and create instances of our classes recreated. And here we can use the two stars submitting dictionaries as keyword arguments. This is a really handy Python feature. And the same for our Meta phi configuration. And then we want to create a exedra ETL class instance and just run our drop. And what I did is I added just logger infos that the exedra ETL job started and finished. And now our run dot p-y should be done and it should work. And before running the rounded PY and do an end-to-end tests, I want to create a launch JSON. This is a launch phi where I do settings about what Python path I want to use when I run the program, and what environment variables and what arguments I want to add. And this is especially useful because we are using here the parser and here as an argument config. And therefore, i just going to create it in Visual Studio Code here. This is the way to use the launch JSON. In other IDEs, there are other ways to do it in PyCharm is also really handy. And here I go to run and debug, and I create a launch JSON file. And here I choose a Python file. And here this is the common, the general configuration. I just add a configuration. I choose Python, python file, and then I just name it. However I want. Accelerator run type Python. Then I add the Python path. Then I choose my run dot PY file. The console is integrated terminal and then I add arguments. By the way here I missed a coma. And now we can run our program. Here. We can go to the run dot PY, and I go to run and debug. Here, I can choose my current python setup, and this is what I created in the launch to JSON exedra ETL run. And launch to JSON is saved inside your project folder. And there is a another folder, dot VS Code. And here this I am going to choose. And then just run it and let's see what happens. And here I got an error already, and this is because I made a mistake here. We don't need this. Let's run it again. Let's see what problems we got. And it seems to load our CSV files. And that's it. Now we can check it inside our inside my AWS account if everything is as expected, at least according to the logs, it looks good. All the CSV files, since here here we can see the 24th of June were downloaded, extracted, and then successfully written. And let's see. In order to check it, I opened my AWS account, the S3 with the bucket, with my target packet exedra, 1, 2, 3, 4. And here we can see that there's a metafile report one. And here we can see just the June 27th, it was just right now. And also here my report one here we have our report one Parquet file. This means our job is working and we are almost done. 72. 72. Dependency Management - Intro: Hello, When we talk about dependency management in Python, the first thing we have to be aware is that we have to differentiate between applications and packages. The standard approach to manage dependencies of applications is by using a requirements TXT file. This file is actually just a list of the requirements for a complete Python environment. This means the required packages are listed. This can be often an exhaustive listing of pinned versions for repeatable installations of complete environments packages. The standard approach is using a setup PY file that lists the minimum requirements for a single project to run correctly. Here, the use of pin versions are not best practice because we don't want to lose the benefit of dependency upgrades since we created an application, let's focus on the application dependency management with a requirements TXT file and see how we can list our requirements. Imagine we are in the development environment, wrote a Flask application using Flask 012 one. In our requirements TXT, we only write flask as unspecific package. Using this requirements teak see, we want to deploy our application in the production environment doing pip install minus our requirements TXT. What actually happens since we did not specify a version of the flask Package. Pip will install the latest version by default. And at this point that was, for example, flask version 1, 1, 2. This might be a problem because we did not test our application using this version. So there's a potential break off our application in production. This means the build is not deterministic for required dependencies. For this reason, we decide to use pin dependencies in our requirements t x t. Well, now we specify the exact version we need in the requirements takes T, S, we can see here 0121. But in addition to the Flask package, we have a sub dependency. The Flask package requires the work package, although we did not explicitly mentioned it, the work package with the version 0.14 is installed and used in our development environment and we execute and test our application. They are using exactly this version. What can happen if we want to deploy to production doing pip install minus r requirements.txt. The Flask package is installed with the pin version, but for the vector package, the latest version is installed. Again, there's a potential break off our application in production because we did not test our application using this vector version, the build is not deterministic for required sub dependencies. Another problem can occur when we have sub packages that we don't specify in the requirements TXT. Let's imagine we have packaged a and package B in our requirements TXT. And there are dependencies for each package of package. See, package a requires version 1 or higher, and Package B requires version 2 or less. If we use pip install minus our requirements TXT, what happens? Pip sees in the requirements teak, see the first package a, it detects the dependency that is specified in a setup PY of the package a. Since there is written packaged, see higher or equal 1, it installs the latest version by default. Let's assume the latest version of package c is 3. So Package see with version 3 is installed. Now, pip wants to install package B, checks the dependencies and sees that it requires package C with a version 2 or smaller. There is now a conflict and the installation fails. How to solve this issue? We, as developers, have to take care of the sub dependencies of the packages we use. So we have to check what packages, the packages we immediately use require and determine which versions are required. We have to explicitly list package see with the versions higher or equal 1 and smaller or equal to 0. And we have to list it before package a and package B. Like this, it at first installs packaged see with a version that is suitable for both packages a and B. Well, this means if we take care about all sub dependencies for all packages reuse, we won't get conflicts anymore and our application will work. But there is one big drawback. We are responsible for updating the versions for our dependencies and even sub dependencies. This means whenever any package we use has a buck or security issue. And if it was fixed in other version was released, we as developers have to be informed and to update our requirements TXT. Not so good as you understand, if we miss that there might accuse security or other issues. So the correct way of deploying our application to other environments, such as a production environment, is to do pip freeze. This generates an exhaustive list with oral pinned versions for our environment. In this way, we make sure that our application works using the exact same package versions in all environments. But what about packages we don't want to use and production such as pike test. The solution is to use two requirements, TXT, one for production and one for development. And whenever we deploy to production, we only use the requirements T3's t, when another developer wants to work on our application in the deaf environment, he or she can use the def requirements TXT. But here again, we still are responsible for updating all dependencies and sub dependencies. What might bring some security issues? The question is now, how do you allow for deterministic bolts for your Python project without gaining the responsibility of updating versions of sub dependencies. The solution is quite easy. There's a two name, PIP EMF. This tool salts exactly these issues and in addition to it, it manages virtual environments. This means we only need one to four virtual environments and package management. And this is also the tool python.org recommends for managing library dependencies when developing Python applications. I provided a link to python.org and a link with a really good introduction to Pip infant. Also more explanations. I will not give a detailed introduction to the tool, but in the next lesson, we will use tool for our purposes and for managing our virtual environment. We already know that tool. 73. 73. pipenv Implementation: Hello. Now I opened the pip file that was created by Pip nth. And whenever we install new packages, they were added here in the pip file. And now I want to choose the packages that I don't need in production and do I want to move it from packages to deaf packages? And now let's do a pip and flock to ensure to have exactly the environment as it is test and running. The pip file lock is now successfully created. And let's do a git commit. Now let's imagine we are inside our production environment. Therefore, I create a new folder. I just call it Sandra. Just be careful that this follow whenever you create it is not inside another folder where you created a nother virtual environment using pip EMF because this will auto detect it and won't work. So I go inside my folder. And the first thing I do is git clone, ICD in my clone project and my clone folder. And then I do git checkout develop. And then we can just see that there is pip file and pip file lock. And at first, I create a new virtual environment. And now we can pick list. We see that there are just the three packages, pip setup foods wheel. And now we do pip install minus-minus ignore pip file. Using the argument ignore PIP phi will immediately install the packages from pitfall Ohio Lock. And this is what we need now because we pinned the versions and we want to use the exact same packages. And doing pip list. We see that the packages we want to use with the exact versions we need are installed inside our production environment. And now let's imagine me as developer. I want to install the application and environment inside the development environment and do some development AT features or whatever. I just create a new folder. Sandra, deaf to git clone. Then I C, D in my folder to git checkout developed to change the branch. And then I create a new virtual environment with pip and shell. And then I do pip install minus-minus Deaf. This way we use the pip file. It will create a Pip file lock with the latest updated versions. And also it will install all the packages for development. And let's do pip list. And here we can see that we installed a lot more packages. And now we set up our def environment and we can start developing. 74. 74. Profiling and Timing - Intro: Hello, When we talk about optimization of performance issues, profiling is a where we use foil to. A profile is a set of statistics that describes how often and for how long various parts of the program executed. There are actually multiple profilers out there. I just mentioned a few ones. The Python standard library provides two different implementations of the same profiling interface. See profile is a C extension with reasonable overhead that makes it suitable for profiling long-running programs. And profile is a pure Python module whose interface is limited by C profile, but which adds significant overhead to provide programs. Then there are third party libraries. But where are we good tools such as the line profile, the profile setTime, individual lines Code take to execute. And there is the memory profiler that monitors memory consumption of a process, as well as line by line analysis of memory consumption for Python programs. But you can also plot the memory consumption over time to get a good idea of both memory and performance issues, I provided links for more information about all the profiles. Here in the course, I will just show how to use the memory profile. By the way, the profile of modules are designed to provide an execution or memory profile for a given program, not for benchmarking purposes. If you want to exactly time your code snippets, there is timing for reasonably accurate results. The module time, it provides a simple way to time small bits of Python code. It has a command line interface and a callable interface. For our purpose here, it is not needed, but I just wanted to mention it. 75. 75. Mem-Profiler: Hello. In order to use the memory profiler at first, we have to install it in our environment. And I have to install matplotlib as well. As a next step. I just set the first extract date in my config to the yesterday's date, just that the program doesn't take too much time. And then I have to add the decorator AT profile to all methods I want to measure. And I want to measure our S3 methods as well. And now I can run my program using improve. The program ran. And now I can plot the result using M Prof plot. In my folder, there is a exedra run PNG. Now I open it. And here you can see exactly the timing, the profiling of our program. And here you see mostly read CSV to DataFrame was called and executed. And we can exclude it because there are so many files that were red. We can just exclude this from profiling, remove the decorator, and then probably we have a better picture. But he, you can see already here one method starts in other method at the same method with the same color. It just finished here the same here, same TO all the time. The yellow ones is read CSV to DataFrame. And then here finally some others. There was no space obviously to be printed. But this is a good way to get an impression of what functions or method takes the most time and where it makes sense to optimize something. Because it doesn't make sense if you optimize, for example, one method, that it just takes maybe 1% of the total time of your program. 76. 76. Dockerization: Hello. Now it's time to create a Docker image containing the application code and the environment we need. And at first we have to create a Docker file. And here I use the Python 3.9 slim buster. Then I set some environment variables. Here. I just said pip, no cache directory to yes, the Python packages shouldn't be cached. Then Python don't write bytecode. So bytecode, I don't want to be written. And for sure the Python path we need that points to our working directory where our application code is. And then we want to initialize the new working directory. Then the directories and files that are needed should be transferred. And then we want to pip install pip nth in our image, and also pip install our packages according to our pip file. And here I use the arguments system. This means that the packages are installed in the system white Python installation. There's no virtual environment created. Now I want to build the Docker image using the Docker file I just wrote. And I'm inside my directory where my Docker file is. And then I can just use Docker builds minus-minus RM minus t. And then I call it according to my Docker Hub user account Schwartz 87 slash, and then I call it exedra test or cetera, ETL 1. And then dot. And there you can see there's an unexpected statement. This I have to fix. And there is the issue that he or should be curly brackets. Knowledge tried again. And another issue or the same issue I should have read properly. Here. It is. The quotes or missing. Let's tried once again. Pip file not found. Here. I just have a typo and let's see if it works now. And again, an issue here with pip install ignore pip file, so it should be pip file. So here we remove the S. And now finally it should work. Now the image is built and I can do docker logs in. And now I can docker push. And now I can see on my Docker Hub that the image is there, cetera ETL. And now I can use it. 77. 77. Run in Production Environment: Hello, Now it's time to run the full job in a production environment. In my production environment as execution platform, I created a mini cue cluster as orchestration tool, I use the container native communities orchestrator Argo workflows, the Docker image we already created, and it is in Docker Hub. And once we want to run the job, the orchestrator Argo workflows will just triggered job create a container using our Docker image. And there we have our application code inside the image. And then we need the conflict phi and secretes. The secrets will be created as Kubernetes secrets and will be mounted as environment variables in our container. The same for the config file. There, I will create a conflict map as Kubernetes resource, and this will be mounted as volume to the container. And then once we want to run the job, the orchestrator Argo workflows will trigger the job and loads the data from the source. Cetera, S3 bucket, create a report and load the data to our target S3 bucket. Here we can see our target S3 bucket and it is empty right now because we are going to do the first load. And here we can see the Mini cube Kubernetes dashboard. And here we can see all the resources. Let's take a look to the conflict maps. And here I created a exedra config, config map. And here we see just the content of our config file. And here as the first extract date, I chose the sixth of July because today is the 8th of July. So I just want to extract the last two days. And then let's take a look to the secrets. And here I created et cetera secrets. And here we see there is AWS Access Key ID and AWS secret access key. And then we have our orchestration tool, Argo workflows. There I created a cron workflow. I called it cornrows, etc. With a defined schedule, it should be run only on Saturdays. And here we can just see what is the workflow. Actually in the workflow, I just defined image that should be taken here. This is the image that I created. It will be pulled from my Docker Hub. And then I just mount the secrets I created as environment variables. And also I mount here as volumes, the conflict map. And now we can just submit workflow. And now the workflow is running. Here we can see our parameters, the image, we use the command. Here, I'm just running the Rundle PY and the environment variables that are mounted. And now the job is already done. And now let's take a look to our target bucket. It's updated. And Q, we can see our metadata and our report. Let's take a look to the metadata. And here we can see the last three days actually it was extracted from the 6 seventh and today the eighth, the data. So now our job executed successfully. Now we have a robust job running in a production environment. 78. 78. Summary: Congratulations, you have successfully worked through all the lessons of the course. Now, let's summarize what you learned after understanding et cetera source data and implementing a first quick and dirty solution with Jupiter notebook. We discussed the importance of a proper co-design and different approaches, respectively, functional and object-oriented programming. Then we add first change to code using the functional approach. And afterwards did a proper object oriented code design. And finally implemented the design, applying best practices for design principles, clean coding, Virtual Environments, project and folder setup, configuration, logging, exception handling, linting, dependency management, performance tuning with profiling, unit testing, integration testing, and decolonization. After the object-oriented implementation of the ETL job using Visual Studio Code, we built the image and I showed you how it can be used in a production environment using combinators and Argo workflows as orchestration tool. I hope that you had a lot of fun in my course. And above all, that you learned a lot. If you enjoyed the course, please give it a good rating. With this. You're helping to keep the cause alive. Now, I say goodbye and wish you all the best.