Learn and practice each part of the data engineering process and apply your acquired knowledge and skills to develop an end-to-end data pipeline from the ground up.
This course consists of modules, workshops, and a project that helps you apply the concepts and tools learned during the course. The syllabus is structured to guide you step-by-step through the world of data engineering.
- Module 1: Containerization and Infrastructure as Code
- Module 2: Workflow Orchestration
- Workshop 1: Data Ingestion
- Module 3: Data Warehouse
- Module 4: Analytics Engineering
- Module 5: Batch Processing
- Module 6: Streaming
- Project
- Course overview
- Introduction to GCP
- Docker and docker-compose
- Running Postgres locally with Docker
- Setting up infrastructure on GCP with Terraform
- Preparing the environment for the course
- Homework
- Data Lake
- Workflow orchestration
- Workflow orchestration with Kestra
- Homework
- Reading from apis
- Building scalable pipelines
- Normalising data
- Incremental loading
- Homework
- Data Warehouse
- BigQuery
- Partitioning and clustering
- BigQuery best practices
- Internals of BigQuery
- BigQuery Machine Learning
- Basics of analytics engineering
- dbt (data build tool)
- BigQuery and dbt
- Postgres and dbt
- dbt models
- Testing and documenting
- Deployment to the cloud and locally
- Visualizing the data with google data studio and metabase
- Batch processing
- What is Spark
- Spark Dataframes
- Spark SQL
- Internals: GroupBy and joins
- Introduction to Kafka
- Schemas (avro)
- Kafka Streams
- Kafka Connect and KSQL
Putting everything we learned to practice
- Week 1 and 2: working on your project
- Week 3: reviewing your peers