Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

DATA Engineering Roadmap

A detailed, practical roadmap to become a job-ready Data Engineer — from foundations to production-grade pipelines.

Table of Contents


Overview

Data engineering is the discipline of building reliable systems that ingest, store, transform, serve, and monitor data.

A strong Data Engineer knows how to:

  • collect data from APIs, applications, databases, and files
  • design clean and scalable data models
  • create reliable ETL/ELT pipelines
  • schedule and monitor workflows
  • handle batch and streaming workloads
  • work with cloud platforms and modern data tools
  • ensure data quality, governance, and security
  • support analytics, machine learning, and operational systems

This roadmap is designed to help you progress from beginner to production-ready practitioner with a clear sequence of skills, tools, and projects.


Who This Roadmap Is For

This roadmap is ideal for:

  • beginners entering data engineering
  • data analysts moving toward engineering
  • software engineers transitioning into data platforms
  • BI developers who want stronger backend data skills
  • students preparing for internships or entry-level roles
  • professionals who want a structured path instead of random tutorials

What a Data Engineer Does

A Data Engineer typically works on:

  • Data ingestion — pulling data from databases, APIs, SaaS tools, logs, and streams
  • Data transformation — cleaning, joining, standardizing, validating, and modeling data
  • Data storage — operating data lakes, warehouses, and lakehouses
  • Pipeline orchestration — scheduling, dependency management, retries, and alerting
  • Data reliability — testing, monitoring, lineage, observability, and SLAs
  • Infrastructure — cloud services, containers, CI/CD, permissions, and deployment workflows
  • Collaboration — partnering with analysts, analytics engineers, ML engineers, and product teams

Core Skills You Need

A solid data engineering foundation combines multiple skill areas:

1. Programming

Focus on:

  • Python
  • shell basics
  • scripting and automation
  • debugging and logging
  • packages and virtual environments
  • clean code practices

2. SQL

This is non-negotiable.

You should master:

  • filtering, grouping, joins
  • window functions
  • CTEs
  • subqueries
  • aggregations
  • date handling
  • query optimization basics
  • schema design

3. Data Modeling

Learn how to structure data so it is useful and scalable:

  • normalization and denormalization
  • fact and dimension tables
  • star and snowflake schemas
  • slowly changing dimensions
  • surrogate keys
  • business grain definition

4. Distributed Processing

For large-scale systems, learn:

  • partitioning
  • parallelism
  • shuffling
  • fault tolerance
  • cluster computing concepts

5. Cloud and Infrastructure

Modern data engineering usually runs in the cloud:

  • object storage
  • IAM / permissions
  • compute services
  • containers
  • infrastructure basics
  • deployment pipelines

6. Reliability and Operations

Production systems require:

  • data quality checks
  • alerting
  • retry logic
  • idempotency
  • monitoring
  • incident handling
  • cost awareness

Roadmap Stages

Stage 0: Mindset and Setup

Before tools, build the right habits.

Learn

  • what ETL vs ELT means
  • difference between OLTP and OLAP
  • basics of files, tables, APIs, logs, events, and streams
  • why data pipelines fail in production
  • importance of reproducibility and observability

Set Up Your Environment

Install and configure:

  • Python
  • VS Code or PyCharm
  • Git and GitHub
  • Docker
  • a local SQL database such as PostgreSQL
  • Jupyter Notebook (optional)

Output

By the end of this stage, you should be able to:

  • write and run scripts locally
  • manage code with Git
  • use Docker at a basic level
  • connect Python to a database

Stage 1: Programming Fundamentals

Primary Language: Python

Python is widely used in data engineering for scripting, orchestration, API integrations, and Spark jobs.

Topics to Learn

  • variables, data types, functions, classes
  • loops and conditionals
  • list/dict/set comprehensions
  • file handling
  • exception handling
  • modules and packages
  • virtual environments
  • logging
  • type hints
  • unit testing basics

Libraries to Know Early

  • requests
  • pandas
  • pathlib
  • json
  • datetime
  • sqlalchemy
  • psycopg / database drivers

Practice Tasks

  • pull data from a public API
  • save JSON responses to files
  • transform raw data into tabular format
  • load transformed data into PostgreSQL
  • add logging and error handling

Goal

Build small scripts that behave like simple pipelines.


Stage 2: SQL and Data Modeling

SQL is the heart of analytics and warehouse work.

SQL Topics to Master

Fundamentals

  • SELECT
  • WHERE
  • GROUP BY
  • HAVING
  • ORDER BY
  • JOIN
  • UNION

Intermediate

  • window functions
  • CTEs
  • nested queries
  • case expressions
  • date/time functions
  • text manipulation
  • null handling

Advanced

  • execution plans
  • indexing concepts
  • query performance tuning
  • incremental logic
  • deduplication patterns
  • change data capture concepts

Data Modeling Topics

  • entities and relationships
  • business keys vs surrogate keys
  • dimensional modeling
  • medallion layering (bronze, silver, gold)
  • snapshots and historical tracking
  • schema evolution basics

Practice Tasks

  • design a mini e-commerce schema
  • write 25–50 SQL queries of increasing difficulty
  • create fact and dimension tables
  • build daily summary tables
  • solve window function problems

Goal

Be able to transform messy source data into analyst-ready models.


Stage 3: Databases and Storage Systems

You need to understand both transactional and analytical systems.

Learn the Difference

OLTP Systems

Used for application transactions:

  • PostgreSQL
  • MySQL
  • SQL Server

OLAP / Analytical Systems

Used for reporting and analytics:

  • data warehouses
  • columnar storage systems
  • lakehouse engines

Concepts to Learn

  • row vs column storage
  • partitions
  • indexes
  • replication basics
  • transactions and ACID
  • file formats: CSV, JSON, Parquet, Avro
  • compression and serialization
  • schema-on-write vs schema-on-read

Practice Tasks

  • compare CSV vs Parquet size and speed
  • load large files into a relational database
  • partition sample data by date
  • benchmark simple queries

Goal

Understand how storage design affects performance, cost, and maintainability.


Stage 4: Data Warehousing and Analytics Engineering

This is where raw data becomes trusted business data.

Learn

  • warehouse architecture
  • layered data models
  • transformation workflows
  • testing and documentation
  • incremental transformations
  • semantic consistency across teams
  • dbt fundamentals
  • source freshness checks
  • tests for unique/not-null/relationships
  • modular SQL models
  • documentation generation
  • staging → intermediate → marts structure

Practice Tasks

  • build a dbt project on top of PostgreSQL or a cloud warehouse
  • model raw order data into marts
  • add tests and documentation
  • create a dashboard-ready schema

Goal

Be able to take raw ingested data and produce reliable business-ready datasets.


Stage 5: Batch Data Processing

Most organizations still rely heavily on batch workloads.

Learn

  • scheduled ingestion
  • chunked processing
  • retries
  • backfills
  • idempotent loads
  • full refresh vs incremental loads
  • partition-based processing

Tools to Explore

  • Python-based ETL scripts
  • Spark / PySpark
  • SQL transformations in warehouses

Spark Topics

  • DataFrames
  • transformations vs actions
  • lazy execution
  • partitioning
  • joins and aggregations
  • reading/writing Parquet
  • Spark SQL
  • performance basics

Practice Tasks

  • process millions of rows with PySpark
  • clean and aggregate large event data
  • write partitioned Parquet outputs
  • compare local Python vs Spark for scale

Goal

Learn when simple scripts are enough and when distributed processing is necessary.


Stage 6: Workflow Orchestration

Pipelines are not just scripts. They need scheduling, dependency management, and monitoring.

Learn

  • DAG concepts
  • task dependencies
  • retries and backoff
  • scheduling intervals
  • parameterization
  • backfills
  • SLAs and alerts
  • secrets handling

Tool to Prioritize

  • Apache Airflow

Practice Tasks

  • create DAGs for ingest → transform → validate → publish
  • run daily and hourly schedules
  • add retry policies and failure alerts
  • parameterize DAGs for different environments

Goal

Move from “script runner” thinking to “production workflow system” thinking.


Stage 7: Streaming and Event-Driven Data

Many systems need near-real-time data movement.

Learn

  • events, topics, partitions, offsets
  • producers and consumers
  • delivery guarantees
  • stateful vs stateless processing
  • watermarking and late-arriving data
  • event time vs processing time

Tools to Explore

  • Apache Kafka
  • Kafka Connect concepts
  • Spark Structured Streaming
  • stream processing fundamentals

Practice Tasks

  • stream click events into Kafka
  • consume messages with Python
  • aggregate events over time windows
  • send enriched records to storage or a database

Goal

Understand real-time patterns even if your first job is primarily batch-oriented.


Stage 8: Cloud Data Engineering

Most real-world roles expect cloud familiarity.

Pick One Cloud First

Choose one deeply before sampling the others:

  • AWS
  • GCP
  • Azure

Universal Cloud Concepts

  • object storage
  • managed databases
  • serverless compute
  • data warehouse services
  • secrets and IAM
  • networking basics
  • logging and monitoring
  • cost optimization

Typical Services by Category

Storage

  • S3 / GCS / Azure Blob Storage

Compute

  • EC2 / Compute Engine / Virtual Machines
  • Lambda / Cloud Functions / Azure Functions
  • container services

Warehousing

  • Redshift
  • BigQuery
  • Snowflake
  • Azure Synapse / Fabric-oriented ecosystems

Orchestration and Integration

  • Airflow-managed offerings
  • cloud-native schedulers
  • connectors and ingestion services

Practice Tasks

  • store raw files in object storage
  • run transformations in cloud compute or warehouse SQL
  • create an end-to-end ELT pipeline in one cloud
  • track costs and permissions

Goal

Be able to build cloud-native pipelines instead of only local demos.


Stage 9: Data Quality, Governance, and Security

This is what separates toy projects from trustworthy systems.

Data Quality Topics

  • null checks
  • uniqueness checks
  • referential integrity
  • freshness checks
  • volume anomaly checks
  • distribution checks
  • schema drift detection

Governance Topics

  • metadata
  • lineage
  • ownership
  • naming conventions
  • documentation
  • data classification
  • retention policies

Security Topics

  • principle of least privilege
  • secret management
  • encryption at rest and in transit
  • PII awareness
  • auditability

Practice Tasks

  • add validation steps to every pipeline
  • create data contracts for key tables
  • document lineage from raw to mart
  • separate dev/staging/prod access

Goal

Learn to build pipelines others can trust.


Stage 10: DevOps for Data Engineers

Data engineers need software engineering discipline.

Learn

  • Git workflows
  • branching strategy
  • pull requests
  • CI/CD basics
  • automated tests
  • Docker images
  • environment variables
  • IaC concepts (Terraform is a strong plus)

Practice Tasks

  • dockerize a pipeline project
  • add unit tests and SQL tests
  • configure GitHub Actions for linting/tests
  • deploy the same project to dev and prod configs

Goal

Make your data projects reproducible, reviewable, and deployable.


Stage 11: Production Readiness

At this stage, focus less on tools and more on system design.

Learn

  • scalability trade-offs
  • fault tolerance
  • observability
  • cost vs performance
  • schema evolution
  • reprocessing and replay strategies
  • SLA / SLO thinking
  • stakeholder communication

Questions You Should Be Able to Answer

  • What happens if a pipeline fails halfway through?
  • How do you backfill historical data safely?
  • How do you detect bad upstream data?
  • How do you avoid duplicates?
  • How do you debug slow jobs?
  • When should you use Spark instead of SQL?
  • When is streaming worth the complexity?

Goal

Think like an engineer responsible for business-critical data systems.



Project Roadmap

Projects matter more than endless tutorials.

Project 1: API to PostgreSQL Pipeline

Goal: Build a small ETL pipeline.

Stack: Python, requests, PostgreSQL, Docker

Features:

  • extract data from a public API
  • transform nested JSON into clean tables
  • load into PostgreSQL
  • add logs and retry handling

Project 2: Analytics Modeling Project

Goal: Create business-ready warehouse models.

Stack: SQL, dbt, PostgreSQL or cloud warehouse

Features:

  • staging models
  • marts
  • tests
  • documentation
  • incremental models

Project 3: Orchestrated ELT Pipeline

Goal: Run a full multi-step pipeline automatically.

Stack: Airflow, dbt, Docker

Features:

  • ingest raw data
  • transform with dbt
  • run data quality checks
  • publish final tables
  • add alerts

Project 4: Batch Big Data Pipeline

Goal: Process large datasets efficiently.

Stack: PySpark, Parquet, object storage

Features:

  • read large event logs
  • clean and aggregate data
  • partition outputs by date
  • benchmark performance changes

Project 5: Streaming Pipeline

Goal: Demonstrate real-time thinking.

Stack: Kafka, Python consumers, Spark Structured Streaming or similar

Features:

  • ingest event stream
  • aggregate in windows
  • store results downstream
  • monitor lag/failures

Project 6: Cloud-End-to-End Data Platform Demo

Goal: Show production-style architecture.

Stack: cloud storage + cloud warehouse + orchestration + IaC

Features:

  • raw, refined, curated layers
  • automated scheduling
  • cost-aware storage choices
  • permissions and environment separation

Beginner to Advanced Tool Stack

Below is a practical progression, not a mandatory list.

Foundation Tools

  • Python
  • SQL
  • PostgreSQL
  • Git/GitHub
  • Docker

Early Data Engineering Tools

  • pandas
  • SQLAlchemy
  • dbt
  • Airflow

Scale Tools

  • PySpark
  • Kafka
  • Parquet / Avro
  • object storage

Cloud / Platform Tools

  • AWS / GCP / Azure
  • warehouse platform
  • Terraform
  • CI/CD pipelines

Reliability Tools

  • data tests
  • alerting systems
  • logging / metrics / dashboards
  • lineage and catalog tools

Portfolio Checklist

Your GitHub portfolio should show more than notebooks.

Include

  • clear README files
  • architecture diagrams
  • setup instructions
  • sample datasets or generation scripts
  • tests
  • screenshots of DAGs or dashboards
  • documentation of trade-offs
  • failure handling notes
  • deployment steps

Recruiter-Friendly Project Questions Your Repo Should Answer

  • What problem does this solve?
  • What tools are used and why?
  • How is data ingested, transformed, and stored?
  • How is quality validated?
  • How is the workflow scheduled?
  • How would this scale in production?

Interview Preparation

Technical Topics

  • SQL joins, windows, deduplication
  • Python data structures and scripting
  • ETL vs ELT
  • dimensional modeling
  • Airflow concepts
  • Spark basics
  • partitioning and file formats
  • Kafka fundamentals
  • warehousing concepts
  • cloud basics
  • testing and observability

Scenario Questions

Prepare for questions like:

  • design a daily sales pipeline
  • handle duplicate events in ingestion
  • backfill 2 years of historical data
  • optimize a slow transformation job
  • model orders, customers, and products for analytics
  • move from batch to near-real-time reporting

Behavioral Topics

  • debugging production issues
  • communicating with analysts/stakeholders
  • prioritizing reliability vs speed
  • documenting assumptions and trade-offs

Common Mistakes to Avoid

  • learning too many tools at once
  • skipping SQL depth
  • building only notebooks instead of pipelines
  • ignoring testing and logging
  • using Spark before understanding basic data workflows
  • copying tutorials without changing anything
  • focusing on certificates over projects
  • neglecting cloud cost and permissions
  • avoiding documentation

Learn by Official Documentation

Practice Platforms

  • SQL practice websites
  • LeetCode SQL sections
  • Kaggle datasets
  • public APIs
  • cloud free tiers / sandbox projects

Books / Concepts to Study

  • data modeling
  • distributed systems basics
  • database internals basics
  • analytics engineering principles
  • software engineering best practices

How to Use This Roadmap

Use this roadmap in cycles:

  1. Learn a concept
  2. Apply it in a small exercise
  3. Use it in a project
  4. Document what you learned
  5. Refactor and improve

Do not wait until you “know enough” before building.

Build early. Break things. Fix them. Document them. Repeat.


Final Advice

The fastest way to become a Data Engineer is not to memorize every tool. It is to consistently build systems that move data from raw → reliable → usable.

A strong path looks like this:

  1. Master Python + SQL
  2. Learn databases + modeling
  3. Build ETL/ELT pipelines
  4. Add dbt + Airflow
  5. Learn Spark + cloud
  6. Understand Kafka + streaming
  7. Practice testing, monitoring, and production design
  8. Build a portfolio that proves you can do the work

If you stay consistent and project-focused, this roadmap can take you from beginner to job-ready with real confidence.


Intro to Python

Python is the de facto language for modern data engineering—valued for its readable syntax, rich ecosystem (Pandas, PySpark, Kafka clients), and strong integration with cloud and big‑data platforms; mastering its architecture, libraries, and operational patterns is essential for building reliable, scalable pipelines in production.

Introduction to Python for Data Engineers

  • Purpose: Python is used to ingest, transform, validate, orchestrate, and serve data at scale.
  • Why it matters for advanced engineers: Python balances developer productivity with extensibility into high‑performance systems (native extensions, distributed runtimes), making it suitable for both prototyping and production workloads.

Core concepts (non‑code)

  • Language model and runtime: Understand the interpreter model, Global Interpreter Lock (GIL) implications for concurrency, and when to prefer multi‑processing, async IO, or offloading to JVM/compiled libraries.
  • Data model & memory: Know how Python represents arrays, dataframes, and in‑memory objects, and the tradeoffs between in‑memory processing (Pandas) and out‑of‑core/distributed approaches (Dask, PySpark).
  • Ecosystem roles: Different libraries serve distinct layers—Pandas/NumPy for local transforms, PySpark for distributed compute, SQLAlchemy for DB access, Kafka clients for streaming ingestion.

Python Resources

This page contains curated resources to help you master Python, specifically tailored for the needs of a Data Engineer.


Official Python Tutorial

  • Source: docs.python.org
  • Why it’s here: The gold standard. If you want to understand how the language is designed and stay up-to-date with the latest features, always start here.

FreeCodeCamp: Python for Beginners

  • Source: YouTube
  • Why it’s here: Great for visual learners. This comprehensive video covers the fundamentals in a project-based way that is easy to follow.

Real Python

  • Source: realpython.com
  • Why it’s here: Excellent for deep dives. Once you know the basics, use this to master advanced topics like Decorators, Generators, and Context Managers—all of which are vital for writing clean data pipelines.

For quick reference while coding, bookmark these:

Python Practice Tasks

These tasks are designed to build the foundational logic required for Data Engineering pipelines.


🛠️ Data Engineering Prep Tasks

📝 Practice Tasks (Standard Library)

1. The CSV to JSON Converter

Goal: Master the csv and json modules.

  • Scenario: A legacy system exports data in CSV, but your modern API only accepts JSON.
  • Requirements:
    • Read a file users.csv without using external libraries.
    • Convert each row into a Dictionary.
    • Handle “dirty” data: if a numeric field contains text, default it to 0.
    • Output: A beautifully formatted users.json file.

2. The Memory-Efficient Log Scanner

Goal: Use Generators to handle “Big Data” on a small machine.

  • Scenario: You have a 5GB log file, but your computer only has 4GB of RAM.
  • Requirements:
    • Use a generator function (yield) to read the file line-by-line.
    • Count occurrences of keywords like ERROR, WARNING, and INFO.
    • Output: A summary dictionary: {'ERROR': 450, 'WARNING': 120, 'INFO': 2000}.

Control Flows: Basic to Advance

Control flow is the “brain” of your script, determining which chunks of code run based on specific conditions.

🚦 Basic Control Flow

Learn how to use if, elif, and else to guide your data logic.

🔄 Advanced Iteration

In Data Engineering, we often deal with large datasets where performance matters.


🧪 Quick Challenge

Write a Nested Loop that flattens a list of lists (a common task when dealing with nested JSON data).

  • Input: [[1, 2], [3, 4], [5, 6]]
  • Output: [1, 2, 3, 4, 5, 6]

Pandas for Data Engineering

Pandas is a fast, powerful, and flexible open-source data analysis and manipulation tool built on top of the Python programming language.

🔗 Learning Resources

📝 Practice Tasks (The Basics)

1. The Schema Validator

Goal: Ensure incoming data matches your requirements.

  • Scenario: You have a CSV of “Sales” data. You need to ensure no prices are negative and all dates are valid.
  • Requirements:
    • Load sales.csv.
    • Use df.info() to check data types.
    • Use boolean indexing to find rows where price < 0 and drop them.
    • Convert the transaction_date column to datetime64[ns].
    • Output: A report of how many rows were deleted during cleaning.

2. The Aggregator (Group-By)

Goal: Create a summary report from raw events.

  • Scenario: A CSV contains store_id, product_category, and revenue.
  • Requirements:
    • Group the data by store_id.
    • Calculate the Total Revenue and Average Revenue per store.
    • Sort the results so the highest-earning store is at the top.
    • Output: Save the summary to store_performance.csv.

Introduction to SQL

SQL is the language of data. For a Data Engineer, it is essential for performing transformations (the ‘T’ in ETL) and querying data warehouses.

🔗 Learning Resources

📊 SQL Practice Tasks: Retail Database

In this module, you will build a relational structure for a grocery store and perform data audits.

Task 1: Table Architecture (DDL)

Create a table named store_sales with the following requirements:

  • transaction_id: Integer, Primary Key.
  • item_name: String, cannot be null.
  • price: Float, must be greater than 0.
  • category: String, default to ‘General’.
  • sale_date: Date.

Task 2: Data Ingestion (DML)

Insert 5 rows of data. Ensure at least one row has a NULL category to test your default constraint, and one row has a high quantity (over 10) for future filtering.

Task 3: The “Revenue” Report

Write a query to calculate:

  1. The Total Revenue (Price * Quantity) per category.
  2. The Average Price of items in the ‘Produce’ category.
  3. Sort the results so the highest-earning category is at the top.

Task 4: Conditional Flagging

Use a CASE statement to create a new column sale_type:

  • If quantity > 10, label as ‘Wholesale’.
  • Otherwise, label as ‘Retail’.

Introduction to dbt (data build tool)

dbt is the “T” (Transform) in ELT. It allows Data Engineers and Analysts to write transformations in simple SQL while dbt handles the complex work of dependency management, version control, and data testing.dbt is a development framework that combines SQL with software engineering best practices (version control, testing, CI/CD) to build data transformations.

Core Concepts

  • T in ELT: dbt does not move data; it transforms data already loaded into your warehouse (Snowflake, BigQuery, etc.).
  • SQL-based: If you know SELECT statements, you can use dbt.
  • Modular: It turns monolithic scripts into reusable models.

Official Resources

🔗 Learning Resources

💎 dbt (Data Build Tool) Practice Tasks

dbt sits on top of your Data Warehouse. It allows you to transform “Raw” data into “Clean” data using SQL and Jinja templating.


📝 Practice Tasks: Analytics Engineering

Task 1: Project Initialization & Profiles

Goal: Set up the dbt environment.

  • Requirement: Run dbt init grocery_project.
  • Challenge: Configure your profiles.yml to connect to a local SQLite database or a cloud warehouse (Snowflake/BigQuery).
  • Verification: Run dbt debug to ensure the connection is successful.

Task 2: Building Modular Models (The ref function)

Goal: Create a dependency between two models.

  • Requirement: 1. Create a “Staging” model stg_sales.sql that selects all data from your raw sales table. 2. Create a “Fact” model fct_daily_summary.sql that uses the {{ ref('stg_sales') }} function to calculate daily totals.
  • Discussion: Why is using ref() better than hardcoding the table name like FROM my_db.my_schema.stg_sales?

Task 3: Data Documentation & YAML

Goal: Document your data for the business team.

  • Requirement: Create a schema.yml file in your models/ folder.
  • Challenge: Add a description for the transaction_id and total_revenue columns.
  • Command: Run dbt docs generate followed by dbt docs serve. Can you see your documentation in the browser?

Task 4: Automated Testing (Data Quality)

Goal: Catch bad data before it hits the dashboard.

  • Requirement: In your schema.yml, add the following tests to the transaction_id column:
    • unique
    • not_null
  • Challenge: Add an accepted_values test to the category column to ensure only ‘Produce’, ‘Dairy’, and ‘Bakery’ are allowed.
  • Command: Run dbt test and check if your data passes.

Introduction to PySpark

PySpark allows you to use the power of Apache Spark with Python, enabling large-scale data processing across clusters.

🔗 Learning Resources

⚙️ PySpark Practice Tasks: Big Data Processing

When data grows to Terabytes, we move from Pandas to Spark. Note: Spark uses Lazy Evaluation.

Task 1: Schema Enforcement

Instead of using inferSchema=True, manually define a StructType schema for the grocery data prepared in SQL practice task.

Task 2: Distributed Transformations

Using the Spark DataFrame API:

  1. Filter rows where quantity is less than 1 (Data Quality check).
  2. Add a column taxed_price which is price * 1.15.
  3. Rename the column item_name to product_description.

Task 3: Wide Transformations (Shuffle)

Perform a groupBy on the category column and calculate the sum of quantity.

  • Discussion: Explain why groupBy is considered a “Wide Transformation” compared to filter.

Task 4: Optimization

Use the .cache() method on a DataFrame. In what scenario would a Data Engineer choose to cache a dataset?

Introduction to Databricks

Databricks is a unified analytics platform built on top of Apache Spark. It pioneered the “Lakehouse” architecture, which provides the performance of a data warehouse with the massive scale and flexibility of a data lake.

🔗 Learning Resources

🧱 Databricks Practice Tasks: The Lakehouse Challenge

Databricks is a unified data analytics platform. It combines the best of Data Warehouses (SQL/Performance) and Data Lakes (Storage/Flexibility) into a Lakehouse.


📝 Practice Tasks: Notebooks & Delta Lake

Task 1: Cluster & Notebook Setup

Goal: Understand the Compute layer.

  • Requirement: Spin up a “Personal Compute” or “Community Edition” cluster.
  • Challenge: Create a New Notebook.
    • Set the default language to Python.
    • Use a “Magic Command” (%sql) in the second cell to switch to SQL.
  • Verification: Run spark.version in Python and SELECT 1 in SQL.

Task 2: The Magic of Delta Lake (Time Travel)

Goal: Understand ACID transactions on a Data Lake.

  • Requirement: Load the grocery CSV into a Delta Table.
    df.write.format("delta").saveAsTable("sales_delta")
    
  • Challenge: 1. Update a price in the table: UPDATE sales_delta SET price = 10.0 WHERE id = 1. 2. Use Time Travel to see the previous version:
    SELECT * FROM sales_delta VERSION AS OF 0
    
  • Discussion: Why is “Time Travel” a lifesaver for Data Engineers during a pipeline failure?

Task 3: Multi-Language Transformation (Polyglot)

Goal: Use the best tool for the job within one notebook.

  • Requirement: 1. Use PySpark to read the raw data and filter out nulls. 2. Register the result as a Temporary View: df.createOrReplaceTempView("cleaned_data"). 3. Use SQL (%sql) to perform a GROUP BY on that temporary view.
  • Challenge: Why is this “Polyglot” approach better than forcing everything into one language?

Task 4: Databricks Utilities (dbutils)

Goal: Interact with the file system (DBFS).

  • Requirement: Use dbutils.fs.ls("/") to list the root directories.
  • Challenge: Create a new folder named /tmp/my_project/ and move a sample file into it using dbutils.fs.mv.
  • Command: display(dbutils.fs.ls("/tmp/my_project/"))

🏗️ Mini Project: The Medallion Architecture

Build a simplified “Medallion” pipeline in a single notebook:

  1. Bronze Layer (Raw): Load the raw CSV exactly as it is into a Delta table named bronze_sales.
  2. Silver Layer (Cleaned): Filter out negative prices, fix date formats, and deduplicate. Save as silver_sales.
  3. Gold Layer (Aggregated): Create a final table gold_category_revenue that sums up revenue for the business dashboard.
  4. Optimization: Run OPTIMIZE gold_category_revenue ZORDER BY (sale_date) to speed up queries.

Apache Kafka & Event Streaming

Kafka is a distributed event store and stream-processing platform. It allows you to handle trillions of events a day, making it the backbone of real-time Data Engineering.

🔗 Learning Resources

⚡ Apache Kafka Practice Tasks: Real-Time Streaming

Kafka is a distributed event streaming platform. Instead of processing “batches” of data, we process “events” as they happen.


📝 Practice Tasks: Topics, Producers, and Consumers

Task 1: Infrastructure Setup (CLI)

Goal: Understand the core components of a Kafka Cluster.

  • Requirement: Start a Zookeeper and Kafka broker (using Docker or local binaries).
  • Challenge: 1. Create a Topic named grocery-sales with 3 partitions and a replication factor of 1. 2. List all topics to verify it exists. 3. Describe the topic to see the partition distribution.
  • Command: kafka-topics.sh --create --topic grocery-sales --bootstrap-server localhost:9092 ...

Task 2: The Console Producer & Consumer

Goal: Send and receive your first messages manually.

  • Requirement: Open two terminal windows.
  • Challenge: 1. In Terminal A (Producer), type 3 messages representing sales: {"id": 1, "item": "Apple"}. 2. In Terminal B (Consumer), read the messages from the beginning (--from-beginning).
  • Discussion: What happens to the messages in Terminal B if you close it and reopen it? Do they disappear?

Task 3: Python Integration (confluent-kafka or kafka-python)

Goal: Programmatically stream data.

  • Requirement: Write a Python script to act as a Producer.
  • Challenge: * Create a loop that sends a new “Sale Event” every 2 seconds to the grocery-sales topic.
    • Each event should have a random price and a timestamp.
  • Verification: Use the Console Consumer to watch the live stream of data appearing in real-time.

Task 4: Consumer Groups & Scalability

Goal: Understand how Kafka handles high volume.

  • Requirement: Start two instances of a Python Consumer script using the same group.id.
  • Challenge: Observe the output. Does each consumer see every message, or do they split the load?
  • Discussion: Why is the number of Partitions the “ceiling” for how many consumers you can have in a group?

🏗️ Mini Project: The “Real-Time Alert” Pipeline

Build a small streaming application that monitors for high-value transactions.

  1. Producer: A script that generates random sales data (0 - $100).
  2. Stream Processor (Python): A consumer script that reads the stream.
  3. The Logic: If a sale is greater than $80, print a “🚨 ALERT: High Value Sale Detected!” message.
  4. The Sink: Save only the “Alert” transactions into a local file named alerts_log.txt.

Introduction to Snowflake

Snowflake is a cloud-native data warehouse known for its “near-zero” management. Its unique architecture separates storage from compute, allowing Data Engineers to scale resources up or down instantly without affecting data availability.

🔗 Learning Resources

🧪 Practice Tasks

  1. The Warehouse Control: Create a Virtual Warehouse (X-Small), set it to “Auto-Suspend” after 1 minute, and observe how it handles an incoming query.
  2. The Data Loader: Create an Internal Stage, upload a local file using PUT, and use the COPY INTO command to load it into a structured table.
  3. The Time Traveler: Delete several records from a table, then use the AT (OFFSET => -60) command to select the data as it existed 1 minute ago and restore it.

Apache Airflow

Orchestration is the “glue” that holds data pipelines together. Apache Airflow allows you to programmatically author, schedule, and monitor workflows.

🔗 Learning Resources

🚀 Apache Airflow Practice Tasks: Data Orchestration

Airflow uses DAGs (Directed Acyclic Graphs) written in Python to schedule and monitor your data pipelines.


📝 Practice Tasks: Operators & Scheduling

Task 1: Your First DAG (The “Hello World”)

Goal: Understand the basic structure of a DAG.

  • Requirement: Create a Python file in your dags/ folder.
  • Challenge: 1. Define a DAG that runs every day at midnight (@daily). 2. Use the EmptyOperator for a ‘start’ and ‘end’ task. 3. Use the BashOperator to print “Hello Airflow” in the middle.
  • Verification: Open the Airflow UI (localhost:8080) and trigger the DAG manually.

Task 2: PythonOperator & XComs

Goal: Pass data between tasks.

  • Requirement: Airflow tasks are isolated, so we use XComs (Cross-Communications) to share small amounts of data.
  • Challenge: 1. Task A: A Python function that returns a random number. 2. Task B: A Python function that pulls that number and prints whether it is “Even” or “Odd”.

Task 3: Task Dependencies & Branching

Goal: Control the flow of logic.

  • Requirement: Use the BranchPythonOperator.
  • Challenge: 1. Create a task that checks if today is a weekday. 2. If it is a weekday, run a “Work Task”. 3. If it is a weekend, run a “Maintenance Task”.

Task 4: Catchup & Backfilling

Goal: Understand how Airflow handles historical data.

  • Requirement: Set a start_date to 3 days ago.
  • Challenge: 1. Set catchup=True and see how Airflow automatically triggers runs for the past 3 days. 2. Change catchup=False and see the difference.

Introduction to AWS (Amazon Web Services)

AWS is the world’s most comprehensive and broadly adopted cloud platform. For a Data Engineer, AWS provides the foundational storage (S3) and compute (EC2/Lambda) layers needed to build robust data pipelines.

🔗 Learning Resources

☁️ AWS Practice Tasks: Cloud Data Engineering

AWS is the most widely used cloud provider. In this module, you will build a “Data Lake” and a serverless ETL pipeline.


📝 Practice Tasks: S3, Lambda, and Glue

Task 1: The S3 Data Lake Setup

Goal: Master object storage and permissions.

  • Requirement: Log into the AWS Console (or use LocalStack for local practice).
  • Challenge: 1. Create an S3 Bucket named raw-grocery-data-unique-id. 2. Upload your grocery_sales.csv manually. 3. Create a folder structure inside the bucket: inbound/, processed/, and archive/.
  • Command: aws s3 cp grocery_sales.csv s3://raw-grocery-data-unique-id/inbound/

Task 2: Serverless ETL with AWS Lambda

Goal: Trigger code based on an event (File Upload).

  • Requirement: Create a Python Lambda function.
  • Challenge: 1. Set an S3 Trigger so the Lambda runs every time a .csv is dropped into the inbound/ folder. 2. Write Python code (using the boto3 library) to read the file’s metadata and print the file name to CloudWatch Logs.

Task 3: AWS Glue Crawler & Data Catalog

Goal: Automatically discover the schema of your data.

  • Requirement: Point an AWS Glue Crawler at your S3 bucket.
  • Challenge: 1. Run the Crawler to scan your CSV files. 2. Check the AWS Glue Data Catalog to see if a table was created automatically. 3. Use Amazon Athena to run a SQL query (SELECT * FROM ...) directly against the files sitting in S3.
  • Concept: This is the “Schema-on-Read” pattern.

Task 4: IAM Roles & Security

Goal: Understand the “Principle of Least Privilege.”

  • Requirement: Your Lambda function needs permission to read from S3.
  • Challenge: Create an IAM Role that only has AmazonS3ReadOnlyAccess. Attach it to your Lambda.
  • Verification: Try to make the Lambda delete the file. It should fail. This proves your pipeline is secure.

🏗️ Mini Project: The “S3-to-Athena” Pipeline

Build a production-ready serverless analytics layer:

  1. Storage: Landing zone in S3 (/raw).
  2. Transformation: An AWS Glue Job (PySpark) that reads the raw CSV, converts it to Parquet format, and saves it to a /parquet folder.
  3. Cataloging: A Glue Crawler that updates the metadata for the Parquet files.
  4. Analysis: Use Amazon Athena to calculate the total monthly revenue. Compare the speed of querying the CSV vs. the Parquet version.

Introduction to Azure

Azure is Microsoft’s cloud computing platform and a leader in enterprise data solutions. For a Data Engineer, Azure provides a seamless ecosystem for storing, integrating, and managing massive data workloads using specialized tools like ADLS and Synapse.

🔗 Learning Resources

☁️ Azure Practice Tasks: Enterprise Data Engineering

Azure is the cloud of choice for many enterprise organizations. In this module, you will build a scalable data ingestion and transformation pipeline.


📝 Practice Tasks: Data Lake, Data Factory, and Synapse

Task 1: ADLS Gen2 Hierarchy

Goal: Master “Hierarchical Namespaces” for data organization.

  • Requirement: Create an Azure Storage Account (Standard V2).
  • Challenge: 1. Enable Hierarchical Namespace (this turns it into ADLS Gen2). 2. Create a “Container” named landing-zone. 3. Upload grocery_sales.csv using Azure Storage Explorer or the Portal.
  • Concept: Understand why a “True Directory” structure is faster for Big Data than a “Flat” S3-style bucket.

Task 2: Azure Data Factory (ADF) - The Copy Activity

Goal: Your first “No-Code” ETL pipeline.

  • Requirement: Provision an Azure Data Factory instance.
  • Challenge: 1. Create a Linked Service to your Storage Account. 2. Create a Pipeline with a Copy Data Activity. 3. Source: grocery_sales.csv. Sink: A new folder named /processed/.
  • Verification: Trigger the pipeline and check if the file moved successfully.

Task 3: Azure Key Vault Security

Goal: Secure your credentials (Secret Management).

  • Requirement: Create an Azure Key Vault.
  • Challenge: 1. Store your Storage Account Access Key as a Secret. 2. Link your Data Factory to the Key Vault. 3. Configure the Data Factory to pull the secret instead of hardcoding the password in the connection string.

🏗️ Mini Project: The “ADF-to-SQL” Pipeline

Build an end-to-end automated ingestion layer:

  1. Storage: Landing zone in ADLS Gen2.
  2. Orchestration: An Azure Data Factory pipeline triggered by a “Storage Event” (when a file is uploaded).
  3. Transformation: A Mapping Data Flow in ADF that:
    • Removes rows with NULL prices.
    • Converts item_name to lowercase.
  4. Loading: Sink the cleaned data into an Azure SQL Database table.
  5. Monitoring: Use the ADF Monitor tab to verify the run time and rows processed.

Introduction to GCP (Google Cloud Platform)

GCP is a powerful and rapidly growing cloud platform by Google, offering scalable infrastructure and advanced data analytics capabilities. For a Data Engineer, GCP provides essential services like Cloud Storage for data lakes, BigQuery for analytics, and Dataflow for building scalable data pipelines.


🔗 Learning Resources


⚙️ Key GCP Services for Data Engineering

  • Cloud Storage → Data lake storage for raw and processed data
  • BigQuery → Serverless data warehouse for analytics
  • Dataflow → Stream & batch data processing (Apache Beam)
  • Pub/Sub → Real-time messaging and event ingestion
  • Dataproc → Managed Spark/Hadoop clusters
  • Cloud Composer → Workflow orchestration (Apache Airflow)

🧪 Practice Tasks

1. The Bucket Builder

Create a Cloud Storage bucket, upload a .csv file, and configure IAM permissions to allow read-only access.

2. The Cost Optimizer

Set up a lifecycle rule in Cloud Storage to move objects to Coldline or Archive storage after 30 days.

3. The Query Explorer

Load your .csv file into BigQuery and run SQL queries:

  • Count total rows
  • Filter records
  • Aggregate data (e.g., GROUP BY)

4. The Access Controller

Create a service account with least privilege and use it to access BigQuery or Cloud Storage.


🚀 Mini Project: Build a Data Pipeline in GCP

Objective:

Build a simple batch data pipeline from Cloud Storage → BigQuery.

Steps:

  1. Upload a sample dataset (.csv) to Cloud Storage
  2. Create a BigQuery dataset and table
  3. Use BigQuery UI / bq command / Dataflow template to load data
  4. Transform data using SQL (e.g., cleaning, filtering)
  5. Schedule pipeline (optional) using Cloud Composer or scheduled queries