DATA Engineering Roadmap

A detailed, practical roadmap to become a job-ready Data Engineer — from foundations to production-grade pipelines.

Overview
Who This Roadmap Is For
What a Data Engineer Does
Core Skills You Need
Roadmap Stages
Suggested Learning Timeline
Project Roadmap
Beginner to Advanced Tool Stack
Daily / Weekly Study Plan
Portfolio Checklist
Interview Preparation
Common Mistakes to Avoid
Recommended Resources
How to Use This Roadmap
Final Advice

Overview

Data engineering is the discipline of building reliable systems that ingest, store, transform, serve, and monitor data.

A strong Data Engineer knows how to:

collect data from APIs, applications, databases, and files
design clean and scalable data models
create reliable ETL/ELT pipelines
schedule and monitor workflows
handle batch and streaming workloads
work with cloud platforms and modern data tools
ensure data quality, governance, and security
support analytics, machine learning, and operational systems

This roadmap is designed to help you progress from beginner to production-ready practitioner with a clear sequence of skills, tools, and projects.

Who This Roadmap Is For

This roadmap is ideal for:

beginners entering data engineering
data analysts moving toward engineering
software engineers transitioning into data platforms
BI developers who want stronger backend data skills
students preparing for internships or entry-level roles
professionals who want a structured path instead of random tutorials

What a Data Engineer Does

A Data Engineer typically works on:

Data ingestion — pulling data from databases, APIs, SaaS tools, logs, and streams
Data transformation — cleaning, joining, standardizing, validating, and modeling data
Data storage — operating data lakes, warehouses, and lakehouses
Pipeline orchestration — scheduling, dependency management, retries, and alerting
Data reliability — testing, monitoring, lineage, observability, and SLAs
Infrastructure — cloud services, containers, CI/CD, permissions, and deployment workflows
Collaboration — partnering with analysts, analytics engineers, ML engineers, and product teams

Core Skills You Need

A solid data engineering foundation combines multiple skill areas:

1. Programming

Focus on:

Python
shell basics
scripting and automation
debugging and logging
packages and virtual environments
clean code practices

2. SQL

This is non-negotiable.

You should master:

filtering, grouping, joins
window functions
CTEs
subqueries
aggregations
date handling
query optimization basics
schema design

3. Data Modeling

Learn how to structure data so it is useful and scalable:

normalization and denormalization
fact and dimension tables
star and snowflake schemas
slowly changing dimensions
surrogate keys
business grain definition

4. Distributed Processing

For large-scale systems, learn:

partitioning
parallelism
shuffling
fault tolerance
cluster computing concepts

5. Cloud and Infrastructure

Modern data engineering usually runs in the cloud:

object storage
IAM / permissions
compute services
containers
infrastructure basics
deployment pipelines

6. Reliability and Operations

Production systems require:

data quality checks
alerting
retry logic
idempotency
monitoring
incident handling
cost awareness

Stage 0: Mindset and Setup

Before tools, build the right habits.

Learn

what ETL vs ELT means
difference between OLTP and OLAP
basics of files, tables, APIs, logs, events, and streams
why data pipelines fail in production
importance of reproducibility and observability

Set Up Your Environment

Install and configure:

Python
VS Code or PyCharm
Git and GitHub
Docker
a local SQL database such as PostgreSQL
Jupyter Notebook (optional)

Output

By the end of this stage, you should be able to:

write and run scripts locally
manage code with Git
use Docker at a basic level
connect Python to a database

Stage 1: Programming Fundamentals

Primary Language: Python

Python is widely used in data engineering for scripting, orchestration, API integrations, and Spark jobs.

Topics to Learn

variables, data types, functions, classes
loops and conditionals
list/dict/set comprehensions
file handling
exception handling
modules and packages
virtual environments
logging
type hints
unit testing basics

Libraries to Know Early

requests
pandas
pathlib
json
datetime
sqlalchemy
psycopg / database drivers

Practice Tasks

pull data from a public API
save JSON responses to files
transform raw data into tabular format
load transformed data into PostgreSQL
add logging and error handling

Goal

Build small scripts that behave like simple pipelines.

Stage 2: SQL and Data Modeling

SQL is the heart of analytics and warehouse work.

SQL Topics to Master

Fundamentals

SELECT
WHERE
GROUP BY
HAVING
ORDER BY
JOIN
UNION

Intermediate

window functions
CTEs
nested queries
case expressions
date/time functions
text manipulation
null handling

Advanced

execution plans
indexing concepts
query performance tuning
incremental logic
deduplication patterns
change data capture concepts

Data Modeling Topics

entities and relationships
business keys vs surrogate keys
dimensional modeling
medallion layering (bronze, silver, gold)
snapshots and historical tracking
schema evolution basics

Practice Tasks

design a mini e-commerce schema
write 25–50 SQL queries of increasing difficulty
create fact and dimension tables
build daily summary tables
solve window function problems

Goal

Be able to transform messy source data into analyst-ready models.

Stage 3: Databases and Storage Systems

You need to understand both transactional and analytical systems.

Learn the Difference

OLTP Systems

Used for application transactions:

PostgreSQL
MySQL
SQL Server

OLAP / Analytical Systems

Used for reporting and analytics:

data warehouses
columnar storage systems
lakehouse engines

Concepts to Learn

row vs column storage
partitions
indexes
replication basics
transactions and ACID
file formats: CSV, JSON, Parquet, Avro
compression and serialization
schema-on-write vs schema-on-read

Practice Tasks

compare CSV vs Parquet size and speed
load large files into a relational database
partition sample data by date
benchmark simple queries

Goal

Understand how storage design affects performance, cost, and maintainability.

Stage 4: Data Warehousing and Analytics Engineering

This is where raw data becomes trusted business data.

Learn

warehouse architecture
layered data models
transformation workflows
testing and documentation
incremental transformations
semantic consistency across teams

Recommended Focus Areas

dbt fundamentals
source freshness checks
tests for unique/not-null/relationships
modular SQL models
documentation generation
staging → intermediate → marts structure

Practice Tasks

build a dbt project on top of PostgreSQL or a cloud warehouse
model raw order data into marts
add tests and documentation
create a dashboard-ready schema

Goal

Be able to take raw ingested data and produce reliable business-ready datasets.

Stage 5: Batch Data Processing

Most organizations still rely heavily on batch workloads.

Learn

scheduled ingestion
chunked processing
retries
backfills
idempotent loads
full refresh vs incremental loads
partition-based processing

Tools to Explore

Python-based ETL scripts
Spark / PySpark
SQL transformations in warehouses

Spark Topics

DataFrames
transformations vs actions
lazy execution
partitioning
joins and aggregations
reading/writing Parquet
Spark SQL
performance basics

Practice Tasks

process millions of rows with PySpark
clean and aggregate large event data
write partitioned Parquet outputs
compare local Python vs Spark for scale

Goal

Learn when simple scripts are enough and when distributed processing is necessary.

Stage 6: Workflow Orchestration

Pipelines are not just scripts. They need scheduling, dependency management, and monitoring.

Learn

DAG concepts
task dependencies
retries and backoff
scheduling intervals
parameterization
backfills
SLAs and alerts
secrets handling

Practice Tasks

create DAGs for ingest → transform → validate → publish
run daily and hourly schedules
add retry policies and failure alerts
parameterize DAGs for different environments

Goal

Move from “script runner” thinking to “production workflow system” thinking.

Stage 7: Streaming and Event-Driven Data

Many systems need near-real-time data movement.

Learn

events, topics, partitions, offsets
producers and consumers
delivery guarantees
stateful vs stateless processing
watermarking and late-arriving data
event time vs processing time

Tools to Explore

Apache Kafka
Kafka Connect concepts
Spark Structured Streaming
stream processing fundamentals

Practice Tasks

stream click events into Kafka
consume messages with Python
aggregate events over time windows
send enriched records to storage or a database

Goal

Understand real-time patterns even if your first job is primarily batch-oriented.

Stage 8: Cloud Data Engineering

Most real-world roles expect cloud familiarity.

Pick One Cloud First

Choose one deeply before sampling the others:

AWS
GCP
Azure

Universal Cloud Concepts

object storage
managed databases
serverless compute
data warehouse services
secrets and IAM
networking basics
logging and monitoring
cost optimization

Typical Services by Category

Compute

EC2 / Compute Engine / Virtual Machines
Lambda / Cloud Functions / Azure Functions
container services

Warehousing

Redshift
BigQuery
Snowflake
Azure Synapse / Fabric-oriented ecosystems

Orchestration and Integration

Airflow-managed offerings
cloud-native schedulers
connectors and ingestion services

Practice Tasks

store raw files in object storage
run transformations in cloud compute or warehouse SQL
create an end-to-end ELT pipeline in one cloud
track costs and permissions

Goal

Be able to build cloud-native pipelines instead of only local demos.

Stage 9: Data Quality, Governance, and Security

This is what separates toy projects from trustworthy systems.

Data Quality Topics

null checks
uniqueness checks
referential integrity
freshness checks
volume anomaly checks
distribution checks
schema drift detection

Governance Topics

metadata
lineage
ownership
naming conventions
documentation
data classification
retention policies

Security Topics

principle of least privilege
secret management
encryption at rest and in transit
PII awareness
auditability

Practice Tasks

add validation steps to every pipeline
create data contracts for key tables
document lineage from raw to mart
separate dev/staging/prod access

Goal

Learn to build pipelines others can trust.

Stage 10: DevOps for Data Engineers

Data engineers need software engineering discipline.

Learn

Git workflows
branching strategy
pull requests
CI/CD basics
automated tests
Docker images
environment variables
IaC concepts (Terraform is a strong plus)

Practice Tasks

dockerize a pipeline project
add unit tests and SQL tests
configure GitHub Actions for linting/tests
deploy the same project to dev and prod configs

Goal

Make your data projects reproducible, reviewable, and deployable.

Stage 11: Production Readiness

At this stage, focus less on tools and more on system design.

Learn

scalability trade-offs
fault tolerance
observability
cost vs performance
schema evolution
reprocessing and replay strategies
SLA / SLO thinking
stakeholder communication

Questions You Should Be Able to Answer

What happens if a pipeline fails halfway through?
How do you backfill historical data safely?
How do you detect bad upstream data?
How do you avoid duplicates?
How do you debug slow jobs?
When should you use Spark instead of SQL?
When is streaming worth the complexity?

Goal

Think like an engineer responsible for business-critical data systems.

Project Roadmap

Projects matter more than endless tutorials.

Project 1: API to PostgreSQL Pipeline

Goal: Build a small ETL pipeline.

Stack: Python, requests, PostgreSQL, Docker

Features:

extract data from a public API
transform nested JSON into clean tables
load into PostgreSQL
add logs and retry handling

Project 2: Analytics Modeling Project

Goal: Create business-ready warehouse models.

Stack: SQL, dbt, PostgreSQL or cloud warehouse

Features:

staging models
marts
tests
documentation
incremental models

Project 3: Orchestrated ELT Pipeline

Goal: Run a full multi-step pipeline automatically.

Stack: Airflow, dbt, Docker

Features:

ingest raw data
transform with dbt
run data quality checks
publish final tables
add alerts

Project 4: Batch Big Data Pipeline

Goal: Process large datasets efficiently.

Stack: PySpark, Parquet, object storage

Features:

read large event logs
clean and aggregate data
partition outputs by date
benchmark performance changes

Project 5: Streaming Pipeline

Goal: Demonstrate real-time thinking.

Stack: Kafka, Python consumers, Spark Structured Streaming or similar

Features:

ingest event stream
aggregate in windows
store results downstream
monitor lag/failures

Project 6: Cloud-End-to-End Data Platform Demo

Goal: Show production-style architecture.

Stack: cloud storage + cloud warehouse + orchestration + IaC

Features:

raw, refined, curated layers
automated scheduling
cost-aware storage choices
permissions and environment separation

Beginner to Advanced Tool Stack

Below is a practical progression, not a mandatory list.

Scale Tools

PySpark
Kafka
Parquet / Avro
object storage

Cloud / Platform Tools

AWS / GCP / Azure
warehouse platform
Terraform
CI/CD pipelines

Reliability Tools

data tests
alerting systems
logging / metrics / dashboards
lineage and catalog tools

Portfolio Checklist

Your GitHub portfolio should show more than notebooks.

Include

clear README files
architecture diagrams
setup instructions
sample datasets or generation scripts
tests
screenshots of DAGs or dashboards
documentation of trade-offs
failure handling notes
deployment steps

Recruiter-Friendly Project Questions Your Repo Should Answer

What problem does this solve?
What tools are used and why?
How is data ingested, transformed, and stored?
How is quality validated?
How is the workflow scheduled?
How would this scale in production?

Interview Preparation

Technical Topics

SQL joins, windows, deduplication
Python data structures and scripting
ETL vs ELT
dimensional modeling
Airflow concepts
Spark basics
partitioning and file formats
Kafka fundamentals
warehousing concepts
cloud basics
testing and observability

Scenario Questions

Prepare for questions like:

design a daily sales pipeline
handle duplicate events in ingestion
backfill 2 years of historical data
optimize a slow transformation job
model orders, customers, and products for analytics
move from batch to near-real-time reporting

Behavioral Topics

debugging production issues
communicating with analysts/stakeholders
prioritizing reliability vs speed
documenting assumptions and trade-offs

Common Mistakes to Avoid

learning too many tools at once
skipping SQL depth
building only notebooks instead of pipelines
ignoring testing and logging
using Spark before understanding basic data workflows
copying tutorials without changing anything
focusing on certificates over projects
neglecting cloud cost and permissions
avoiding documentation

Recommended Resources

Learn by Official Documentation

Python documentation
PostgreSQL documentation
Apache Airflow documentation
dbt documentation
Apache Spark documentation
Apache Kafka documentation
your cloud provider’s official data engineering guides

Practice Platforms

SQL practice websites
LeetCode SQL sections
Kaggle datasets
public APIs
cloud free tiers / sandbox projects

Books / Concepts to Study

data modeling
distributed systems basics
database internals basics
analytics engineering principles
software engineering best practices

How to Use This Roadmap

Use this roadmap in cycles:

Learn a concept
Apply it in a small exercise
Use it in a project
Document what you learned
Refactor and improve

Do not wait until you “know enough” before building.

Build early. Break things. Fix them. Document them. Repeat.

Final Advice

The fastest way to become a Data Engineer is not to memorize every tool. It is to consistently build systems that move data from raw → reliable → usable.

A strong path looks like this:

Master Python + SQL
Learn databases + modeling
Build ETL/ELT pipelines
Add dbt + Airflow
Learn Spark + cloud
Understand Kafka + streaming
Practice testing, monitoring, and production design
Build a portfolio that proves you can do the work

If you stay consistent and project-focused, this roadmap can take you from beginner to job-ready with real confidence.

Intro to Python

Python is the de facto language for modern data engineering—valued for its readable syntax, rich ecosystem (Pandas, PySpark, Kafka clients), and strong integration with cloud and big‑data platforms; mastering its architecture, libraries, and operational patterns is essential for building reliable, scalable pipelines in production.

Introduction to Python for Data Engineers

Purpose: Python is used to ingest, transform, validate, orchestrate, and serve data at scale.
Why it matters for advanced engineers: Python balances developer productivity with extensibility into high‑performance systems (native extensions, distributed runtimes), making it suitable for both prototyping and production workloads.

Core concepts (non‑code)

Language model and runtime: Understand the interpreter model, Global Interpreter Lock (GIL) implications for concurrency, and when to prefer multi‑processing, async IO, or offloading to JVM/compiled libraries.
Data model & memory: Know how Python represents arrays, dataframes, and in‑memory objects, and the tradeoffs between in‑memory processing (Pandas) and out‑of‑core/distributed approaches (Dask, PySpark).
Ecosystem roles: Different libraries serve distinct layers—Pandas/NumPy for local transforms, PySpark for distributed compute, SQLAlchemy for DB access, Kafka clients for streaming ingestion.

Python Resources

This page contains curated resources to help you master Python, specifically tailored for the needs of a Data Engineer.

🎓 Recommended Learning

Official Python Tutorial

Source: docs.python.org
Why it’s here: The gold standard. If you want to understand how the language is designed and stay up-to-date with the latest features, always start here.

FreeCodeCamp: Python for Beginners

Source: YouTube
Why it’s here: Great for visual learners. This comprehensive video covers the fundamentals in a project-based way that is easy to follow.

Real Python

Source: realpython.com
Why it’s here: Excellent for deep dives. Once you know the basics, use this to master advanced topics like Decorators, Generators, and Context Managers—all of which are vital for writing clean data pipelines.

🛠️ Official Documentation Quick-Links

For quick reference while coding, bookmark these:

Python Practice Tasks

These tasks are designed to build the foundational logic required for Data Engineering pipelines.

🛠️ Data Engineering Prep Tasks

📝 Practice Tasks (Standard Library)

1. The CSV to JSON Converter

Goal: Master the csv and json modules.

Scenario: A legacy system exports data in CSV, but your modern API only accepts JSON.
Requirements:
- Read a file users.csv without using external libraries.
- Convert each row into a Dictionary.
- Handle “dirty” data: if a numeric field contains text, default it to 0.
- Output: A beautifully formatted users.json file.

2. The Memory-Efficient Log Scanner

Goal: Use Generators to handle “Big Data” on a small machine.

Scenario: You have a 5GB log file, but your computer only has 4GB of RAM.
Requirements:
- Use a generator function (yield) to read the file line-by-line.
- Count occurrences of keywords like ERROR, WARNING, and INFO.
- Output: A summary dictionary: {'ERROR': 450, 'WARNING': 120, 'INFO': 2000}.

Control Flows: Basic to Advance

Control flow is the “brain” of your script, determining which chunks of code run based on specific conditions.

🚦 Basic Control Flow

Learn how to use if, elif, and else to guide your data logic.

Resource: Python If…Else (W3Schools)
Resource: Python Loops (Real Python)

🔄 Advanced Iteration

In Data Engineering, we often deal with large datasets where performance matters.

List Comprehensions: A concise way to create lists.
- Guide to List Comprehensions
Error Handling: Using try...except to prevent pipelines from crashing when meeting bad data.
- Exceptions in Python

🧪 Quick Challenge

Write a Nested Loop that flattens a list of lists (a common task when dealing with nested JSON data).

Input: [[1, 2], [3, 4], [5, 6]]
Output: [1, 2, 3, 4, 5, 6]

Pandas for Data Engineering

Pandas is a fast, powerful, and flexible open-source data analysis and manipulation tool built on top of the Python programming language.

🔗 Learning Resources

Official Documentation: Pandas Documentation
Pandas Getting Started: 10 Minutes to Pandas - The best “first stop” for beginners.
Kaggle Course: Pandas Tutorial - Interactive and specifically focused on data manipulation.
Visualization: Pandas Plotting Guide

📝 Practice Tasks (The Basics)

1. The Schema Validator

Goal: Ensure incoming data matches your requirements.

Scenario: You have a CSV of “Sales” data. You need to ensure no prices are negative and all dates are valid.
Requirements:
- Load sales.csv.
- Use df.info() to check data types.
- Use boolean indexing to find rows where price < 0 and drop them.
- Convert the transaction_date column to datetime64[ns].
- Output: A report of how many rows were deleted during cleaning.

2. The Aggregator (Group-By)

Goal: Create a summary report from raw events.

Scenario: A CSV contains store_id, product_category, and revenue.
Requirements:
- Group the data by store_id.
- Calculate the Total Revenue and Average Revenue per store.
- Sort the results so the highest-earning store is at the top.
- Output: Save the summary to store_performance.csv.

Introduction to SQL

SQL is the language of data. For a Data Engineer, it is essential for performing transformations (the ‘T’ in ETL) and querying data warehouses.

🔗 Learning Resources

Official Documentation: PostgreSQL Documentation
Interactive Learning: SQLZoo - Great for hands-on practice in the browser.
SQL Crash Course: Mode Analytics SQL Tutorial - Highly recommended for data professionals.

📊 SQL Practice Tasks: Retail Database

In this module, you will build a relational structure for a grocery store and perform data audits.

Task 1: Table Architecture (DDL)

Create a table named store_sales with the following requirements:

transaction_id: Integer, Primary Key.
item_name: String, cannot be null.
price: Float, must be greater than 0.
category: String, default to ‘General’.
sale_date: Date.

Task 2: Data Ingestion (DML)

Insert 5 rows of data. Ensure at least one row has a NULL category to test your default constraint, and one row has a high quantity (over 10) for future filtering.

Task 3: The “Revenue” Report

Write a query to calculate:

The Total Revenue (Price * Quantity) per category.
The Average Price of items in the ‘Produce’ category.
Sort the results so the highest-earning category is at the top.

Task 4: Conditional Flagging

Use a CASE statement to create a new column sale_type:

If quantity > 10, label as ‘Wholesale’.
Otherwise, label as ‘Retail’.

Introduction to dbt (data build tool)

dbt is the “T” (Transform) in ELT. It allows Data Engineers and Analysts to write transformations in simple SQL while dbt handles the complex work of dependency management, version control, and data testing.dbt is a development framework that combines SQL with software engineering best practices (version control, testing, CI/CD) to build data transformations.

Core Concepts

T in ELT: dbt does not move data; it transforms data already loaded into your warehouse (Snowflake, BigQuery, etc.).
SQL-based: If you know SELECT statements, you can use dbt.
Modular: It turns monolithic scripts into reusable models.

Official Resources

🔗 Learning Resources

dbt Learn: dbt Fundamentals - The gold standard course for learning the modern data stack.
Best Practices: The dbt Viewpoint - Essential reading on why dbt works the way it does.
Documentation: dbt Core & Cloud Docs

💎 dbt (Data Build Tool) Practice Tasks

dbt sits on top of your Data Warehouse. It allows you to transform “Raw” data into “Clean” data using SQL and Jinja templating.

📝 Practice Tasks: Analytics Engineering

Task 1: Project Initialization & Profiles

Goal: Set up the dbt environment.

Requirement: Run dbt init grocery_project.
Challenge: Configure your profiles.yml to connect to a local SQLite database or a cloud warehouse (Snowflake/BigQuery).
Verification: Run dbt debug to ensure the connection is successful.

Task 2: Building Modular Models (The `ref` function)

Goal: Create a dependency between two models.

Requirement: 1. Create a “Staging” model stg_sales.sql that selects all data from your raw sales table. 2. Create a “Fact” model fct_daily_summary.sql that uses the {{ ref('stg_sales') }} function to calculate daily totals.
Discussion: Why is using ref() better than hardcoding the table name like FROM my_db.my_schema.stg_sales?

Task 3: Data Documentation & YAML

Goal: Document your data for the business team.

Requirement: Create a schema.yml file in your models/ folder.
Challenge: Add a description for the transaction_id and total_revenue columns.
Command: Run dbt docs generate followed by dbt docs serve. Can you see your documentation in the browser?

Task 4: Automated Testing (Data Quality)

Goal: Catch bad data before it hits the dashboard.

Requirement: In your schema.yml, add the following tests to the transaction_id column:
- unique
- not_null
Challenge: Add an accepted_values test to the category column to ensure only ‘Produce’, ‘Dairy’, and ‘Bakery’ are allowed.
Command: Run dbt test and check if your data passes.

Introduction to PySpark

PySpark allows you to use the power of Apache Spark with Python, enabling large-scale data processing across clusters.

🔗 Learning Resources

Official Documentation: Apache Spark Python API (PySpark)
PySpark Quickstart: Spark Official Guide
Free Course: SparkByExamples - An excellent blog-style reference for common PySpark functions.

⚙️ PySpark Practice Tasks: Big Data Processing

When data grows to Terabytes, we move from Pandas to Spark. Note: Spark uses Lazy Evaluation.

Task 1: Schema Enforcement

Instead of using inferSchema=True, manually define a StructType schema for the grocery data prepared in SQL practice task.

Task 2: Distributed Transformations

Using the Spark DataFrame API:

Filter rows where quantity is less than 1 (Data Quality check).
Add a column taxed_price which is price * 1.15.
Rename the column item_name to product_description.

Task 3: Wide Transformations (Shuffle)

Perform a groupBy on the category column and calculate the sum of quantity.

Discussion: Explain why groupBy is considered a “Wide Transformation” compared to filter.

Task 4: Optimization

Use the .cache() method on a DataFrame. In what scenario would a Data Engineer choose to cache a dataset?

Introduction to Databricks

Databricks is a unified analytics platform built on top of Apache Spark. It pioneered the “Lakehouse” architecture, which provides the performance of a data warehouse with the massive scale and flexibility of a data lake.

🔗 Learning Resources

Official Academy: Databricks Lakehouse Fundamentals - Free accreditation for students.
Community Edition: Databricks Community Cloud - Access a free micro-cluster for hands-on practice.
Delta Lake Docs: Delta Lake Guide - Understand how to bring ACID transactions to big data.

🧱 Databricks Practice Tasks: The Lakehouse Challenge

Databricks is a unified data analytics platform. It combines the best of Data Warehouses (SQL/Performance) and Data Lakes (Storage/Flexibility) into a Lakehouse.

📝 Practice Tasks: Notebooks & Delta Lake

Task 1: Cluster & Notebook Setup

Goal: Understand the Compute layer.

Requirement: Spin up a “Personal Compute” or “Community Edition” cluster.
Challenge: Create a New Notebook.
- Set the default language to Python.
- Use a “Magic Command” (%sql) in the second cell to switch to SQL.
Verification: Run spark.version in Python and SELECT 1 in SQL.

Task 2: The Magic of Delta Lake (Time Travel)

Goal: Understand ACID transactions on a Data Lake.

Requirement: Load the grocery CSV into a Delta Table.

df.write.format("delta").saveAsTable("sales_delta")

Challenge: 1. Update a price in the table: UPDATE sales_delta SET price = 10.0 WHERE id = 1. 2. Use Time Travel to see the previous version:
```
SELECT * FROM sales_delta VERSION AS OF 0
```
Discussion: Why is “Time Travel” a lifesaver for Data Engineers during a pipeline failure?

Task 3: Multi-Language Transformation (Polyglot)

Goal: Use the best tool for the job within one notebook.

Requirement: 1. Use PySpark to read the raw data and filter out nulls. 2. Register the result as a Temporary View: df.createOrReplaceTempView("cleaned_data"). 3. Use SQL (%sql) to perform a GROUP BY on that temporary view.
Challenge: Why is this “Polyglot” approach better than forcing everything into one language?

Task 4: Databricks Utilities (`dbutils`)

Goal: Interact with the file system (DBFS).

Requirement: Use dbutils.fs.ls("/") to list the root directories.
Challenge: Create a new folder named /tmp/my_project/ and move a sample file into it using dbutils.fs.mv.
Command: display(dbutils.fs.ls("/tmp/my_project/"))

🏗️ Mini Project: The Medallion Architecture

Build a simplified “Medallion” pipeline in a single notebook:

Bronze Layer (Raw): Load the raw CSV exactly as it is into a Delta table named bronze_sales.
Silver Layer (Cleaned): Filter out negative prices, fix date formats, and deduplicate. Save as silver_sales.
Gold Layer (Aggregated): Create a final table gold_category_revenue that sums up revenue for the business dashboard.
Optimization: Run OPTIMIZE gold_category_revenue ZORDER BY (sale_date) to speed up queries.

Apache Kafka & Event Streaming

Kafka is a distributed event store and stream-processing platform. It allows you to handle trillions of events a day, making it the backbone of real-time Data Engineering.

🔗 Learning Resources

Official Documentation: Apache Kafka Docs
Introduction to Kafka: Confluent’s Kafka 101 - The best conceptual starting point.
Interactive Course: Learn Apache Kafka (YouTube)
Visual Guide: Kafka Visualizer - Great for understanding how offsets and partitions work.

⚡ Apache Kafka Practice Tasks: Real-Time Streaming

Kafka is a distributed event streaming platform. Instead of processing “batches” of data, we process “events” as they happen.

📝 Practice Tasks: Topics, Producers, and Consumers

Task 1: Infrastructure Setup (CLI)

Goal: Understand the core components of a Kafka Cluster.

Requirement: Start a Zookeeper and Kafka broker (using Docker or local binaries).
Challenge: 1. Create a Topic named grocery-sales with 3 partitions and a replication factor of 1. 2. List all topics to verify it exists. 3. Describe the topic to see the partition distribution.
Command: kafka-topics.sh --create --topic grocery-sales --bootstrap-server localhost:9092 ...

Task 2: The Console Producer & Consumer

Goal: Send and receive your first messages manually.

Requirement: Open two terminal windows.
Challenge: 1. In Terminal A (Producer), type 3 messages representing sales: {"id": 1, "item": "Apple"}. 2. In Terminal B (Consumer), read the messages from the beginning (--from-beginning).
Discussion: What happens to the messages in Terminal B if you close it and reopen it? Do they disappear?

Task 3: Python Integration (`confluent-kafka` or `kafka-python`)

Goal: Programmatically stream data.

Requirement: Write a Python script to act as a Producer.
Challenge: * Create a loop that sends a new “Sale Event” every 2 seconds to the grocery-sales topic.
- Each event should have a random price and a timestamp.
Verification: Use the Console Consumer to watch the live stream of data appearing in real-time.

Task 4: Consumer Groups & Scalability

Goal: Understand how Kafka handles high volume.

Requirement: Start two instances of a Python Consumer script using the same group.id.
Challenge: Observe the output. Does each consumer see every message, or do they split the load?
Discussion: Why is the number of Partitions the “ceiling” for how many consumers you can have in a group?

🏗️ Mini Project: The “Real-Time Alert” Pipeline

Build a small streaming application that monitors for high-value transactions.

Producer: A script that generates random sales data (0 - $100).
Stream Processor (Python): A consumer script that reads the stream.
The Logic: If a sale is greater than $80, print a “🚨 ALERT: High Value Sale Detected!” message.
The Sink: Save only the “Alert” transactions into a local file named alerts_log.txt.

Introduction to Snowflake

Snowflake is a cloud-native data warehouse known for its “near-zero” management. Its unique architecture separates storage from compute, allowing Data Engineers to scale resources up or down instantly without affecting data availability.

🔗 Learning Resources

Snowflake University: Hands-on Essentials - Data Warehousing
Quickstart Guide: Snowflake in 20 Minutes - The fastest way to understand the UI.
SQL Reference: Snowflake SQL Commands - Comprehensive map of Snowflake-specific SQL.

🧪 Practice Tasks

The Warehouse Control: Create a Virtual Warehouse (X-Small), set it to “Auto-Suspend” after 1 minute, and observe how it handles an incoming query.
The Data Loader: Create an Internal Stage, upload a local file using PUT, and use the COPY INTO command to load it into a structured table.
The Time Traveler: Delete several records from a table, then use the AT (OFFSET => -60) command to select the data as it existed 1 minute ago and restore it.

Apache Airflow

Orchestration is the “glue” that holds data pipelines together. Apache Airflow allows you to programmatically author, schedule, and monitor workflows.

🔗 Learning Resources

Official Documentation: Apache Airflow Documentation
Airflow Quickstart: Running Airflow Locally
Astronomer Guides: Airflow Concepts - The best place to learn about DAGs, Operators, and Tasks.

🚀 Apache Airflow Practice Tasks: Data Orchestration

Airflow uses DAGs (Directed Acyclic Graphs) written in Python to schedule and monitor your data pipelines.

📝 Practice Tasks: Operators & Scheduling

Task 1: Your First DAG (The “Hello World”)

Goal: Understand the basic structure of a DAG.

Requirement: Create a Python file in your dags/ folder.
Challenge: 1. Define a DAG that runs every day at midnight (@daily). 2. Use the EmptyOperator for a ‘start’ and ‘end’ task. 3. Use the BashOperator to print “Hello Airflow” in the middle.
Verification: Open the Airflow UI (localhost:8080) and trigger the DAG manually.

Task 2: PythonOperator & XComs

Goal: Pass data between tasks.

Requirement: Airflow tasks are isolated, so we use XComs (Cross-Communications) to share small amounts of data.
Challenge: 1. Task A: A Python function that returns a random number. 2. Task B: A Python function that pulls that number and prints whether it is “Even” or “Odd”.

Task 3: Task Dependencies & Branching

Goal: Control the flow of logic.

Requirement: Use the BranchPythonOperator.
Challenge: 1. Create a task that checks if today is a weekday. 2. If it is a weekday, run a “Work Task”. 3. If it is a weekend, run a “Maintenance Task”.

Task 4: Catchup & Backfilling

Goal: Understand how Airflow handles historical data.

Requirement: Set a start_date to 3 days ago.
Challenge: 1. Set catchup=True and see how Airflow automatically triggers runs for the past 3 days. 2. Change catchup=False and see the difference.

Introduction to AWS (Amazon Web Services)

AWS is the world’s most comprehensive and broadly adopted cloud platform. For a Data Engineer, AWS provides the foundational storage (S3) and compute (EC2/Lambda) layers needed to build robust data pipelines.

🔗 Learning Resources

Official Training: AWS Cloud Practitioner Essentials
Storage Deep-Dive: Amazon S3 User Guide
Interactive Workshop: AWS Data Engineering Immersion Day

☁️ AWS Practice Tasks: Cloud Data Engineering

AWS is the most widely used cloud provider. In this module, you will build a “Data Lake” and a serverless ETL pipeline.

📝 Practice Tasks: S3, Lambda, and Glue

Task 1: The S3 Data Lake Setup

Goal: Master object storage and permissions.

Requirement: Log into the AWS Console (or use LocalStack for local practice).
Challenge: 1. Create an S3 Bucket named raw-grocery-data-unique-id. 2. Upload your grocery_sales.csv manually. 3. Create a folder structure inside the bucket: inbound/, processed/, and archive/.
Command: aws s3 cp grocery_sales.csv s3://raw-grocery-data-unique-id/inbound/

Task 2: Serverless ETL with AWS Lambda

Goal: Trigger code based on an event (File Upload).

Requirement: Create a Python Lambda function.
Challenge: 1. Set an S3 Trigger so the Lambda runs every time a .csv is dropped into the inbound/ folder. 2. Write Python code (using the boto3 library) to read the file’s metadata and print the file name to CloudWatch Logs.

Task 3: AWS Glue Crawler & Data Catalog

Goal: Automatically discover the schema of your data.

Requirement: Point an AWS Glue Crawler at your S3 bucket.
Challenge: 1. Run the Crawler to scan your CSV files. 2. Check the AWS Glue Data Catalog to see if a table was created automatically. 3. Use Amazon Athena to run a SQL query (SELECT * FROM ...) directly against the files sitting in S3.
Concept: This is the “Schema-on-Read” pattern.

Task 4: IAM Roles & Security

Goal: Understand the “Principle of Least Privilege.”

Requirement: Your Lambda function needs permission to read from S3.
Challenge: Create an IAM Role that only has AmazonS3ReadOnlyAccess. Attach it to your Lambda.
Verification: Try to make the Lambda delete the file. It should fail. This proves your pipeline is secure.

🏗️ Mini Project: The “S3-to-Athena” Pipeline

Build a production-ready serverless analytics layer:

Storage: Landing zone in S3 (/raw).
Transformation: An AWS Glue Job (PySpark) that reads the raw CSV, converts it to Parquet format, and saves it to a /parquet folder.
Cataloging: A Glue Crawler that updates the metadata for the Parquet files.
Analysis: Use Amazon Athena to calculate the total monthly revenue. Compare the speed of querying the CSV vs. the Parquet version.

Introduction to Azure

Azure is Microsoft’s cloud computing platform and a leader in enterprise data solutions. For a Data Engineer, Azure provides a seamless ecosystem for storing, integrating, and managing massive data workloads using specialized tools like ADLS and Synapse.

🔗 Learning Resources

Official Training: Azure Data Fundamentals (DP-900) - The best starting point for the Microsoft ecosystem.
Storage Guide: Introduction to Azure Data Lake Storage (ADLS) Gen2
Architecture Center: Azure Data Engineering Architecture

☁️ Azure Practice Tasks: Enterprise Data Engineering

Azure is the cloud of choice for many enterprise organizations. In this module, you will build a scalable data ingestion and transformation pipeline.

📝 Practice Tasks: Data Lake, Data Factory, and Synapse

Task 1: ADLS Gen2 Hierarchy

Goal: Master “Hierarchical Namespaces” for data organization.

Requirement: Create an Azure Storage Account (Standard V2).
Challenge: 1. Enable Hierarchical Namespace (this turns it into ADLS Gen2). 2. Create a “Container” named landing-zone. 3. Upload grocery_sales.csv using Azure Storage Explorer or the Portal.
Concept: Understand why a “True Directory” structure is faster for Big Data than a “Flat” S3-style bucket.

Task 2: Azure Data Factory (ADF) - The Copy Activity

Goal: Your first “No-Code” ETL pipeline.

Requirement: Provision an Azure Data Factory instance.
Challenge: 1. Create a Linked Service to your Storage Account. 2. Create a Pipeline with a Copy Data Activity. 3. Source: grocery_sales.csv. Sink: A new folder named /processed/.
Verification: Trigger the pipeline and check if the file moved successfully.

Task 3: Azure Key Vault Security

Goal: Secure your credentials (Secret Management).

Requirement: Create an Azure Key Vault.
Challenge: 1. Store your Storage Account Access Key as a Secret. 2. Link your Data Factory to the Key Vault. 3. Configure the Data Factory to pull the secret instead of hardcoding the password in the connection string.

🏗️ Mini Project: The “ADF-to-SQL” Pipeline

Build an end-to-end automated ingestion layer:

Storage: Landing zone in ADLS Gen2.
Orchestration: An Azure Data Factory pipeline triggered by a “Storage Event” (when a file is uploaded).
Transformation: A Mapping Data Flow in ADF that:
- Removes rows with NULL prices.
- Converts item_name to lowercase.
Loading: Sink the cleaned data into an Azure SQL Database table.
Monitoring: Use the ADF Monitor tab to verify the run time and rows processed.

Introduction to GCP (Google Cloud Platform)

GCP is a powerful and rapidly growing cloud platform by Google, offering scalable infrastructure and advanced data analytics capabilities. For a Data Engineer, GCP provides essential services like Cloud Storage for data lakes, BigQuery for analytics, and Dataflow for building scalable data pipelines.

🔗 Learning Resources

Official Training: Google Cloud Digital Leader Training
Storage Deep-Dive: Cloud Storage Documentation
Data Engineering Path: Google Cloud Skills Boost - Data Engineering
BigQuery Guide: BigQuery Documentation

⚙️ Key GCP Services for Data Engineering

Cloud Storage → Data lake storage for raw and processed data
BigQuery → Serverless data warehouse for analytics
Dataflow → Stream & batch data processing (Apache Beam)
Pub/Sub → Real-time messaging and event ingestion
Dataproc → Managed Spark/Hadoop clusters
Cloud Composer → Workflow orchestration (Apache Airflow)

Count total rows
Filter records
Aggregate data (e.g., GROUP BY)

Upload a sample dataset (.csv) to Cloud Storage
Create a BigQuery dataset and table
Use BigQuery UI / bq command / Dataflow template to load data
Transform data using SQL (e.g., cleaning, filtering)
Schedule pipeline (optional) using Cloud Composer or scheduled queries

Keyboard shortcuts

Data Engineering Book