DATA Engineering Roadmap
A detailed, practical roadmap to become a job-ready Data Engineer — from foundations to production-grade pipelines.
Table of Contents
- Overview
- Who This Roadmap Is For
- What a Data Engineer Does
- Core Skills You Need
- Roadmap Stages
- Stage 0: Mindset and Setup
- Stage 1: Programming Fundamentals
- Stage 2: SQL and Data Modeling
- Stage 3: Databases and Storage Systems
- Stage 4: Data Warehousing and Analytics Engineering
- Stage 5: Batch Data Processing
- Stage 6: Workflow Orchestration
- Stage 7: Streaming and Event-Driven Data
- Stage 8: Cloud Data Engineering
- Stage 9: Data Quality, Governance, and Security
- Stage 10: DevOps for Data Engineers
- Stage 11: Production Readiness
- Suggested Learning Timeline
- Project Roadmap
- Beginner to Advanced Tool Stack
- Daily / Weekly Study Plan
- Portfolio Checklist
- Interview Preparation
- Common Mistakes to Avoid
- Recommended Resources
- How to Use This Roadmap
- Final Advice
Overview
Data engineering is the discipline of building reliable systems that ingest, store, transform, serve, and monitor data.
A strong Data Engineer knows how to:
- collect data from APIs, applications, databases, and files
- design clean and scalable data models
- create reliable ETL/ELT pipelines
- schedule and monitor workflows
- handle batch and streaming workloads
- work with cloud platforms and modern data tools
- ensure data quality, governance, and security
- support analytics, machine learning, and operational systems
This roadmap is designed to help you progress from beginner to production-ready practitioner with a clear sequence of skills, tools, and projects.
Who This Roadmap Is For
This roadmap is ideal for:
- beginners entering data engineering
- data analysts moving toward engineering
- software engineers transitioning into data platforms
- BI developers who want stronger backend data skills
- students preparing for internships or entry-level roles
- professionals who want a structured path instead of random tutorials
What a Data Engineer Does
A Data Engineer typically works on:
- Data ingestion — pulling data from databases, APIs, SaaS tools, logs, and streams
- Data transformation — cleaning, joining, standardizing, validating, and modeling data
- Data storage — operating data lakes, warehouses, and lakehouses
- Pipeline orchestration — scheduling, dependency management, retries, and alerting
- Data reliability — testing, monitoring, lineage, observability, and SLAs
- Infrastructure — cloud services, containers, CI/CD, permissions, and deployment workflows
- Collaboration — partnering with analysts, analytics engineers, ML engineers, and product teams
Core Skills You Need
A solid data engineering foundation combines multiple skill areas:
1. Programming
Focus on:
- Python
- shell basics
- scripting and automation
- debugging and logging
- packages and virtual environments
- clean code practices
2. SQL
This is non-negotiable.
You should master:
- filtering, grouping, joins
- window functions
- CTEs
- subqueries
- aggregations
- date handling
- query optimization basics
- schema design
3. Data Modeling
Learn how to structure data so it is useful and scalable:
- normalization and denormalization
- fact and dimension tables
- star and snowflake schemas
- slowly changing dimensions
- surrogate keys
- business grain definition
4. Distributed Processing
For large-scale systems, learn:
- partitioning
- parallelism
- shuffling
- fault tolerance
- cluster computing concepts
5. Cloud and Infrastructure
Modern data engineering usually runs in the cloud:
- object storage
- IAM / permissions
- compute services
- containers
- infrastructure basics
- deployment pipelines
6. Reliability and Operations
Production systems require:
- data quality checks
- alerting
- retry logic
- idempotency
- monitoring
- incident handling
- cost awareness
Roadmap Stages
Stage 0: Mindset and Setup
Before tools, build the right habits.
Learn
- what ETL vs ELT means
- difference between OLTP and OLAP
- basics of files, tables, APIs, logs, events, and streams
- why data pipelines fail in production
- importance of reproducibility and observability
Set Up Your Environment
Install and configure:
- Python
- VS Code or PyCharm
- Git and GitHub
- Docker
- a local SQL database such as PostgreSQL
- Jupyter Notebook (optional)
Output
By the end of this stage, you should be able to:
- write and run scripts locally
- manage code with Git
- use Docker at a basic level
- connect Python to a database
Stage 1: Programming Fundamentals
Primary Language: Python
Python is widely used in data engineering for scripting, orchestration, API integrations, and Spark jobs.
Topics to Learn
- variables, data types, functions, classes
- loops and conditionals
- list/dict/set comprehensions
- file handling
- exception handling
- modules and packages
- virtual environments
- logging
- type hints
- unit testing basics
Libraries to Know Early
requestspandaspathlibjsondatetimesqlalchemypsycopg/ database drivers
Practice Tasks
- pull data from a public API
- save JSON responses to files
- transform raw data into tabular format
- load transformed data into PostgreSQL
- add logging and error handling
Goal
Build small scripts that behave like simple pipelines.
Stage 2: SQL and Data Modeling
SQL is the heart of analytics and warehouse work.
SQL Topics to Master
Fundamentals
SELECTWHEREGROUP BYHAVINGORDER BYJOINUNION
Intermediate
- window functions
- CTEs
- nested queries
- case expressions
- date/time functions
- text manipulation
- null handling
Advanced
- execution plans
- indexing concepts
- query performance tuning
- incremental logic
- deduplication patterns
- change data capture concepts
Data Modeling Topics
- entities and relationships
- business keys vs surrogate keys
- dimensional modeling
- medallion layering (bronze, silver, gold)
- snapshots and historical tracking
- schema evolution basics
Practice Tasks
- design a mini e-commerce schema
- write 25–50 SQL queries of increasing difficulty
- create fact and dimension tables
- build daily summary tables
- solve window function problems
Goal
Be able to transform messy source data into analyst-ready models.
Stage 3: Databases and Storage Systems
You need to understand both transactional and analytical systems.
Learn the Difference
OLTP Systems
Used for application transactions:
- PostgreSQL
- MySQL
- SQL Server
OLAP / Analytical Systems
Used for reporting and analytics:
- data warehouses
- columnar storage systems
- lakehouse engines
Concepts to Learn
- row vs column storage
- partitions
- indexes
- replication basics
- transactions and ACID
- file formats: CSV, JSON, Parquet, Avro
- compression and serialization
- schema-on-write vs schema-on-read
Practice Tasks
- compare CSV vs Parquet size and speed
- load large files into a relational database
- partition sample data by date
- benchmark simple queries
Goal
Understand how storage design affects performance, cost, and maintainability.
Stage 4: Data Warehousing and Analytics Engineering
This is where raw data becomes trusted business data.
Learn
- warehouse architecture
- layered data models
- transformation workflows
- testing and documentation
- incremental transformations
- semantic consistency across teams
Recommended Focus Areas
- dbt fundamentals
- source freshness checks
- tests for unique/not-null/relationships
- modular SQL models
- documentation generation
- staging → intermediate → marts structure
Practice Tasks
- build a dbt project on top of PostgreSQL or a cloud warehouse
- model raw order data into marts
- add tests and documentation
- create a dashboard-ready schema
Goal
Be able to take raw ingested data and produce reliable business-ready datasets.
Stage 5: Batch Data Processing
Most organizations still rely heavily on batch workloads.
Learn
- scheduled ingestion
- chunked processing
- retries
- backfills
- idempotent loads
- full refresh vs incremental loads
- partition-based processing
Tools to Explore
- Python-based ETL scripts
- Spark / PySpark
- SQL transformations in warehouses
Spark Topics
- DataFrames
- transformations vs actions
- lazy execution
- partitioning
- joins and aggregations
- reading/writing Parquet
- Spark SQL
- performance basics
Practice Tasks
- process millions of rows with PySpark
- clean and aggregate large event data
- write partitioned Parquet outputs
- compare local Python vs Spark for scale
Goal
Learn when simple scripts are enough and when distributed processing is necessary.
Stage 6: Workflow Orchestration
Pipelines are not just scripts. They need scheduling, dependency management, and monitoring.
Learn
- DAG concepts
- task dependencies
- retries and backoff
- scheduling intervals
- parameterization
- backfills
- SLAs and alerts
- secrets handling
Tool to Prioritize
- Apache Airflow
Practice Tasks
- create DAGs for ingest → transform → validate → publish
- run daily and hourly schedules
- add retry policies and failure alerts
- parameterize DAGs for different environments
Goal
Move from “script runner” thinking to “production workflow system” thinking.
Stage 7: Streaming and Event-Driven Data
Many systems need near-real-time data movement.
Learn
- events, topics, partitions, offsets
- producers and consumers
- delivery guarantees
- stateful vs stateless processing
- watermarking and late-arriving data
- event time vs processing time
Tools to Explore
- Apache Kafka
- Kafka Connect concepts
- Spark Structured Streaming
- stream processing fundamentals
Practice Tasks
- stream click events into Kafka
- consume messages with Python
- aggregate events over time windows
- send enriched records to storage or a database
Goal
Understand real-time patterns even if your first job is primarily batch-oriented.
Stage 8: Cloud Data Engineering
Most real-world roles expect cloud familiarity.
Pick One Cloud First
Choose one deeply before sampling the others:
- AWS
- GCP
- Azure
Universal Cloud Concepts
- object storage
- managed databases
- serverless compute
- data warehouse services
- secrets and IAM
- networking basics
- logging and monitoring
- cost optimization
Typical Services by Category
Storage
- S3 / GCS / Azure Blob Storage
Compute
- EC2 / Compute Engine / Virtual Machines
- Lambda / Cloud Functions / Azure Functions
- container services
Warehousing
- Redshift
- BigQuery
- Snowflake
- Azure Synapse / Fabric-oriented ecosystems
Orchestration and Integration
- Airflow-managed offerings
- cloud-native schedulers
- connectors and ingestion services
Practice Tasks
- store raw files in object storage
- run transformations in cloud compute or warehouse SQL
- create an end-to-end ELT pipeline in one cloud
- track costs and permissions
Goal
Be able to build cloud-native pipelines instead of only local demos.
Stage 9: Data Quality, Governance, and Security
This is what separates toy projects from trustworthy systems.
Data Quality Topics
- null checks
- uniqueness checks
- referential integrity
- freshness checks
- volume anomaly checks
- distribution checks
- schema drift detection
Governance Topics
- metadata
- lineage
- ownership
- naming conventions
- documentation
- data classification
- retention policies
Security Topics
- principle of least privilege
- secret management
- encryption at rest and in transit
- PII awareness
- auditability
Practice Tasks
- add validation steps to every pipeline
- create data contracts for key tables
- document lineage from raw to mart
- separate dev/staging/prod access
Goal
Learn to build pipelines others can trust.
Stage 10: DevOps for Data Engineers
Data engineers need software engineering discipline.
Learn
- Git workflows
- branching strategy
- pull requests
- CI/CD basics
- automated tests
- Docker images
- environment variables
- IaC concepts (Terraform is a strong plus)
Practice Tasks
- dockerize a pipeline project
- add unit tests and SQL tests
- configure GitHub Actions for linting/tests
- deploy the same project to dev and prod configs
Goal
Make your data projects reproducible, reviewable, and deployable.
Stage 11: Production Readiness
At this stage, focus less on tools and more on system design.
Learn
- scalability trade-offs
- fault tolerance
- observability
- cost vs performance
- schema evolution
- reprocessing and replay strategies
- SLA / SLO thinking
- stakeholder communication
Questions You Should Be Able to Answer
- What happens if a pipeline fails halfway through?
- How do you backfill historical data safely?
- How do you detect bad upstream data?
- How do you avoid duplicates?
- How do you debug slow jobs?
- When should you use Spark instead of SQL?
- When is streaming worth the complexity?
Goal
Think like an engineer responsible for business-critical data systems.
Project Roadmap
Projects matter more than endless tutorials.
Project 1: API to PostgreSQL Pipeline
Goal: Build a small ETL pipeline.
Stack: Python, requests, PostgreSQL, Docker
Features:
- extract data from a public API
- transform nested JSON into clean tables
- load into PostgreSQL
- add logs and retry handling
Project 2: Analytics Modeling Project
Goal: Create business-ready warehouse models.
Stack: SQL, dbt, PostgreSQL or cloud warehouse
Features:
- staging models
- marts
- tests
- documentation
- incremental models
Project 3: Orchestrated ELT Pipeline
Goal: Run a full multi-step pipeline automatically.
Stack: Airflow, dbt, Docker
Features:
- ingest raw data
- transform with dbt
- run data quality checks
- publish final tables
- add alerts
Project 4: Batch Big Data Pipeline
Goal: Process large datasets efficiently.
Stack: PySpark, Parquet, object storage
Features:
- read large event logs
- clean and aggregate data
- partition outputs by date
- benchmark performance changes
Project 5: Streaming Pipeline
Goal: Demonstrate real-time thinking.
Stack: Kafka, Python consumers, Spark Structured Streaming or similar
Features:
- ingest event stream
- aggregate in windows
- store results downstream
- monitor lag/failures
Project 6: Cloud-End-to-End Data Platform Demo
Goal: Show production-style architecture.
Stack: cloud storage + cloud warehouse + orchestration + IaC
Features:
- raw, refined, curated layers
- automated scheduling
- cost-aware storage choices
- permissions and environment separation
Beginner to Advanced Tool Stack
Below is a practical progression, not a mandatory list.
Foundation Tools
- Python
- SQL
- PostgreSQL
- Git/GitHub
- Docker
Early Data Engineering Tools
- pandas
- SQLAlchemy
- dbt
- Airflow
Scale Tools
- PySpark
- Kafka
- Parquet / Avro
- object storage
Cloud / Platform Tools
- AWS / GCP / Azure
- warehouse platform
- Terraform
- CI/CD pipelines
Reliability Tools
- data tests
- alerting systems
- logging / metrics / dashboards
- lineage and catalog tools
Portfolio Checklist
Your GitHub portfolio should show more than notebooks.
Include
- clear README files
- architecture diagrams
- setup instructions
- sample datasets or generation scripts
- tests
- screenshots of DAGs or dashboards
- documentation of trade-offs
- failure handling notes
- deployment steps
Recruiter-Friendly Project Questions Your Repo Should Answer
- What problem does this solve?
- What tools are used and why?
- How is data ingested, transformed, and stored?
- How is quality validated?
- How is the workflow scheduled?
- How would this scale in production?
Interview Preparation
Technical Topics
- SQL joins, windows, deduplication
- Python data structures and scripting
- ETL vs ELT
- dimensional modeling
- Airflow concepts
- Spark basics
- partitioning and file formats
- Kafka fundamentals
- warehousing concepts
- cloud basics
- testing and observability
Scenario Questions
Prepare for questions like:
- design a daily sales pipeline
- handle duplicate events in ingestion
- backfill 2 years of historical data
- optimize a slow transformation job
- model orders, customers, and products for analytics
- move from batch to near-real-time reporting
Behavioral Topics
- debugging production issues
- communicating with analysts/stakeholders
- prioritizing reliability vs speed
- documenting assumptions and trade-offs
Common Mistakes to Avoid
- learning too many tools at once
- skipping SQL depth
- building only notebooks instead of pipelines
- ignoring testing and logging
- using Spark before understanding basic data workflows
- copying tutorials without changing anything
- focusing on certificates over projects
- neglecting cloud cost and permissions
- avoiding documentation
Recommended Resources
Learn by Official Documentation
- Python documentation
- PostgreSQL documentation
- Apache Airflow documentation
- dbt documentation
- Apache Spark documentation
- Apache Kafka documentation
- your cloud provider’s official data engineering guides
Practice Platforms
- SQL practice websites
- LeetCode SQL sections
- Kaggle datasets
- public APIs
- cloud free tiers / sandbox projects
Books / Concepts to Study
- data modeling
- distributed systems basics
- database internals basics
- analytics engineering principles
- software engineering best practices
How to Use This Roadmap
Use this roadmap in cycles:
- Learn a concept
- Apply it in a small exercise
- Use it in a project
- Document what you learned
- Refactor and improve
Do not wait until you “know enough” before building.
Build early. Break things. Fix them. Document them. Repeat.
Final Advice
The fastest way to become a Data Engineer is not to memorize every tool. It is to consistently build systems that move data from raw → reliable → usable.
A strong path looks like this:
- Master Python + SQL
- Learn databases + modeling
- Build ETL/ELT pipelines
- Add dbt + Airflow
- Learn Spark + cloud
- Understand Kafka + streaming
- Practice testing, monitoring, and production design
- Build a portfolio that proves you can do the work
If you stay consistent and project-focused, this roadmap can take you from beginner to job-ready with real confidence.
Intro to Python
Python is the de facto language for modern data engineering—valued for its readable syntax, rich ecosystem (Pandas, PySpark, Kafka clients), and strong integration with cloud and big‑data platforms; mastering its architecture, libraries, and operational patterns is essential for building reliable, scalable pipelines in production.
Introduction to Python for Data Engineers
- Purpose: Python is used to ingest, transform, validate, orchestrate, and serve data at scale.
- Why it matters for advanced engineers: Python balances developer productivity with extensibility into high‑performance systems (native extensions, distributed runtimes), making it suitable for both prototyping and production workloads.
Core concepts (non‑code)
- Language model and runtime: Understand the interpreter model, Global Interpreter Lock (GIL) implications for concurrency, and when to prefer multi‑processing, async IO, or offloading to JVM/compiled libraries.
- Data model & memory: Know how Python represents arrays, dataframes, and in‑memory objects, and the tradeoffs between in‑memory processing (Pandas) and out‑of‑core/distributed approaches (Dask, PySpark).
- Ecosystem roles: Different libraries serve distinct layers—Pandas/NumPy for local transforms, PySpark for distributed compute, SQLAlchemy for DB access, Kafka clients for streaming ingestion.
Python Resources
This page contains curated resources to help you master Python, specifically tailored for the needs of a Data Engineer.
🎓 Recommended Learning
Official Python Tutorial
- Source: docs.python.org
- Why it’s here: The gold standard. If you want to understand how the language is designed and stay up-to-date with the latest features, always start here.
FreeCodeCamp: Python for Beginners
- Source: YouTube
- Why it’s here: Great for visual learners. This comprehensive video covers the fundamentals in a project-based way that is easy to follow.
Real Python
- Source: realpython.com
- Why it’s here: Excellent for deep dives. Once you know the basics, use this to master advanced topics like Decorators, Generators, and Context Managers—all of which are vital for writing clean data pipelines.
🛠️ Official Documentation Quick-Links
For quick reference while coding, bookmark these:
Python Practice Tasks
These tasks are designed to build the foundational logic required for Data Engineering pipelines.
🛠️ Data Engineering Prep Tasks
📝 Practice Tasks (Standard Library)
1. The CSV to JSON Converter
Goal: Master the csv and json modules.
- Scenario: A legacy system exports data in CSV, but your modern API only accepts JSON.
- Requirements:
- Read a file
users.csvwithout using external libraries. - Convert each row into a Dictionary.
- Handle “dirty” data: if a numeric field contains text, default it to
0. - Output: A beautifully formatted
users.jsonfile.
- Read a file
2. The Memory-Efficient Log Scanner
Goal: Use Generators to handle “Big Data” on a small machine.
- Scenario: You have a 5GB log file, but your computer only has 4GB of RAM.
- Requirements:
- Use a generator function (
yield) to read the file line-by-line. - Count occurrences of keywords like
ERROR,WARNING, andINFO. - Output: A summary dictionary:
{'ERROR': 450, 'WARNING': 120, 'INFO': 2000}.
- Use a generator function (
Control Flows: Basic to Advance
Control flow is the “brain” of your script, determining which chunks of code run based on specific conditions.
🚦 Basic Control Flow
Learn how to use if, elif, and else to guide your data logic.
- Resource: Python If…Else (W3Schools)
- Resource: Python Loops (Real Python)
🔄 Advanced Iteration
In Data Engineering, we often deal with large datasets where performance matters.
- List Comprehensions: A concise way to create lists.
- Error Handling: Using
try...exceptto prevent pipelines from crashing when meeting bad data.
🧪 Quick Challenge
Write a Nested Loop that flattens a list of lists (a common task when dealing with nested JSON data).
- Input:
[[1, 2], [3, 4], [5, 6]] - Output:
[1, 2, 3, 4, 5, 6]
Pandas for Data Engineering
Pandas is a fast, powerful, and flexible open-source data analysis and manipulation tool built on top of the Python programming language.
🔗 Learning Resources
- Official Documentation: Pandas Documentation
- Pandas Getting Started: 10 Minutes to Pandas - The best “first stop” for beginners.
- Kaggle Course: Pandas Tutorial - Interactive and specifically focused on data manipulation.
- Visualization: Pandas Plotting Guide
📝 Practice Tasks (The Basics)
1. The Schema Validator
Goal: Ensure incoming data matches your requirements.
- Scenario: You have a CSV of “Sales” data. You need to ensure no prices are negative and all dates are valid.
- Requirements:
- Load
sales.csv. - Use
df.info()to check data types. - Use boolean indexing to find rows where
price < 0and drop them. - Convert the
transaction_datecolumn todatetime64[ns]. - Output: A report of how many rows were deleted during cleaning.
- Load
2. The Aggregator (Group-By)
Goal: Create a summary report from raw events.
- Scenario: A CSV contains
store_id,product_category, andrevenue. - Requirements:
- Group the data by
store_id. - Calculate the Total Revenue and Average Revenue per store.
- Sort the results so the highest-earning store is at the top.
- Output: Save the summary to
store_performance.csv.
- Group the data by
Introduction to SQL
SQL is the language of data. For a Data Engineer, it is essential for performing transformations (the ‘T’ in ETL) and querying data warehouses.
🔗 Learning Resources
- Official Documentation: PostgreSQL Documentation
- Interactive Learning: SQLZoo - Great for hands-on practice in the browser.
- SQL Crash Course: Mode Analytics SQL Tutorial - Highly recommended for data professionals.
📊 SQL Practice Tasks: Retail Database
In this module, you will build a relational structure for a grocery store and perform data audits.
Task 1: Table Architecture (DDL)
Create a table named store_sales with the following requirements:
transaction_id: Integer, Primary Key.item_name: String, cannot be null.price: Float, must be greater than 0.category: String, default to ‘General’.sale_date: Date.
Task 2: Data Ingestion (DML)
Insert 5 rows of data. Ensure at least one row has a NULL category to test your default constraint, and one row has a high quantity (over 10) for future filtering.
Task 3: The “Revenue” Report
Write a query to calculate:
- The Total Revenue (Price * Quantity) per category.
- The Average Price of items in the ‘Produce’ category.
- Sort the results so the highest-earning category is at the top.
Task 4: Conditional Flagging
Use a CASE statement to create a new column sale_type:
- If quantity > 10, label as ‘Wholesale’.
- Otherwise, label as ‘Retail’.
Introduction to dbt (data build tool)
dbt is the “T” (Transform) in ELT. It allows Data Engineers and Analysts to write transformations in simple SQL while dbt handles the complex work of dependency management, version control, and data testing.dbt is a development framework that combines SQL with software engineering best practices (version control, testing, CI/CD) to build data transformations.
Core Concepts
- T in ELT: dbt does not move data; it transforms data already loaded into your warehouse (Snowflake, BigQuery, etc.).
- SQL-based: If you know
SELECTstatements, you can use dbt. - Modular: It turns monolithic scripts into reusable models.
Official Resources
🔗 Learning Resources
- dbt Learn: dbt Fundamentals - The gold standard course for learning the modern data stack.
- Best Practices: The dbt Viewpoint - Essential reading on why dbt works the way it does.
- Documentation: dbt Core & Cloud Docs
💎 dbt (Data Build Tool) Practice Tasks
dbt sits on top of your Data Warehouse. It allows you to transform “Raw” data into “Clean” data using SQL and Jinja templating.
📝 Practice Tasks: Analytics Engineering
Task 1: Project Initialization & Profiles
Goal: Set up the dbt environment.
- Requirement: Run
dbt init grocery_project. - Challenge: Configure your
profiles.ymlto connect to a local SQLite database or a cloud warehouse (Snowflake/BigQuery). - Verification: Run
dbt debugto ensure the connection is successful.
Task 2: Building Modular Models (The ref function)
Goal: Create a dependency between two models.
- Requirement: 1. Create a “Staging” model
stg_sales.sqlthat selects all data from your raw sales table. 2. Create a “Fact” modelfct_daily_summary.sqlthat uses the{{ ref('stg_sales') }}function to calculate daily totals. - Discussion: Why is using
ref()better than hardcoding the table name likeFROM my_db.my_schema.stg_sales?
Task 3: Data Documentation & YAML
Goal: Document your data for the business team.
- Requirement: Create a
schema.ymlfile in yourmodels/folder. - Challenge: Add a description for the
transaction_idandtotal_revenuecolumns. - Command: Run
dbt docs generatefollowed bydbt docs serve. Can you see your documentation in the browser?
Task 4: Automated Testing (Data Quality)
Goal: Catch bad data before it hits the dashboard.
- Requirement: In your
schema.yml, add the following tests to thetransaction_idcolumn:uniquenot_null
- Challenge: Add an
accepted_valuestest to thecategorycolumn to ensure only ‘Produce’, ‘Dairy’, and ‘Bakery’ are allowed. - Command: Run
dbt testand check if your data passes.
Introduction to PySpark
PySpark allows you to use the power of Apache Spark with Python, enabling large-scale data processing across clusters.
🔗 Learning Resources
- Official Documentation: Apache Spark Python API (PySpark)
- PySpark Quickstart: Spark Official Guide
- Free Course: SparkByExamples - An excellent blog-style reference for common PySpark functions.
⚙️ PySpark Practice Tasks: Big Data Processing
When data grows to Terabytes, we move from Pandas to Spark. Note: Spark uses Lazy Evaluation.
Task 1: Schema Enforcement
Instead of using inferSchema=True, manually define a StructType schema for the grocery data prepared in SQL practice task.
Task 2: Distributed Transformations
Using the Spark DataFrame API:
- Filter rows where
quantityis less than 1 (Data Quality check). - Add a column
taxed_pricewhich isprice * 1.15. - Rename the column
item_nametoproduct_description.
Task 3: Wide Transformations (Shuffle)
Perform a groupBy on the category column and calculate the sum of quantity.
- Discussion: Explain why
groupByis considered a “Wide Transformation” compared tofilter.
Task 4: Optimization
Use the .cache() method on a DataFrame. In what scenario would a Data Engineer choose to cache a dataset?
Introduction to Databricks
Databricks is a unified analytics platform built on top of Apache Spark. It pioneered the “Lakehouse” architecture, which provides the performance of a data warehouse with the massive scale and flexibility of a data lake.
🔗 Learning Resources
- Official Academy: Databricks Lakehouse Fundamentals - Free accreditation for students.
- Community Edition: Databricks Community Cloud - Access a free micro-cluster for hands-on practice.
- Delta Lake Docs: Delta Lake Guide - Understand how to bring ACID transactions to big data.
🧱 Databricks Practice Tasks: The Lakehouse Challenge
Databricks is a unified data analytics platform. It combines the best of Data Warehouses (SQL/Performance) and Data Lakes (Storage/Flexibility) into a Lakehouse.
📝 Practice Tasks: Notebooks & Delta Lake
Task 1: Cluster & Notebook Setup
Goal: Understand the Compute layer.
- Requirement: Spin up a “Personal Compute” or “Community Edition” cluster.
- Challenge: Create a New Notebook.
- Set the default language to Python.
- Use a “Magic Command” (
%sql) in the second cell to switch to SQL.
- Verification: Run
spark.versionin Python andSELECT 1in SQL.
Task 2: The Magic of Delta Lake (Time Travel)
Goal: Understand ACID transactions on a Data Lake.
- Requirement: Load the grocery CSV into a Delta Table.
df.write.format("delta").saveAsTable("sales_delta") - Challenge: 1. Update a price in the table:
UPDATE sales_delta SET price = 10.0 WHERE id = 1. 2. Use Time Travel to see the previous version:SELECT * FROM sales_delta VERSION AS OF 0 - Discussion: Why is “Time Travel” a lifesaver for Data Engineers during a pipeline failure?
Task 3: Multi-Language Transformation (Polyglot)
Goal: Use the best tool for the job within one notebook.
- Requirement: 1. Use PySpark to read the raw data and filter out nulls.
2. Register the result as a Temporary View:
df.createOrReplaceTempView("cleaned_data"). 3. Use SQL (%sql) to perform aGROUP BYon that temporary view. - Challenge: Why is this “Polyglot” approach better than forcing everything into one language?
Task 4: Databricks Utilities (dbutils)
Goal: Interact with the file system (DBFS).
- Requirement: Use
dbutils.fs.ls("/")to list the root directories. - Challenge: Create a new folder named
/tmp/my_project/and move a sample file into it usingdbutils.fs.mv. - Command:
display(dbutils.fs.ls("/tmp/my_project/"))
🏗️ Mini Project: The Medallion Architecture
Build a simplified “Medallion” pipeline in a single notebook:
- Bronze Layer (Raw): Load the raw CSV exactly as it is into a Delta table named
bronze_sales. - Silver Layer (Cleaned): Filter out negative prices, fix date formats, and deduplicate. Save as
silver_sales. - Gold Layer (Aggregated): Create a final table
gold_category_revenuethat sums up revenue for the business dashboard. - Optimization: Run
OPTIMIZE gold_category_revenue ZORDER BY (sale_date)to speed up queries.
Apache Kafka & Event Streaming
Kafka is a distributed event store and stream-processing platform. It allows you to handle trillions of events a day, making it the backbone of real-time Data Engineering.
🔗 Learning Resources
- Official Documentation: Apache Kafka Docs
- Introduction to Kafka: Confluent’s Kafka 101 - The best conceptual starting point.
- Interactive Course: Learn Apache Kafka (YouTube)
- Visual Guide: Kafka Visualizer - Great for understanding how offsets and partitions work.
⚡ Apache Kafka Practice Tasks: Real-Time Streaming
Kafka is a distributed event streaming platform. Instead of processing “batches” of data, we process “events” as they happen.
📝 Practice Tasks: Topics, Producers, and Consumers
Task 1: Infrastructure Setup (CLI)
Goal: Understand the core components of a Kafka Cluster.
- Requirement: Start a Zookeeper and Kafka broker (using Docker or local binaries).
- Challenge: 1. Create a Topic named
grocery-saleswith 3 partitions and a replication factor of 1. 2. List all topics to verify it exists. 3. Describe the topic to see the partition distribution. - Command:
kafka-topics.sh --create --topic grocery-sales --bootstrap-server localhost:9092 ...
Task 2: The Console Producer & Consumer
Goal: Send and receive your first messages manually.
- Requirement: Open two terminal windows.
- Challenge: 1. In Terminal A (Producer), type 3 messages representing sales:
{"id": 1, "item": "Apple"}. 2. In Terminal B (Consumer), read the messages from the beginning (--from-beginning). - Discussion: What happens to the messages in Terminal B if you close it and reopen it? Do they disappear?
Task 3: Python Integration (confluent-kafka or kafka-python)
Goal: Programmatically stream data.
- Requirement: Write a Python script to act as a Producer.
- Challenge: * Create a loop that sends a new “Sale Event” every 2 seconds to the
grocery-salestopic.- Each event should have a random
priceand atimestamp.
- Each event should have a random
- Verification: Use the Console Consumer to watch the live stream of data appearing in real-time.
Task 4: Consumer Groups & Scalability
Goal: Understand how Kafka handles high volume.
- Requirement: Start two instances of a Python Consumer script using the same
group.id. - Challenge: Observe the output. Does each consumer see every message, or do they split the load?
- Discussion: Why is the number of Partitions the “ceiling” for how many consumers you can have in a group?
🏗️ Mini Project: The “Real-Time Alert” Pipeline
Build a small streaming application that monitors for high-value transactions.
- Producer: A script that generates random sales data (0 - $100).
- Stream Processor (Python): A consumer script that reads the stream.
- The Logic: If a sale is greater than $80, print a “🚨 ALERT: High Value Sale Detected!” message.
- The Sink: Save only the “Alert” transactions into a local file named
alerts_log.txt.
Introduction to Snowflake
Snowflake is a cloud-native data warehouse known for its “near-zero” management. Its unique architecture separates storage from compute, allowing Data Engineers to scale resources up or down instantly without affecting data availability.
🔗 Learning Resources
- Snowflake University: Hands-on Essentials - Data Warehousing
- Quickstart Guide: Snowflake in 20 Minutes - The fastest way to understand the UI.
- SQL Reference: Snowflake SQL Commands - Comprehensive map of Snowflake-specific SQL.
🧪 Practice Tasks
- The Warehouse Control: Create a Virtual Warehouse (X-Small), set it to “Auto-Suspend” after 1 minute, and observe how it handles an incoming query.
- The Data Loader: Create an Internal Stage, upload a local file using
PUT, and use theCOPY INTOcommand to load it into a structured table. - The Time Traveler: Delete several records from a table, then use the
AT (OFFSET => -60)command to select the data as it existed 1 minute ago and restore it.
Apache Airflow
Orchestration is the “glue” that holds data pipelines together. Apache Airflow allows you to programmatically author, schedule, and monitor workflows.
🔗 Learning Resources
- Official Documentation: Apache Airflow Documentation
- Airflow Quickstart: Running Airflow Locally
- Astronomer Guides: Airflow Concepts - The best place to learn about DAGs, Operators, and Tasks.
🚀 Apache Airflow Practice Tasks: Data Orchestration
Airflow uses DAGs (Directed Acyclic Graphs) written in Python to schedule and monitor your data pipelines.
📝 Practice Tasks: Operators & Scheduling
Task 1: Your First DAG (The “Hello World”)
Goal: Understand the basic structure of a DAG.
- Requirement: Create a Python file in your
dags/folder. - Challenge: 1. Define a DAG that runs every day at midnight (
@daily). 2. Use theEmptyOperatorfor a ‘start’ and ‘end’ task. 3. Use theBashOperatorto print “Hello Airflow” in the middle. - Verification: Open the Airflow UI (localhost:8080) and trigger the DAG manually.
Task 2: PythonOperator & XComs
Goal: Pass data between tasks.
- Requirement: Airflow tasks are isolated, so we use XComs (Cross-Communications) to share small amounts of data.
- Challenge: 1. Task A: A Python function that returns a random number. 2. Task B: A Python function that pulls that number and prints whether it is “Even” or “Odd”.
Task 3: Task Dependencies & Branching
Goal: Control the flow of logic.
- Requirement: Use the
BranchPythonOperator. - Challenge: 1. Create a task that checks if today is a weekday. 2. If it is a weekday, run a “Work Task”. 3. If it is a weekend, run a “Maintenance Task”.
Task 4: Catchup & Backfilling
Goal: Understand how Airflow handles historical data.
- Requirement: Set a
start_dateto 3 days ago. - Challenge: 1. Set
catchup=Trueand see how Airflow automatically triggers runs for the past 3 days. 2. Changecatchup=Falseand see the difference.
Introduction to AWS (Amazon Web Services)
AWS is the world’s most comprehensive and broadly adopted cloud platform. For a Data Engineer, AWS provides the foundational storage (S3) and compute (EC2/Lambda) layers needed to build robust data pipelines.
🔗 Learning Resources
- Official Training: AWS Cloud Practitioner Essentials
- Storage Deep-Dive: Amazon S3 User Guide
- Interactive Workshop: AWS Data Engineering Immersion Day
☁️ AWS Practice Tasks: Cloud Data Engineering
AWS is the most widely used cloud provider. In this module, you will build a “Data Lake” and a serverless ETL pipeline.
📝 Practice Tasks: S3, Lambda, and Glue
Task 1: The S3 Data Lake Setup
Goal: Master object storage and permissions.
- Requirement: Log into the AWS Console (or use LocalStack for local practice).
- Challenge: 1. Create an S3 Bucket named
raw-grocery-data-unique-id. 2. Upload yourgrocery_sales.csvmanually. 3. Create a folder structure inside the bucket:inbound/,processed/, andarchive/. - Command:
aws s3 cp grocery_sales.csv s3://raw-grocery-data-unique-id/inbound/
Task 2: Serverless ETL with AWS Lambda
Goal: Trigger code based on an event (File Upload).
- Requirement: Create a Python Lambda function.
- Challenge: 1. Set an S3 Trigger so the Lambda runs every time a
.csvis dropped into theinbound/folder. 2. Write Python code (using theboto3library) to read the file’s metadata and print the file name to CloudWatch Logs.
Task 3: AWS Glue Crawler & Data Catalog
Goal: Automatically discover the schema of your data.
- Requirement: Point an AWS Glue Crawler at your S3 bucket.
- Challenge: 1. Run the Crawler to scan your CSV files.
2. Check the AWS Glue Data Catalog to see if a table was created automatically.
3. Use Amazon Athena to run a SQL query (
SELECT * FROM ...) directly against the files sitting in S3. - Concept: This is the “Schema-on-Read” pattern.
Task 4: IAM Roles & Security
Goal: Understand the “Principle of Least Privilege.”
- Requirement: Your Lambda function needs permission to read from S3.
- Challenge: Create an IAM Role that only has
AmazonS3ReadOnlyAccess. Attach it to your Lambda. - Verification: Try to make the Lambda delete the file. It should fail. This proves your pipeline is secure.
🏗️ Mini Project: The “S3-to-Athena” Pipeline
Build a production-ready serverless analytics layer:
- Storage: Landing zone in S3 (
/raw). - Transformation: An AWS Glue Job (PySpark) that reads the raw CSV, converts it to Parquet format, and saves it to a
/parquetfolder. - Cataloging: A Glue Crawler that updates the metadata for the Parquet files.
- Analysis: Use Amazon Athena to calculate the total monthly revenue. Compare the speed of querying the CSV vs. the Parquet version.
Introduction to Azure
Azure is Microsoft’s cloud computing platform and a leader in enterprise data solutions. For a Data Engineer, Azure provides a seamless ecosystem for storing, integrating, and managing massive data workloads using specialized tools like ADLS and Synapse.
🔗 Learning Resources
- Official Training: Azure Data Fundamentals (DP-900) - The best starting point for the Microsoft ecosystem.
- Storage Guide: Introduction to Azure Data Lake Storage (ADLS) Gen2
- Architecture Center: Azure Data Engineering Architecture
☁️ Azure Practice Tasks: Enterprise Data Engineering
Azure is the cloud of choice for many enterprise organizations. In this module, you will build a scalable data ingestion and transformation pipeline.
📝 Practice Tasks: Data Lake, Data Factory, and Synapse
Task 1: ADLS Gen2 Hierarchy
Goal: Master “Hierarchical Namespaces” for data organization.
- Requirement: Create an Azure Storage Account (Standard V2).
- Challenge: 1. Enable Hierarchical Namespace (this turns it into ADLS Gen2).
2. Create a “Container” named
landing-zone. 3. Uploadgrocery_sales.csvusing Azure Storage Explorer or the Portal. - Concept: Understand why a “True Directory” structure is faster for Big Data than a “Flat” S3-style bucket.
Task 2: Azure Data Factory (ADF) - The Copy Activity
Goal: Your first “No-Code” ETL pipeline.
- Requirement: Provision an Azure Data Factory instance.
- Challenge: 1. Create a Linked Service to your Storage Account.
2. Create a Pipeline with a Copy Data Activity.
3. Source:
grocery_sales.csv. Sink: A new folder named/processed/. - Verification: Trigger the pipeline and check if the file moved successfully.
Task 3: Azure Key Vault Security
Goal: Secure your credentials (Secret Management).
- Requirement: Create an Azure Key Vault.
- Challenge: 1. Store your Storage Account Access Key as a Secret. 2. Link your Data Factory to the Key Vault. 3. Configure the Data Factory to pull the secret instead of hardcoding the password in the connection string.
🏗️ Mini Project: The “ADF-to-SQL” Pipeline
Build an end-to-end automated ingestion layer:
- Storage: Landing zone in ADLS Gen2.
- Orchestration: An Azure Data Factory pipeline triggered by a “Storage Event” (when a file is uploaded).
- Transformation: A Mapping Data Flow in ADF that:
- Removes rows with
NULLprices. - Converts
item_nameto lowercase.
- Removes rows with
- Loading: Sink the cleaned data into an Azure SQL Database table.
- Monitoring: Use the ADF Monitor tab to verify the run time and rows processed.
Introduction to GCP (Google Cloud Platform)
GCP is a powerful and rapidly growing cloud platform by Google, offering scalable infrastructure and advanced data analytics capabilities. For a Data Engineer, GCP provides essential services like Cloud Storage for data lakes, BigQuery for analytics, and Dataflow for building scalable data pipelines.
🔗 Learning Resources
- Official Training: Google Cloud Digital Leader Training
- Storage Deep-Dive: Cloud Storage Documentation
- Data Engineering Path: Google Cloud Skills Boost - Data Engineering
- BigQuery Guide: BigQuery Documentation
⚙️ Key GCP Services for Data Engineering
- Cloud Storage → Data lake storage for raw and processed data
- BigQuery → Serverless data warehouse for analytics
- Dataflow → Stream & batch data processing (Apache Beam)
- Pub/Sub → Real-time messaging and event ingestion
- Dataproc → Managed Spark/Hadoop clusters
- Cloud Composer → Workflow orchestration (Apache Airflow)
🧪 Practice Tasks
1. The Bucket Builder
Create a Cloud Storage bucket, upload a .csv file, and configure IAM permissions to allow read-only access.
2. The Cost Optimizer
Set up a lifecycle rule in Cloud Storage to move objects to Coldline or Archive storage after 30 days.
3. The Query Explorer
Load your .csv file into BigQuery and run SQL queries:
- Count total rows
- Filter records
- Aggregate data (e.g., GROUP BY)
4. The Access Controller
Create a service account with least privilege and use it to access BigQuery or Cloud Storage.
🚀 Mini Project: Build a Data Pipeline in GCP
Objective:
Build a simple batch data pipeline from Cloud Storage → BigQuery.
Steps:
- Upload a sample dataset (
.csv) to Cloud Storage - Create a BigQuery dataset and table
- Use BigQuery UI / bq command / Dataflow template to load data
- Transform data using SQL (e.g., cleaning, filtering)
- Schedule pipeline (optional) using Cloud Composer or scheduled queries