Introduction to AWS (Amazon Web Services)
AWS is the world’s most comprehensive and broadly adopted cloud platform. For a Data Engineer, AWS provides the foundational storage (S3) and compute (EC2/Lambda) layers needed to build robust data pipelines.
🔗 Learning Resources
- Official Training: AWS Cloud Practitioner Essentials
- Storage Deep-Dive: Amazon S3 User Guide
- Interactive Workshop: AWS Data Engineering Immersion Day
☁️ AWS Practice Tasks: Cloud Data Engineering
AWS is the most widely used cloud provider. In this module, you will build a “Data Lake” and a serverless ETL pipeline.
📝 Practice Tasks: S3, Lambda, and Glue
Task 1: The S3 Data Lake Setup
Goal: Master object storage and permissions.
- Requirement: Log into the AWS Console (or use LocalStack for local practice).
- Challenge: 1. Create an S3 Bucket named
raw-grocery-data-unique-id. 2. Upload yourgrocery_sales.csvmanually. 3. Create a folder structure inside the bucket:inbound/,processed/, andarchive/. - Command:
aws s3 cp grocery_sales.csv s3://raw-grocery-data-unique-id/inbound/
Task 2: Serverless ETL with AWS Lambda
Goal: Trigger code based on an event (File Upload).
- Requirement: Create a Python Lambda function.
- Challenge: 1. Set an S3 Trigger so the Lambda runs every time a
.csvis dropped into theinbound/folder. 2. Write Python code (using theboto3library) to read the file’s metadata and print the file name to CloudWatch Logs.
Task 3: AWS Glue Crawler & Data Catalog
Goal: Automatically discover the schema of your data.
- Requirement: Point an AWS Glue Crawler at your S3 bucket.
- Challenge: 1. Run the Crawler to scan your CSV files.
2. Check the AWS Glue Data Catalog to see if a table was created automatically.
3. Use Amazon Athena to run a SQL query (
SELECT * FROM ...) directly against the files sitting in S3. - Concept: This is the “Schema-on-Read” pattern.
Task 4: IAM Roles & Security
Goal: Understand the “Principle of Least Privilege.”
- Requirement: Your Lambda function needs permission to read from S3.
- Challenge: Create an IAM Role that only has
AmazonS3ReadOnlyAccess. Attach it to your Lambda. - Verification: Try to make the Lambda delete the file. It should fail. This proves your pipeline is secure.
🏗️ Mini Project: The “S3-to-Athena” Pipeline
Build a production-ready serverless analytics layer:
- Storage: Landing zone in S3 (
/raw). - Transformation: An AWS Glue Job (PySpark) that reads the raw CSV, converts it to Parquet format, and saves it to a
/parquetfolder. - Cataloging: A Glue Crawler that updates the metadata for the Parquet files.
- Analysis: Use Amazon Athena to calculate the total monthly revenue. Compare the speed of querying the CSV vs. the Parquet version.