Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Pandas for Data Engineering

Pandas is a fast, powerful, and flexible open-source data analysis and manipulation tool built on top of the Python programming language.

🔗 Learning Resources

📝 Practice Tasks (The Basics)

1. The Schema Validator

Goal: Ensure incoming data matches your requirements.

  • Scenario: You have a CSV of “Sales” data. You need to ensure no prices are negative and all dates are valid.
  • Requirements:
    • Load sales.csv.
    • Use df.info() to check data types.
    • Use boolean indexing to find rows where price < 0 and drop them.
    • Convert the transaction_date column to datetime64[ns].
    • Output: A report of how many rows were deleted during cleaning.

2. The Aggregator (Group-By)

Goal: Create a summary report from raw events.

  • Scenario: A CSV contains store_id, product_category, and revenue.
  • Requirements:
    • Group the data by store_id.
    • Calculate the Total Revenue and Average Revenue per store.
    • Sort the results so the highest-earning store is at the top.
    • Output: Save the summary to store_performance.csv.