Banking_Data_Pipeline/readme.md
2026-01-11 21:39:52 +00:00

5.8 KiB


Banking Transaction Pipeline
(Python • Spark • S3)

Banking pipeline diagram

A Python-based Spark pipeline that ingests banking transactions into S3.
Bronze → Silver → Gold architecture with data quality validation.

Note

This project is intended to demonstrate analytics engineering and lakehouse design patterns


Table of Contents
  1. About The Project
  2. Architecture
  3. Data Quality & Validation
  4. Outputs
  5. Roadmap
  6. License
  7. Connect With Me

About The Project

This project simulates a banking transaction data pipeline using Python + Apache Spark with an S3-backed data lake.

It demonstrates how raw transactional data can be ingested, validated, transformed, and curated into analytics-ready datasets using a Bronze → Silver → Gold architecture.

Tech Stack: Python, PySpark, Apache Spark, S3 storage

Key Features

  • Batch ingestion of banking-style transaction data into an S3-backed Bronze layer
  • Bronze → Silver → Gold lakehouse-style architecture
  • Data validation gates (required fields, schema enforcement, duplicates, constraints)
  • Curated datasets designed for BI and ad-hoc analytics
  • Designed with analytics engineering principles: reliable outputs, repeatability, clear modeling

(back to top)

Architecture

The pipeline follows a lakehouse pattern where each layer has a clear responsibility.

Bronze (Raw)

Purpose

  • Store transactions “as received” with minimal transformation

Why it matters

  • Preserves an auditable source of truth
  • Enables reprocessing into Silver/Gold without re-ingesting from the source

Silver (Clean & Validated)

Purpose

  • Standardize schema and datatypes
  • Validate records and isolate invalid data
  • Deduplicate and normalize for analysis

Typical transformations

  • Datatype casting (timestamps, numeric amounts)
  • Standardized column names and formats
  • Deduplication rules (e.g., transaction_id collisions)

Gold (Curated & Analytics-Ready)

Purpose

  • Create business-friendly datasets and aggregations for analytics and BI

Example outputs

  • Daily transaction counts & totals
  • Account/customer-level summaries
  • Error/invalid transaction metrics

Notes

  • Bronze should contain raw ingested data (audit layer)
  • Silver should contain cleaned and validated records
  • Gold should contain curated outputs ready for analytics and BI

(back to top)


Data Quality & Validation

The pipeline applies checks to prevent bad data from reaching curated datasets.

Common checks include:

  • Required fields (e.g., transaction_id, account_id, amount, timestamp)
  • Schema enforcement (consistent datatypes between runs)
  • Duplicate detection (e.g., transaction_id collisions)
  • Value constraints (e.g., amounts must be non-negative)
  • Timestamp parsing and validation
  • Quarantine routing for invalid records (optional, stored under errors/)

These checks keep the Silver and Gold layers consistent and trustworthy for downstream analytics.


Outputs

Example S3 layout:

s3://<bucket>/
  bronze/banking/
  silver/banking/
  gold/banking/
  errors/banking/

Gold-layer datasets are structured to support:

  • Business intelligence tools (Tableau / Power BI)
  • Ad-hoc querying (Spark SQL / DuckDB)
  • Downstream analytics and metric definitions

(back to top)

Roadmap

  • Add orchestration (Airflow / Dagster)
  • Implement incremental processing and partitioning
  • Add automated pipeline health checks (row counts, null rates, duplicates)
  • Add unit tests for validation logic
  • Add monitoring, alerting, and run logs
  • Add CDC-style ingestion simulation

License

Distributed under the MIT License. See LICENSE.txt for more information.

💬 Connect With Me

LinkedIn Portfolio Kaggle Email Resume

(back to top)