5.6 KiB
Banking Transaction Pipeline
(Python • Spark • S3)
A Python-based Spark pipeline that ingests banking transactions into S3.
Bronze → Silver → Gold architecture with data quality validation.
Note
This project is intended to demonstrate analytics engineering and lakehouse design patterns
Table of Contents
About The Project
This project simulates a banking transaction data pipeline using Python + Apache Spark with an S3-backed data lake.
It demonstrates how raw transactional data can be ingested, validated, transformed, and curated into analytics-ready datasets using a Bronze → Silver → Gold architecture.
Tech Stack: Python, PySpark, Apache Spark, S3 storage
Key Features
- Batch ingestion of banking-style transaction data into an S3-backed Bronze layer
- Bronze → Silver → Gold lakehouse-style architecture
- Data validation gates (required fields, schema enforcement, duplicates, constraints)
- Curated datasets designed for BI and ad-hoc analytics
- Designed with analytics engineering principles: reliable outputs, repeatability, clear modeling
Architecture
The pipeline follows a lakehouse pattern where each layer has a clear responsibility.
Bronze (Raw)
Purpose
- Store transactions “as received” with minimal transformation
Why it matters
- Preserves an auditable source of truth
- Enables reprocessing into Silver/Gold without re-ingesting from the source
Silver (Clean & Validated)
Purpose
- Standardize schema and datatypes
- Validate records and isolate invalid data
- Deduplicate and normalize for analysis
Typical transformations
- Datatype casting (timestamps, numeric amounts)
- Standardized column names and formats
- Deduplication rules (e.g., transaction_id collisions)
Gold (Curated & Analytics-Ready)
Purpose
- Create business-friendly datasets and aggregations for analytics and BI
Example outputs
- Daily transaction counts & totals
- Account/customer-level summaries
- Error/invalid transaction metrics
Notes
- Bronze should contain raw ingested data (audit layer)
- Silver should contain cleaned and validated records
- Gold should contain curated outputs ready for analytics and BI
Data Quality & Validation
The pipeline applies checks to prevent bad data from reaching curated datasets.
Common checks include:
- Required fields (e.g.,
transaction_id,account_id,amount,timestamp) - Schema enforcement (consistent datatypes between runs)
- Duplicate detection (e.g.,
transaction_idcollisions) - Value constraints (e.g., amounts must be non-negative)
- Timestamp parsing and validation
- Quarantine routing for invalid records (optional, stored under
errors/)
These checks keep the Silver and Gold layers consistent and trustworthy for downstream analytics.
Outputs
Example S3 layout:
s3://<bucket>/
bronze/banking/
silver/banking/
gold/banking/
errors/banking/
Gold-layer datasets are structured to support:
- Business intelligence tools (Tableau / Power BI)
- Ad-hoc querying (Spark SQL / DuckDB)
- Downstream analytics and metric definitions
Roadmap
- Add orchestration (Airflow / Dagster)
- Implement incremental processing and partitioning
- Add automated pipeline health checks (row counts, null rates, duplicates)
- Add unit tests for validation logic
- Add monitoring, alerting, and run logs
- Add CDC-style ingestion simulation
License
Distributed under the MIT License. See LICENSE.txt for more information.