| airflow | ||
| fake_data | ||
| Scripts | ||
| .gitignore | ||
| readme.md | ||
Banking Transaction Pipeline (Python • Spark • S3)
A Python-based Spark pipeline that ingests banking-style transactions into S3 and processes them through a Bronze → Silver → Gold architecture with data quality validation.
Table of Contents
About The Project
This project simulates a banking transaction data pipeline using Python + Apache Spark with an S3-backed data lake. It demonstrates how raw transactional data can be ingested, validated, transformed, and curated into analytics-ready datasets using a Bronze → Silver → Gold architecture.
Key Features
- Batch ingestion of banking-style transaction data into an S3-backed Bronze layer
- Bronze → Silver → Gold lakehouse-style architecture
- Data validation gates (required fields, schema enforcement, duplicates, constraints)
- Curated datasets designed for BI and ad-hoc analytics
- Designed with analytics engineering principles: reliable outputs, repeatability, clear modeling
Architecture
The pipeline follows a lakehouse pattern where each layer has a clear responsibility.
Bronze (Raw)
Purpose
- Store transactions “as received” with minimal transformation
Why it matters
- Preserves an auditable source of truth
- Enables reprocessing into Silver/Gold without re-ingesting from the source
Silver (Clean & Validated)
Purpose
- Standardize schema and datatypes
- Validate records and isolate invalid data
- Deduplicate and normalize for analysis
Typical transformations
- Datatype casting (timestamps, numeric amounts)
- Standardized column names and formats
- Deduplication rules (e.g., transaction_id collisions)
Gold (Curated & Analytics-Ready)
Purpose
- Create business-friendly datasets and aggregations for analytics and BI
Example outputs
- Daily transaction counts & totals
- Account/customer-level summaries
- Error/invalid transaction metrics
Notes
- Bronze should contain raw ingested data (audit layer)
- Silver should contain cleaned and validated records
- Gold should contain curated outputs ready for analytics and BI
For deeper implementation details, see the code in this repo.
Data Quality & Validation
The pipeline applies checks to prevent bad data from reaching curated datasets.
Common checks include:
- Required fields (e.g.,
transaction_id,account_id,amount,timestamp) - Schema enforcement (consistent datatypes between runs)
- Duplicate detection (e.g.,
transaction_idcollisions) - Value constraints (e.g., amounts must be non-negative)
- Timestamp parsing and validation
- Quarantine routing for invalid records (optional, stored under
errors/)
These checks keep the Silver and Gold layers consistent and trustworthy for downstream analytics.
Outputs
Example S3 layout:
s3://<bucket>/
bronze/banking/
silver/banking/
gold/banking/
errors/banking/
Gold-layer datasets are structured to support:
Business intelligence tools (Tableau / Power BI)
Ad-hoc querying (Spark SQL / DuckDB)
Downstream analytics and metric definitions
Roadmap
Add orchestration (Airflow / Dagster)
Implement incremental processing and partitioning
Add automated pipeline health checks (row counts, null rates, duplicates)
Add unit tests for validation logic
Add monitoring, alerting, and run logs
Add CDC-style ingestion simulation
See the open issues for a full list of proposed features and known issues.
LicenseDistributed under the MIT License. See LICENSE.txt for more information.
ContactCameron Seamons Ogden, Utah Email: CameronSeamons@gmail.com
LinkedIn: linkedin_username
Project Link: https://github.com/github_username/repo_name