Banking_Data_Pipeline/readme.md
2026-01-11 21:06:10 +00:00

203 lines
5.5 KiB
Markdown

<a name="readme-top"></a>
<!-- PROJECT LOGO -->
<br />
<h3 align="center">Banking Transaction Pipeline (Python • Spark • S3)</h3>
<p align="center">
A Python-based Spark pipeline that ingests banking-style transactions into S3 and processes them through a Bronze → Silver → Gold architecture with data quality validation.
<br />
</p>
</div>
<!-- TABLE OF CONTENTS -->
<details open>
<summary>Table of Contents</summary>
<ol>
<li>
<a href="#about-the-project">About The Project</a>
<ul>
<li><a href="#key-features">Key Features</a></li>
</ul>
</li>
<li>
<a href="#architecture">Architecture</a>
<ul>
<li><a href="#bronze-raw">Bronze (Raw)</a></li>
<li><a href="#silver-clean--validated">Silver (Clean & Validated)</a></li>
<li><a href="#gold-curated--analytics-ready">Gold (Curated & Analytics-Ready)</a></li>
</ul>
</li>
<li>
<a href="#getting-started">Getting Started</a>
<ul>
<li><a href="#prerequisites">Prerequisites</a></li>
<li><a href="#installation">Installation</a></li>
<li><a href="#configuration">Configuration</a></li>
</ul>
</li>
<li><a href="#usage">Usage</a></li>
<li><a href="#data-quality--validation">Data Quality & Validation</a></li>
<li><a href="#outputs">Outputs</a></li>
<li><a href="#roadmap">Roadmap</a></li>
<li><a href="#contributing">Contributing</a></li>
<li><a href="#license">License</a></li>
<li><a href="#contact">Contact</a></li>
<li><a href="#acknowledgments">Acknowledgments</a></li>
</ol>
</details>
<!-- ABOUT THE PROJECT -->
## About The Project
This project simulates a **banking transaction data pipeline** using **Python + Apache Spark** with an **S3-backed data lake**. It demonstrates how raw transactional data can be ingested, validated, transformed, and curated into analytics-ready datasets using a **Bronze → Silver → Gold** architecture.
### Key Features
- **Batch ingestion** of banking-style transaction data into an S3-backed Bronze layer
- **Bronze → Silver → Gold** lakehouse-style architecture
- **Data validation gates** (required fields, schema enforcement, duplicates, constraints)
- **Curated datasets** designed for BI and ad-hoc analytics
- Designed with **analytics engineering principles**: reliable outputs, repeatability, clear modeling
<p align="right">(<a href="#readme-top">back to top</a>)</p>
## Architecture
The pipeline follows a lakehouse pattern where each layer has a clear responsibility.
### Bronze (Raw)
**Purpose**
- Store transactions “as received” with minimal transformation
**Why it matters**
- Preserves an auditable source of truth
- Enables reprocessing into Silver/Gold without re-ingesting from the source
---
### Silver (Clean & Validated)
**Purpose**
- Standardize schema and datatypes
- Validate records and isolate invalid data
- Deduplicate and normalize for analysis
**Typical transformations**
- Datatype casting (timestamps, numeric amounts)
- Standardized column names and formats
- Deduplication rules (e.g., transaction_id collisions)
---
### Gold (Curated & Analytics-Ready)
**Purpose**
- Create business-friendly datasets and aggregations for analytics and BI
**Example outputs**
- Daily transaction counts & totals
- Account/customer-level summaries
- Error/invalid transaction metrics
<p align="right">(<a href="#readme-top">back to top</a>)</p>
### Notes
- **Bronze** should contain raw ingested data (audit layer)
- **Silver** should contain cleaned and validated records
- **Gold** should contain curated outputs ready for analytics and BI
For deeper implementation details, see the code in this repo.
<p align="right">(<a href="#readme-top">back to top</a>)</p>
---
## Data Quality & Validation
The pipeline applies checks to prevent bad data from reaching curated datasets.
**Common checks include:**
- Required fields (e.g., `transaction_id`, `account_id`, `amount`, `timestamp`)
- Schema enforcement (consistent datatypes between runs)
- Duplicate detection (e.g., `transaction_id` collisions)
- Value constraints (e.g., amounts must be non-negative)
- Timestamp parsing and validation
- Quarantine routing for invalid records (optional, stored under `errors/`)
These checks keep the Silver and Gold layers consistent and trustworthy for downstream analytics.
<p align="right">(<a href="#readme-top">back to top</a>)</p>
---
## Outputs
**Example S3 layout:**
```text
s3://<bucket>/
bronze/banking/
silver/banking/
gold/banking/
errors/banking/
```
Gold-layer datasets are structured to support:
Business intelligence tools (Tableau / Power BI)
Ad-hoc querying (Spark SQL / DuckDB)
Downstream analytics and metric definitions
<p align="right">(<a href="#readme-top">back to top</a>)</p>
Roadmap
Add orchestration (Airflow / Dagster)
Implement incremental processing and partitioning
Add automated pipeline health checks (row counts, null rates, duplicates)
Add unit tests for validation logic
Add monitoring, alerting, and run logs
Add CDC-style ingestion simulation
See the open issues
for a full list of proposed features and known issues.
<p align="right">(<a href="#readme-top">back to top</a>)</p>License
Distributed under the MIT License. See LICENSE.txt for more information.
<p align="right">(<a href="#readme-top">back to top</a>)</p>
Contact
Cameron Seamons
Ogden, Utah
Email: CameronSeamons@gmail.com
LinkedIn: linkedin_username
Project Link: https://github.com/github_username/repo_name
<p align="right">(<a href="#readme-top">back to top</a>)</p>