Update readme.md
This commit is contained in:
parent
7ba4db709f
commit
1e5ce96375
1 changed files with 29 additions and 25 deletions
52
readme.md
52
readme.md
|
|
@ -7,16 +7,23 @@
|
||||||
<br />
|
<br />
|
||||||
|
|
||||||
|
|
||||||
<h3 align="center">Banking Transaction Pipeline (Python • Spark • S3)</h3>
|
<h2 align="center">Banking Transaction Pipeline <br> (Python • Spark • S3)</h2>
|
||||||
|
|
||||||
<p align="center">
|
<p align="center">
|
||||||
A Python-based Spark pipeline that ingests banking-style transactions into S3 and processes them through a Bronze → Silver → Gold architecture with data quality validation.
|
<img height="250" src="https://git.camcodes.dev/Cameron/Data_Lab/raw/branch/main/images/Banking.jpg" alt="Banking pipeline diagram" />
|
||||||
<br />
|
|
||||||
|
|
||||||
</p>
|
</p>
|
||||||
</div>
|
|
||||||
|
|
||||||
|
|
||||||
|
A Python-based Spark pipeline that ingests banking transactions into S3.
|
||||||
|
Bronze → Silver → Gold architecture with data quality validation.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
> [!NOTE]
|
||||||
|
> This project is intended to demonstrate **analytics engineering and lakehouse design patterns**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
|
||||||
<!-- TABLE OF CONTENTS -->
|
<!-- TABLE OF CONTENTS -->
|
||||||
<details open>
|
<details open>
|
||||||
|
|
@ -44,12 +51,17 @@
|
||||||
</ol>
|
</ol>
|
||||||
</details>
|
</details>
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
<!-- ABOUT THE PROJECT -->
|
<!-- ABOUT THE PROJECT -->
|
||||||
## About The Project
|
# About The Project
|
||||||
|
|
||||||
|
This project simulates a **banking transaction data pipeline** using **Python + Apache Spark** with an **S3-backed data lake**.
|
||||||
|
|
||||||
|
It demonstrates how raw transactional data can be ingested, validated, transformed, and curated into analytics-ready datasets using a **Bronze → Silver → Gold** architecture.
|
||||||
|
|
||||||
|
## **Tech Stack:** Python, PySpark, Apache Spark, S3 storage
|
||||||
|
|
||||||
This project simulates a **banking transaction data pipeline** using **Python + Apache Spark** with an **S3-backed data lake**. It demonstrates how raw transactional data can be ingested, validated, transformed, and curated into analytics-ready datasets using a **Bronze → Silver → Gold** architecture.
|
|
||||||
|
|
||||||
### Key Features
|
### Key Features
|
||||||
|
|
||||||
|
|
@ -63,11 +75,11 @@ This project simulates a **banking transaction data pipeline** using **Python +
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## Architecture
|
# Architecture
|
||||||
|
|
||||||
The pipeline follows a lakehouse pattern where each layer has a clear responsibility.
|
The pipeline follows a lakehouse pattern where each layer has a clear responsibility.
|
||||||
|
|
||||||
### Bronze (Raw)
|
## Bronze (Raw)
|
||||||
|
|
||||||
**Purpose**
|
**Purpose**
|
||||||
- Store transactions “as received” with minimal transformation
|
- Store transactions “as received” with minimal transformation
|
||||||
|
|
@ -78,7 +90,7 @@ The pipeline follows a lakehouse pattern where each layer has a clear responsibi
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
### Silver (Clean & Validated)
|
## Silver (Clean & Validated)
|
||||||
|
|
||||||
**Purpose**
|
**Purpose**
|
||||||
- Standardize schema and datatypes
|
- Standardize schema and datatypes
|
||||||
|
|
@ -92,7 +104,7 @@ The pipeline follows a lakehouse pattern where each layer has a clear responsibi
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
### Gold (Curated & Analytics-Ready)
|
## Gold (Curated & Analytics-Ready)
|
||||||
|
|
||||||
**Purpose**
|
**Purpose**
|
||||||
- Create business-friendly datasets and aggregations for analytics and BI
|
- Create business-friendly datasets and aggregations for analytics and BI
|
||||||
|
|
@ -102,9 +114,6 @@ The pipeline follows a lakehouse pattern where each layer has a clear responsibi
|
||||||
- Account/customer-level summaries
|
- Account/customer-level summaries
|
||||||
- Error/invalid transaction metrics
|
- Error/invalid transaction metrics
|
||||||
|
|
||||||
<p align="right">(<a href="#readme-top">back to top</a>)</p>
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
### Notes
|
### Notes
|
||||||
|
|
||||||
|
|
@ -112,13 +121,12 @@ The pipeline follows a lakehouse pattern where each layer has a clear responsibi
|
||||||
- **Silver** should contain cleaned and validated records
|
- **Silver** should contain cleaned and validated records
|
||||||
- **Gold** should contain curated outputs ready for analytics and BI
|
- **Gold** should contain curated outputs ready for analytics and BI
|
||||||
|
|
||||||
For deeper implementation details, see the code in this repo.
|
|
||||||
|
|
||||||
<p align="right">(<a href="#readme-top">back to top</a>)</p>
|
<p align="right">(<a href="#readme-top">back to top</a>)</p>
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Data Quality & Validation
|
# Data Quality & Validation
|
||||||
|
|
||||||
The pipeline applies checks to prevent bad data from reaching curated datasets.
|
The pipeline applies checks to prevent bad data from reaching curated datasets.
|
||||||
|
|
||||||
|
|
@ -132,7 +140,6 @@ The pipeline applies checks to prevent bad data from reaching curated datasets.
|
||||||
|
|
||||||
These checks keep the Silver and Gold layers consistent and trustworthy for downstream analytics.
|
These checks keep the Silver and Gold layers consistent and trustworthy for downstream analytics.
|
||||||
|
|
||||||
<p align="right">(<a href="#readme-top">back to top</a>)</p>
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
@ -149,11 +156,9 @@ s3://<bucket>/
|
||||||
|
|
||||||
Gold-layer datasets are structured to support:
|
Gold-layer datasets are structured to support:
|
||||||
|
|
||||||
Business intelligence tools (Tableau / Power BI)
|
- Business intelligence tools (Tableau / Power BI)
|
||||||
|
- Ad-hoc querying (Spark SQL / DuckDB)
|
||||||
Ad-hoc querying (Spark SQL / DuckDB)
|
- Downstream analytics and metric definitions
|
||||||
|
|
||||||
Downstream analytics and metric definitions
|
|
||||||
|
|
||||||
<p align="right">(<a href="#readme-top">back to top</a>)</p>
|
<p align="right">(<a href="#readme-top">back to top</a>)</p>
|
||||||
|
|
||||||
|
|
@ -167,7 +172,6 @@ Downstream analytics and metric definitions
|
||||||
- Add CDC-style ingestion simulation
|
- Add CDC-style ingestion simulation
|
||||||
|
|
||||||
|
|
||||||
<p align="right">(<a href="#readme-top">back to top</a>)</p>
|
|
||||||
|
|
||||||
## License
|
## License
|
||||||
|
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue