Banking Transaction Pipeline
(Python • Spark • S3)
A Python-based Spark pipeline that ingests banking transactions into S3.
Bronze → Silver → Gold architecture with data quality validation.
> [!NOTE]
> This project is intended to demonstrate **analytics engineering and lakehouse design patterns**
---
Table of Contents
-
About The Project
-
Architecture
- Data Quality & Validation
- Outputs
- Roadmap
- License
- Connect With Me
---
# About The Project
This project simulates a **banking transaction data pipeline** using **Python + Apache Spark** with an **S3-backed data lake**.
It demonstrates how raw transactional data can be ingested, validated, transformed, and curated into analytics-ready datasets using a **Bronze → Silver → Gold** architecture.
## **Tech Stack:** Python, PySpark, Apache Spark, S3 storage
### Key Features
- **Batch ingestion** of banking-style transaction data into an S3-backed Bronze layer
- **Bronze → Silver → Gold** lakehouse-style architecture
- **Data validation gates** (required fields, schema enforcement, duplicates, constraints)
- **Curated datasets** designed for BI and ad-hoc analytics
- Designed with **analytics engineering principles**: reliable outputs, repeatability, clear modeling
(back to top)
# Architecture
The pipeline follows a lakehouse pattern where each layer has a clear responsibility.
## Bronze (Raw)
**Purpose**
- Store transactions “as received” with minimal transformation
**Why it matters**
- Preserves an auditable source of truth
- Enables reprocessing into Silver/Gold without re-ingesting from the source
---
## Silver (Clean & Validated)
**Purpose**
- Standardize schema and datatypes
- Validate records and isolate invalid data
- Deduplicate and normalize for analysis
**Typical transformations**
- Datatype casting (timestamps, numeric amounts)
- Standardized column names and formats
- Deduplication rules (e.g., transaction_id collisions)
---
## Gold (Curated & Analytics-Ready)
**Purpose**
- Create business-friendly datasets and aggregations for analytics and BI
**Example outputs**
- Daily transaction counts & totals
- Account/customer-level summaries
- Error/invalid transaction metrics
### Notes
- **Bronze** should contain raw ingested data (audit layer)
- **Silver** should contain cleaned and validated records
- **Gold** should contain curated outputs ready for analytics and BI
(back to top)
---
# Data Quality & Validation
The pipeline applies checks to prevent bad data from reaching curated datasets.
**Common checks include:**
- Required fields (e.g., `transaction_id`, `account_id`, `amount`, `timestamp`)
- Schema enforcement (consistent datatypes between runs)
- Duplicate detection (e.g., `transaction_id` collisions)
- Value constraints (e.g., amounts must be non-negative)
- Timestamp parsing and validation
- Quarantine routing for invalid records (optional, stored under `errors/`)
These checks keep the Silver and Gold layers consistent and trustworthy for downstream analytics.
---
## Outputs
**Example S3 layout:**
```text
s3:///
bronze/banking/
silver/banking/
gold/banking/
errors/banking/
```
Gold-layer datasets are structured to support:
- Business intelligence tools (Tableau / Power BI)
- Ad-hoc querying (Spark SQL / DuckDB)
- Downstream analytics and metric definitions
(back to top)
## Roadmap
- Add orchestration (Airflow / Dagster)
- Implement incremental processing and partitioning
- Add automated pipeline health checks (row counts, null rates, duplicates)
- Add unit tests for validation logic
- Add monitoring, alerting, and run logs
- Add CDC-style ingestion simulation
## License
Distributed under the MIT License. See [LICENSE.txt](https://git.camcodes.dev/Cameron/Data_Lab/src/branch/main/LICENSE.txt) for more information.
## 💬 Connect With Me
[](https://www.linkedin.com/in/cameron-css/) [](https://CamDoesData.com) [](https://www.kaggle.com/cameronseamons) [](mailto:CameronSeamons@gmail.com) [](https://drive.google.com/file/d/1YaM4hDtt2-79ShBVTN06Y3BU79LvFw6J/view?usp=sharing)
(back to top)