Banking Transaction Pipeline
(Python • Spark • S3)

Banking pipeline diagram

A Python-based Spark pipeline that ingests banking transactions into S3. Bronze → Silver → Gold architecture with data quality validation. > [!NOTE] > This project is intended to demonstrate **analytics engineering and lakehouse design patterns** ---

Table of Contents

About The Project
- Key Features
Architecture
Data Quality & Validation
Outputs
Roadmap
License
Contact

--- # About The Project This project simulates a **banking transaction data pipeline** using **Python + Apache Spark** with an **S3-backed data lake**. It demonstrates how raw transactional data can be ingested, validated, transformed, and curated into analytics-ready datasets using a **Bronze → Silver → Gold** architecture. ## **Tech Stack:** Python, PySpark, Apache Spark, S3 storage ### Key Features - **Batch ingestion** of banking-style transaction data into an S3-backed Bronze layer - **Bronze → Silver → Gold** lakehouse-style architecture - **Data validation gates** (required fields, schema enforcement, duplicates, constraints) - **Curated datasets** designed for BI and ad-hoc analytics - Designed with **analytics engineering principles**: reliable outputs, repeatability, clear modeling

(back to top)

# Architecture The pipeline follows a lakehouse pattern where each layer has a clear responsibility. ## Bronze (Raw) **Purpose** - Store transactions “as received” with minimal transformation **Why it matters** - Preserves an auditable source of truth - Enables reprocessing into Silver/Gold without re-ingesting from the source --- ## Silver (Clean & Validated) **Purpose** - Standardize schema and datatypes - Validate records and isolate invalid data - Deduplicate and normalize for analysis **Typical transformations** - Datatype casting (timestamps, numeric amounts) - Standardized column names and formats - Deduplication rules (e.g., transaction_id collisions) --- ## Gold (Curated & Analytics-Ready) **Purpose** - Create business-friendly datasets and aggregations for analytics and BI **Example outputs** - Daily transaction counts & totals - Account/customer-level summaries - Error/invalid transaction metrics ### Notes - **Bronze** should contain raw ingested data (audit layer) - **Silver** should contain cleaned and validated records - **Gold** should contain curated outputs ready for analytics and BI

(back to top)

--- # Data Quality & Validation The pipeline applies checks to prevent bad data from reaching curated datasets. **Common checks include:** - Required fields (e.g., `transaction_id`, `account_id`, `amount`, `timestamp`) - Schema enforcement (consistent datatypes between runs) - Duplicate detection (e.g., `transaction_id` collisions) - Value constraints (e.g., amounts must be non-negative) - Timestamp parsing and validation - Quarantine routing for invalid records (optional, stored under `errors/`) These checks keep the Silver and Gold layers consistent and trustworthy for downstream analytics. --- ## Outputs **Example S3 layout:** ```text s3:/// bronze/banking/ silver/banking/ gold/banking/ errors/banking/ ``` Gold-layer datasets are structured to support: - Business intelligence tools (Tableau / Power BI) - Ad-hoc querying (Spark SQL / DuckDB) - Downstream analytics and metric definitions

(back to top)

## Roadmap - Add orchestration (Airflow / Dagster) - Implement incremental processing and partitioning - Add automated pipeline health checks (row counts, null rates, duplicates) - Add unit tests for validation logic - Add monitoring, alerting, and run logs - Add CDC-style ingestion simulation ## License Distributed under the MIT License. See [LICENSE.txt](https://git.camcodes.dev/Cameron/Data_Lab/src/branch/main/LICENSE.txt) for more information. ## 💬 Connect With Me [![LinkedIn](https://github.com/user-attachments/assets/d1d2f882-0bda-46cb-9b7c-9f01eff81da9)](https://www.linkedin.com/in/cameron-css/) [![Portfolio](https://github.com/user-attachments/assets/eb2e9672-e765-442f-89d7-149c7e7db0a8)](https://CamDoesData.com) [![Kaggle](https://github.com/user-attachments/assets/ef5fbcf3-067a-4bb1-b5cd-fd4e369df980)](https://www.kaggle.com/cameronseamons) [![Email](https://github.com/user-attachments/assets/12af3cba-137e-498f-abe1-c66108e5e57a)](mailto:CameronSeamons@gmail.com) [![Resume](https://github.com/user-attachments/assets/1ee4d4d1-22cd-42ff-b2e4-be7185269306)](https://drive.google.com/file/d/1YaM4hDtt2-79ShBVTN06Y3BU79LvFw6J/view?usp=sharing)

(back to top)

Banking Transaction Pipeline (Python • Spark • S3)

Banking Transaction Pipeline
(Python • Spark • S3)