Cameron/Banking_Data_Pipeline: Data Lab for experiments in Data Engineering

Data Lab for experiments in Data Engineering

Find a file

Cameron c20557a036 Update readme.md		2026-01-11 21:06:10 +00:00
airflow	removed gitsync	2025-12-10 22:53:21 -07:00
fake_data	fixed code not outputting mixed df	2025-12-23 20:53:25 -07:00
Scripts	UPDATED CODE TO WORK ANYWHERE (pip hardcoded)	2025-12-23 14:02:28 -07:00
.gitignore	fixed git ignore	2025-12-10 20:11:27 -07:00
readme.md	Update readme.md	2026-01-11 21:06:10 +00:00

readme.md

Banking Transaction Pipeline (Python • Spark • S3)

A Python-based Spark pipeline that ingests banking-style transactions into S3 and processes them through a Bronze → Silver → Gold architecture with data quality validation.

Table of Contents

About The Project
- Key Features
Architecture
Getting Started
Usage
Data Quality & Validation
Outputs
Roadmap
Contributing
License
Contact
Acknowledgments

About The Project

This project simulates a banking transaction data pipeline using Python + Apache Spark with an S3-backed data lake. It demonstrates how raw transactional data can be ingested, validated, transformed, and curated into analytics-ready datasets using a Bronze → Silver → Gold architecture.

Key Features

Batch ingestion of banking-style transaction data into an S3-backed Bronze layer
Bronze → Silver → Gold lakehouse-style architecture
Data validation gates (required fields, schema enforcement, duplicates, constraints)
Curated datasets designed for BI and ad-hoc analytics
Designed with analytics engineering principles: reliable outputs, repeatability, clear modeling

(back to top)

Architecture

The pipeline follows a lakehouse pattern where each layer has a clear responsibility.

Bronze (Raw)

Purpose

Store transactions “as received” with minimal transformation

Why it matters

Preserves an auditable source of truth
Enables reprocessing into Silver/Gold without re-ingesting from the source

Silver (Clean & Validated)

Purpose

Standardize schema and datatypes
Validate records and isolate invalid data
Deduplicate and normalize for analysis

Typical transformations

Datatype casting (timestamps, numeric amounts)
Standardized column names and formats
Deduplication rules (e.g., transaction_id collisions)

Gold (Curated & Analytics-Ready)

Purpose

Create business-friendly datasets and aggregations for analytics and BI

Example outputs

Daily transaction counts & totals
Account/customer-level summaries
Error/invalid transaction metrics

(back to top)

Notes

Bronze should contain raw ingested data (audit layer)
Silver should contain cleaned and validated records
Gold should contain curated outputs ready for analytics and BI

For deeper implementation details, see the code in this repo.

(back to top)

Data Quality & Validation

The pipeline applies checks to prevent bad data from reaching curated datasets.

Common checks include:

Required fields (e.g., transaction_id, account_id, amount, timestamp)
Schema enforcement (consistent datatypes between runs)
Duplicate detection (e.g., transaction_id collisions)
Value constraints (e.g., amounts must be non-negative)
Timestamp parsing and validation
Quarantine routing for invalid records (optional, stored under errors/)

These checks keep the Silver and Gold layers consistent and trustworthy for downstream analytics.

(back to top)

Outputs

Example S3 layout:

s3://<bucket>/
  bronze/banking/
  silver/banking/
  gold/banking/
  errors/banking/

Gold-layer datasets are structured to support:

Business intelligence tools (Tableau / Power BI)

Ad-hoc querying (Spark SQL / DuckDB)

Downstream analytics and metric definitions

(back to top)

Roadmap

Add orchestration (Airflow / Dagster)

Implement incremental processing and partitioning

Add automated pipeline health checks (row counts, null rates, duplicates)

Add unit tests for validation logic

Add monitoring, alerting, and run logs

Add CDC-style ingestion simulation

See the open issues for a full list of proposed features and known issues.

(back to top)

License

Distributed under the MIT License. See LICENSE.txt for more information.

(back to top)

Contact

Cameron Seamons Ogden, Utah Email: CameronSeamons@gmail.com

LinkedIn: linkedin_username

Project Link: https://github.com/github_username/repo_name

(back to top)