Project

The goal of DA331 Big Data Analytics is not only to equip you with the tools but also to provide you with an understanding of the underlying technologies and knowhows so that you can develop them or use them for real-world problems. The final project is intended to push you in that direction.

Getting Started

List of Probable Project Topics

  • Batch ETL Pipeline with Spark (clean → transform → store in Parquet; partitioning strategy)
  • Streaming Analytics Dashboard (Kafka-like stream → Spark Streaming → real-time KPIs)
  • Log Analytics at Scale (web/server logs: sessionization, funnels, anomaly spikes)
  • Clickstream Processing System (event ingestion + user journeys + cohort retention)
  • Social Media Text Mining Pipeline (topic/sentiment at scale; distributed NLP features)
  • IoT Sensor Data Platform (time-series ingestion, aggregation, alert rules)
  • Fraud/Anomaly Detection on Big Transactions (feature generation + scalable scoring)
  • Large-Scale Recommendation Data Prep (implicit feedback matrix building + sampling)
  • Distributed Join & Query Optimization Study (broadcast vs shuffle joins; benchmarks)
  • Data Lakehouse Build (raw/bronze → silver → gold layers; Delta/Iceberg-like design)
  • NoSQL vs SQL Benchmarking (MongoDB/Cassandra vs Postgres for read/write workloads)
  • Search Indexing Pipeline (ingest docs → build inverted index/Elasticsearch-style analysis)
  • Graph Processing Project (PageRank/community detection using GraphX or similar)
  • Cloud Cost & Performance Tuning (cluster sizing, partitioning, caching, resource configs)
  • Large-Scale Deduplication & Entity Resolution (blocking + similarity joins)
  • Air Quality / Weather Big Data Analytics (multi-source ingestion + geo/time aggregations)
  • Data Quality Monitoring System (schema checks, drift detection, alerting)
  • End-to-End BI on Big Data (warehouse tables + aggregated marts + dashboarding)