Syllabus

All Materials, Lectures and Assignments (along with the deadlines) are provided here.

Text Book:

Various interesting and useful topics that will be touched during the course are discussed in the following textbooks.
  • Mining of Massive Datasets, Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman, Cambridge University Press, 2nd Edition, 2014. Download for Free.
  • Materials and Chapters will be referred when required

Lectures

Event Date Lecture Suggested Readings Assignments and Deadline
Lecture 1 -- Topics: (no slides)
  • Formal introduction
  • Course details
  • Syllabus
Lecture 2 -- Topics: (slides)
  • Why big data 5Vs, scaling pain points; batch vs streaming; architecture patterns
Lecture 3 -- Topics: (slides)
  • Distributed systems basics partitioning, replication, consistency, fault tolerance
Lecture 4 -- Topics: (slides)
  • HDFS + MapReduce (conceptual) data locality, jobs, combiner; classic wordcount.
Lecture 5 -- Topics: (slides)
  • HDFS + MapReduce (conceptual) data locality, jobs, combiner; classic wordcount.
Lecture 6 -- Topics: (slides)
  • YARN + Hadoop ecosystem overview Hive, HBase, Sqoop, Flume; when to use.
Lecture 7 -- Topics: (slides)
  • YARN + Hadoop ecosystem overview Hive, HBase, Sqoop, Flume; when to use.
Lecture 8 -- Topics: (slides)
  • Spark fundamentals RDD vs DataFrame; lazy evaluation; actions vs transformations
Lecture 9 -- Topics: (slides)
  • PySpark DataFrames I schema, reading/writing CSV/JSON/Parquet; basic ops.
Lecture 10 -- Topics: (slides)
  • PySpark DataFrames I schema, reading/writing CSV/JSON/Parquet; basic ops.
Lecture 11 -- Topics: (slides)
  • PySpark DataFrames II groupBy, joins, window functions; handling skew.
Lecture 12 -- Topics: (slides)
  • PySpark DataFrames II groupBy, joins, window functions; handling skew.
Lecture 13 -- Topics: (slides)
  • Spark SQL temporary views, SQL queries; optimization intuition
Lecture 14 -- Topics: (slides)
  • Performance tuning I partitions, caching, broadcast joins; explain plans
Lecture 15 -- Topics: (slides)
  • Performance tuning II shuffle, skew mitigation, salting, AQE; memory tuning basics
Lecture 16 -- Topics: (slides)
  • Data formats Parquet/ORC, compression; partitioned tables; lakehouse idea
Lecture 17 -- Topics: (slides)
  • Data formats Parquet/ORC, compression; partitioned tables; lakehouse idea
Lecture 18 -- Topics: (slides)
  • Special Topics Page Rank Algorithm and Implementation
Lecture 19 -- Topics: (slides)
  • Special Topics Page Rank Algorithm and Implementation
Lecture 20 -- Topics: (slides)
  • Special Topics Streaming Algorithms
-- -- (Feb 25) Last Date for Proposal Submission.
-- -- Mid Semester Exam Week Best of Luck.
Lecture 21 -- Topics: (slides)
  • Special Topics Streaming Algorithms
Lecture 22 -- Topics: (slides)
  • Special Topics Data Structures for Big Data, kd Trees, Bloom Filter.
Lecture 23 -- Topics: (slides)
  • Special Topics Data Structures for Big Data, kd Trees, Bloom Filter.
Lecture 24 -- Topics: (slides)
  • Special Topics Decision Making in Distributed Systems and Algorithms, Game Theory and Advertisements.
Lecture 25 -- Topics: (slides)
  • Special Topics Decision Making in Distributed Systems and Algorithms, Game Theory and Advertisements.
Lecture 26 -- Topics: (slides)
  • Special Topics Language Embedding and Applications
Lecture 27 -- Topics: (slides)
  • Special Topics Language Embedding and Applications
Lecture 28 -- Topics: (slides)
  • Special Topics Graph Neural Networks
Lecture 29 -- Topics: (slides)
  • Workflow orchestration DAGs, retries, idempotency; Airflow concepts
Lecture 30 -- Topics: (slides)
  • Workflow orchestration DAGs, retries, idempotency; Airflow concepts
Lecture 31 -- Topics: (slides)
  • Data quality checks, expectations, anomaly detection; lineage basics
Lecture 32 -- Topics: (slides)
  • Streaming fundamentals event time vs processing time; watermarks; exactly-once intuition
-- -- (Mar 30) - Last Date of Mid Presentation.
Lecture 33 -- Topics: (slides)
  • Kafka basics topics, partitions, consumer groups; offset management
Lecture 34 -- Topics: (slides)
  • Spark Structured Streaming sources/sinks, windows, state; demo pipeline
Lecture 35 -- Topics: (slides)
  • NoSQL overview key-value, document, columnar; HBase/Cassandra/Mongo tradeoffs
Lecture 36 -- Topics: (slides)
  • NoSQL overview key-value, document, columnar; HBase/Cassandra/Mongo tradeoffs
Lecture 37 -- Topics: (slides)
  • Graph & search Graph processing intro; Elasticsearch/OpenSearch conceptually
Lecture 38 -- Topics: (slides)
  • Graph & search Graph processing intro; Elasticsearch/OpenSearch conceptually
Lecture 39 -- Topics: (slides)
  • Security & governance access control, encryption, PII handling; audit trails
Lecture 40 -- Topics: (slides)
  • Security & governance access control, encryption, PII handling; audit trails
Lecture 41 -- Topics: (slides)
  • Capstone build a batch + streaming pipeline + tuning report
Lecture 42 -- Topics: (slides)
  • Wrap-up
-- -- (Apl 30) Last Date of Codes Submission.
-- -- (Apl 30) Dead Line for Final Presentation Video Submission.
-- -- (Apl 30) Dead Line for Report Submission.
-- -- End Semester Exam Week Best of Luck.
Link Added on Last Date for Submission :