Data-intensive computing systems (2019/2020)

Course code
Damiano Carra
Academic sector
Language of instruction
Teaching is organised as follows:
Activity Credits Period Academic staff Timetable
Teoria 5 I semestre Damiano Carra

Go to lesson schedule

Laboratorio 1 I semestre Damiano Carra

Go to lesson schedule

Learning outcomes

The course aims to provide the fundamental concepts of distributed computing systems that deal with very large data sets, together with the programming paradigms adopted by these systems. At the end of the course the student must demonstrate that he has acquired the necessary knowledge to evaluate the possible alternatives in the design of the analysis of large amounts of data, considering the benefits and limitations of possible alternatives. This knowledge will allow the student to: i) configure parallel data processing systems; ii) design solutions to analyze large amounts of data; iii) evaluate the solutions for data analysis with parallel systems, considering the system resources necessary for the analysis; iv) continue the studies autonomously in the development of advanced analysis of large amounts of data.


* Programming frameworks:
-- Distributed filesystems (HDFS);
-- Data and graph processing (MapReduce, Pregel);
-- SQL-like systems (Pig, Hive);
-- NoSQL systems (HBase, Cassandra).

* Algorithms:
-- Design of algorithms for text processing;
-- Indexing algorithms (inverted indexing);
-- Graph analysis (PageRank).

* Datacenter architectures:
-- Datacenter organization;
-- Datacenter networking;
-- Failure management.

Assessment methods and criteria

Examination consists of a project and the corresponding documentation. The project aims at verifying the comprehension of course contents and the capability to apply these contents in the resolution of a problem. The project topic is agreed with the teacher and focus on specific case studies. The project includes the performance evaluation for different input sizes, and the evaluation of the implementation alternatives. After the evaluation of the project documentation, the student may give an oral exam where the details of the project are discussed.

Reference books
Activity Author Title Publisher Year ISBN Note
Teoria Jimmy Lin, Chris Dyer Data-Intensive Text Processing with MapReduce (Edizione 1) Morgan & Claypool Publishers 2010 978-1608453429
Laboratorio Tom White Hadoop: The Definitive Guide (Edizione 3) Oreilly & Associates Inc 2012 978-1449311520