BST 262

Computing for Big Data

This course will give a critical presentation of software implementations, theoretical/algorithmic software development, and modern software tools to collect, store, and process data at scale. This will include hands-on programming practice, R package development (with C++ integration), software design and good software development practice, multiprocessing with OpenMP, cloud computing on the Harvard computing cluster, container images (Docker), and an introduction to big data stacks. A basic level of programming in R and C++ is required. The goal of the course is not only limited to recipes to manipulate data, but to learn state of the art workflows for software design and dissemination, software tool selection, and maintenance. We will see how big data influences several aspects of data science (for instance, software development and data management) and how we can leverage modern tools to work with data efficiently.