Introduction to Data Science¶

This course provides the tools and concepts to manage and analyse large volumes of data. It has three modules and covers both software engineering applied to scientific projects and mathematical statistics. It has a particular emphasis on projects that apply the concepts, both in academic and industrial areas.

The detailed programs for the modules and their responsible are listed below:

Research Software Engineering in Python
Probability and Statistics
Proyectos
- Proyectos Física de Altas Energías
- Proyectos Sistemas Complejos

Module 1: Research Software Engineering in Python (2 ECTS)¶

Instructors:¶

Arturo Sánchez Pineda, Laboratoire d’Annecy de Physique des Particules (LAPP), France (previamente ICTP, Italia y CERN, Suiza).
Juan Carlos Basto Pineda, Universidad Industrial de Santander, Colombia.

Number of hours: 56 hours of total work.¶

16 hours of classes/tutorials, over 4 weeks.
4 hours of evaluation work, over 1 week
36 hours of independent study.

Module description¶

The aim of this module is learn how to construct reliable, readable, efficient research software in a collaborative environment. The module is based on Python but the general ideas can be applied to any other programming language.

Topic overview¶

Introduction to software engineering for reproducible research
Python basics
Research Data in Python
Version Control
Testing your code
Building software projects
Reproducible environments
Construction and design practices
Speed and optimisation
Advanced Programming Techniques

Pre-requisites/Co-requisites¶

Basic programming knowledge. Students should have done an introductory course in Python, we will provide a recommended online course.

Schedule¶

Check academic calendar here

Class Structure¶

Each session is a self-contained module
A set of questions and exercises are proposed at the end of each session as individual work

Assessments¶

The assessment of this module is based on a code assignment.
Students are expected to submit a short report and your code. The purpose of the report is to answer the non-coding questions, to present results and provide a brief description of design choices and implementation. This correspond to 20% of the total grade for the data science module.

Homework (tareas) Repo¶

Repo with classes and inside each class:

Docs
Codes
Data
README.md

Schedule and topics¶

Introduction to open reproducible data science (2 hours)¶

Diagnostics Test
What is open science?
Why reproducibility is important in science?

Version Control (2 hours)¶

Short intro to the Unix Terminal
Why use version control?
Solo use of version control
Publishing your code to GitHub
Collaborating with others through Git
Forks, Pull/Merge Requests and the GitHub Flow
Branching (Optional)
Rebasing and Merging (Optional)
Debugging with GitBisect (Optional)

Introduction to Python (2 hours)¶

Why use scripting languages?
Python. IPython and the IPython notebook.
List comprehensions
Functions in Python
Modules in Python
Data structures: list, dictionaries, and sets.
An introduction to classes (Optional)

Research Data in Python (2 hours)¶

Working with files on the disk
Interacting with the internet: streaming
Plotting with Matplotlib
Animations with Matplotlib
JSON and YAML (Optional)
API (Optional)

Testing your code (2 hours)¶

Example code
Why test?
Unit testing and regression testing
Negative testing
Mocking (prototipado)
Debugging
Continuous Integration (Optional)

Construction and Design (2 hours)¶

Coding conventions
Comments
Refactoring
Documentation
Object Orientation (classes)
Design Patterns

Software Projects and Reproducible Environments (2 hours)¶

Turning your code into a package
Releasing code
Choosing an open-source license
In GitLab / GitHub -- Software project management -- Organising issues and tasks

Programming for Speed (2 hours)¶

Optimisation
Profiling
Scaling laws
NumPy
Cython

Advanced Programming Techniques (2 hours)¶

Functional programming
Metaprogramming
Duck typing and exceptions
Operator overloading
Iterators and Generators

Assessment week (4 hours)¶

Course material¶

The Alan Turing Isntitute Guide for SE

Required¶

This course is largely based on the course created by J. Hetherington for PhD students at The Alan Turing Institute:

Research Software Engineering in Python

Module 2 : Probability and Statistics (5 ECTS)¶

Instructors:¶

José Ocariz, Université de Paris, Francia.
Camila Rangel, The Alan Turing Institute, Reino Unido.

Number of hours: 127 total work hours¶

28 hours of classes/tutorials over 7 weeks.
14 hours of consulting over 7 weeks.
25 hours revising required literature.
25 hours of self-guided exercises.
35 hours of evaluation work.

Course description¶

A pedagogical selection of topics in probability and statistics is presented. Choice and emphasis are driven predominantly from physics analyses using experimental data from high-energy physics detectors. The final section gives a high level description of basic concepts on machine learning. The magistral classes are completed with practical consulting sessions.

Topic overview¶

Basic concepts in probability and statistics
Parametric PDFs and parameter estimation
The maximum likelihood theorem and its applications
Statistical hypothesis testing
Basic concepts in Machine Learning

Pre-requisites/co-requisites¶

Open and mandatory to all LA-CoNGA students. Students must have followed the Research Software Engineering course.

Schedule¶

Check academic calendar

Class Structure¶

Each session is a self-contained module
The magistral classes are completed with practical training sessions using jupyter notebooks

Assessments¶

The assessment of this module is based on: * Practical training sessions (20%) * Quizzes (20%) * Mid-term and final code assignments (60%)

Topics & hours¶

Basic concepts in probability and statistics (4 hours)¶

Random processes
Mathematical probability
Conditional probability and Bayes' theorem
The probability density function
Multidimensional PDFs
Programming/practical exercises

Parametric PDFs and parameter estimation (4 hours)¶

Expectation values
Shape characterisation
Parameter estimation
The classical examples: mean value and variance
Covariance, correlations, propagation of uncertainties
Programming/practical exercises

A survey of selected distributions (4 hours)¶

Two examples of discrete distributions: Binomial and Poisson
Common examples of real-valued distributions: Uniform, exponential, Gaussian, \(\chi^2\), Breit-Wigner and Voigtian distributions
Programming/practical exercises

The maximum likelihood theorem and its applications (4 hours)¶

Likelihood contours
Selected topics on maximum likelihood
Samples composed of multiple species
Extended ML fits
Non-parabolic likelihoods, likelihoods with multiple maxima
Programming exercises

Estimating efficiencies with ML fits (4 hours)¶

ML applications: MLE parameter estimation for a regression and comparison with the Least Square Estimator.
Systematic uncertainties
The profile-likelihood method
Programming/practical exercises

Statistical hypothesis testing (4 hours)¶

The \(\chi^2\) test
General properties of hypothesis testing
From LEP to LHC: statistics in particle physics
The modified \({\rm CL}(s)\) hypothesis testing
Profiled likelihood ratios
Industry example : A/B testing
Programming/practical exercises

Basic concepts in Machine Learning (4 hours)¶

Diferences between Machine Learning and Statistical Modelling
Machine Learning categories: Supervised, unsupervised and reinforcment learning.
Basics of supervised learning:
- Cost funtions and the gradient descent algorithm.
- Regression problems: Linear regresion
- Classification problems: Logistic regresion, decision trees.
Under-fitting & Over-fitting
Regularisation
Hyperparameter optimisation, cross validation

Course material¶

Required¶

Cowan, G. (1998). Statistical data analysis. Oxford university press.
Barlow, R. J. (1993). Statistics: a guide to the use of statistical methods in the physical sciences (Vol. 29). John Wiley & Sons.
James, F. (2006). Statistical methods in experimental physics. World Scientific Publishing Company.
Brandt, S. (1998) Statistical and Computational Methods in Data Analysis, Springer, New York.
Lyons, L. (1986). Statistics for Nuclear and Particle Physicists, Cambridge University Press, Cambridge and New York.

Module 3: Hands-on Projects (3 ECTS)¶

Data Science hands-on projects in two flavours, one in High Energy Physics and the other in Complex Systems

Number of hours: 75h of hands on practical work over 6 weeks.¶

Module 3-HEP: Hands-on Projects in High Energy Physics (3 ECTS)¶

Organizers:¶

Arturo Sánchez Pineda, Laboratoire d’Annecy de Physique des Particules (LAPP), France (previamente ICTP, Italia y CERN, Suiza).
Javier Solano, Universidad Nacional de Ingeniería, Perú.

Pre-requisites/Co-requisites¶

Students should have followed the Statistics and Research Software Engineering courses.

Schedule¶

Class Structure¶

Academic projects will be developed during weeks 1-3 while non academic projects will be developed between weeks 4-6
The frequency of meetings/discussion sessions will be decided between the students and the mentors of the corresponding project

Assessments¶

The assessment of this module is based on a presentation and a report of the project's results.
Students are expected to submit a short report and their code. The purpose of the report is to answer the non-coding questions, to present results and provide a brief description of design choices and implementation. This correspond to 30% of the total grade for the data science module.

Module 3-CS: Hands-on Projects in Molecular Dynamics (3 ECTS)¶

Organizer:¶

Ernesto Medina, Yachay Tech, Ecuador.

Course description¶

The aims of this course are: * To understand the probabilistic underpinnings of the Monte Carlo approach * To study the main algorithms for Monte Carlo and molecular Dynamics, and how to extract data corresponding to relevant observables * To elaborate a project for a simulation of system either in statistical physics or in a more general context

Topic overview¶

Overview of problem solving with random numbers
Review of probabilist concepts, functions of random variables, linear transformations
Limit theorems and Markov processes
Random number generation
Markov Chain Monte Carlo
Ensembles: Sampling from ensembles, Thermodynamics averages, fluctuations time correlation functions and transport coefficients, Inhomogeneous systems
Monte-Carlo Methods: Importance sampling, Metropolis method, Monte Carlo at constant temperature and pressure, Grand Canonical Monte Carlo, Estimation of Free energy, Simulation of Phase equilibria.
Molecular Dynamics: Equations of motion for atomic systems, finite difference methods, Molecular dynamics of rigid non-spherical bodies, multiple time-step algorithms, accuracy checks, molecular dynamics in contact with reservoirs,
Practical simulation methods: Neighbour lists, multiple time steps, organizing a simulation, self consistency, parallel simulation, loops, replica exchange, analysis of results, liquid structure, time correlation functions, estimating errors.
Meso Scale Methods: Langevin and Brownian dynamics, Dissipative particle dynamics, lattice Boltzmann methods, developing coarse grained potentials

Pre-requisites/Co-requisites¶

Pre-requisites: Basic knowledge of statistical physics and programming. Co-requisites: The mandatory courses of the complex system branch

Class Structure¶

3 weeks for the theoretical course on Monte Carlo and molecular dynamics techniques
3 weeks for the simulation project

Assessments¶

The assessment of this module will be done in two parts: a brief dissertation about an application of the MC and MD methods at the end of the theoretical course, and a report and a presentation of the simulation project results. This correspond to 30% of the total grade for the data science module.

Schedule and weekly learning goals¶

Week 1-4: theoretical background¶

Introduction to the Monte Carlo method, probabilistic foundation
Sampling from ensembles, thermodynamics averages, fluctuations, time correlation functions
Metropolis method, detailed balance, Monte Carlo at constant temperature and pressure
Introduction to molecular dynamics
Equations of motion for atomic systems, finite difference methods,
Multiple time-step algorithms, accuracy checks

Week 5-6:¶

The student is supposed to perform a project to simulate a physical system by using either Monte Carlo or molecular dynamics techniques.

Course material¶

Suggested¶

Kroese D. P., Taimre T, and Botev Z. I. (2001) Handbook of Monte Carlo Methods, Wiley.
Frenkel, D., & Smit, B. (2001). Understanding molecular simulation: from algorithms to applications (Vol. 1). Elsevier.
Dill, K., & Bromberg, S. (2012). Molecular driving forces: statistical thermodynamics in biology, chemistry, physics, and nanoscience. Garland Science.
Allen, M. P., & Tildesley, D. J. (2017). Computer simulation of liquids. Oxford university press.
Haile, J. M. (1992). Molecular Dynamics Simulations, Elementary Methods, John-Wiley and Sons.

Acompañamiento docente¶

Juan C. Basto-Pineda UIS Colombia	Arturo Sánchez Pineda CNRS Francia	José Ocariz UP Francia	Camila Rangel-Smith TI Reino Unido	Ernesto Medina YachayTech Ecuador	Javier Solano UNI Perú

Introduction to Data Science¶

Module 1: Research Software Engineering in Python (2 ECTS)¶

Instructors:¶

Number of hours: 56 hours of total work.¶

Module description¶

Topic overview¶

Pre-requisites/Co-requisites¶

Schedule¶

Class Structure¶

Assessments¶

Homework (tareas) Repo¶

Schedule and topics¶

Introduction to open reproducible data science (2 hours)¶

Version Control (2 hours)¶

Introduction to Python (2 hours)¶

Research Data in Python (2 hours)¶

Testing your code (2 hours)¶

Construction and Design (2 hours)¶

Software Projects and Reproducible Environments (2 hours)¶

Programming for Speed (2 hours)¶

Advanced Programming Techniques (2 hours)¶

Assessment week (4 hours)¶

Course material¶

Required¶

Recommended reading¶

Module 2 : Probability and Statistics (5 ECTS)¶

Instructors:¶

Number of hours: 127 total work hours¶

Course description¶

Topic overview¶

Pre-requisites/co-requisites¶

Schedule¶

Class Structure¶

Assessments¶

Topics & hours¶

Basic concepts in probability and statistics (4 hours)¶

Parametric PDFs and parameter estimation (4 hours)¶

A survey of selected distributions (4 hours)¶

The maximum likelihood theorem and its applications (4 hours)¶

Estimating efficiencies with ML fits (4 hours)¶

Statistical hypothesis testing (4 hours)¶

Basic concepts in Machine Learning (4 hours)¶

Course material¶

Required¶

Further reading¶

Module 3: Hands-on Projects (3 ECTS)¶

Number of hours: 75h of hands on practical work over 6 weeks.¶

Module 3-HEP: Hands-on Projects in High Energy Physics (3 ECTS)¶

Organizers:¶

Pre-requisites/Co-requisites¶

Schedule¶

Class Structure¶

Assessments¶

Module 3-CS: Hands-on Projects in Molecular Dynamics (3 ECTS)¶

Organizer:¶

Course description¶

Topic overview¶

Pre-requisites/Co-requisites¶

Class Structure¶

Assessments¶

Schedule and weekly learning goals¶

Week 1-4: theoretical background¶

Week 5-6:¶

Course material¶

Suggested¶

Acompañamiento docente¶