Introduction to Data Science¶
This course provides the tools and concepts to manage and analyse large volumes of data. It has three modules and covers both software engineering applied to scientific projects and mathematical statistics. It has a particular emphasis on projects that apply the concepts, both in academic and industrial areas.
The detailed programs for the modules and their responsible are listed below:
Module 1: Research Software Engineering in Python (2 ECTS)¶
Instructors:¶
 Arturo Sánchez Pineda, Laboratoire d’Annecy de Physique des Particules (LAPP), France (previamente ICTP, Italia y CERN, Suiza).
 Juan Carlos Basto Pineda, Universidad Industrial de Santander, Colombia.
Number of hours: 56 hours of total work.¶
 16 hours of classes/tutorials, over 4 weeks.
 4 hours of evaluation work, over 1 week
 36 hours of independent study.
Module description¶
The aim of this module is learn how to construct reliable, readable, efficient research software in a collaborative environment. The module is based on Python but the general ideas can be applied to any other programming language.
Topic overview¶
 Introduction to software engineering for reproducible research
 Python basics
 Research Data in Python
 Version Control
 Testing your code
 Building software projects
 Reproducible environments
 Construction and design practices
 Speed and optimisation
 Advanced Programming Techniques
Prerequisites/Corequisites¶
Basic programming knowledge. Students should have done an introductory course in Python, we will provide a recommended online course.
Schedule¶
Check academic calendar here
Class Structure¶
 Each session is a selfcontained module
 A set of questions and exercises are proposed at the end of each session as individual work
Assessments¶
The assessment of this module is based on a code assignment.
Students are expected to submit a short report and your code. The purpose of the report is to answer the noncoding questions, to present results and provide a brief description of design choices and implementation.
This correspond to 20% of the total grade for the data science module.
Homework (tareas) Repo¶
Repo with classes and inside each class:
 Docs
 Codes
 Data
 README.md
Schedule and topics¶
Introduction to open reproducible data science (2 hours)¶
 Diagnostics Test
 What is open science?
 Why reproducibility is important in science?
Version Control (2 hours)¶
 Short intro to the Unix Terminal
 Why use version control?
 Solo use of version control
 Publishing your code to GitHub
 Collaborating with others through Git
 Forks, Pull/Merge Requests and the GitHub Flow
 Branching (Optional)
 Rebasing and Merging (Optional)
 Debugging with GitBisect (Optional)
Introduction to Python (2 hours)¶
 Why use scripting languages?
 Python. IPython and the IPython notebook.
 List comprehensions
 Functions in Python
 Modules in Python
 Data structures: list, dictionaries, and sets.
 An introduction to classes (Optional)
Research Data in Python (2 hours)¶
 Working with files on the disk
 Interacting with the internet: streaming
 Plotting with Matplotlib
 Animations with Matplotlib
 JSON and YAML (Optional)
 API (Optional)
Testing your code (2 hours)¶
 Example code
 Why test?
 Unit testing and regression testing
 Negative testing
 Mocking (prototipado)
 Debugging
 Continuous Integration (Optional)
Construction and Design (2 hours)¶
 Coding conventions
 Comments
 Refactoring
 Documentation
 Object Orientation (classes)
 Design Patterns
Software Projects and Reproducible Environments (2 hours)¶
 Turning your code into a package
 Releasing code
 Choosing an opensource license
 In GitLab / GitHub  Software project management  Organising issues and tasks
Programming for Speed (2 hours)¶
 Optimisation
 Profiling
 Scaling laws
 NumPy
 Cython
Advanced Programming Techniques (2 hours)¶
 Functional programming
 Metaprogramming
 Duck typing and exceptions
 Operator overloading
 Iterators and Generators
Assessment week (4 hours)¶
Course material¶
The Alan Turing Isntitute Guide for SE
Required¶
This course is largely based on the course created by J. Hetherington for PhD students at The Alan Turing Institute:
Recommended reading¶
Module 2 : Probability and Statistics (5 ECTS)¶
Instructors:¶
 José Ocariz, Université de Paris, Francia.
 Camila Rangel, The Alan Turing Institute, Reino Unido.
Number of hours: 127 total work hours¶
 28 hours of classes/tutorials over 7 weeks.
 14 hours of consulting over 7 weeks.
 25 hours revising required literature.
 25 hours of selfguided exercises.
 35 hours of evaluation work.
Course description¶
A pedagogical selection of topics in probability and statistics is presented. Choice and emphasis are driven predominantly from physics analyses using experimental data from highenergy physics detectors. The final section gives a high level description of basic concepts on machine learning. The magistral classes are completed with practical consulting sessions.
Topic overview¶
 Basic concepts in probability and statistics
 Parametric PDFs and parameter estimation
 The maximum likelihood theorem and its applications
 Statistical hypothesis testing
 Basic concepts in Machine Learning
Prerequisites/corequisites¶
Open and mandatory to all LACoNGA students. Students must have followed the Research Software Engineering course.
Schedule¶
Class Structure¶
 Each session is a selfcontained module
 The magistral classes are completed with practical training sessions using jupyter notebooks
Assessments¶
The assessment of this module is based on: * Practical training sessions (20%) * Quizzes (20%) * Midterm and final code assignments (60%)
Topics & hours¶
Basic concepts in probability and statistics (4 hours)¶
 Random processes
 Mathematical probability
 Conditional probability and Bayes' theorem
 The probability density function
 Multidimensional PDFs
 Programming/practical exercises
Parametric PDFs and parameter estimation (4 hours)¶
 Expectation values
 Shape characterisation
 Parameter estimation
 The classical examples: mean value and variance
 Covariance, correlations, propagation of uncertainties
 Programming/practical exercises
A survey of selected distributions (4 hours)¶
 Two examples of discrete distributions: Binomial and Poisson
 Common examples of realvalued distributions: Uniform, exponential, Gaussian, \(\chi^2\), BreitWigner and Voigtian distributions
 Programming/practical exercises
The maximum likelihood theorem and its applications (4 hours)¶
 Likelihood contours
 Selected topics on maximum likelihood
 Samples composed of multiple species
 Extended ML fits
 Nonparabolic likelihoods, likelihoods with multiple maxima
 Programming exercises
Estimating efficiencies with ML fits (4 hours)¶
 ML applications: MLE parameter estimation for a regression and comparison with the Least Square Estimator.
 Systematic uncertainties
 The profilelikelihood method
 Programming/practical exercises
Statistical hypothesis testing (4 hours)¶
 The \(\chi^2\) test
 General properties of hypothesis testing
 From LEP to LHC: statistics in particle physics
 The modified \({\rm CL}(s)\) hypothesis testing
 Profiled likelihood ratios
 Industry example : A/B testing
 Programming/practical exercises
Basic concepts in Machine Learning (4 hours)¶
 Diferences between Machine Learning and Statistical Modelling
 Machine Learning categories: Supervised, unsupervised and reinforcment learning.
 Basics of supervised learning:
 Cost funtions and the gradient descent algorithm.
 Regression problems: Linear regresion
 Classification problems: Logistic regresion, decision trees.
 Underfitting & Overfitting
 Regularisation
 Hyperparameter optimisation, cross validation
Course material¶
Required¶
 Cowan, G. (1998). Statistical data analysis. Oxford university press.
 Barlow, R. J. (1993). Statistics: a guide to the use of statistical methods in the physical sciences (Vol. 29). John Wiley & Sons.
 James, F. (2006). Statistical methods in experimental physics. World Scientific Publishing Company.
 Brandt, S. (1998) Statistical and Computational Methods in Data Analysis, Springer, New York.
 Lyons, L. (1986). Statistics for Nuclear and Particle Physicists, Cambridge University Press, Cambridge and New York.
Further reading¶
 Statistical Methods in Particle Physics WS. (Heidelberg): S. Masciocchi / N. Berger: http://www.physi.uniheidelberg.de/~nberger/teaching/ws12/statistics/statistics.php
 Introduction to Statistical Methods: 2011 CERN Summer Student Lectures (Glen Cowan)
 http://www.pp.rhul.ac.uk/~cowan/stat_cern.html
Module 3: Handson Projects (3 ECTS)¶
Data Science handson projects in two flavors, one in High Energy Physics and the other in Complex Systems
Number of hours: 75h of hands on practical work over 6 weeks.¶
Module 3HEP: Handson Projects in High Energy Physics (3 ECTS)¶
Organizers:¶
 Arturo Sánchez Pineda, Laboratoire d’Annecy de Physique des Particules (LAPP), France (previamente ICTP, Italia y CERN, Suiza).
 Javier Solano, Universidad Nacional de Ingeniería, Perú.
Course description¶
Two hands on project:
* One project will be based in HEP open data expanded through 3 weeks
* Another project based on a dataset coming from industry/nonacademic environments (3 weeks)
All students will do both exercises.
Topic overview¶
A pool of projects will be made available to the students using datasets from different fields: experimental particle physics (LHC experiments, LAGO), Kaggle challenges or datasets from our industry partners. There will be a mentor assigned to each projects.
Prerequisites/Corequisites¶
Students should have followed the Statistics and Research Software Engineering courses.
Schedule¶
Class Structure¶
 Academic projects will be developed during weeks 13 while non academic projects will be developed between weeks 46
 The frequency of meetings/discussion sessions will be decided between the students and the mentors of the corresponding project
Assessments¶
The assessment of this module is based on a presentation and a report of the project's results.
Students are expected to submit a short report and their code. The purpose of the report is to answer the noncoding questions, to present results and provide a brief description of design choices and implementation.
This correspond to 30% of the total grade for the data science module.
Pool of HEP projects¶
 ‘AI Commons’: A framework for collaboration to achieve global impact
 Dataset: Need to contact the community to get the datasets. These are datasets released in collaboration with the UN.
 Objective: A common knowledge hub to accelerate the world’s challenges with Artificial Intelligence
 Comments: it might be better to use these datasets for the AI course in the second semester. Here is the link: https://aicommons.org
 New York City Airbnb Open Data
 Dataset: Kaggle challenge https://www.kaggle.com/dgomonov/newyorkcityairbnbopendata
 Objective: This data file includes all needed information to find out more about hosts, geographical availability, necessary metrics to make predictions and draw conclusions.
 Comments: There is also the possiblity of having some data criminality statistics for the city of London.
 Scientific computing: design and deploy of physics data analysis pipelines with inhouse and cloud computing.
 Dataset: LHC experiments Open Data at 8 and 13 TeV

Objectives:
 Train the student to design data analysis in terms of the resources, the procedures, versioning and maintainability.
 Develop a culture of reproducibility and proper analysis development protocols.
 Get into the socalled "Big Data" by prototyping analysis using Open Data.
 Present the final products in a modern way, under the proper licensing and DOI identification.

Analysis of Higgs boson decays to two tau leptons using data and simulation of events at the CMS detector from 2012
 Dataset: CMS open data

Objective: This analysis uses data and simulation of events at the CMS experiment from 2012 with the goal to study decays of a Higgs boson into two tau leptons in the final state of a muon lepton and a hadronically decayed tau lepton. The analysis follows loosely the setup of the official CMS analysis published in 2014.

Sample with tracker hit information for tracking algorithm ML studies TTbar_13TeV_PU50_PixelSeeds
 Dataset: CMS open data

Objective: This dataset consists of a collection of pixel doublet seeds, i.e. the hit pairs that could belong to the same particle flying through the CMS Silicon Pixel Detector. These can be used in ML studies of particle tracking algorithms. Particle tracking is the process of clustering the recorded hits into groups of points arranged along an helix.

Study of boosted Z→ee using fat electrons
 Dataset: ATLAS data
 Objective: For the search of a heavy resonance that decays X → WZ → lvll, the acceptance×efficiency of the analysis decreases for electrons above signal mass ∼ 2 TeV. The loss occurs as electrons are boosted and get closer together and are removed by the usual electron isolation requirements. Different approaches to recover those events are currently under study.

Data driven backgrounds for same charge WW analysis
 Dataset: ATLAS data
 Objective: In the scattering of two W bosons of same charge an important background comes from other processes like ttbar or W+jets, when one of the jets is misidentified as a lepton. The modelling in usual simulations is not accurate enough motivating more complex approaches to estimate this background from measured data.

Validation of Simulations in the scattering of two same charge W bosons
 Dataset: ATLAS data
 Objective: In the last year, new simulations became available which increase the theoretical accuracy for simulations of the scattering of two W bosons with the same charge. These new simulations can only be used, once properly validated.
Module 3CS: Handson Projects in Molecular Dynamics (3 ECTS)¶
Organizer:¶
 Ernesto Medina, Yachay Tech, Ecuador.
Course description¶
The aims of this course are: * To understand the basic concepts of numerical simulations * To study the main algorithms for Monte Carlo and molecular Dynamics, and how to extract data corresponding to relevant observables * To elaborate a project for a simulation of system either in statistical physics or in a more general context
Topic overview¶
 Ensembles: Sampling from ensembles, Thermodynamics averages, fluctuations , time correlation functions and transport coefficients, Inhomogeneous systems
 MonteCarlo Methods: Importance sampling, Metropolis method, Monte Carlo at constant temperature and pressure, Grand Canonical Monte Carlo, Estimation of Free energy, Simulation of Phase equilibria.
 Molecular Dynamics: Equations of motion for atomic systems, finite difference methods, Molecular dynamics of rigid nonspherical bodies, multiple timestep algorithms, accuracy checks, molecular dynamics in contact with reservoirs,
 Practical simulation methods: Neighbour lists, multiple time steps, organizing a simulation, self consistency, parallel simulation, loops, replica exchange, analysis of results, liquid structure, time correlation functions, estimating errors.
 Meso Scale Methods: Langevin and Brownian dynamics, Dissipative particle dynamics, lattice Boltzmann methods, developing coarse grained potentials
Prerequisites/Corequisites¶
Prerequisites: Basic knowledge ion statistical physics and programming. Corequisites: The mandatory courses of the complex system branch
Schedule¶
TBD
Class Structure¶
 3 weeks for the theoretical course on Monte Carlo and molecular dynamics techniques
 3 weeks for the simulation project
Assessments¶
The assessment of this module will be done in two parts: a brief dissertation about an application of the MC and MD methods at the end of the theoretical course, and a report and a presentation of the simulation project results. This correspond to 30% of the total grade for the data science module.
Schedule and weekly learning goals¶
Week 13: theoretical background¶
 Introduction to the Monte Carlo method
 Sampling from ensembles, thermodynamics averages, fluctuations, time correlation functions
 Metropolis method, detailed balance, Monte Carlo at constant temperature and pressure
 Introduction to molecular dynamics
 Equations of motion for atomic systems, finite difference methods,
 Multiple timestep algorithms, accuracy checks
Week 46:¶
The student is supposed to perform a project to simulate a physical system by using either Monte Carlo or molecular dynamics techniques.
Pool of projects¶
 Simulation of the 2D or 3D Ising model, observation of the phase transition and estimation of some critical exponents (Monte Carlo)
 Description:Set up importance sampling Montecarlo scheme to compute the equilibrium state of an Ising model of a magnetic system accounting for boundary conditions and finite size scaling. Find critical exponents by collapsing data of different system sizes.

Calculate complex integrals by Monte Carlo (used in high energy physics)
 Description: As the calculation of integrals amounts to the computation of areas under a curve. In particle physics integrals associated with cross sections are multiple integrals that can be complicated analytically. Monte Carlo can address them in a very simple way.

Law of radioactive decay and decay chains derived from MonteCarlo
 Description: Determine the history of a radioactive atom by a sequence of random numbers: Suppose that the first (m1) successive random numbers are higher than 0.1 and that the mth number is lower than 0.1, this sequence of ran dom numbers is representative of the history of an atom which survives the time (m1) and decays during the mth time interval. Generalize this approach to the chain A > B > C.

Slowing down of fast neutrons (E< 1 MeV) (Monte Carlo)
 Description: Assume a homogeneous plate of thickness h bombarded perpendicularly by a flux of neutrons of energy . Neutrons penetrate and can scatter elastically or be absorbed. Scattering is assumed to be equally probable in all direction in a particular collision. Compute the probability of the neutron to traverse the plate, be absorbed or reflected.

Fluid Flow in a pipe: Poisseuille Law via Molecular Dynamics
 Description: Set up a molecular dynamics simulation in 2D inside a pipe with no slip country conditions at the pipe walls. After the transient behavior determine the flow profile and fit to Poisseuille flow. Determine the viscosity of the displaced fluid.

Hard sphere gas and irreversibility by Molecular Dynamics
 Description: molecular dynamics of molecules interacting via hard potentials must be solved in a way which is qualitatively different from the molecular dynamics of soft bodies. This is the highest precision equation of motion evolution. In this project it will serve to test irreversibility or how a gas forgets its initial conditions.

Dynamics of simple fluids and transport coefficients by Molecular dynamics
 Description: Molecular dynamics with phenomenological potentials using linear response to compute transport coefficients such as the diffusion constant, viscosity, thermal coefficient.
Course material¶
Required¶
 Frenkel, D., & Smit, B. (2001). Understanding molecular simulation: from algorithms to applications (Vol. 1). Elsevier.
 Dill, K., & Bromberg, S. (2012). Molecular driving forces: statistical thermodynamics in biology, chemistry, physics, and nanoscience. Garland Science.
 Allen, M. P., & Tildesley, D. J. (2017). Computer simulation of liquids. Oxford university press.
 Haile, J. M. (1992). Molecular Dynamics Simulations, Elementary Methods, JohnWiley and Sons.
 Kroese D. P., Taimre T, and Botev Z. I. (2001) Handbook of Monte Carlo Methods, Wiley.
Acompañamiento docente¶
Juan C. BastoPineda UIS Colombia 
Arturo Sánchez Pineda CNRS Francia 
José Ocariz UP Francia 
Camila RangelSmith TI Reino Unido 
Ernesto Medina YachayTech Ecuador 
Javier Solano UNI Perú 
