Introducción a la Ciencia de Datos¶
Este curso trata de los conceptos y herramientas necesarios para el manejo científico de grandes volúmenes de datos.
Está compuesto por tres módulos:
Module 1: Research Software Engineering in Python (2 ECTS)¶
Instructors:¶
- William David Romero Serrano, Universidad Industrial de Santander, Colombia.
- Fabio Martínez, Universidad Industrial de Santander, Colombia.
Number of hours: 56 hours of total work.¶
- 16 hours of classes/tutorials, over 4 weeks.
- 4 hours of evaluation work, over 1 week
- 36 hours of independent study.
Module description¶
The aim of this module is learn how to construct reliable, readable, efficient research software in a collaborative environment. The module is based on Python but the general ideas can be applied to any other programming language.
Topic overview¶
- Introduction to software engineering for reproducible research
- Python basics
- Research Data in Python
- Version Control
- Testing your code
- Building software projects
- Reproducible environments
- Construction and design practices
- Speed and optimisation
- Advanced Programming Techniques
Pre-requisites/Co-requisites¶
Basic programming knowledge. Students should have done an introductory course in Python, we will provide a recommended online course.
Schedule¶
Check academic calendar here
Class Structure¶
- Each session is a self-contained module
- A set of questions and exercises are proposed at the end of each session as individual work
Assessments¶
The assessment of this module is based on a code assignment.
Students are expected to submit a short report and your code. The purpose of the report is to answer the non-coding questions, to present results and provide a brief description of design choices and implementation.
This correspond to 20% of the total grade for the data science module.
Homework (tareas) Repo¶
Repo with classes and inside each class:
- Docs
- Codes
- Data
- README.md
Schedule and topics¶
Introduction to open reproducible data science (2 hours)¶
- Diagnostics Test
- What is open science?
- Why reproducibility is important in science?
Version Control (2 hours)¶
- Short intro to the Unix Terminal
- Why use version control?
- Solo use of version control
- Publishing your code to GitHub
- Collaborating with others through Git
- Forks, Pull/Merge Requests and the GitHub Flow
- Branching (Optional)
- Rebasing and Merging (Optional)
- Debugging with GitBisect (Optional)
Introduction to Python (2 hours)¶
- Why use scripting languages?
- Python. IPython and the IPython notebook.
- List comprehensions
- Functions in Python
- Modules in Python
- Data structures: list, dictionaries, and sets.
- An introduction to classes (Optional)
Research Data in Python (2 hours)¶
- Working with files on the disk
- Interacting with the internet: streaming
- Plotting with Matplotlib
- Animations with Matplotlib
- JSON and YAML (Optional)
- API (Optional)
Testing your code (2 hours)¶
- Example code
- Why test?
- Unit testing and regression testing
- Negative testing
- Mocking (prototipado)
- Debugging
- Continuous Integration (Optional)
Construction and Design (2 hours)¶
- Coding conventions
- Comments
- Refactoring
- Documentation
- Object Orientation (classes)
- Design Patterns
Software Projects and Reproducible Environments (2 hours)¶
- Turning your code into a package
- Releasing code
- Choosing an open-source license
- In GitLab / GitHub -- Software project management -- Organising issues and tasks
Programming for Speed (2 hours)¶
- Optimisation
- Profiling
- Scaling laws
- NumPy
- Cython
Advanced Programming Techniques (2 hours)¶
- Functional programming
- Metaprogramming
- Duck typing and exceptions
- Operator overloading
- Iterators and Generators
Assessment week (4 hours)¶
Module 2 : Probabilidad y estadística para la física (5 ECTS)¶
Instructor:¶
- José Ocariz, Université Paris Cité, Francia.
Descripción del curso¶
Una selección pedagógica de tópicos en probabidad y estadística.
El énfasis está determinado principalmente por la experiencia del docente en análisis de física con datos experimentales en física de altas energías.
Las clases magistrales son complementadas con sesiones de consulta individuales o en pequeños grupos.
Material del curso¶
El material tratado en las clases magistrales se estará actualizando en este enlace.
Los estudiantes están calurosamente invitados a reportar errores y gazapos en el documento. ¡Toda sugerencia para mejorarlo es ampliamente bienvenida!
Tabla de contenidos¶
- Capítulo I : Conceptos básicos en probabilidad matemática
- Capítulo II : Variables aleatorias, funciones de densidad de probabilidad
- Capítulo III : Estimación de parámetros, propagación de incertidumbres
- Capítulo IV : PDFs de uso común en física
- Capítulo V : El método de la verosimilitud máxima
- Capítulo VI : Incertidumbres sistemáticas
- Capítulo VII : Contraste de hipótesis
- Capítulo VII : Análisis multidimensionales
Tareas¶
- Tarea I : en este enlace
- Tarea II : en este enlace
Pre-requisitos/co-requisitos¶
Abierto y obligatorio par todos los estudiantes LA-CoNGA physics.
Experiencia previa en programación es muy recomendada.
Calendario¶
Evaluación¶
La evaluación del módulo es una función multivariada no-lineal, optimizada sobre un estadístico de prueba a definir, y basada en los siguientes elementos: - Ejercicios analíticos (Tarea 1) - Ejercicios numéricos (Tareas 1 y 2) - Ejercicios de interpretación (Tarea 3) - Ejercicios de las sesiones prácticas
Lectura sugerida¶
- Cowan, G. (1998). Statistical data analysis. Oxford university press.
- Barlow, R. J. (1993). Statistics: a guide to the use of statistical methods in the physical sciences (Vol. 29). John Wiley & Sons.
- James, F. (2006). Statistical methods in experimental physics. World Scientific Publishing Company.
- Brandt, S. (1998) Statistical and Computational Methods in Data Analysis, Springer, New York.
- Lyons, L. (1986). Statistics for Nuclear and Particle Physicists, Cambridge University Press, Cambridge and New York.
Lectura adicional¶
- Statistical Methods in Particle Physics WS. (Heidelberg): S. Masciocchi / N. Berger: http://www.physi.uni-heidelberg.de/~nberger/teaching/ws12/statistics/statistics.php
- Introduction to Statistical Methods: 2011 CERN Summer Student Lectures (Glen Cowan)
- http://www.pp.rhul.ac.uk/~cowan/stat_cern.html
Module 3: Hands-on Projects (3 ECTS)¶
Data Science hands-on projects in two flavours, one in High Energy Physics and the other in Complex Systems
Number of hours: 75h of hands on practical work over 6 weeks.¶
Module 3-HEP: Hands-on Projects in High Energy Physics (3 ECTS)¶
Organizers:¶
- Arturo Sánchez Pineda, Laboratoire d’Annecy de Physique des Particules (LAPP), France (previamente ICTP, Italia y CERN, Suiza).
- Javier Solano, Universidad Nacional de Ingeniería, Perú.
Pre-requisites/Co-requisites¶
Students should have followed the Statistics and Research Software Engineering courses.
Schedule¶
Class Structure¶
- Academic projects will be developed during weeks 1-3 while non academic projects will be developed between weeks 4-6
- The frequency of meetings/discussion sessions will be decided between the students and the mentors of the corresponding project
Assessments¶
The assessment of this module is based on a presentation and a report of the project's results.
Students are expected to submit a short report and their code. The purpose of the report is to answer the non-coding questions, to present results and provide a brief description of design choices and implementation.
This correspond to 30% of the total grade for the data science module.
Course material¶
The Alan Turing Isntitute Guide for SE
Required¶
This course is largely based on the course created by J. Hetherington for PhD students at The Alan Turing Institute:
Recommended reading¶
Module 3-CS: Hands-on Projects in Molecular Dynamics (3 ECTS)¶
Lista de proyectos: Listado
Organizer:¶
- Ernesto Medina, Yachay Tech, Ecuador.
Course description¶
The aims of this course are: * To understand the probabilistic underpinnings of the Monte Carlo approach * To study the main algorithms for Monte Carlo and molecular Dynamics, and how to extract data corresponding to relevant observables * To elaborate a project for a simulation of system either in statistical physics or in a more general context
Topic overview¶
- Overview of problem solving with random numbers
- Review of probabilist concepts, functions of random variables, linear transformations
- Limit theorems and Markov processes
- Random number generation
- Markov Chain Monte Carlo
- Ensembles: Sampling from ensembles, Thermodynamics averages, fluctuations time correlation functions and transport coefficients, Inhomogeneous systems
- Monte-Carlo Methods: Importance sampling, Metropolis method, Monte Carlo at constant temperature and pressure, Grand Canonical Monte Carlo, Estimation of Free energy, Simulation of Phase equilibria.
- Molecular Dynamics: Equations of motion for atomic systems, finite difference methods, Molecular dynamics of rigid non-spherical bodies, multiple time-step algorithms, accuracy checks, molecular dynamics in contact with reservoirs,
- Practical simulation methods: Neighbour lists, multiple time steps, organizing a simulation, self consistency, parallel simulation, loops, replica exchange, analysis of results, liquid structure, time correlation functions, estimating errors.
- Meso Scale Methods: Langevin and Brownian dynamics, Dissipative particle dynamics, lattice Boltzmann methods, developing coarse grained potentials
Pre-requisites/Co-requisites¶
Pre-requisites: Basic knowledge of statistical physics and programming. Co-requisites: The mandatory courses of the complex system branch
Class Structure¶
- 3 weeks for the theoretical course on Monte Carlo and molecular dynamics techniques
- 3 weeks for the simulation project
Assessments¶
The assessment of this module will be done in two parts: a brief dissertation about an application of the MC and MD methods at the end of the theoretical course, and a report and a presentation of the simulation project results. This correspond to 30% of the total grade for the data science module.
Schedule and weekly learning goals¶
Week 1-4: theoretical background¶
- Introduction to the Monte Carlo method, probabilistic foundation
- Sampling from ensembles, thermodynamics averages, fluctuations, time correlation functions
- Metropolis method, detailed balance, Monte Carlo at constant temperature and pressure
- Introduction to molecular dynamics
- Equations of motion for atomic systems, finite difference methods,
- Multiple time-step algorithms, accuracy checks
Week 5-6:¶
The student is supposed to perform a project to simulate a physical system by using either Monte Carlo or molecular dynamics techniques.
Course material¶
Suggested¶
- Kroese D. P., Taimre T, and Botev Z. I. (2001) Handbook of Monte Carlo Methods, Wiley.
- Frenkel, D., & Smit, B. (2001). Understanding molecular simulation: from algorithms to applications (Vol. 1). Elsevier.
- Dill, K., & Bromberg, S. (2012). Molecular driving forces: statistical thermodynamics in biology, chemistry, physics, and nanoscience. Garland Science.
- Allen, M. P., & Tildesley, D. J. (2017). Computer simulation of liquids. Oxford university press.
- Haile, J. M. (1992). Molecular Dynamics Simulations, Elementary Methods, John-Wiley and Sons.
Acompañamiento docente¶
Fabio Martínez UIS Colombia |
José Ocariz UP Francia |
Camila Rangel-Smith TI Reino Unido |
Ernesto Medina YachayTech Ecuador |
Javier Solano UNI Perú |
---|---|---|---|---|