Dirty Data science: machine-learning on non-curated data


Go to NumFOCUS academy page.

Cleaning data to analyze it is a major roadblock to data science. I will discuss two specific problems, missing values and categories which variants and typos, in the context of machine learning. This talk will be on recent publications but give simple solutions in Python.


Gaël Varoquaux

I am a research director at Inria (French National Computer Science Research Institute), studying machine learning for health, as well as a visiting professor at McGill university. I have a strong academic track record in fundamental machine learning and mental health applications (many publications in the best venues such as NeurIPS and ICML, editor at elife, one of the reference life sciences journal).

I have been a contributor to the numeric Python and pydata stack since the mid 2000s, contributing to numpy, Mayavi, and later founding scikit-learn and joblib, as well as a few other domain-specific packages.

I have been talking about Python and data processing and teaching it for 15 years. I helped creating and curating the scipy lecture notes, and gave many tutorials as well as keynotes at various Python conferences.