Effective Amazon Machine Learning - Packt Publishing - April 2017

This book focuses on the recent Amazon Machine Learning service. The service is intentionally simple to use and the book follows that philosophy. It is composed of three parts:

Effective Amazon Machine Learning Book Cover

  • A thorough intro into Data Science for the new data scientist. In the first 2 Chapters, I present the minimal and necessary concepts in Predictive Modeling: classification, regression, metrics, bias and variance as well as feature engineering tactics.

  • Starting with a brief overview of the AWS Machine Learning service in Chapter 3, I explore the service in-extenso in Chapter 4 to 6: Loading the data using S3, Building a model and Assesssing the predictive power of the model. Throughout the book I use classic datasets such as the Titanic dataset which is particularly adapted to creative Feature engineering.

  • In the 3rd part, we gear up and star using a python SDK or the AWS CLI command line interface to upload data, build and assess models. Moving to scripting and the command line allows us to implement cross validation and recursive feature selection. I close the book by showing you how to create a full data pipeline built around AWS lambda, Redshift and AWS Machine Learning with twitter as the source for a fun sentiment analysis project.

My goal in writing this book has been to go beyond the AWS Machine Learning service and to offer to the reader other efficient tools and methods that can be useful in a day to day data science / ETL workflow. For instance by showing how to leverage SQL queries and simple Bash Shell scripting to perform feature extraction and feature engineering upstream.

The book is available on Amazon, on the Packtpub website and at Safari books.

The github repo associated with the book contains all the code and datasets used in the book, organised by chapters.

Articles on ODSC and KDnuggets

I had the pleasure of working with the amazing team at ODSC in 2016 and take part in the organizations of the ODSC conferences. I also wrote a few articles for them that enjoyed aquite a bit of traffic.

  • Riding on Large Data with Scikit-learn: When you data consumes all the memory of your laptop but does not qualify as Big Data, the trick is to use the Out-of-core mode of scikit-learn’s algorithms and stream your data to the model chunks by chunks. This batch processing intriduces extra complexity and parameters that you need to be aware of.

  • Dissecting the Presidential Debates with an NLP Scalpel Back in 2015, the presidential debates were going full speed with 13 Republican candidates and 5 Demicrat candidates. In this post I explore several NLP technics to decrypt the debates in terms of topics, sentiment, candidate dynamics and summarization.

  • Jupyter, Zeppelin, Beaker: The Rise of the Notebooks With the rise of the iPython Notebooks, soon to become Jupyter notebooks, it was interesting to uncover other notebook projects such as the beaker notebook, the Apache Zeppelin notebook and of course the venerable Sage Notebooks.

  • Open Source and Data Science, a perfect match

  • on KDNuggets.com: Amazon Machine Learning: Nice and Easy or Overly Simple? AWS had just launched its new Machine Learning Service and I could not resists but try to find out how it worked and performed. This post was the precursor of my book on the subject.

Behind the Scenes with MOOCs: Berklee College of Music’s Experience Developing, Running, and Evaluating Courses through Coursera

The Journal of Continuing Higher Education 77:136 · January 2013

Back in 2013, MOOCs were still a novelty. We carried out an analysis of the behavior of our online students with a focus on engagement. completion and final scoring.

You can download the article.