Studying Data Science: The Analytics Edge via edX (weeks 1 and 2)

The aim of “The Analytics Edge” course is to “learn how to use data and analytics to give an edge to your career and your life”. It covers “the following analytics methods: linear regression, logistic regression, trees, text analytics, clustering, visualization, and optimization” by using real world examples. In the introductory lecture several examples are mentioned: IBM Watson – a computer able to recognize and analyze natural language which competed in Jeopardy, eHarmony – an online dating website that uses data collected to match members, as well as a couple of other applications of analytics in the health sector.

The course runs for 12 weeks and there are new lecture sequences, which are followed by a recitation and homework assignment. Each video lecture is followed by quick questions, while recitations use R to provide detailed exploration of the concepts discussed in each lecture. After unit 6 there is a Kaggle competition for participants and a final exam at the end of the course. The due date for the final exam is July 5th, so until then I will be briefly mentioning what the course covered over the two weeks between each of my posts.

The lectures are quite engaging as each mentions a real life example while the recitations are brief videos (up to 8 mins) each followed by a series of exercises. These require you to focus and sit down at a PC. This is not a course you can simply read through and it makes sense since the end goal is to teach analytics methods for use at work and everyday life. Practice is key. Personally I like the fact that you delve into R programming from the beginning with each command or method you learn building on the previous one. This requires you to also be engaged and active; I have created a Google document for note-taking and each recitation and homework assignment is saved in an R script file (instead of typing away at the command line).

Week one of the course is an introduction to analytics and R. The examples mentioned at the beginning are covered and then data frames, vectors, loading csv files into R, plots and basic summary statistics commands are reviewed. There are four homework assignments for practice. When I saw that the fourth one was optional (for the final grading) I was relieved. If I had time it would be fine, but the exercises are long and after starting one it is better to continue to the end in order to complete the session. At times it feels like there is a lot of repetition in the exercises, however they are carefully planned; each assignment builds on the previous one. For example, if in one question you are asked to do something for the first time you are guided and given clues or even the actual command. The next question which asks something similar will expect that you have mastered this and offer no clues. This is another reason that that makes it important to keep notes and your code in script files as you can refer to them and also add comments in the script to keep track of how you progress.

Week two introduces the story of Princeton economics professor Orley Ashenfelter who used linear regression to predict the quality of Bordeaux wine without tasting them at all. Each lecture progresses from one variable regression to multiple variable regression, followed by correlation and multicollinearity. Finally, creating predictive models using training data (data used to build the model) and then predictions using test data (data used to test the model). The second lecture for this week is a review of analytics used in baseball which was used to make predictions and build a championship team. The recitation which is next up for me is on analytics in the NBA; which I am really looking forward to as an avid basketball fan.

I will be back in two weeks to provide another update on the progress of this course. In the meantime, let me give you a feel for the content by sharing the screenshot of a linear regression exercise using R.

R screenshot