Every week or two I have been reviewing the content of The Analytics Edge, an MIT course supplied online by edX, as I progress through it. So far we have covered the following:
- Weeks 1 and 2, which gave an introduction to analytics and the statistical programming language R, followed by a real-life case study on the use of linear regression.
- Week 3, which focused on logistic regression and a famous study of heart disease.
- Weeks 4 and 5, which took us on to classification and regression trees, then natural language processing.
Now we continue with week 6 of The Analytics Edge which brings us to the important concept of clustering. This is the method that drives most of the recommendation systems we see online. The case study discusses the Netflix competition which would give a $1 million prize to the team that improved their recommendation system by 10%; that competition closed in 2009. The prize is also a reflection of how valuable these systems are to some businesses.
Clustering segments the data into similar groups but does not predict anything. The goal is to improve the prediction within each group. Clustering is a series of algorithms which one can apply in order to perform unsupervised learning.
There are two clustering methodologies discussed in the lectures: hierarchical and k-means. The first begins by defining each data point as its own cluster, then the nearest clusters are combined until you end up with one cluster. You then have to decide the number of clusters that makes sense given the problem at hand, whether these are meaningful and whether the points in the selected groups have something in common. In the k-means algorithm a number of clusters is specified and each point is randomly assigned to it, the centroids (mean) of the points are calculated and each point re-assigned to the closest cluster. This is repeated until no further improvement can be made. Clustering is quite an interesting topic as it also requires some judgement and there are a number of different algorithms to apply in addition to the two explored in the course.
At the end of week 5 participants in the course were asked to register for a kaggle competition which you are given three weeks to complete. In addition unit 7 which reviews visualizations is open but the deadline for this is after the kaggle competition.
At this point I would like to note what I personally believe is the biggest drawback of this course. There are strict deadlines for each homework assignment. This makes it tough for someone who is not always able to work at the same pace every week to keep up. It is easy to lose track and fall behind if you have a busy week at work or are unable to keep up for any other reason. This is what has happened to me.
There are other courses that use the last day of the course as the deadline for all homework. These courses allow you to set your own pace. With the large audiences that MOOCs draw there will obviously be many arguments for either deadline methodology.
To be fair, the passing grade for this course’s certificate is only 55%, and the lowest homework score is dropped. I am still in the course and following it, but at the moment I feel it will be a strong introduction to some of the topics I was not completely familiar with but which I will still need to build upon.
However, I will press on. In my next review: visualizations and the kaggle competition!