Studying Data Science: The Analytics Edge (Weeks 4-5)

The Analytics Edge continues in week 4 with classification and regression trees (CART). This methodology splits the values of the independent variable in order to predict the most frequent outcome. The lectures describe this method versus random forests, a methodology designed to improve the accuracy of CART but has a trade-off of making the results less interpretable.

The real-life example presented in the lecture is the case of D2Hawkeye, a company that was founded to perform healthcare analytics based on medical insurance claims. During the lecture the data is used to create new variables by placing it into buckets of cost or risk. This allows one to run the CART methodology on the data at hand. There is quite a lot of theory in this week’s lectures covering cross-validation, penalty errors and complexity parameters. Instead of going into the details, I will talk about week 5’s Text Analytics; a much more interesting topic.

Natural Language Processing has progressed in the last few years. Examples given are Apple’s Siri and Google Now (though after yesterday’s I/O conference Google Now might be rebranded as Google Assistant). Language is processed using the bag of words method.

The first step is to clean up irregularities in the text: change all words to either all lowercase or uppercase and normalize punctuation. We then clear unhelpful terms, such as “the”, “is”, “at”, etc. The final step is stemming; we would use argu for all or the following words: argue, argued, argues.

The real-life example for Text Analytics is the case of IBM Watson competing in Jeopardy. I leave you with a video of that episode: