One of the advantages of today’s big-data world is the ability to perform predictive analysis very effectively because of the availability of low-cost processing power. In telecoms, there are multiple areas where predictive analysis is very useful. For example, operators are interested in predicting subscriber churn – which set of customers will leave this month or quarter. They also want to solve the inference problem, i.e., understanding the revenue impact of new product launches, such as LTE, or the implications of a new competitor entering the market. Predicting the future is a key requirement as lead times for network expansion or addition of new product lines require significant planning. Telecoms is just one sector that can benefit from predictive analytics; businesses in all sectors have understood the importance of analytics and want to have a better understanding of their data in order to enhance revenue and profitability.
Whilst the goal is develop a good predictive model, to achieve that aim businesses must also perform descriptive and exploratory analysis. Descriptive analysis, in statistical terms, focusses on measuring central tendencies like an arithmetic average or summation, but these provide very limited information about the data. So typically central tendency measurement for KPIs leads to plotting averages of variables over time or summation of a metric over time. This gives only half the picture as another key component is excluded: the measure of variance in the data. I believe that the importance of variance is seriously underestimated. Variance measurement provides comprehensive information about the dataset, including a check on the outliers; it provides the solid foundation for building a predictive model.
Once variance measurement is tackled and data distribution is better understood, the next step is to proceed to explorative analysis. This analysis provides information on patterns within the data as seen from different perspectives. Explorative analysis is inherently complex, and businesses generally avoid it. The complexity of the analytics results in complex charts and KPIs, although the resulting information is very valuable. Predictive analytics is the logical successor to explorative analysis, so it is beneficial to conduct a thorough review of the results obtained using exploratory analysis. A strong showing during the explorative phase can result in a much better predictive model. This is essential to understanding the relationships between the predictors, independent variables, features or variables, and the response or dependent variable.
All statistical models are prone to reducible and irreducible errors; hence all predictive models have limitations. A very flexible model with high accuracy on one data set might not perform as well on a different data set, whereas a less flexible model like linear regression might not be very accurate but will perform well in different datasets. What is important is to minimize reducible errors through the application of good statistical techniques. Selecting variables or predictors plays a key role. Narrowing down the predictors to the optimal number strikes a good balance between interpretability and flexibility of the model. The standard practice is to build a model on training datasets, and then to run the model test on validation datasets with similar parameters. The model is considered good if the outcome of the model is consistent for training, test and validation datasets.
Overfitting is a huge problem in predictive analytics especially while selecting flexible models. However, the exploratory analysis phase helps overcome the problem of overfitting. There are scenarios where the interpretability of the model is not as important as the prediction itself. In these cases, explorative analytics also performs an important role in highlighting the risks associated with the predictions.
If performed properly, descriptive and explorative analyses add significant value to the development of the eventual predictive model. We can consider predictive models as the tip of an iceberg, with many hidden components underneath from which a lot of value can be derived. Some examples include: scatterplots, Pareto charts, histograms, principal component analysis, box plots, and clustering. If used effectively, these components supply greater information to the business. Data scientists should take care to educate the business on the importance of these intermediary analyses, although the main objective of the project is to predict future outcomes.