Continuing the DAND track on Udacity I am about to submit my completed project 2: Investigating a Dataset. The classes in preparation for this project focus on the data analysis process which includes the below:
Wrangling the data and posing questions usually take place at the same time or will be a back and forth process. Data wrangling entails acquiring and cleaning the data. As a first step the data that will be examined has to be obtained from the source and loaded into Python – in this case – as a list. We then clean the data by, for example, transforming the data into the correct data types. You might only have strings to begin with which will not allow you to make calculations on numeric types. As is the case with Udacity classes the interactive sections between lectures will ask you to perform calculations and/or write code yourself. For this set of lectures a data set which shows engagement figures for Udacity is used.
The next step in the process is to pose questions. What can I find out from this data? What would be interesting to a specific audience? In the case of student engagement for Udacity one might ask what is the average time spent on classes per day, how many classes does one take before completing the project, etc.
We then investigate or explore the data. The main purpose here is to get a feel for the data beginning by checking for inconsistencies, i.e. checking for duplicates, inconsistent field names. We can at any point go back and refine our question. We can include basic descriptive statistics, correlations and observations from the tabular data or graphical visualizations which leads us to the conclusions part of the process. There is a footnote here that you have to check whether the results occur by chance by using statistics.
Finally the results are communicated. Interesting findings should be reported to our audience or displayed visually.
The next two modules for the preparation to project 2 involve using NumPy and pandas. The first module discusses using one dimensional data, arrays in the case of NumPy, and series in the case of pandas. The final module on two dimensional data introduces dataframes which are closer to database tables if you have experience there. Functions can be run on the whole dataframe, for example if you take the mean of a dataframe then the mean of the numerical values will be displayed as a result in a tabular format.
The material for this project is also available for free and I would recommend it for two reasons:
- It provides a good foundation for data analysis and the practical use of Python in data analysis (NumPy and pandas).
- It serves as a method of getting your feet wet and determining whether this sort of material is for you.
If you have a background in programming there is very little effort required to complete this course. You can also simply watch without completing the exercises to get an idea of what data analysis is about and also to get a feel for Udacity and its offerings.