The third module of the Udacity data analyst nanodegree is entitled Data Extraction and Wrangling and deals with auditing “the validity, accuracy, completeness, consistency, and uniformity of a dataset.” By utilizing Python libraries we can read in data – even large data sets iteratively (not load the whole set but read it in in small chunks for processing) – which we then audit for the above.
Data comes from various sources and in various formats; XML, csv, excel (XLRD) and JSON are discussed. In addition, scraping data from a website is introduced as well as formatting the correct http get requests.
Once we have obtained the data we need to check:
- validity: ensuring data follows constraints we have set, e.g. duration cannot be negative.
- accuracy: the degree to which entries conform to a gold standard data, that is, against a set of data that we trust.
- completeness: do we have all the records we should have? It is stated that:
“While explaining it is pretty straightforward, actually measuring completeness is a very difficult thing to do”.
- consistency: checking that the data across systems is the same in all fields, e.g. MSC and charging systems.
- uniformity: to use another telecoms analogy, is everything in seconds or minutes?
The next step in this module of the course is deciding whether to use MongoDB, a NoSQL i.e. non-relational database, or a standard relational DB and therefore standard SQL. After auditing the data it is formatted for insertion into one of the above databases. In my case, I chose standard SQL because I have previous experience with it. This was done in the interest of time, but one could select the flavour of SQL that they do NOT have interest in order to gain experience in something new.
The project directs you to OpenStreetMap where you select an area of the world, download the data in osm format (which is similar to xml), audit and then use SQL to explore the data. I am in the final stages of completing the project. Initially I chose a very large area and my input file was more than a gigabyte; so every step of the project, even the choice of data, matters. You can see a sample of a completed project provided to students here.