DAND: Udacity Data Analyst Nanodegree Project 3

The third module of the Udacity data analyst nanodegree is entitled Data Extraction and Wrangling and deals with auditing “the validity, accuracy, completeness, consistency, and uniformity of a dataset.” By utilizing Python libraries we can read in data – even large data sets iteratively (not load the whole set but read it in in small chunks for processing) – which we then audit for the above.

Data comes from various sources and in various formats; XML, csv, excel (XLRD) and JSON are discussed. In addition, scraping data from a website is introduced as well as formatting the correct http get requests.

Once we have obtained the data we need to check:

  • validity: ensuring data follows constraints we have set, e.g. duration cannot be negative.
  • accuracy: the degree to which entries conform to a gold standard data, that is, against a set of data that we trust.
  • completeness: do we have all the records we should have? It is stated that:
    “While explaining it is pretty straightforward, actually measuring completeness is a very difficult thing to do”.
  • consistency: checking that the data across systems is the same in all fields, e.g. MSC and charging systems.
  • uniformity: to use another telecoms analogy, is everything in seconds or minutes?

The next step in this module of the course is deciding whether to use MongoDB, a NoSQL i.e. non-relational database, or a standard relational DB and therefore standard SQL. After auditing the data it is formatted for insertion into one of the above databases. In my case, I chose standard SQL because I have previous experience with it. This was done in the interest of time, but one could select the flavour of SQL that they do NOT have interest in order to gain experience in something new.

The project directs you to OpenStreetMap where you select an area of the world, download the data in osm format (which is similar to xml), audit and then use SQL to explore the data. I am in the final stages of completing the project. Initially I chose a very large area and my input file was more than a gigabyte; so every step of the project, even the choice of data, matters. You can see a sample of a completed project provided to students here.

Michael Lazarou
Michael Lazarou

Michael Lazarou manages revenue assurance and fraud at Epic, a Cypriot telco, having joined their RA function in March 2011. His background includes a double major in Computer Science and Economics, as well as an MBA. Before being lured into the exciting world of telecoms he worked as a software developer.

Michael is interested to gain a better understanding of different aspects of RA and data analysis. He shares his insights on training courses he participates in with Commsrisk.

Related Articles

Get Our Weekly Newsletter by Email