Quote of the Day
I wish to do something great and wonderful, but I must start by doing the little things like they were great and wonderful.
— Albert Einstein
I recently decided to take some classes in data analysis at Datacamp, an online training site. My first classes were in dplyr and tidyr – two excellent R-based tools for manipulating files that are not amenable to analysis because of inconsistencies and structure: tidyr provides many tools for cleaning up messy data, dplyr provides many tools for restructuring data. After completing the two classes, I decided that I needed to firm up my knowledge of these tools by applying them to a concrete example.
The gnarliest data that I know of is hosted on the Hyperwar web site. The data seems to be scanned and OCRed WW2 records that were translated into HTML. While a fantastic resource for history aficionados, these records are painful to analyze using software because of:
- Numerous OCR errors
The original source is often difficult to figure out.
- Inconsistent expression of zeros
I see blanks, zeros, periods, dashes.
- Mixing data and totals in seemingly random ways
- Arithmetic errors (e.g. incorrect totals)
I often think of some poor secretary/yeoman typing numbers all day and then having to use an adding machine to complete a total. It would have been miserable work.
- Typing errors
For example, a cargo ship being listed as 70,006 tons during WW2 was most likely 7,006 tons. At that time, cargo ship were typically in the 10,000 ton range. This error actually occurred. I was able to confirm the error because the Wikipedia has a remarkable amount of data on WW2 ships, and it listed 7,006 tons for this ship.
- Unbelievably complex headings (literally three and four levels)
This is not a complaint about the Hyperwar site itself. This is just a statement about the format and quality of WW2 records that were written by humans typing numbers at keyboards. For this post, I decided to download data on the cost of WW2 Army Air Force combat aircraft and clean it up for graphing. All of the heavy lifting is done in RStudio – only the Appendix is done in Excel because I like how it formats tables.
I have read quite a bit about air operations during WW2, and I often have read that the P-47 and P-38 were very expensive fighters compared to the P-51. One reason given for the P-51 replacing the P-47 and P-38 in many applications was its cheaper unit cost. I decided that I would research the cost of these aircraft as a motivating application for practicing with tidyr and dplyr.
- tidy data
- Tidy data is a term coined by Hadley Wickham (creator of tidyr and dplyr) to describe data that is formatted with variables in columns and observations in rows with no blank rows or columns. Missing data is clearly marked in a consistent matter, as are zeros. If you are a database person, you will see hints of his concepts in Codd's Rules for relational databases.
- untidy data
- Untidy data is data that is not tidy – I hate definitions like this, but it works here. There are some common problems:
- Values of variables used for column headings (e.g. column headings containing years – years should be a column with specific year values stored for each row).
- Data that should be in multiple tables combined into one table. (e.g. describing a person's pets means multiple table rows for people with more than one pet – the pet details should be in a separate table.)
- Aggregate data (e.g. totals) in the rows.
- Blank rows or columns.
- Single observation scattered across multiple rows.
All costs are expressed in WW2 dollars. The point of this exercise is to provide a worked Rmarkdown-based example of cleaning an untidy data file. Converting WW2 dollars to modern dollars is fraught with issues, which I leave to others.
The Rmarkdown file is a bit long, so I include a PDF of it here as a link. For the complete source, see this zip file. The zip file contains the raw Hyperwar tab-separated file, a cleaned-up .csv, Rmarkdown source, and a PDF of the processed Rmarkdown document.
I am not interested in plotting all the data. There are just a few WW2 aircraft in which I am interested. I plotted the costs of my favorite WW2 fighters and bombers versus time. Note that some of the costs dropped dramatically over time. This is true for any product (example).
Fighter Plane Costs Versus Time
Figure 2 show how the unit costs changes with time for my favorite WW2 fighter planes. Note that the P-38 and P-47 are quite expensive relative to the P-51. The most expensive US fighter in WW2 was the P-61 night fighter (Figure 1).
Bomber Plane Costs Versus Time
Figure 3 show how the costs changes with time for my favorite WW2 bombers. Notice how the B-29 was ~2.6x more expensive that the B-17. However, only the B-29 had the range needed to bomb Japan from the Marianas Islands.
Cleaned-Up CSV File
The Rmarkdown file outputs a cleaned-up version of the data as a .csv file. I include that file in with my source. You can see a tabular version of the data in Appendix A.
I work in corporate America, where Excel is the standard for processing data. While Excel has some strengths, it is not the most powerful data analysis tool available. RStudio provides a wonderful data analysis and presentation environment. This post provides a fully worked example of how to use RStudio with tidyr and dplyr to tidy-up some very messy data.
Appendix A: Hyperwar Aircraft Cost Table Reconstruction.
Figure 4 shows how my tidy data version of the cost data can easily be converted back into the original form.