Quote of the Day
In defeat unbeatable, in victory unbearable.
You may have noticed that many of my recent posts are focused on data processing and analysis. These recent posts reflect the fact that I am preparing for a career change as I near "retirement," and I plan on working in the data analysis arena. This means that I have been in serious Python, R, and statistics training. While I love working with Python and R, I keep finding myself drawn back to Excel and Power Query (aka Get and Transform) for quick, ad hoc analysis work. While this blog will look at the frequency over time of the name Mark, you can use the tool to generate the same worksheet for any name.
In this post, I am going to use Social Security Administration (SSA) data on baby names, which dates back to 1880 – I assume the original data came from the Census Bureau because the SSA did not exist in 1880. The data is in the form of text files, one file for each year, that contain the number of babies given each name during that year. The SSA did not list names that occurred less than five times in a year. As an example of the information that can be extracted, Figure 1 shows the ten most popular names (male and female) in 2016.
I was inspired to look for this data because I am interested in the historical popularity of the name Mark (my name), which I became curious about because virtually every Mark I meet has an age from 50 to 60 years. Given that observation, it seems that I should see Mark being a very popular name during the late 1950s through the 1960s. I will show that the data supports this observation. I will present the results graphically. I also plotted data for other names in my family, but decided not to post that data for privacy reasons.
This is a really good data set to use for a Power Query example because:
- The data consists of many text files in the same format that need to be combined.
- Each text file name includes the year of the data, which I must extract and put into a column within the data.
- There are well over a million lines of data, which exceeds the number of rows allowed in Excel. Power Query and Power Pivot allow these large data sets to be processed within the Excel environment. You can present the data using pivot tables, which allow you to aggregate the information down to a reasonable size.
- I was able to use of an m-code routine to rank the names. I was inspired to use this approach by Chris Webb's excellent blog post on the topic.
- I was able to demonstrate how to make the workbook relocatable by using sending Power Query the file address as a parameter.
Figure 2 show that the name Mark was quite popular (6th or 7th ranked) from 1957 to 1968. So my anecdotal evidence is supported by the data.
My workbook and the associated data is located here – the zip file is quite large (22.5 MB). You can generate Figure 2 for any name by inputting the name and gender (M/F), then pressing the update chart button. You can use it to check the popularity of any name over time. As an example, Figure 3 shows the popularity of the name Peter over time.
The workbook contains a single line macro that updates the query.