Power Query Regular Expression Hack

Quote of the Day

May I give you a word of advice? Next time you invade Italy, do not start at the bottom.

General Fridolin von Senger und Etterlin, WW2 German general in Italy during WW2. This was his post-war advice to a British historian. The Allied Italian Campaign was a disaster in many ways – General von Senger und Etterlin's critique of the Allied effort was probably correct.


Introduction

Figure 1: Regular Expressions Used in My Example.

Figure 1: Regular Expressions Used in My Example. Drawn using Regexpr.com.

Because many corporations are focused on the Microsoft Office suite of tools, businesses often require that I use Excel/Power Query for my analysis work so that they can work with the tools I develop after I am gone. Fortunately, I really enjoy using Power Query, but I find it irritating that it does not support regular expressions. I must admit that Power Query's standard functions can do a good job of extracting strings, but the process is a bit tedious.  I have a large library of regular expressions for extracting email addresses, phone numbers, social security numbers, and the like that would be efficient for me to use if Power Query could run them.

Fortunately, I saw a blog post by Imke Feldmann that showed a hack that allows the use of a regular expression in Power Query. The hack involves writing a query that contains a small HTML page that calls a Javascript regex routine. This may not be the world's most efficient piece of code, but much of my work involves cleaning up relatively small dimension tables where convenience is more important than efficiency.

My M-code example is included in this workbook and is shown below. I used a US Census table that I copied from the Wikipedia for an application example. I use this table all the time for grouping US state data by region. While it is a small table that could have been cleaned up with only Power Query's editor (i.e., no coding), using regular expressions resulted in a tad shorter query.

Reg Ex Query

The Power Query query that contains the regex is shown below.

Wikipedia Description of US Census Regions and Divisions

This is the text I copied from the Wikipedia.

Region 1: Northeast
Division 1: New England (Connecticut, Maine, Massachusetts, New Hampshire, Rhode Island, and Vermont)
Division 2: Mid-Atlantic (New Jersey, New York, and Pennsylvania)
Region 2: Midwest (Prior to June 1984, the Midwest Region was designated as the North Central Region.)[7]
Division 3: East North Central (Illinois, Indiana, Michigan, Ohio, and Wisconsin)
Division 4: West North Central (Iowa, Kansas, Minnesota, Missouri, Nebraska, North Dakota, and South Dakota)
Region 3: South
Division 5: South Atlantic (Delaware, Florida, Georgia, Maryland, North Carolina, South Carolina, Virginia, District of Columbia, and West Virginia)
Division 6: East South Central (Alabama, Kentucky, Mississippi, and Tennessee)
Division 7: West South Central (Arkansas, Louisiana, Oklahoma, and Texas)
Region 4: West
Division 8: Mountain (Arizona, Colorado, Idaho, Montana, Nevada, New Mexico, Utah, and Wyoming)
Division 9: Pacific (Alaska, California, Hawaii, Oregon, and Washington)

Tidied Table of US State Regions and Divisions

This is a tidy version of the Wikipedia data.

Cleaned Up Table of US States By Region and Division

RegionDivisionState
MidwestEast North CentralIllinois
MidwestEast North CentralIndiana
MidwestEast North CentralMichigan
MidwestEast North CentralOhio
MidwestEast North CentralWisconsin
MidwestWest North CentralIowa
MidwestWest North CentralKansas
MidwestWest North CentralMinnesota
MidwestWest North CentralMissouri
MidwestWest North CentralNebraska
MidwestWest North CentralNorth Dakota
MidwestWest North CentralSouth Dakota
NortheastNew EnglandConnecticut
NortheastNew EnglandMaine
NortheastNew EnglandMassachusetts
NortheastNew EnglandNew Hampshire
NortheastNew EnglandRhode Island
NortheastNew EnglandVermont
NortheastMid-AtlanticNew Jersey
NortheastMid-AtlanticNew York
NortheastMid-AtlanticPennsylvania
SouthSouth AtlanticDelaware
SouthSouth AtlanticFlorida
SouthSouth AtlanticGeorgia
SouthSouth AtlanticMaryland
SouthSouth AtlanticNorth Carolina
SouthSouth AtlanticSouth Carolina
SouthSouth AtlanticVirginia
SouthSouth AtlanticWest Virginia
SouthEast South CentralAlabama
SouthEast South CentralKentucky
SouthEast South CentralMississippi
SouthEast South CentralTennessee
SouthWest South CentralArkansas
SouthWest South CentralLouisiana
SouthWest South CentralOklahoma
SouthWest South CentralTexas
WestMountainArizona
WestMountainColorado
WestMountainIdaho
WestMountainMontana
WestMountainNevada
WestMountainNew Mexico
WestMountainUtah
WestMountainWyoming
WestPacificAlaska
WestPacificCalifornia
WestPacificHawaii
WestPacificOregon
WestPacificWashington

 
This entry was posted in Excel. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *