Text Match and Fuzzy Lookup

January 28th, 2015 by datapig 4 comments »

I would bet that every one of us has seen Customer lists from separate sources that can't be matched up because of spelling mistakes and inconsistent naming conventions and abbreviations. In fact, there are consulting firms whose entire business revolves around helping organizations reconcile their "Master Data". I can't count how many times I've seen entire projects launched to create a "gold source for master data". Yet here we are; manually matching up addresses, customer names, or people names from different data sources.

.

Well today, I'd like to introduce you to a free tool from Microsoft that can help ease your pain a little. This tool is a free Excel Add-in called Fuzzy Lookup. The Fuzzy Lookup Add-in enables you to perform similarity analysis to match textual strings that mean the same thing but are not spelled exactly the same. With this tool, you can match things like customer addresses from two separate data sources, or even find imperfect duplicates in a single dataset.

.

Understanding the Jaccard Index of Similarity

The magic behind the Fuzzy Lookup Add-in comes from the Jaccard Index of Similarity.

The Jaccard index, also known as the Jaccard similarity coefficient, was developed by Paul Jaccard as a way to document the distribution of different types of flora (yawn). Jaccard's index gave him a statistical way to measure similarities between sample sets. The gist of Jaccard's index is this:

.

You take two sample sets. You count the attributes shared by both samples (call this Z). You count the attributes observed in only one sample (call this X). You then count the attributes observed in only the other sample (call this Y). Finally, you can calculate the similarity index by using the three counts in this operation: Z / (Z+X+Y).

.

Confused? Don't worry. I have an example.

Let's say we want to compare the similarities between these two Excel nerds.

We define some attributes, then we tag each nerd with a 1 when each attribute evaluates to TRUE.

.

We can now get the count of the intersections and then perform the math.

As you can see, the similarity index is .57 or 57%.

The more similar two sample sets are, the closer Jaccard's index will be to 1.

 

In its simplest form, the same kind of analysis can be done on Textual Strings using the Jaccard Index. You can see in this example, the words hair and hare result in an index of .60.

By the way, in terms of textual matches, 60% is not a great score. In my experience, most textual matches below 70% index is suspect.

 

Understanding the Fuzzy Lookup Transformations

Ok. Now you know that the Fuzzy Lookup algorithm is based on the Jaccard Index of Similarity. But it's important to note that Fuzzy Lookup does not run on your raw data. Before running the Jaccard Index, Fuzzy Lookup applies some built-in transformation algorithms designed to improve its matching results.

 

The first thing to understand is that Fuzzy Lookup converts data using something called a Tokenizer. For example, the record {"DataPig", "Bacon Boulevard"} might be tokenized into the set, {" DataPig", "Bacon", "Boulevard"}. Each word is considered to be a token. Tokens are assigned weights based on frequency of use. High weights are applied to tokens that are uncommon. Low weights are applied to tokens that come up frequently. For example, frequent words such as "Boulevard" are given a lower weight than less frequent words such as "DataPig". Fuzzy Lookup gives you the opportunity to override the default token weights by supplying your own weights via an Excel Table.

.

Fuzzy Lookup also has algorithms and built-in dictionaries that allow for the automatic correction of spelling mistakes, word merge scenarios, string split scenarios, and string prefix corrections. All of these automatic transformation can be turned on and off by via the Fuzzy Lookup Configure dialog box. You can add custom Transformation rules by pointing the TransformationRowsetName property to your own Excel Table of rules.

Note: The sample workbook installed with the Fuzzy Lookup Add-in includes a Customization tab that shows you how to set up your own custom transformation table.

.

Using the Fuzzy Lookup Add-In

To use the Fuzzy Lookup Add-in, you'll need to first convert your datasets into Excel Tables (click anywhere inside your data and press Ctrl+T).

In this example, we have two Excel Tables (IndProfile and CustInfo). We want to match up the customer names.

.

Click the Fuzzy Lookup command button to launch the tool.

.

In the Fuzzy Lookup dialog box, do the following:

1. Select the two tables you want matched.

2. Select the columns that hold the data you want matched.

3. Click the join button to drop your two columns down to the Match Columns section.

4. Select all the columns you want included in your output. Be sure to include the FuzzyLookupSimilarity column.

5. Define how many matches you want per record. I recommend setting this to 1.

6. Define the Similarity Threshold. This is the Jaccard Index number. As mentioned before, in my experience, any match with an index of less than .70 is typically not that accurate. I usually set this to .60 just to see the results.

.

Once you have defined your selections, be sure to place your cursor in a new blank worksheet. Be warned, Fuzzy Lookup WILL OVERWRITE DATA.

Press the Go button to output the results.

You will see a new table that includes the Output Columns you selected (in step 4 above). The Similarity column will show you the Jaccard index for each match. Any row with no matches will get an index of 0.

.

It's important to review any matches with a similarity index less than 1.

Don't take for granted that Fuzzy Lookup will get everything right.

For example, you can see in Row 7 below, Fuzzy Lookup gave us a .86 index score on the match for NVISION Inc. and WORLD VISION INC. An index score of .86 is typically a good match. However, we can see that in this case, the two names are not the same.

.

If you want to perform a Fuzzy search for duplicates within the same table, simply reference only that table when setting up the search parameters. For instance, in this example we are looking for Fuzzy duplicates in the Org Name column of the IndProfile table.

.

I guess that's about it.

Fuzzy Lookup is one of those tools you really need to experiment with to get a feel for how helpful it can be.

Although it's not perfect, it has saved me hours of scripting and manual matching.

So before you burn out your eyeballs staring at two data sets again, take some time to play with Fuzzy Lookup. It may become your new best friend.

Business Intelligence vs Data Science

January 26th, 2015 by datapig No comments »

Hello all! Belated Happy New Year! I've been busy working on a new book and a few other projects. It's time to get back to blogging.

.

I recently saw this article explaining the difference between BI Analysts and Data Scientists.

On the surface, this article highlights the operational differences between BI Analysts and Data Scientists. But at the core, it's a view into how the BI industry has changed over the last few years. Here's a graphic from that article.

.

As I look at this graphic, I personally recognize many of the tasks outlined here.

I can remember a time when I felt the needs of my customers change from simple trending dashboards to more predictive analytics. In the last few years, many Excel analysts have been asked to do more Data Science'y kind of stuff. Stuff like:

  • Analyze the correlation between key demographic data and transactional metrics such as number of purchases, purchase amounts, and frequency of visits
  • Score customers based on customer satisfaction levels, likelihood to recommend, and capacity utilization
  • Analyze association rules; for example, if Customer A buys Product A, we can predict (with 90% confidence level) that there is an 85% likelihood that this customer will buy Product B.
  • Predict outcomes based on certain independent variables found through regression analysis; for example, a machine's likelihood to break down given a number of interrelated variables.

.

In the past, business intelligence primarily consisted of dashboards showing historic data on an agreed upon set of key performance metrics. Organizations today are demanding more dynamic predictive analysis, the ability to iteratively perform data discovery, and the freedom to take the hard left and right turns on data presentation. These managers often turn to Excel analysts to provide the needed analytics and visualization tools.

.

Over the new few months, I'd like to focus my blog posts here on some of the more advanced Data Science'y kind of stuff that we'll all be asked to do.

.

This leads me to the Excel Power BI Boot Camp I'm hosting with Bill Jelen (Mr. Excel) in Dallas from May 20th – May 22nd.

.

This 3-day event is aimed squarely at Excel analysts who find it increasingly necessary to:

  • Analyze large amounts of data and report those results in a meaningful way
  • Get better visibility into data with new Data Mining tools
  • Perform advanced Clustering and Market Basket analytics
  • Add interactive controls to your dashboards without VBA
  • Automate repetitive data cleansing and transformation tasks
  • Create eye-catching visualizations and Dashboards
  • Create map-based dashboards
  • Access external data sources to expand your message

.

Bill and I only have 25 seats left in the class. If you sign up before March 1st, you'll get a $100 discount off registration.

.

I'll be back tomorrow with an exciting new series of posts on the SQL Server Data Mining Add-in.

See you then!

 

My Take on Linkedin

December 23rd, 2014 by datapig 6 comments »

The Case Against Maths

December 19th, 2014 by datapig 10 comments »

I didn't want to finish out the last full work week of the year without a blog entry.

So let me bring up something that has been bothering me for quite some time.

Take a look at these two statements, then point to the one that looks right to you.

.

If you're American, you probably chose the statement on the right. If you're English, I bet you chose the statement on the left. In researching articles for some of my statistics posts, I often encounter variations on how the abbreviated word for mathematics is used. American authors use the word math as in – "Do the Math". The English use the word maths as in – "Do the Maths".

.

As a proud overweight American, I find it odd to say "Do the Maths".

What's bothers me the most is the seemingly arbitrary plural designation given to the abbreviation for mathematics.

.

Here is the way I see it. Mathematics is not a pluralized word for Mathematic. You don't have 1 Mathematic and 5 Mathematics. You simply have Mathematics. In fact, the word Mathematics (the overarching name for the study of all things mathematical) is used solely in singular verb forms.

You would say "Mathematics is useful in business".

You would not say "Mathematics are useful in business".

.

Even the English agree.

They say "Maths is fun".

No one says "Maths are fun"

.

That's right, even though there is an S in Maths, it's not treated as a plural.

It's obvious that the English use the abbreviated term "maths" (with an S at the end) to account for the s in Mathematics. Ok, I see that. So the question now is why is there an S at the end of Mathematics in the first place? After all, we don't say Literatures or Musics.

.

Well, according to Wikipedia, The most probable reason is that the origin of the word Mathematics is the Greek word mathematika. In Greek, when a word ends in the letter A, that typically denotes that the word is plural. So when mathematika was being translated to English, the letter S was added to the end.

,

In short, the term Maths is a singular word that looks plural.

And that, my friends, gives me the heebee-jeebees.

.

Ahhh…it feels good to get that off my chest…finally.

Well, have a great weekend – and remember:

Office Update Breaks ActiveX Controls

December 11th, 2014 by datapig 7 comments »

I checked my email today and saw a bevy of emails from friends and clients claiming that their workbooks broke over-night. When one person contacts me saying something is broken, my general attitude is "they're on crack". But when I get a flood of emails, that's a horse of another color.

.

Apparently, a recent Office Update introduced security "fixes" that break any workbook or code involving ActiveX Controls. I have to admit, I didn't even notice it. I guess my workbooks are sorely lacking in ActiveX Controls.

.

The ever amazing Jan Karel Pieterse shows us how to fix the issue at Daily Dose of Excel.

.

His fix addresses the set of .exd files that prevent the addition or use of ActiveX Controls. These .exd files are added to your machine through the update.

The answer is to delete these .exd files (which are located under C:\users\[your name]\App Data\local\temp).

Here are the ones I found on my machine. I deleted the nasty buggers and my Excel is back to normal.

.

I bet it's a great day to be on the Microsoft Help Desk today.

.

Thanks Jan Karel for the fix!

 

 

Changing the Look of Embedded Objects

December 4th, 2014 by datapig 5 comments »

Most of us know that you can embed objects like Word Files, Outlook Messages, and even other Excel Files. Here's an example of a Word object I embedded in my spreadsheet. My customers can double-click on this object to open the Word document I've embedded. This is cool, but it sure does come out ugly. I'd like to make it more attractive.

.

.

Here are some steps I can take to make a better looking Embedded Object.

  1. Click Insert->Object to open the Object dialog box.
  2. In the Object dialog box, go the Create from File tab and click the Browse button to find the file I want embedded (I selected a document on the mating habits of koalas).
  3. Click the Display as Icon check box.


    .

  4. Click the Change icon button.
  5. In the Change Icon dialog box, select the icon that is just a blank space, and then clear the caption input so that no caption will be shown.

    .
    Don't see a blank icon option?

    You can click the Browse button on the Change Icon dialog box and choose any strange file type that doesn't have associated icons. My go-to file is the bootstat.dat file under the Windows directory (c:\windows\bootstat.dat). Again, you can pick any strange file type.
    .


    .
    .

  6. After pressing OK, I see an empty square. I can click on that square and go to the Format tab. On the Format tab, I click Shape Fill->Picture.


    .
    .

  7. I can select a picture I want to use instead of that lame Word icon. I've got a cool picture of a Koala. I choose that and press the Insert button.

    .

.

And voila! I've got a nice looking embedded object. When I double-click the object, my document on the mating habits of koalas opens.

.

Of course, you don't have to use pictures of over-sexed koalas.

You can use this technique to display your own application icons, info-graphics, or any other image that fits the theme of your report or dashboard. These embedded objects are great for distributing different kinds of files with your Excel workbook. You can use these to include things like help files, back-up documentation, change logs, etc.

.

New Comments Filter on Bacon Bits

December 3rd, 2014 by datapig 3 comments »

Over the last few days, I've been fighting an issue with the comments on this blog.

Apparently, the spam robots have been posting over 25 comments per second on my blog, causing time outs and overloads. I'll never understand why anyone would think this practice of shot gunning comments would be an effective way of driving revenue.

.

In any case, I had to do some back-end cleanup of my WordPress SQL database and install a new Captcha mechanism to hopefully curb the spam bots.

.

I typically don't like Captcha tools, but I found one that fits the irreverent style of this blog

If and when you enter a comment, you'll need to solve a small puzzle.

Simply look at the picture on the right and click the image on the left that best corresponds with the picture.

.

Here's to the end of my WordPress woes.

Happy commenting.

Happy Thanksgiving 2014

November 25th, 2014 by datapig 5 comments »

No posts this week, but here's a gift for you. I made a Thanksgiving GIF you can share with the kids. I used a bunch of Excel shapes to make a cozy Thanksgiving movie. Who says Excel doesn't appeal to kids?

.
thanksgiving
.

I'll be back next week. Enjoy your time off!

The Best Reviews on Amazon

November 21st, 2014 by datapig 1 comment »

Over at Amazon, you'll find the Samsung 105 inch 4K Ultra-HD Curved TV.

The price tag on this substantial purchase is $120,000.

.

If you're apprehensive about taking the leap on this purchase, rest assured.

It's getting some of the best reviews I've seen.

.

.

.

Check out all the reviews here.

.

.

 

 

 

 

Prevent Worksheet Delete without Workbook Protection

November 20th, 2014 by datapig 8 comments »

My friend Tim sent me a message yesterday asking:

"I would like to allow people to insert and re-order worksheets but not delete any. Is that possible?"

Searching all the forums, I found that this question comes up with some frequency. There seems to be a common need to prevent the deletion of a worksheet, but still give users the ability to change the structure of the workbook. Excel does have a Workbook Protection feature, but that feature doesn't give you many options when you protect a workbook. That is to say, you can't specify that you want to prevent one action, but not other actions. So you're stuck with either protecting the workbook structure or not.

.

The answer will have to come from VBA.

In my search, I discovered that Excel 2013 added a new Worksheet Event called BeforeDelete.


.

Wow. I hadn't noticed this event before.

Of course, the Microsoft help files are as useful as ever (sarcastic tone).

.

Judging by the utter lack of examples on how other folks are using this, I'm assuming this new addition to the Worksheet events has gone largely unnoticed by many of us.

.

Apparently, this event triggers when you attempt to delete any worksheet in your workbook. Unfortunately, this event does not come with a Cancel method. Meaning that when a user deletes a worksheet, this event triggers, but then Excel goes ahead and deletes the worksheet anyway. It seems that the purpose of this event is to do some action before the worksheet is delete – not give the developer an opportunity to cancel the delete. So with this event, you can do things like: log the time the worksheet was deleted, save the workbook before deleting the worksheet, send an email before worksheet is deleted, or anything else you can think of.

.

I decided to solve Tim's problem with this small bit of code in the Worksheet_BeforeDelete event. This code simply renames the worksheet, then creates a copy with the original name. So before the worksheet is deleted, you essentially create a copy of it. No matter how many times the user tries to delete the worksheet, it will always be there.

Private Sub Worksheet_BeforeDelete()

Dim MyName As String

'Capture the original worksheet name
MyName = ThisWorkbook.ActiveSheet.Name

'Rename the worksheet
ThisWorkbook.ActiveSheet.Name = Left(MyName, 30) + "#"

'Create a copy of the worksheet
ThisWorkbook.ActiveSheet.Copy _
After:=Sheets(ThisWorkbook.ActiveSheet.Index)

'Name the copy to the original name
ThisWorkbook.ActiveSheet.Name = MyName

End Sub

.

If you have Excel 2013, you can right-click on the worksheet you want protected, then select the View Code option. Then simply paste this code into the VBE. Note that you'll have to do this for every worksheet you don't want deleted.

.

For those of you with Excel 2010 or prior versions, you're basically out of luck. You don't have the BeforeDelete event available to you. In those version, you'd have to employ messy tricks to hide the Delete Commands from the interface. These require VBA and even some RibbonX customization that I'd rather not get into.

Update: Jan Karel Pieterse, observing that my technique will zap any formula pointing to the deleted worksheet, has provided an excellent solution that is far superior to the one I proposed here.

In a normal module, paste this code:

Sub UnprotectBook()
ThisWorkbook.Unprotect
End Sub

Then for every worksheet, right-click, select View Code, and then paste this:

Private Sub Worksheet_Deactivate()
ThisWorkbook.Protect , True
Application.OnTime Now, "UnprotectBook"
End Sub

And this works in any version of Excel! Thanks Jan Karel!

 

Feel free to comment and tell us if and how you're using the BeforeDelete event. Like I said, I've just now noticed it. I'd like to know how anyone else is using it.