Data Cleanliness Is Next to Usefulness

As people sort through the Boston Marathon bombing and the activities of the two brothers, a reasonable question might be how could authorities not know of the pair’s travels and inclinations? Now it seems as though they did know, after all.

Homeland Security Secretary Janet Napolitano said that her agency knew that alleged Boston bomber Tamerlan Tsarnaev traveled to Russia last year, even though a misspelling of his name threw off the FBI, according to an AP report.

Napolitano said that even though Tsarnaev’s name was misspelled, redundancies in the system allowed his departure to be captured by U.S. authorities in January 2012. But she said that by the time he came back six months later, an FBI alert on him had expired and so his re-entry was not noted.

In other words, a vital clue was lost because of a typo. And while nowhere near as significant, businesses should take note because similar problems can plague decision-making processes, and it shouldn’t even take an M.S. in Business Intelligence to know it.

In any sort of BI or big data operation, making mistakes is ridiculously easy. One of the biggest problems comes from “dirty” data — information that has not been cleaned up and brought to a standard format. The possible mistakes are legion:

  • Anything in a record — names, addresses, order numbers — can be misspelled.
  • Abbreviations can become a bugaboo. Is Main Road the same as Main Rd.? How about Main St. in the same town?
  • Records may hold old information that never received an update. For example, even though the National Change of Address register notes when people notify the U.S. Postal Service of having moved, you have to obtain and apply the fixes if you are to know.
  • Redundant data, like contact information in both CRM and order systems, becomes a bear if an update in one place does not automatically percolate through to all others.
  • Database fields may be laid out with assumptions, like all addresses and phone numbers are in the U.S. although there are customers from overseas.
  • Duplicates may seem like two different people, particularly if a person moved but the old contact information wasn’t purged.

These might seem like trivial problems best suited to intern grunt work, but don’t fool yourself. Bad data quality can disrupt orders, throw off analyses, and waste time and resources.

Any attempt at BI, data mining, data warehousing, or big data must start with data quality. Professionals learn techniques to recognize and address problems. Then ongoing data quality processes can help eliminate issues going forward.