Modern data verification and cleaning options in DTM Data Scrubber

DTM Data Scrubber is a tool that designed to make your data better. It offers two modes: data verification and data clearing. In the first mode the tool finds problems in data only. In the second it fixes the found problem or performs some problem depended action.

DTM Data Scrubber: well known data menu

The modern versions of the scrubber support at least dozen of data verification methods also known as 'checks'. Most popular methods are: 'match regular expression', 'compare with results of SQL statement execution', 'in list' and 'contains substring'.

This article discloses new verification methods: 'well known data' and 'by value library'. Well known data method contains seven predefined patterns for popular data formats: URL, e-mail address, phone number, etc. This method allows the user to save time for mentioned format verification or clearing. It is not required to design regular expression manually anymore, just select data format from drop down list.

The second of mentioned methods is value library. The value library is a set of predefined lists for popular data. Currently, the library includes following sets:

DTM Data Scrubber: value library item selection
  • Country name ('United Kingdom' as example)
  • Country code (2 symbols like 'UK')
  • Country code (3 symbols like 'GBR')
  • Currency name (like 'Pound Sterling')
  • Currency code (3 symbols like 'GBP')
  • First Name
  • Month (like 'March')
  • Month code (like 'Jan')
  • US States ('California')
  • US State code (like 'CA')

The program can compare value to be verified as is or in case insensitive mode. The value library based verification rule is an easy way to find mistypes in data like 'Ilinois' instead of 'Illinois'.

Please keep in mind that the scrubber offers good alternative if you have your own value list with reference data. It is comparison with result set of the SQL query execution. Fro example, if you have table 'Customers' with 'Name' column you can use 'select Name from Customers' query as a source for this check. The verification tool finds any row with wrong spelled customer name, i.e. name that not in existing list.

Another set of news deals with 'actions'. The action is an operation that must be executed for each data item marked as wrong. There are eleven actions but we'll describe only new: trim spaces, execute custom SQL statement and write to execution log.

The trim action helps to remove extra spaces at begin or end of value. For example, 'Mask ' will be converted to 'Mask'. The user can select left, right or both sides trim option.

DTM Data Scrubber: data cleaning actions menu

For the complex cases there is too difficult to fix data problem automatically. The program offers 'write log' action to save information about found problem to log for manual review. By the way, we recommend using DTM Data Editor for manual data correction. It provides easy and visual way to view or modify your data.

Let's look at the 'custom SQL statement' action. It is most powerful way to correct your data because you can call built-in database functions or procedures. To make this action really useful the program provides two macros that can be used as a part of SQL statement: %VALUE% and %WHERE%.

The 'VALUE' macro corresponds to original data item value. 'WHERE' can be user to identify data row with wrong data item. For example, your 'Discount' values in 'Order Details' table must be less 0.1; otherwise the value must be divided by 5. You should:

  1. Create data clearing rule.
  2. Specify 'compare with value' as condition for 'Discount' column.
  3. Specify 0.1 as check value.
  4. Use 'Custom SQL statement' as action.
  5. Specify 'update [Order Details] set Discount=%VALUE%/5 %WHERE%' as a statement.