Data profiling with SQL Server 2008

One of the many improvements brought about SQL Server 2008 at the ETL with Integration Services is their ability to perform data profiling with its new Data Profile Task.

The data profiling is one of the first tasks typically addressed in Data Quality processes, and involves an initial analysis of the source data, usually on tables, with the goal of beginning to know their structure, format and level of quality. Inquiries are made at the table level, column, relationships between columns, and even relationships between tables.

The SSIS Data Profile Task works by selecting a table in a SQLServer 2000 database or above (no use on other databases) the profiling options you want to perform on the data in the table, and an XML file for saving the result. It's really simple.

You can select up to 8 types of profiling, 5 for column level and 3 several columns level analysis.

Column level profile

  • Length distribution of values
  • Percentage of null values
  • Patterns expressed by regular expressions
  • Column statistics: minimum, maximum, mean or standard deviation
  • Distribution of values, different values and percentage of occurrence of each of the total number of rows 

Multicolumn profile: 

  • Candidate keys, which could be primary key columns of the table
  • Functional dependence, the values of a column may depend on other
  • Inclusion of values, which could be foreign key columns of other
     

SQL Server profile task SSIS

 

After running, the task generates an XML file in the chosen location where all information is stored. To examine these findings SQL Server provides the Data Profile Viewer application. In a typical installation it shold be in this path:

C:\Program Files\Microsoft SQL Server\100\DTS\Binn\DataProfileViewer.exe

Select the XML generated by the SSIS task and begin exploring the results:

SSIS Data Profile Viewer

For more detailed information please consult the chapter Task profiling data from the online documentation from Microsoft Technet.

This task is also well commented articles from Performance SQLServer SSIS New Features in SQL Server 2008 - Part 3 and Using The Data Profiler Task and FTP Task in SQL Server 2008 Integration Services