To most, data analysis is as exciting as auditing or watching paint dry. Strangely, I enjoy scrutinizing and exploring data, and do it frequently when implementing CRM applications and data warehouses. Properly analyzing your data before a data migration and integration is extremely important, because incomplete analysis leads to flawed logic, and a lot of hasty data remodeling and rewriting of ETL code before go-live. Below are some steps I go through to understand existing data before ETL development. It’s quite tedious and by no means complete, but hopefully there are bits and pieces that you can leverage.
Tables – After step 1, you should have a better understanding of the application and the data from the business perspective. Now go to the database. Open each table and execute something like SELECT * FROM [TABLE1] ORDER BY [MODIFIEDDATE] DESC LIMIT 10. Copy the result set into a single spreadsheet and use that spreadsheet to take notes. Having quick access to the table structure and sample data along with your notes helps you learn the data faster.
ERD – Using primary keys and foreign keys, diagram just the tables and relationships. In other words, create an entity-relationship diagram. The field info is captured in step 2, so you don’t need the RSD or a complete data model unless you like to see the data types as well. Modeling tools like ERwin and Visio can automatically render the ERD for you, but I do it manually so that I can examine every relationship during the process. Also, PKs and FKs are not sometimes set in the database, so the tool will not draw those relationships.
Fields – Now the painful part - scan values in each column. Pull the first 250 and last 250 rows in each table and look for odd values while scrolling through the records. e.g. date columns with 1900-01-01, first name column with company names, email column with comma separated values, etc. Columns with odd values - as well as the fields that the business identified in step 1 – should get extra scrutiny with SQL code, using Group By, Distinct, Count of nulls, etc.
Here are some more specific things to look for:
Hopefully this was informative.
Do you have more questions about how to best analyze data, data migrations, or data integration? Get in touch with us.
I am even more accessible than the other modals.