Data Quality
If you want good data products, data quality is essential.
Overview
For some organizations their data is as messy as a toddlers bedroom. To get any kind of good results from your data products, you need quality source data. Remember the product (the software) is the delivery mechanism. The data is what people are really interested in. For me data quality is essential.
Some of the most common causes of data quality issues are apathy, poor system design, manual edits, fixing another issues that created a new problem, data entry errors and data processing.
Poor data quality is very prevalent. It exists because people believe if they have data it must be good. It isn't, unless intentionally managed. Fortunately, it isn't hard to fix, technically speaking. In fact its quite easy. Getting people to change is a whole other thing. High Quality data just takes an intentional act to do things correctly and on purpose such as better oversight, data quality standards, consistent data cleaning through automation and fixing issues with source data, where possible.
Dimensions
The most common dimensions of data quality.
Complete
The data set is actually complete and isn't missing values that should be there. Some values can be missing, but those that can't should be there.
Consistent
The data set is consistent among itself. It doesn't use unknown acronyms and terms are recorded the same way every time. If not done this was it affects reporting, queries and lowers performance of technology products.
Accurate
The data set accurately reflects data from the real world. An example would be an address that is out of data for an order would be inaccurate data.
Timely
The data set is actually up to data at the time it is needed. If it takes weeks to update a data set it isn't timely. Automated programs can help with this.
Unique
Each record in a data set is unique and not a duplicate of another record. While some fields may be duplicated a complete record that is duplicate should be deleted.
Valid
The data set is within acceptable values according to business rules. So if a field takes values that are integers from 1 - 10 both -3 and 7.4 would be invalid values.
Example
Here is a quick example of a situation I have dealt with numerous times.
I'm working with a data set and in a given field there are 20 plus variations of the same term. They are close but have slight differences in spelling, spacing and function. This breaks the queries and lowers performance. While a human can tell the instances all mean the same thing, computers can't. Computers need an exact match or predictable patter. Neither of these will work with this mess.
The solution to this is multi-faceted. On the database itself its data cleaning to make the term consistent. On the application side we use a dropdown list to enforce that 1 and only 1 version of the term can get through to the database going forward.