You can only run superficial tests like is not null, is unique, to prevent row duplication in sql join for example, but not much else. You simply don't have enough information about what to test at the stage of the initial ETL/ELT that data engineering is responsible for.
You maintain pipes that can route water, oil or gasoline. You don't want to test for water purity, because next day you are asked to route sewage or oil through your system. You can at best test volume, pressure or velocity in the pipeline because these actually have an impact on your system.
The actual test that business is asking you to put at the "raw data" stage has to come at the application level later on. Eg. when you extract a metric out of the data you can do all kinds of time series tests on it, which will test the pipeline thoroughly.
Testing is what makes data engineering different from software engineering. You don't control the input data and don't want to. You have to make your organization work in such a way that downstream users communicate data issues back to you. You only test the data thoroughly by using it.
If you are a data vendor and your downstream user is a paying customer that you don't want to embarass yourself in front of, you need to invent a use for the data, that you sell, within your org.
I like this analogy, I often use the term plumbing in relation to Data Engineering, and this extends it perfectly!
> You maintain pipes that can route water, oil or gasoline. You don't want to test for water purity, because next day you are asked to route sewage or oil through your system. You can at best test volume, pressure or velocity in the pipeline because these actually have an impact on your system.
You maintain pipes that can route water, oil or gasoline. You don't want to test for water purity, because next day you are asked to route sewage or oil through your system. You can at best test volume, pressure or velocity in the pipeline because these actually have an impact on your system.
The actual test that business is asking you to put at the "raw data" stage has to come at the application level later on. Eg. when you extract a metric out of the data you can do all kinds of time series tests on it, which will test the pipeline thoroughly.
Testing is what makes data engineering different from software engineering. You don't control the input data and don't want to. You have to make your organization work in such a way that downstream users communicate data issues back to you. You only test the data thoroughly by using it.
If you are a data vendor and your downstream user is a paying customer that you don't want to embarass yourself in front of, you need to invent a use for the data, that you sell, within your org.