Data Engineering Hub
GitHub Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode

Great Expectations

Great Expectations is a [[Python]] library for creating [[Data Unit Test|data unit tests]] that can be used in your [[Data Pipeline|data pipelines]].

Summary

Great Expectations operates off of the principal that data engineering pipelines tend towards entropy over time, a term that they dub “pipeline debt”. Great Expectations aims to provide a testing and evaluation suite to help data engineering teams clean up their pipelines and increase their confidence in working on them in a collaborative setting.

Great Expectations calls each unit test segment an “Expectation”, due to expecting the output of a test to be a certain value. Rather than other testing and evaluation suites, Great Expectations encourage testing at batch time (when new data arrives). This is in contrast to compile or deploy times. By testing at batch time, teams can be confident that there is a safety net should code behave unexpectedly for new data, and pinpoint the root cause as soon as possible.

Workflow

  1. Introduce the expectation early into the process, perhaps even before you’ve built a pipeline.
  2. Show it to the stakeholder and have them validate the assumptions.
  3. Implement it into your pipeline.
  4. Continuously update tests as data changes by iterating with the stakeholder.