增量更新
A delta load refers to extracting only the data that has changed since the last time the extract process has run. The most commonly used steps to perform a delta load are:
- Ensure there is a
modified_at
timestamp or incremental id column such as a primary key on the data source. - On the initial run of the pipeline, do a full load of the dataset.
- On following runs of the pipeline, query the target dataset using
MAX(column_name)
. - Query the source dataset and filter records where values are greater than the value from step 3.
- More resource efficient
- Easy to implement and maintain
- Only requires read permissions to perform
- Does not capture deleted records
- Requires extra metadata on the source (commonly a unique id or updated timestamp)
- Does not capture multiple changes between the polling interval. If a row changes multiple times, you may only capture the latest state.
- Querying the database for changes may impact the database performance.