Data Engineering Hub
GitHub Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode

Mage

Python package version Apache version Slack community GitHub Stars Docker pulls pip installs

Overview

Mage is an open-source data pipeline tool for transforming and integrating data. Mage features a GUI and pre-built assets for data extraction, transformation, and storage.

Mage is built around the following core abstractions: projects, pipelines, and blocks. Each project houses one or many pipelines, which can perform batch processing, streaming, or data integration.

Each pipeline is comprised of discrete blocks. Blocks allow users to load, export, and transform data. Mage’s built-in testing framework allows data engineers to check outputs at every step of the way.

These outputs can be previewed and analyzed via Mage’s GUI. This reduces friction in the developer feedback loop, allowing for faster editing and troubleshooting.

๐Ÿ”ฎ Features

๐ŸŽถ Orchestration Schedule and manage data pipelines with observability.
๐Ÿ““ Notebook Interactive Python, SQL, & R editor for coding data pipelines.
๐Ÿ—๏ธ Data integrations Synchronize data from 3rd party sources to your internal destinations.
๐Ÿšฐ Streaming pipelines Ingest and transform real-time data.
โŽ DBT Build, run, and manage your dbt models with Mage.

๐Ÿ”๏ธ Core design principles

Every user experience and technical design decision adheres to these principles.

๐Ÿ’ป Easy developer experience Open-source engine that comes with a custom notebook UI for building data pipelines.
๐Ÿšข Engineering best practices built-in Build and deploy data pipelines using modular code. No more writing throwaway code or trying to turn notebooks into scripts.
๐Ÿ’ณ Data is a first-class citizen Designed from the ground up specifically for running data-intensive workflows.
๐Ÿช Scaling is made simple Analyze and process large data quickly for rapid iteration.

๐Ÿ›ธ Core abstractions

These are the fundamental concepts that Mage uses to operate.

Project Like a repository on GitHub; this is where you write all your code.
Pipeline Contains references to all the blocks of code you want to run, charts for visualizing data, and organizes the dependency between each block of code.
Block A file with code that can be executed independently or within a pipeline.
Data product Every block produces data after it’s been executed. These are called data products in Mage.
Trigger A set of instructions that determine when or how a pipeline should run.
Run Stores information about when it was started, its status, when it was completed, any runtime variables used in the execution of the pipeline or block, etc.

Advantages

  • Hybrid GUI/Code Tool: design-driven GUI for building and editing pipelines while still allowing the flexibility of code.
  • Easy developer experience: start developing locally with a single command or launch a dev environment in the cloud.
  • Engineering best practices built-in: each step in your pipeline is a standalone file containing modular code thatโ€™s reusable and testable with data validations.
  • Interactive code: Immediately see results from your codeโ€™s output with an interactive notebook UI.
  • Data is a first-class citizen: Each block of code in your pipeline produces data that can be versioned, partitioned, and cataloged for future use.
  • Collaborate on cloud: Develop collaboratively on cloud resources, version control with Git, and test pipelines without waiting for an available shared staging environment.
  • Scaling made simple: Transform very large datasets directly in your data warehouse or through a native integration with Spark.
  • Observability: Operationalize your pipelines with built-in monitoring, alerting, and observability through an intuitive UI.
  • Rapidly growing community: Mage has a vibrant community of over 2.5k data professionals as of 07/23.
  • Data integration: Use existing connectors or build your own with the Singer-spec for a free alternative to paid tools, like Fivetran. Full table and incremental via CDC (change data capture) support.
  • Native integration with dbt: preview dbt results, orchestrate dbt model runs, schedule dbt models to depend on non-dbt tasks (e.g. ETL/ELT pipelines).

Disadvantages

  • Requires frequent patches: Frequent releases/version upgrades mean that some maintenance is required.
  • Work in progress: version 0.9.3 as of 07/23โ€” this is not yet a v1 tool.
  • No managed offering: there is only a self-hosted option, currently.
  • Compliance: At this time, Mage only supports SOC-2 security standards.
  • SLAs: Since Mage is self-hosted, no support service-level agreements are available.

Learning Resources