Data Engineering 101 - A real beginner's approach

AI Summary16 min read

TL;DR

This article introduces Bruin, a unified data engineering framework that simplifies building pipelines by combining ingestion, transformation, and governance into a code-based approach. It covers core concepts like assets, policies, and glossaries, with a step-by-step guide for beginners.

Key Takeaways

•Bruin integrates data engineering tools like Airflow and dbt into a single framework using YAML, SQL, and Python for pipelines as code.
•Key components include assets for modular tasks, policies for data quality enforcement, and glossaries for consistent terminology.
•The framework supports multi-language workflows, dependency management, and local development with tools like DuckDB and Postgres.
•It emphasizes reproducibility, version control, and governance without vendor lock-in, making it ideal for developers new to data engineering.

1. Prologue

I have to admit: whenever someone mentioned Data Engineering, I used to tune out. It always sounded like something impossibly complex — almost magical.

This week, I finally decided to dive in. I thought it would be fairly straightforward, but it didn’t take long to realize how deep the rabbit hole goes.

This field isn’t just a few scripts or SQL queries; it’s an entire ecosystem of interconnected tools, concepts, and responsibilities that form the backbone of modern data systems.

Concepts like:

Data Catalogs and Governance: understanding who owns the data, how to ensure quality, and how to track lineage.
Orchestration: coordinating dependencies and workflows with tools like Apache Airflow or Dagster.
Transformation (ETL/ELT): cleaning and standardizing data using tools such as dbt or Fivetran.
Ingestion and Streaming: connecting sources and moving data in real time with Kafka, Airbyte, or Confluent Cloud.
Observability and Quality: monitoring data health with solutions like Monte Carlo and Datafold.

For each article clicked, just opened up a new world of tools, words, frameworks, architectures, and best practices.
And somehow, all of it has to work together — governance, orchestration, transformation, ingestion, observability, infrastructure.

As a developer, I’m used to learning a language and a framework and then getting to work.
But in data engineering, it’s different.

It’s about understanding an entire ecosystem and how each piece connects to the next.

After hours of reading docs, chasing GitHub repos, and jumping between tools, articles, and endless definitions, I finally found the tool that made everything click — Bruin.

Imagine a single framework that:

Pipelines as Code — Everything lives in version-controlled text (YAML, SQL, Python). No hidden UIs or databases. Reproducible, reviewable, and automatable.
Multi-Language by Nature — Native support for SQL and Python, plus the ability to plug in binaries or containers for more complex use cases.
Composable Pipelines — Combine technologies, sources, and destinations in one seamless flow — no glue code, no hacks.
No Lock-In — 100% open-source (Apache-licensed) CLI that runs anywhere: locally, in CI, or in production. You keep full control of your pipelines and data.
Built for Developers and Data Quality — Fast local runs, integrated checks, and quick feedback loops. Data products that are tested, trusted, and easy to ship.

…and it fits all the core Data Engineering concepts I just mentioned earlier.

I’ll admit it — I’m the kind of person who embraces productive laziness. If there’s a way to do more with fewer tools and less friction, I’m in.

So before we get started, here’s the plan:

In most setups, data flows from OLTP databases → ingestion → data lake/warehouse → transformation → marts → analytics dashboards.

Tools like Airbyte handle ingestion, dbt handles transformation, Airflow orchestrates dependencies — and Bruin combines those layers
into one unified framework.

This article will walk through the fundamental principles of Data Engineering, while exploring how Bruin brings them all together through a simple, real-world pipeline.

2. First Impressions: Exploring Bruin’s Structure

I’ll be honest — I used to have a bit of a bias against Data Science/Engineering projects. Every time I looked at one, it felt messy and unstructured, with files and notebooks scattered everywhere. Coming from a software development background, that kind of chaos always bothered me.

But once I started looking at Bruin’s project structure, that perception completely changed. Everything suddenly felt organized and intentional.
The framework naturally enforces structure through its layers — and once you follow them, everything starts to make sense.

Example: Project Structure

├── duckdb.db  
├── ecommerce-mart  
│ ├── assets  
│ │ ├── ingestion  
│ │ │ ├── raw.customers.asset.yml  
│ │ │ ├── raw.order_items.asset.yml  
│ │ │ ├── raw.orders.asset.yml  
│ │ │ ├── raw.products.asset.yml  
│ │ │ └── raw.product_variants.asset.yml  
│ │ ├── mart  
│ │ │ ├── mart.customers-by-age.asset.py  
│ │ │ ├── mart.customers-by-country.asset.yml  
│ │ │ ├── mart.product_performance.sql  
│ │ │ ├── mart.sales_daily.sql  
│ │ │ └── mart.variant_profitability.sql  
│ │ └── staging  
│ │ ├── stg.customers.asset.yml  
│ │ ├── stg.order_items.sql  
│ │ ├── stg.orders.sql  
│ │ ├── stg.products.sql  
│ │ └── stg.product_variants.sql  
│ └── pipeline.yml  
├── glossary.yml  
├── policy.yml  
└── .bruin.yml

Enter fullscreen mode Exit fullscreen mode

What Each Part Does

.bruin.yml
- The main configuration file for your Bruin environment.
- Defines global settings like default connections, variables, and behavior for all pipelines.
policy.yml
- Your data governance and validation policy file.
- Defines data quality rules, access controls, and compliance checks that Bruin can automatically enforce before shipping data products.
glossary.yml
- Works as a lightweight data catalog for your project.
- Documents terms, metrics, and datasets so everyone on the team speaks the same language.
- Also helps with lineage, documentation, and discoverability.
some-feature/pipeline.yml
- Defines a specific pipeline for a domain or project (in this example, ecommerce).
- Describes the end-to-end data flow — which assets to run, their dependencies, and schedules.
- Pipelines are modular, so you can maintain separate ones for different business domains.
some-feature/assets/*
- Contains all the assets — the building blocks of your data pipelines.
- Each asset handles a distinct task: ingesting raw data, transforming it, or generating analytical tables.
- Since every asset is a file, it’s version-controlled, testable, and reusable — just like code.

With just that, we're able to run a full pipeline. However, I still think we need to go through each step and file individually — I promise it’ll be quick!

2.1. Core File: `.bruin.yml`

Think of .bruin.yml as the root configuration of your project — the file that tells Bruin how and where to run everything.

Instead of scattering settings across scripts or environment variables, Bruin centralizes them here: connections, credentials, and environment-specific configurations all live in one place.
It also serves as Bruin’s default secrets backend, so your pipelines can access databases or warehouses securely and consistently.

bruin run ecommerce/pipeline.yml --config-file /path/to/.bruin.yml

Enter fullscreen mode Exit fullscreen mode

A simple example:

default_environment: default

environments:
  default:
    connections:
      postgres:
        - name: pg-default
          username: postgres # (hardcoded as well)
          password: ${PG_PASSWORD}
          host: ${PG_HOST}
          port: ${PG_PORT}
          database: ${PG_DATABASE}

      duckdb:
        - name: duckdb-default
          path: duckdb.db

Enter fullscreen mode Exit fullscreen mode

What’s Happening Here

default_environment — sets the environment Bruin will use unless specified otherwise.
environments — defines multiple setups (e.g., dev, staging, prod), each with its own configuration.
connections — lists every system Bruin can connect to, like Postgres or DuckDB. Each connection gets a name (e.g., pg-default) that you’ll reference across pipelines and assets.
Environment variable support — any value wrapped in ${...} will be automatically read from your system environment.

This means you can keep credentials out of source control while still running locally or in CI/CD environments.

This design keeps everything centralized, secure, and version-controlled, while giving you the flexibility to inject secrets dynamically through environment variables — perfect for switching between local, staging, and production without touching the code.

2.2. Pipeline: A WAY EASIER Apache Airflow

For each feature you have, it will have to come with a pipeline.yml file.
This is the file that will group all your assets and understand that it's not a single asset running, but a chained list of assets.

- ecommerce-mart/
  ├─ pipeline.yml -> you're here
  └─ assets/
    ├─ some-asset.sql
    ├─ definitely-an-asset.yml
    └─ another-asset.py

Enter fullscreen mode Exit fullscreen mode

There you also configure each connection you want to use on the specific pipeline:

name: product_ecommerce_marts
schedule: daily # relevant for Bruin Cloud deployments

default_connections:
  duckdb: "duckdb-default"
  postgres: "pg-default"

Enter fullscreen mode Exit fullscreen mode

2.3. Assets: The Building Blocks of Data Products

Every data pipeline in Bruin is composed of assets — modular, self-contained units that define a specific operation: ingesting, transforming, or producing a dataset.

Each asset exists as a file under the assets/ directory, and its filename doubles as its identity inside the pipeline graph.

If you remember the file structure shown in the beginning, you must remember that I have multiple types of assets in the pipeline. That's the most cool part, since you can write down in multiple languages and still being something simple. Here's some possibilities:

Type	Description	Filename (in the file tree)
YAML	Declarative configuration for ingestion or metadata-heavy assets	`raw.customers.asset.yml`
SQL	Pure transformation logic — think dbt-style models	`stg.orders.sql`
Python	Custom logic or integrations (e.g., APIs, validations, or machine learning steps)	`mart.sales_daily.asset.py`

You’re free to organize assets however you like — there’s no rigid hierarchy to follow.
The key insight is that the orchestration happens implicitly through dependencies, not through an external DAG engine like Airflow.

Each asset declares what it depends on, and Bruin automatically builds and executes the dependency graph for you.

Example:

raw.orders.asset.yml

# raw.orders.asset.yml
name: raw.orders
type: ingestr
description: Ingest OLTP orders from Postgres into the DuckDB raw layer.
parameters:
  source_connection: pg-default
  source_table: "public.orders"
  destination: duckdb

Enter fullscreen mode Exit fullscreen mode

raw.order_items.asset.yml

# raw.order_items.asset.yml
name: raw.order_items
type: ingestr
description: Ingest OLTP order_items from Postgres into the DuckDB raw layer.
depends:
  - raw.orders  # declares a dependency on the 'raw.orders' asset
parameters:
  source_connection: pg-default
  source_table: "public.order_items"
  destination: duckdb

Enter fullscreen mode Exit fullscreen mode

... turns into:

graph TD
    raw.orders --> raw.order_items;

Enter fullscreen mode Exit fullscreen mode

By chaining assets like this, you describe logical relationships between data operations rather than manually orchestrating steps.
The result is a declarative, composable, and maintainable pipeline — easy to read, version, and extend just like application code.

One of the most powerful aspects of Bruin is how it connects data quality and governance directly into your assets.
By defining checks under each column, you’re not only validating your data but also documenting ownership, expectations, and constraints — all version-controlled and enforceable at runtime.

This means Bruin doesn’t just run pipelines — it audits, documents, and governs them as part of the same workflow.

2.4. Policies: Enforcing Quality and Governance

Policies in Bruin act as the rulebook that keeps your data pipelines consistent, compliant, and high quality.
They ensure every asset and pipeline follows best practices — from naming conventions and ownership to validation and metadata completeness.

At their core, policies are defined in a single policy.yml file located at the root of your project.
This file lets you lint, validate, and enforce standards automatically before a pipeline runs.

Quick Overview

rulesets:
  - name: standard
    selector:
      - path: .*/ecommerce/.*
    rules:
      - asset-has-owner
      - asset-name-is-lowercase
      - asset-has-description

Enter fullscreen mode Exit fullscreen mode

Each ruleset defines:

where the rule applies (selector → match by path, tag, or name),
what to enforce (rules → built-in or custom validation rules).

Once defined, you can validate your entire project:

bruin validate ecommerce

# Validating pipelines in 'ecommerce' for 'default' environment...
# Pipeline: ecommerce_pg_to_duckdb (.)
#   raw.order_items (assets/ingestion/raw.order_items.asset.yml)
#     └── Asset must have an owner (policy:standard:asset-has-owner)

Enter fullscreen mode Exit fullscreen mode

Bruin automatically lints assets before execution — ensuring that non-compliant pipelines never run.

Built-in and Custom Rules

Rule	Target	Description
`asset-has-owner`	asset	Each asset must define an owner.
`asset-has-description`	asset	Assets must include a description.
`asset-name-is-lowercase`	asset	Asset names must be lowercase.
`pipeline-has-retries`	pipeline	Pipelines must define retry settings.

You can also define your own rules:

custom_rules:
  - name: asset-has-owner
    description: every asset should have an owner
    criteria: asset.Owner != ""

Enter fullscreen mode Exit fullscreen mode

Rules can target either assets or pipelines, and they use logical expressions to determine compliance.

Policies transform Bruin into a self-governing data platform — one where best practices aren’t optional, they’re enforced.
By committing your rules to version control, you make data governance part of the development workflow, not an afterthought.

2.5. Glossary: Speaking the Same Language

In data projects, one of the hardest problems isn’t technical — it’s communication.
Different teams often use the same word to mean different things.
That’s where Bruin’s Glossary comes in.

A glossary is defined in glossary.yml at the root of your project.
It acts as a shared dictionary of business concepts (like Customer or Order) and their attributes, keeping teams aligned across pipelines.

entities:
  Customer:
    description: A registered user or business in our platform.
    attributes:
      ID:
        type: integer
        description: Unique customer identifier.

Enter fullscreen mode Exit fullscreen mode

You can reference these definitions inside assets using extends, avoiding duplication and ensuring consistency:

# raw.customers.asset.yml
name: raw.customers
type: ingestr

columns:
  - name: customer_id
    extends: Customer.ID

Enter fullscreen mode Exit fullscreen mode

This automatically inherits the type and description from the glossary.
It’s a simple idea, but a powerful one — your data definitions become version-controlled and shared, just like code.

3. Building Our First Pipeline

Now that we’ve explored the structure and philosophy behind Bruin, it’s time to build an end-to-end pipeline.

We’ll go from raw ingestion to a clean staging layer, and finally, to analytics-ready marts — all defined as code.

We’ll assume you already have:

A Postgres database as your data source.
A DuckDB database as your analytical storage.
A working .bruin.yml file configured with both connections ### 3.1 Step 1 : Ingest from Your Source to a Data Lake

The first step is to move data from Postgres into DuckDB.

This creates your Raw Layer — data replicated from the source with minimal transformation.

Create an ingestion asset file:

touch assets/ingestion/raw.customers.asset.yml

Enter fullscreen mode Exit fullscreen mode

Then define the asset:

# assets/ingestion/raw.customers.asset.yml
name: raw.customers
type: ingestr

description: Ingest OLTP customers from Postgres into the DuckDB raw layer.

parameters:
  source_connection: pg-default
  source_table: "public.customers"
  destination: duckdb

columns:
  - name: id
    type: integer
    primary_key: true
    checks:
      - name: not_null
      - name: unique

  - name: email
    type: string
    checks:
      - name: not_null
      - name: unique

  - name: country
    type: string
    checks:
      - name: not_null

Enter fullscreen mode Exit fullscreen mode

This tells Bruin to extract data from your Postgres table public.customers, validate column quality, and store it in the DuckDB raw layer.

Running the Asset

bruin run ecommerce/assets/ingestion/raw.customers.asset.yml

Enter fullscreen mode Exit fullscreen mode

Expected output:

Analyzed the pipeline 'ecommerce_pg_to_duckdb' with 13 assets.
Running only the asset 'raw.customers'

  Pipeline: ecommerce_pg_to_duckdb (../../..)
  No issues found                                                                                                                                                                                                                                  
✓ Successfully validated 13 assets across 1 pipeline, all good.                                                                                                 
  Interval: 2025-10-12T00:00:00Z - 2025-10-12T23:59:59Z

Starting the pipeline execution...
  PASS raw.customers ........

  bruin run completed successfully in 2.095s

✓ Assets executed      1 succeeded

Enter fullscreen mode Exit fullscreen mode

You can now query the ingested data:

bruin query --connection duckdb-default --query "SELECT * FROM raw.customers LIMIT 5"

Enter fullscreen mode Exit fullscreen mode

Result:

┌────┬───────────────────┬───────────────────────────┬───────────┬──────────────────┬──────────────────────────────────────┬──────────────────────────────────────┐
│ ID │ FULL_NAME         │ EMAIL                     │ COUNTRY   │ CITY             │ CREATED_AT                           │ UPDATED_AT                           │
├────┼───────────────────┼───────────────────────────┼───────────┼──────────────────┼──────────────────────────────────────┼──────────────────────────────────────┤
│ 1  │ Allison Hill      │ [email protected]  │ Uganda    │ New Roberttown   │ 2025-10-10 18:19:13.083281 +0000 UTC │ 2025-10-10 00:42:59.71112 +0000 UTC  │
│ 2  │ David Guzman      │ [email protected] │ Cyprus    │ Lawrencetown     │ 2025-10-10 07:52:47.643619 +0000 UTC │ 2025-10-10 06:23:42.864287 +0000 UTC │
│ 3  │ Caitlin Henderson │ [email protected]        │ Hong Kong │ West Melanieview │ 2025-10-10 21:06:02.639412 +0000 UTC │ 2025-10-10 19:23:17.540169 +0000 UTC │
│ 4  │ Monica Herrera    │ [email protected]       │ Niger     │ Barbaraland      │ 2025-10-11 01:33:43.032929 +0000 UTC │ 2025-10-10 02:29:27.22515 +0000 UTC  │
│ 5  │ Darren Roberts    │ [email protected] │ Fiji      │ Reidstad         │ 2025-10-10 12:05:18.734246 +0000 UTC │ 2025-10-10 00:51:13.406526 +0000 UTC │
└────┴───────────────────┴───────────────────────────┴───────────┴──────────────────┴──────────────────────────────────────┴──────────────────────────────────────┘

Enter fullscreen mode Exit fullscreen mode

Your raw layer is now established and validated.

3.2 Step 2: Formatting and Validating the Data (Staging Layer)

Next, we’ll clean and standardize the ingested data before using it in analytics.

This layer is called Staging (stg) — it’s where you enforce schem

Data Engineering 101 - A real beginner's approach

TL;DR

Key Takeaways

Tags

Table of Contents

1. Prologue

Imagine a single framework that:

2. First Impressions: Exploring Bruin’s Structure

Example: Project Structure

What Each Part Does

2.1. Core File: `.bruin.yml`

What’s Happening Here

2.2. Pipeline: A WAY EASIER Apache Airflow

2.3. Assets: The Building Blocks of Data Products

2.4. Policies: Enforcing Quality and Governance

Quick Overview

Built-in and Custom Rules

2.5. Glossary: Speaking the Same Language

3. Building Our First Pipeline

Running the Asset

3.2 Step 2: Formatting and Validating the Data (Staging Layer)

DEV.to - Trending Guides

Data Engineering 101 - A real beginner's approach

TL;DR

Key Takeaways

Tags

Table of Contents

1. Prologue

Imagine a single framework that:

2. First Impressions: Exploring Bruin’s Structure

Example: Project Structure

What Each Part Does

2.1. Core File: .bruin.yml

What’s Happening Here

2.2. Pipeline: A WAY EASIER Apache Airflow

2.3. Assets: The Building Blocks of Data Products

2.4. Policies: Enforcing Quality and Governance

Quick Overview

Built-in and Custom Rules

2.5. Glossary: Speaking the Same Language

3. Building Our First Pipeline

Running the Asset

3.2 Step 2: Formatting and Validating the Data (Staging Layer)

2.1. Core File: `.bruin.yml`