Data Engineering 101 - A real beginner's approach
TL;DR
This article introduces Bruin, a unified data engineering framework that simplifies building pipelines by combining ingestion, transformation, and governance into a code-based approach. It covers core concepts like assets, policies, and glossaries, with a step-by-step guide for beginners.
Key Takeaways
- •Bruin integrates data engineering tools like Airflow and dbt into a single framework using YAML, SQL, and Python for pipelines as code.
- •Key components include assets for modular tasks, policies for data quality enforcement, and glossaries for consistent terminology.
- •The framework supports multi-language workflows, dependency management, and local development with tools like DuckDB and Postgres.
- •It emphasizes reproducibility, version control, and governance without vendor lock-in, making it ideal for developers new to data engineering.
Tags
This is the article about Data Engineering that you find if you search the subject on Google and get redirected after click Feeling Lucky.
Also, this article is a POV of an experienced webdev engineer that started a new topic in his career, that means: that's my study, my research, if you have anything to add, feel free to teach me in the comments below! I would love to learn more!
Table of Contents
- 1. Prologue
- 2. First Impressions: Exploring Bruin’s Structure
- 3. Building Our First Pipeline
- 4. Full Preview Data Flow
- 5. Overpowered VS Code Extension
- 6. Conclusion
1. Prologue
I have to admit: whenever someone mentioned Data Engineering, I used to tune out. It always sounded like something impossibly complex — almost magical.
This week, I finally decided to dive in. I thought it would be fairly straightforward, but it didn’t take long to realize how deep the rabbit hole goes.
This field isn’t just a few scripts or SQL queries; it’s an entire ecosystem of interconnected tools, concepts, and responsibilities that form the backbone of modern data systems.
Concepts like:
- Data Catalogs and Governance: understanding who owns the data, how to ensure quality, and how to track lineage.
- Orchestration: coordinating dependencies and workflows with tools like Apache Airflow or Dagster.
- Transformation (ETL/ELT): cleaning and standardizing data using tools such as dbt or Fivetran.
- Ingestion and Streaming: connecting sources and moving data in real time with Kafka, Airbyte, or Confluent Cloud.
- Observability and Quality: monitoring data health with solutions like Monte Carlo and Datafold.
For each article clicked, just opened up a new world of tools, words, frameworks, architectures, and best practices.
And somehow, all of it has to work together — governance, orchestration, transformation, ingestion, observability, infrastructure.
As a developer, I’m used to learning a language and a framework and then getting to work.
But in data engineering, it’s different.
It’s about understanding an entire ecosystem and how each piece connects to the next.
After hours of reading docs, chasing GitHub repos, and jumping between tools, articles, and endless definitions, I finally found the tool that made everything click — Bruin.
Imagine a single framework that:
- Pipelines as Code — Everything lives in version-controlled text (YAML, SQL, Python). No hidden UIs or databases. Reproducible, reviewable, and automatable.
- Multi-Language by Nature — Native support for SQL and Python, plus the ability to plug in binaries or containers for more complex use cases.
- Composable Pipelines — Combine technologies, sources, and destinations in one seamless flow — no glue code, no hacks.
- No Lock-In — 100% open-source (Apache-licensed) CLI that runs anywhere: locally, in CI, or in production. You keep full control of your pipelines and data.
- Built for Developers and Data Quality — Fast local runs, integrated checks, and quick feedback loops. Data products that are tested, trusted, and easy to ship.
…and it fits all the core Data Engineering concepts I just mentioned earlier.
I’ll admit it — I’m the kind of person who embraces productive laziness. If there’s a way to do more with fewer tools and less friction, I’m in.
So before we get started, here’s the plan:
In most setups, data flows from OLTP databases → ingestion → data lake/warehouse → transformation → marts → analytics dashboards.
Tools like Airbyte handle ingestion, dbt handles transformation, Airflow orchestrates dependencies — and Bruin combines those layers
into one unified framework.
This article will walk through the fundamental principles of Data Engineering, while exploring how Bruin brings them all together through a simple, real-world pipeline.
2. First Impressions: Exploring Bruin’s Structure
I’ll be honest — I used to have a bit of a bias against Data Science/Engineering projects. Every time I looked at one, it felt messy and unstructured, with files and notebooks scattered everywhere. Coming from a software development background, that kind of chaos always bothered me.
But once I started looking at Bruin’s project structure, that perception completely changed. Everything suddenly felt organized and intentional.
The framework naturally enforces structure through its layers — and once you follow them, everything starts to make sense.
Example: Project Structure
├── duckdb.db
├── ecommerce-mart
│ ├── assets
│ │ ├── ingestion
│ │ │ ├── raw.customers.asset.yml
│ │ │ ├── raw.order_items.asset.yml
│ │ │ ├── raw.orders.asset.yml
│ │ │ ├── raw.products.asset.yml
│ │ │ └── raw.product_variants.asset.yml
│ │ ├── mart
│ │ │ ├── mart.customers-by-age.asset.py
│ │ │ ├── mart.customers-by-country.asset.yml
│ │ │ ├── mart.product_performance.sql
│ │ │ ├── mart.sales_daily.sql
│ │ │ └── mart.variant_profitability.sql
│ │ └── staging
│ │ ├── stg.customers.asset.yml
│ │ ├── stg.order_items.sql
│ │ ├── stg.orders.sql
│ │ ├── stg.products.sql
│ │ └── stg.product_variants.sql
│ └── pipeline.yml
├── glossary.yml
├── policy.yml
└── .bruin.yml
What Each Part Does
-
.bruin.yml
- The main configuration file for your Bruin environment.
- Defines global settings like default connections, variables, and behavior for all pipelines.
-
policy.yml
- Your data governance and validation policy file.
- Defines data quality rules, access controls, and compliance checks that Bruin can automatically enforce before shipping data products.
-
glossary.yml
- Works as a lightweight data catalog for your project.
- Documents terms, metrics, and datasets so everyone on the team speaks the same language.
- Also helps with lineage, documentation, and discoverability.
-
some-feature/pipeline.yml
- Defines a specific pipeline for a domain or project (in this example, ecommerce).
- Describes the end-to-end data flow — which assets to run, their dependencies, and schedules.
- Pipelines are modular, so you can maintain separate ones for different business domains.
-
some-feature/assets/*
- Contains all the assets — the building blocks of your data pipelines.
- Each asset handles a distinct task: ingesting raw data, transforming it, or generating analytical tables.
- Since every asset is a file, it’s version-controlled, testable, and reusable — just like code.
With just that, we're able to run a full pipeline. However, I still think we need to go through each step and file individually — I promise it’ll be quick!
2.1. Core File: .bruin.yml
Think of .bruin.yml
as the root configuration of your project — the file that tells Bruin how and where to run everything.
Instead of scattering settings across scripts or environment variables, Bruin centralizes them here: connections, credentials, and environment-specific configurations all live in one place.
It also serves as Bruin’s default secrets backend, so your pipelines can access databases or warehouses securely and consistently.
bruin run ecommerce/pipeline.yml --config-file /path/to/.bruin.yml
A simple example:
default_environment: default
environments:
default:
connections:
postgres:
- name: pg-default
username: postgres # (hardcoded as well)
password: ${PG_PASSWORD}
host: ${PG_HOST}
port: ${PG_PORT}
database: ${PG_DATABASE}
duckdb:
- name: duckdb-default
path: duckdb.db
What’s Happening Here
-
default_environment
— sets the environment Bruin will use unless specified otherwise. -
environments
— defines multiple setups (e.g., dev, staging, prod), each with its own configuration. -
connections
— lists every system Bruin can connect to, like Postgres or DuckDB. Each connection gets a name (e.g.,pg-default
) that you’ll reference across pipelines and assets. -
Environment variable support — any value wrapped in
${...}
will be automatically read from your system environment.
This means you can keep credentials out of source control while still running locally or in CI/CD environments.
This design keeps everything centralized, secure, and version-controlled, while giving you the flexibility to inject secrets dynamically through environment variables — perfect for switching between local, staging, and production without touching the code.
2.2. Pipeline: A WAY EASIER Apache Airflow
For each feature you have, it will have to come with a pipeline.yml
file.
This is the file that will group all your assets and understand that it's not a single asset running, but a chained list of assets.
- ecommerce-mart/
├─ pipeline.yml -> you're here
└─ assets/
├─ some-asset.sql
├─ definitely-an-asset.yml
└─ another-asset.py
There you also configure each connection you want to use on the specific pipeline:
name: product_ecommerce_marts
schedule: daily # relevant for Bruin Cloud deployments
default_connections:
duckdb: "duckdb-default"
postgres: "pg-default"
2.3. Assets: The Building Blocks of Data Products
Every data pipeline in Bruin is composed of assets — modular, self-contained units that define a specific operation: ingesting, transforming, or producing a dataset.
Each asset exists as a file under the assets/
directory, and its filename doubles as its identity inside the pipeline graph.
If you remember the file structure shown in the beginning, you must remember that I have multiple types of assets in the pipeline. That's the most cool part, since you can write down in multiple languages and still being something simple. Here's some possibilities:
Type | Description | Filename (in the file tree) |
---|---|---|
YAML | Declarative configuration for ingestion or metadata-heavy assets | raw.customers.asset.yml |
SQL | Pure transformation logic — think dbt-style models | stg.orders.sql |
Python | Custom logic or integrations (e.g., APIs, validations, or machine learning steps) | mart.sales_daily.asset.py |
You’re free to organize assets however you like — there’s no rigid hierarchy to follow.
The key insight is that the orchestration happens implicitly through dependencies, not through an external DAG engine like Airflow.
Each asset declares what it depends on, and Bruin automatically builds and executes the dependency graph for you.
Example:
- raw.orders.asset.yml
# raw.orders.asset.yml
name: raw.orders
type: ingestr
description: Ingest OLTP orders from Postgres into the DuckDB raw layer.
parameters:
source_connection: pg-default
source_table: "public.orders"
destination: duckdb
- raw.order_items.asset.yml
# raw.order_items.asset.yml
name: raw.order_items
type: ingestr
description: Ingest OLTP order_items from Postgres into the DuckDB raw layer.
depends:
- raw.orders # declares a dependency on the 'raw.orders' asset
parameters:
source_connection: pg-default
source_table: "public.order_items"
destination: duckdb
... turns into:
graph TD
raw.orders --> raw.order_items;
By chaining assets like this, you describe logical relationships between data operations rather than manually orchestrating steps.
The result is a declarative, composable, and maintainable pipeline — easy to read, version, and extend just like application code.
One of the most powerful aspects of Bruin is how it connects data quality and governance directly into your assets.
By defining checks under each column, you’re not only validating your data but also documenting ownership, expectations, and constraints — all version-controlled and enforceable at runtime.
This means Bruin doesn’t just run pipelines — it audits, documents, and governs them as part of the same workflow.
2.4. Policies: Enforcing Quality and Governance
Policies in Bruin act as the rulebook that keeps your data pipelines consistent, compliant, and high quality.
They ensure every asset and pipeline follows best practices — from naming conventions and ownership to validation and metadata completeness.
At their core, policies are defined in a single policy.yml
file located at the root of your project.
This file lets you lint, validate, and enforce standards automatically before a pipeline runs.
Quick Overview
rulesets:
- name: standard
selector:
- path: .*/ecommerce/.*
rules:
- asset-has-owner
- asset-name-is-lowercase
- asset-has-description
Each ruleset defines:
-
where the rule applies (
selector
→ match by path, tag, or name), -
what to enforce (
rules
→ built-in or custom validation rules).
Once defined, you can validate your entire project:
bruin validate ecommerce
# Validating pipelines in 'ecommerce' for 'default' environment...
# Pipeline: ecommerce_pg_to_duckdb (.)
# raw.order_items (assets/ingestion/raw.order_items.asset.yml)
# └── Asset must have an owner (policy:standard:asset-has-owner)
Bruin automatically lints assets before execution — ensuring that non-compliant pipelines never run.
Built-in and Custom Rules
Rule | Target | Description |
---|---|---|
asset-has-owner |
asset | Each asset must define an owner. |
asset-has-description |
asset | Assets must include a description. |
asset-name-is-lowercase |
asset | Asset names must be lowercase. |
pipeline-has-retries |
pipeline | Pipelines must define retry settings. |
You can also define your own rules:
custom_rules:
- name: asset-has-owner
description: every asset should have an owner
criteria: asset.Owner != ""
Rules can target either assets or pipelines, and they use logical expressions to determine compliance.
Policies transform Bruin into a self-governing data platform — one where best practices aren’t optional, they’re enforced.
By committing your rules to version control, you make data governance part of the development workflow, not an afterthought.
2.5. Glossary: Speaking the Same Language
In data projects, one of the hardest problems isn’t technical — it’s communication.
Different teams often use the same word to mean different things.
That’s where Bruin’s Glossary comes in.
A glossary is defined in glossary.yml
at the root of your project.
It acts as a shared dictionary of business concepts (like Customer or Order) and their attributes, keeping teams aligned across pipelines.
entities:
Customer:
description: A registered user or business in our platform.
attributes:
ID:
type: integer
description: Unique customer identifier.
You can reference these definitions inside assets using extends
, avoiding duplication and ensuring consistency:
# raw.customers.asset.yml
name: raw.customers
type: ingestr
columns:
- name: customer_id
extends: Customer.ID
This automatically inherits the type
and description
from the glossary.
It’s a simple idea, but a powerful one — your data definitions become version-controlled and shared, just like code.
3. Building Our First Pipeline
Now that we’ve explored the structure and philosophy behind Bruin, it’s time to build an end-to-end pipeline.
We’ll go from raw ingestion to a clean staging layer, and finally, to analytics-ready marts — all defined as code.
We’ll assume you already have:
- A Postgres database as your data source.
- A DuckDB database as your analytical storage.
- A working
.bruin.yml
file configured with both connections ### 3.1 Step 1 : Ingest from Your Source to a Data Lake
The first step is to move data from Postgres into DuckDB.
This creates your Raw Layer — data replicated from the source with minimal transformation.
Create an ingestion asset file:
touch assets/ingestion/raw.customers.asset.yml
Then define the asset:
# assets/ingestion/raw.customers.asset.yml
name: raw.customers
type: ingestr
description: Ingest OLTP customers from Postgres into the DuckDB raw layer.
parameters:
source_connection: pg-default
source_table: "public.customers"
destination: duckdb
columns:
- name: id
type: integer
primary_key: true
checks:
- name: not_null
- name: unique
- name: email
type: string
checks:
- name: not_null
- name: unique
- name: country
type: string
checks:
- name: not_null
This tells Bruin to extract data from your Postgres table public.customers
, validate column quality, and store it in the DuckDB raw layer.
Running the Asset
bruin run ecommerce/assets/ingestion/raw.customers.asset.yml
Expected output:
Analyzed the pipeline 'ecommerce_pg_to_duckdb' with 13 assets.
Running only the asset 'raw.customers'
Pipeline: ecommerce_pg_to_duckdb (../../..)
No issues found
✓ Successfully validated 13 assets across 1 pipeline, all good.
Interval: 2025-10-12T00:00:00Z - 2025-10-12T23:59:59Z
Starting the pipeline execution...
PASS raw.customers ........
bruin run completed successfully in 2.095s
✓ Assets executed 1 succeeded
You can now query the ingested data:
bruin query --connection duckdb-default --query "SELECT * FROM raw.customers LIMIT 5"
Result:
┌────┬───────────────────┬───────────────────────────┬───────────┬──────────────────┬──────────────────────────────────────┬──────────────────────────────────────┐
│ ID │ FULL_NAME │ EMAIL │ COUNTRY │ CITY │ CREATED_AT │ UPDATED_AT │
├────┼───────────────────┼───────────────────────────┼───────────┼──────────────────┼──────────────────────────────────────┼──────────────────────────────────────┤
│ 1 │ Allison Hill │ [email protected] │ Uganda │ New Roberttown │ 2025-10-10 18:19:13.083281 +0000 UTC │ 2025-10-10 00:42:59.71112 +0000 UTC │
│ 2 │ David Guzman │ [email protected] │ Cyprus │ Lawrencetown │ 2025-10-10 07:52:47.643619 +0000 UTC │ 2025-10-10 06:23:42.864287 +0000 UTC │
│ 3 │ Caitlin Henderson │ [email protected] │ Hong Kong │ West Melanieview │ 2025-10-10 21:06:02.639412 +0000 UTC │ 2025-10-10 19:23:17.540169 +0000 UTC │
│ 4 │ Monica Herrera │ [email protected] │ Niger │ Barbaraland │ 2025-10-11 01:33:43.032929 +0000 UTC │ 2025-10-10 02:29:27.22515 +0000 UTC │
│ 5 │ Darren Roberts │ [email protected] │ Fiji │ Reidstad │ 2025-10-10 12:05:18.734246 +0000 UTC │ 2025-10-10 00:51:13.406526 +0000 UTC │
└────┴───────────────────┴───────────────────────────┴───────────┴──────────────────┴──────────────────────────────────────┴──────────────────────────────────────┘
Your raw layer is now established and validated.
3.2 Step 2: Formatting and Validating the Data (Staging Layer)
Next, we’ll clean and standardize the ingested data before using it in analytics.
This layer is called Staging (stg) — it’s where you enforce schem