Skip to content

Instantly share code, notes, and snippets.

@lktslionel
Created July 17, 2025 21:10
Show Gist options
  • Select an option

  • Save lktslionel/1448404eef8a7ddab2d937b4e20a1fd9 to your computer and use it in GitHub Desktop.

Select an option

Save lktslionel/1448404eef8a7ddab2d937b4e20a1fd9 to your computer and use it in GitHub Desktop.
ELT vs ETL Data pipeline

The pipeline you've described is an ETL (Extract, Transform, Load) pipeline.

Here's why:

Let's break down your pipeline step by step:

  1. Extract (E)

    • You get data from a source in an S3 bucket.
    • This is the extraction phase, where raw data is pulled from the source.
  2. Transform (T)

    • You normalize it in Parquet, meaning you process and clean the data before loading.
    • This is the transformation step. You're applying structure and potentially cleaning, joining, or reshaping the data.
  3. Load (L)

    • You then load the result to Snowflake.
    • This is the loading phase, where the already transformed data is inserted into the target system.

Why it’s ETL and not ELT:

  • In ETL, the transformation happens before loading the data into the destination system (in your case, Snowflake). That’s what your pipeline does — it transforms data (normalization/parquet formatting) before loading.
  • In ELT (Extract, Load, Transform), raw data is first loaded into the destination (like Snowflake), and transformation is done within Snowflake using SQL or other processing tools.

In summary:

ETL: Because your transformation happens before loading into Snowflake. ❌ Not ELT: You’re not doing transformation inside Snowflake.

Let me know if you'd like to discuss performance or cost implications of ETL vs ELT!

To convert your current ETL pipeline into an ELT pipeline, the key change is to move the transformation step into Snowflake instead of doing it before loading. Here's how you can restructure the pipeline:


🔄 Current ETL Pipeline (what you're doing now):

  1. Extract from S3
  2. Transform (normalize, convert to Parquet)
  3. Load transformed data into Snowflake

✅ Modified ELT Pipeline:

  1. Extract from S3

    • You still pull the data from S3.
  2. Load raw/untransformed data into Snowflake

    • Skip normalization/parquet formatting at this point.
    • Load the data as-is (usually CSV, JSON, or even raw Parquet) into a staging table in Snowflake.
  3. Transform the data within Snowflake

    • Use SQL scripts, Snowflake Streams, Tasks, or dbt (if you're using it) to:

      • Normalize
      • Cleanse
      • Join
      • Aggregate
    • Write the final output into transformed tables (often called "model" or "reporting" tables).


💡 Key Implementation Details:

  • Use Snowflake’s COPY INTO command to ingest files directly from S3 into a staging table.
  • Optionally register external S3 files as external tables in Snowflake.
  • Write transformation logic in SQL or dbt and automate it using Snowflake Tasks or orchestration tools like Airflow or Dagster.
  • Use Parquet only if you want efficient columnar storage for initial ingest — but the transformation logic should still be in Snowflake to qualify as ELT.

✅ Benefits of ELT:

  • Scalability: Leverages Snowflake’s compute power for transformations.
  • Flexibility: Raw data remains available in Snowflake for reprocessing if logic changes.
  • Cost Optimization: Pushes compute into Snowflake, potentially reducing transformation steps elsewhere.

Let me know if you want a sample SQL transformation workflow or a dbt model structure!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment