The pipeline you've described is an ETL (Extract, Transform, Load) pipeline.
Let's break down your pipeline step by step:
-
Extract (E)
- You get data from a source in an S3 bucket.
- This is the extraction phase, where raw data is pulled from the source.
-
Transform (T)
- You normalize it in Parquet, meaning you process and clean the data before loading.
- This is the transformation step. You're applying structure and potentially cleaning, joining, or reshaping the data.
-
Load (L)
- You then load the result to Snowflake.
- This is the loading phase, where the already transformed data is inserted into the target system.
- In ETL, the transformation happens before loading the data into the destination system (in your case, Snowflake). That’s what your pipeline does — it transforms data (normalization/parquet formatting) before loading.
- In ELT (Extract, Load, Transform), raw data is first loaded into the destination (like Snowflake), and transformation is done within Snowflake using SQL or other processing tools.
✅ ETL: Because your transformation happens before loading into Snowflake. ❌ Not ELT: You’re not doing transformation inside Snowflake.
Let me know if you'd like to discuss performance or cost implications of ETL vs ELT!
To convert your current ETL pipeline into an ELT pipeline, the key change is to move the transformation step into Snowflake instead of doing it before loading. Here's how you can restructure the pipeline:
- Extract from S3
- Transform (normalize, convert to Parquet)
- Load transformed data into Snowflake
-
Extract from S3
- You still pull the data from S3.
-
Load raw/untransformed data into Snowflake
- Skip normalization/parquet formatting at this point.
- Load the data as-is (usually CSV, JSON, or even raw Parquet) into a staging table in Snowflake.
-
Transform the data within Snowflake
-
Use SQL scripts, Snowflake Streams, Tasks, or dbt (if you're using it) to:
- Normalize
- Cleanse
- Join
- Aggregate
-
Write the final output into transformed tables (often called "model" or "reporting" tables).
-
- Use Snowflake’s COPY INTO command to ingest files directly from S3 into a staging table.
- Optionally register external S3 files as external tables in Snowflake.
- Write transformation logic in SQL or dbt and automate it using Snowflake Tasks or orchestration tools like Airflow or Dagster.
- Use Parquet only if you want efficient columnar storage for initial ingest — but the transformation logic should still be in Snowflake to qualify as ELT.
- Scalability: Leverages Snowflake’s compute power for transformations.
- Flexibility: Raw data remains available in Snowflake for reprocessing if logic changes.
- Cost Optimization: Pushes compute into Snowflake, potentially reducing transformation steps elsewhere.
Let me know if you want a sample SQL transformation workflow or a dbt model structure!