Skip to content

Instantly share code, notes, and snippets.

@lalitsingh24x7
Created November 14, 2024 06:04
Show Gist options
  • Select an option

  • Save lalitsingh24x7/2ccc1a813e6556508d4d0b878092ada1 to your computer and use it in GitHub Desktop.

Select an option

Save lalitsingh24x7/2ccc1a813e6556508d4d0b878092ada1 to your computer and use it in GitHub Desktop.
partitionBy
1. Partitioned Writes:
# Write the DataFrame partitioned by Product and Date directly to S3
df.write.mode("overwrite").partitionBy("Product", "Date").csv(output_s3_base_path, header=True)
<<<
# Select relevant columns, keeping Product as a column but partitioning by Date only
df = df.select("Product", "Date", "Amount")
# Write the DataFrame partitioned only by Date
df.write.mode("overwrite").partitionBy("Date").csv(output_s3_base_path, header=True)
>>>
2. Repartitioning for Parallelism:
# Repartition by Product and Date to ensure parallel processing
repartitioned_df = df.repartition("Product", "Date")
# Write the repartitioned DataFrame
repartitioned_df.write.mode("overwrite").partitionBy("Product", "Date").csv(output_s3_base_path, header=True)
3: Dynamic Frame Conversion (AWS Glue):
from awsglue.dynamicframe import DynamicFrame
# Convert the DataFrame to a DynamicFrame
dynamic_df = DynamicFrame.fromDF(df, glueContext, "dynamic_df")
# Write the DynamicFrame partitioned by Product and Date
glueContext.write_dynamic_frame.from_options(
frame=dynamic_df,
connection_type="s3",
connection_options={
"path": output_s3_base_path,
"partitionKeys": ["Product", "Date"]
},
format="csv"
)
x
4:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment