Skip to content

Instantly share code, notes, and snippets.

@miohtama
Last active September 9, 2025 16:02
Show Gist options
  • Select an option

  • Save miohtama/fd32bc6eb4cf06226f46fe066cdb411c to your computer and use it in GitHub Desktop.

Select an option

Save miohtama/fd32bc6eb4cf06226f46fe066cdb411c to your computer and use it in GitHub Desktop.
You can append data to a Parquet dataset in PyArrow by using the `pyarrow.parquet.write_to_dataset()` function and setting the `existing_data_behavior` parameter to `'overwrite_or_append'`.
This function writes new Parquet files to the dataset directory. If the data belongs to a new partition, a new partition directory is created. If it belongs to an existing partition, a new file is added to that partition's directory.
-----
### \#\# How It Works
The key is the `existing_data_behavior` argument, which tells PyArrow how to handle the new data in relation to any data already at the destination.
* `'error'` (Default): The operation will fail if any data already exists at the target location. This prevents accidental overwrites.
* `'overwrite_or_append'` (For Appending): This is the mode you need. It will add new Parquet files for the data you're writing. It doesn't modify existing files but adds new ones, effectively appending to the dataset.
* `'delete_matching'`: Before writing the new data, it will remove all existing data in partitions that the new data will be written to. This is useful for "upsert" or "replace" operations.
-----
### \#\# Code Example
Here’s a complete example showing how to create a partitioned dataset and then append new data to it.
```python
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import os
# Define the root directory for our dataset
dataset_path = './my_partitioned_dataset'
# --- 1. Create and write the initial dataset ---
print("Writing initial data...")
# Create some initial data using pandas
df_initial = pd.DataFrame({
'value': [10, 20, 30],
'year': [2023, 2023, 2024],
'month': [12, 12, 1]
})
# Convert pandas DataFrame to a PyArrow Table
table_initial = pa.Table.from_pandas(df_initial)
# Write the table to a partitioned Parquet dataset
pq.write_to_dataset(
table_initial,
root_path=dataset_path,
partition_cols=['year', 'month']
)
print(f"Initial dataset created at '{dataset_path}'")
# You will now have a directory structure like:
# my_partitioned_dataset/
# ├── year=2023/
# │ └── month=12/
# │ └── [some_uuid].parquet
# └── year=2024/
# └── month=1/
# └── [some_uuid].parquet
# --- 2. Create new data to append ---
print("\nWriting new data to append...")
df_new = pd.DataFrame({
'value': [40, 50, 60],
'year': [2024, 2024, 2025], # Appends to existing 2024 partition and creates a new 2025 partition
'month': [1, 2, 1]
})
table_new = pa.Table.from_pandas(df_new)
# --- 3. Append the new data to the existing dataset ---
# The key is existing_data_behavior='overwrite_or_append'
pq.write_to_dataset(
table_new,
root_path=dataset_path,
partition_cols=['year', 'month'],
existing_data_behavior='overwrite_or_append'
)
print("Append operation complete.")
# --- 4. Verify the appended data ---
print("\nReading full dataset to verify...")
# Read the entire dataset back
full_dataset = pq.read_table(dataset_path)
print(full_dataset.to_pandas())
# Expected Output:
# value year month
# 0 10 2023 12
# 1 20 2023 12
# 2 30 2024 1
# 3 40 2024 1 <-- Appended data
# 4 50 2024 2 <-- Appended data
# 5 60 2025 1 <-- Appended data
```
-----
### \#\# Important Considerations
* **Schema Consistency**: The PyArrow Table you are appending **must have the exact same schema** (column names, data types, and order) as the existing dataset. Any mismatch will cause an error.
* **Small Files Problem**: Repeatedly appending small batches of data can lead to a large number of small Parquet files. This can significantly slow down read performance because query engines have to open and process many individual files. For optimal performance, it's often better to batch data into larger chunks before writing or to run a periodic "compaction" job that rewrites smaller files into larger ones. 👍
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment