miohtama/gist:fd32bc6eb4cf06226f46fe066cdb411c

## gistfile1.txt
You can append data to a Parquet dataset in PyArrow by using the `pyarrow.parquet.write_to_dataset()` function and setting the `existing_data_behavior` parameter to `'overwrite_or_append'`.

This function writes new Parquet files to the dataset directory. If the data belongs to a new partition, a new partition directory is created. If it belongs to an existing partition, a new file is added to that partition's directory.

-----

### \#\# How It Works

The key is the `existing_data_behavior` argument, which tells PyArrow how to handle the new data in relation to any data already at the destination.

  * `'error'` (Default): The operation will fail if any data already exists at the target location. This prevents accidental overwrites.
  * `'overwrite_or_append'` (For Appending): This is the mode you need. It will add new Parquet files for the data you're writing. It doesn't modify existing files but adds new ones, effectively appending to the dataset.
  * `'delete_matching'`: Before writing the new data, it will remove all existing data in partitions that the new data will be written to. This is useful for "upsert" or "replace" operations.

-----

### \#\# Code Example

Here’s a complete example showing how to create a partitioned dataset and then append new data to it.

```python
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import os

# Define the root directory for our dataset
dataset_path = './my_partitioned_dataset'

# --- 1. Create and write the initial dataset ---
print("Writing initial data...")
# Create some initial data using pandas
df_initial = pd.DataFrame({
    'value': [10, 20, 30],
    'year': [2023, 2023, 2024],
    'month': [12, 12, 1]
})

# Convert pandas DataFrame to a PyArrow Table
table_initial = pa.Table.from_pandas(df_initial)

# Write the table to a partitioned Parquet dataset
pq.write_to_dataset(
    table_initial,
    root_path=dataset_path,
    partition_cols=['year', 'month']
)

print(f"Initial dataset created at '{dataset_path}'")
# You will now have a directory structure like:
# my_partitioned_dataset/
# ├── year=2023/
# │   └── month=12/
# │       └── [some_uuid].parquet
# └── year=2024/
#     └── month=1/
#         └── [some_uuid].parquet


# --- 2. Create new data to append ---
print("\nWriting new data to append...")
df_new = pd.DataFrame({
    'value': [40, 50, 60],
    'year': [2024, 2024, 2025], # Appends to existing 2024 partition and creates a new 2025 partition
    'month': [1, 2, 1]
})
table_new = pa.Table.from_pandas(df_new)

# --- 3. Append the new data to the existing dataset ---
# The key is existing_data_behavior='overwrite_or_append'
pq.write_to_dataset(
    table_new,
    root_path=dataset_path,
    partition_cols=['year', 'month'],
    existing_data_behavior='overwrite_or_append'
)

print("Append operation complete.")


# --- 4. Verify the appended data ---
print("\nReading full dataset to verify...")
# Read the entire dataset back
full_dataset = pq.read_table(dataset_path)
print(full_dataset.to_pandas())

# Expected Output:
#    value  year  month
# 0     10  2023     12
# 1     20  2023     12
# 2     30  2024      1
# 3     40  2024      1  <-- Appended data
# 4     50  2024      2  <-- Appended data
# 5     60  2025      1  <-- Appended data
```

-----

### \#\# Important Considerations

  * **Schema Consistency**: The PyArrow Table you are appending **must have the exact same schema** (column names, data types, and order) as the existing dataset. Any mismatch will cause an error.
  * **Small Files Problem**: Repeatedly appending small batches of data can lead to a large number of small Parquet files. This can significantly slow down read performance because query engines have to open and process many individual files. For optimal performance, it's often better to batch data into larger chunks before writing or to run a periodic "compaction" job that rewrites smaller files into larger ones. 👍
	You can append data to a Parquet dataset in PyArrow by using the `pyarrow.parquet.write_to_dataset()` function and setting the `existing_data_behavior` parameter to `'overwrite_or_append'`.

	This function writes new Parquet files to the dataset directory. If the data belongs to a new partition, a new partition directory is created. If it belongs to an existing partition, a new file is added to that partition's directory.

	-----

	### \#\# How It Works

	The key is the `existing_data_behavior` argument, which tells PyArrow how to handle the new data in relation to any data already at the destination.

	* `'error'` (Default): The operation will fail if any data already exists at the target location. This prevents accidental overwrites.
	* `'overwrite_or_append'` (For Appending): This is the mode you need. It will add new Parquet files for the data you're writing. It doesn't modify existing files but adds new ones, effectively appending to the dataset.
	* `'delete_matching'`: Before writing the new data, it will remove all existing data in partitions that the new data will be written to. This is useful for "upsert" or "replace" operations.

	-----

	### \#\# Code Example

	Here’s a complete example showing how to create a partitioned dataset and then append new data to it.

	```python
	import pandas as pd
	import pyarrow as pa
	import pyarrow.parquet as pq
	import os

	# Define the root directory for our dataset
	dataset_path = './my_partitioned_dataset'

	# --- 1. Create and write the initial dataset ---
	print("Writing initial data...")
	# Create some initial data using pandas
	df_initial = pd.DataFrame({
	'value': [10, 20, 30],
	'year': [2023, 2023, 2024],
	'month': [12, 12, 1]
	})

	# Convert pandas DataFrame to a PyArrow Table
	table_initial = pa.Table.from_pandas(df_initial)

	# Write the table to a partitioned Parquet dataset
	pq.write_to_dataset(
	table_initial,
	root_path=dataset_path,
	partition_cols=['year', 'month']
	)

	print(f"Initial dataset created at '{dataset_path}'")
	# You will now have a directory structure like:
	# my_partitioned_dataset/
	# ├── year=2023/
	# │ └── month=12/
	# │ └── [some_uuid].parquet
	# └── year=2024/
	# └── month=1/
	# └── [some_uuid].parquet


	# --- 2. Create new data to append ---
	print("\nWriting new data to append...")
	df_new = pd.DataFrame({
	'value': [40, 50, 60],
	'year': [2024, 2024, 2025], # Appends to existing 2024 partition and creates a new 2025 partition
	'month': [1, 2, 1]
	})
	table_new = pa.Table.from_pandas(df_new)

	# --- 3. Append the new data to the existing dataset ---
	# The key is existing_data_behavior='overwrite_or_append'
	pq.write_to_dataset(
	table_new,
	root_path=dataset_path,
	partition_cols=['year', 'month'],
	existing_data_behavior='overwrite_or_append'
	)

	print("Append operation complete.")


	# --- 4. Verify the appended data ---
	print("\nReading full dataset to verify...")
	# Read the entire dataset back
	full_dataset = pq.read_table(dataset_path)
	print(full_dataset.to_pandas())

	# Expected Output:
	# value year month
	# 0 10 2023 12
	# 1 20 2023 12
	# 2 30 2024 1
	# 3 40 2024 1 <-- Appended data
	# 4 50 2024 2 <-- Appended data
	# 5 60 2025 1 <-- Appended data
	```

	-----

	### \#\# Important Considerations

	* Schema Consistency: The PyArrow Table you are appending must have the exact same schema (column names, data types, and order) as the existing dataset. Any mismatch will cause an error.
	* Small Files Problem: Repeatedly appending small batches of data can lead to a large number of small Parquet files. This can significantly slow down read performance because query engines have to open and process many individual files. For optimal performance, it's often better to batch data into larger chunks before writing or to run a periodic "compaction" job that rewrites smaller files into larger ones. 👍
No results found