Last active
September 9, 2025 16:02
-
-
Save miohtama/fd32bc6eb4cf06226f46fe066cdb411c to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| You can append data to a Parquet dataset in PyArrow by using the `pyarrow.parquet.write_to_dataset()` function and setting the `existing_data_behavior` parameter to `'overwrite_or_append'`. | |
| This function writes new Parquet files to the dataset directory. If the data belongs to a new partition, a new partition directory is created. If it belongs to an existing partition, a new file is added to that partition's directory. | |
| ----- | |
| ### \#\# How It Works | |
| The key is the `existing_data_behavior` argument, which tells PyArrow how to handle the new data in relation to any data already at the destination. | |
| * `'error'` (Default): The operation will fail if any data already exists at the target location. This prevents accidental overwrites. | |
| * `'overwrite_or_append'` (For Appending): This is the mode you need. It will add new Parquet files for the data you're writing. It doesn't modify existing files but adds new ones, effectively appending to the dataset. | |
| * `'delete_matching'`: Before writing the new data, it will remove all existing data in partitions that the new data will be written to. This is useful for "upsert" or "replace" operations. | |
| ----- | |
| ### \#\# Code Example | |
| Here’s a complete example showing how to create a partitioned dataset and then append new data to it. | |
| ```python | |
| import pandas as pd | |
| import pyarrow as pa | |
| import pyarrow.parquet as pq | |
| import os | |
| # Define the root directory for our dataset | |
| dataset_path = './my_partitioned_dataset' | |
| # --- 1. Create and write the initial dataset --- | |
| print("Writing initial data...") | |
| # Create some initial data using pandas | |
| df_initial = pd.DataFrame({ | |
| 'value': [10, 20, 30], | |
| 'year': [2023, 2023, 2024], | |
| 'month': [12, 12, 1] | |
| }) | |
| # Convert pandas DataFrame to a PyArrow Table | |
| table_initial = pa.Table.from_pandas(df_initial) | |
| # Write the table to a partitioned Parquet dataset | |
| pq.write_to_dataset( | |
| table_initial, | |
| root_path=dataset_path, | |
| partition_cols=['year', 'month'] | |
| ) | |
| print(f"Initial dataset created at '{dataset_path}'") | |
| # You will now have a directory structure like: | |
| # my_partitioned_dataset/ | |
| # ├── year=2023/ | |
| # │ └── month=12/ | |
| # │ └── [some_uuid].parquet | |
| # └── year=2024/ | |
| # └── month=1/ | |
| # └── [some_uuid].parquet | |
| # --- 2. Create new data to append --- | |
| print("\nWriting new data to append...") | |
| df_new = pd.DataFrame({ | |
| 'value': [40, 50, 60], | |
| 'year': [2024, 2024, 2025], # Appends to existing 2024 partition and creates a new 2025 partition | |
| 'month': [1, 2, 1] | |
| }) | |
| table_new = pa.Table.from_pandas(df_new) | |
| # --- 3. Append the new data to the existing dataset --- | |
| # The key is existing_data_behavior='overwrite_or_append' | |
| pq.write_to_dataset( | |
| table_new, | |
| root_path=dataset_path, | |
| partition_cols=['year', 'month'], | |
| existing_data_behavior='overwrite_or_append' | |
| ) | |
| print("Append operation complete.") | |
| # --- 4. Verify the appended data --- | |
| print("\nReading full dataset to verify...") | |
| # Read the entire dataset back | |
| full_dataset = pq.read_table(dataset_path) | |
| print(full_dataset.to_pandas()) | |
| # Expected Output: | |
| # value year month | |
| # 0 10 2023 12 | |
| # 1 20 2023 12 | |
| # 2 30 2024 1 | |
| # 3 40 2024 1 <-- Appended data | |
| # 4 50 2024 2 <-- Appended data | |
| # 5 60 2025 1 <-- Appended data | |
| ``` | |
| ----- | |
| ### \#\# Important Considerations | |
| * **Schema Consistency**: The PyArrow Table you are appending **must have the exact same schema** (column names, data types, and order) as the existing dataset. Any mismatch will cause an error. | |
| * **Small Files Problem**: Repeatedly appending small batches of data can lead to a large number of small Parquet files. This can significantly slow down read performance because query engines have to open and process many individual files. For optimal performance, it's often better to batch data into larger chunks before writing or to run a periodic "compaction" job that rewrites smaller files into larger ones. 👍 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment