This document describes how the Gondola Points Explorer filters, cleans, and bounds hotel loyalty point redemption data across its extract, preprocess, and visualization stages.
The Points Explorer computes cents-per-point (CPP) redemption values for hotel loyalty programs by comparing cash rates to points rates at the same property and date. To produce meaningful distributions, the pipeline applies filters at three stages: data extraction, preprocessing, and visualization.
- Null exclusion: The SQL query requires
points_cost IS NOT NULL,cash_cost IS NOT NULL, andcash_currency IS NOT NULL. - Freshness: Only records with
created_at >= '2025-01-01'andsearch_date > {min_search_date}are fetched. - Deduplication at source: A CTE selects only the latest
created_atper(vervotech_property_id, search_date), preventing stale data from inflating the dataset.
- Zero cash cost removal: Rows where
cash_cost == 0are dropped before computing redemption values, since division by zero or free-night rates produce meaningless CPP. - Null points cost after merge: When merging refundable rates (which lack their own points cost) with the base points data, any rows that fail to match — leaving
points_costasNone— are dropped.
All cash values are converted to USD using exchange rates from the database. Properties with unmappable currencies are excluded (the conversion returns None).
Rows with a NULL hotel_reward_task_id and is_refundable != True are removed. These "phantom rows" originate from a non-standard ingestion path, exhibit trailing whitespace in rate names, and land disproportionately on Sundays — all signs of data quality issues. Travelport-sourced refundable rows (null task ID but is_refundable == True) are explicitly preserved.
Any row with search_date <= run_date is dropped. The pipeline only analyzes future availability.
Rows with brand_name values not recognized in the CHAIN_TO_BRANDS mapping are filtered out, preventing unknown or miscategorized brands from polluting chain-level statistics.
The pipeline deduplicates to one row per (vervotech_property_id, search_date, is_refundable), keeping the row with the lowest redemption value (cheapest CPP). This ensures:
- Each property contributes at most one data point per date per refundable status.
- Duplicate rate entries from overlapping scrapes don't inflate counts.
- The cheapest available rate is what travelers would actually book.
For the "All" (non-refundable-filtered) view, an additional re-dedup collapses to one row per (vervotech_property_id, search_date), again keeping the cheapest. This prevents NaN-refundable rows from inflating the count and skewing the median upward.
Trailing whitespace in supplier_rate_name is stripped to prevent near-duplicate rates from appearing as separate entries in rate-level breakdowns.
The primary outlier exclusion for chart rendering applies two simultaneous bounds:
df_filtered = value_counts[
(value_counts["percentile"] <= 0.999) & # Exclude top 0.1% of values
(value_counts["redemption_value"] <= 10) # Hard cap: 10 cents per point
]Redemption values beyond the 99.9th percentile are excluded. These represent extreme outliers — typically caused by:
- Data entry errors in cash or points pricing
- Promotional rates with artificially low points costs
- Currency conversion anomalies
- Properties with unusual tax/fee structures that inflate the apparent cash value
A hard ceiling of 10 cents per point removes any redemption value that would be unrealistically high for any major hotel loyalty program. For context, even the best hotel point redemptions rarely exceed 3-5 cpp, so 10 cpp provides a generous buffer while still excluding clearly erroneous data.
There is no explicit lower bound beyond the zero-cash-cost filter applied during extraction. Very low but non-zero CPP values (e.g., 0.01 cpp) pass through. These typically represent legitimate — if poor — redemptions at budget properties or during high-availability periods.
For each filter combination, the pipeline computes and stores:
| Statistic | Description |
|---|---|
| Count | Total number of rate observations |
| Median CPP | 50th percentile — the "typical" redemption value |
| P25 CPP | 25th percentile — lower quartile |
| P75 CPP | 75th percentile — upper quartile |
| P90 CPP | 90th percentile — "good" redemption threshold |
| Mean CPP | Arithmetic mean (more sensitive to outliers than median) |
These are computed before the visualization-stage percentile/cap filters, on the full preprocessed dataset.
| Stage | Filter | Purpose |
|---|---|---|
| Extract | points_cost IS NOT NULL |
Exclude properties without points pricing |
| Extract | cash_cost IS NOT NULL |
Exclude properties without cash pricing |
| Extract | cash_cost != 0 |
Prevent division-by-zero in CPP calculation |
| Extract | Latest created_at per date |
Deduplicate stale scrape data |
| Extract | Null points after refundable merge | Drop unmatched refundable rates |
| Preprocess | Phantom row filter | Remove non-standard ingestion artifacts |
| Preprocess | search_date > run_date |
Only include future availability |
| Preprocess | Brand validation | Exclude unrecognized brands |
| Preprocess | 1 row per (property, date, refundable) | Deduplicate, keep cheapest |
| Preprocess | Re-dedup for "All" view | 1 row per (property, date), keep cheapest |
| Visualization | percentile <= 0.999 |
Trim top 0.1% extreme values |
| Visualization | redemption_value <= 10 |
Hard cap at 10 cents per point |