skyler1537/points-explorer-methodology.md

## points-explorer-methodology.md

      
    Raw
  

              points-explorer-methodology.md
            
          
    Points Explorer: Data Pipeline & Outlier Filtering Methodology

This document describes how the Gondola Points Explorer filters, cleans, and bounds hotel loyalty point redemption data across its extract, preprocess, and visualization stages.
Overview

The Points Explorer computes cents-per-point (CPP) redemption values for hotel loyalty programs by comparing cash rates to points rates at the same property and date. To produce meaningful distributions, the pipeline applies filters at three stages: data extraction, preprocessing, and visualization.

Stage 1: Data Extraction (data_loader.py)

SQL-Level Filtering


Null exclusion: The SQL query requires points_cost IS NOT NULL, cash_cost IS NOT NULL, and cash_currency IS NOT NULL.
Freshness: Only records with created_at >= '2025-01-01' and search_date > {min_search_date} are fetched.
Deduplication at source: A CTE selects only the latest created_at per (vervotech_property_id, search_date), preventing stale data from inflating the dataset.

Post-Fetch Filtering


Zero cash cost removal: Rows where cash_cost == 0 are dropped before computing redemption values, since division by zero or free-night rates produce meaningless CPP.
Null points cost after merge: When merging refundable rates (which lack their own points cost) with the base points data, any rows that fail to match — leaving points_cost as None — are dropped.

Currency Normalization

All cash values are converted to USD using exchange rates from the database. Properties with unmappable currencies are excluded (the conversion returns None).

Stage 2: Preprocessing (preprocess_data.py)

Phantom Row Filtering

Rows with a NULL hotel_reward_task_id and is_refundable != True are removed. These "phantom rows" originate from a non-standard ingestion path, exhibit trailing whitespace in rate names, and land disproportionately on Sundays — all signs of data quality issues. Travelport-sourced refundable rows (null task ID but is_refundable == True) are explicitly preserved.
Past-Date Exclusion

Any row with search_date <= run_date is dropped. The pipeline only analyzes future availability.
Brand Validation

Rows with brand_name values not recognized in the CHAIN_TO_BRANDS mapping are filtered out, preventing unknown or miscategorized brands from polluting chain-level statistics.
Deduplication Strategy

The pipeline deduplicates to one row per (vervotech_property_id, search_date, is_refundable), keeping the row with the lowest redemption value (cheapest CPP). This ensures:

Each property contributes at most one data point per date per refundable status.
Duplicate rate entries from overlapping scrapes don't inflate counts.
The cheapest available rate is what travelers would actually book.

For the "All" (non-refundable-filtered) view, an additional re-dedup collapses to one row per (vervotech_property_id, search_date), again keeping the cheapest. This prevents NaN-refundable rows from inflating the count and skewing the median upward.
Rate Name Normalization

Trailing whitespace in supplier_rate_name is stripped to prevent near-duplicate rates from appearing as separate entries in rate-level breakdowns.

Stage 3: Visualization Outlier Bounds (app.py)

The primary outlier exclusion for chart rendering applies two simultaneous bounds:
df_filtered = value_counts[
    (value_counts["percentile"] <= 0.999) &      # Exclude top 0.1% of values
    (value_counts["redemption_value"] <= 10)      # Hard cap: 10 cents per point
]
Upper Bound: Percentile Cutoff (99.9th percentile)

Redemption values beyond the 99.9th percentile are excluded. These represent extreme outliers — typically caused by:

Data entry errors in cash or points pricing
Promotional rates with artificially low points costs
Currency conversion anomalies
Properties with unusual tax/fee structures that inflate the apparent cash value

Upper Bound: Hard Price Cap (10 cpp)

A hard ceiling of 10 cents per point removes any redemption value that would be unrealistically high for any major hotel loyalty program. For context, even the best hotel point redemptions rarely exceed 3-5 cpp, so 10 cpp provides a generous buffer while still excluding clearly erroneous data.
Lower Bound

There is no explicit lower bound beyond the zero-cash-cost filter applied during extraction. Very low but non-zero CPP values (e.g., 0.01 cpp) pass through. These typically represent legitimate — if poor — redemptions at budget properties or during high-availability periods.

Statistical Aggregations

For each filter combination, the pipeline computes and stores:


Statistic
Description


Count
Total number of rate observations


Median CPP
50th percentile — the "typical" redemption value


P25 CPP
25th percentile — lower quartile


P75 CPP
75th percentile — upper quartile


P90 CPP
90th percentile — "good" redemption threshold


Mean CPP
Arithmetic mean (more sensitive to outliers than median)


These are computed before the visualization-stage percentile/cap filters, on the full preprocessed dataset.

Summary


Stage
Filter
Purpose


Extract
points_cost IS NOT NULL
Exclude properties without points pricing


Extract
cash_cost IS NOT NULL
Exclude properties without cash pricing


Extract
cash_cost != 0
Prevent division-by-zero in CPP calculation


Extract
Latest created_at per date
Deduplicate stale scrape data


Extract
Null points after refundable merge
Drop unmatched refundable rates


Preprocess
Phantom row filter
Remove non-standard ingestion artifacts


Preprocess
search_date > run_date
Only include future availability


Preprocess
Brand validation
Exclude unrecognized brands


Preprocess
1 row per (property, date, refundable)
Deduplicate, keep cheapest


Preprocess
Re-dedup for "All" view
1 row per (property, date), keep cheapest


Visualization
percentile <= 0.999
Trim top 0.1% extreme values


Visualization
redemption_value <= 10
Hard cap at 10 cents per point
Statistic	Description
Count	Total number of rate observations
Median CPP	50th percentile — the "typical" redemption value
P25 CPP	25th percentile — lower quartile
P75 CPP	75th percentile — upper quartile
P90 CPP	90th percentile — "good" redemption threshold
Mean CPP	Arithmetic mean (more sensitive to outliers than median)
Stage	Filter	Purpose
Extract	`points_cost IS NOT NULL`	Exclude properties without points pricing
Extract	`cash_cost IS NOT NULL`	Exclude properties without cash pricing
Extract	`cash_cost != 0`	Prevent division-by-zero in CPP calculation
Extract	Latest `created_at` per date	Deduplicate stale scrape data
Extract	Null points after refundable merge	Drop unmatched refundable rates
Preprocess	Phantom row filter	Remove non-standard ingestion artifacts
Preprocess	`search_date > run_date`	Only include future availability
Preprocess	Brand validation	Exclude unrecognized brands
Preprocess	1 row per (property, date, refundable)	Deduplicate, keep cheapest
Preprocess	Re-dedup for "All" view	1 row per (property, date), keep cheapest
Visualization	`percentile <= 0.999`	Trim top 0.1% extreme values
Visualization	`redemption_value <= 10`	Hard cap at 10 cents per point