Skip to content

Instantly share code, notes, and snippets.

@skyler1537
Created March 4, 2026 05:37
Show Gist options
  • Select an option

  • Save skyler1537/ea47142ed3acb4dcbb825c1f549e2a95 to your computer and use it in GitHub Desktop.

Select an option

Save skyler1537/ea47142ed3acb4dcbb825c1f549e2a95 to your computer and use it in GitHub Desktop.
Points Explorer: Data Pipeline & Outlier Filtering Methodology

Points Explorer: Data Pipeline & Outlier Filtering Methodology

This document describes how the Gondola Points Explorer filters, cleans, and bounds hotel loyalty point redemption data across its extract, preprocess, and visualization stages.

Overview

The Points Explorer computes cents-per-point (CPP) redemption values for hotel loyalty programs by comparing cash rates to points rates at the same property and date. To produce meaningful distributions, the pipeline applies filters at three stages: data extraction, preprocessing, and visualization.


Stage 1: Data Extraction (data_loader.py)

SQL-Level Filtering

  • Null exclusion: The SQL query requires points_cost IS NOT NULL, cash_cost IS NOT NULL, and cash_currency IS NOT NULL.
  • Freshness: Only records with created_at >= '2025-01-01' and search_date > {min_search_date} are fetched.
  • Deduplication at source: A CTE selects only the latest created_at per (vervotech_property_id, search_date), preventing stale data from inflating the dataset.

Post-Fetch Filtering

  • Zero cash cost removal: Rows where cash_cost == 0 are dropped before computing redemption values, since division by zero or free-night rates produce meaningless CPP.
  • Null points cost after merge: When merging refundable rates (which lack their own points cost) with the base points data, any rows that fail to match — leaving points_cost as None — are dropped.

Currency Normalization

All cash values are converted to USD using exchange rates from the database. Properties with unmappable currencies are excluded (the conversion returns None).


Stage 2: Preprocessing (preprocess_data.py)

Phantom Row Filtering

Rows with a NULL hotel_reward_task_id and is_refundable != True are removed. These "phantom rows" originate from a non-standard ingestion path, exhibit trailing whitespace in rate names, and land disproportionately on Sundays — all signs of data quality issues. Travelport-sourced refundable rows (null task ID but is_refundable == True) are explicitly preserved.

Past-Date Exclusion

Any row with search_date <= run_date is dropped. The pipeline only analyzes future availability.

Brand Validation

Rows with brand_name values not recognized in the CHAIN_TO_BRANDS mapping are filtered out, preventing unknown or miscategorized brands from polluting chain-level statistics.

Deduplication Strategy

The pipeline deduplicates to one row per (vervotech_property_id, search_date, is_refundable), keeping the row with the lowest redemption value (cheapest CPP). This ensures:

  • Each property contributes at most one data point per date per refundable status.
  • Duplicate rate entries from overlapping scrapes don't inflate counts.
  • The cheapest available rate is what travelers would actually book.

For the "All" (non-refundable-filtered) view, an additional re-dedup collapses to one row per (vervotech_property_id, search_date), again keeping the cheapest. This prevents NaN-refundable rows from inflating the count and skewing the median upward.

Rate Name Normalization

Trailing whitespace in supplier_rate_name is stripped to prevent near-duplicate rates from appearing as separate entries in rate-level breakdowns.


Stage 3: Visualization Outlier Bounds (app.py)

The primary outlier exclusion for chart rendering applies two simultaneous bounds:

df_filtered = value_counts[
    (value_counts["percentile"] <= 0.999) &      # Exclude top 0.1% of values
    (value_counts["redemption_value"] <= 10)      # Hard cap: 10 cents per point
]

Upper Bound: Percentile Cutoff (99.9th percentile)

Redemption values beyond the 99.9th percentile are excluded. These represent extreme outliers — typically caused by:

  • Data entry errors in cash or points pricing
  • Promotional rates with artificially low points costs
  • Currency conversion anomalies
  • Properties with unusual tax/fee structures that inflate the apparent cash value

Upper Bound: Hard Price Cap (10 cpp)

A hard ceiling of 10 cents per point removes any redemption value that would be unrealistically high for any major hotel loyalty program. For context, even the best hotel point redemptions rarely exceed 3-5 cpp, so 10 cpp provides a generous buffer while still excluding clearly erroneous data.

Lower Bound

There is no explicit lower bound beyond the zero-cash-cost filter applied during extraction. Very low but non-zero CPP values (e.g., 0.01 cpp) pass through. These typically represent legitimate — if poor — redemptions at budget properties or during high-availability periods.


Statistical Aggregations

For each filter combination, the pipeline computes and stores:

Statistic Description
Count Total number of rate observations
Median CPP 50th percentile — the "typical" redemption value
P25 CPP 25th percentile — lower quartile
P75 CPP 75th percentile — upper quartile
P90 CPP 90th percentile — "good" redemption threshold
Mean CPP Arithmetic mean (more sensitive to outliers than median)

These are computed before the visualization-stage percentile/cap filters, on the full preprocessed dataset.


Summary

Stage Filter Purpose
Extract points_cost IS NOT NULL Exclude properties without points pricing
Extract cash_cost IS NOT NULL Exclude properties without cash pricing
Extract cash_cost != 0 Prevent division-by-zero in CPP calculation
Extract Latest created_at per date Deduplicate stale scrape data
Extract Null points after refundable merge Drop unmatched refundable rates
Preprocess Phantom row filter Remove non-standard ingestion artifacts
Preprocess search_date > run_date Only include future availability
Preprocess Brand validation Exclude unrecognized brands
Preprocess 1 row per (property, date, refundable) Deduplicate, keep cheapest
Preprocess Re-dedup for "All" view 1 row per (property, date), keep cheapest
Visualization percentile <= 0.999 Trim top 0.1% extreme values
Visualization redemption_value <= 10 Hard cap at 10 cents per point
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment