Skip to content

Instantly share code, notes, and snippets.

@emmairwin
Created August 3, 2025 00:34
Show Gist options
  • Select an option

  • Save emmairwin/1389a59d0e25bd1eeaa32270baf466d7 to your computer and use it in GitHub Desktop.

Select an option

Save emmairwin/1389a59d0e25bd1eeaa32270baf466d7 to your computer and use it in GitHub Desktop.
Contributor activity criteria

Contributor Activity Criteria

1. Data Collection Phase

Time Window: Analyzes the last N days (default 365, user-configurable 1-365)

Data Sources:

  • Recent commits within the analysis window
  • Recent issues and pull requests within the analysis window
  • Review comments and participants

2. Bot Filtering

Bot Detection: Filters out automated accounts using multiple criteria:

  • Username contains: "bot", "automated", "auto", "ci", "cd", "deploy", "build", "github-actions", "dependabot", "renovate", "codecov", "travis", "jenkins", etc.
  • Display name contains similar bot indicators
  • Email patterns suggesting automation

Result: Only human contributors are included in the analysis

3. Contributor Activity Tracking

For each human contributor, the system tracks:

  • Commits: Number of commits authored
  • Issues Created: Issues opened by the contributor
  • PRs Created: Pull requests opened by the contributor
  • Reviews Given: Code reviews provided to other PRs
  • Comments Made: Estimated comments in issues/PRs (distributed among participants)

4. Email Classification

Each contributor's email is classified as:

  • Company: Known corporate domains (microsoft.com, google.com, etc.)
  • Personal: Gmail, Yahoo, Outlook, etc.
  • Academic: .edu, .ac.uk, .edu.au domains
  • Custom: User-provided company domains to filter
  • "no email available": When email is missing or invalid

5. Activity Trend Analysis

Quarterly Breakdown: Divides analysis window into 4 quarters

Trend Calculation: Compares recent half (Q2+Q3) vs older half (Q0+Q1)

Trend Categories:

  • Increasing: Recent activity > 1.5x older activity
  • Decreasing: Recent activity < 0.67x older activity
  • Stable: Activity levels roughly equal
  • Insufficient Data: Less than 10 total activities

6. Sentiment Analysis

Eligibility: Only contributors with 10+ activities get sentiment analysis

Data Sources: Commit messages, issue comments, PR comments, review comments

Output: Average polarity, subjectivity, and sentiment label (positive/negative/neutral)

7. Final Output Structure

Each contributor entry includes all tracked metrics and classifications.

8. Sorting & Risk Analysis

Sorting: Contributors sorted by total activity (highest first)

Concentration Risk: Calculated based on top contributor's percentage of total activity

Activity Distribution: Shows how activity is distributed among top 1, 3, and 5 contributors

This comprehensive approach provides a detailed view of who's actively contributing to a repository, their engagement patterns, communication sentiment, and potential organizational affiliation based on email domains.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment