spiffxp/README.md

## README.md

      
    Raw
  

              README.md
            
          
    I'm trying to figure out how to use data to drive contributor ladder nomination / promotion (and possibly pruning)
Someone asked me if they could get promoted in an OWNERS file.  I wanted to know, had they reviewed enough PRs relevant to that OWNERS file?
I can (sortof) now answer whether spiffxp should be in cluster/OWNERS because they're saying /lgtm on relevant PRs
eg:
./last-100-merged-prs.py spiffxp --repo kubernetes/kubernetes --file-regex ^cluster/ --comment /lgtm
# ...

2019-02-28T16:41:41Z: https://github.com/kubernetes/kubernetes/pull/74731 - ignored: neither authored nor commented /lgtm
2019-03-02T04:34:54Z: https://github.com/kubernetes/kubernetes/pull/74808 - commented
2019-03-02T20:59:12Z: https://github.com/kubernetes/kubernetes/pull/74851 - commented
2019-03-05T17:50:13Z: https://github.com/kubernetes/kubernetes/pull/74854 - commented
2019-04-18T05:58:20Z: https://github.com/kubernetes/kubernetes/pull/76711 - ignored: neither authored nor commented /lgtm
2019-06-14T14:58:39Z: https://github.com/kubernetes/kubernetes/pull/78614 - commented
2019-06-20T13:54:50Z: https://github.com/kubernetes/kubernetes/pull/75638 - commented
2019-06-28T00:43:47Z: https://github.com/kubernetes/kubernetes/pull/79410 - commented
2019-06-28T01:53:47Z: https://github.com/kubernetes/kubernetes/pull/79407 - ignored: neither authored nor commented /lgtm
2019-06-28T19:43:33Z: https://github.com/kubernetes/kubernetes/pull/79390 - ignored: neither authored nor commented /lgtm
2019-07-02T22:41:12Z: https://github.com/kubernetes/kubernetes/pull/79284 - commented
2019-07-11T04:39:46Z: https://github.com/kubernetes/kubernetes/pull/79949 - commented
2019-07-12T05:03:05Z: https://github.com/kubernetes/kubernetes/pull/79554 - ignored: neither authored nor commented /lgtm
2019-07-12T05:03:18Z: https://github.com/kubernetes/kubernetes/pull/80046 - ignored: neither authored nor commented /lgtm
2019-07-13T22:37:04Z: https://github.com/kubernetes/kubernetes/pull/80054 - ignored: neither authored nor commented /lgtm
2019-07-14T16:55:05Z: https://github.com/kubernetes/kubernetes/pull/80141 - commented
2019-08-01T03:09:04Z: https://github.com/kubernetes/kubernetes/pull/80796 - commented

out of 100 most recently merged PRs in 'kubernetes/kubernetes' involving 'spiffxp', 17 touched files matching regex '^cluster/'
of those, spiffxp authored 0, and commented '/lgtm' on 10
This approach:

asks github for the 100 most recent merged PRs that involve them (authored, commented, mentioned)
gets the list of files that PR touches, as well as comments, reviews, review_comments and events for the PR
determine whether the PR in question touches relevant files
determines whether the person authored the PR in question
determines whether the person has said '/lgtm' in comments, reviews, or review_comments

Problems:

the 100 most recent PRs involving the person is noisy, and may drown out earlier relevant activity

someone who is mentioned often on prs touching irrelevant files will appear more inactive than they should
people who review a lot of PRs touching disparate paths will appear less recent


this approach doesn't bother to evaluate quality of review, and so could fall prey to rubber-stampers

people who actually review would have non-zero review_comment counts, drop /holds, etc.


the approach of looking for /lgtm or /approve ignores those who use github reviews to drive this

looking for a comment of '' works, but ignores the review state of github reviews (Comment/Request Changes/Approve)


Ideas:

instead of "does this person" what about "who has reviewed the relevant files"
on the other end of things "what files has this person actually reviewed"
this is focused on a single-repo
do I care about whether this person is reviewing when requested, when assigned, etc
I may want to try the graphql api, so I can get files and comments directly
I may want to walk back a list of PRs relevant to this dir

Scaling This Out:


what prior art exists that we could build on?

devstats

could we derive the files changed from commits
could we add a pr<->files mapping table to help construct queries?


gharchive

GitHub doesn't expose PullRequestReview events (ref: igrigorik/gharchive.org#197)


ghtorrent
microsoft/ghcrawler
all of our prow logs


APPARENTLY GitHub's event streams (at least those consumed by gharchive and https://developer.github.com/v3/activity/events/#list-events-performed-by-a-user) do not contain the PullRequestReviewEvent type of event.  Which is the event that happens when people use GitHub's "approve a pull request" UI


Use microsoft/ghcrawler?

Most of the stores are either mongo or azure-specific
For grins I've pointed this at kubernetes/enhancements with two tokens, dumping into gs://spiffxp-ghcrawler, but it's not exactly easy to query
My read of https://github.com/microsoft/ghcrawler/blob/develop/lib/visitorMap.js#L197-L210 is that this doesn't actually scrape file directly though
May need to get at files through some level of indirection via commits
At a glance this scrapes way more than I'm interested in, I think: stargazers, commits, individual comments

microsoft/ghcrawler#112 says I should be able to define a scenario and visitor map to fetch only what I want (and maybe other stuff I also want?)


https://github.com/fhoffa/analyzing_github/

points out two other bigquery datasets besides gharchive
http://ghtorrent.org/gcloud.html
https://medium.com/google-cloud/github-on-bigquery-analyze-all-the-code-b3576fd2b150


## last-100-merged-prs.py
#!/usr/bin/env python3
import argparse
import json
import logging
import os
import re
import sys

import requests

class DumbGitHubCache(object):
    def __init__(self, workdir):
        self.workdir = workdir
        if not os.path.exists(workdir):
            os.makedirs(workdir)

        self.token = os.environ['GITHUB_TOKEN']
        self.host = 'https://api.github.com'
        self.headers = {
            'authorization': "token %s" % self.token
        }

    def write_json(self, data, filename):
        dirname = os.path.dirname(filename)
        if not os.path.exists(dirname):
            os.makedirs(dirname)
        with open(filename, 'w') as fp:
            json.dump(data, fp)

    def get_json(self, path, localpath):
        localpath = localpath[1:] if localpath.startswith("/") else localpath
        filename = os.path.join(self.workdir, localpath)
        if os.path.exists(filename):
            # logging.info("get %s HIT cache at %s", path, filename)
            with open(filename) as fp:
                return json.load(fp)
        # logging.info("get %s missed cache at %s", path, filename)
        r = requests.get("%s%s" % (self.host, path), headers=self.headers)
        r.raise_for_status()
        data = r.json()
        self.write_json(data, filename)
        return data


def loginfo(msg):
    if True:
        logging.info(msg)

def main(user, repo, regex, comment, workdir):
    """Display events for the user"""

    setup_logging()

    # TODO: doesn't paginate, so capped at first page
    ghcache = DumbGitHubCache(workdir)

    def cache_get(endpoint):
        ep = endpoint.replace('https://api.github.com','')
        js = "%s.json" % ep
        return ghcache.get_json("%s" % ep, js)

    query = "involves:%s+repo:%s+is:pr+is:merged" % (user, repo)
    loginfo("query: %s..." % query)
    # TODO: redo this to optionally refresh, and walk through a few pages
    search_results = cache_get("/search/issues?per_page=100&q=%s&sort=updated&order=desc" % query)

    merged_prs = {}
    for item in search_results['items']:
        url = item['pull_request']['url']
        loginfo("pr: %s..." % url)
        pr = cache_get(url)
        loginfo("  files...")
        files = cache_get("%s/files?per_page=100" % url)
        loginfo("  reviews...")
        reviews = cache_get("%s/reviews?per_page=100" % url)
        loginfo("  comments...")
        comments = cache_get("%s?per_page=100" % pr['comments_url'])
        loginfo("  review_comments...")
        review_comments = cache_get("%s?per_page=100" % pr['review_comments_url'])
        loginfo("  issue_events...")
        issue_events = cache_get("%s/events?per_page=100" % pr['issue_url'])
        filenames = [f['filename'] for f in files]
        u_events = [e['event'] for e in issue_events if e['actor'] is not None and e['actor']['login'] == user]
        u_comments = [c['body'] for c in comments if c['user'] is not None and c['user']['login'] == user]
        u_review_comments = [c['body'] for c in review_comments if c['user'] is not None and c['user']['login'] == user]
        u_reviews = [c['body'] for c in reviews if c['user'] is not None and c['user']['login'] == user]
        merged_prs[url] = {
            'url': pr['html_url'],
            'author': pr['user']['login'],
            'files' : filenames,
            'events': u_events,
            'comments': u_comments,
            'review_comments': u_review_comments,
            'reviews': u_reviews,
            'merged_at': pr['merged_at'],
            'classified_as': 'tbd'
        }

    relevant_prs = []
    commented_prs = []
    authored_prs = []
    ignored_prs = []

    file_pattern = re.compile(regex)
    for pr, info in merged_prs.items():
        f_files = [f for f in info['files'] if file_pattern.match(f)]
        if len(f_files) != 0:
            info['classified_as']="relevant"
            relevant_prs.append(info)
        else:
            info['classified_as']=f"ignored: no files match '{regex}'"
            ignored_prs.append(info)

    for info in relevant_prs:
        f_comments = [c for c in info['comments'] if comment in c]
        f_review_comments = [c for c in info['review_comments'] if comment in c]
        f_reviews = [c for c in info['reviews'] if comment in c]
        # TODO: probably doesn't handle those repos that use github approve events as lgtm or approve
        reviewed = len(f_comments) != 0 or len(f_review_comments) != 0 or len(f_reviews) != 0
        if info['author'] == user:
            info['classified_as']="authored"
            authored_prs.append(info)
        elif reviewed:
            info['classified_as']=f"commented: said '{comment}'" # in a comment, review, or review_comment"
            commented_prs.append(info)
        else:
            info['classified_as']=f"ignored: relevant files, but {user} neither authored nor commented '{comment}'"
            ignored_prs.append(info)

    relevant_prs.sort(key=lambda x: x['merged_at'])
    authored_prs.sort(key=lambda x: x['merged_at'])
    commented_prs.sort(key=lambda x: x['merged_at'])
    ignored_prs.sort(key=lambda x: x['merged_at'])

    print(json.dumps(ignored_prs, indent=2))
    print(json.dumps(authored_prs, indent=2))
    print(json.dumps(commented_prs, indent=2))
    for x in relevant_prs:
        print("%s: %s - %s" % (x['merged_at'], x['url'], x['classified_as']))

    print("")
    out_of = f"out of 100 most recently updated merged PRs in '{repo}' involving '{user}'"
    relevant = f"{len(relevant_prs)} touched files matching regex '{regex}'"
    print(f"{out_of}...")
    print(f"{relevant}")
    print(f"of those, {user} authored {len(authored_prs)}, and commented '{comment}' on {len(commented_prs)}")

def setup_logging():
    """Initialize logging to screen"""
    # See https://docs.python.org/2/library/logging.html#logrecord-attributes
    # [IWEF]mmdd HH:MM:SS.mmm] msg
    fmt = '%(levelname).1s%(asctime)s.%(msecs)03d] %(message)s'  # pylint: disable=line-too-long
    datefmt = '%m%d %H:%M:%S'
    logging.basicConfig(
        level=logging.INFO,
        format=fmt,
        datefmt=datefmt,
    )

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Summarize github events for the specified user')
    parser.add_argument('user', default='spiffxp', help='GitHub user to query for')
    parser.add_argument('--repo', default='kubernetes/kubernetes', help='repo that will be queried')
    parser.add_argument('--comment', default='', help='comment to look for in comments')
    parser.add_argument('--file-regex', default='.*', help='regex to match filenames')
    parser.add_argument('--workdir', default='data/last-100-merged-prs', help='Work directory to cache things')
    args = parser.parse_args()

    main(args.user, args.repo, args.file_regex, args.comment, args.workdir)

## pr-file-info.py
#!/usr/bin/env python
import argparse
import json
import logging
import os
import re
import sys

from collections import OrderedDict

import requests

# TODO: doesn't paginate, so capped at first page
class DumbGitHubCache(object):
    def __init__(self, workdir):
        self.workdir = workdir
        if not os.path.exists(workdir):
            os.makedirs(workdir)

        self.token = os.environ['GITHUB_TOKEN']
        self.host = 'https://api.github.com'
        self.headers = {
            'authorization': "token %s" % self.token
        }

    def write_json(self, data, filename):
        dirname = os.path.dirname(filename)
        if not os.path.exists(dirname):
            os.makedirs(dirname)
        with open(filename, 'w') as fp:
            json.dump(data, fp)

    def get_json(self, path, localpath):
        localpath = localpath[1:] if localpath.startswith("/") else localpath
        filename = os.path.join(self.workdir, localpath)
        if os.path.exists(filename):
            # logging.info("get %s HIT cache at %s", path, filename)
            with open(filename) as fp:
                return json.load(fp)
        # logging.info("get %s missed cache at %s", path, filename)
        r = requests.get("%s%s" % (self.host, path), headers=self.headers)
        r.raise_for_status()
        data = r.json()
        self.write_json(data, filename)
        return data


def loginfo(msg):
    logging.info(msg)


def main(workdir, prfile, repo, file_regex, min_count, users):
    """
    Given a list of pr numbers and a repo, pull down the pr, its reviews.
    and the files that it touches. Then walk through all prs, and
    accumulate some per-file stats for all files touched by all prs.

    eg:
    foo/bar/baz:
        assignees:
            spiffxp: 23
        labels:
            size/XL: 2
            size/S: 21

    means:
    of all the prs surveyed, for those that touched file foo/bar/baz:
    - spiffxp was an assignee on 23 prs
    - the size/XL label was on 2 prs
    - he size/S label was on 21 prs

    Other keys:
        authors: was a pr author
        requested_reviewers: was requested for review on a pr
        lgtmers: issued /lgtm on a pr (OR clicked "approve changes" in github's ui)
        approvers: issued /approve on a pr
        triagers: issued area/kind/sig/priority/lifecycle/milestone commands on a pr
        holders: issued a /hold on a pr (OR clicked "request changes" in github ui)


    Other ideas:
    - sift these against OWNERS files somehow
    - take file of org/repo#123 instead of --repo org/repo + file of 123
    - filter by date ranges
    - filter by labels
    - (maybe these should be github queries)
    """

    setup_logging()

    ghcache = DumbGitHubCache(workdir)

    def cache_get(endpoint):
        ep = endpoint.replace('https://api.github.com','')
        js = "%s.json" % ep
        return ghcache.get_json("%s" % ep, js)

    with open(prfile) as fp:
        prnums = [l.strip() for l in fp]


    pattern = re.compile(file_regex)
    # IMO while some of these can be done by author, they are more meaningful
    # signal when done by someone other than author
    # TODO(spiffxp): /lgtm is meaningless if not done by a member
    # TODO(spiffxp): /approve is meaningless if not in OWNERS
    commands = {
        'close': re.compile(r'^/close', re.MULTILINE),
        'lgtm': re.compile(r'^/lgtm', re.MULTILINE),
        'approve': re.compile(r'^/approve', re.MULTILINE),
        'triage': re.compile(r'^/(remove-)?(area|kind|sig|priority|milestone|lifecycle)', re.MULTILINE),
        'hold': re.compile(r'^/hold$', re.MULTILINE),
        # other ideas:
        #   test-shepherd: retest|test|ok-to-test
        #   review-shepherd: (un)(assign|cc) (if not self)
    }

    def process_comments(comments, processed):
        """comments: [{user: ..., body: ...}], processed:{k:[] for k in commands}]"""
        leftovers=[]
        for c in comments:
            # ignore reviews or comments by users that no longer exist; they have user: null
            if c['user'] is None:
                continue
            matched = False
            for name, regex in commands.iteritems():
                if regex.findall(c['body']):
                    matched = True
                    processed[name].append(c)
            # TODO(spiffxp): make these distinct commands?
            if 'state' in c:
                if c['state'] == "APPROVED":
                    matched = True
                    processed['lgtm'].append(c)
                elif c['state'] == "CHANGES_REQUESTED":
                    matched = True
                    processed['hold'].append(c)
            if not matched:
                leftovers.append(c)
        return leftovers

    # get the prs we'll be dealing with
    prs = {}
    for num in prnums:
        pr_id = '%s#%s' % (repo, num)
        url = 'https://api.github.com/repos/%s/pulls/%s' % (repo, num)
        loginfo("pr: %s..." % url)
        pr = cache_get(url)
        loginfo("  files...")
        files = cache_get("%s/files?per_page=100" % url)

        loginfo("  reviews...")
        reviews = cache_get("%s/reviews?per_page=100" % url)
        loginfo("  comments...")
        comments = cache_get("%s?per_page=100" % pr['comments_url'])
        loginfo("  review_comments...")
        review_comments = cache_get("%s?per_page=100" % pr['review_comments_url'])

        # things I want to filter out of reviews
        #   - reviews of {body:"", state:"COMMENTED"}, means they dropped
        #     review comments, so the review itself isn't meaningful
        github_reviews = filter(lambda x: x['body'] != "" or x['state'] != "COMMENTED", reviews)

        processed={name:[] for name in commands}
        github_reviews = process_comments(github_reviews, processed)
        comments = process_comments(comments, processed)
        review_comments = process_comments(review_comments, processed)

        # TODO: try parsing out events like so
        # did they use the native review ui to approve? cool
        # did they issue an approve? cool
        # did they issue an lgtm? cool
        # did they triage? cool
        # if their comment was none of these things, they commented
        # if their review_comment was none of these things, they review commented
        filenames = [f['filename'] for f in files]
        labels = [l['name'] for l in pr['labels']]
        prs[pr_id] = {
            'id': pr_id,
            'url': pr['html_url'],
            'author': pr['user']['login'],
            'assignees': [x['login'] for x in pr['assignees']],
            'requested_reviewers': [x['login'] for x in pr['requested_reviewers']],
            'lgtms': [{
                'login': c['user']['login'],
                'html_url': c['html_url'],
            } for c in processed['lgtm']],
            'approves': [{
                'login': c['user']['login'],
                'html_url': c['html_url'],
            } for c in processed['approve']],
            'triages': [{
                'login': c['user']['login'],
                'html_url': c['html_url'],
            } for c in processed['triage']],
            'holds': [{
                'login': c['user']['login'],
                'html_url': c['html_url'],
            } for c in processed['hold']],
            'files': filenames,
            'labels': labels,
            'merged': pr['merged'],
            'github_reviews': [{
                'login': r['user']['login'],
                'body': r['body'],
                'state': r['state'],
                'html_url': r['html_url']
            } for r in github_reviews],
        }

    # filter to prs touching files that match a regex
    file_pattern = re.compile(file_regex)
    prs_matching_file_regex = []
    for pr_id, info in prs.iteritems():
        f_files = [f for f in info['files'] if file_pattern.match(f)]
        relevant_files = len(f_files) != 0
        if relevant_files:
            prs_matching_file_regex.append(info)

    # swap around: for every file, what prs touched it
    file_to_prs = {}
    for info in prs_matching_file_regex:
        for f in info['files']:
            file_prs = file_to_prs.get(f, [])
            file_prs.append(info)
            file_to_prs[f]=file_prs

    ignore_users = users is None or len(users) == 0
    def user_count_tuple_matches(x):
        return x[1] >= min_count and (ignore_users or x[0] in users)

    file_info = OrderedDict(sorted(
        [(
            f, OrderedDict({
                'author': OrderedDict(sorted(
                    filter(user_count_tuple_matches,
                        map(lambda x: (x[0], len(x[1])),
                            group_by(lambda x: x['author'], file_prs).iteritems())),
                    key=lambda x: x[1]
                )),
                'assignees':  OrderedDict(sorted(
                    filter(user_count_tuple_matches,
                        map(lambda x: (x[0], len(x[1])),
                            categorize_by(lambda x: x['assignees'], file_prs).iteritems())),
                    key=lambda x: x[1]
                )),
                'requested_reviewers':  OrderedDict(sorted(
                    filter(user_count_tuple_matches,
                        map(lambda x: (x[0], len(x[1])),
                            categorize_by(lambda x: x['requested_reviewers'], file_prs).iteritems())),
                    key=lambda x: x[1]
                )),
                # 'labels':  OrderedDict(sorted(
                #     filter(lambda x: x[1] >= min_count,
                #         map(lambda x: (x[0], len(x[1])),
                #             categorize_by(lambda x: x['labels'], file_prs).iteritems())),
                #     key=lambda x: x[1]
                # )),
                #'github_reviewers':  OrderedDict(sorted(
                #    filter(lambda x: len(x[1]) >= min_count,
                #        map(lambda x: (x[0], [{
                #                'html_url': r['html_url'],
                #                'state': r['state']
                #            } for pr in x[1] for r in pr['github_reviews'] if r['login'] == x[0]]),
                #            categorize_by(lambda x: set([r['login'] for r in x['github_reviews']]), file_prs).iteritems())),
                #    key=lambda x: x[1]
                #)),
                'lgtmers':  OrderedDict(sorted(
                    filter(user_count_tuple_matches,
                        map(lambda x: (x[0], len([[r['html_url'] for r in pr['lgtms'] if r['login'] == x[0]] for pr in x[1]])),
                            categorize_by(lambda x: set([r['login'] for r in x['lgtms']]), file_prs).iteritems())),
                    key=lambda x: x[1]
                )),
                'approvers':  OrderedDict(sorted(
                    filter(user_count_tuple_matches,
                        map(lambda x: (x[0], len([[r['html_url'] for r in pr['approves'] if r['login'] == x[0]] for pr in x[1]])),
                            categorize_by(lambda x: set([r['login'] for r in x['approves']]), file_prs).iteritems())),
                    key=lambda x: x[1]
                )),
                'triagers':  OrderedDict(sorted(
                    filter(user_count_tuple_matches,
                        map(lambda x: (x[0], len([[r['html_url'] for r in pr['triages'] if r['login'] == x[0]] for pr in x[1]])),
                            categorize_by(lambda x: set([r['login'] for r in x['triages']]), file_prs).iteritems())),
                    key=lambda x: x[1]
                )),
                'holders':  OrderedDict(sorted(
                    filter(user_count_tuple_matches,
                        map(lambda x: (x[0], len([[r['html_url'] for r in pr['holds'] if r['login'] == x[0]] for pr in x[1]])),
                            categorize_by(lambda x: set([r['login'] for r in x['holds']]), file_prs).iteritems())),
                    key=lambda x: x[1]
                )),
                # TODO(spiffxp): include commenter / review_commenter
            })
        ) for f, file_prs in file_to_prs.iteritems()],
        # sort by files that have had the most PRs against them (by adding up author counts)
        key=lambda x: reduce(lambda r,x2: r+x2[1], x[1]['author'].items(), 0)
    ))

    print json.dumps(file_info, indent=2)

# naming things is hard:
#  - categorize_by: a thing can be in multiple categories
#  - group_by: a thing can only be in one group

# given fn(value)->[keys] fn and [values], return {key:[values]}
def categorize_by(fn, xs):
    r = {}
    for x in xs:
        ks = fn(x)
        for k in ks:
            vs = r.get(k, [])
            vs.append(x)
            r[k] = vs
    return r

# given fn(value)->key fn and [values], return {key:[values]}
def group_by(fn, xs):
    r = {}
    for x in xs:
        k = fn(x)
        vs = r.get(k, [])
        vs.append(x)
        r[k] = vs
    return r

def setup_logging():
    """Initialize logging to screen"""
    # See https://docs.python.org/2/library/logging.html#logrecord-attributes
    # [IWEF]mmdd HH:MM:SS.mmm] msg
    fmt = '%(levelname).1s%(asctime)s.%(msecs)03d] %(message)s'  # pylint: disable=line-too-long
    datefmt = '%m%d %H:%M:%S'
    logging.basicConfig(
        level=logging.INFO,
        format=fmt,
        datefmt=datefmt,
    )

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Summarize github events for the specified user')
    parser.add_argument('prfile', default='prfile', help='file containing list of pr numbers')
    parser.add_argument('--repo', default='kubernetes/kubernetes', help='repo that will be queried')
    parser.add_argument('--file-regex', default='.*', help='only consider prs whose filenames match this regex')
    parser.add_argument('--min-count', default=1, help='minimum count of occurrences to be included in summary')
    parser.add_argument('--workdir', default='data/pr-file-info', help='Work directory to cache things')
    parser.add_argument('--users', nargs='+', help='Filter to just these users')
    args = parser.parse_args()

    main(args.workdir, args.prfile, args.repo, args.file_regex, int(args.min_count), args.users)

## sample-pr-file-info.md

      
    Raw
  

              sample-pr-file-info.md
            
          
    ./pr-file-info.py prfile-k-k-all-since-2019-04-01 --repo kubernetes/kubernetes --workdir data/pr-files/ --min-count 2 --file-regex hack/.golint_failures
"hack/.golint_failures": {
    "github_reviewers": {
      "saad-ali": [
        {
          "state": "COMMENTED",
          "html_url": "https://github.com/kubernetes/kubernetes/pull/73891#pullrequestreview-230453522"
        },
        {
          "state": "COMMENTED",
          "html_url": "https://github.com/kubernetes/kubernetes/pull/73891#pullrequestreview-234265924"
        },
        {
          "state": "COMMENTED",
          "html_url": "https://github.com/kubernetes/kubernetes/pull/73891#pullrequestreview-235831480"
        }
      ],
      "bsalamat": [
        {
          "state": "COMMENTED",
          "html_url": "https://github.com/kubernetes/kubernetes/pull/74614#pullrequestreview-224011123"
        },
        {
          "state": "COMMENTED",
          "html_url": "https://github.com/kubernetes/kubernetes/pull/74614#pullrequestreview-226939166"
        }
      ]
    },
    "assignees": {
      "lavalamp": 2,
      "luxas": 2,
      "SataQiu": 2,
      "timothysc": 2,
      "smarterclayton": 2,
      "brendandburns": 2,
      "mikedanese": 2,
      "saad-ali": 2,
      "derekwaynecarr": 2,
      "xichengliudui": 3,
      "cheftako": 3,
      "dims": 4,
      "yujuhong": 4,
      "vishh": 4,
      "mattjmcnaughton": 5,
      "spiffxp": 5,
      "sttts": 5,
      "msau42": 6,
      "deads2k": 6,
      "oomichi": 6,
      "thockin": 8,
      "neolit123": 15,
      "fejta": 26,
      "liggitt": 26
    },
    "approvers": {
      "cblecker": 2,
      "vishh": 2,
      "dims": 2,
      "oomichi": 3,
      "yujuhong": 3,
      "sttts": 4,
      "msau42": 5,
      "thockin": 6,
      "spiffxp": 9,
      "fejta": 23,
      "liggitt": 24
    },
    "triagers": {
      "msau42": 3,
      "SataQiu": 3,
      "dims": 4,
      "spiffxp": 5,
      "liggitt": 8,
      "neolit123": 16
    },
    "author": {
      "andrewsykim": 2,
      "danielqsj": 2,
      "cofyc": 2,
      "atoato88": 2,
      "SataQiu": 33
    },
    "holders": {
      "spiffxp": 2
    },
    "requested_reviewers": {
      "thockin": 2,
      "jsafrane": 2,
      "deads2k": 2,
      "davidz627": 2,
      "krmayankk": 2,
      "liggitt": 2,
      "ixdy": 2,
      "eparis": 2,
      "caesarxuchao": 2,
      "enisoc": 2,
      "mbohlool": 2,
      "BenTheElder": 2,
      "yujuhong": 2,
      "cblecker": 3,
      "pwittrock": 3,
      "dchen1107": 3,
      "spiffxp": 3,
      "dims": 4,
      "cheftako": 4,
      "jbeda": 5,
      "sttts": 5,
      "gmarek": 6,
      "xichengliudui": 6,
      "lavalamp": 8
    },
    "labels": {
      "area/kubeadm": 2,
      "sig/scheduling": 3,
      "kind/feature": 3,
      "sig/cluster-lifecycle": 3,
      "area/e2e-test-framework": 3,
      "area/release-eng": 3,
      "area/cloudprovider": 3,
      "sig/release": 3,
      "area/kubectl": 4,
      "sig/cloud-provider": 4,
      "size/XL": 4,
      "area/dependency": 4,
      "area/apiserver": 5,
      "sig/cli": 5,
      "release-note": 6,
      "sig/network": 6,
      "size/XXL": 8,
      "size/M": 8,
      "priority/important-soon": 8,
      "sig/storage": 9,
      "priority/important-longterm": 9,
      "size/L": 9,
      "kind/api-change": 10,
      "ok-to-test": 10,
      "sig/auth": 11,
      "sig/api-machinery": 11,
      "sig/apps": 11,
      "size/XS": 13,
      "needs-priority": 14,
      "area/kubelet": 15,
      "sig/node": 17,
      "size/S": 18,
      "area/test": 23,
      "sig/testing": 24,
      "priority/backlog": 28,
      "release-note-none": 54,
      "kind/cleanup": 55,
      "cncf-cla: yes": 60,
      "lgtm": 60,
      "approved": 60
    },
    "lgtmers": {
      "SataQiu": 2,
      "dims": 2,
      "cheftako": 2,
      "yujuhong": 2,
      "xichengliudui": 3,
      "msau42": 3,
      "mattjmcnaughton": 5,
      "spiffxp": 5,
      "thockin": 6,
      "oomichi": 6,
      "neolit123": 13,
      "liggitt": 17,
      "fejta": 18
    }
  }
}
	#!/usr/bin/env python3
	import argparse
	import json
	import logging
	import os
	import re
	import sys

	import requests

	class DumbGitHubCache(object):
	def __init__(self, workdir):
	self.workdir = workdir
	if not os.path.exists(workdir):
	os.makedirs(workdir)

	self.token = os.environ['GITHUB_TOKEN']
	self.host = 'https://api.github.com'
	self.headers = {
	'authorization': "token %s" % self.token
	}

	def write_json(self, data, filename):
	dirname = os.path.dirname(filename)
	if not os.path.exists(dirname):
	os.makedirs(dirname)
	with open(filename, 'w') as fp:
	json.dump(data, fp)

	def get_json(self, path, localpath):
	localpath = localpath[1:] if localpath.startswith("/") else localpath
	filename = os.path.join(self.workdir, localpath)
	if os.path.exists(filename):
	# logging.info("get %s HIT cache at %s", path, filename)
	with open(filename) as fp:
	return json.load(fp)
	# logging.info("get %s missed cache at %s", path, filename)
	r = requests.get("%s%s" % (self.host, path), headers=self.headers)
	r.raise_for_status()
	data = r.json()
	self.write_json(data, filename)
	return data


	def loginfo(msg):
	if True:
	logging.info(msg)

	def main(user, repo, regex, comment, workdir):
	"""Display events for the user"""

	setup_logging()

	# TODO: doesn't paginate, so capped at first page
	ghcache = DumbGitHubCache(workdir)

	def cache_get(endpoint):
	ep = endpoint.replace('https://api.github.com','')
	js = "%s.json" % ep
	return ghcache.get_json("%s" % ep, js)

	query = "involves:%s+repo:%s+is:pr+is:merged" % (user, repo)
	loginfo("query: %s..." % query)
	# TODO: redo this to optionally refresh, and walk through a few pages
	search_results = cache_get("/search/issues?per_page=100&q=%s&sort=updated&order=desc" % query)

	merged_prs = {}
	for item in search_results['items']:
	url = item['pull_request']['url']
	loginfo("pr: %s..." % url)
	pr = cache_get(url)
	loginfo(" files...")
	files = cache_get("%s/files?per_page=100" % url)
	loginfo(" reviews...")
	reviews = cache_get("%s/reviews?per_page=100" % url)
	loginfo(" comments...")
	comments = cache_get("%s?per_page=100" % pr['comments_url'])
	loginfo(" review_comments...")
	review_comments = cache_get("%s?per_page=100" % pr['review_comments_url'])
	loginfo(" issue_events...")
	issue_events = cache_get("%s/events?per_page=100" % pr['issue_url'])
	filenames = [f['filename'] for f in files]
	u_events = [e['event'] for e in issue_events if e['actor'] is not None and e['actor']['login'] == user]
	u_comments = [c['body'] for c in comments if c['user'] is not None and c['user']['login'] == user]
	u_review_comments = [c['body'] for c in review_comments if c['user'] is not None and c['user']['login'] == user]
	u_reviews = [c['body'] for c in reviews if c['user'] is not None and c['user']['login'] == user]
	merged_prs[url] = {
	'url': pr['html_url'],
	'author': pr['user']['login'],
	'files' : filenames,
	'events': u_events,
	'comments': u_comments,
	'review_comments': u_review_comments,
	'reviews': u_reviews,
	'merged_at': pr['merged_at'],
	'classified_as': 'tbd'
	}

	relevant_prs = []
	commented_prs = []
	authored_prs = []
	ignored_prs = []

	file_pattern = re.compile(regex)
	for pr, info in merged_prs.items():
	f_files = [f for f in info['files'] if file_pattern.match(f)]
	if len(f_files) != 0:
	info['classified_as']="relevant"
	relevant_prs.append(info)
	else:
	info['classified_as']=f"ignored: no files match '{regex}'"
	ignored_prs.append(info)

	for info in relevant_prs:
	f_comments = [c for c in info['comments'] if comment in c]
	f_review_comments = [c for c in info['review_comments'] if comment in c]
	f_reviews = [c for c in info['reviews'] if comment in c]
	# TODO: probably doesn't handle those repos that use github approve events as lgtm or approve
	reviewed = len(f_comments) != 0 or len(f_review_comments) != 0 or len(f_reviews) != 0
	if info['author'] == user:
	info['classified_as']="authored"
	authored_prs.append(info)
	elif reviewed:
	info['classified_as']=f"commented: said '{comment}'" # in a comment, review, or review_comment"
	commented_prs.append(info)
	else:
	info['classified_as']=f"ignored: relevant files, but {user} neither authored nor commented '{comment}'"
	ignored_prs.append(info)

	relevant_prs.sort(key=lambda x: x['merged_at'])
	authored_prs.sort(key=lambda x: x['merged_at'])
	commented_prs.sort(key=lambda x: x['merged_at'])
	ignored_prs.sort(key=lambda x: x['merged_at'])

	print(json.dumps(ignored_prs, indent=2))
	print(json.dumps(authored_prs, indent=2))
	print(json.dumps(commented_prs, indent=2))
	for x in relevant_prs:
	print("%s: %s - %s" % (x['merged_at'], x['url'], x['classified_as']))

	print("")
	out_of = f"out of 100 most recently updated merged PRs in '{repo}' involving '{user}'"
	relevant = f"{len(relevant_prs)} touched files matching regex '{regex}'"
	print(f"{out_of}...")
	print(f"{relevant}")
	print(f"of those, {user} authored {len(authored_prs)}, and commented '{comment}' on {len(commented_prs)}")

	def setup_logging():
	"""Initialize logging to screen"""
	# See https://docs.python.org/2/library/logging.html#logrecord-attributes
	# [IWEF]mmdd HH:MM:SS.mmm] msg
	fmt = '%(levelname).1s%(asctime)s.%(msecs)03d] %(message)s' # pylint: disable=line-too-long
	datefmt = '%m%d %H:%M:%S'
	logging.basicConfig(
	level=logging.INFO,
	format=fmt,
	datefmt=datefmt,
	)

	if __name__ == '__main__':
	parser = argparse.ArgumentParser(description='Summarize github events for the specified user')
	parser.add_argument('user', default='spiffxp', help='GitHub user to query for')
	parser.add_argument('--repo', default='kubernetes/kubernetes', help='repo that will be queried')
	parser.add_argument('--comment', default='', help='comment to look for in comments')
	parser.add_argument('--file-regex', default='.*', help='regex to match filenames')
	parser.add_argument('--workdir', default='data/last-100-merged-prs', help='Work directory to cache things')
	args = parser.parse_args()

	main(args.user, args.repo, args.file_regex, args.comment, args.workdir)
	#!/usr/bin/env python
	import argparse
	import json
	import logging
	import os
	import re
	import sys

	from collections import OrderedDict

	import requests

	# TODO: doesn't paginate, so capped at first page
	class DumbGitHubCache(object):
	def __init__(self, workdir):
	self.workdir = workdir
	if not os.path.exists(workdir):
	os.makedirs(workdir)

	self.token = os.environ['GITHUB_TOKEN']
	self.host = 'https://api.github.com'
	self.headers = {
	'authorization': "token %s" % self.token
	}

	def write_json(self, data, filename):
	dirname = os.path.dirname(filename)
	if not os.path.exists(dirname):
	os.makedirs(dirname)
	with open(filename, 'w') as fp:
	json.dump(data, fp)

	def get_json(self, path, localpath):
	localpath = localpath[1:] if localpath.startswith("/") else localpath
	filename = os.path.join(self.workdir, localpath)
	if os.path.exists(filename):
	# logging.info("get %s HIT cache at %s", path, filename)
	with open(filename) as fp:
	return json.load(fp)
	# logging.info("get %s missed cache at %s", path, filename)
	r = requests.get("%s%s" % (self.host, path), headers=self.headers)
	r.raise_for_status()
	data = r.json()
	self.write_json(data, filename)
	return data


	def loginfo(msg):
	logging.info(msg)


	def main(workdir, prfile, repo, file_regex, min_count, users):
	"""
	Given a list of pr numbers and a repo, pull down the pr, its reviews.
	and the files that it touches. Then walk through all prs, and
	accumulate some per-file stats for all files touched by all prs.

	eg:
	foo/bar/baz:
	assignees:
	spiffxp: 23
	labels:
	size/XL: 2
	size/S: 21

	means:
	of all the prs surveyed, for those that touched file foo/bar/baz:
	- spiffxp was an assignee on 23 prs
	- the size/XL label was on 2 prs
	- he size/S label was on 21 prs

	Other keys:
	authors: was a pr author
	requested_reviewers: was requested for review on a pr
	lgtmers: issued /lgtm on a pr (OR clicked "approve changes" in github's ui)
	approvers: issued /approve on a pr
	triagers: issued area/kind/sig/priority/lifecycle/milestone commands on a pr
	holders: issued a /hold on a pr (OR clicked "request changes" in github ui)


	Other ideas:
	- sift these against OWNERS files somehow
	- take file of org/repo#123 instead of --repo org/repo + file of 123
	- filter by date ranges
	- filter by labels
	- (maybe these should be github queries)
	"""

	setup_logging()

	ghcache = DumbGitHubCache(workdir)

	def cache_get(endpoint):
	ep = endpoint.replace('https://api.github.com','')
	js = "%s.json" % ep
	return ghcache.get_json("%s" % ep, js)

	with open(prfile) as fp:
	prnums = [l.strip() for l in fp]


	pattern = re.compile(file_regex)
	# IMO while some of these can be done by author, they are more meaningful
	# signal when done by someone other than author
	# TODO(spiffxp): /lgtm is meaningless if not done by a member
	# TODO(spiffxp): /approve is meaningless if not in OWNERS
	commands = {
	'close': re.compile(r'^/close', re.MULTILINE),
	'lgtm': re.compile(r'^/lgtm', re.MULTILINE),
	'approve': re.compile(r'^/approve', re.MULTILINE),
	'triage': re.compile(r'^/(remove-)?(area\|kind\|sig\|priority\|milestone\|lifecycle)', re.MULTILINE),
	'hold': re.compile(r'^/hold$', re.MULTILINE),
	# other ideas:
	# test-shepherd: retest\|test\|ok-to-test
	# review-shepherd: (un)(assign\|cc) (if not self)
	}

	def process_comments(comments, processed):
	"""comments: [{user: ..., body: ...}], processed:{k:[] for k in commands}]"""
	leftovers=[]
	for c in comments:
	# ignore reviews or comments by users that no longer exist; they have user: null
	if c['user'] is None:
	continue
	matched = False
	for name, regex in commands.iteritems():
	if regex.findall(c['body']):
	matched = True
	processed[name].append(c)
	# TODO(spiffxp): make these distinct commands?
	if 'state' in c:
	if c['state'] == "APPROVED":
	matched = True
	processed['lgtm'].append(c)
	elif c['state'] == "CHANGES_REQUESTED":
	matched = True
	processed['hold'].append(c)
	if not matched:
	leftovers.append(c)
	return leftovers

	# get the prs we'll be dealing with
	prs = {}
	for num in prnums:
	pr_id = '%s#%s' % (repo, num)
	url = 'https://api.github.com/repos/%s/pulls/%s' % (repo, num)
	loginfo("pr: %s..." % url)
	pr = cache_get(url)
	loginfo(" files...")
	files = cache_get("%s/files?per_page=100" % url)

	loginfo(" reviews...")
	reviews = cache_get("%s/reviews?per_page=100" % url)
	loginfo(" comments...")
	comments = cache_get("%s?per_page=100" % pr['comments_url'])
	loginfo(" review_comments...")
	review_comments = cache_get("%s?per_page=100" % pr['review_comments_url'])

	# things I want to filter out of reviews
	# - reviews of {body:"", state:"COMMENTED"}, means they dropped
	# review comments, so the review itself isn't meaningful
	github_reviews = filter(lambda x: x['body'] != "" or x['state'] != "COMMENTED", reviews)

	processed={name:[] for name in commands}
	github_reviews = process_comments(github_reviews, processed)
	comments = process_comments(comments, processed)
	review_comments = process_comments(review_comments, processed)

	# TODO: try parsing out events like so
	# did they use the native review ui to approve? cool
	# did they issue an approve? cool
	# did they issue an lgtm? cool
	# did they triage? cool
	# if their comment was none of these things, they commented
	# if their review_comment was none of these things, they review commented
	filenames = [f['filename'] for f in files]
	labels = [l['name'] for l in pr['labels']]
	prs[pr_id] = {
	'id': pr_id,
	'url': pr['html_url'],
	'author': pr['user']['login'],
	'assignees': [x['login'] for x in pr['assignees']],
	'requested_reviewers': [x['login'] for x in pr['requested_reviewers']],
	'lgtms': [{
	'login': c['user']['login'],
	'html_url': c['html_url'],
	} for c in processed['lgtm']],
	'approves': [{
	'login': c['user']['login'],
	'html_url': c['html_url'],
	} for c in processed['approve']],
	'triages': [{
	'login': c['user']['login'],
	'html_url': c['html_url'],
	} for c in processed['triage']],
	'holds': [{
	'login': c['user']['login'],
	'html_url': c['html_url'],
	} for c in processed['hold']],
	'files': filenames,
	'labels': labels,
	'merged': pr['merged'],
	'github_reviews': [{
	'login': r['user']['login'],
	'body': r['body'],
	'state': r['state'],
	'html_url': r['html_url']
	} for r in github_reviews],
	}

	# filter to prs touching files that match a regex
	file_pattern = re.compile(file_regex)
	prs_matching_file_regex = []
	for pr_id, info in prs.iteritems():
	f_files = [f for f in info['files'] if file_pattern.match(f)]
	relevant_files = len(f_files) != 0
	if relevant_files:
	prs_matching_file_regex.append(info)

	# swap around: for every file, what prs touched it
	file_to_prs = {}
	for info in prs_matching_file_regex:
	for f in info['files']:
	file_prs = file_to_prs.get(f, [])
	file_prs.append(info)
	file_to_prs[f]=file_prs

	ignore_users = users is None or len(users) == 0
	def user_count_tuple_matches(x):
	return x[1] >= min_count and (ignore_users or x[0] in users)

	file_info = OrderedDict(sorted(
	[(
	f, OrderedDict({
	'author': OrderedDict(sorted(
	filter(user_count_tuple_matches,
	map(lambda x: (x[0], len(x[1])),
	group_by(lambda x: x['author'], file_prs).iteritems())),
	key=lambda x: x[1]
	)),
	'assignees': OrderedDict(sorted(
	filter(user_count_tuple_matches,
	map(lambda x: (x[0], len(x[1])),
	categorize_by(lambda x: x['assignees'], file_prs).iteritems())),
	key=lambda x: x[1]
	)),
	'requested_reviewers': OrderedDict(sorted(
	filter(user_count_tuple_matches,
	map(lambda x: (x[0], len(x[1])),
	categorize_by(lambda x: x['requested_reviewers'], file_prs).iteritems())),
	key=lambda x: x[1]
	)),
	# 'labels': OrderedDict(sorted(
	# filter(lambda x: x[1] >= min_count,
	# map(lambda x: (x[0], len(x[1])),
	# categorize_by(lambda x: x['labels'], file_prs).iteritems())),
	# key=lambda x: x[1]
	# )),
	#'github_reviewers': OrderedDict(sorted(
	# filter(lambda x: len(x[1]) >= min_count,
	# map(lambda x: (x[0], [{
	# 'html_url': r['html_url'],
	# 'state': r['state']
	# } for pr in x[1] for r in pr['github_reviews'] if r['login'] == x[0]]),
	# categorize_by(lambda x: set([r['login'] for r in x['github_reviews']]), file_prs).iteritems())),
	# key=lambda x: x[1]
	#)),
	'lgtmers': OrderedDict(sorted(
	filter(user_count_tuple_matches,
	map(lambda x: (x[0], len([[r['html_url'] for r in pr['lgtms'] if r['login'] == x[0]] for pr in x[1]])),
	categorize_by(lambda x: set([r['login'] for r in x['lgtms']]), file_prs).iteritems())),
	key=lambda x: x[1]
	)),
	'approvers': OrderedDict(sorted(
	filter(user_count_tuple_matches,
	map(lambda x: (x[0], len([[r['html_url'] for r in pr['approves'] if r['login'] == x[0]] for pr in x[1]])),
	categorize_by(lambda x: set([r['login'] for r in x['approves']]), file_prs).iteritems())),
	key=lambda x: x[1]
	)),
	'triagers': OrderedDict(sorted(
	filter(user_count_tuple_matches,
	map(lambda x: (x[0], len([[r['html_url'] for r in pr['triages'] if r['login'] == x[0]] for pr in x[1]])),
	categorize_by(lambda x: set([r['login'] for r in x['triages']]), file_prs).iteritems())),
	key=lambda x: x[1]
	)),
	'holders': OrderedDict(sorted(
	filter(user_count_tuple_matches,
	map(lambda x: (x[0], len([[r['html_url'] for r in pr['holds'] if r['login'] == x[0]] for pr in x[1]])),
	categorize_by(lambda x: set([r['login'] for r in x['holds']]), file_prs).iteritems())),
	key=lambda x: x[1]
	)),
	# TODO(spiffxp): include commenter / review_commenter
	})
	) for f, file_prs in file_to_prs.iteritems()],
	# sort by files that have had the most PRs against them (by adding up author counts)
	key=lambda x: reduce(lambda r,x2: r+x2[1], x[1]['author'].items(), 0)
	))

	print json.dumps(file_info, indent=2)

	# naming things is hard:
	# - categorize_by: a thing can be in multiple categories
	# - group_by: a thing can only be in one group

	# given fn(value)->[keys] fn and [values], return {key:[values]}
	def categorize_by(fn, xs):
	r = {}
	for x in xs:
	ks = fn(x)
	for k in ks:
	vs = r.get(k, [])
	vs.append(x)
	r[k] = vs
	return r

	# given fn(value)->key fn and [values], return {key:[values]}
	def group_by(fn, xs):
	r = {}
	for x in xs:
	k = fn(x)
	vs = r.get(k, [])
	vs.append(x)
	r[k] = vs
	return r

	def setup_logging():
	"""Initialize logging to screen"""
	# See https://docs.python.org/2/library/logging.html#logrecord-attributes
	# [IWEF]mmdd HH:MM:SS.mmm] msg
	fmt = '%(levelname).1s%(asctime)s.%(msecs)03d] %(message)s' # pylint: disable=line-too-long
	datefmt = '%m%d %H:%M:%S'
	logging.basicConfig(
	level=logging.INFO,
	format=fmt,
	datefmt=datefmt,
	)

	if __name__ == '__main__':
	parser = argparse.ArgumentParser(description='Summarize github events for the specified user')
	parser.add_argument('prfile', default='prfile', help='file containing list of pr numbers')
	parser.add_argument('--repo', default='kubernetes/kubernetes', help='repo that will be queried')
	parser.add_argument('--file-regex', default='.*', help='only consider prs whose filenames match this regex')
	parser.add_argument('--min-count', default=1, help='minimum count of occurrences to be included in summary')
	parser.add_argument('--workdir', default='data/pr-file-info', help='Work directory to cache things')
	parser.add_argument('--users', nargs='+', help='Filter to just these users')
	args = parser.parse_args()

	main(args.workdir, args.prfile, args.repo, args.file_regex, int(args.min_count), args.users)