This gist is related to this blog post
Here are some steps you can follow to reproduce the issue and the misclassifications due to the incorrectly formatted entries in the override list of the Topics API feature.
-
Get the
override_list.pb.gzshipped with model v4 (current version of themodel.tflitefor the Topics API shipped in Chrome).- Visit
chrome://topics-internals, under the Classifier tab, get the path where the model is stored, in that same folder you will find theoverride_list.pb.gzfile - For archival purposes, the file can also be found here
- Visit
-
Then, check that the override list contains domains with the invalid
"/"character and that these domains are flipped around that character, you will need to decompress the protobuf file, get the corresponding.proto, and decode it. Here is how I do it, you will need these 2 scripts:convert_pb_override.shbash script:#!/bin/bash override_pb_gz=$1 override_tsv=$2 override_pb=override.pb proto_path=page_topics_override_list.proto python_proto_path=page_topics_override_list_pb2.py if [ ! -f $override_tsv ] then # Fetch page_topics_override_list.proto wget -q -O $proto_path https://raw.githubusercontent.com/chromium/chromium/main/components/optimization_guide/proto/page_topics_override_list.proto protoc $proto_path --python_out=. # Decompress override.pb.gz gzip -cdk $override_pb_gz > $override_pb python3 convert_pb_override.py $override_pb > $override_tsv rm $proto_path $override_pb $python_proto_path fi
convert_pb_override.pypython script:import argparse import pandas as pd import page_topics_override_list_pb2 # Create Argument Parser parser = argparse.ArgumentParser( prog="python3 convert_pb_override.py", description="Convert .pb override list to .tsv", ) parser.add_argument("input_file", help="input file") args = parser.parse_args() # Load override list override_list = page_topics_override_list_pb2.PageTopicsOverrideList() with open(args.input_file, "rb") as f: override_list.ParseFromString(f.read()) print("domain\ttopics") for entry in override_list.entries: line = "{}".format(entry.domain) first_topic = True for id in entry.topics.topic_ids: if first_topic: line += "\t{}".format(id) first_topic = False else: line += ",{}".format(id) print(line)
- And then run:
# Decode override list to .tsv format ./convert_pv_override.sh override_list.pb.gz override_list.tsv # Extract domains (and corresponding topics) with invalid character: grep ".*[^[:alpha:][:space:][:digit:]^,].*" override_list.tsv
- I obtain the following incorrectly formatted domains.
The domain entries in that override list are supposed to be pre-processed the same way as the input that would be passed to the model.tflite of the Topics API.
This means: take the FQDN, remove any "www." prefix if present in the domain to classify, and then replace the following characters "-", "_", ".", "+" by a whitespace (https://source.chromium.org/chromium/chromium/src/+/main:components/browsing_topics/annotator_impl.cc;l=269).
Some examples:
candy-crush-soda-saga.web.app->candy crush soda saga web appand notweb app/candy crush soda sagasubscribe.free.fr->subscribe free frand not:free fr/subscribeuk.instructure.com->uk instructure comand not:instructure com/uk
As a result, when these domains are classified by the Topics API in Chrome, no match is found in the override list for the domain correctly pre-processed.
Thus, they are classified by the ML model which does not output the intended classification (Chrome classification can be obtained from chrome://topics-internals):
candy-crush-soda-saga.web.app->183. Computer & video games - 186. Casual games - 1. Arts & entertainmentby ML model ->186. Casual games - 215. Internet & telecomfrom override list forweb app/candy crush soda sagasubscribe.free.fr->217. Internet service providers (ISPs)by ML model ->217. Internet service providers (ISPs) - 365. Movie & TV streaming - 129. Consumer electronics - 218.Phone service providersfrom override list forfree fr/subscribeuk.instructure.com->229. Colleges & universities - 227. Educationby ML model ->230. Distance learning - 234. Standardized & admissions tests - 140. Software - 227. Education - 229. Colleges & universitiesfrom override list forinstructure com/uk