Warning
This document is a work in progress.
This document discusses considerations for implementing telemetry in FRRouting. At the time of writing, no telemetry features exist in FRR (and there is no known private implementation in any vendor branch, though such a thing may of course exist in secret.)
The purpose and goal of adding telemetry to FRR is to understand which parts of the code are used to what extent. This may include features at large, specific details of features, but also workarounds and kludges necessary to handle some edge cases.
A non-goal of telemetry as discussed here is crash reporting and data collection. While a lot of other telemetry implementations include this, getting an useful crash report for FRR would involve a significant amount of likely sensitive information. Sanitizing or minimizing a crash report would make it useless in the vast majority of cases.
FRRouting is packaged for a variety of distributions, and sees significant vendorization as well. Any telemetry solution should take into account:
- in general, operators will by default not want to reveal arbitrary information about their network. There is a significant hurdle to clear, and clearing it only very rarely makes the collected data much less useful due to having a smaller sample size.
- telemetry metadata can reveal sensitive information about operators, i.e. how many devices are in a particular network. It can also reveal which routing protocols may be of interest to try to exploit in a network.
- most common distributions have strict policies against privacy breaches in packages they are distributing, e.g. https://wiki.debian.org/PrivacyIssues#Phone_home https://udd.debian.org/lintian-tag.cgi?tag=privacy-breach-generic
- vendors may want to have FRR report telemetry to them, sometimes in addition, sometimes exclusively.
- telemetry data may reveal details about "private" customizations made to the FRR codebase, whether by vendors or individual operators. Simultaneously, if customizations are applied but not visible, this may make telemetry data misleading.
- the systems FRR is running on may have only limited connectivity to send telemetry data, whether due to operator policy, or simply because the routing setup is designed that way (e.g. isolated management VRF.)
- the FRR community is not a monolithic block and in particular there is no legal entity that "is" the FRR community. This is especially relevant for privacy agreements, since the FRR community does not "exist" it cannot agree to perform (or not perform) specific handling of telemetry data.
- there is a preexisting amount of metadata already being generated at the package repository systems (deb.frrouting.org, rpm.frrouting.org) for installations using those repositories.
Submitting any kind of telemetry in FRR by default is precluded by several points, first and foremost that most operators will disagree with such a policy. It would also violate most distribution policies. It also plain wouldn't work in a good number of scenarios.
Note
This applies only to submitting telemetry data. Collecting telemetry data locally can (and probably should) be enabled by default.
All telemetry data collected by FRR must be auditable by users so they can confirm no sensitive data is included. This is also necessary to allow operators to set up exfiltration countermeasures, as in, prevent an FRR telemetry channel from being abused by malicious entities to bypass data outflow controls.
To deal with routers without good connectivity, anti-exfiltration and audit requirements, and minimize exposed metadata (particularly number of telemetry sources), it is necessary to allow the operator to collect and merge telemetry data with a separate process.
This also implies that submitting telemetry data is a function that should be able to run independent of the FRR daemons, e.g. on a jumphost that has no routing functions.
Having any association between telemetry data and submission origin (e.g. IP address) will be inacceptable for some operators at minimum, and is in fact not desirable from an FRR community perspective. Having a way to associate telemetry data with its origin makes these systems a target for malicious entities, even if only to identify what vulnerabilities a network of interest may have. Not having this metadata also immunizes the FRR community from any possible accusations of mishandling that metadata.
As such, all telemetry data submissions should be fully anonymized. The method of choice is probably the Tor network https://en.wikipedia.org/wiki/Tor_(network)
Note
It is a common misunderstanding that using the Tor network requires running a node and forwarding traffic on it. This is not the case. Tor can be used in a purely "client" manner.
To in fact get a good amount of telemetry data, submitting the data must be
reasonably easy for the user. While default-enabled may not be acceptable,
a simple way of doing this may be to make available a separate package, e.g.
frr-telemetry, that enables submitting telemetry data.
This package may be (almost) empty, it serves primarily as a switch. It could also be the place for a vendor to add custom telemetry data endpoints.
For any telemetry data point, there are two considerations:
- usability in the sense of fulfilling all requirements towards security and privacy. There must be as much as a guarantee that the data point cannot leak sensitive data.
- usefulness to achieve some data collection goal. This particularly covers framing the data point with sufficient context.
Essentially show modules output. This may reveal the existence of custom
modules. A whitelist filter could be added to prevent that.
The version is also reasonably important to have as context for all other data points. The telemetry data should probably be structured such that all other data points are subordinate to the version.
As FRR is already collecting counters for memory allocations on MTYPE_*,
this data is very relevant to include in telemetry. Even plain (non-)usage
of some MTYPE_* already gives a lot of information which features are used
and which aren't.
Luckily, there are in fact two ways to collect sanitized data for CLI commands:
- reporting on an unparsed DEFUN command string level, e.g.
"router bgp [ASNUM$instasn [<view|vrf> VIEWVRFNAME] [as-notation <dot|dot+|plain>]]"- reporting on the DEFUN tokens actually matching the command being executed, e.g.
"router bgp ASNUM as-notation dot"
The latter clearly has some additional information. The former has the advantage of not requiring extra collection housekeeping since counters can be placed directly in the DEFUN's data (zero runtime memory allocations needed.)
In both cases, the most immediately obvious data point is to count executions of each command. An additional data point to consider would be to count failure results separately.
Statistics on these commands would however not work when other northbound interfaces are in use.
Similar to CLI commands, counters can be placed on YANG schema items, next to the callbacks implementing their operations. It is crucial in this case to use the path from the schema only.
Conversely, this only works for YANG-ified daemons.
In some ways, telemetry data on log messages might be a "holy grail" of data
since it provides visibility into actual code paths being hit. The
XXXXX-XXXXX message identification mechanism provides the necessary
scaffolding for referencing them in a sane, sanitized way.
However, the most useful data on this would be from debug messages, which are not generally enabled and thus cannot be counted if the logging code is never executed. PR #12272 addresses this, but will require (automated) adjustment of debug logging calls across the FRR codebase.
The first half of an FRR telemetry feature is to just collect the data locally,
which will probably end up in some file in a location like /var/lib/frr.
There are a bunch of implementation considerations with this (e.g. race conditions from daemons updating the file simultaneously, frequency of writing out data from memory to disk, etc.)
However, one consideration stands out to evaluate early: the telemetry data shouldn't be allowed to grow without bounds. As long as only a number of counters is collected for an exhaustively enumerable list of items, this should not be a problem, but it needs to be considered regardless.
With FRR slowly moving to an increased usage of multithreading, plain counters
are no longer as cheap and simple as they sound. The MTYPE_* accounting
mechanism uses atomic operations for its counters. These atomic ops are
already visible in perf data. With allocating memory is in itself being a
somewhat costly operation, the added cost of counting may be acceptable there,
but in particular for disabled debugging messages this may not be the case.
There are possible cheaper alternatives in either using thread-local counters,
or the recently added rseq mechanism on Linux (cf.
https://github.com/compudj/librseq).