eqvinox/telemetry.rst

## telemetry.rst

      
    Raw
  

              telemetry.rst
            
          
    FRRouting telemetry considerations


Warning
This document is a work in progress.


Scope and goals

This document discusses considerations for implementing telemetry in FRRouting.
At the time of writing, no telemetry features exist in FRR (and there is no
known private implementation in any vendor branch, though such a thing may of
course exist in secret.)
The purpose and goal of adding telemetry to FRR is to understand which parts
of the code are used to what extent.  This may include features at large,
specific details of features, but also workarounds and kludges necessary to
handle some edge cases.
A non-goal of telemetry as discussed here is crash reporting and data
collection.  While a lot of other telemetry implementations include this,
getting an useful crash report for FRR would involve a significant amount of
likely sensitive information.  Sanitizing or minimizing a crash report would
make it useless in the vast majority of cases.

Implementation Context

FRRouting is packaged for a variety of distributions, and sees significant
vendorization as well.  Any telemetry solution should take into account:


in general, operators will by default not want to reveal arbitrary
information about their network.  There is a significant hurdle to clear,
and clearing it only very rarely makes the collected data much less useful
due to having a smaller sample size.
telemetry metadata can reveal sensitive information about operators, i.e.
how many devices are in a particular network.  It can also reveal which
routing protocols may be of interest to try to exploit in a network.
most common distributions have strict policies against privacy breaches in
packages they are distributing, e.g.
https://wiki.debian.org/PrivacyIssues#Phone_home
https://udd.debian.org/lintian-tag.cgi?tag=privacy-breach-generic
vendors may want to have FRR report telemetry to them, sometimes in
addition, sometimes exclusively.
telemetry data may reveal details about "private" customizations made to
the FRR codebase, whether by vendors or individual operators.
Simultaneously, if customizations are applied but not visible, this may
make telemetry data misleading.
the systems FRR is running on may have only limited connectivity to send
telemetry data, whether due to operator policy, or simply because the
routing setup is designed that way (e.g. isolated management VRF.)
the FRR community is not a monolithic block and in particular there is no
legal entity that "is" the FRR community.  This is especially relevant for
privacy agreements, since the FRR community does not "exist" it cannot
agree to perform (or not perform) specific handling of telemetry data.
there is a preexisting amount of metadata already being generated at the
package repository systems (deb.frrouting.org, rpm.frrouting.org) for
installations using those repositories.


Corollaries


No default telemetry

Submitting any kind of telemetry in FRR by default is precluded by several
points, first and foremost that most operators will disagree with such a
policy.  It would also violate most distribution policies.  It also plain
wouldn't work in a good number of scenarios.

Note
This applies only to submitting telemetry data.  Collecting
telemetry data locally can (and probably should) be enabled by default.


Transparency

All telemetry data collected by FRR must be auditable by users so they can
confirm no sensitive data is included.  This is also necessary to allow
operators to set up exfiltration countermeasures, as in, prevent an FRR
telemetry channel from being abused by malicious entities to bypass data
outflow controls.

User collation

To deal with routers without good connectivity, anti-exfiltration and audit
requirements, and minimize exposed metadata (particularly number of telemetry
sources), it is necessary to allow the operator to collect and merge telemetry
data with a separate process.
This also implies that submitting telemetry data is a function that should be
able to run independent of the FRR daemons, e.g. on a jumphost that has no
routing functions.

Metadata-safe transport

Having any association between telemetry data and submission origin (e.g. IP
address) will be inacceptable for some operators at minimum, and is in fact
not desirable from an FRR community perspective.  Having a way to associate
telemetry data with its origin makes these systems a target for malicious
entities, even if only to identify what vulnerabilities a network of interest
may have.  Not having this metadata also immunizes the FRR community from any
possible accusations of mishandling that metadata.
As such, all telemetry data submissions should be fully anonymized.  The
method of choice is probably the Tor network
https://en.wikipedia.org/wiki/Tor_(network)

Note
It is a common misunderstanding that using the Tor network requires running
a node and forwarding traffic on it.  This is not the case.  Tor can be
used in a purely "client" manner.


Easy enablement

To in fact get a good amount of telemetry data, submitting the data must be
reasonably easy for the user.  While default-enabled may not be acceptable,
a simple way of doing this may be to make available a separate package, e.g.
frr-telemetry, that enables submitting telemetry data.
This package may be (almost) empty, it serves primarily as a switch.  It could
also be the place for a vendor to add custom telemetry data endpoints.

Usable and useful data points

For any telemetry data point, there are two considerations:


usability in the sense of fulfilling all requirements towards security
and privacy.  There must be as much as a guarantee that the data point
cannot leak sensitive data.
usefulness to achieve some data collection goal.  This particularly covers
framing the data point with sufficient context.


Installed version and loaded modules

Essentially show modules output.  This may reveal the existence of custom
modules.  A whitelist filter could be added to prevent that.
The version is also reasonably important to have as context for all other
data points.  The telemetry data should probably be structured such that all
other data points are subordinate to the version.

Memory allocations

As FRR is already collecting counters for memory allocations on MTYPE_*,
this data is very relevant to include in telemetry.  Even plain (non-)usage
of some MTYPE_* already gives a lot of information which features are used
and which aren't.

CLI commands

Luckily, there are in fact two ways to collect sanitized data for CLI commands:


reporting on an unparsed DEFUN command string level, e.g.
"router bgp [ASNUM$instasn [<view|vrf> VIEWVRFNAME] [as-notation <dot|dot+|plain>]]"
reporting on the DEFUN tokens actually matching the command being executed,
e.g. "router bgp ASNUM as-notation dot"


The latter clearly has some additional information.  The former has the
advantage of not requiring extra collection housekeeping since counters can be
placed directly in the DEFUN's data (zero runtime memory allocations needed.)
In both cases, the most immediately obvious data point is to count executions
of each command.  An additional data point to consider would be to count
failure results separately.
Statistics on these commands would however not work when other northbound
interfaces are in use.

YANG schema items

Similar to CLI commands, counters can be placed on YANG schema items, next to
the callbacks implementing their operations.  It is crucial in this case to
use the path from the schema only.
Conversely, this only works for YANG-ified daemons.

Log messages

In some ways, telemetry data on log messages might be a "holy grail" of data
since it provides visibility into actual code paths being hit.  The
XXXXX-XXXXX message identification mechanism provides the necessary
scaffolding for referencing them in a sane, sanitized way.
However, the most useful data on this would be from debug messages, which are
not generally enabled and thus cannot be counted if the logging code is never
executed.  PR #12272 addresses this, but will require (automated) adjustment
of debug logging calls across the FRR codebase.

Other considerations


Local collection requirements

The first half of an FRR telemetry feature is to just collect the data locally,
which will probably end up in some file in a location like /var/lib/frr.
There are a bunch of implementation considerations with this (e.g. race
conditions from daemons updating the file simultaneously, frequency of writing
out data from memory to disk, etc.)
However, one consideration stands out to evaluate early:  the telemetry data
shouldn't be allowed to grow without bounds.  As long as only a number of
counters is collected for an exhaustively enumerable list of items, this should
not be a problem, but it needs to be considered regardless.

Cost of counters

With FRR slowly moving to an increased usage of multithreading, plain counters
are no longer as cheap and simple as they sound.  The MTYPE_* accounting
mechanism uses atomic operations for its counters.  These atomic ops are
already visible in perf data.  With allocating memory is in itself being a
somewhat costly operation, the added cost of counting may be acceptable there,
but in particular for disabled debugging messages this may not be the case.
There are possible cheaper alternatives in either using thread-local counters,
or the recently added rseq mechanism on Linux (cf.
https://github.com/compudj/librseq).
No results found