Platforms Are Still Flying Blind When It Comes to Societal Impact — Better Metrics Could Help
After the 2020 election, a Twitter dashboard that I first prototyped four years before started going wild. It estimates misinformation prevalence by monitoring “the percent of retweets and likes pointing toward domains that had made a habit of sharing misinformation.” This metric had been going up throughout the election cycle from a low around 10% up to almost 20% on November 3rd. And then it jumped wildly to 30% over the next week and stayed there for almost a month. Something was likely very wrong.
Tracking this sort of change is a valuable step toward understanding the platform’s impact. It was almost the simplest possible “health metric” — it likely should have triggered alarm bells and executive meetings — it “should” have been equivalent in importance to a significant drop in a revenue dashboard. But it was also clearly insufficiently visible.
In practice, product teams at companies like Twitter, Facebook, and YouTube are primarily rewarded for two things: moving metrics and shipping products — which then move metrics. If you want to change a system, you must understand its incentives. Since metrics are core to changing platforms, tools that make measurement or improvement easier can be one of the most effective routes to moving platforms. We can’t know if we’ve reduced the extent of misinformation or hate speech without an estimate of how much there is. Paying more attention to the type of information tracked by such metrics could also help ensure platforms can compete on characteristics more valuable to society than attention or stock value. Shortly after the 2016 election, I published a piece suggesting a focus on one such metric, to measure attention toward misinformation.
We still don’t have a great way to measure this — the metric that showed the spike of misinformation right after the 2020 election is only a very rough proxy. I developed the initial version of this independently and brought this project first to the Tow Center at Columbia and finally to the Center for Social Media Responsibility at the University of Michigan where it found a permanent home (and which was particularly crucial by supporting this work — taking it from a Jupyter notebook into the publicly available continuous dashboard it is today).
Here is what that looked like through the spring of 2018:
This dashboard shows a metric for the percent of engagement on Facebook and Twitter with articles from “iffy” (potentially unreliable) sources — notice the spike in the fall of 2016.
The devil is in the details — in what we consider engagement, what is news, and what counts as “iffy.” But even this simple metric, with a very coarse measure of reliability, is powerful. It enables us to observe how real-world events and platform mitigations correspond to significant moments in engagement with potentially unreliable sources.
Platform health metric criteria
Here are a set of criteria that a metric should ideally satisfy. They use the mnemonic “AIIM” — as aiming for healthy platforms is what such metrics can help us do.
- Actionable: Could changes in this metric affect platform decisions and actions?
- Important: Does this metric target issues of societal importance?
- Impactful: Could such platform actions have significant real world impact?
- Meaningful: Does this metric meaningfully capture what it seeks to measure?
To fulfill the AIIM criteria, we need to understand what is societally important and what goals, values, and properties platforms should be striving to uphold. We expect this to be a continuous and evolving conversation. At the same time, we want to make sure that these conversations around principles don’t block urgent action and collaboration.
Creating a community of metrics
Developing such metrics is often challenging and expensive and there is a broad set of issues and products that are potentially valuable to measure. An increasing number of studies are doing this type of measurement, which is a great step forward. However, these are “one-off” studies, and it can be difficult to build upon this work or do crucial comparison across products and issues.
We can do better.
The first step is a form of lightweight collaboration. To understand how that might work, we need to dive into the weeds of what a metric is made of.
To get into the technical weeds, a metric is a way to evaluate the aggregate properties of a stream of data. There are many types of properties one might measure that could be considered beneficial or harmful about a platform, for example: information credibility (and sub-components of that like political and health information credibility), polarization, outrage, civic utility, authenticity, toxicity, silencing, extremism, “conspiracyness,” bottiness, meaningfulness, addictiveness, and so on. We can explore these properties for various types of data of a platform product — the output of their recommendation engines for content, for accounts, for trends; the actual content being engaged with by users, relationships between users, and so on. Any combination of these could satisfy the criteria for a platform health metric.
Practically, this can often be broken down into two core “metric infrastructure” components: a data collector and a property classifier. A classifier might be as simple as a list of URLs to match against, or it might involve machine learning, or crowdsourcing, or user survey questions — it can be anything that helps organize data for measurement.
Some health metrics examples
Putting that all together, here are some more examples of potential metrics, with the data bolded and the property italicized.
- Percent of posts with polarizing content.
- Percent of toxic accounts.
- Percent of communities with predominantly respectful behavior.
- Percent of recommendations of conspiracy content.
- Percent of users feeling more informed after using a platform.
- Percent of engagement with forged images or videos.
As you can see, there is a lot of variety — and many challenging data collection and classification tasks to undertake. More broadly, this document lays out an initial framework sketch of what might be measured (though it is not meant to be exhaustive or to provide full detail).
Rows are collectors/datastreams; columns are classifiers/properties — so each cell is a metric.
It could be invaluable to create a “metrics community inventory” of potentially useful collectors or classifiers, and perhaps individuals or organizations working toward them. These can then be carefully mixed, matched, and compared with other collectors and classifiers. For example, a classifier for polarizing content used for a study of Facebook could be repurposed to also apply to Twitter, Reddit, and Tumblr. In the longer term, hubs, tools, and standards could be created to facilitate sharing collectors, datastreams, datasets, and classifiers across metrics projects in a privacy preserving way.
Many of these metrics would need to be built on sampling and qualitative assessments — humans doing manual evaluations of content, behaviors, perceptions, and so on. Those assessments eventually need to be compressed into numbers for comparison over time and place, but they are also valuable in their own right. Sampled well and shared along with metrics, they can help us get a fuller sense of what is happening on a platform — and what the metrics are actually indicating. These sampled data points and evaluations should also ideally be shared where possible (given privacy constraints), enabling a deeper “mixed methods” approach to understanding platform health.
A way forward
Measuring the health of a complex ecosystem is never an easy task. We can’t do this without community, coordination, investment, and infrastructure. For example, I am now supporting an organization that is enabling large platforms and researchers to help quantify and sample the extremely qualitative evaluations required for meaningful metrics. This is incredibly challenging to do with high quality, ethically, and cost-efficiently at scale — it’s the type of new critical infrastructure we will need to make sense of this new world.
Understanding the impact of platforms on discourse and trust is one of the most important challenges of our time — and we are still failing. If we want to safely move toward a more connected future, we need better visibility into how our information ecosystem is changing, and platform health metrics are a crucial way of gaining that visibility. Let’s build the community and infrastructure we need so that we are up to that task.
We can’t afford to keep flying blind.