Table of Contents
Predictive Results Calculations
Introduction
Acquiring actionable cybersecurity information for a third-party cyber risk management program can be extremely time-consuming, taking months and sometimes years to obtain assessment completion and validation. Once the initial assessment is complete, the data can quickly become stale and irrelevant to emerging cyberattacks. The Global Risk Exchange offers a solution that provides a dynamic risk profile for any company in the Exchange to provide actionable information for urgent cybersecurity decisions in a more optimized time frame. These Predictive Risk Profiles are produced by applying advanced machine learning to data from varied sources, including self-attested assessments from our third-party risk Exchange, firmographic information, outside-in scanning, and threat intelligence data from our partners. We can produce cybersecurity information for any given third party in near real-time using this data and machine learning.
Overview
The Exchange has curated a dataset of structured cybersecurity assessments enriched with external and internal data to train a predictive model using a standard 80/20 split, focusing on control areas, maturity levels, and various cybersecurity metrics.
Bayesian networks model the joint probability distribution of assessment answers using conditional dependencies to manage complexity, enhance explainability, incorporate expert knowledge, and effectively handle missing data.
Training a Bayesian network involves structure learning using the hill-climbing algorithm with BIC scoring, manual mapping of dependencies by experts, and parameter fitting using CPD tables, with strategies for handling sparse data and validation weighting to refine probabilities.
To predict assessment answers using a company's input data, we query the Bayesian network with particle-based methods, generating 1000 samples per company for robust analysis, which takes about 30 seconds per block to compute and can be run in parallel for multiple companies.
To provide insights into a company's cybersecurity program, we use Bayesian network-generated sample blocks to estimate true scores, produce histograms for control and maturity group coverage, and display medians and confidence ranges.
Accuracy is measured by comparing the predicted results for each control to the attested answers in the test set, using the maximum predictive outcome for Yes/No/Not Applicable.
Structured Data
The Exchange has been collecting structured cybersecurity assessments from its members since 2016. Additionally, a subset of these assessments have a subset of validated controls. This data has been curated to create a dataset ideal for training a machine learning model. The standard 80/20 split was used to create the training and testing sets. The information in these assessments covered the most relevant control areas of a member's cybersecurity program. The maturity level of each control group’s related people, processes, and technology is also considered. The assessment data provides primarily binary information ranging from upwards to 250 answers at the GRX Tier 2 Assessment level. Tier 2 assessments allow for the confirmation of cybersecurity controls but do not consider the effectiveness of the controls.
The predictive model’s input variables were chosen from external and internally exchange-sourced data. For example, the importance of a company’s firmographics, such as industry, revenue, size, age, and online popularity, was paired with data about their maturity and control coverage. Additional cybersecurity information is put into the model through breach monitoring relating to leaked passwords, product vulnerabilities, policy violations, domain weaknesses, etc. It is paired with numerical ratings for vulnerability severity, such as web security, software patching, and email security, which are collected through automated network scanning.
The predictor variables are shown in Table 1 listed with their data type and category where category indicates that multiple classifications exist. For example, a company is classified by the number of employees ranging from 1-5, 6-10, 11-20, 21-50, 51-100, 101-250, 251-500, 501-1000, 1001-2000, 2001-5000, 5001-10000, 10000+.
TABLE 1
Feature | Data Type |
Industry | Categorial |
Revenue | Categorial |
Employee Number | Categorial |
Age | Integer |
Online Presence | Integer |
Network Scanning | Float |
Breach Monitoring | Categorial |
Table 1: External data used as input into the model. Network scanning is a family of several variables that describe a company's security at the domain-accessible level. Breach monitoring consists of several variables gathered from the dark web, breach signals, and datasets.
Model
For modeling, we consider each assessment control a random variable that will be observed once the Exchange member has completed the assessment. We model the joint probability distribution of assessment answers given an Exchange member’s attested answers. The controls are primarily variables with a binomial distribution over Yes/No outcomes, with few being multinomial over three outcomes, Yes/No/Not Applicable. At the same time, the maturity questions are multinomial over six answer options. For illustration purposes, if we consider 200 independent binary target variables, there are approximately 2200 - 1, approximately 1.6 x 1060, parameters to determine the joint distribution. Since cybersecurity controls tend to correlate with one another, given the context of the question in the assessment, we can reduce the complexity of fitting many parameters by leveraging conditional dependencies between response variables while maintaining correlations between questions. For these reasons, a Bayesian network was utilized to model the data.
The underlying structure of a Bayesian network is a graph where each node represents a random variable, in this case, the assessment question, and is connected to other nodes through conditional dependence, i.e., the answer to a question depends on other questions. This structure allows questions to have local distributions by only considering the questions they depend on, known as parents.
In addition to the parameter reduction, Bayesian networks have several properties that are advantageous to modeling assessment data; due to the statistical nature of the model, they are explainable compared to other popular models that fit high dimensional data; expert domain knowledge can be included as prior distributions over the variables; the model still produces predictions for the response variables even if there is missing data in the input; the theory behind Bayesian networks has been developed over decades.
Training
Training a Bayesian network is a two-step process: learning the dependencies' structure and fitting the variables' parameters. We used a popular score-based hill-climbing (HC) algorithm in this development. HC finds a local optimum by searching the possible orientations of edges connecting one variable to another and assigning a score to the potential structure. The Bayesian Information Criterion (BIC) was used to score the structures.
The BIC score penalizes large parameters to the model while maximizing the likelihood of the answers. Penalizing large parameters prevents the complexity of the model from getting too large, which may result in overfitting the training data, leading to poor performance on unseen data. The HC algorithm will iteratively search for a maximum BIC score until the pre-defined maximum number of iterations is reached or increases to the score are no longer found.
With the assistance of internal cybersecurity professionals, several input variables were manually mapped to assessment controls to force dependencies between variables and reduce the search space for the structure. Examples of these mappings are in Figure 1, where several of the inputs were mapped to selected variables in the graph.
Figure 1.1: Email security input mapped to two controls |
Figure 1.2: Web app security input mapped to four controls |
Figure 1.3: Software patching input mapped to six controls; four included in the image |
Figure 1: Images from the structure of the model where the input nodes were manually mapped.
Once the structure is found, the conditional dependencies between the variables are known, and we can fit the parameters of the variables in the graph using the assessment answers. The probabilities of the outcomes are stored in a Conditional Probability Distribution (CPD) table that captures the local distribution of a variable given its parents. A prior probability distribution was included over the outcomes of the variables to smooth out the bias in the assessment answers.
Table 2 shows an example of a fitted CPD table, with the conditional distributions in columns depending on the parents' outcomes.
TABLE 2
Parent 1 | Parent 1 (0) | Parent 1 (0) | Parent 1 (1) | Parent 1 (1) |
Parent 2 | Parent 2 (0) | Parent 2 (1) | Parent 2 (0) | Parent 2 (1) |
Variable (0) | 0.86 | 0.25 | 0.4 | 0.7 |
Variable (1) | 0.14 | 0.75 | 0.6 | 0.3 |
Table 2: An example of a CPD table where the columns are the conditional probabilities of the variable given the observed outcomes of the parents shown in parenthesis.
We use two methods to overcome this lack of data for variables for which we do not have enough data to reliably build a probability distribution for each possible state.
First, we group states to form categories with larger data sets. After training a CPD on these binned categories, we retrain the model on the original unbinned data using the binned CPDs as prior probability distributions. This ensures that outliers do not have an outsized effect in any state, giving us sensible distributions for states with little to no data.
Due to limitations in the algorithms that train the model, we cannot use linear statistical distributions as parent nodes in the model. For data inputs that would lend themselves well to a linear statistical model, such as network scanning data, we first use tree-based algorithms to categorize the data, determine the most effective binning strategy for these variables, and then train discrete CPDs for these bins. These data points and bins are then used to build a linear approximation of these CPDs so that each possible input has its distinct probability distribution.
Additionally, validation results are used to weigh the importance of assessment and influence probabilities. Proportions on implementation were compared between attested answers and validated answers. A test of proportion equality was performed to find the weight using the p-value, which ranges from 0 to 1. Larger differences equate to lower weight. Non-validated assessments are weighted based on a similar procedure with a maximum weight of 0.5 since we cannot validate the answers.
TABLE 3
Input | Training Base Model | Refreshing Predictive Profile with Current Data |
Historical Attested Assessments | ✔️ | |
Cyber Security Expert Opinion | ✔️ | |
Validated Controls | ✔️ | |
Assessee Comments on Validated Controls | ✔️ | |
Industry | ✔️ | ✔️ |
Revenue Range | ✔️ | ✔️ |
Employee Range | ✔️ | ✔️ |
Company Founding Date | ✔️ | ✔️ |
Online Presence | ✔️ | ✔️ |
Company Technology Infrastructure | ✔️ | ✔️ |
Perimeter Scanning | ✔️ | ✔️ |
Threat Intelligence | ✔️ |
Table 3: All input data used to attain a base model and all input data used to refresh a company's profile.
Querying
To predict the assessment answers given a company’s input data or evidence, we must query the Bayesian network for inference on those variables. We use particle-based methods, such as random sampling from the Bayesian network, to produce a more robust analysis of the potential assessment outcomes. Given enough samples, we can approximate the true distribution.
Sampling a variable consists of using the CPD column with the parent outcomes fixed to be the observations of the evidence and then drawing an outcome from that CPD with probabilities corresponding to those of the selected columns. The results of the sampled variables, being parents, in that order feed into the CPDs of the variables that depend on them, known as children, until the end of the ordering is reached. This process is then repeated for a predefined number of times to produce the same number of potential assessments.
To process a company for results, it must have a set of possible assessments to analyze, given the input variables for that company. After some experimentation, a sample size of 1000 was chosen. This sample size was performative in time to completion and accuracy in approximation. It takes 30 seconds to score and analyze a block of 1000 sampled assessments for a company. This computation is then distributed to run in parallel for multiple companies to obtain results.
Predictive Risk Calculations
To provide insights into a company’s cybersecurity program, it must remain consistent with the already established metrics and summaries from self-assessed assessments. When an Exchange member completes an assessment, scores are generated from their answers to provide an overview of the level of risk in their cybersecurity program. These include group coverage for the most relevant control areas and maturity group coverage. To estimate the true scores, we describe a company's possible distribution of scores by querying the possible assessments from the Bayesian network, thus producing what we call a sample block. An illustration of the clarity of the sample block is included in Table 4. This provides a more robust view of the possibilities and builds a confidence score around the distribution of scores.
TABLE 4
Outcomes | Question 1 | Question 2 | … | Question 251 |
Assessment 1 | Yes | No | … | No |
Assessment 2 | Yes | Yes | … | Yes |
… | … | … | … | … |
Assessment 1000 | No | Yes | … | No |
Producing coverage and maturity scores at the group level begins by iterating through each assessment in the sample block and scoring that possible assessment for each group. Once all the sampled assessments have been scored, a histogram per group is created with the scores. A histogram allows us to display the median as the expected value of the scores and confidence around the expected value for the control coverage.
The maturity coverage is displayed as a range of low, median, and high scores. The median is the expected score value, and the low and high scores cover a margin of error for the estimate.
Accuracy
Accuracy is measured by predicting results for the companies in the test set, then comparing control by controlling the predicted result to the attested answer. Since an answer is required, the maximum predictive outcome is used for Yes/No/Not Applicable.
Full assessment accuracy:
Figure 2: Distribution of overall assessment accuracy.
Summarization statements pulled from the histogram based on quartiles:
- The 33rd percentile shows the model can predict two-thirds or 147 of the controls at 70% or greater accuracy.
- The median or 50th percentile shows the model can predict half or 110 controls with 76% or greater accuracy.
Control Score Distributions:
There are controls where obtaining data other than attested answers is difficult. An example of this is physical security. The opposite exists where some controls are standard practices in today’s world where the accuracy is high, such as having a risk program.
The following histograms show the control accuracy for the previous model in blue, the current model in orange, and the overlap in red. Examining these 220 distributions allows an end user to lean on the model to accelerate a risk program and understand control effectiveness prediction based on the inputs used in Table 3 as an iterative process.
Strategic Controls
Operational Controls
Core Controls
Management Controls
Privacy Controls
Timing
The process to generate predictive results for companies on the Exchange is running daily as of August 2024. Every day around 17:00 Eastern Time, we trigger running the predictive model and calculating results for a company if the following conditions are met:
- New company on the Exchange without predictive results
- Industry change
- Risk Recon change
- Recorded Future change
- Company on the Exchange with predictive results more than 90 days old
This ensures that customers have risk data they can rely on 365 days of the year.
Comments
0 comments
Please sign in to leave a comment.