This tutorial provides an overview of using BNW to build a Bayesian network model from a dataset and use the network to make predictions. The dataset used in this tutorial is a synthetic example of a genetic dataset that has a total of 8 variables. Two of the variables are genotypes labeled Geno1 and Geno2, and the remaining 6 variables are gene expression levels or other quantitative traits that are labeled Trait1 to Trait6. The dataset is available here.

The data file is formatted according to the guidelines on the BNW help page. The first row of the file contains the names of the variables and the remaining rows contain the data for each sample of the dataset. The genotypes (Geno1 and Geno2), which have two possible states (1 and 2), are the only discrete variables in the network and are the leftmost variables in the input file. The quantitative traits are continuous variables.

1. Structure learning using default options

We do not know the network structure that underlies the relationships between the variables in this dataset, so we will use BNW to learn the structure that best explains the data. To begin, select Learn a network model from data from the BNW home page. Next, click Choose File at the top of the file upload page, navigate to and select the data file, and click Upload. A screen similar to image shown below should be displayed.

Clicking on the View uploaded variables and data on the left menu will bring up a pop-up window that displays the uploaded data file or information about the data set, such whether variables were classified as discrete or continuous. This information can be used to ensure that the input file was uploaded and properly interpreted in BNW.

Other options on the left menu will continue with structure learning. Initially, we will perform structure learning using default options in BNW and click on the Perform Bayesian network modeling using default settings button. By default, BNW limits the maximum number of parents for any node in the network to 4 and presents the structure of the single highest scoring network. Structure learning can take a significant amount of time for larger networks. For this dataset, structure learning should take only a few seconds, and the structure below will soon be displayed. The network can be also be accessed here.

In the single best scoring network structure, the genotype nodes, Geno1 and Geno2, directly influence Trait2 and Traits3, 4, and 6, respectively. There are also several direct interactions between traits; for example, Trait2 influences Trait1 and Trait4. Additionally, Trait5 is absent from the network, indicating that this variable did not interact with the other variables strongly enough to be located within the highest scoring network.

2. Modifying structure learning settings

In order to test if edges present the single best scoring network are conserved across high scoring networks. We can modify the structure learning settings to get identify the structures of many high scoring networks and perform model averaging over these structures. To do this, return to the BNW home page, select Learn a network model from data, and upload the data file. Instead of using the default settings, select Go to structure learning settings and the BNW structural constraint interface. A more detailed overview of use of the structural constraint interface is provided in another tutorial, but, here, we will investigate the impact of modifying some of the structure learning settings shown below:

Change the Number of networks to include in model averaging to 100 and select Perform Bayesian network modeling. Now, instead of displaying the single highest scoring network, BNW will determine the 100 highest scoring networks, perform model averaing over these networks, and display the structure after model averaging that includes all features with a Model averaging edge selection threshold greater than 0.5. Model averaging over the 100 highest scoring structures has resulted in a change in the network structure as shown below and is available here.

Specifically, the single best model network contains a directed edge linking Trait6 with Trait3, while this edge is absent from the structure after model averaging over the 100 best scoring networks. Clicking Display structure matrix displays the model averaging scores as well as the structure matrix. The structure matrix file can be downloaded and used in return sessions to BNW, allowing users to skip structure learning and more quickly use the model to make predictions.

In the model averaging scores, we can observe that most of the network edges in the model average network were found in all or nearly all of the 100 highest scoring networks. For example, edges from Geno2 to Traits 3 and 4 have posterior probabilities of 1. Also, the edge from Trait6 to Trait3 that was observed in the single highest scoring network has a posterior probability of 0.46, and it, therefore, falls just short of being included in the model averaged network structure.

3. Using the network structure to make predictions

To make predictions with the network, we will use the structure learned after model averaging of the top 100 highest scoring networks. First, we will use the model to compare the expected values for nodes in the network based on observed genotypes. For these predictions, we will use evidence mode when making predictions, which is the default behavior in BNW. The difference between evidence and intervention modes is discussed in the BNW FAQ page. To use the model to make predictions based on Geno1, click on one of the blue bars in the Geno1 node and enter 1 or 2 to indicate which genotype value should be used to predict the values of the other network nodes. In the figure below, Geno1 is outlined in red and state 2 has a 100% probability, indicating that the value of this node has been entered as evidence. The red lines in the figure show the predicted distributions of the nodes after this evidence is known and can be compared with the blue lines which show the distributions for variables using the original data.

If Geno1 has genotype 2, the value of Traits 1, 2, and 4 are expected to increase compared with the distribution for all data. Specifically, the mean value of Trait2 is expected to be near 1 for Geno1=2 data, while it is close to 0 when this evidence is not known. Traits 1 and 4 are also expected to increase, but the magnitude of this increase is not expected to be as large. Predicted distributions for the other nodes in the network, which are not descendants of Geno1, are expected to be close to the same as their original distributions, and, the red line covers the blue line for some nodes.

To quantitatively assess the impact of this evidence on the network predictions, the View parameters button on the left menu can be selected. Clicking this button brings up a pop-up window with the network parameters (i.e., the probability distributions of the states of discrete nodes and the means and standard deviations of the Gaussian distributions for continuous nodes) for both the original data set and the data when considering the entered evidence.

Evidence for multiple nodes can be considered at the same time by clicking on a new node in the network and entering a value. Alternatively, users can select Clear evidence to reset the network to show the orignial distributions in a new tab.

Next, we will make predictions using the intervention mode. To use the prediction mode, click the button next to Intervention at the top of the page. The Selected mode tab on the left of the screan should now display Intervention. The effects of experimental intervention on Trait2 can be predicted by clicking on the blue line in the Trait2 node and entering a value for the variable. The figure below shows the network after setting Trait2 to a value of 1.5.

Entering a value for Trait2 in intervention mode results in a change in the structure of the network. Specifically, as Trait2 is now set by the experimental intervention, it is no longer dependent on its parents. In this case, as Geno1 was only connected to the network by being the parent of Trait2, Geno1 no longer appears in the network. Also, intervention mode only predicts how nodes that are descendants of Trait2 are affected. Setting Trait2 to 1.5 results in increases in the predicted values of both Trait1 and Trait4.