This tutorial provides an overview of using BNW to build a Bayesian network model from a dataset and use the network to make predictions. The dataset used in this tutorial is a synthetic example of a genetic dataset that has a total of 8 variables. Two of the variables are genotypes labeled Geno1 and Geno2, and the remaining 6 variables are gene expression levels or other quantitative traits that are labeled Trait1 to Trait6. The dataset is available here.

The data file is formatted according to the guidelines on the BNW help page. The first row of the file contains the names of the variables and the remaining rows contain the data for each sample of the dataset. The genotypes (Geno1 and Geno2), which are the only discrete variables in the network, are the leftmost variables in the input file and are integer values (1 and 2) for all of the samples. The quantitative traits are continuous variables, and, therefore, all contain a "." in at least one of the samples.

1. Structure learning using default options

We do not know the network structure of this datasets, so we will use BNW to learn the structure that best explains the data. To begin, select Learn a network model from data from the BNW home page. Next, click Choose File at the top of the file upload page, navigate to and select the data file, and click Upload. The contents of the uploaded file should now be visible as shown below.

Initially, we will perform structure learning using default options in BNW. By default, BNW limits the maximum number of parents for any node in the network to 4 and presents the structure of the single highest scoring network. To perform structure learning using the default settings and view the network structure, click Perform Bayesian network modeling using default settings. The structure below should be displayed and the network is available here.

In the single best scoring network structure, the genotype nodes, Geno1 and Geno2, directly influence Trait2 and Traits3, 4, and 6, respectively. There are also several direct interactions between traits; for example, Trait2 influences Trait1 and Trait4. Additionally, Trait5 is absent from the network, indicating that this variable did not interact with the other variables strongly enough to be located within the highest scoring network.

2. Modifying structure learning settings

In order to test if edges present the single best scoring network are conserved across high scoring networks. We can modify the structure learning settings to get identify the structures of many high scoring networks and perform model averaging over these structures. To do this, return to the BNW home page, select Learn a network model from data, and upload the datafile. Instead of using the default settings, select Go to structure learning settings and the BNW structural constraint interface. A more detailed overview of use of the structural constraint interface is provided in another tutorial, but, here, we will investigate the impact of modifying some of the structure learning settings shown below:

Change the Number of networks to include in model averaging to 100 and select Perform Bayesian network modeling. Now, instead of displaying the single highest scoring network, BNW will determine the 100 highest scoring networks, perform model averaing over these networks, and display the structure after model averaging that includes all features with a Model averaging edge selection threshold greater than 0.5. Model averaging over the 100 highest scoring structures has resulted in a change in the network structure as shown below and available here.

Specifically, the single best model network contains a directed edge linking Trait6 with Trait3, while this edge is absent from the structure after model averaging over the 100 best scoring networks. Clicking Display structure matrix displays the model averaging scores as well as the structure matrix. The structure matrix file can be downloaded and used in return sessions to BNW, allowing users to skip structure learning and more quickly use the model to make predictions.

In the model averaging scores, we can observe that most of the network edges in the model average network were found in all or nearly all of the 100 highest scoring networks. For example, edges from Geno2 to Traits 3 and 4 have posterior probabilities of 1. Also, the edge from Trait6 to Trait3 that was observed in the single highest scoring network has a posterior probability of 0.46, and it, therefore, falls just short of being included in the model averaged network structure.

3. Using the network structure to make predictions

To make predictions with the network, we will use the structure learned after model averaging of the top 100 highest scoring networks. First, we will use the model to compare the expected values for nodes in the network based on observed genotypes. For these predictions, we will keep the prediction in evidence mode. The difference between evidence and intervention modes is discussed in the BNW FAQ page. To use the model to make predictions based on Geno1, click on one of the blue bars in the Geno1 node and enter 1 or 2 to indicate which genotype value should be used to predict the values of the other network nodes. In the figure below, Geno1 is outlined in red and state 2 has a 100% probability, indicating that the value of this node has been entered as evidence. The red lines in the figure show the predicted distributions of the nodes after this evidence is known and can be compared with the blue lines which show the distributions for variables using the original data.

If Geno1 has genotype 2, the value of Traits 1, 2, and 4 are expected to increase compared with the distribution for all data. Specifically, the mean value of Trait2 is expected to be near 1 for Geno1=2 data, while it is close to 0 when this evidence is not known. Traits 1 and 4 are also expected to increase, but the magnitude of this increase is not expected to be as large. Predicted distributions for the other nodes in the network, which are not descendants of Geno1, are expected to be close to the same as their original distributions, and, the red line covers the blue line for some nodes. Evidence for multiple nodes can be considered at the same time by clicking on a new node in the network and entering a value. Alternatively, users can select Clear evidence to reset the network to show the orignial distributions in a new tab.

Next, we will make predictions using the intervention mode. To use the prediction mode, click the button next to Intervention at the top of the page. The Selected mode tab on the left of the screan should now display Intervention. The effects of experimental intervention on Trait2 can be predicted by clicking on the blue line in the Trait2 node and entering a value for the variable. The figure below shows the network after setting Trait2 to a value of 1.5.

Entering a value for Trait2 in intervention mode results in a change in the structure of the network. Specifically, as Trait2 is now set by the experimental intervention, it is no longer dependent on its parents. In this case, as Geno1 was only connected to the network by being the parent of Trait2, Geno1 no longer appears in the network. Also, intervention mode only predicts how nodes that are descendants of Trait2 are affected. Setting Trait2 to 1.5 results in increases in the predicted values of both Trait1 and Trait4.