Bayesian Network Webserver for Biological Network Modeling

Table of Contents

  1. Introduction
  2. Structure learning overview
  3. Structural constraint interface
  4. Parameter learning and using models to make predictions
  5. Details of structure learning methods
  6. Formatting files for BNW
  7. Downloadable structure learning package

1. Introduction to BNW

Use of the Bayesian network webserver (BNW) can be broken up into two main parts: learning the structure of a network model and using the model to make predictions about the interactions between the variables in the model. Both of these steps are more fully described later in this help file. However, users may also want to start using BNW by following this tutorial for network modeling of a dataset containing 8 variables which is available here.

2. Overview of structure learning in BNW

The first step in Bayesian network modeling of a dataset is identifying the network structure. In BNW, users can either upload a known network structure or learn the network structure that best explains the data. Structure learning from a dataset identifies which directed edges between network variables (nodes) should be included in the network to represent the conditional dependencies observed in the data. The structure learning method implemented in BNW can be used to learn the network structures of discrete, continuous, and hybrid (i.e., datasets containing both hybrid and continuous variables) datasets. After uploading a text file containing a dataset, users can either immediately perform structure learning using default settings or add or modify structural constraints that can improve the performance of structure learning. By default, BNW limits the maximum number of parents for each node in the network to 4 and presents only the highest scoring network structure (i.e., no model averaging is performed). The structure learning method is more fully described later in this help file.

3. Structural constraint interface

BNW includes a structural constraint interface that provides users with options that can increase the speed of structure learning, aid in identifying robust network structures, and limit structure searches to biologically or physically meaningful networks by incorporating prior knowledge. Examples of using the structural constraint interface are available here.

Global structure learning settings: The first section of the structural constraint interface allows users to set the following options that define global properties of the network structure search:

Maximum number of parents: This option sets a limit on the number of immediate parents for every node in the network and can impact structure learning in two main ways. First, limiting the maximum number of parents can dramatically increase the speed of structure learning for larger networks. Second, this limit may also help in avoiding over-fitting a network model, as it prevents a variable from being directly influenced by a large number of the other variables in the network. By default, the maximum number of parents of a node in BNW is 4.

Number of networks to include in model averaging: This option specifies k, the number of the high scoring networks that will be included in model averaging. For k=1, only the highest scoring network is considered and no model averaging is performed. For other values of k, model averaging is performed over the k-best networks. Increasing the value of k will increase the time required to perform the structure learning search but may increase the performance of model averaging.

Model averaging selection threshold: This option specifies the threshold that should be used to select directed edges to be included in the network given their posterior probabilities after model averaging. All directed edges with posterior probabilities greater than the threshold will be included in the network. It is ignored if k=1. By default, the threshold is set to 0.5.

Number of tiers: Users of BNW can separate the nodes in the network into tiers, which can then be used to specify structure learning constraints as discussed below. The number of tiers is set to 3 by default. If the network variables are not assigned to tiers, this option will be ignored when structure learning is performed.

Tier Assignment: The next section of the structural constraint interface allows users to separate the nodes in the network into tiers that can be used to easily indicate structural constraints. A node can be placed in a tier by simple clicking and dragging the node into the appropriate box.

Tier Interactions: This section allows for the description of the interactions that are allowed within and between tiers. By default, edges are allowed within tiers for all tiers, and nodes within a tier are only allowed to be parents of nodes within lower ranking tiers. For example, nodes in Tier2 of a network with 4 tiers could be the parents of nodes in Tiers 3 and 4, but could not be the parents of nodes in Tier1. Users can modify the allowed interactions to fit the details of their networks. For example, they may want to prohibit interactions within a tier containing variables that do not causally depend on each other.

Specific Banned and Required Edges: Finally, users can enter lists of banned and required edges to identify specific interactions that should or should not be included in the network. Banned edges can be used if experimental testing has shown that a particular regulatory relationship does not exist, while required edges can specify known regulatory relationships.

4. Parameter learning and using models to make predictions

After structure learning is completed, BNW automatically performs parameter learning of the network model using the Kevin Murphy's Bayes Net Toolbox (BNT) and displays the network model. Discrete variables in the network are displayed as bar charts and continuous variables are displayed as Gaussian distributions. The networks can be used to make predictions after clicking on a node and entering a value for that variable. Specifically, click on either the blue bar for a discrete node or the blue line for a continuous node to bring up a pop-up box that can be used to enter a value for the node. After submitting a value for the variable, the distributions of the other nodes in the network will change, allowing for visualization of the impact of setting the variable to the given value. The distributions after the entered value is considered are shown in red, while the original distributions are shown in blue. The node for which data was entered is outlined in red.

Two prediction modes are available in BNW: evidence and intervention. In the evidence mode, entered values will alter the distributions of the other variables in the network, but will not alter the network structure. In intervention mode, the intervention alters both the distributions of the network variables and the network structure. Specifically, the intervened variable becomes independent of its parents. Evidence mode is appropriate when making predictions of other network variables after the value of one variable in the network is observed, while intervention mode is appropriate for predictions after experimental interventions that alter the values of some variables in the network. Further discussion of the difference between evidence and intervention prediction modes is given on the BNW FAQ page.

5. Details of structure learning methods

The structure learning method used in BNW can be broken down into three main steps: calculating local network scores, determing global structures that optimize network scores, and, if indicated by user settings, performing model averaging over high scoring structures.

Local score calculation: The score of a local structure of a Bayesian network considers how well a node is explained by the nodes that are its immediate parents. To begin structure learning in BNW, we calculate all possible local scores by perform an exhaustive search of local structures given the structural constraints specified by the user. In order to allow structure learning of hybrid datasets, BNW calculates local scores using the scoring metric proposed by Bottcher and Dethlefsen that they have previously incorporated in the R package deal. Briefly, to allow for the use hybrid datasets containing both discrete and continuous variables, local structures are scored based on conditional probability tables for entirely discrete local strucutes, Gaussian distributions for entirely continuous local structures, and conditional Gaussian distributions for hybrid local structures. One property of this scoring metric is that it does not allow discrete nodes to be the children of continuous nodes. To improve the speed of structure learning, we do not use deal, but, instead, calculate local scores using code that we have written in the C programming language.

Search for the k-best global optimal structures: After calculating local scores, BNW performs a search for the k-best global optimal structures, using a user-specified value of k. The k-best structure search method used in BNW was developed by Tian and Re and is available here. It is an adaptation of an algorithm for identifying the optimal network structure developed by Silander and Myllymaki that is available here.

Model averaging: Model averaging can be used to reduce the risk of over-fitting data to a single model. In BNW, model averaging is automatically performed over the k-best scoring structures, when users select values of k > 1. To select features (i.e., directed edges between nodes), the posterior probability of each feature is calculated by a weighted average over the k-best networks where the weight is given by the score of the global network structure. This posterior probability, which ranges for 1 for features included in all high scoring networks to 0 for features in no high scoring networks, reflects confidence in the feature.

Model averaging may be particularly advantageous when learning network models using small datasets. As the number of samples in a dataset increases, the differences between the scores of the highest scoring networks often also increase. Therefore, instead of a single network structure having the best score, structure learning of small datasets may identify a group of structures with similar scores. Model averaging can be used to select features that are common to many of these high scoring networks.

6. Data formatting guidelines

Data file format

Data files uploaded to the Bayesian Network Webserver should be tab-delimited text files with the names of the variables in the first row of the file and the values of the variables for each sample or individual in the remaining rows.

BNW automatically determines whether variables contain continuous or discrete data. To help ensure that BNW correctly parses data files, users should follow these formatting guidelines:
1. Variable names should start with a letter, not a number, and should not contain any whitespace characters.
2. Discrete variables should be listed before continuous variables, that is, discrete variables should be the leftmost columns of the file.
3. The values of the levels of discrete variables should be integers starting with 1.
4. The data values for each continuous variables should include a period (.) followed by a number in at least one of the samples.

An example input data file for a file with 5 variables is given below. The network contains 2 discrete (Disc1 and Disc2) variables, which are given in the first two columns of the file, and 3 continuous variables (Cont1, Cont2, and Cont3). Disc1 is a discrete variable with two states (1 and 2), while Disc2 has three states (1, 2, and 3). Although the samples of Cont2 are integral values, we wish to deal with this variable as continuous, not discrete. Therefore, the value of Cont2 for the first sample is given as '3.0' instead of '3' so that one of the values of Cont2 contains a '.' followed by a number and Cont2 is interpreted as a continuous variable.

Disc1 Disc2 Cont1 Cont2 Cont3
2 1 3.25 3.0 0.97
2 3 2.46 2 0.93
1 2 4.21 33 0.43
2 3 3.76 8 0.88
2 1 3.69 4 0.91
1 1 4.27 13 0.38

Structure file format

If the structure of the network model for a dataset is already known, users can upload this structure by selecting "Upload structure" on the BNW home page. The structure file should be tab-delimited, with the variable names on the first row. The remainder of the file should be an n x n matrix of 0's and 1's, where n is the number of variables in the network. A '1' in row i and column j in this matrix indicates that there is a directed edge connecting variables i and j, (i.e., there is a edge from i to j in the network). '0' indicate that there is not a directed edge from variable i to variable j.

An example of a structure data file is shown below. The following edges would be included in the network:
1. Disc1 -> Cont1
2. Dics2 -> Cont2
3. Cont1 -> Cont3
4. Cont2 -> Cont3

Disc1 Disc2 Cont1 Cont2 Cont3
0 0 1 0 0
0 0 0 1 0
0 0 0 0 1
0 0 0 0 1
0 0 0 0 0

7. Downloadable structure learning package

To learn the structure of large networks or for large values of k when identifying the k-highest scoring networks for model averaging, users may want to download a package containing the BNW structure learning method, which is available here. The model_averaging.txt output file provided by the package can be loaded into BNW by selecting "Make predictions using a known structure" on the left menu on the BNW homepage or here and used to make predictions. The input of the data file required by the downloadable package has one change from the BNW input file format; namely, the downloadable package requires that user specify the variable type on the second line of input file. Enter the number of unique states on this line for discrete variables and enter '1' on this line to indicate a continuous variable. The downloadable package is written in C and is intended for use on computers with a Linux operating system and the gcc compiler.

Contact us

Please send questions and comments to Dr. Yan Cui at University of Tennesee Health Science Center.