Bayesian Network Webserver

BNW Help: Table of Contents

Introduction
Structure learning overview
Structure learning settings and the structural constraint interface
Parameter learning and using models to make predictions
Details of structure learning methods
Formatting files for BNW
Downloadable structure learning package
Recent updates to BNW

1. Introduction to BNW

Use of the Bayesian network webserver (BNW) can be broken up into two main parts: learning the structure of a network model and using the model to make predictions about the interactions between the variables in the model. Both of these steps are more fully described later in this help file. However, users may also want to start using BNW by following this tutorial for network modeling of a dataset containing 8 variables which is available here.

2. Overview of structure learning in BNW

The first step in Bayesian network modeling of a dataset is identifying the network structure. In BNW, users can either upload a known network structure or learn the network structure that best explains the data. Structure learning from a dataset identifies which directed edges between network variables (nodes) should be included in the network to represent the conditional dependencies observed in the data. The structure learning method implemented in BNW can be used to learn the network structures of discrete, continuous, and hybrid (i.e., datasets containing both hybrid and continuous variables) datasets. After uploading a text file containing a dataset, users can either immediately perform structure learning using default settings or add or modify structural constraints that can improve the performance of structure learning. By default, BNW limits the maximum number of parents for each node in the network to 4 and presents only the highest scoring network structure (i.e., no model averaging is performed). The structure learning method is more fully described later in this help file.

3. Structure learning settings and the structural constraint interface

BNW includes a structural constraint interface that provides users with options that can increase the speed of structure learning, aid in identifying robust network structures, and limit structure searches to biologically or physically meaningful networks by incorporating prior knowledge. This procedure has two main steps. First, a set of global structure learning settings are defined. Second, network variables are assigned to tiers and the types of interactions that are allowed within and between tiers are set. An example of using the structural constraint interface is available here.

Global structure learning settings:

Maximum number of parents: This option sets a limit on the number of immediate parents for every node in the network and can impact structure learning in two main ways. First, limiting the maximum number of parents can dramatically reduce the time required for structure learning of larger networks. Second, this limit may also help in avoiding over-fitting a network model, as it prevents a variable from being directly influenced by a large number of the other variables in the network. By default, the maximum number of parents of a node in BNW is 4.

Number of networks to include in model averaging: This option specifies k, the number of the high scoring networks that will be included in model averaging. For k=1, only the highest scoring network is considered and no model averaging is performed. For other values of k, model averaging is performed over the k-best networks. Increasing the value of k will increase the time required to perform the structure learning search but may improve the performance of model averaging.

Model averaging selection threshold: This option specifies the threshold that should be used to select directed edges to be included in the network given their posterior probabilities after model averaging. All directed edges with posterior probabilities greater than the threshold will be included in the network. It is ignored if k=1. By default, the threshold is set to 0.5.

Structural constaint interface:

Number of tiers: This option specifies the number of different tiers to which variables can be assigned.

Tier Assignment: Network variables can be placed in desired tiers by clicking and dragging the variable into the appropriate box. If variables are not placed in a tier, they are included in structure learning without any restrictions.

Tier Interactions: This section allows for the description of the interactions that are allowed within and between tiers. By default, edges are allowed within tiers for all tiers, and nodes within a tier are only allowed to be parents of nodes within lower ranking tiers. For example, nodes in Tier2 of a network with 4 tiers could be the parents of nodes in Tiers 3 and 4, but could not be the parents of nodes in Tier1. Users can modify the allowed interactions to fit the details of their networks. For example, they may want to prohibit interactions within a tier containing variables that do not causally depend on each other.

Specific Banned and Required Edges: Finally, users can enter lists of banned and required edges to identify specific interactions that should or should not be included in the network. Banned edges can be used if experimental testing has shown that a particular regulatory relationship does not exist, while required edges can specify known regulatory relationships.

4. Parameter learning and using models to make predictions

After structure learning is completed, BNW automatically performs parameter learning of the network model using the Kevin Murphy's Bayes Net Toolbox (BNT) and displays the network model. BNW has been recently been updated and Dirichlet prior distributions are now used during parameter learning.

Discrete variables in the network are displayed as bar charts and continuous variables are displayed as Gaussian distributions. The networks can be used to make predictions after clicking on a node and entering a value for that variable. Specifically, click on either the blue bar for a discrete node or the blue line for a continuous node to bring up a pop-up box that can be used to enter a value for the node. After submitting a value for the variable, the distributions of the other nodes in the network will change, allowing for visualization of the impact of setting the variable to the given value. The distributions after the entered value is considered are shown in red, while the original distributions are shown in blue. The node for which data was entered is outlined in red.

Two prediction modes are available in BNW: evidence and intervention. In the evidence mode, entered values will alter the distributions of the other variables in the network, but will not alter the network structure. In intervention mode, the intervention alters both the distributions of the network variables and the network structure. Specifically, the intervened variable becomes independent of its parents. Evidence mode is appropriate when making predictions of other network variables after the value of one variable in the network is observed, while intervention mode is appropriate for predictions after experimental interventions that alter the values of some variables in the network. Further discussion of the difference between evidence and intervention prediction modes is given on the BNW FAQ page.

5. Details of structure learning methods

The structure learning method used in BNW can be broken down into three main steps: calculating local network scores, determing global structures that optimize network scores, and, if indicated by user settings, performing model averaging over high scoring structures.

Local score calculation: The score of a local structure of a Bayesian network considers how well a node is explained by the nodes that are its immediate parents. To begin structure learning in BNW, we calculate all possible local scores by perform an exhaustive search of local structures given the structural constraints specified by the user. In order to allow structure learning of hybrid datasets, BNW calculates local scores using the scoring metric proposed by Bottcher and Dethlefsen that they have previously incorporated in the R package deal. Briefly, to allow for the use hybrid datasets containing both discrete and continuous variables, local structures are scored based on conditional probability tables for entirely discrete local strucutes, Gaussian distributions for entirely continuous local structures, and conditional Gaussian distributions for hybrid local structures. One property of this scoring metric is that it does not allow discrete nodes to be the children of continuous nodes. To improve the speed of structure learning, we do not use deal, but, instead, calculate local scores using code that we have written in the C programming language.

Search for the k-best global optimal structures: After calculating local scores, BNW performs a search for the k-best global optimal structures, using a user-specified value of k. The k-best structure search method used in BNW was developed by Tian and Re. It is an adaptation of an algorithm for identifying the optimal network structure developed by Silander and Myllymaki that is available here.

Model averaging: Model averaging can be used to reduce the risk of over-fitting data to a single model. In BNW, model averaging is automatically performed over the k-best scoring structures, when users select values of k > 1. To select features (i.e., directed edges between nodes), the posterior probability of each feature is calculated by a weighted average over the k-best networks where the weight is given by the score of the global network structure. This posterior probability, which ranges for 1 for features included in all high scoring networks to 0 for features in no high scoring networks, reflects confidence in the feature.

Model averaging may be particularly advantageous when learning network models using small datasets. As the number of samples in a dataset increases, the differences between the scores of the highest scoring networks often also increase. Therefore, instead of a single network structure having the best score, structure learning of small datasets may identify a group of structures with similar scores. Model averaging can be used to select features that are common to many of these high scoring networks.

6. Data formatting guidelines

Data file format

Data files uploaded to the Bayesian Network Webserver should be tab-delimited text files with the names of the variables in the first row of the file and the values of the variables for each sample or individual in the remaining rows. A sample input file is provided here.

Variable names should not contain any whitespace characters.

BNW automatically determines whether each variable contains continuous or discrete data by applying the following rules, in order:

1. If a variable contains 3 or fewer different values, the variable is considered to be discrete.

2. If a variable contains more than 20 different values, the variable is considered to be continuous.

3. If the ratio of the number of different values for a variable compared to the number of cases in the data set is large, the variable is considered to be continuous. Specifically, if this ratio is 1/3 or larger, the variable is considered to be continuous.

4. If none of the first three rules apply, the data set is inspected to determine if any of the values for the variable contain a period (.). If at least one value contains a period, the variable is considered to be continuous; otherwise, the variable is considered to be discrete.

Users can examine whether or not BNW has correctly loaded input data files and classified variables by clicking on "View uploaded variables and data" after uploading a dataset. We believe that BNW should correctly classify variables in most cases, but users may occasionally need to add or remove a period to the data of some variables.

BNW allows input data files to contain cases that have missing data. A value of "NA" should be entered if the value of a variable is missing for the case. Cases with missing variables are ignored by BNW and are removed from the data set before structure and parameter learning.

An example input data file for a file with 5 variables is given below. The network contains 2 discrete (Disc1 and Disc2) variables, which are given in the first two columns of the file, and 3 continuous variables (Cont1, Cont2, and Cont3). Disc1 is a discrete variable with two states (1 and 2), while Disc2 has two states (A and B). Although Cont2 has integral values, the value of Cont2 for the first sample is given as '3.0' instead of '3'. The addition of the '.' is not necessary, but ensures that Cont2 is interpreted as a continuous variable.

Disc1	Disc2	Cont1	Cont2	Cont3
2	A	3.25	3.0	0.97
2	B	2.46	2	0.93
1	A	4.21	33	0.43
2	B	3.76	8	0.88
2	A	3.69	4	0.91
1	B	4.27	13	0.38
1	A	4.12	9	0.45

Structure file format

If the structure of the network model for a dataset is already known, users can upload this structure by selecting "Upload structure" on the BNW home page. The structure file should be tab-delimited, with the variable names on the first row. The remainder of the file should be an n x n matrix of 0's and 1's, where n is the number of variables in the network. A '1' in row i and column j in this matrix indicates that there is a directed edge connecting variables i and j, (i.e., there is a edge from i to j in the network). '0' indicates that there is not a directed edge from variable i to variable j.

An example of a structure data file is shown below. The following edges would be included in the network:
1. Disc1 -> Cont1
2. Dics2 -> Cont2
3. Cont1 -> Cont3
4. Cont2 -> Cont3

Cont1	Cont2	Cont3
1	0	0
0	1	0
0	0	1
0	0	1
0	0	0

7. Downloadable structure learning package

To learn the structure of large networks or for large values of k when identifying the k-highest scoring networks for model averaging, users may want to download a package containing the BNW structure learning method, which is available here. The model_averaging.txt output file provided by the package can be loaded into BNW by selecting "Make predictions using a known structure" on the left menu on the BNW homepage or here and used to make predictions. The input of the data file required by the downloadable package has one change from the BNW input file format; namely, the downloadable package requires that user specify the variable type on the second line of input file. Enter the number of unique states on this line for discrete variables and enter '1' on this line to indicate a continuous variable. The downloadable package is written in C and is intended for use on computers with a Linux operating system and the gcc compiler.

8. Recent updates to BNW

BNW has recently been updated to add features and improve the user experience. Four major recent changes have been:

The addition of a new visualization of the network structure: This network structure is presented to users immediately after structure learning to allow for a simpler introduction to the relationships within the network model. This new network image can be shown with or without edge weights that were determined after model averaging and can also be exported as a PNG or SVG file. The visualization of the network that has previously been used in BNW and is appropriate for using the network to make predictions can be accssed by clicking on the Use network to make predictions button on the left side of the page.

The addition of a several options for modifying input files and network structures: Users of BNW now have the option of changing structure learning settings, making specific changes to network structures, and removing variables from an uploaded data set without having to re-upload input data files.

The addition of several options for learning more about the network parameters and distributions: First, a View parameters button has been added that allows users to quantify network parameters. The parameters of the network model considering the original data set that was uploaded by the user or the predicted parameters considering the evidence or intervention that has been entered by the user. The fraction of cases for each possible state is provided for discrete nodes, while continuous nodes are described by the means and standard deviations of Gaussian distributions.

If users have made predictions using either the evidence or intervention modes, a link to a file containing the parameters of the network considering the entered evidence or intervention is also provided. The file lists the predicted fraction of states for discrete nodes and predicted mean and standard deviation of the Gaussian distribution for continuous nodes. Nodes for which evidence or intervention has been entered are also noted in the file.

Second, clicking View violin plots of distributions shows a figure created using plotly that compares fits to uploaded data and the distributions from the network parameters. If evidence or interventions have been added to the network, an additional plot compares the fits to the data, as originally uploaded, with the distributions in the network model considering the evidence or intervention.

BNW is now able to perform cross-validation and make predictions on a test data set: Clicking the Cross validation and predictions button on the network page presents three options for testing the predictions of the network: leave-one-out cross validation, k-fold cross validation, and uploading a test data set to make predictions using the network model.

a) Leave-one-out cross validation: Users should enter the name of a network variable that they want to investigate. These calculations can take up to 3 minutes for large data sets. After the calculation is complete, the cross-validation output contains the following information:

For discrete variables, the predicted likelihood of each state for the left-out case, given the values of its parent variables in the network, is provided.

For continuous variables, the predicted mean and standard deviation of the left-out variable, given the values of its parent variables in the network, is provided.

b) k-fold cross validation: Users should enter the name of the variable that should be predicted as well as the number of folds into which the data set should be split. The output file contains similar information as what is produced using leave-one-out cross-validation.

c)Predictions on a test data set: Users can then upload a file containing a test data set. The format of the file should follow the format of the input data file with one important exception: The first line of the file should containing the name of the variable that should be predicted.

The test data file can contain missing data. If data for a variable of a given sample or case is not known, an "NA" can be entered in the input file. Data can be missing for the variable that is to be predicted or for the variables that are to be used as predictors.

Other changes and new features that have been added to BNW since its introduction in Bioinformatics include:

1) Parameter learning settings have been modified. Specifically, network parameters (e.g., the distributions of states for discrete variables and means and standard deviations of Gaussian distributions) are now learned using Dirichlet prior distributions. In our testing, this has had a minimal impact on the parameters of most networks, but has helped reduce the impact of cases with rare combinations of states on the predicted distributions of some networks. In the original version of BNW, no priors were used.

2) Increased flexibility in formatting of uploaded data files. One major change is that BNW now allows for users to upload data files in which discrete variables have alphabetic values. For example, a file containing a node for the genotype of BXD mice can now have values of B and D; the values do not have to be recoded as integers. Input data files can also contain missing data by entering "NA" for missing values. Cases with missing data are ignored during structure and parameter learning of the network. A full description of the proper format for input files in BNW considering these changes can be found here.

3) Added "View uploaded variables and data" button to allow users to ensure that uploaded data sets are loaded and parsed correctly. After users upload a data set, clicking this button provides the ability to view either the uploaded data file directly or view a variable description file that shows how BNW has parsed the data. The variable description file lists the number of variables and cases (e.g., individuals or samples) in the data file. It also indicates whether each variable is discrete or continuous and the criteria that was used to make this determination. For discrete variables, the possible states of the variable are provided. For continuous variables, the mean and standard deviation of the variable is provided.

4) Added a "Use a network ID to return to a network" on the BNW home page. This button allows users to enter a network ID to return to a previously generated network model. The network ID can also be used to share the network model with collaborators. The network ID is a three character string that is listed in left menu of a BNW network page.

5) In a change that is invisible to users, Octave is used instead of Matlab for parameter learning using the BayesNet Toolbox.

Contact us

Please send questions and comments to Dr. Yan Cui at University of Tennesee Health Science Center.