--network inference or link prediction of drug-drug interaction using network-based similarity.
Consider an undirected network G(V,E), such as drug-drug or protein-protein interaction network, where V is the set of nodes and E is the set of links. Multiple links and self-connections are not allowed. Denote by U the universal set containing all possible |V|×(|V|-1)/2 links, where |V| denotes the number of elements in set V. Then, the set of nonexistent links is U-E. We assume there are some missing links (or the links that will appear in the future) in the set U-E and the task of link prediction is to find out these links.
We obtained the information about the interactions between drugs from the DrugBank database (version 4.0). At the time of writing, we collected 11802 drug-drug interactions associated with 1178 known drugs (see Figure 1). The data statistics for drugs and their interactions are summarized in Table 1. Before further analysis, all drug-drug associations and their indications are automatically obtained from DrugBank. The set of known drug-drug interactions is regarded as the 'gold standard' data in this study, and is used for evaluating the performance of the propose association indices in the validation experiments.
Figure 1: the drug-drug interaction network. In this network, drugs are denoted by nodes, and the edge is linked if two drugs have interactions. The drug-drug interaction network consists of 11802 drug-drug interactions associated with 1178 known drugs from DrugBank.
|No. of drugs||1178|
|No. of drug-drug interactions||11802|
|Average degree of drugs||20.12|
|Minimal degree of drugs||1|
|Maximal degree of drugs||254|
|Cluster coefficient of drugs||0.190|
|Proportion of missing links in the network||0.0085|
To evaluate the performance of each similarity measure, we randomly divided the total data set into a training set and a test set. The training samples are used for generating similarity measures and constructing the classifier, and the test samples are used for evaluating the prediction performance of the similarity measure. However, in each division, we try to keep the inter-drug connectivity of the constructed training set. The process was repeated ten times to obtain robust and reliable statistics. We used the area under ROC (AUC) as performance statistics. Provided the rank of all non-observed links, the AUC value can be interpreted as the probability that a randomly chosen missing link (i.e., a link in validation set） is given a higher score than a randomly chosen nonexistent link (i.e., a link in U−E). In the algorithmic implementation, we usually calculate the score of each non-observed link instead of giving the ordered list since the latter task is more time-consuming. Then, at each time we randomly pick a missing link and a nonexistent link to compare their scores, if among n independent comparisons, there are n1 times the missing link having a higher score and n2 times they have the same score, the AUC value is (n1+0.5×n2)/n. If all the scores are generated from an independent and identical distribution, the AUC value should be about 0.5. Therefore, the degree to which the value exceeds 0.5 indicates how much better the algorithm performs than pure chance.
The identification of drug-drug interactions is a key area in the clinical practice. The early identification of drug-drug interaction could efficiently avoid the risk of adverse drug reactions caused by these interactions. Therefore, there is a strong incentive to develop new methods capable of detecting these potential drug-drug interactions efficiently. To well demonstrate and compare the prediction ability of these network-based similarity indices in network inference and link prediction, we applied them to the prediction of drug-drug interactions. The AUC scores for each network-based similarity indices are shown in Figure 2. One can clearly see that these network-based similarity indices obtained the prediction performance of 0.74-0.97, which indicated that they could be used for predicting potential drug-drug interactions. However, different types of similarity indices obtained different prediction performance. The common neighbor-based similarity indices such as CN, Salton, Jaccard, Sorensen, HPI, HDI, LHN1, AA, RA, yield the relatively low prediction performance, and the AUC scores range from 0.74-0.82. The preferential attachment index obtained the AUC score of 0.88. Although the preferential attachment concept is simple, it does perform well in the prediction of drug-drug interactions. Compared to the previous similarity indices, the network-path-based similarity indices such as LP, Katz obtained the better prediction. However, LHN2 performs relative poorly, whose prediction accuracy is the similar to those from common neighbor-based similarity indices. Among all similarity indices, random walk-based similarity indices yield the best prediction ability, and ranged from 0.86 to 0.97. Local random walk index gave the highest AUC score (AUC = 0.97).
Figure 2: Accuracies of different similarity indices subject to link prediction, measured by the AUC value. Each number is obtained by averaging over 10 implementations with independently random partitions of the training set (80%) and the test set (20%). The eighteen similarity indices are CN, Salton, Jaccard, Sorensen, HPI, HDI, LHN1, AA, RA, PA, LP, Katz, LHN2, ACT, CosPlus, LRW, SRW, and MFI, respectively.