We investigate four different representations of word frequencies of viral sequences including the relative word frequency and three normalized word frequencies by subtracting the number of expected from the observed word counts. We also study five machine learning methods including logistic regression, support vector machine, random forest, Gaussian naive Bayes and Bernoulli naive Bayes for separating infectious from non-infectious viruses for nine bacterial host genera with at least 45 infecting viruses. Area under the receiver operating characteristic curve (AUC) is used to compare the performance of different machine learning method and feature combinations. We then evaluate the performance of the best method for the identification of the hosts of contigs in metagenomic studies. We also develop a maximum likelihood method to estimate the fraction of true infectious viruses for a given host in viral tagging experiments.
Based on nine bacterial host genera with at least 45 infectious viruses, we show that random forest together with the relative word frequency vector performs the best in identifying viruses infecting particular hosts. For all the nine host genera, the AUC is over 0.85 and for five of them, the AUC is higher than 0.98 when the word size is 6 indicating the high accuracy of using machine learning approaches for identifying viruses infecting particular hosts. We also show that our method can predict the hosts of viral contigs of length at least 1kbps in metagenomic studies with high accuracy. The random forest together with word frequency vector outperforms current available methods based on Manhattan and [Formula: see text] dissimilarity measures. Based on word frequencies, we estimate that about 95% of the identified T4-like viruses in viral tagging experiment infect Synechococcus, while only about 29% of the identified non-T4-like viruses and 30% of the contigs in the study potentially infect Synechococcus.
The study of virus-host infectious association is important for understanding the functions and dynamics of microbial communities. Both cellular and fractionated viral metagenomic data generate a large number of viral contigs with missing host information. Although relative simple methods based on the similarity between the word frequency vectors of viruses and bacterial hosts have been developed to study virus-host associations, the problem is significantly understudied. We hypothesize that machine learning methods based on word frequencies can be efficiently used to study virus-host infectious associations.
The random forest machine learning method together with the relative word frequencies as features of viruses can be used to predict viruses and viral contigs for specific bacterial hosts. The maximum likelihood approach can be used to estimate the fraction of true infectious associated viruses in viral tagging experiments.