How to classify sequences
From BioWeka
In this short tutorial you will learn how to read a FASTA sequence file with class annotations into Weka and apply different classification procedures to the data.
Note: This is not a description of a scientific analysis but a practical introduction to Weka and BioWeka.
Contents |
Data loading
To begin with, download the SCOP sequences from http://astral.berkeley.edu/, e.g. the data set titled "ASTRAL SCOP 1.69 genetic domain sequence subsets, based on PDB SEQRES records, with less than 40% identity to each other".
Note: Because the following procedures needs a lot of computing time, we recommend just to use a subsample of the dataset by opening the downloaded FASTA file into your favorite text editor, selecting only the n first sequences and then copying them to a new file. However, if you do so, please be aware that the results of the classification procedure will be different from those described below. For a very quick test, you can use the following FASTA-formatted data instead:
>d1a6wl_ b.1.1.1 (L:) Immunoglobulin light chain lambda variable domain, VL-lambda {Mouse (Mus musculus)}
avvtqesalttspgetvtltcrsstgavttsnyanwvqekpdhlftgliggtnnrapgvp
arfsgslignkaaltitgaqtedeaiyfcalwysnhwvfgggtkltvle
>d1b0wa_ b.1.1.1 (A:) Immunoglobulin light chain kappa variable domain, VL-kappa {Human (Homo sapiens), cluster 1}
diqmtqspsslsasvgdrvtitcqasqdisdyliwyqqklgkapnlliydastletgvps
rfsgsgsgteytftisslqpediatyycqqyddlpytfgqgtkveikr
>d1dlya_ a.1.1.1 (A:) Protozoan/bacterial hemoglobin {Green alga (Chlamydomonas eugametos)}
slfaklggreaveaavdkfynkivadptvstyfsntdmkvqrskqfaflayalggasewk
gkdmrtahkdlvphlsdvhfqavarhlsdtltelgvppeditdamavvastrtevlnmpq
q
>d1uvya_ a.1.1.1 (A:) Protozoan/bacterial hemoglobin {Ciliate (Paramecium caudatum)}
slfeqlggqaavqavtaqfyaniqadatvatffngidmpnqtnktaaflcaalggpnawt
grnlkevhanmgvsnaqfttvighlrsaltgagvaaalveqtvavaetvrgdvvtv
Next, start Weka and open the Explorer GUI. On the Preprocess panel hit the Open file ... button (see Figure 1). Go to the directory where you stored the FASTA file. To see that file, you have to set the file type to All files. Open the sequence file.
An error message pops up telling you that the file has an unrecognized file type. This is OK. Just hit the Use converter button. In the next dialog, push the button Choose. You will see a tree. Within this tree follow the path bioweka/core/converters/sequence and choose the entry FastaSequenceLoader. This selects the FASTA to ARFF converter for reading the sequence file. In addition, to read the SCOP class annotation stored in the FASTA description line, you have to change the sequence mapper. The description line of a sequence looks as follows:
>d1dlwa_ a.1.1.1 (A:) Protozoan/bacterial hemoglobin {Ciliate (Paramecium caudatum)}
slfeqlggqaavqavtaqfyaniqadatvatffngidmpnqtnktaaflcaalggpnawt
grnlkevhanmgvsnaqfttvighlrsaltgagvaaalveqtvavaetvrgdvvtv
As you can see the first token of the description line is the sequence identifier. The second token, however, is the SCOP class.
The preselected sequence mapper of the FastaSequenceLoader is the FastaSequenceMapper, which doesn't consider the second token. Therefore, hit the button Choose next to the caption sequenceMapper and choose from the tree the ClassificationFastaSequenceMapper. After the tree disappeared click on the caption ClassificationFastaSequenceMapper. Another dialog should open up.
Within this dialog, set the classLevel to 1. The ClassificationFastaSequenceMapper supports hierarchical classifications like SCOP, but Weka does not. For this reason the hierarchical classes have to be mapped into a flat class system. By setting the class level to -1 each hierarchical class becomes a flat class, e.g. in the case of the classes a.1.1.1 and a.1.1.2 the information that both classes share the same parent class a.1.1 gets lost. Setting the class level to 3 means that only the first three levels should be considered. Thus the classes a.1.1.1 and a.1.1.2 become both a.1.1. In this example, we only consider the top level class, i.e. the class level equals 1.
Finish the loading procedure by hitting on each open dialog the OK button. It takes a while loading the about 7200 sequences into the Weka Explorer, so don't get jittery.
Data preparation
As you will realize (when selecting the last attribute in the left frame) the class attribute sequence.annotation.class isn't of type nominal but of type string. To make the class attribute a nominal attribute, so it can be used for classification, a filter called StringToNominal has to be applied. Click on the button titled Choose in the left upper corner and select root/weka/filters/unsupervised/attribute/StringToNominal. Afterwards hit the Apply button on the opposite side.
After a few seconds the class attribute should be shown in new colors (see Figure 5). As you can see, there are seven classes. However, for the following classification procedure we just consider the classes a and b. This is because to minimize the numbers of alignments and thus the time needed to do the classification.
This can be done by using the filter RemoveWithValues and the configuration shown in Figure 6. Before you apply this filter, ensure that the no class attribute is selected. So choose from the combobox on the right side near the button Visualize All the value No class. After applying this filter to the current data set, there are only two classes, a and b, and so 2955 instances left. Each instance has seven values attached: its unique sequence identifier, the sequence data, a so called URN (uniform resource name), which can be disregarded here, the length of the sequence (i.e. the numbers of amino acids in this example), a nominal value, that identifies the alphabet (here PROTEIN), the description line from the FASTA sequence file and finally the class value.
Classification using local alignments
Now switch to the Classify panel and choose as classifier bioweka/classifiers/sequence/alignments/SimpleAlignmentClassifier. We will use a local alignment implementation called JAligner to calculate the alignment scores. By choosing as score evaluator the MaxScoreEvaluator, a test instance is classified by the class of the training instance, which scores highest in an alignment with the test instance.
In most cases a cross-validation is the best choice to evaluate the performance (i.e. prediction accuracy) of a classification procedure. However, in this case our intention is only to demonstrate the appplication of Weka and BioWeka. To minimize the time needed for that demonstration, instead of a cross-validation a percentage-split is applied. So 66 % of the sequences are used as training instances, the remaining 33 % are used as test instances.
Finally, hit the Start button and watch the alignment classifier do his work (you'll see log messages within the console window). When the classification is finished, the results are shown in the text box under Classifier output. As you can see in Figure 8, about 62 % of the sequences are correctly classified. This is just twelve percent better than when using random class assignment.
Classification using BLAST
The classification using the local alignment implementation JAligner isn't very fast. If time plays a role, BLAST may be a solution. Using the BlastClassifier a BLAST database is created from the training set and for each sequence from the test set a BLAST query is performed resulting in a ranked list of scored alignment results. The evaluation and subsequent class assignment is done in the same way as for the local alignment classification.
Before you continue, please ensure that you have installed BLAST on your local machine and configured it in the right way. It is e.g. on Windows especially important to set the BLASTDB and BLASTDIR environment variables, to include the BLAST installation directory into the PATH environment variable and to have a file called ncbi.ini in your Windows directory, as described in the BLAST installation instructions.
On the Classify panel within the Weka Explorer choose as classifier bioweka/classifiers/sequence/alignments/BlastClassifier. Because the data set we are using here contains protein sequences, we don't have to change anything. However, if you use nucleotid sequences, you have to adjust the commands used by the BlastClassifier to call the various programs from the BLAST suite. Here's a list of properties you have to change (changes are marked in bold):
- buildCommand
- formatdb -i stdin -n %DBNAME% -l %DBNAME%.log -p F
- command
- blastall -p blastn -d %DBNAME%
As you can see, the command property contains the program call itself as you would use it on command line. If you like to use PSI-BLAST instead of BLAST for the classification of protein sequences, you have to modify the command property appropriately and choose as parser the PsiBlastParser component instead of the preselected BlastParser component whenever you use more than one iteration. Depending on your BLAST version, you may have to set the working directory, too:
Per default the BLAST databases are created within the working directory. This is the directory from which you've started BioWeka. If you like to change that, set the workingDirectory property to the directory you want BioWeka to use for the BLAST databases. A special case is blastpgp in the latest version (2.2.15) under Windows, for which you have to set the working directory in the BlastClassifier dialog to the directory of your BLAST installation.
Please be aware of the fact, that if you apply e.g. a ten-fold cross-validation, the BlastClassifier will create ten BLAST databases. However, because each database has the same name, in each iteration the database from the previous iteration is overwritten. To avoid this behaviour, you have to turn on the createTemporaryFile property. Now, the name of each BLAST database is determined by the relation name of the data set and an unique random suffix. Please note, that this is an experimental feature, which is not fully tested. It is not necessarily required for running a cross-validation only locally. However, if you plan to use software such as Weka-Grid and run the cross-validation distributed in parallel mode on many machines, which share a common file space, you have turn on this property.
Rebuilding the same BLAST database from the same training set again and again, is not very efficient, but sometimes necessary, if e.g. just the query parameters are changed. In such a case it may be useful to build the BLAST database just once and reuse it the following times. This can be done by setting the ignoreTrainingSet property to True. In this case no BLAST database is created from the training set, but the relation name of the training set is used to identify an existing BLAST database.
The actual classification procedure is not different from that using the local alignments. After you made your settings (see e.g. Figure 9a), hit the Start button. Exemplary results for the settings in Figure 9a on the reduced set with four sequences as given above are shown in Figure 9b.
Classification based on a feature vector
So let's try another approach: Instead of using sequence alignments, we compute a numeric vector, i.e. a feature vector, for each sequence and apply afterwards a standard Weka classifier. One possibility to create a feature vector for protein sequences is to count the frequencies of the amino acids. This can be done by using the SymbolCounter filter.
So go back to the Preprocess panel and search for bioweka/filters/sequence/extractors/SymbolCounter within the filter field. Instead of using the direct counts of each amino acid, we want to use their relative frequencies. Therefore, we choose within the SymbolCounter properties dialog as normalizer the class SumNormalizer. Finally, we set the property pseudoCounts to 1.0, before we apply the filter to the sequence set.
As you will see, there are 21 new attributes named after the three letter codes of the 20 amino acids plus SEC for selenocysteine. Next, we must remove all non-numeric data from the data set, because the classifier we apply in the next step only supports numeric and nominal data. We also remove the alphabet and the sequence length attribute, because we do not want the classifier to consider these properties when classifying the sequences.
To remove an attribute, put a mark into the check box next to the attribute name. Do this for the attributes No. 1 until 6. Finally, hit the button Remove. The selected attributes should now disappear.
Now switch to the Classify panel and choose as a classifier weka/classifiers/functions/SMO. This is Weka's implementation of a support vector machine. Before you hit the Start button to run the classification, ensure that above the Start button the attribute sequence.annotation.class is selected. This combo box identifies the class attribute.
The evaluation procedure (percentage-split) now just takes a few seconds. The results on the ASTRAL data are shown in Figure 10b.
Classification using InterProScan
Another method to classify sequences is to create a pattern vector for each sequence. A pattern vector counts for a number of sequence patterns how often a given pattern occurs within the sequence. Of course, for a given sequence a pattern vector will contain mostly zeros and just a few non-zero entries, so we are talking about sparse instances.
One way to create such a pattern vector from a set of sequences is to use InterProScan. InterProScan provides a command line tool called iprscan, which accepts as an input a FASTA file and outputs a set of pattern vectors in different formats. One output format is XML. This is also the format BioWeka can read.
So for this part of the tutorial please download and install iprscan. Then let iprscan calculate the pattern vectors for our data set. At this point the (reduced) data set only exists within the Weka Explorer. To use it with iprscan, we must export it in the FASTA file format.
As you can see in Figure 5 the Explorer interface provides a button titled Save. However, this button provides only the option to save the current dataset as ARFF file. Of course, BioWeka delivers a solution to this problem, i.e. it offers a special filter called Save.
From the filter list choose bioweka/filters/universal/Save. In the filter's options choose as saver the component bioweka/core/converters/sequence/FastaSequenceSaver. Finally, set the outputFile property e.g. to subset.fa, close all dialogs using the OK button and in the Explorer interface hit the Apply button. In addition, save the dataset as ARFF file, say subset.arff. It will be clear in a moment, why we will need both formats.
In the next step, you have to run iprscan on the FASTA file. Please note, that you must tell iprscan to export the pattern vectors in the XML format. After iprscan finished its work, there should be a file called subset.xml, or whatever you've called your FASTA input file.
Back in the Weka Explorer, try opening this XML file. You'll get the well-known error message telling you that this is not a ARFF file. So hit the button Use converter and choose as a converter the component bioweka/core/converters/xml/XslXmlLoader. This component requires the stylesheet property to be set. Point it to the file iprscan.xsl within the directory etc/stylesheets of BioWeka's installation directory. Close the dialog.
It requires some time to load the pattern vectors into the Weka Explorer. Afterwards the Preprocess panel shows the pattern names within the attribute list. Each attribute is named by the prefix sequence.pattern. and the name of the pattern, e.g "PRODOM_PD003791" for a pattern from ProDom.
What is missing is the class annotation, i.e. the sequence.annotation.class attribute. Therefore, you may choose as filter the component bioweka/filters/universal/AddLastAttributeByName (available in BioWeka since version 0.6.0) and then configure its properties. As input file use the subset.arff file. After closing the dialog and pressing the Apply button, the attribute list will show the additional attribute sequence.annotation.class from the subset.arff file.
Note: This filter joins the instances of the two sets by their first attribute (usually the sequence name). It adds the last attribute of the second set (usually the class attribute) to the instances of the first set. This attribute must not be String. Please convert it to nominal as described above before applying the filter, if necessary.
For the following classification procedure we recommend to use a SVM, i.e. the component weka/classifiers/functions/SMO. However, you may choose any classifier that operates on numeric data. Please note that you may wish to delete the attributes sequence.name, sequence.crc and sequence.length first, if you want to concentrate on InterPro patterns only.
For illustrative purposes, we again use the four test sequences of the reduced test set given above: Figure 11a shows an example XML file for the test data. Figure 11b shows the dialog box of the XmlXslLoader. Figure 11c shows the setting for the AddLastAttributeByName filter, where we load the class attribute from the ARFF file for the four test sequences. As you can see in Figure 11d, now we have both pattern annotations and class annotations (attribute no. 18). Please find the test files here: the test sequence file, the test sequence ARFF file, and the test XML file from iprscan.
See also
Jan E. Gewehr¹, Martin Szugat¹ and Ralf Zimmer. BioWeka - extending the Weka framework for bioinformatics. Bioinformatics. 2007 Mar 1;23(5):651-3. HTML
¹authors contributed equally

