Iprscan.xsl
From BioWeka
Contents |
Description
This stylesheet converts XML files created by InterProScan into the ARFF format required by Weka.
Prerequisites
- InterProScan 4.1 or later recommended: Download the standalone program
iprscanfrom the EBI FTP site. - XSL 1.0 processor
Application
Each protein is converted into a sparse instance with the minimal attributes sequence.name of type nominal (not of string because of a problem with sparse instances and string attributes), sequence.length of type numeric and sequence.crc64 of type numeric. For each pattern there exists a seperate attribute of type numeric counting the occurences of the pattern inside a specific protein sequence. Thus an instance represents a (sparse) pattern vector. The name of a pattern attribute corresponds to the following form: sequence.pattern.{$dbname}_{$id}. In this string {$dbname} stands for the value of the @dbname attribute of the match element and {$id} stands for the @id attribute. The patterns are sorted using the value of {$dbname}_{$id}.
Parameters
The relation name is per default interpro. To change it set the relationName parameter to the appropiate value.
The database names and identifiers of the patterns should not include the whitespace character. If this can not be avoided set the patternSeperator parameter to a char that is not contained by any pattern id or database name.
Known problems
- iprscan.xsl can't load XML files
- String attributes get lost when using sparse instances - only before version 0.5.0

