How to import gene expression data (CSV and MAGE-ML)
From BioWeka
Here you will learn how to import gene expression data into Weka using
BioWeka's import layer. We will deal with two formats: tab-separated
data (using the stanford microarray database files as examples) and
MAGE-ML with external datafiles (using data dowloaded from
http://sgdlite.princeton.edu/download/yeast_datasets/).
Contents |
Loading Tab-Separated Data
Tab-separated files are common for gene expression data. Here, we will show the import of the fly dataset by Arbeitman et al. (Science, 2002, 297(5590):2270-5), downloaded from the Stanford Microarray Database (SMD, http://genome-www5.stanford.edu). Figure 1 shows a screenshot of the file in WordPad.
In Weka/BioWeka, we select this file in the Open dialog. Weka does not recognize the format at first (see Figure 2), so we need a converter. We choose BioWeka's CSV loader. and select the range 2-last as numeric values, as the first value of each row contains the ID (see Figure 3). After clicking "OK", we have the dataset available in the Weka GUI (see Figure 4). The filter assumes the first line to be the header and the rest to be data. As you can see in Fig. 1, you may wish to remove the EWEIGHT instance using the EDIT button in the Explorer GUI before proceeding.
Loading MAGE-ML Data
JAVA Preparation
In order to process MAGE-ML data with XSLT 2.0 , we need some preparations. First of all, we download Saxon 8.4 from sourceforge (http://saxon.sourceforge.net/). The jars need to be placed in BioWeka's lib directory, where they will automatically be included in the class path when the startup script is called. Further, we add the following line to the the jaxp.properties file in the lib directory of our JAVA installation:
javax.xml.transform.TransformerFactory=net.sf.saxon.TransformerFactoryImpl.
If this file does not exist already, we can simply copy it from BioWeka's "etc/stylesheets" directory. Now Saxon is the default processor for XSLT. These preparations have to be done only once before the first MAGE-ML file can be imported.
Data Preparation
Another important thing one has to keep in mind is the MAGE-ML.dtd. In the header of a MAGE-ML file, this DTD has to be referenced correctly, such that the processor can find it. If this is not the case, as in some files one finds references to DTDs somewhere in the local filesystem of the person who created the file, one has to correct this reference. A simple workaround is to download the MAGE-ML.dtd file and put it into the working directory together with the MAGE-ML file one wants to import. Then the reference in the header can be simply set to
<!DOCTYPE MAGE-ML SYSTEM "MAGE-ML.dtd">.
MAGE-ML and DataExternal
We use a dataset downloaded from sgdlite (see above) for this tutorial, namely
the experiment data from Preiss et al., 2003, Nat. Struct. Biol. 10(12):1039-47.
This dataset consists of a MAGE-ML file and a number of text files that are referenced in the
XML file as external data referenced by DataExternal elements
(please note that MAGE-ML is basically a description
of the experiment and its parameters and does not necessarily contain the expression
data itself). We select the XML file in the Open dialog. Again, Weka does not
recognize the format, so we choose BioWeka's XslXmlLoader and choose the MAGE-ML
stylesheet from BioWeka's "etc/stylesheets" directory (see Figure 5).
As the result, the first table referenced as external data in the XML file is loaded into BioWeka (see Figure 6). All other tables are converted into ARFF format and stored on disk in the directory where the XML file was selected (see Figure 7) for future use.
MAGE-ML and DataInternal
Alternatively, the same stylesheet can deal with data that is stored in MAGE-ML
files as DataInternal elements. Again, the first dataset will be loaded and the
following datasets will be stored on disk. The names will be chosen according to
the positions in the XML file. The content of each internal data element will
be regarded as an individual dataset.

