How to import gene expression data (CSV and MAGE-ML)

From BioWeka



Here you will learn how to import gene expression data into Weka using BioWeka's import layer. We will deal with two formats: tab-separated data (using the stanford microarray database files as examples) and MAGE-ML with external datafiles (using data dowloaded from http://sgdlite.princeton.edu/download/yeast_datasets/).

Contents

Loading Tab-Separated Data

Figure 1
Enlarge
Figure 1

Tab-separated files are common for gene expression data. Here, we will show the import of the fly dataset by Arbeitman et al. (Science, 2002, 297(5590):2270-5), downloaded from the Stanford Microarray Database (SMD, http://genome-www5.stanford.edu). Figure 1 shows a screenshot of the file in WordPad.

Figure 2
Enlarge
Figure 2
Figure 3
Enlarge
Figure 3
Figure 4
Enlarge
Figure 4

In Weka/BioWeka, we select this file in the Open dialog. Weka does not recognize the format at first (see Figure 2), so we need a converter. We choose BioWeka's CSV loader. and select the range 2-last as numeric values, as the first value of each row contains the ID (see Figure 3). After clicking "OK", we have the dataset available in the Weka GUI (see Figure 4). The filter assumes the first line to be the header and the rest to be data. As you can see in Fig. 1, you may wish to remove the EWEIGHT instance using the EDIT button in the Explorer GUI before proceeding.

Loading MAGE-ML Data

JAVA Preparation

Figure 5
Enlarge
Figure 5
Figure 6
Enlarge
Figure 6

In order to process MAGE-ML data with XSLT 2.0 , we need some preparations. First of all, we download Saxon 8.4 from sourceforge (http://saxon.sourceforge.net/). The jars need to be placed in BioWeka's lib directory, where they will automatically be included in the class path when the startup script is called. Further, we add the following line to the the jaxp.properties file in the lib directory of our JAVA installation:

javax.xml.transform.TransformerFactory=net.sf.saxon.TransformerFactoryImpl
.

If this file does not exist already, we can simply copy it from BioWeka's "etc/stylesheets" directory. Now Saxon is the default processor for XSLT. These preparations have to be done only once before the first MAGE-ML file can be imported.

Data Preparation

Another important thing one has to keep in mind is the MAGE-ML.dtd. In the header of a MAGE-ML file, this DTD has to be referenced correctly, such that the processor can find it. If this is not the case, as in some files one finds references to DTDs somewhere in the local filesystem of the person who created the file, one has to correct this reference. A simple workaround is to download the MAGE-ML.dtd file and put it into the working directory together with the MAGE-ML file one wants to import. Then the reference in the header can be simply set to

<!DOCTYPE MAGE-ML SYSTEM "MAGE-ML.dtd">
.

MAGE-ML and DataExternal

We use a dataset downloaded from sgdlite (see above) for this tutorial, namely the experiment data from Preiss et al., 2003, Nat. Struct. Biol. 10(12):1039-47. This dataset consists of a MAGE-ML file and a number of text files that are referenced in the XML file as external data referenced by DataExternal elements (please note that MAGE-ML is basically a description of the experiment and its parameters and does not necessarily contain the expression data itself). We select the XML file in the Open dialog. Again, Weka does not recognize the format, so we choose BioWeka's XslXmlLoader and choose the MAGE-ML stylesheet from BioWeka's "etc/stylesheets" directory (see Figure 5).

Figure 7
Enlarge
Figure 7

As the result, the first table referenced as external data in the XML file is loaded into BioWeka (see Figure 6). All other tables are converted into ARFF format and stored on disk in the directory where the XML file was selected (see Figure 7) for future use.

MAGE-ML and DataInternal

Alternatively, the same stylesheet can deal with data that is stored in MAGE-ML files as DataInternal elements. Again, the first dataset will be loaded and the following datasets will be stored on disk. The names will be chosen according to the positions in the XML file. The content of each internal data element will be regarded as an individual dataset.