RNeXML tutorial


for v2.0.4


An R package for reading, writing, integrating and publishing data using the Ecological Metadata Language (EML) format.

An extensive and rapidly growing collection of richly annotated phylogenetics data is now available in the NeXML format. NeXML relies on state-of-the-art data exchange technology to provide a format that can be both validated and extended, providing a data quality assurance and and adaptability to the future that is lacking in other formats Vos et al 2012.

Installation

The stable version is on CRAN

install.packages("RNeXML")

The development version of RNeXML is available on Github. With the devtools package installed on your system, RNeXML can be installed using:

install.packages("devtools")
devtools::install_github("ropensci/RNeXML")

Usage

Understanding the nexml S4 object

The RNeXML package provides many convenient functions to add and extract information from nexml objects in the R environment without requiring the reader to understand the details of the NeXML data structure and making it less likely that a user will generate invalid NeXML syntax that could not be read by other parsers. The nexml object we have been using in all of the examples is built on R’s S4 mechanism. Advanced users may sometimes prefer to interact with the data structure more directly using R’s S4 class mechanism and subsetting methods. Many R users are more familiar with the S3 class mechanism (such as in the ape package phylo objects) rather than the S4 class mechanism used in phylogenetics packages such as ouch and phylobase. The phylobase vignette provides an excellent introduction to these data structures. Users already familiar with subsetting lists and other S3 objects in R are likely familar with the use of the $ operator, such as phy$edge. S4 objects simply use an @ operator instead (but cannot be subset using numeric arguments such as phy[[1]] or named arguments such as phy[[“edge”]]).

The nexml object is an S4 object, as are all of its components (slots). Its hierarchical structure corresponds exactly with the XML tree of a NeXML file, with the single exception that both XML attributes and children are represented as slots. S4 objects have constructor functions to initialize them. We create a new nexml object with the command:

nex <- new("nexml")

We can see a list of slots contained in this object with

slotNames(nex)
 [1] "version"            "generator"          "xsi:schemaLocation"
 [4] "namespaces"         "otus"               "trees"
 [7] "characters"         "meta"               "about"
[10] "xsi:type"

Some of these slots have already been populated for us, for instance, the schema version and default namespaces:

nex@version
[1] "0.9"
nex@namespaces
                                             nex
                     "http://www.nexml.org/2009"
                                             xsi
     "http://www.w3.org/2001/XMLSchema-instance"
                                             xml
          "http://www.w3.org/XML/1998/namespace"
                                            cdao
       "http://purl.obolibrary.org/obo/cdao.owl"
                                             xsd
             "http://www.w3.org/2001/XMLSchema#"
                                              dc
              "http://purl.org/dc/elements/1.1/"
                                         dcterms
                     "http://purl.org/dc/terms/"
                                             ter
                     "http://purl.org/dc/terms/"
                                           prism
"http://prismstandard.org/namespaces/1.2/basic/"
                                              cc
                "http://creativecommons.org/ns#"
                                            ncbi
         "http://www.ncbi.nlm.nih.gov/taxonomy#"
                                              tc
 "http://rs.tdwg.org/ontology/voc/TaxonConcept#"

                     "http://www.nexml.org/2009"

Recognize that nex@namespaces serves the same role as get_namespaces function, but provides direct access to the slot data. For instance, with this syntax we could also overwrite the existing namespaces with nex@namespaces <- NULL. Changing the namespace in this way is not advised.

Some slots can contain multiple elements of the same type, such as trees, characters, and otus. For instance, we see that

class(nex@characters)
[1] "ListOfcharacters"
attr(,"package")
[1] "RNeXML"

is an object of class ListOfcharacters, and is currently empty,

length(nex@characters)
[1] 0

In order to assign an object to a slot, it must match the class definition of the slot. We can create a new element of any given class with the new function,

nex@characters <- new("ListOfcharacters", list(new("characters")))

and now we have a length-1 list of character matrices,

length(nex@characters)
[1] 1

and we access the first character matrix using the list notation, [[1]]. Here we check the class is a characters object.

class(nex@characters[[1]])
[1] "characters"
attr(,"package")
[1] "RNeXML"

Direct subsetting has two primary use cases: (a) useful in looking up (and possibly editing) a specific value of an element, or (b) when adding metadata annotations to specific elements. Consider the example file

f <- system.file("examples", "trees.xml", package="RNeXML")
nex <- nexml_read(f)

We can look up the species label of the first otu in the first otus block:

nex@otus[[1]]@otu[[1]]@label
      label
"species 1"

We can add metadata to this particular OTU using this subsetting format

nex@otus[[1]]@otu[[1]]@meta <-
  c(meta("skos:note",
          "This species was incorrectly identified"),
         nex@otus[[1]]@otu[[1]]@meta)

Here we use the c operator to append this element to any existing meta annotations to this otu.

Writing NeXML metadata

The add_basic_meta() function takes as input an existing nexml object (like the other add_ functions, if none is provided it will create one), and at the time of this writing any of the following parameters: title, description, creator, pubdate, rights, publisher, citation. Other metadata elements and corresponding parameters may be added in the future.

Load data:

data(bird.orders)

Create an nexml object for the phylogeny bird.orders and add appropriate metadata:

data("bird.orders")
birds <- add_trees(bird.orders)
birds <- add_basic_meta(
  title = "Phylogeny of the Orders of Birds From Sibley and Ahlquist",

  description = "This data set describes the phylogenetic relationships of the
     orders of birds as reported by Sibley and Ahlquist (1990). Sibley
     and Ahlquist inferred this phylogeny from an extensive number of
     DNA/DNA hybridization experiments. The ``tapestry'' reported by
     these two authors (more than 1000 species out of the ca. 9000
     extant bird species) generated a lot of debates.

     The present tree is based on the relationships among orders. The
     branch lengths were calculated from the values of Delta T50H as
     found in Sibley and Ahlquist (1990, fig. 353).",

  citation = "Sibley, C. G. and Ahlquist, J. E. (1990) Phylogeny and
     classification of birds: a study in molecular evolution. New
     Haven: Yale University Press.",

  creator = "Sibley, C. G. and Ahlquist, J. E.",
    nexml=birds)

Instead of a literal string, citations can also be provided in R’s bibentry type, which is the one in which R package citations are obtained:

birds <- add_basic_meta(citation = citation("ape"), nexml = birds)

Taxonomic identifiers

The taxize_nexml() function uses the R package taxize [@Chamberlain_2013] to check each taxon label against the NCBI database. If a unique match is found, a metadata annotation is added to the taxon providing the NCBI identification number to the taxonomic unit.

birds <- taxize_nexml(birds, "NCBI")

If no match is found, the user is warned to check for possible typographic errors in the taxonomic labels provided. If multiple matches are found, the user will be prompted to choose between them.

Custom metadata extensions

We can get a list of namespaces along with their prefixes from the nexml object:

prefixes <- get_namespaces(birds)
prefixes["dc"]
                                dc
"http://purl.org/dc/elements/1.1/"

We create a meta element containing this annotation using the meta function:

modified <- meta(property = "prism:modificationDate", content = "2013-10-04")

We can add this annotation to our existing birds NeXML file using the add_meta() function. Because we do not specify a level, it is added to the root node, referring to the NeXML file as a whole.

birds <- add_meta(modified, birds)

The built-in vocabularies are just the tip of the iceberg of established vocabularies. Here we add an annotation from the skos namespace which describes the history of where the data comes from:

history <- meta(property = "skos:historyNote",
  content = "Mapped from the bird.orders data in the ape package using RNeXML")

Because skos is not in the current namespace list, we add it with a url when adding this meta element. We also specify that this annotation be placed at the level of the trees sub-node in the NeXML file.

birds <- add_meta(history,
                birds,
                level = "trees",
                namespaces = c(skos = "http://www.w3.org/2004/02/skos/core#"))

For finer control of the level at which a meta element is added, we will manipulate the nexml R object directly using S4 sub-setting, as shown in the supplement.

Much richer metadata annotation is possible. Later we illustrate how metadata annotation can be used to extend the base NeXML format to represent new forms of data while maintaining compatibility with any NeXML parser. The RNeXML package can be easily extended to support helper functions such as taxize_nexml to add additional metadata without imposing a large burden on the user.

Reading NeXML metadata

A call to the nexml object prints some metadata summarizing the data structure:

birds
A nexml object representing:
 	 1 phylogenetic tree blocks, where:
 	 block 1 contains 1 phylogenetic trees
 	 46 meta elements
 	 0 character matrices
 	 23 taxonomic units
 Taxa: 	 Struthioniformes, Tinamiformes, Craciformes, Galliformes, Anseriformes, Turniciformes ...

 NeXML generated by RNeXML using schema version: 0.9
 size: 372.7 Kb

We can extract all metadata pertaining to the NeXML document as a whole (annotations of the XML root node, <nexml>) with the command

meta <- get_metadata(birds)

This returns a data.frame of available metadata. We can see the kinds of metadata recorded from the names:

meta
Source: local data frame [10 x 7]

    meta                      property   datatype
   (chr)                         (chr)      (chr)
1     m2                      dc:title xsd:string
2     m3                    dc:creator xsd:string
3     m4                dc:description xsd:string
4     m5                            NA         NA
5     m6 dcterms:bibliographicCitation xsd:string
6     m7                    dc:creator xsd:string
7     m8                    dc:pubdate xsd:string
8     m9                            NA         NA
9    m20 dcterms:bibliographicCitation xsd:string
10   m44        prism:modificationDate xsd:string
Variables not shown: content (chr), xsi.type (chr), rel (chr), href (chr)

We can also access a table of taxonomic metadata:

get_taxa(birds)
Source: local data frame [23 x 5]

     otu            label about xsi.type  otus
   (chr)            (chr) (chr)    (lgl) (chr)
1    ou1 Struthioniformes  #ou1       NA   os1
2    ou2     Tinamiformes  #ou2       NA   os1
3    ou3      Craciformes  #ou3       NA   os1
4    ou4      Galliformes  #ou4       NA   os1
5    ou5     Anseriformes  #ou5       NA   os1
6    ou6    Turniciformes  #ou6       NA   os1
7    ou7       Piciformes  #ou7       NA   os1
8    ou8    Galbuliformes  #ou8       NA   os1
9    ou9   Bucerotiformes  #ou9       NA   os1
10  ou10      Upupiformes #ou10       NA   os1
..   ...              ...   ...      ...   ...

Which returns text from the otu element labels, typically used to define taxonomic names, rather than text from explicit meta elements.

We can also access metadata at a specific level (or use level=all to extract all meta elements in a list). Here we show only the first few results:

otu_meta <- get_metadata(birds, level="otus/otu")
otu_meta
Source: local data frame [23 x 9]

    meta property datatype content     xsi.type        rel
   (chr)    (lgl)    (lgl)   (lgl)        (chr)      (chr)
1    m21       NA       NA      NA ResourceMeta tc:toTaxon
2    m22       NA       NA      NA ResourceMeta tc:toTaxon
3    m23       NA       NA      NA ResourceMeta tc:toTaxon
4    m24       NA       NA      NA ResourceMeta tc:toTaxon
5    m25       NA       NA      NA ResourceMeta tc:toTaxon
6    m26       NA       NA      NA ResourceMeta tc:toTaxon
7    m27       NA       NA      NA ResourceMeta tc:toTaxon
8    m28       NA       NA      NA ResourceMeta tc:toTaxon
9    m29       NA       NA      NA ResourceMeta tc:toTaxon
10   m30       NA       NA      NA ResourceMeta tc:toTaxon
..   ...      ...      ...     ...          ...        ...
Variables not shown: href (chr), otu (chr), otus (chr)

Merging metadata tables

We often want to combine metadata from multiple tables. For instance, in this exercise we want to include the taxonomic identifier and id value for each species returned in the character table. This helps us more precisely identify the species whose traits are described by the table.

library("dplyr")
library("geiger")

To begin, let’s generate a NeXML file using the tree and trait data from the geiger package’s “primates” data:

data("primates")
add_trees(primates$phy) %>%
  add_characters(primates$dat, ., append=TRUE) %>%
  taxize_nexml() -> nex

(Note that we’ve used dplyr's cute pipe syntax, but unfortunately our add_ methods take the nexml object as the second argument instead of the first, so this isn’t as elegant since we need the stupid . to show where the piped output should go…)

We now read in the three tables of interest. Note that we tell get_characters to give us species labels as there own column, rather than as rownames. The latter is the default only because this plays more nicely with the default format for character matrices that is expected by geiger and other phylogenetics packages, but is in general a silly choice for data manipulation.

otu_meta <- get_metadata(nex, "otus/otu")
taxa <- get_taxa(nex)
char <- get_characters(nex, rownames_as_col = TRUE)

We can take a peek at what the tables look like, just to orient ourselves:

otu_meta
Source: local data frame [216 x 9]

    meta property datatype content     xsi.type        rel
   (chr)    (lgl)    (lgl)   (lgl)        (chr)      (chr)
1    m49       NA       NA      NA ResourceMeta tc:toTaxon
2    m50       NA       NA      NA ResourceMeta tc:toTaxon
3    m51       NA       NA      NA ResourceMeta tc:toTaxon
4    m52       NA       NA      NA ResourceMeta tc:toTaxon
5    m53       NA       NA      NA ResourceMeta tc:toTaxon
6    m54       NA       NA      NA ResourceMeta tc:toTaxon
7    m55       NA       NA      NA ResourceMeta tc:toTaxon
8    m56       NA       NA      NA ResourceMeta tc:toTaxon
9    m57       NA       NA      NA ResourceMeta tc:toTaxon
10   m58       NA       NA      NA ResourceMeta tc:toTaxon
..   ...      ...      ...     ...          ...        ...
Variables not shown: href (chr), otu (chr), otus (chr)
taxa
Source: local data frame [233 x 5]

     otu                       label about xsi.type  otus
   (chr)                       (chr) (chr)    (lgl) (chr)
1   ou24 Allenopithecus_nigroviridis #ou24       NA   os2
2   ou25         Allocebus_trichotis #ou25       NA   os2
3   ou26           Alouatta_belzebul #ou26       NA   os2
4   ou27             Alouatta_caraya #ou27       NA   os2
5   ou28          Alouatta_coibensis #ou28       NA   os2
6   ou29              Alouatta_fusca #ou29       NA   os2
7   ou30           Alouatta_palliata #ou30       NA   os2
8   ou31              Alouatta_pigra #ou31       NA   os2
9   ou32               Alouatta_sara #ou32       NA   os2
10  ou33          Alouatta_seniculus #ou33       NA   os2
..   ...                         ...   ...      ...   ...
head(char)
Source: local data frame [6 x 2]

                         taxa        x
                        (chr)    (dbl)
1 Allenopithecus_nigroviridis 8.465900
2         Allocebus_trichotis 4.368181
3           Alouatta_belzebul 8.729074
4             Alouatta_caraya 8.628735
5          Alouatta_coibensis 8.764053
6              Alouatta_fusca 8.554489

Now that we have nice data.frame objects for all our data, it’s easy to join them into the desired table with a few obvious dplyr commands:

taxa %>%
  left_join(char, by = c("label" = "taxa")) %>%
  left_join(otu_meta, by = "otu") %>%
  select(otu, label, x, href)
Source: local data frame [233 x 4]

     otu                       label        x
   (chr)                       (chr)    (dbl)
1   ou24 Allenopithecus_nigroviridis 8.465900
2   ou25         Allocebus_trichotis 4.368181
3   ou26           Alouatta_belzebul 8.729074
4   ou27             Alouatta_caraya 8.628735
5   ou28          Alouatta_coibensis 8.764053
6   ou29              Alouatta_fusca 8.554489
7   ou30           Alouatta_palliata 8.791790
8   ou31              Alouatta_pigra 8.881836
9   ou32               Alouatta_sara 8.796339
10  ou33          Alouatta_seniculus 8.767173
..   ...                         ...      ...
Variables not shown: href (chr)

Because these are all from the same otus block anyway, we haven’t selected that column, but were it of interest it is also available in the taxa table.

Citing

To cite RNeXML in publications use:


Carl Boettiger, Scott Chamberlain, Hilmar Lapp, Kseniia Shumelchyk and Rutger Vos (2015). RNeXML: Implement semantically rich I/O for NeXML format. R package version 2.0.4. http://CRAN.R-project.org/package=RNeXML

License and bugs

Back to top