Match tables with different taxon names
rOpenSci package: taxize
A common situation when a set of species is having more than one table of data, each with taxonomic names. These tables need to be merged at some point.
The problem is that some of the names in the separate tables may actually be the same, but one table may have updated names while the other may have old names. Therefore, when merging, two rows of data that should be matched are not.
We can easily take care of this problem using taxize.
Install taxize from CRAN
install.packages("taxize")
Load taxize
library("taxize")
Table 1
df1 <- data.frame(species = c("Helianthus annuus", "Silene acaulis", "Mimulus alpinus","Phlox drummondii", "Triteleia hyacinthina"), var1 = rnorm(5), var2 = rnorm(5),stringsAsFactors = FALSE)
Table 2
df2 <- data.frame(species = c("Helianthus annuus texanus", "Cucubalus acaulis","Mimulus langsdorffii alpinus", "Armeria drummondii", "Triteleia hyocinthina"),var3 = rnorm(5), var4 = rnorm(5), stringsAsFactors = FALSE)
Take a look.
df1
species var1 var2
1 Helianthus annuus 0.7285 -0.4922
2 Silene acaulis -0.6368 0.1510
3 Mimulus alpinus 0.1223 -0.2955
4 Phlox drummondii -0.1071 -0.6371
5 Triteleia hyacinthina 3.0563 2.3601
df2
species var3 var4
1 Helianthus annuus texanus 1.2547 1.5582
2 Cucubalus acaulis 0.1917 -0.4014
3 Mimulus langsdorffii alpinus -0.7310 0.3817
4 Armeria drummondii -0.8728 -1.8467
5 Triteleia hyocinthina 2.1583 0.5587
Get the up to date accepted names for both tables
And add that as a new column to each table. In this example I’m using
the tpl_search
function which searches the Plantlist.org, but you can
use other functions for the Taxonomic Name Resolution Service (TNRS),
NCBI, EOL, and more.
(df1_corr <- tpl_search(as.character(df1$species), corr = TRUE))
Genus Species Infraspecific Plant.Name.Index Taxonomic.status
1 Helianthus annuus TRUE Accepted
2 Silene acaulis TRUE Accepted
3 Mimulus alpinus TRUE Accepted
4 Phlox drummondii TRUE Accepted
5 Triteleia hyacinthina TRUE Accepted
Family New.Genus New.Species New.Infraspecific Authority
1 Compositae Helianthus annuus L.
2 Caryophyllaceae Silene acaulis (L.) Jacq.
3 Phrymaceae Mimulus alpinus <NA> (A. Gray) Piper
4 Polemoniaceae Phlox drummondii Hook.
5 Asparagaceae Triteleia hyacinthina (Lindl.) Greene
Typo WFormat
1 FALSE FALSE
2 FALSE FALSE
3 FALSE FALSE
4 FALSE FALSE
5 FALSE FALSE
(df2_corr <- tpl_search(as.character(df2$species), corr = TRUE))
Genus Species Infraspecific Plant.Name.Index Taxonomic.status
1 Helianthus annuus texanus TRUE Synonym
2 Cucubalus acaulis TRUE Synonym
3 Mimulus langsdorffii alpinus TRUE Synonym
4 Armeria drummondii TRUE Synonym
5 Triteleia hyocinthina TRUE Accepted
Family New.Genus New.Species New.Infraspecific Authority
1 Compositae Helianthus annuus L.
2 Caryophyllaceae Silene acaulis (L.) Jacq.
3 Phrymaceae Mimulus alpinus <NA> (A. Gray) Piper
4 Polemoniaceae Phlox drummondii Hook.
5 Asparagaceae Triteleia hyacinthina (Lindl.) Greene
Typo WFormat
1 FALSE FALSE
2 FALSE FALSE
3 FALSE FALSE
4 FALSE FALSE
5 TRUE FALSE
It looks like 4 of the 5 names in df2 are synonyms, while 1 name was misspelled. Let’s get the new names for df2 into df2.
df2$newsp <- apply(df2_corr[c("New.Genus", "New.Species")], 1, paste, collapse = " ")
df2 <- df2[, -1]
Merge tables and make a plot
(alldat <- merge(df1, df2, by.x="species", by.y="newsp"))
species var1 var2 var3 var4
1 Helianthus annuus 0.7285 -0.4922 1.2547 1.5582
2 Mimulus alpinus 0.1223 -0.2955 -0.7310 0.3817
3 Phlox drummondii -0.1071 -0.6371 -0.8728 -1.8467
4 Silene acaulis -0.6368 0.1510 0.1917 -0.4014
5 Triteleia hyacinthina 3.0563 2.3601 2.1583 0.5587
library("ggplot2")
ggplot(alldat, aes(var1, var4)) +
theme_grey(base_size=18) +
geom_point() +
labs(x="Cool variable", y="Other super sweet variable")
And that’s it. Of course in practice your tables will be larger to much larger, but the same concept applies - it will just take longer.