spocc tutorial
for v0.7.0
Installation
Stable version from CRAN
install.packages("spocc")
Or dev version from GitHub
if (!require("devtools")) install.packages("devtools")
install_github("ropensci/spocc")
library('spocc')
Usage
Data retrieval
The most significant function in spocc is the occ
(short for occurrence) function. occ
takes a query, often a species name, and searches across all data sources specified in the from
argument. For example, one can search for all occurrences of Sharp-shinned Hawks (Accipiter striatus) from the GBIF database with the following R call.
library('spocc')
(df <- occ(query = 'Accipiter striatus', from = 'gbif'))
#> Searched: gbif
#> Occurrences - Found: 618,207, Returned: 500
#> Search type: Scientific
#> gbif: Accipiter striatus (500)
The data returned are part of a S3
class called occdat
. This class has slots for each of the data sources described above. One can easily switch the source by changing the from
parameter in the function call above.
Within each data source is the set of species queried. In the above example, we only asked for occurrence data for one species, but we could have asked for any number. Let’s say we asked for data for two species: Accipiter striatus, and Pinus contorta. Then the structure of the response would be
response -- |
| -- gbif ------- |
| -- Accipiter_striatus
| -- Pinus_contorta
| -- ecoengine -- |
| -- Accipiter_striatus
| -- Pinus_contorta
... and so on for each data source
If you only request data from gbif, like from = 'gbif'
, then the other four source slots are present in the response object, but have no data.
You can quickly get just the GBIF data by indexing to it, like
df$gbif
#> Species [Accipiter striatus (500)]
#> First 10 rows of [Accipiter_striatus]
#>
#> # A tibble: 500 x 106
#> name longitude latitude prov issues key
#> <chr> <dbl> <dbl> <chr> <chr> <int>
#> 1 Accipiter striatus -105.14459 39.91499 gbif cdround,gass84 1572375879
#> 2 Accipiter striatus -77.41813 39.49461 gbif cdround,gass84 1453332084
#> 3 Accipiter striatus -122.05344 36.95316 gbif cdround,gass84 1453346783
#> 4 Accipiter striatus -96.84084 33.13948 gbif cdround,gass84 1562918951
#> 5 Accipiter striatus -83.06435 42.27640 gbif cdround,gass84 1453374806
#> 6 Accipiter striatus -93.60027 42.13699 gbif cdround,gass84 1453347396
#> 7 Accipiter striatus -96.94657 32.83480 gbif cdround,gass84 1453372191
#> 8 Accipiter striatus -99.14944 19.34328 gbif cdround,gass84 1453336277
#> 9 Accipiter striatus NA NA gbif 1453348280
#> 10 Accipiter striatus -79.33628 43.60350 gbif cdround,gass84 1453387327
#> # ... with 490 more rows, and 100 more variables: datasetKey <chr>,
#> # publishingOrgKey <chr>, publishingCountry <chr>, protocol <chr>,
#> # lastCrawled <chr>, lastParsed <chr>, crawlId <int>,
#> # basisOfRecord <chr>, taxonKey <int>, kingdomKey <int>,
#> # phylumKey <int>, classKey <int>, orderKey <int>, familyKey <int>,
#> # genusKey <int>, scientificName <chr>, kingdom <chr>, phylum <chr>,
#> # order <chr>, family <chr>, genus <chr>, genericName <chr>,
#> # specificEpithet <chr>, taxonRank <chr>, dateIdentified <chr>,
#> # coordinateUncertaintyInMeters <dbl>, year <int>, month <int>,
#> # day <int>, eventDate <date>, modified <chr>, lastInterpreted <chr>,
#> # references <chr>, license <chr>, geodeticDatum <chr>, class <chr>,
#> # countryCode <chr>, country <chr>, rightsHolder <chr>,
#> # identifier <chr>, informationWithheld <chr>, verbatimEventDate <chr>,
#> # datasetName <chr>, verbatimLocality <chr>, gbifID <chr>,
#> # collectionCode <chr>, occurrenceID <chr>, taxonID <chr>,
#> # catalogNumber <chr>, recordedBy <chr>,
#> # `http://unknown.org/occurrenceDetails` <chr>, institutionCode <chr>,
#> # rights <chr>, eventTime <chr>, occurrenceRemarks <chr>,
#> # identificationID <chr>, identificationRemarks <chr>, locality <chr>,
#> # individualCount <int>, elevation <dbl>, elevationAccuracy <dbl>,
#> # continent <chr>, stateProvince <chr>, institutionID <chr>,
#> # county <chr>, identificationVerificationStatus <chr>, language <chr>,
#> # type <chr>, locationAccordingTo <chr>, preparations <chr>,
#> # identifiedBy <chr>, georeferencedDate <chr>, nomenclaturalCode <chr>,
#> # higherGeography <chr>, georeferencedBy <chr>,
#> # georeferenceProtocol <chr>, georeferenceVerificationStatus <chr>,
#> # endDayOfYear <chr>, verbatimCoordinateSystem <chr>,
#> # otherCatalogNumbers <chr>, organismID <chr>,
#> # previousIdentifications <chr>, identificationQualifier <chr>,
#> # samplingProtocol <chr>, accessRights <chr>,
#> # higherClassification <chr>, georeferenceSources <chr>,
#> # infraspecificEpithet <chr>, recordNumber <chr>,
#> # ownerInstitutionCode <chr>, startDayOfYear <chr>, datasetID <chr>,
#> # verbatimElevation <chr>, collectionID <chr>, sex <chr>,
#> # dynamicProperties <chr>, lifeStage <chr>, vernacularName <chr>,
#> # reproductiveCondition <chr>, locationRemarks <chr>
When you get data from multiple providers, the fields returned are slightly different, e.g.:
df <- occ(query = 'Accipiter striatus', from = c('gbif', 'ecoengine'), limit = 25)
head(df$gbif$data$Accipiter_striatus)[1:6,1:10]
#> # A tibble: 6 x 10
#> name longitude latitude issues prov key
#> <chr> <dbl> <dbl> <chr> <chr> <int>
#> 1 Accipiter striatus -105.14459 39.91499 cdround,gass84 gbif 1572375879
#> 2 Accipiter striatus -77.41813 39.49461 cdround,gass84 gbif 1453332084
#> 3 Accipiter striatus -122.05344 36.95316 cdround,gass84 gbif 1453346783
#> 4 Accipiter striatus -96.84084 33.13948 cdround,gass84 gbif 1562918951
#> 5 Accipiter striatus -83.06435 42.27640 cdround,gass84 gbif 1453374806
#> 6 Accipiter striatus -93.60027 42.13699 cdround,gass84 gbif 1453347396
#> # ... with 4 more variables: datasetKey <chr>, publishingOrgKey <chr>,
#> # publishingCountry <chr>, protocol <chr>
head(df$ecoengine$data$Accipiter_striatus)
#> # A tibble: 6 x 17
#> longitude latitude
#> <dbl> <dbl>
#> 1 -118.8567 39.4120
#> 2 -115.7112 40.1013
#> 3 -115.7112 40.1013
#> 4 -114.4585 32.8861
#> 5 -114.8859 39.9005
#> 6 -115.0657 36.1750
#> # ... with 15 more variables: url <chr>, key <chr>,
#> # observation_type <chr>, name <chr>, country <chr>,
#> # state_province <chr>, begin_date <date>, end_date <chr>, source <chr>,
#> # remote_resource <chr>, locality <chr>,
#> # coordinate_uncertainty_in_meters <int>, recorded_by <chr>,
#> # last_modified <chr>, prov <chr>
We provide a function occ2df
that pulls out a few key columns needed for making maps:
head(occ2df(df))
#> # A tibble: 6 x 6
#> name longitude latitude prov date key
#> <chr> <dbl> <dbl> <chr> <date> <chr>
#> 1 Accipiter striatus -105.14459 39.91499 gbif 2017-01-04 1572375879
#> 2 Accipiter striatus -77.41813 39.49461 gbif 2017-01-05 1453332084
#> 3 Accipiter striatus -122.05344 36.95316 gbif 2017-01-11 1453346783
#> 4 Accipiter striatus -96.84084 33.13948 gbif 2017-01-26 1562918951
#> 5 Accipiter striatus -83.06435 42.27640 gbif 2017-01-23 1453374806
#> 6 Accipiter striatus -93.60027 42.13699 gbif 2017-01-11 1453347396
Fix names
One problem you often run in to is that there can be various names for the same taxon in any one source. For example:
df <- occ(query = 'Pinus contorta', from = c('gbif', 'ecoengine'), limit = 50)
head(df$gbif$data$Pinus_contorta)[1:6, 1:5]
#> # A tibble: 6 x 5
#> name longitude latitude issues prov
#> <chr> <dbl> <dbl> <chr> <chr>
#> 1 Pinus contorta -135.34804 57.05074 cdround,gass84 gbif
#> 2 Pinus contorta 17.56537 59.84456 cdround,gass84,rdatm gbif
#> 3 Pinus contorta 17.56467 59.84493 cdround,gass84,rdatm gbif
#> 4 Pinus contorta 12.39827 59.59845 gass84,rdatm gbif
#> 5 Pinus contorta -123.99101 46.22554 cdround,gass84 gbif
#> 6 Pinus contorta 17.56458 59.84525 cdround,gass84,rdatm gbif
head(df$ecoengine$data$Pinus_contorta)[1:6, 1:5]
#> # A tibble: 6 x 5
#> longitude latitude
#> <dbl> <dbl>
#> 1 -119.7976 38.5184
#> 2 -120.2007 38.8368
#> 3 -119.4928 38.1013
#> 4 -119.4500 38.0337
#> 5 -118.7309 36.6564
#> 6 -122.4149 37.5497
#> # ... with 3 more variables: url <chr>, key <chr>, observation_type <chr>
This is fine, but when trying to make a map in which points are colored for each taxon, you can have many colors for a single taxon, where instead one color per taxon is more appropriate. There is a function in spocc
called fixnames
, which has a few options in which you can take the shortest names (usually just the plain binomials like Homo sapiens), or the original name queried, or a vector of names supplied by the user.
df <- fixnames(df, how = 'shortest')
head(df$gbif$data$Pinus_contorta[,1:2])
#> # A tibble: 6 x 2
#> name longitude
#> <chr> <dbl>
#> 1 Pinus contorta -135.34804
#> 2 Pinus contorta 17.56537
#> 3 Pinus contorta 17.56467
#> 4 Pinus contorta 12.39827
#> 5 Pinus contorta -123.99101
#> 6 Pinus contorta 17.56458
head(df$ecoengine$data$Pinus_contorta[,1:2])
#> # A tibble: 6 x 2
#> longitude latitude
#> <dbl> <dbl>
#> 1 -119.7976 38.5184
#> 2 -120.2007 38.8368
#> 3 -119.4928 38.1013
#> 4 -119.4500 38.0337
#> 5 -118.7309 36.6564
#> 6 -122.4149 37.5497
df_comb <- occ2df(df)
head(df_comb); tail(df_comb)
#> # A tibble: 6 x 6
#> name longitude latitude prov date key
#> <chr> <dbl> <dbl> <chr> <date> <chr>
#> 1 Pinus contorta -135.34804 57.05074 gbif 2017-01-12 1453348580
#> 2 Pinus contorta 17.56537 59.84456 gbif 2017-01-04 1433800465
#> 3 Pinus contorta 17.56467 59.84493 gbif 2017-01-25 1434022908
#> 4 Pinus contorta 12.39827 59.59845 gbif 2017-01-03 1433805430
#> 5 Pinus contorta -123.99101 46.22554 gbif 2017-01-16 1453371064
#> 6 Pinus contorta 17.56458 59.84525 gbif 2017-01-07 1433834252
#> # A tibble: 6 x 6
#> name longitude latitude prov date key
#> <chr> <dbl> <dbl> <chr> <date> <chr>
#> 1 Pinus contorta -119.4928 38.1013 ecoengine NA vtm:plot:71E15:7
#> 2 Pinus contorta -119.4246 37.8488 ecoengine NA vtm:plot:76B115:3
#> 3 Pinus contorta -119.3799 37.8051 ecoengine NA vtm:plot:76C24:3
#> 4 Pinus contorta -119.3634 37.7275 ecoengine NA vtm:plot:76D27:4
#> 5 Pinus contorta -123.7772 39.4155 ecoengine NA POM213040
#> 6 Pinus contorta -121.4035 40.4450 ecoengine NA CAS:DS:40775
Clean data
All data cleaning functionality is in a new package scrubr. On CRAN.
Make maps
All mapping functionality is now in a separate package mapr (formerly known as spoccutils
), to make spocc
easier to maintain. On CRAN.
Citing
To cite spocc
in publications use:
Scott Chamberlain (2017). spocc: Interface to Species Occurrence Data Sources. R package version 0.7.0. https://CRAN.R-project.org/package=spocc
License and bugs
- License: MIT
- Report bugs at our Github repo for spocc