solr - an R interface to Solr
A number of the APIs we interact with (e.g., PLOS full text API, and USGS’s BISON API in rplos and rbison, respectively) expose Solr endpoints. Solr is an Apache hosted project - it is a powerful search server. Given that at least two, and possibly more in the future, of the data providers we interact with provide Solr endpoints, it made sense to create an R package to make robust functions to interact with Solr that work across any Solr endpoint. This is then useful to us, and hopefully others.
The following are a few examples covering some of things you can do in Solr that fall in to six categories:
- Search: via
solr_search
- Grouping: via
solr_group
- Faceting: via
solr_facet
- Highlighting: via
solr_highlight
- Stats: via
solr_stats
- More like this: via
solr_mlt
The solr
package generally has two steps for any query: a) send the request given your inputs, and b) parse the output into a useful R data structure. Part a) is quite easy. However, part b) is harder. We are working hard on making parsers that are as general as possible for each of the data formats that are returned by group, facet, highlight, etc., but of course we will still definitely fail in many cases. Please do submit bug reports to our issue tracker so we can make the parsers work better.
Installation
solr
is on CRAN, so you can install the more stable version there, and some dependencies.
install.packages("solr")
You can install the development version from Github as follows. Below we’ll use the Github version - most of below is available in the CRAN version too, except solr_group
.
install.packages("devtools")
devtools::install_github("ropensci/solr")
Load the library
library("solr")
Define url endpoint and key
As solr
is a general interface to Solr endpoints, you need to define the url. Here, we’ll work with the Public Library of Science full text search API (docs here). Some Solr endpoints will require authentication - I should note that we don’t yet handle authentication schemes other than passing in a key in the url, but that’s on the to do list.
url <- 'https://api.plos.org/search'
Search
solr_search(q='*:*', rows=2, fl='id', base=url)
#> id
#> 1 10.1371/annotation/c313df3a-52bd-4cbe-af14-6676480d1a43
#> 2 10.1371/annotation/c313df3a-52bd-4cbe-af14-6676480d1a43/title
Search for words “sports” and “alcohol” within seven words of each other
solr_search(q='everything:"sports alcohol"~7', fl='title', rows=3, base=url)
#> title
#> 1 Alcohol Ingestion Impairs Maximal Post-Exercise Rates of Myofibrillar Protein Synthesis following a Single Bout of Concurrent Training
#> 2 “Like Throwing a Bowling Ball at a Battle Ship” Audience Responses to Australian News Stories about Alcohol Pricing and Promotion Policies: A Qualitative Focus Group Study
#> 3 Development and Validation of a Risk Score Predicting Substantial Weight Gain over 5 Years in Middle-Aged European Men and Women
Groups
Most recent publication by journal
solr_group(q='*:*', group.field='journal', rows=5, group.limit=1, group.sort='publication_date desc', fl='publication_date, score', base=url)
#> groupValue numFound start publication_date score
#> 1 plos one 931323 0 2014-11-24T00:00:00Z 1
#> 2 plos genetics 40603 0 2014-11-20T00:00:00Z 1
#> 3 plos medicine 18514 0 2014-11-18T00:00:00Z 1
#> 4 plos pathogens 35497 0 2014-11-24T00:00:00Z 1
#> 5 plos biology 26133 0 2014-11-18T00:00:00Z 1
First publication by journal
solr_group(q='*:*', group.field='journal', group.limit=1, group.sort='publication_date asc', fl='publication_date, score', fq="publication_date:[1900-01-01T00:00:00Z TO *]", base=url)
#> groupValue numFound start publication_date
#> 1 plos one 931323 0 2006-12-01T00:00:00Z
#> 2 plos genetics 40603 0 2005-06-17T00:00:00Z
#> 3 plos medicine 18514 0 2004-09-07T00:00:00Z
#> 4 plos pathogens 35497 0 2005-07-22T00:00:00Z
#> 5 plos biology 26133 0 2003-08-18T00:00:00Z
#> 6 none 57566 0 2005-08-23T00:00:00Z
#> 7 plos computational biology 29838 0 2005-06-24T00:00:00Z
#> 8 plos neglected tropical diseases 25119 0 2007-08-30T00:00:00Z
#> 9 plos clinical trials 521 0 2006-04-21T00:00:00Z
#> 10 plos medicin 9 0 2012-04-17T00:00:00Z
#> score
#> 1 1
#> 2 1
#> 3 1
#> 4 1
#> 5 1
#> 6 1
#> 7 1
#> 8 1
#> 9 1
#> 10 1
Facet
solr_facet(q='*:*', facet.field='journal', facet.query='cell,bird', base=url)
#> $facet_queries
#> term value
#> 1 cell,bird 17
#>
#> $facet_fields
#> $facet_fields$journal
#> X1 X2
#> 1 plos one 931323
#> 2 plos genetics 40603
#> 3 plos pathogens 35497
#> 4 plos computational biology 29838
#> 5 plos biology 26133
#> 6 plos neglected tropical diseases 25119
#> 7 plos medicine 18514
#> 8 plos clinical trials 521
#> 9 plos medicin 9
#>
#>
#> $facet_dates
#> NULL
#>
#> $facet_ranges
#> NULL
Range faceting with > 1 field
head( solr_facet(q='*:*', base=url, facet.range='alm_twitterCount', facet.range.start=5, facet.range.end=1000, facet.range.gap=10)$facet_ranges$alm_twitterCount )
#> X1 X2
#> 1 5 60938
#> 2 15 13668
#> 3 25 6379
#> 4 35 2952
#> 5 45 2297
#> 6 55 1497
Highlight
solr_highlight(q='alcohol', hl.fl = 'abstract', rows=2, base = url)
#> $`10.1371/journal.pmed.0040151`
#> $`10.1371/journal.pmed.0040151`$abstract
#> [1] "Background: <em>Alcohol</em> consumption causes an estimated 4% of the global disease burden, prompting"
#>
#>
#> $`10.1371/journal.pone.0027752`
#> $`10.1371/journal.pone.0027752`$abstract
#> [1] "Background: The negative influences of <em>alcohol</em> on TB management with regard to delays in seeking"
Stats
solr_stats(q='ecology', stats.field='alm_twitterCount', stats.facet=c('journal','volume'), base=url)
#> min max count missing sum sumOfSquares mean stddev
#> 1 0 1624 24326 0 113589 19746631 4.669448 28.10656
More like this
solr_mlt
is a function to return similar documents to the ones searched for.
out <- solr_mlt(q='title:"ecology" AND body:"cell"', mlt.fl='title', mlt.mindf=1, mlt.mintf=1, fl='counter_total_all', rows=5, base=url)
out$docs
#> id counter_total_all
#> 1 10.1371/journal.pbio.1001805 10102
#> 2 10.1371/journal.pbio.0020440 16630
#> 3 10.1371/journal.pone.0087217 2922
#> 4 10.1371/journal.pone.0040117 2514
#> 5 10.1371/journal.pone.0072525 1112
Raw data?
You can optionally get back raw json
or xml
from all functions by setting parameter raw=TRUE
. You can then parse after the fact with solr_parse
, or just process as you wish. For example:
(out <- solr_highlight(q='alcohol', hl.fl = 'abstract', rows=2, base = url, raw=TRUE))
#> [1] "{\"response\":{\"numFound\":15301,\"start\":0,\"docs\":[{},{}]},\"highlighting\":{\"10.1371/journal.pmed.0040151\":{\"abstract\":[\"Background: <em>Alcohol</em> consumption causes an estimated 4% of the global disease burden, prompting\"]},\"10.1371/journal.pone.0027752\":{\"abstract\":[\"Background: The negative influences of <em>alcohol</em> on TB management with regard to delays in seeking\"]}}}\n"
#> attr(,"class")
#> [1] "sr_high"
#> attr(,"wt")
#> [1] "json"
Then parse
solr_parse(out, 'df')
#> names
#> 1 10.1371/journal.pmed.0040151
#> 2 10.1371/journal.pone.0027752
#> abstract
#> 1 Background: <em>Alcohol</em> consumption causes an estimated 4% of the global disease burden, prompting
#> 2 Background: The negative influences of <em>alcohol</em> on TB management with regard to delays in seeking
Verbosity
As you have noticed, we include in each function the acutal call to the Solr endpoint made so you know exactly what was submitted to the remote or local Solr instance. You can suppress the message with verbose=FALSE
. This message isn’t in the CRAN version.
Advanced: Function Queries
Function Queries allow you to query on actual numeric fields in the SOLR database, and do addition, multiplication, etc on one or many fields to stort results. For example, here, we search on the product of counter_total_all and alm_twitterCount, using a new temporary field “val”
solr_search(q='_val_:"product(counter_total_all,alm_twitterCount)"', rows=5, fl='id,title', fq='doc_type:full', base=url)
#> id
#> 1 10.1371/journal.pmed.0020124
#> 2 10.1371/journal.pone.0105948
#> 3 10.1371/journal.pone.0046362
#> 4 10.1371/journal.pone.0069841
#> 5 10.1371/journal.pbio.1001535
#> title
#> 1 Why Most Published Research Findings Are False
#> 2 Sliding Rocks on Racetrack Playa, Death Valley National Park: First Observation of Rocks in Motion
#> 3 The Power of Kawaii: Viewing Cute Images Promotes a Careful Behavior and Narrows Attentional Focus
#> 4 Facebook Use Predicts Declines in Subjective Well-Being in Young Adults
#> 5 An Introduction to Social Media for Scientists
Here, we search for the papers with the most citations
solr_search(q='_val_:"max(counter_total_all)"', rows=5, fl='id,counter_total_all', fq='doc_type:full', base=url)
#> id counter_total_all
#> 1 10.1371/journal.pmed.0020124 1002083
#> 2 10.1371/journal.pmed.0050045 324559
#> 3 10.1371/journal.pone.0007595 315117
#> 4 10.1371/journal.pone.0033288 305965
#> 5 10.1371/journal.pone.0069841 277609
Or with the most tweets
solr_search(q='_val_:"max(alm_twitterCount)"', rows=5, fl='id,alm_twitterCount', fq='doc_type:full', base=url)
#> id alm_twitterCount
#> 1 10.1371/journal.pone.0061981 2298
#> 2 10.1371/journal.pmed.0020124 1700
#> 3 10.1371/journal.pbio.1001535 1624
#> 4 10.1371/journal.pone.0046362 1368
#> 5 10.1371/journal.pmed.1001747 1361