solrium tutorial
for v0.3.0
solrium
is a general purpose R interface to Apache Solr
Installation
Stable version from CRAN
install.packages("solrium")
Or the development version from GitHub
install.packages("devtools")
devtools::install_github("ropensci/solrium")
Load
library("solrium")
Usage
Solr search
Setup connection
You can setup for a remote Solr instance or on your local machine.
solr_connect('http://api.plos.org/search')
#> <solr_connection>
#> url: http://api.plos.org/search
#> errors: simple
#> verbose: TRUE
#> proxy:
Rundown
solr_search()
only returns the docs
element of a Solr response body. If docs
is
all you need, then this function will do the job. If you need facet data only, or mlt
data only, see the appropriate functions for each of those below. Another function,
solr_all()
has a similar interface in terms of parameter as solr_search()
, but
returns all parts of the response body, including, facets, mlt, groups, stats, etc.
as long as you request those.
Search docs
solr_search()
returns only docs. A basic search:
solr_search(q = '*:*', rows = 2, fl = 'id')
#> Source: local data frame [2 x 1]
#>
#> id
#> (chr)
#> 1 10.1371/journal.pone.0142241
#> 2 10.1371/journal.pone.0142241/title
Search in specific fields with :
Search for word ecology in title and cell in the body
solr_search(q = 'title:"ecology" AND body:"cell"', fl = 'title', rows = 5)
#> Source: local data frame [5 x 1]
#>
#> title
#> (chr)
#> 1 The Ecology of Collective Behavior
#> 2 Ecology's Big, Hot Idea
#> 3 Spatial Ecology of Bacteria at the Microscale in Soil
#> 4 Biofilm Formation As a Response to Ecological Competition
#> 5 Ecology of Root Colonizing Massilia (Oxalobacteraceae)
Wildcards
Search for word that starts with “cell” in the title field
solr_search(q = 'title:"cell*"', fl = 'title', rows = 5)
#> Source: local data frame [5 x 1]
#>
#> title
#> (chr)
#> 1 Tumor Cell Recognition Efficiency by T Cells
#> 2 Cancer Stem Cell-Like Side Population Cells in Clear Cell Renal Cell Carcin
#> 3 Dcas Supports Cell Polarization and Cell-Cell Adhesion Complexes in Develop
#> 4 MS4a4B, a CD20 Homologue in T Cells, Inhibits T Cell Propagation by Modulat
#> 5 Cell-Cell Contact Preserves Cell Viability via Plakoglobin
Proximity search
Search for words “sports” and “alcohol” within four words of each other
solr_search(q = 'everything:"stem cell"~7', fl = 'title', rows = 3)
#> Source: local data frame [3 x 1]
#>
#> title
#> (chr)
#> 1 Correction: Reduced Intensity Conditioning, Combined Transplantation of Hap
#> 2 A Recipe for Self-Renewing Brain
#> 3 Gene Expression Profile Created for Mouse Stem Cells and Developing Embryo
Range searches
Search for articles with Twitter count between 5 and 10
solr_search(q = '*:*', fl = c('alm_twitterCount', 'id'), fq = 'alm_twitterCount:[5 TO 50]',
rows = 10)
#> Source: local data frame [10 x 2]
#>
#> id alm_twitterCount
#> (chr) (int)
#> 1 10.1371/journal.pone.0085872 9
#> 2 10.1371/journal.pone.0085872/body 9
#> 3 10.1371/journal.pone.0085872/introduction 9
#> 4 10.1371/journal.pone.0085872/results_and_discussion 9
#> 5 10.1371/journal.pone.0085872/materials_and_methods 9
#> 6 10.1371/journal.pone.0108178 7
#> 7 10.1371/journal.pone.0108178/title 7
#> 8 10.1371/journal.pone.0108178/abstract 7
#> 9 10.1371/journal.pone.0108178/references 7
#> 10 10.1371/journal.pone.0108178/body 7
Boosts
Assign higher boost to title matches than to body matches (compare the two calls)
solr_search(q = 'title:"cell" abstract:"science"', fl = 'title', rows = 3)
#> Source: local data frame [3 x 1]
#>
#> title
#> (chr)
#> 1 I Want More and Better Cells! – An Outreach Project about Stem Cells and It
#> 2 Centre of the Cell: Science Comes to Life
#> 3 Globalization of Stem Cell Science: An Examination of Current and Past Coll
solr_search(q = 'title:"cell"^1.5 AND abstract:"science"', fl = 'title', rows = 3)
#> Source: local data frame [3 x 1]
#>
#> title
#> (chr)
#> 1 Centre of the Cell: Science Comes to Life
#> 2 I Want More and Better Cells! – An Outreach Project about Stem Cells and It
#> 3 Derivation of Hair-Inducing Cell from Human Pluripotent Stem Cells
Search all
solr_all()
differs from solr_search()
in that it allows specifying facets, mlt, groups,
stats, etc, and returns all of those. It defaults to parsetype = "list"
and wt="json"
,
whereas solr_search()
defaults to parsetype = "df"
and wt="csv"
. solr_all()
returns
by default a list, whereas solr_search()
by default returns a data.frame.
A basic search, just docs output
solr_all(q = '*:*', rows = 2, fl = 'id')
#> $response
#> $response$numFound
#> [1] 1505603
#>
#> $response$start
#> [1] 0
#>
#> $response$docs
#> $response$docs[[1]]
#> $response$docs[[1]]$id
#> [1] "10.1371/journal.pone.0142241"
#>
#>
#> $response$docs[[2]]
#> $response$docs[[2]]$id
#> [1] "10.1371/journal.pone.0142241/title"
Get docs, mlt, and stats output
solr_all(q = 'ecology', rows = 2, fl = 'id', mlt = 'true', mlt.count = 2, mlt.fl = 'abstract', stats = 'true', stats.field = 'counter_total_all')
#> $response
#> $response$numFound
#> [1] 31523
#>
#> $response$start
#> [1] 0
#>
#> $response$docs
#> $response$docs[[1]]
#> $response$docs[[1]]$id
#> [1] "10.1371/journal.pone.0059813"
#>
#>
#> $response$docs[[2]]
#> $response$docs[[2]]$id
#> [1] "10.1371/journal.pone.0001248"
#>
#>
#>
#>
#> $moreLikeThis
#> $moreLikeThis$`10.1371/journal.pone.0059813`
#> $moreLikeThis$`10.1371/journal.pone.0059813`$numFound
#> [1] 152983
#>
#> $moreLikeThis$`10.1371/journal.pone.0059813`$start
#> [1] 0
#>
#> $moreLikeThis$`10.1371/journal.pone.0059813`$docs
#> $moreLikeThis$`10.1371/journal.pone.0059813`$docs[[1]]
#> $moreLikeThis$`10.1371/journal.pone.0059813`$docs[[1]]$id
#> [1] "10.1371/journal.pone.0111996"
#>
#>
#> $moreLikeThis$`10.1371/journal.pone.0059813`$docs[[2]]
#> $moreLikeThis$`10.1371/journal.pone.0059813`$docs[[2]]$id
#> [1] "10.1371/journal.pone.0143687"
#>
#>
#>
#>
#> $moreLikeThis$`10.1371/journal.pone.0001248`
#> $moreLikeThis$`10.1371/journal.pone.0001248`$numFound
#> [1] 159342
#>
#> $moreLikeThis$`10.1371/journal.pone.0001248`$start
#> [1] 0
#>
#> $moreLikeThis$`10.1371/journal.pone.0001248`$docs
#> $moreLikeThis$`10.1371/journal.pone.0001248`$docs[[1]]
#> $moreLikeThis$`10.1371/journal.pone.0001248`$docs[[1]]$id
#> [1] "10.1371/journal.pone.0001275"
#>
#>
#> $moreLikeThis$`10.1371/journal.pone.0001248`$docs[[2]]
#> $moreLikeThis$`10.1371/journal.pone.0001248`$docs[[2]]$id
#> [1] "10.1371/journal.pone.0024192"
#>
#>
#>
#>
#>
#> $stats
#> $stats$stats_fields
#> $stats$stats_fields$counter_total_all
#> $stats$stats_fields$counter_total_all$min
#> [1] 0
#>
#> $stats$stats_fields$counter_total_all$max
#> [1] 368015
#>
#> $stats$stats_fields$counter_total_all$count
#> [1] 31523
#>
#> $stats$stats_fields$counter_total_all$missing
#> [1] 0
#>
#> $stats$stats_fields$counter_total_all$sum
#> [1] 141705919
#>
#> $stats$stats_fields$counter_total_all$sumOfSquares
#> [1] 3.169268e+12
#>
#> $stats$stats_fields$counter_total_all$mean
#> [1] 4495.318
#>
#> $stats$stats_fields$counter_total_all$stddev
#> [1] 8962.864
#>
#> $stats$stats_fields$counter_total_all$facets
#> named list()
Facet
solr_facet(q = '*:*', facet.field = 'journal', facet.query = c('cell', 'bird'))
#> $facet_queries
#> term value
#> 1 cell 128838
#> 2 bird 13095
#>
#> $facet_fields
#> $facet_fields$journal
#> X1 X2
#> 1 plos one 1236240
#> 2 plos genetics 49331
#> 3 plos pathogens 42853
#> 4 plos computational biology 36391
#> 5 plos neglected tropical diseases 33991
#> 6 plos biology 28753
#> 7 plos medicine 19957
#> 8 plos clinical trials 521
#> 9 plos medicin 9
#>
#>
#> $facet_pivot
#> NULL
#>
#> $facet_dates
#> NULL
#>
#> $facet_ranges
#> NULL
Highlight
solr_highlight(q = 'alcohol', hl.fl = 'abstract', rows = 2)
#> $`10.1371/journal.pmed.0040151`
#> $`10.1371/journal.pmed.0040151`$abstract
#> [1] "Background: <em>Alcohol</em> consumption causes an estimated 4% of the global disease burden, prompting"
#>
#>
#> $`10.1371/journal.pone.0027752`
#> $`10.1371/journal.pone.0027752`$abstract
#> [1] "Background: The negative influences of <em>alcohol</em> on TB management with regard to delays in seeking"
Stats
out <- solr_stats(q = 'ecology', stats.field = c('counter_total_all', 'alm_twitterCount'), stats.facet = c('journal', 'volume'))
out$data
#> min max count missing sum sumOfSquares
#> counter_total_all 0 368015 31523 0 141705919 3.169268e+12
#> alm_twitterCount 0 1756 31523 0 168996 3.277982e+07
#> mean stddev
#> counter_total_all 4495.318307 8962.86436
#> alm_twitterCount 5.361038 31.79876
out$facet
#> $counter_total_all
#> $counter_total_all$volume
#> min max count missing sum sumOfSquares mean stddev
#> 1 0 166346 941 0 2666674 64059423768 2833.872 7752.996
#> 2 740 103659 105 0 1024186 23772918060 9754.152 11512.055
#> 3 1952 69750 69 0 706027 13818140343 10232.275 9847.273
#> 4 1001 14423 9 0 50561 406580461 5617.889 3913.668
#> 5 1871 182659 81 0 1510206 87348144472 18644.519 27200.892
#> 6 1668 117987 482 0 5838545 162659609485 12113.164 13825.190
#> 7 1340 128119 741 0 7717886 188793785566 10415.501 12103.615
#> 8 667 362492 1010 0 9697699 340690657979 9601.682 15664.231
#> 9 103 113281 1539 0 12101852 219212200772 7863.452 8980.903
#> 10 72 244134 2948 0 17709895 327671234859 6007.427 8665.255
#> 11 51 184267 4825 0 24213725 383352120687 5018.389 7367.377
#> 12 16 368015 6360 0 26395482 534025707012 4150.233 8170.212
#> 13 42 288338 6620 0 21032561 615479463617 3177.124 9104.457
#> 14 0 161868 5793 0 11040620 207978083278 1905.855 5681.101
#> volume
#> 1 11
#> 2 12
#> 3 13
#> 4 14
#> 5 1
#> 6 2
#> 7 3
#> 8 4
#> 9 5
#> 10 6
#> 11 7
#> 12 8
#> 13 9
#> 14 10
#>
#> $counter_total_all$journal
#> min max count missing sum sumOfSquares mean stddev
#> 1 667 117987 243 0 4077530 1.462728e+11 16779.959 17936.069
#> 2 1001 265845 884 0 14019239 5.517226e+11 15858.868 19314.215
#> 3 8463 13799 2 0 22262 2.620348e+08 11131.000 3773.122
#> 4 0 368015 25969 0 96176268 1.948962e+12 3703.503 7831.729
#> 5 922 61985 595 0 4794131 6.591271e+10 8057.363 6777.446
#> 6 815 76376 758 0 6332922 9.182935e+10 8354.778 7170.244
#> 7 0 212159 1241 0 5885535 1.011674e+11 4742.575 7686.121
#> 8 740 288338 580 0 4216858 1.415464e+11 7270.445 13838.941
#> journal
#> 1 plos medicine
#> 2 plos biology
#> 3 plos clinical trials
#> 4 plos one
#> 5 plos pathogens
#> 6 plos genetics
#> 7 plos neglected tropical diseases
#> 8 plos computational biology
#>
#>
#> $alm_twitterCount
#> $alm_twitterCount$volume
#> min max count missing sum sumOfSquares mean stddev volume
#> 1 0 1756 941 0 12434 4046744 13.213603 64.267113 11
#> 2 0 1061 105 0 6519 1921409 62.085714 120.761694 12
#> 3 0 283 69 0 3487 509935 50.536232 70.054092 13
#> 4 8 338 9 0 721 142819 80.111111 103.113341 14
#> 5 0 42 81 0 177 5001 2.185185 7.594589 1
#> 6 0 87 482 0 696 22972 1.443983 6.757915 2
#> 7 0 48 741 0 652 11036 0.879892 3.760087 3
#> 8 0 239 1010 0 1042 75008 1.031683 8.559996 4
#> 9 0 126 1539 0 1906 90496 1.238467 7.570023 5
#> 10 0 887 2948 0 4363 1247307 1.479986 20.519631 6
#> 11 0 822 4825 0 19648 2037668 4.072124 20.144888 7
#> 12 0 1503 6360 0 35945 6507435 5.651730 31.486433 8
#> 13 0 1539 6620 0 49857 12852975 7.531269 43.417759 9
#> 14 0 863 5793 0 31549 3309017 5.446056 23.273237 10
#>
#> $alm_twitterCount$journal
#> min max count missing sum sumOfSquares mean stddev
#> 1 0 777 243 0 4262 1028972 17.539095 62.79378
#> 2 0 1756 884 0 16570 6170914 18.744344 81.46674
#> 3 0 3 2 0 3 9 1.500000 2.12132
#> 4 0 1539 25969 0 123579 23533715 4.758712 29.72561
#> 5 0 122 595 0 4267 160645 7.171429 14.79629
#> 6 0 179 758 0 4304 149338 5.678100 12.84495
#> 7 0 887 1241 0 4992 1052826 4.022562 28.85930
#> 8 0 285 580 0 4181 267459 7.208621 20.24546
#> journal
#> 1 plos medicine
#> 2 plos biology
#> 3 plos clinical trials
#> 4 plos one
#> 5 plos pathogens
#> 6 plos genetics
#> 7 plos neglected tropical diseases
#> 8 plos computational biology
More like this
solr_mlt
is a function to return similar documents to the one
out <- solr_mlt(q = 'title:"ecology" AND body:"cell"', mlt.fl = 'title', mlt.mindf = 1, mlt.mintf = 1, fl = 'counter_total_all', rows = 5)
out$docs
#> Source: local data frame [5 x 2]
#>
#> id counter_total_all
#> (chr) (int)
#> 1 10.1371/journal.pbio.1001805 17135
#> 2 10.1371/journal.pbio.0020440 23883
#> 3 10.1371/journal.pone.0087217 5941
#> 4 10.1371/journal.pbio.1002191 13086
#> 5 10.1371/journal.pone.0040117 4327
out$mlt
#> $`10.1371/journal.pbio.1001805`
#> id counter_total_all
#> 1 10.1371/journal.pone.0082578 2196
#> 2 10.1371/journal.pone.0098876 2453
#> 3 10.1371/journal.pone.0102159 1181
#> 4 10.1371/journal.pcbi.1002652 3103
#> 5 10.1371/journal.pone.0087380 1895
#>
#> $`10.1371/journal.pbio.0020440`
#> id counter_total_all
#> 1 10.1371/journal.pone.0102679 3113
#> 2 10.1371/journal.pone.0035964 5573
#> 3 10.1371/journal.pone.0003259 2800
#> 4 10.1371/journal.pone.0101568 2673
#> 5 10.1371/journal.pntd.0003377 3392
#>
#> $`10.1371/journal.pone.0087217`
#> id counter_total_all
#> 1 10.1371/journal.pone.0131665 412
#> 2 10.1371/journal.pcbi.0020092 19615
#> 3 10.1371/journal.pone.0133941 475
#> 4 10.1371/journal.pone.0123774 998
#> 5 10.1371/journal.pone.0140306 322
#>
#> $`10.1371/journal.pbio.1002191`
#> id counter_total_all
#> 1 10.1371/journal.pbio.1002232 1952
#> 2 10.1371/journal.pone.0131700 979
#> 3 10.1371/journal.pone.0070448 1615
#> 4 10.1371/journal.pone.0044766 2271
#> 5 10.1371/journal.pone.0144763 512
#>
#> $`10.1371/journal.pone.0040117`
#> id counter_total_all
#> 1 10.1371/journal.pone.0069352 2765
#> 2 10.1371/journal.pone.0148280 472
#> 3 10.1371/journal.pone.0035502 4034
#> 4 10.1371/journal.pone.0014065 5766
#> 5 10.1371/journal.pone.0113280 1984
Groups
solr_groups()
is a function to return similar documents to the one
solr_group(q = 'ecology', group.field = 'journal', group.limit = 1, fl = c('id', 'alm_twitterCount'))
#> groupValue numFound start
#> 1 plos one 25969 0
#> 2 plos computational biology 580 0
#> 3 plos biology 884 0
#> 4 none 1251 0
#> 5 plos medicine 243 0
#> 6 plos neglected tropical diseases 1241 0
#> 7 plos pathogens 595 0
#> 8 plos genetics 758 0
#> 9 plos clinical trials 2 0
#> id alm_twitterCount
#> 1 10.1371/journal.pone.0059813 56
#> 2 10.1371/journal.pcbi.1003594 21
#> 3 10.1371/journal.pbio.1002358 16
#> 4 10.1371/journal.pone.0046671 2
#> 5 10.1371/journal.pmed.1000303 0
#> 6 10.1371/journal.pntd.0002577 2
#> 7 10.1371/journal.ppat.1003372 2
#> 8 10.1371/journal.pgen.1001197 0
#> 9 10.1371/journal.pctr.0020010 0
Parsing
solr_parse()
is a general purpose parser function with extension methods for parsing outputs from functions in solr
. solr_parse()
is used internally within functions to do parsing after retrieving data from the server. You can optionally get back raw json
, xml
, or csv
with the raw=TRUE
, and then parse afterwards with solr_parse()
.
For example:
(out <- solr_highlight(q = 'alcohol', hl.fl = 'abstract', rows = 2, raw = TRUE))
#> [1] "{\"response\":{\"numFound\":20309,\"start\":0,\"docs\":[{},{}]},\"highlighting\":{\"10.1371/journal.pmed.0040151\":{\"abstract\":[\"Background: <em>Alcohol</em> consumption causes an estimated 4% of the global disease burden, prompting\"]},\"10.1371/journal.pone.0027752\":{\"abstract\":[\"Background: The negative influences of <em>alcohol</em> on TB management with regard to delays in seeking\"]}}}\n"
#> attr(,"class")
#> [1] "sr_high"
#> attr(,"wt")
#> [1] "json"
Then parse
solr_parse(out, 'df')
#> names
#> 1 10.1371/journal.pmed.0040151
#> 2 10.1371/journal.pone.0027752
#> abstract
#> 1 Background: <em>Alcohol</em> consumption causes an estimated 4% of the global disease burden, prompting
#> 2 Background: The negative influences of <em>alcohol</em> on TB management with regard to delays in seeking
Document management
Initialize connection. By default, you connect to http://localhost:8983
solr_connect()
#> <solr_connection>
#> url: http://localhost:8983
#> errors: simple
#> verbose: TRUE
#> proxy:
Create documents from R objects
For now, only lists and data.frame’s supported.
data.frame
df <- data.frame(id = c(67, 68), price = c(1000, 500000000))
add(df, "books")
list
delete_by_id(1:2, "books")
ss <- list(list(id = 1, price = 100), list(id = 2, price = 500))
add(ss, "books")
Delete documents
By id
Add some documents first
delete_by_id(1:3, "gettingstarted")
docs <- list(list(id = 1, price = 100, name = "brown"),
list(id = 2, price = 500, name = "blue"),
list(id = 3, price = 2000L, name = "pink"))
add(docs, "gettingstarted")
And the documents are now in your Solr database
tail(solr_search(name = "gettingstarted", "*:*", base = "http://localhost:8983/solr/select", rows = 100))
Now delete those documents just added
delete_by_id(ids = c(1, 2, 3), "gettingstarted")
And now they are gone
tail(solr_search("gettingstarted", "*:*", base = "http://localhost:8983/solr/select", rows = 100))
By query
Add some documents first
add(docs, "gettingstarted")
And the documents are now in your Solr database
tail(solr_search("gettingstarted", "*:*", base = "http://localhost:8983/solr/select", rows = 100))
Now delete those documents just added
delete_by_query(query = "(name:blue OR name:pink)", "gettingstarted")
And now they are gone
tail(solr_search("gettingstarted", "*:*", base = "http://localhost:8983/solr/select", rows = 100))
Update documents from files
This approach is best if you have many different things you want to do at once, e.g., delete and add files and set any additional options. The functions are:
update_xml()
update_json()
update_csv()
There are separate functions for each of the data types as they take slightly different parameters - and to make it more clear that those are the three input options for data types.
JSON
file <- system.file("examples", "books.json", package = "solrium")
update_json(file, "books")
Add and delete in the same file
Add a document first, that we can later delete
ss <- list(list(id = 456, name = "cat"))
add(ss, "books")
Now add a new document, and delete the one we just made
file <- system.file("examples", "add_delete.xml", package = "solrium")
cat(readLines(file), sep = "\n")
update_xml(file, "books")
Notes
Note that update_xml()
and update_json()
have exactly the same parameters, but simply use different data input formats. update_csv()
is different in that you can’t provide document or field level boosts or other modifications. In addition update_csv()
can accept not just csv, but tsv and other types of separators.
Cores/collections management
Initialize connection
solr_connect()
#> <solr_connection>
#> url: http://localhost:8983
#> errors: simple
#> verbose: TRUE
#> proxy:
Cores
There are many operations you can do on cores, including:
core_create()
- create a corecore_exists()
- check if a core existscore_mergeindexes()
- merge indexescore_reload()
- reload a corecore_rename()
- rename a corecore_requeststatus()
- check request statuscore_split()
- split a corecore_status()
- check core statuscore_swap()
- core swapcore_unload()
- delete a core
Create a core
core_create()
Delete a core
core_unload()
Collections
There are many operations you can do on collections, including:
collection_addreplica()
collection_addreplicaprop()
collection_addrole()
collection_balanceshardunique()
collection_clusterprop()
collection_clusterstatus()
collection_create()
collection_createalias()
collection_createshard()
collection_delete()
collection_deletealias()
collection_deletereplica()
collection_deletereplicaprop()
collection_deleteshard()
collection_list()
collection_migrate()
collection_overseerstatus()
collection_rebalanceleaders()
collection_reload()
collection_removerole()
collection_requeststatus()
collection_splitshard()
Create a collection
collection_create()
Delete a collection
collection_delete()
Citing
To cite solrium
in publications use:
Scott Chamberlain (2016). solrium: General Purpose R Interface to ‘Solr’. R package version 0.3.0. https://github.com/ropensci/solrium
License and bugs
- License: MIT
- Report bugs at our Github repo for solrium