New rOpenSci Packages for Text Processing in R
Textual data and natural language processing are still a niche domain within the R ecosytstem. The NLP task view gives an overview of existing work however a lot of basic infrastructure is still missing. At the rOpenSci text workshop in April we discussed many ideas for improving text processing in R which revealed several core areas that need improvement: Reading: better tools for extracing text and metadata from documents in various formats (doc, rtf, pdf, etc).
Unconf projects 5: mwparser, Gargle, arresteddev
And finally, we end our series of unconf project summaries (day 1, day 2, day 3, day 4). mwparser Summary: Wikimarkup is the language used on Wikipedia and similar projects, and as such contains a lot of valuable data both for scientists studying collaborative systems and people studying things documented on or in Wikipedia. mwparser parses wikimarkup, allowing a user to filter down to specific types of tags such as links or templates, and then extract components of those tags.
Unconf projects 4: cityquant, notary, packagemetrics, pegax
Continuing our series of blog posts (day 1, day 2, day 3) this week about unconf 17. cityquant Summary: The goal with the cityquant project was to build a digital dashboard for sustainable cities. They also had a “spin-off” project called selfquant to get data from a quantified self google sheets template to keep track of weekly performance in various categories. Team: Reka Solymosi, Ben Best, Chelsea Ursaner, Tim Phan, Jasmine Dumas
Unconf projects 3: available, miner, rcheatsheet, ponyexpress
Continuing our series of blog posts (day 1, day 2) this week about unconf 17. available Summary: Ever have trouble naming your software package? Find a great name and realize it’s already taken on CRAN, or further along in development on GitHub? The available package makes it easy to check for valid, available names, and also checks various sources for any unintended meanings. The package can also suggest names based on the description and title of your package.
Unconf projects 2: checkers, gramr, data-packages, exploRingJSON
Following up on Stefanie’s recap of unconf 17, we are following up this entire week with summaries of projects developed at the event. We plan to highlight 4-5 projects each day, with detailed posts from a handful of teams to follow. checkers Summary: checkers is a framework for reviewing analysis projects. It provides automated checks for best practices, using extensions on the goodpractice package. In addition, checkers includes a descriptive guide for best practices.