Rvest: Easy web scraping with R

12423gsd · on Nov 24, 2014

The RStudio guys have really made R a pleasure to use. Thank you guys!

The core language is still a confusing mess (I'm still never sure when to use a matrix, a dataframe, a list..), but if you use their tools you can ignore it for the most part.

In under 10 lines you can massage data and generate fantastic graphics.

A little off topic: but does anyone know what their business model is? Are they going to run out of money and burnout in a year or two?

hadley · on Nov 24, 2014

If you're confused about R's data structures, please read http://adv-r.had.co.nz/Data-structures.html and let me know if it doesn't help.

And no, we're not planning on burning out. We currently sell three things:

* RStudio Server Pro. An commercial version of the open-source server version that provides stuff that corporate IT wants (e.g. monitoring, more auth options, ...)

* Shiny Server Pro. A more flexible version of the open-source shiny server that offers more configurability (e.g. number of R processes per app), and again other stuff that corporate IT wants.

* Right to use the RStudio desktop IDE to companies who don't want to use AGPL software

on Nov 24, 2014

[deleted]

hadley · on Nov 24, 2014

Have a look at http://adv-r.had.co.nz/Subsetting.html#subsetting-operators. There was just too much material to fit in one chapter.

smachlis · on Nov 24, 2014

Here's how I think of it, which has been working for me:

matrix - If you have data that would make sense to be in a spreadsheet-type format and all your data are numbers.

dataframe - If you have data that would make sense to be in a spreadsheet-type format and some columns are numbers but other columns are something else (character strings, dates, TRUE/FALSE); but each column is only one thing. That is, you have one column that's all dates, another column that's all numbers, yet another column that's all character strings, etc.

list - if you need to mix data types within a certain entity (vector or column of data).

hadley · on Nov 24, 2014

Unless you're doing linear algebra (or really care about memory usage), you almost never need to use a matrix in R.

jowiar · on Nov 24, 2014

To piggyback on what hadley said a bit, I find thinking of a data frame as a "collection of records", and a matrix as "two dimensional data" to be a bit better.

One useful heuristic worth asking is "Does it make sense to sort this data by something". In that case, you have a data frame. Whereas if you want to perform matrix math on something (inverting it, multiplying it by another matrix, reducing it, etc.), you have a matrix. Things that I use a matrix for can generally also be expressed as a data frame with columns rowId, colId, and value. If it doesn't make sense in that format, a matrix is generally not the appropriate structure.

hadley · on Nov 24, 2014

That's a great explanation! Data frame for data analysis; matrix for math.

grayclhn · on Nov 24, 2014

I'd amend that a little: use a matrix when you're actually calculating statistics (internally to the function). Clean your data so it always fits in a data frame when you load it. Lists are for representing things like data scraped from html before converting it to a data frame.

wesleyy · on Nov 24, 2014

It's always great when you spend 10 hours trying to debug something and then find out from a mailing list that it's actually a bug in R. :(

Hansi · on Nov 24, 2014

Business model is sell to enterprise and consulting: http://www.rstudio.com/pricing/

hadley · on Nov 24, 2014

FWIW we don't do any consulting, although we do a decent amount of training.

earino · on Nov 24, 2014

The community and ecosystem around R is rapidly changing and adapting. R has a long and storied history as a niche language for statistics and analysis. Much like those disciplines have entered the mainstream of modern technology enabled businesses, so follows the R ecosystem. Previously laborious tasks are being revamped with new elegant APIs such as Rvest does for scraping (and dplyr for manipulation, and lubridate for date manipulation, etc...)

Performance is also historically an R bugaboo, but with changes to R's copy on write semantics and other optimizations in the base language, current benchmarks show it behaving on par with Python and other dynamic languages (if not even slightly better with tools such as dplyr and data.table.)

The maggritr package's implementation of a "pipe semantic" (often considered the only truly successful implementation of a 'component architecture') and the adoption of the model for tools such as Rvset are really allowing for the functional, vectorized nature of R to shine through. These are really darned exciting times to be a part of this community!

elliott34 · on Nov 24, 2014

Do you have any links to articles on the current benchmarks that you mention? Not being snide just curious to read more.

hadley · on Nov 24, 2014

I'd start here: https://github.com/Rdatatable/data.table/wiki/Benchmarks-:-G...

Performance isn't currently a huge focus for dplyr. In my opinion dplyr is fast enough that the bottleneck becomes mostly cognitive - you spend more time thinking about what you want to do than actually doing it.

mapcar · on Nov 24, 2014

Does lubridate solve the unintended timezone conversions of POSIXct classes?

hadley · on Nov 24, 2014

Can you provide a bit more detail?

mapcar · on Nov 24, 2014

In the R Help Desk 2004 (http://www.r-project.org/doc/Rnews/Rnews_2004-1.pdf), Gabor Grothendieck recommends chron over POSIXct classes on account of the time zone conversions which occur when the tz attribute of the latter object is not "GMT". Will this not be a problem with lubridate? Thanks in advance.

hadley · on Nov 24, 2014

I've never found that to be a problem in practice. Do you have an example where it's bitten you in practice?

(Also you should use UTC and not GMT)

mapcar · on Nov 26, 2014

Hi Hadley, yes for instance

> as.chron("1970-01-01")+unclass(as.chron("2001-04-01")) [1] 04/01/01

> as.POSIXct("1970-01-01","EST")+unclass(as.POSIXct("2014-06-01","EST")) [1] "2014-06-01 05:00:00 EST"

If there is any conversion necessary it is difficult to get back the original intended time.

hadley · on Nov 26, 2014

What does adding two dates together mean?

mapcar · on Nov 26, 2014

Isn't this the conventional way of converting variables which have been coerced to their numeric representations back to time/date classes?

danso · on Nov 24, 2014

I guess I really do skip over who the submitter is when checking out HN links...if this submission had been titled, "Rvest: Easy web scraping R library by Hadley Wickham", I would've immediately been non-skeptical.

It looks like rvest intends to be the equivalent of Mechanize, with stateful navigation in the works. Is there an R equivalent to just Beautiful Soup or Nokogiri?

hadley · on Nov 24, 2014

What are you looking for? rvest should support all the navigation tools from beautiful soup/nokogiri (unless I've missed something), but currently doesn't have any support for modifying the document (in which case I think your only option is the XML package).

danso · on Nov 24, 2014

No you didn't miss anything...I meant if there were standalone parsers for R...Mechanize uses Nokogiri/Soup as a dependency.

hadley · on Nov 24, 2014

Not that I'm aware of - rvest uses the R package XML which uses the C library libxml.

minimaxir · on Nov 24, 2014

Will this work with https websites?

One of the reasons I learned Python for data scraping was that R in general does not play nice with https (RCurl requires a certificate and even then it's pretty fussy)

hadley · on Nov 24, 2014

Yes. It uses httr which wraps RCurl in such a way that everything should just work.

hudibras · on Nov 24, 2014

Hadley Wickham writes R packages faster than I can read the documentation on them.

A true 10xer.