Hacker News new | past | comments | ask | show | jobs | submit login
Rvest: Easy web scraping with R (rstudio.org)
142 points by hadley on Nov 24, 2014 | hide | past | favorite | 28 comments



The RStudio guys have really made R a pleasure to use. Thank you guys!

The core language is still a confusing mess (I'm still never sure when to use a matrix, a dataframe, a list..), but if you use their tools you can ignore it for the most part.

In under 10 lines you can massage data and generate fantastic graphics.

A little off topic: but does anyone know what their business model is? Are they going to run out of money and burnout in a year or two?


If you're confused about R's data structures, please read http://adv-r.had.co.nz/Data-structures.html and let me know if it doesn't help.

And no, we're not planning on burning out. We currently sell three things:

* RStudio Server Pro. An commercial version of the open-source server version that provides stuff that corporate IT wants (e.g. monitoring, more auth options, ...)

* Shiny Server Pro. A more flexible version of the open-source shiny server that offers more configurability (e.g. number of R processes per app), and again other stuff that corporate IT wants.

* Right to use the RStudio desktop IDE to companies who don't want to use AGPL software


[deleted]


Have a look at http://adv-r.had.co.nz/Subsetting.html#subsetting-operators. There was just too much material to fit in one chapter.


Here's how I think of it, which has been working for me:

matrix - If you have data that would make sense to be in a spreadsheet-type format and all your data are numbers.

dataframe - If you have data that would make sense to be in a spreadsheet-type format and some columns are numbers but other columns are something else (character strings, dates, TRUE/FALSE); but each column is only one thing. That is, you have one column that's all dates, another column that's all numbers, yet another column that's all character strings, etc.

list - if you need to mix data types within a certain entity (vector or column of data).


Unless you're doing linear algebra (or really care about memory usage), you almost never need to use a matrix in R.


To piggyback on what hadley said a bit, I find thinking of a data frame as a "collection of records", and a matrix as "two dimensional data" to be a bit better.

One useful heuristic worth asking is "Does it make sense to sort this data by something". In that case, you have a data frame. Whereas if you want to perform matrix math on something (inverting it, multiplying it by another matrix, reducing it, etc.), you have a matrix. Things that I use a matrix for can generally also be expressed as a data frame with columns rowId, colId, and value. If it doesn't make sense in that format, a matrix is generally not the appropriate structure.


That's a great explanation! Data frame for data analysis; matrix for math.


I'd amend that a little: use a matrix when you're actually calculating statistics (internally to the function). Clean your data so it always fits in a data frame when you load it. Lists are for representing things like data scraped from html before converting it to a data frame.


It's always great when you spend 10 hours trying to debug something and then find out from a mailing list that it's actually a bug in R. :(


Business model is sell to enterprise and consulting: http://www.rstudio.com/pricing/


FWIW we don't do any consulting, although we do a decent amount of training.


The community and ecosystem around R is rapidly changing and adapting. R has a long and storied history as a niche language for statistics and analysis. Much like those disciplines have entered the mainstream of modern technology enabled businesses, so follows the R ecosystem. Previously laborious tasks are being revamped with new elegant APIs such as Rvest does for scraping (and dplyr for manipulation, and lubridate for date manipulation, etc...)

Performance is also historically an R bugaboo, but with changes to R's copy on write semantics and other optimizations in the base language, current benchmarks show it behaving on par with Python and other dynamic languages (if not even slightly better with tools such as dplyr and data.table.)

The maggritr package's implementation of a "pipe semantic" (often considered the only truly successful implementation of a 'component architecture') and the adoption of the model for tools such as Rvset are really allowing for the functional, vectorized nature of R to shine through. These are really darned exciting times to be a part of this community!


Do you have any links to articles on the current benchmarks that you mention? Not being snide just curious to read more.


I'd start here: https://github.com/Rdatatable/data.table/wiki/Benchmarks-:-G...

Performance isn't currently a huge focus for dplyr. In my opinion dplyr is fast enough that the bottleneck becomes mostly cognitive - you spend more time thinking about what you want to do than actually doing it.


Does lubridate solve the unintended timezone conversions of POSIXct classes?


Can you provide a bit more detail?


In the R Help Desk 2004 (http://www.r-project.org/doc/Rnews/Rnews_2004-1.pdf), Gabor Grothendieck recommends chron over POSIXct classes on account of the time zone conversions which occur when the tz attribute of the latter object is not "GMT". Will this not be a problem with lubridate? Thanks in advance.


I've never found that to be a problem in practice. Do you have an example where it's bitten you in practice?

(Also you should use UTC and not GMT)


Hi Hadley, yes for instance

> as.chron("1970-01-01")+unclass(as.chron("2001-04-01")) [1] 04/01/01

> as.POSIXct("1970-01-01","EST")+unclass(as.POSIXct("2014-06-01","EST")) [1] "2014-06-01 05:00:00 EST"

If there is any conversion necessary it is difficult to get back the original intended time.


What does adding two dates together mean?


Isn't this the conventional way of converting variables which have been coerced to their numeric representations back to time/date classes?


I guess I really do skip over who the submitter is when checking out HN links...if this submission had been titled, "Rvest: Easy web scraping R library by Hadley Wickham", I would've immediately been non-skeptical.

It looks like rvest intends to be the equivalent of Mechanize, with stateful navigation in the works. Is there an R equivalent to just Beautiful Soup or Nokogiri?


What are you looking for? rvest should support all the navigation tools from beautiful soup/nokogiri (unless I've missed something), but currently doesn't have any support for modifying the document (in which case I think your only option is the XML package).


No you didn't miss anything...I meant if there were standalone parsers for R...Mechanize uses Nokogiri/Soup as a dependency.


Not that I'm aware of - rvest uses the R package XML which uses the C library libxml.


Will this work with https websites?

One of the reasons I learned Python for data scraping was that R in general does not play nice with https (RCurl requires a certificate and even then it's pretty fussy)


Yes. It uses httr which wraps RCurl in such a way that everything should just work.


Hadley Wickham writes R packages faster than I can read the documentation on them.

A true 10xer.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: