Hacker News new | past | comments | ask | show | jobs | submit login
A new ggplot is here (yhat.com)
31 points by avyfain on July 8, 2016 | hide | past | favorite | 16 comments



> aes(x='np.log(B - A)')

I never understood the pattern of putting source code in a string. Why not just use np.log(B - A) directly and configure the function to accept columns? With strings you lose highlighting, semantic analysis from editors, as well as the ability to know what computations are happening when and where. There seems to be no point and significant drawback to this, what's the rationale?


It's because R is lazily evaluated and Python is eagerly evaluated [1].

Eager evaluation is when the interpreter evaluates the argument np.log(B-A) BEFORE passing it into aes(). aes() can only see the resulting VALUE, not the EXPRESSION itself.

In contrast, lazy evaluation means that aes() gets the raw expression as an argument. It can evaluate it, pass it on to another function, or otherwise manipulate it (e.g. serialize it back to a string).

In the case of ggplot, this is used to actually evaluate the expression at DIFFERENT values, so you can plot it. Suppose you want to plot f(x) = square(x). It doesn't make any sense to write plot(square(x)), because if x = 5.0, you will get plot(25.0), and you can't make the plot. plot(lambda x: square(x)) make more sense, because then the plot() function can evaluate it with 100 different values of x to get 100 pixel values.

And it's also used to print the expression on the axes. That is, you actually want to print the expression "square(x)" on the graph. You don't want print "25.0" on the graph -- that makes no sense.

This is related to the concept of quotations in Lisp. Quotations are UNEVALUATED program fragments. In Lisp it's an AST, but in Python or any other dynamic language, it has to be a string. In C there is no way to do this (short of shelling out to the C compiler at runtime, which some people have actually done ...)

[1] R has crazy caveats, but that's beyond the scope of this post...


So why not accept a lambda? I saw that http://stackoverflow.com/questions/334851/print-the-code-whi... gives you the source, though I'm not sure if it'll always work. This does work though:

  ~ λ echo "a = lambda x: x * 2" > test.py
  ~ λ py -i test.py               
  IPython 4.2.0 -- An enhanced Interactive Python.
  In [1]: a(2)
  Out[1]: 4
  In [2]: import inspect
  In [3]: inspect.getsource(a)
  Out[3]: 'a = lambda x: x * 2\n'
I'm definitely a big fan of using Lisp's quoting for code-as-data, but I feel stringifying it makes the problem even worse.


Yeah that's a good point. I haven't used this ggplot library, but it seems like it could use lambdas. And then you don't break syntax highlighting.

One other place I've seen this done is in the numexpr for Python.

https://github.com/pydata/numexpr

It does seem like this

    ne.evaluate('a*b-4.1*a > 2.5*b') 
could be

    ne.evaluate(lambda a, b: a*b - 4.1*a > 2.5*b)
The lambda is never executed because it compiles to machine code and not Python byte code, but that shouldn't make a difference. You should still be able to use the AST of the body as input to the compiler.

And then a and b have to be pulled out of locals() automatically or something.


That is likely a hack necessary for it to work with Python syntax. (R's ggplot2 does not require wrapping aesthetics in strings)


I actually usually use the aes_string command instead, which allows strings. Easy for programmatic reuse of complex ggplot commands.


Indeed, whereas function arguments in R are lazily evaluated in Python they're not.


But is laziness that important for a graphing library where presumably you'd be graphing the function anyway? I guess there are some cases where one of the other options would disable the y-axis or something, but this seems pretty rare. Is there another reason?


I'm not sure whether this library supports it, but if you plot a function, a smart plotting library will evaluate the function at points that depend on the function and the resolution of your output device; you can't just evaluate the function at n points on the to be plotted range and expect to get an accurate plot.

For example, to plot sin(1/x) accurately, it must be evaluated many, many times near zero, but you do not want to do that further away from the origin, as it slows down plotting, and can produce less nice plots if you export to a vector format such as svg or postscript.


Off the top of my head:

When plotting in R the name of the names of the variables in the caller of the plot function are used as axis labels. This is pretty nifty and plotting without feels very unnatural to me (now).


Important clarification: this refers to the Python port of ggplot2, not the R package itself.


I really think they should get another name, it's confusing for no real gain, especially if they intend to go their own way. A good grammar-based language to make plots in python is needed, not sure we need to be restrained by all the R legacy and quirks.


To be fair, there is some precedent. Printf, QuickCheck, and other libraries and features keep the exact same name as they're copied from language to language.


I wonder if people will think this one came first because of the '2'.


Context would be nice, looked through it baffled until I realized this was a port of the Grammar of Graphics to Python to work and look exactly like Hadley's ggplot.

I learned R because of my frustrations where Panda's was years ago, seeing this kind of port continues to make me think about moving back to Python from start to finish.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: