I never understood the pattern of putting source code in a string. Why not just use np.log(B - A) directly and configure the function to accept columns? With strings you lose highlighting, semantic analysis from editors, as well as the ability to know what computations are happening when and where. There seems to be no point and significant drawback to this, what's the rationale?
It's because R is lazily evaluated and Python is eagerly evaluated [1].
Eager evaluation is when the interpreter evaluates the argument np.log(B-A) BEFORE passing it into aes(). aes() can only see the resulting VALUE, not the EXPRESSION itself.
In contrast, lazy evaluation means that aes() gets the raw expression as an argument. It can evaluate it, pass it on to another function, or otherwise manipulate it (e.g. serialize it back to a string).
In the case of ggplot, this is used to actually evaluate the expression at DIFFERENT values, so you can plot it. Suppose you want to plot f(x) = square(x). It doesn't make any sense to write plot(square(x)), because if x = 5.0, you will get plot(25.0), and you can't make the plot. plot(lambda x: square(x)) make more sense, because then the plot() function can evaluate it with 100 different values of x to get 100 pixel values.
And it's also used to print the expression on the axes. That is, you actually want to print the expression "square(x)" on the graph. You don't want print "25.0" on the graph -- that makes no sense.
This is related to the concept of quotations in Lisp. Quotations are UNEVALUATED program fragments. In Lisp it's an AST, but in Python or any other dynamic language, it has to be a string. In C there is no way to do this (short of shelling out to the C compiler at runtime, which some people have actually done ...)
[1] R has crazy caveats, but that's beyond the scope of this post...
The lambda is never executed because it compiles to machine code and not Python byte code, but that shouldn't make a difference. You should still be able to use the AST of the body as input to the compiler.
And then a and b have to be pulled out of locals() automatically or something.
But is laziness that important for a graphing library where presumably you'd be graphing the function anyway? I guess there are some cases where one of the other options would disable the y-axis or something, but this seems pretty rare. Is there another reason?
I'm not sure whether this library supports it, but if you plot a function, a smart plotting library will evaluate the function at points that depend on the function and the resolution of your output device; you can't just evaluate the function at n points on the to be plotted range and expect to get an accurate plot.
For example, to plot sin(1/x) accurately, it must be evaluated many, many times near zero, but you do not want to do that further away from the origin, as it slows down plotting, and can produce less nice plots if you export to a vector format such as svg or postscript.
When plotting in R the name of the names of the variables in the caller of the plot function are used as axis labels. This is pretty nifty and plotting without feels very unnatural to me (now).
I really think they should get another name, it's confusing for no real gain, especially if they intend to go their own way. A good grammar-based language to make plots in python is needed, not sure we need to be restrained by all the R legacy and quirks.
To be fair, there is some precedent. Printf, QuickCheck, and other libraries and features keep the exact same name as they're copied from language to language.
Context would be nice, looked through it baffled until I realized this was a port of the Grammar of Graphics to Python to work and look exactly like Hadley's ggplot.
I learned R because of my frustrations where Panda's was years ago, seeing this kind of port continues to make me think about moving back to Python from start to finish.
I never understood the pattern of putting source code in a string. Why not just use np.log(B - A) directly and configure the function to accept columns? With strings you lose highlighting, semantic analysis from editors, as well as the ability to know what computations are happening when and where. There seems to be no point and significant drawback to this, what's the rationale?