Mixing DOM and PHP objects for fun & profit

hackernewz · on Feb 3, 2011

Sounds like really bad coding taken to another level. At some point in your program you will know which thumbnails are needed for the page. Instead of structuring that logic and commenting it well, perhaps even turning it into a sub-system or library, you pepper the templates with database calls. And then, you waste more CPU time by scanning the full HTML output of a page multiple times.

This is like AJAX all on the server side. All the hassle of async processing with none of the benefits.

kouiskas · on Feb 3, 2011

I think you misunderstood the article, the point of that technique is to greatly reduce the amount of database calls. We don't "pepper the templates with database calls", to the contrary, we avoid doing them on the spot like a bad implementation would. We treat the data needs for these objects all at once, as late as possible (so we can group the data fetching needs of as many objects as possible), greatly reducing the amount of DB calls needed.

We don't scan the HMTL either, this system doesn't have templates, this is another point of the technique. The data structure holding the output contains both DOM and objects precisely to avoid a templating syntax, which would be costly to scan.

The tree that contains intertwined objects and DOM is only traversed once, when it's echoed. The data resolution is done in passes, but doesn't require traversing the "output" tree.

_hgt1 · on Feb 3, 2011

Ok, so you're not writing "mysql_query()" in the middle of your templates, but you are letting them decide what data they need. Your view layer is pulling data down from the model layer. And the whole DOM-Object "intertwining" thing would worry me: how do you test it, for a start.

kouiskas · on Feb 3, 2011

Only some of the data, when it makes sense to use that technique. Most of the data fetching on our pages is still done the old-fashioned MVC way. You can keep the best of both worlds by carefully selecting which parts you decide to defer. The technique is most interesting in situations where the relationship between controllers and views is complex and "bubbling" a new need for data inside a view's logic back to its controller is difficult. It makes even more sense to use when you identify similar needs in views/controllers pairs that are far in scope from each other. Like the AJAX example I gave, you can group DB queries across completely separate AJAX requests that have nothing to do with each other.

The intertwining part is quite straightforward to sanity check and unit test. For its output buffering aspect, PHP provides a way to check that you closed as many levels of buffering as you opened. As for object resolution, you can detect easily when things go wrong like attempting to render objects whose data hasn't been fetched by the resolution passes. It's difficult for me to prove "how safe" that part of the technique is without going into a lot of details about our implementation. That tool is so low-level for us that it's one of the most tested, error-checking areas of our code. When debugging you can also visualize the tree of intertwined DOM and objects at any stage of construction or resolution.

yinrunning · on Feb 3, 2011

Wow. This is a really ugly workaround for a problem that began when you started using MVC. I would assume that any savings in db calls is completely wasted in processing overhead. Even more was wasted just coming up with this rigmarole. More reasons why MVC is bad juju.

banksyw · on Feb 4, 2011

With respect I think you have missed the point a little. The thing this idea solves is separating data-fetching optimisation from code layout. Regardless of whether you use MVC or entirely procedural code, it gets very hard to keep data access optimal when you have very complex and highly modular pages without introducing hacky globals or coupling unrelated parts of the code together (should be obvious why that's bad).

For any one page you may be able to change the structure so that it is more efficient without compromising the code too much but some of those modules will need to be reused in a completely different context on a different page.

I'm perhaps not explaining very well but this idea isn't solving deficiencies with MVC, it is solving a problem inherent in ANY attempt to modularise/re-use code. To gain the same data-access efficiency without MVC would leave your code unreadable not to mention un-maintanable with a complex site and high frequency changes.

Many sites simply don't need this so yes it would be overkill and of little benefit for them. I don't think the article was suggesting such a scheme would work well for everyone else. If you do have pages which legitimately need to make hundreds of queries/cache gets to resolve the data requirements though, techniques like this can help a lot.

tlack · on Feb 3, 2011

I wish the examples here were more concrete. This could be a powerful mechanism, beyond the usual tiresome MVC.

Nycto · on Feb 3, 2011

I had the same thought, so I whipped up how I would have implemented it with PHP:

http://pastebin.com/YD79HaxU

Caveat: That code is completely untested

toolate · on Feb 4, 2011

That looks like a good start, but is missing the multi-pass approach from the original article.

kouiskas · on Feb 3, 2011

disclaimer: I'm the author of the article

I can share more details about how we use it. I didn't feel like giving code samples because our implementation is kind of "besides the point". People might find better ways to implement the same concept, that fits better with their particular needs or language of choice.

If by concrete examples you mean where and what we use this technique for on the website, I'm happy to explain more about that.

paisawalla · on Feb 3, 2011

Without concrete examples, it's difficult to tell what problem you are solving and how exactly you are solving it. From reading it twice I've gathered that you defer the execution/access of something until sometime. I don't mean to be flippant, but your explanation is too abstract.

Concrete examples, even contrived ones, would give me a sense of how it's different to write your views or your controller logic, as opposed to how I might do it in Symfony (for example.)

kouiskas · on Feb 3, 2011

A good example would require an entire source tree, as the full technique requires a different structure and flow of code for the entire application. That's why I found it a bit hard to fit into an article. Anyway, I'll give a shot at giving a more concrete example here.

Very naïve code might do something like this:

  function render_some_info($imageid) {
	$image = ImageFactory::get($imageid);
	echo '<div class="container">';
	echo '<img src="',$image->get_src(),'">';
	echo '</div>';
  }

You get the data about the image in the ImageFactory call and output it on the spot. In a better MVC architecture, that factory call would be in a controller. But for the sake of the argument, lets imagine that you couldn't predict that $imageid early enough, it came from another data fetching/logic.

What we would do with datex would look like that:

  function render_some_info($imageid) {
	$datex = new_datex();
	echo '<div class="container">';
	$datex->merge(new ImageDatum($imageid));
	echo '</div>';
	return $datex->end();
  }

The datex object captures the DOM output, and the datum object, keeping the order. The datum object is just a shell, it contains only the $imageid at this point, nothing else.

The $datex object returned by this method, is then merged into the general output tree that represents your page.

Processing continues until you've built your entire page. You still haven't touched the DB or cache for anything about that $imageid.

Then the resolving phase happens, you traverse your $datex tree object, that contains your entire page, and you look for all objects of type ImageDatum. Say you have 10 of them, you get the $imageid they contain, remove the dupes, then you do your efficient one-time DB query that gets the URL data for all these images at once. You iterate through all the ImageDatum objects and give them the data they need.

Once you've ran out of datum objects that needed data, you know that they're all resolvable (populated). You run through the $datex tree once more, you echo all the DOM you find, and you call ->render() on all the objects you find. In the case of ImageDatum your render() method will look like:

  function render() {
    echo '<img src="',$this->src,'">';
  }

Where the $src member was populated by the resolving phase of datex (right after hitting the DB once).

Does that give you a better idea, or do I need to go into more details about this specific example?

paisawalla · on Feb 3, 2011

I think that gives a good idea. It helps to see how someone on the consuming side of your API would do things, without having to know any of your internals.

In other languages/contexts, this has been called things like promise objects -- i.e., the object promises to give you certain data in the future when you request that data. It defers the computation of that data until the last possible moment. You guys go one step extra by retrieving all your promises from factories, so fulfillment of those promises can happen in a way that requires the least number of passes, or is optimized in some other way.

Is that a fair characterization of your technique? It's pretty cool

kouiskas · on Feb 3, 2011

Yeah, I think you have a good understanding of what we do. It gives a lot of control over when, in which order and how you group your data needs, and allows you to "drop" promise objects in the output. You can create dependencies between promise objects too (what we call "datum" in my example), which creates a web of relationships between these objects that lives in parallel to the realm of their DOM placement.

A typical use case for that kind of dependency is advertisements on deviantART. We serve different ad inventory depending on whether or not there is mature content on the page. Our ad datum objects thus depend on all the image/thumbnail datum objects being resolved first. That dependency between datum object classes is defined in the code and guarantees that we never serve the wrong inventory.

Before datex we had to be really careful never to output an image after the ad code had been called. And of course we have a banner at the very top of the website, above any image in the DOM. That meant we had to use templating, output buffering and all sorts of convoluted ways to make sure that we generated the ad code after all images had been output, then inserted the ad back at the top of the page where it belonged.

Now we can just render the page in logical top-to-bottom order, merging the ad datum object into the datex stream at the very beginning of the page generation, then move on and merge image datums as they come. We know that the dependency will always be respected; the ad datums will see their data populated after all the image datums have been processed.

tlack · on Feb 4, 2011

To me what's most interesting -- and something you touched on briefly in your article -- is that it seems this concept could be a powerful way to tie in Ajax and lazy loading (like Facebook does where parts of the page are loaded after the main DOM, to enhance perceived loading speed) in addition to being a better way to structure your regular procedural top-to-bottom page load, including providing a crawlable alternative/mobile view of highly dynamic content.

wvenable · on Feb 3, 2011

I agree this concrete example should be in the article -- I sort of got the idea from the article but this really solidified the concept in my mind. Very interesting stuff.

_hgt1 · on Feb 3, 2011

"Datex"? Scary. Not sure I agree that the example given at the beginning of the article is a big problem (why not just make the model apis more fine-grained?), and also not sure what this has to do with MVC specifically.

mildweed · on Feb 3, 2011

Do you have any stats on the CPU/memory usage difference using this approach? Obviously your database/memcache calls are optimized, but I'm curious what the overall effect is.

kouiskas · on Feb 3, 2011

When we deployed it, I started by adding only the overhead (output buffering and storing the output in a tree to render it later) throughout the entire codebase. When I benchmarked the before/after of this step, the performance difference on our pages wasn't measurable, it was within the margin of error. Which means that it was all gain from then on when we started using it to really save time on the DB and cache.

It's impossible for me to compare right now between using it and not using on a given page, because it requires so much rewriting and change in the way you structure parts of the code that you can't just turn it on or off.

I think the best way to compare would be to fork an open-source framework to use that technique and then look at the difference in CPU and memory usage to run the same site. I wish I had time to do that kind of tedious research for an article, but I don't... My goal was merely to share the idea, I'd be thrilled if someone picks it up and does an implementation that everyone can measure.

I developed that technology over months on a constantly shifting closed source codebase with 20+ developers committing code daily, that's another reason why comparing before/after is a bit difficult to achieve sometimes. Deploying that tech was a massive task itself, very far from a single source branching.

If I had to guess, I'd say that PHP memory usage would be increased, but not by much. After all you're only storing as much data as your final page HTML output - generally you want to keep that to a small size - plus some very small objects. However if your traditional MVC framework was already doing a lot of output buffering, there might not be a difference, the buffering is just moved to this new technique. As for CPU, I don't think it would be noticeable, we're just adding things to a small tree then traversing it once.

I think what is most wasteful about out implementation is the little extra code it creates to deal with datex instead of just echoing content. That's why I mention in the article that this would be better if handled at the language level, where all the concerns about memory and CPU could be highly optimized, in addition to benefiting from lighter syntax.

amccloud · on Feb 4, 2011

So... lazy loading database queries and including partial templates?