I've read the ALBERT paper extensively[0] and agree with what you mean. We can't...

octbash · on May 9, 2020

I seem to find myself in the minority, but I don't think distill.pub is a particularly ideal model for publicizing research.

distill.pub heavily favors fancy and interactive visualization over actually meaningful research content. This is not to say that the research publicized on distill.pub is not meaningful, but that it is biased to research that can have fancy visualizations. So you end up seeing a lot of tweakable plots, image-augmentations, and attention weights visualizations. It is also further biased towards research groups that have the resources to create a range of D3 plots with sliders, carved out of actual research time.

For instance, I don't think BERT could ever make it into a distill.pub post. Despite completely upending the NLP field over the last 2 years, it has no fancy plots, multi-headed self-attention is too messy to visualize, and its setup is dead simple. You could maybe have one gif explaining how masked language modeling works. The best presentation of the significance of BERT is "here is a table of results showing BERT handily beating every other hand-tweaked implementation for every non-generation NLP task we could find with a dead-simple fine-tuning regime, and all it had was masked language modeling."

To give another exmaple: I think it's one of the reasons why a lot of junior researchers spend time trying to extract structure from attention and self-attention mechanisms. As someone who's spent some time looking into this topic, you'll find a ton of one-off analysis papers, and next to no insights that actually inform the field (other than super-trivial observations like "tokens tend to attend to themselves and adjacent tokens).

6gvONxR4sf7o · on May 9, 2020

Oh for sure. PDF is tough for so many reasons. Remember that article about the apple programmer trying to implement the "fit text to screen width" thing for PDF a couple months back? PDF is sooooooooo challenging as a medium. Even something that reads and looks identical, but is different under the hood could be big improvement, apparently (I don't actually know how PDF works under the hood other than hearsay of "it's difficult"). In the spirit of chesterson's fence, maybe not though.

I totally agree that additional media could be good. I got caught up on the "most papers could be compressed to < 50 lines" line and misunderstood the premise you were presenting.