A commit history of BERT and its forks

ivoras · on May 9, 2020

First we need to move away from PDF as a method of distribution of research papers. While it's an immense improvement over the dead-tree medium and images containing scanned paper pages, it's a visual presentation medium instead of a semantic medium.

We'll see how long that particular revolution in communication takes.

Once / if we have all that information in even a loosely structured but plaintext format (whatever that may be ... JSON? XML? semantic HTML? MHTML? TeX? as long as both images and text can live in the same file), there will be another revolution in accessibility, ease of distribution and accountability.

Who knows, we might even stop mentioning GitHub as a best practice for that and use something distributed, blockchain-like.

Imagine a future where citations of previous work looks like "As H.G.Wells et all mention in block 4f98aaf1 document #391, the foo is better than the bar." with hyperlinks of course. Peddling my own wares, it can be done with something like https://github.com/ivoras/daisy .

op03 · on May 9, 2020

Wikipedia/WikiData seem to have already all solved all these problems.

Which is why I think a lot of burnt out scientists spend their time there editing articles.

bitforger · on May 9, 2020

I totally agree, but I think the state of a research field is too malleable / moves too quickly to be a Wikipedia article (where people expect things to be "true").

A few weeks ago I spun up wikisurvey.org to try and make a place for Wikipedia-like collaboration on research. There's nothing on there yet, but I'm hoping it will enable low-latency collaboration once we get the ball rolling.

ConradKilroy · on May 9, 2020

Likes for mentioning Wikipedia. (by a Wikipedian)

currymj · on May 9, 2020

i somewhat dislike this idea. it seems like it treats scientific papers as independent, isolated containers of reliable facts which can be reliably and authoritatively cited without any need for context.

This is a really common way of reading scientific papers, but I think it is ultimately harmful: I believe it contributed to the replication crisis as well as terrible scientific journalism in the popular press. Current scientific papers certainly do not meet this standard and I'm not sure it's even possible or desirable to require that they do.

Ultimately, I think we just have to accept that scientific papers only really make sense as part of a large body of scientific literature, to which they are responding and which respond to them.

The problem is a lot of people would rather just pick one or a few papers, cite them to justify a claim with total certainty, and then move on. I worry about encouraging this tendency.

cambalache · on May 9, 2020

PDF may not be perfect,but I prefer it over any fancy system every single time.

gumby · on May 9, 2020

> Imagine a future where citations of previous work looks like "As H.G.Wells et all mention in block 4f98aaf1 document #391, the foo is better than the bar."

I dunno, it’s easier for me to describe and to remember “Radin’s 801 paper” or “the ultimate papers” or such.

6gvONxR4sf7o · on May 10, 2020

My really simple preference for a stupidly minor improvement would just be PDF without pagination. Like, we don't have page breaks on web pages, so why not have print optimized and scroll optimized renderings of the paper?

NegatioN · on May 9, 2020

I feel like the papers published at https://distill.pub/ does a decent job of moving toward a more modern solution. There's no doubt that doing it this way often requires a lot more work, and maybe even a new set of skills to even make or visualize though.

And it doesn't really incorporate the diff concept in any way atm, but I imagine it's up their alley at least.

_Microft · on May 10, 2020

Can these papers be downloaded?

The greatest concern with any web-based solution is reliability. A PDF might not be as easy to grasp as a web-based solution with live examples and graphs that make the content more approachable but it is still worth gold since I can back it up and have it available whenever I need it. Web-resources on the other hand can become unavailable at a moment's notice.

activatedgeek · on May 9, 2020

I think this nature of retro-fitting models used for software engineering into research is ill-targeted.

Research is non-linear by nature. One of the reasons research code is usually not well written is because of this same fact - there is no clear plan (or else it wouldn’t be research). Restricting research cultures to follow this linear progress model also imposes an artificial block that may be detrimental to productivity. Alternatively, this exercise could demand a second pass over research code which cleans it up and puts it in its “place”. This may or may not work because “clean up” is a non-trivial task which the researcher may find limited time and utility for.

By no means I’m against writing clean code. In fact, I’ve had multiple discussions advocating it but it’s hard to keep everyone aligned. The primary objective of research in a fast-paced field like Machine Learning is to get the idea out the door. Unfortunately, the exceptionally compounding benefits of writing clean thoughtful code from the beginning are realized much later in this timeline. By then, it’s too late and too hard to retro-fix code.

My understanding so far (having written code for multiple research projects) is that the only way to fix this culture is to deeply ingrain the compounding benefits of clean code in new researchers. By exposing junior researchers to the compounding benefits of “clean” code, we can gradually nudge the community into a more favorable culture with respect to this “commit history” of research.

_pd19 · on May 9, 2020

Agreed - and I would say this is part of a larger phenomenon of SWE hubris that expertism in software automatically translates to process expertise in every other domain under the sun.

activatedgeek · on May 9, 2020

To be fair, SWE patterns are insightful.

But here's where the problem lies - the proposal in the article is representative of how the community has become so used to this nice side-effect of modularity of neural networks, giving the impression that a clean linear decomposition of contributions is even conceivable.

There are examples where this modularity isn't possible (or not optimal even if possible). The poster child, I would say, is probabilistic programming. It is easy to think about building general purpose inference algorithms in probabilistic programming. However, more often than not, inference heavily relies on context.

An example for this case is Gaussian Processes. In theory, it is easy to conceive a chain of ideas that connect threads of research within this tiny field. Unfortunately, in practice it turns out inference with GPs is more efficient when we exploit structural assumptions within the context of modeling choices and general purpose few-liner code changes would be terribly inefficient. You'd find the code will look vastly different to an untrained eye even though it would compose similar sounding (or looking) building blocks.

hprotagonist · on May 9, 2020

In my neck of the woods (biomedical sciences), this hubris' name is "the andy grove fallacy": https://blogs.sciencemag.org/pipeline/archives/2007/11/06/an...

and it is rampant.

activatedgeek · on May 9, 2020

I understand the sentiment. Although, the article you link is more of a ad hominem attack than criticism.

Communities must be wary of giving into the Semmelweis reflex. Often times, the perceived "outsiders" tend to have interesting perspectives.

Another example to consider is Fermat vs Descartes - Fermat was a relatively unpopular figure at a time when mathematics was confined to closed-group elitist figures like Descartes. Nevertheless, he provided more elegant perspectives to problems which Descartes thought couldn't be done better.

hprotagonist · on May 9, 2020

Not Invented Hereism is different from "look, you outsiders fundamentally just don't understand the score, and until you do, i'm going to ignore you".

Swanking into the joint, assuming spherical cows, and berating the locals for being lazy roustabouts who just need to see the light is not going to make you friends, because it's demonstrably not useful. To the extent that there's human politics and gatekeeping, it's resolved by getting ones hands dirty in the lab before one holds forth ex cathedra about things one obviously doesn't understand. It's certainly possible to take valuable ideas from one field and apply them in a productive cross-disciplinary way. The people who do it well don't play this game.

Intuitions and practices developed to handle designed and often linearizable systems for which spec sheets of some form are available just aren't applicable without heavy modification in the biosciences, which routinely deal with eking out victory in systems which are not designed, are wildly nonlinear, and for which the partial specs which do exist are known to be at best lies of omission and often just complete misunderstandings which can be, and are, revised at a moment's notice.

6gvONxR4sf7o · on May 9, 2020

Or it's the converse. If you're an expert in a thing you do with code, thinking that makes your code good.

6gvONxR4sf7o · on May 9, 2020

I think this is missing the point of the research. The point of e.g. ALBERT isn't

    -Next Sentence Prediction
    +Sentence Order Prediction
    +Cross-layer Parameter Sharing
    +Factorized Embeddings

The point was what each of those things do, in terms of theory and experimentation. Factorized embeddings are needed because of the portion of memory used in embeddings, NSP isn't useful because it's too easy, etc. Consider a paper like this Smoothness and Stability in GANs [0] from ICLR the other week. You could summarize it as Wassertstein loss + spectral normalization + gradient penalty. Or you could summarize the same work as a generator's convex conjugate + fenchel-moreau duality + L-smoothness + inf-convolutions. Both would be missing the point. Research isn't code. Ideas, their motivation, their demonstration, their explanation, and their testing are represented in the form they are for a reason. Natural language + mathematical language + tables + graphs + references, etc.

It's why we aren't having this discussion in terms of diffs versus comments we're replying to.

[0] https://iclr.cc/virtual_2020/poster_HJeOekHKwr.html

amitness · on May 9, 2020

I've read the ALBERT paper extensively[0] and agree with what you mean. We can't boil down years of research efforts with just 4 bullet points.

The intention in the above blog post isn't to suggest that we discard detailed research papers. It's just thinking about additional mediums we can use to make papers more accessible(a good example is distill.pub and paperswithcode). As mentioned in other threads below, PDF can be a limiting medium.

[0] https://amitness.com/2020/02/albert-visual-summary/

octbash · on May 9, 2020

I seem to find myself in the minority, but I don't think distill.pub is a particularly ideal model for publicizing research.

distill.pub heavily favors fancy and interactive visualization over actually meaningful research content. This is not to say that the research publicized on distill.pub is not meaningful, but that it is biased to research that can have fancy visualizations. So you end up seeing a lot of tweakable plots, image-augmentations, and attention weights visualizations. It is also further biased towards research groups that have the resources to create a range of D3 plots with sliders, carved out of actual research time.

For instance, I don't think BERT could ever make it into a distill.pub post. Despite completely upending the NLP field over the last 2 years, it has no fancy plots, multi-headed self-attention is too messy to visualize, and its setup is dead simple. You could maybe have one gif explaining how masked language modeling works. The best presentation of the significance of BERT is "here is a table of results showing BERT handily beating every other hand-tweaked implementation for every non-generation NLP task we could find with a dead-simple fine-tuning regime, and all it had was masked language modeling."

To give another exmaple: I think it's one of the reasons why a lot of junior researchers spend time trying to extract structure from attention and self-attention mechanisms. As someone who's spent some time looking into this topic, you'll find a ton of one-off analysis papers, and next to no insights that actually inform the field (other than super-trivial observations like "tokens tend to attend to themselves and adjacent tokens).

6gvONxR4sf7o · on May 9, 2020

Oh for sure. PDF is tough for so many reasons. Remember that article about the apple programmer trying to implement the "fit text to screen width" thing for PDF a couple months back? PDF is sooooooooo challenging as a medium. Even something that reads and looks identical, but is different under the hood could be big improvement, apparently (I don't actually know how PDF works under the hood other than hearsay of "it's difficult"). In the spirit of chesterson's fence, maybe not though.

I totally agree that additional media could be good. I got caught up on the "most papers could be compressed to < 50 lines" line and misunderstood the premise you were presenting.

RobertoG · on May 9, 2020

Related to this, I have this idea for a while that, publishing a paper, could mean to appear as a submission to a Hacker News style forum, where the users are identified members of the scientific community.

It would allow a kind of public review process of the paper and it would float the more interested papers for the community to the "first page".

hobofan · on May 9, 2020

In theory for arxiv preprints this exists with arxiv-sanity[0]. In practice nobody is discussing anything there.

Same thing for most other preprint sites that have integrated comment sections.

[0]: http://www.arxiv-sanity.com

n4r9 · on May 9, 2020

When I was doing my PhD there was also SciRate: https://scirate.com/

slightwinder · on May 9, 2020

Instead of a Forum, wouldn't a wiki, blog or other type of CMS make more sense? You still can add a comment and interlink-functionality. With a CMS even better, as you can comment directly to a paragraph or even sentence.

Or do you mean the river of news-style, the frontpage, maybe also with upvotes, so people can better manage attention to papers?

_pd19 · on May 9, 2020

https://openreview.net/

nfc · on May 9, 2020

I've given some thought to this and a related idea.

Let's say that for a subset of scientific papers you have the possibility of specifying both the premises and results in a way that can be composed.

Let's say for example that your result is that the rate of expansion of the universe is N. Other papers could cite this result let's say through a URL of the result. Other papers could then use this URL as premise for their results and we could create an automatic system that would notify all this papers if the result has changed after new data for any of the elements of the chain. Scientists could be notified that they should revise their own papers to see if their conclusions change with the new data so other papers depending on them can be notified. A paper could be even marked stale if after the change of one of the premises the authors have not confirmed that the conclusions are still valid, this staleness would propagate down the chain

A very simplified structure of the data would be sth like this:

{ premises: ['urlOfResult1', 'urlOfResult2',...], conclusions: [RateOfExpansion: > N] }

This is obviously terribly simplified, I suppose it'd take me a lot of time to clearly explain in more detail how such a system could work and I thought about it a long time ago.

It could be interesting to apply this to other fields, for example public policy: In this case let's say that we have created this law because of this piece of data. The data changes, we could be notified that perhaps the law could benefit from a new look.

{ premises: ['urlOfResultLeadIsNotDangerous'], conclusions: ['PeopleCanUseLeadInPipesAsMuchAsTheyWant'] }

Such a system could be made even more generic. Probably people have already worked in this kind of systems but I never took the time to investigate it. If someone knows of examples of this kind of systems applied I'd be very happy to know more

dtagames · on May 9, 2020

Hi, nfc. This is along the lines of the problem we are trying to solve with Mimix[0]. Like your JSON structure, we are using s-expressions to denote facts in text narratives so we can detect when they have changed between papers.

Instead of a centralized, URL-based solution, we do the fact comparisons locally, allowing each user to decide what source of facts he or she considers "canonical." The facts and their sources are recorded in that user's writing going forward so the next user has the same access to sources.

[0] http://mimix.io

nfc · on May 9, 2020

That's great, it's certainly something fascinating to explore, I'm happy you are working in this kind of ideas :). I'll have a look when I'll have a bit more time.

Edit: Remove the question about contact info since you already answered in your comment in another thread :)

dtagames · on May 9, 2020

We are working on a similar concept[0] using a programming language designed for this exact purpose[1]. One of our key ideas is "fact diffing" between two papers with different narrative text. We think this will be useful in all kinds of scientific and academic work.

[0] http://mimix.io/recipes

[1] http://mimix.io/specs

nfc · on May 9, 2020

Hi mimixco, I just wrote a comment https://news.ycombinator.com/item?id=23124707 about some thoughts I had about this kind of ideas, what do you think?, I'll try to read more about what you are doing as soon as I find some free time :)

dtagames · on May 9, 2020

Ok, I wrote a little something over there, too. :-) Thanks for looking at our stuff and feel free to email me. My contact info is on our site.

canjobear · on May 9, 2020

A meaningful diff summary like this can only be made in retrospect after some time. There are a lot more differences between, say, BERT and XLNet than the ones listed. At the time of publication, it wasn’t yet clear which particular differences were the important ones (and to some extent it is still not clear).

goose847 · on May 9, 2020

A really interesting take! Makes scientific papers feel more like a tool rather than an article or book. Certainly nice in the case of Computational work. Also allows the potential for long term projects that get incremented on in what would’ve been separate papers. Additionally if new, relevant information comes to light long after a paper has been published, the authors could reference this to give a more complete story.

amitness · on May 9, 2020

Agree. This makes a lot of sense for incremental papers.

hprotagonist · on May 9, 2020

“shingling” is the (fairly rude) term of art.

pbhjpbhj · on May 9, 2020

I don't know how it's rude, shingling means 'many small roof tiles' to me.

hprotagonist · on May 9, 2020

'term of art' means that there's a specific meaning inside a particular sphere of discussion for this term (which has normal sense elsewhere).

To accuse a research group of shingling means that you think that that group releases a lot of papers one after the other which have a lot of overlap between them and back-cite each other-- and that this is done to artificially boost publication count and citation count to make the group look prominent.

pbhjpbhj · on May 9, 2020

Cool, thanks, I did a quick search but nothing came up - I assumed it was something scatalogical :o)

JBiserkov · on May 9, 2020

See https://nextjournal.com/#feature-immutability for an example with Immutability on all levels of the system - not just the source code, but also the computations/analyses.

mjw1007 · on May 9, 2020

I wish research papers had a much stronger notion of "effective date".

By that, I mean the date to use to interpret any wording like "current" or "recently" or "yet" used inside the paper.

For some reason preprints often have no visible date on them at all, and automated datestamps can be misleadingly recent, if someone makes a minor change without rewriting the whole thing.

usrusr · on May 9, 2020

Almost, but not quite, entirely unlike this?

https://en.wikipedia.org/wiki/Special:History/Wikipedia:No_o...

kinow · on May 9, 2020

Very interesting and creative way to visualize related papers.