Visualizing 40,000 student code submissions

Shizka · on Oct 8, 2013

Quite cool when you think about it. Each cluster probably represent a different method for solving the problem. Awesome that it's possible to classify the solutions like this. I think this might be usable for better feedback on Coursera. Cool!

informatimago · on Oct 8, 2013

Yes, and used in the reverse, starting from a red cluster, you can derivate a working program. Now let's just find a way to find those clusters from problem statements ;-)

Shizka · on Oct 8, 2013

Ahh yes, I didn't think about that. I wonder if it would be possible to build the best possible solution from this data in some way?

rube · on Oct 8, 2013

Interesting that the outer edges basically have less occurrence of failed answers. I guess that means that there is a positive correlation between thinking outside the box and success? ;)

cdman · on Oct 8, 2013

Interesting choice of colors - red signifying that all the unit tests pass :-) (that is usually considered "green")

yaddayadda · on Oct 8, 2013

The authors say that the colors correlated to similar implementations that result in similar behavior, with red specifically indicative of passing all tests. (I totally agree with you that green would have been a much more logical choice). Which only leaves green and blue. I'm curious what the distinction is between those implementations (e.g, blue passed some of the tests, green didn't pass any tests).

emilesilvis · on Oct 8, 2013

I would agree!

chrismorgan · on Oct 8, 2013

Abstract art? Yes. Of the best variety!

A couple of years ago, I made some abstract art of the inheritance structure of a large project written in a language with (extensively used) multiple inheritance, there being around 1800 classes. No one was game to produce a 15m-wide, 30cm-high wallpaper (the traditional type) of it, so I just removed the class names, leaving classes just dots and made it my computer's wallpaper. It's got quite a few comments. Still, it was nowhere near as pretty as this.

akjetma · on Oct 8, 2013

Do you still have the image? I've created a few myself and they're really fun to look at. It's interesting to see the symmetry and orderliness of a project in its early stages as compared to the frankenstein's monster it eventually becomes. I'll post mine if I can find or re-run them.

chrismorgan · on Oct 8, 2013

Matter of fact, while I'm still officially employed by that company (part time casual) I haven't worked for them this year at all, having been focusing on the final year of my Uni degree. And the images are at work. So I won't be able to access it for at least a month and a half. I believe the number of classes would now be in excess of 2,500. Certain things have been going on in the past few years which have led to significant growth in both the development team and the number of classes!

The language used is an in-house language, developed in the late 1980s and early 1990s, and one that has aged surprisingly well (with comparatively few modifications to the language since then), though there are now better options available.

iMark · on Oct 8, 2013

Looks like a load of Pollocks :)

khawkins · on Oct 8, 2013

I don't exactly see the value in this visualization. Clustering measures and feature analysis would provide far more insight into what's going on here. In fact, it's not even clear how large the dominant clusters are or what all of those speckles mean.

tlarkworthy · on Oct 8, 2013

?

clustering is putting similar things near similar things. Tree edit distance is quite a natural measure of distance for tree like things like programs.

You can't avoid some warping when putting high dimensional manifolds on a low dimensional one. You can see a lot of their data does cluster properly but their are some long range red arcs (in the embedding space) which are side effects of warping (they are near in data space).

You can see a cluster of green which is clearly of interest ... why did so many students get the wrong answer in the same way?

I see lots of value in that picture.

has2k1 · on Oct 8, 2013

But what is it good for?

Well we have a lot of ideas! One thing that we did, for example, was to apply clustering to discover the ``typical'' approaches to this problem. This allowed us to discover common failure modes in the class, but also gave us a way to find multiple correct approaches to the same problem. Stay tuned for more results from the codewebs team!

RVijay007 · on Oct 8, 2013

Probably also allows them to more easily detect cheating on coding assignments.

mcherm · on Oct 8, 2013

No, comparison of text rather than comparison of ASTs is is better for that purpose. There are many good reasons for ASTs to be equivalent and few good reasons for text to match.

mrcactu5 · on Oct 8, 2013

I keep meaning to take the machine learning course.

This is a great way of using metadata to search for patterns in student assingments. This could detect different "approaches" or "strategies"

mkelley82 · on Oct 8, 2013

Very cool visualization, I'd like to see how this sort of technique could be applied to other real world problems.

yeukhon · on Oct 8, 2013

And probably a way to find who is cheating and who isn't :)