Square open-sources Tesseract: fast filtering for coordinated views

nodata · on March 19, 2012

Please don't call it tesseract, we already have a prominent open source project with that name: http://code.google.com/p/tesseract-ocr/

law · on March 19, 2012

Calling it a "prominent open source project" is an understatement. Granted it's not perfect, it's the de facto standard in FOSS OCR software. When mentioning "tesseract" to just about anyone in the data mining/machine learning/artificial intelligence communities (which is pretty much the target user base), they will automatically think you're referring to the OCR software.

knowtheory · on March 19, 2012

I was just about to come here and lament this very fact.

Because I work on DocumentCloud I very much do touch on both OCRing software and frontend JS libraries.

It will only make things confusing to have these two projects fighting over namespace. And the Tesseract OCR engine is not going away. It's the defacto standard for FOSS OCRing.

Edit: Having played around with the examples for this lib, it's awesome! I will have to find a project to work this into. Looks like fun.

Terretta · on March 19, 2012

Concur -- fantastic tool, unfortunate name, due to the overlap.

Cubic Prism is another name for tesseract and could also remind one of both SQL data rollup and looking at data in new lights.

mbostock · on March 19, 2012

Apologies; I wasn't aware of the Tesseract OCR project until very recently, and then I hoped there would not be much harm given that the two projects are so unrelated. (The name "Tesseract" was the natural progression from "Square" and "Cube".) What's the saying? There are only two hard things in Computer Science: cache invalidation and naming things.

spaznode · on March 19, 2012

http://en.wikipedia.org/wiki/Tesseract "In geometry, the tesseract, also called an 8-cell or regular octachoron or cubic prism, is the four-dimensional analog of the cube. The tesseract is to the cube as the cube is to the square. Just as the surface of the cube consists of 6 square faces, the hypersurface of the tesseract consists of 8 cubical cells."

Looks like the perfect name to me. Sounds like the "other" project already has an established name of "tesseract ocr", don't see any reason why this library would be confused with that. Lame for people to focus on this instead of the crazy beautiful api that comes with this thing:

https://github.com/square/tesseract/wiki/API-Reference

ma2rten · on March 19, 2012

If you were going to make a software, where you can see places though a webcam, would you call it Windows, even if it's a fitting name?

Tesseract goes by the name Tesseract, tesseract-ocr is just the name of the Google code project.

spaznode · on March 19, 2012

If we're going to talk about time, let's not forget a book I read as a child - http://en.wikipedia.org/wiki/A_Wrinkle_in_Time. I believe they win the "who used tesseract first" award. At least as far as this thread is concerned for now.

To be honest though, my view is very biased because I don't care about the ocr project at all. I'm sure it's a very nice project, but hardly a tesseract really.. =p

garethsprice · on March 19, 2012

There are only two hard things in Computer Science: cache invalidation, naming things and off-by-one errors.

grayrest · on March 19, 2012

Unless you're 37Signals and then it's Javascript and naming things.

https://37signals.com/svn/posts/3112-how-basecamp-next-got-t...

dylanz · on March 19, 2012

I'd go with "Tessquare", or some combination of Square and Tesseract. I'd definitely change the name. Apart from the name talk, congratulations, and thank you for opening this up to the community.

aw3c2 · on March 19, 2012

Or Squaresseract :)

gojomo · on March 20, 2012

Squesseract or STesseract also have potential for disambiguation and googleabilty, without being too forgettably eccentric.

jashkenas · on March 19, 2012

If you're settled on "Tesseract", perhaps a small tweak would suffice to clear up the potential confusion. Call it "Tesseract.js".

fourneau · on March 19, 2012

That would lead me to believe that it's 'Tesseract OCR, but in Javascript!'. I think a more substantial name change will be required because of how prominent Tesseract is.

ktizo · on March 20, 2012

Tesseroct maybe..

Tesserect? ;p

Mind you Tesseract OCR did miss a chance to be TesseractOCRus, which would be great, obviously.

wlesieutre · on March 20, 2012

There are only two hard things in Computer Science: cache invalidation, naming things, and off-by-one errors.

Another name option would be "hypercube," which means the same thing. It does make it sound more related to "cube" though.

premchai21 · on March 19, 2012

"Octachoron" would work just as well (progression from "tetragon" and "hexahedron").

sparky · on March 19, 2012

Hypercube?

repsilat · on March 20, 2012

Call it "4-square". Nobody will get confused then.

mattdeboard · on March 19, 2012

n-dimensional hypercube

tesseract · on March 19, 2012

hypercubes are already n-dimensional, that's not adding any extra information.

zdw · on March 19, 2012

Bug filed on their github issues page:

https://github.com/square/tesseract/issues/1

dylanz · on March 19, 2012

+1 ... The title totally confused me. I use Tesseract already, and it's a popular OCR project.

tbe · on March 21, 2012

I'm more confused about the name of the company. To me, Square was the maker of Final Fantasy and other role playing games in the golden years of 8- and 16-bit video game consoles.

http://en.wikipedia.org/wiki/Square_(company)

samstave · on March 19, 2012

Would you prefer Ono-Sendai, or Ice-9?

drewda · on March 19, 2012

@mbostock: Why is there Dart code in the project? Just because that's where you borrowed one of the sort functions from?

Thanks for another useful library!

mbostock · on March 19, 2012

Yep. Typed arrays don't have a built-in sort method, and even the built-in array.sort is extremely slow. I ported Dart's dual-pivot quicksort implementation, which reduced the time to sort 1M floats from ~2.5s to ~350ms (timed in Node v0.6.2).

https://github.com/square/tesseract/blob/master/src/quicksor...

jasondavies · on March 19, 2012

For sorting numbers, you might also want to look at Radixsort.js, which takes O(n) time, although it isn't in-place like quicksort:

https://github.com/jasondavies/radixsort.js

I haven't finished implementing the Float64Array version, but it beats everything else I've compared it with even for relatively small sizes e.g. my benchmark is 65,536 floats and it's already around 2.5x faster than native sort (using Node.js)! Admittedly, Float32Array vs. native sorting of 64-bit floats is not a fair comparison, but you could argue that many applications would get away fine with 32-bit floats anyway. :)

jasondavies · on March 19, 2012

Got inspired and it now supports Float64Array. :)

drewda · on March 19, 2012

Good to know. I'll have to consider switching away from Underscore's sort, which is probably equally slow.

By the way, quicksort.js is code you wrote, right? I assume that's too nice and concise to be from the Dart generator...

jashkenas · on March 19, 2012

Underscore doesn't have a "sort" apart from the native Array.prototype.sort, but it does have a "sortBy", which is something else entirely.

drewda · on March 19, 2012

The sortBy method is just producing an integer array that is then sorted using each browser's implementation of Array.prototype.sort. Is that a proper reading of the code?

jashkenas · on March 19, 2012

Bingo. You got it.

mbostock · on March 19, 2012

Right; I ported it from the Dart implementation, not from the generated JavaScript.

nodata · on March 29, 2012

Update: Square have changed the name to Crossfilter:

"Renamed to Crossfilter, partly in homage to Chris Weaver's work on multidimensional visualization. It may not have the intrigue of "tesseract", but it does describe the library's function succinctly."

-- https://github.com/square/crossfilter/issues/1#issuecomment-...

troels · on March 19, 2012

That's timely. Recently I've been looking around for various widgets interfaces to explore multidimensional data. This looks quite useful - way up next to good old pivot table.

What do you think of parallel coordinates as a widget? (F.ex. http://exposedata.com/parallel/veggie/)

mnutt · on March 19, 2012

This is really great work, and I can't wait to see some of it filter into Square's Cube project.

mukaiji · on March 19, 2012

I got a preview of this project from Square's CTO 3 weeks ago at Stanford. The fast-filtering is MIND-BLOWING.

Tyr42 · on March 20, 2012

It'd be cool if it could say, show the data for all weekends at once, skipping over the weekdays.

swah · on March 20, 2012

So mbostock works for Square?

huggyface · on March 19, 2012

Thank you for an release. I've been heavily promoting the web stack for over a decade and a half, yet still I'm surprised by what it is capable of. This also provides a clear demonstration of the power of algorithms even on an imperfect implementation. Excellent, clean API as well.