Hacker News new | past | comments | ask | show | jobs | submit login
Square open-sources Tesseract: fast filtering for coordinated views (square.github.com)
263 points by mbostock on March 19, 2012 | hide | past | favorite | 42 comments



Please don't call it tesseract, we already have a prominent open source project with that name: http://code.google.com/p/tesseract-ocr/


Calling it a "prominent open source project" is an understatement. Granted it's not perfect, it's the de facto standard in FOSS OCR software. When mentioning "tesseract" to just about anyone in the data mining/machine learning/artificial intelligence communities (which is pretty much the target user base), they will automatically think you're referring to the OCR software.


I was just about to come here and lament this very fact.

Because I work on DocumentCloud I very much do touch on both OCRing software and frontend JS libraries.

It will only make things confusing to have these two projects fighting over namespace. And the Tesseract OCR engine is not going away. It's the defacto standard for FOSS OCRing.

Edit: Having played around with the examples for this lib, it's awesome! I will have to find a project to work this into. Looks like fun.


Concur -- fantastic tool, unfortunate name, due to the overlap.

Cubic Prism is another name for tesseract and could also remind one of both SQL data rollup and looking at data in new lights.


Apologies; I wasn't aware of the Tesseract OCR project until very recently, and then I hoped there would not be much harm given that the two projects are so unrelated. (The name "Tesseract" was the natural progression from "Square" and "Cube".) What's the saying? There are only two hard things in Computer Science: cache invalidation and naming things.


http://en.wikipedia.org/wiki/Tesseract "In geometry, the tesseract, also called an 8-cell or regular octachoron or cubic prism, is the four-dimensional analog of the cube. The tesseract is to the cube as the cube is to the square. Just as the surface of the cube consists of 6 square faces, the hypersurface of the tesseract consists of 8 cubical cells."

Looks like the perfect name to me. Sounds like the "other" project already has an established name of "tesseract ocr", don't see any reason why this library would be confused with that. Lame for people to focus on this instead of the crazy beautiful api that comes with this thing:

https://github.com/square/tesseract/wiki/API-Reference


If you were going to make a software, where you can see places though a webcam, would you call it Windows, even if it's a fitting name?

Tesseract goes by the name Tesseract, tesseract-ocr is just the name of the Google code project.


If we're going to talk about time, let's not forget a book I read as a child - http://en.wikipedia.org/wiki/A_Wrinkle_in_Time. I believe they win the "who used tesseract first" award. At least as far as this thread is concerned for now.

To be honest though, my view is very biased because I don't care about the ocr project at all. I'm sure it's a very nice project, but hardly a tesseract really.. =p


There are only two hard things in Computer Science: cache invalidation, naming things and off-by-one errors.


Unless you're 37Signals and then it's Javascript and naming things.

https://37signals.com/svn/posts/3112-how-basecamp-next-got-t...


I'd go with "Tessquare", or some combination of Square and Tesseract. I'd definitely change the name. Apart from the name talk, congratulations, and thank you for opening this up to the community.


Or Squaresseract :)


Squesseract or STesseract also have potential for disambiguation and googleabilty, without being too forgettably eccentric.


If you're settled on "Tesseract", perhaps a small tweak would suffice to clear up the potential confusion. Call it "Tesseract.js".


That would lead me to believe that it's 'Tesseract OCR, but in Javascript!'. I think a more substantial name change will be required because of how prominent Tesseract is.


Tesseroct maybe..

Tesserect? ;p

Mind you Tesseract OCR did miss a chance to be TesseractOCRus, which would be great, obviously.


There are only two hard things in Computer Science: cache invalidation, naming things, and off-by-one errors.

Another name option would be "hypercube," which means the same thing. It does make it sound more related to "cube" though.


"Octachoron" would work just as well (progression from "tetragon" and "hexahedron").


Hypercube?


Call it "4-square". Nobody will get confused then.


n-dimensional hypercube


hypercubes are already n-dimensional, that's not adding any extra information.


Bug filed on their github issues page:

https://github.com/square/tesseract/issues/1


+1 ... The title totally confused me. I use Tesseract already, and it's a popular OCR project.


I'm more confused about the name of the company. To me, Square was the maker of Final Fantasy and other role playing games in the golden years of 8- and 16-bit video game consoles.

http://en.wikipedia.org/wiki/Square_(company)


Would you prefer Ono-Sendai, or Ice-9?


@mbostock: Why is there Dart code in the project? Just because that's where you borrowed one of the sort functions from?

Thanks for another useful library!


Yep. Typed arrays don't have a built-in sort method, and even the built-in array.sort is extremely slow. I ported Dart's dual-pivot quicksort implementation, which reduced the time to sort 1M floats from ~2.5s to ~350ms (timed in Node v0.6.2).

https://github.com/square/tesseract/blob/master/src/quicksor...


For sorting numbers, you might also want to look at Radixsort.js, which takes O(n) time, although it isn't in-place like quicksort:

https://github.com/jasondavies/radixsort.js

I haven't finished implementing the Float64Array version, but it beats everything else I've compared it with even for relatively small sizes e.g. my benchmark is 65,536 floats and it's already around 2.5x faster than native sort (using Node.js)! Admittedly, Float32Array vs. native sorting of 64-bit floats is not a fair comparison, but you could argue that many applications would get away fine with 32-bit floats anyway. :)


Got inspired and it now supports Float64Array. :)


Good to know. I'll have to consider switching away from Underscore's sort, which is probably equally slow.

By the way, quicksort.js is code you wrote, right? I assume that's too nice and concise to be from the Dart generator...


Underscore doesn't have a "sort" apart from the native Array.prototype.sort, but it does have a "sortBy", which is something else entirely.


The sortBy method is just producing an integer array that is then sorted using each browser's implementation of Array.prototype.sort. Is that a proper reading of the code?


Bingo. You got it.


Right; I ported it from the Dart implementation, not from the generated JavaScript.


Update: Square have changed the name to Crossfilter:

"Renamed to Crossfilter, partly in homage to Chris Weaver's work on multidimensional visualization. It may not have the intrigue of "tesseract", but it does describe the library's function succinctly."

-- https://github.com/square/crossfilter/issues/1#issuecomment-...


That's timely. Recently I've been looking around for various widgets interfaces to explore multidimensional data. This looks quite useful - way up next to good old pivot table.

What do you think of parallel coordinates as a widget? (F.ex. http://exposedata.com/parallel/veggie/)


This is really great work, and I can't wait to see some of it filter into Square's Cube project.


I got a preview of this project from Square's CTO 3 weeks ago at Stanford. The fast-filtering is MIND-BLOWING.


It'd be cool if it could say, show the data for all weekends at once, skipping over the weekdays.


So mbostock works for Square?


Thank you for an release. I've been heavily promoting the web stack for over a decade and a half, yet still I'm surprised by what it is capable of. This also provides a clear demonstration of the power of algorithms even on an imperfect implementation. Excellent, clean API as well.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: