Hacker News new | past | comments | ask | show | jobs | submit login

Code iceberg is in the eye of the beholder. Recently started bizdev-people consistently underestimate the time requirements for certain well-exercised tasks.

Some of the most common icebergs are:

-form validation (seriously -one of the most highly exercised user-interaction paths; it's all over the place, and scales semi-exponentially with the number of fields)

-search ("how hard could it be? you just put an input form there, then figure out what the user thought, then display it" -exact quote)

-anything that has to process natural language. I mean everything. Wanna split up a text into sentences? How do you differentiate between dr. mr., 2004. jun. , and valid sentence-enders? Generating a definite article ("a", "an") before a noun? Keep in mind that 1,2,@,$,=, and other characters might also be valid noun first-letters :) etc.

In my experience, the best anti-iceberg pattern is to follow a portfolio approach, and for each requirements which smells like iceberg, have a fallback plan in place -ie. after N hours of sunken investment, execution shifts to plan B. Usually works out much better, than banging away on the same problem for days.




A fun example I had to deal with was a big site which processed lots of terribly formatted data to build its content. One particular rule for processing incoming data relied on breaking a big blob of text into very specific fields based on where capital letters fell.

We had made a point of asking before the project if regionalisation was ever going to be an issue and no, it would only ever be in English. Shortly after go live we were asked to regionalise everything into Chinese.

I'm still not sure what a capital letter looks like in Chinese.


"How do you differentiate between dr. mr., 2004. jun. , and valid sentence-enders?"

Fun fact: there exists a convention stipulating a double space after a period that ends a sentence. Not that I'm advocating relying on this for any serious purposes


HTML ended that. I have followed all periods with two spaces in this post. How can you tell on the screen?

In fact I just checked, and HN is honoring the two spaces, in that they are output to the actual HTML sent to your browser. And of course, yes, there were other trends that would have ended this anyhow, "two spaces" in meaningless in a non-monospaced font and ever more stuff is going proportional as the computing power necessary to do that continues its steady march from "prohibitive" to "trivial", but the WWW certainly beat the corpse to death again.


The monospaced or proportional fonts aren't the issue; it's that HTML treats all sequences of whitespace within text as a single space. It's quite annoying to those of us who like our double-spaced sentences. When a word processor generates HTML, every double-spaced sentence ends with this:

  [space] 
Or even this:

  [space]<span class="something-about-space">[space]</span>


HTML and proportional fonts are two separate issues. HTML ignores them, which is one problem; two spaces in a proportional font being less immediately obvious than on a typewriter is another problem. You can come up with some other issues too if you think about it. It all adds up to a dead tradition.


It is not meaningless in a non-monospaced font; two spaces should still be wider than one space. I use a double space after a sentence in all of my technical writing, and I'd be extremely disappointed if my wordprocessing package did not honour that.


LaTeX would not honour it. But you can ask it to treat the end of sentences different than regular spaces.


I grew up with that drilled into my head in typing class but, come to find out, now it is not followed in many cases. Most type faces should only have 1 space after the period. http://www.wsu.edu/~brians/errors/spaces.html

It took me forever to stop hitting the space bar twice after ending a sentence.


For my thesis, the easiest thing was to do a find/replace on ". " with ". " when I was done.


And because of the tendency for a browser to strip whitespace from HTML to render pages readably, those two things look exactly the same.

If you want consecutive spaces in a bare HTML page, you have to use "&nbsp;" which doesn't work here because pg isn't a moron.


You can use the appropriate special character instead of the entity.  Like this.


Because you aren't a moron.


I am though :(


TeX


This is far from universal. I know that when I studied journalism, I had to unlearn that habit - I think the official style guide for the industry mandates one space.


Indeed, it seems to have gone out of fashion at the same time as the mechanical typewriter. http://en.wikipedia.org/wiki/Double_spacing_at_the_end_of_se...


That was an old typewriter convention. I learned typing on a typewriter in highshool. Wordprocessors in the 80's eliminated that convention. They would automatically perform the proper spacing.


There is also a convention of capitalizing the first letter of a sentence. The example text demonstrates that corpus quality (forgive the fuzzy terminology) trumps clean, simple rules.


Yep. First letter of a sentence, or title, or any other proper noun. So... how can your program tell which it is? The correct algorithm approaches AI in complexity.


In data/text mining disciplines, correctness is fuzzy, rather than being boolean. For most industry-wide applications, having a solution that covers 99.9% of the cases (1 mis-classification in 1000 sample) is well below acceptable bounds.

So, one particular solution with these performance characteristics is building a decision tree using a bunch of training data, and eg. a maximum entropy classifier. Add some sample data from any openly available corpus's (or fire up mturk, and create your own), and you're pretty much done with it.

Of course, sentence-tokenization is only the tip of the iceberg :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: