Hacker News new | past | comments | ask | show | jobs | submit | isani's comments login

Japanese is usually written without spaces. Words and sentences just run into each other. When writing in hiragana (syllabic characters), word boundaries are often ambiguous.

Englishwouldbemuchhardertoparseifwrittenlikethis.


I have no stake in natural language processing, but it looks to me like a computer might be able to do a pretty good job at splitting that given a dictionary.


Sure, you can get pretty far with a fairly simple solution. But lot of the time, you get two (or more) ways to split the string into dictionary words. For a simple English example, is it "justice was served" or "just ice was served"?


I guess that’s where context will have to be considered. Those two are valid sentences, so presumably humans are using context to distinguish between them, right?


The murderer came to my dinner party, and I had it all planned. In one of the ice cubes, I had frozen arsenic. The murderer would eat the same food, drink the same drink, and nobody would guess that they would die on leaving. When the evening was over, I knew what I would tell people.

Justicehadbeenserved.


Please, share this with the world on tweeter.


If you would like to, feel free. For myself, I think that the comment's context of showing how ambiguity may not be resolved merely be contextual information is important, and that it would not stand as strongly without it.


The stochastic strategy is to 1. enumerate every possible tag combination 2. assign a probability to each one 3. choose the parse with highest probability.

1. can be done either deterministically or stochastically.

2. requires you to have a language model trained with either human-tagged or semi-human-tagged corpus

3. was just the Viterbi algorithm last time I looked.

Implementing 1 and 2 are require broad domain knowledge in two very different domains (linguistics and machine learning respectively)

So while nowadays sentence segmentation can be considered a solved problem, it's far from trivial to implement one that can compete with the state of the art against real-world data.

There is also a nice body of deterministic (rule-based) literature that is practically ignored nowadays.


But Japanese is not written as character soup. It mixes two (actually 3) types of characters, with the "grammatical" sounds being written in hiragana and most content sounds being written in kanji. Since the grammatical sounds are a closed class, and tend to occur at word boundries, it turns out to be relativly simple to seperate words.


Isn't it a case when parsing other languages from speech? Are there any audible cues between words when we speak?


"A fox was brown" is a perfectly ordinary sentence in the active voice. "Speed was involved in an incident" can also be parsed as active, if "involved" is an adjective. If the sentence is passive, then the active equivalent is "An incident involved speed", which is hardly any better.

That "ultimate in passive voice" is certainly convoluted, but it has very little if anything to do with passive voice.


Factoring into powers of 2 seems to me like an unnecessary complication. It's possible to calculate an arbitrary power in O(log N) time without memoization.

  def __get_matrix_power(self, M, p):
    if p == 1:
      return M
    if p % 2 == 1: # odd power
      return self.__multiply_matrices(M, self.__get_matrix_power(M, p - 1))
    else: # even power
      K = self.__get_matrix_power(M, int(p/2))
      return self.__multiply_matrices(K, K)


Also working on pairs of consecutive fibonacci numbers (f_n, f_(n+1)) instead of the matrix [[f_(n+1) f_n] [f_n f_(n-1)]] makes this much simpler.

  def fib(n):
    def fib2(n):
      # returns (f_n, f_(n+1))
      if n == 0:
        return 0, 1
      if n % 2:
        a, b = fib2(n-1)
        return b, a+b
      a, b = fib2(n//2)
      return b*a + a*(b-a), a*a + b*b
    return fib2(n)[0]


A maximum spanning tree might be misleading, as it's easy to interpret no vertex as no correlation. When building a tree, weak correlations may be included out of necessity, while stronger ones that lead to cycles are omitted.

If several dimensions are correlated just about equally strongly, you can get very different trees based on small random variation. There's no guarantee that all significant correlations are displayed, or that correlated dimensions are visually close to one another.


I agree, it's not perfect - just a useful abstraction. Just the same as arbitrary thresholds for correlation or a p<0.05 significance level - often you lose information but gain insight. From personal experience I've seen MST's map out underlying structures that validate classical chemical kinetics of a system in a logical path: something that would not have been apparent in ordinary thresh-holding approaches

Basically IMO it's good to use all of these techniques together to get a good picture of your system. In the end the greatest limitation is our human cognition to interpret the results, which frankly needs all the help it can get.


Thank you for the feedback. I prefer to use a graph instead of a tree because I want to spot clusters of relations.


I'm on a desktop and they're telling me to rotate my screen.


I agree. I'm not sure what kind of an algorithm they use for kerning, but it seems to ignore how letters interact optically. Pairs like "ve" and "ro" are set too far apart.


"VLC for Windows 8 might not be applicable for the store."

And that's a major problem. Without distribution on the Windows Store, this thing isn't going to see any kind of mass adoption. I'd be wary about funding until they work out whether their code and licenses pass the Store certification requirements.



It seems they are doing something similar, but with vertical scrolling/swiping. Swipe up from notifications to get to running apps, swipe again to get to the full list of apps.


Actually, A4 has an aspect ratio of one to the square root of two. Your point still stands, as this is even closer to the iPad screen than 2:3.


"The apps will allow for basic editing"

I find it interesting that Microsoft now has several versions of Office that don't support the full feature set of their file formats. There's already Office Web Apps and Office Mobile, and now the new apps. I wonder if they will settle on a single subset of features, a kind of Office Lite? Otherwise, it'll be quite confusing to tell what works in which app.


Perhaps that common feature set will be rebranded and sold as a new MS-Works?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: