Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: 100K sentences mined from Wikipedia to help non-native English learners (buildmyvocab.in)
225 points by abhas9 on May 30, 2017 | hide | past | favorite | 88 comments



I went to an interesting talk once at the Boston Python meetup, where a guy figured out how to order sentences so could learn them in an order where you already knew the "other" words in the sentence. Basically, making a directed graph of vocabulary.

He was doing it to learn Latin, but you could do it for any language.


There's a very good book for Latin that uses that trick.

Goes from zero to extremely complex Latin. Whole book is in Latin, no translations.

https://www.amazon.com/Lingua-Latina-Illustrata-Pars-Familia...

The only requirement is knowledge of orthographic alphabet and how each sound is produced. Latin, fortunately, has very simple sounds compared to English or Swedish.

It took me about 2 years to go through both parts and I was amazed at how easy the journey was. Could speak and write Latin fluently without issues.


Is any book like that available for contemporary languages?



This is a awesome resource, thanks a lot! Some of these books are of course quite old. For example I would not recommend trying to learning contemporary German using this [1] book. Reading 'fraktur' alone can be rather challenging …

[1] [Worman, J. H.: Erstes Deutsches Buch, nach der Natürlichen Methode, für Schule und Haus, American Book Company, New York 1880](https://vivariumnovum.it/edizioni/libri/dominio-pubblico/Wor...)


Holy moly, you made a languages nerd's day


I learned Dutch using a similar approach (https://www.amazon.com/Delftse-Methode-Nederlands-voor-buite...)


Don't you have to have the translations at some point, or at least some side channel (i.e. an illustration that the phrase refers to), in order to "ground the symbols"?


I have this book and I can tell you that it does use illustrations quite a bit (although most vocabulary is ultimately probably defined using other words). The very first chapter presents a labeled map of the Mediterranean region and begins:

"Rōma in Italiā est. Italia in Eurōpā est. Graecia in Eurōpā est. Italia et Graecia in Eurōpā sunt. Hispānia quoque in Eurōpā est. Hispānia et Italia et Graecia in Eurōpā sunt. Aegyptus in Eurōpā nōn est, Aegyptus in Āfricā est. Gallia nōn in Āfricā est, Gallia est in Eurōpā. Syria nōn est in Eurōpā, sed in Asiā. Arabia quoque in Asiā est. Syria et Arabia in Asiā sunt. Germaia nōn in Asiā, sed in Eurōpā est. Britannia quoque in Eurōpā est. Germānia et Britannia sunt in Eurōpā."

There are marginal notes highlighting things that the author wants you to notice or learn from the examples. Especially at the beginning, the marginal notes often do not discuss things in complete sentences but simply highlight particular grammatical features; for example the notes to the part I just quoted say "-a -ā: Italia...; in Italiā", which is supposed to make you realize that somehow the ending -a changes to -ā when something is "in" something (which later will be revealed to be the Latin ablative case), and "est sunt: Italia in Eurōpā est; Italia et Graecia in Eurōpā sunt", which is supposed to make you realize that sunt 'are' is the plural of est 'is'.


I want that, for every language I want to learn!

Edit: Before anyone says it, in my original comment, that should have been an e.g. instead of an i.e.


Thanks to you I just learnt the difference between those two. Thanks! (Also writing this comment reminded me of this https://xkcd.com/1053/ )


Are you aware of something similar for Biblical Greek?


stefan - I'm building http://pingtype.github.io for studying Chinese, and reading the Bible every day to practice.

I already forked the code to make a version to help Chinese speakers learn English.

I thought about Biblical Greek & Hebrew & Aramaic & Latin, but I wasn't sure if there's a market. Evidently there is! The biggest challenge is making a good dictionary. If I write the code, could you help to fix the dictionary?


How about this?

1. Get a frequency list. The most common word's rank is 1, the second is 2, etc. [0]

2. Then use your favorite Spaced Repetition Software (such as anki) to learn the words in that order.

3. Define a sentence's difficulty as the maximum rank over all its words. You could refine it by adding tie-breakers but I think it doesn't matter. Then sort the sentences in order of difficulty.

[0] See https://en.wiktionary.org/wiki/Wiktionary:Frequency_lists


Clozemaster already offers this for over 100 language pairings via the Fluency Fast Track feature, https://www.clozemaster.com.

Clozemaster's a game to learn and practice a language in context. The objective is to fill in the missing word in a given sentence for thousands of sentences. The missing word is the most difficult word in the sentence according to a frequency list for the language, and the Fluency Fast Track feature allows you to play a sentence for each unique missing word in order of difficulty like you described.


I just spent an hour playing the fast track.

As a Spanish learner with a strong foundation but struggling to get over the next hurdle this is just perfect. Thanks for this link!


Awesome - glad to hear!


Great idea but...

The word "run" is a relatively high frequency word. How many different meanings for "run" does a learner need to know? Is that in isolation? With collocations? As phrasal verbs?

In many languages, the most frequently appearing words also have the most varied meanings. Interestingly, many highly vernacular languages also use relatively few words, but those words have a lot of meanings that are clearly known by the speech communities.

FWIW, some theoretical linguists consider this a non-issue. People in the field (i.e., people might die if I get this wrong) know otherwise.

While your idea sounds nice in principle, I hope you will accept the idea that reality may be slightly more complex.

* Collins Cobuild has 50+ meanings for "run" if phrasal verbs are included. Most non-native speakers are not even aware of the potential breadth of meanings it offers.


It's not just an idea, I've used it to learn about 2000 Hebrew words, albeit in a somewhat different form and along with other methods/materials.

I'm a somewhat experienced language learner. I'm fully aware that words have multiple meanings but that's not as problematic as you think. A lot of distinct meanings of "run" are related, so it's not like you need to memorize every one individually. Besides, they aren't all equally important. In many cases, those different meanings are paralled in my native language (or another language I know), e.g you can translate "run" as "correr" in "he runs", "the river runs..." and "they ran the risk...". More to the point, I always study words in context. In this method the context is provided by the rest of the sentence.


The blessing of language is that word frequency follows a power law. The first couple hundred words cover much of the language anyway.


This is https://en.wikipedia.org/wiki/Zipf%27s_law

The other side of the coin is that the long tail is long:

the frequency of any word in a corpus is inversely proportional to its rank in the frequency table. For large corpora, about 40% to 60% of the words are hapax legomena [words appearing exactly once], and another 10% to 15% are dis legomena[words appearing exactly twice]. Thus, in the Brown Corpus of American English, about half of the 50,000 words are hapax legomena within that corpus. https://en.wikipedia.org/wiki/Hapax_legomenon

So the chance of seeing a hapax in any given sentence is really high.

What's worse, the more rare words in a sentence are often the important content words. When I try to decipher sentences in some language I don't know much of, what I end up understanding is often something like "And then he XXXXX-ed the YYYYY just like that hahaha" (I understood 8/10 words! And I even know that word 4 is a past tense verb! But not at all the meaning).

(Not that you shouldn't study the most frequent first, that's still a good rule.)


> (I understood 8/10 words! And I even know that word 4 is a past tense verb! But not at all the meaning).

That's an important point. Knowing 80% of the words in a text doesn't mean you understand 80% of its meaning, that is, you probably wouldn't get high marks on a basic comprehension test.

I've seen some studies that indicate you need to know at least (around) 95% of the words in a text in order to understand it "enough" of it. (I don't have the links right now but could look it up at home if you're interested).


Right, though I think the long tail is beyond the line between language and culture. There comes a point where additional words are not a matter of understanding utterances, but of following culture.

Effectively none of the English-speaking world would bother to say "Sochi" without the olympics, but they knew the English language and had enough culture to understand from context that it is a place in Russia which hosted the 2014 winter olympics.

If you know enough of the language that you can ask "what's that?" at a non-disruptive rate in conversation (or look it up quickly in a dictionary or encyclopedia), I think that counts as good enough.


I built something like this for Mandarin Chinese at a company I worked for in Shanghai, but the company was acquired and sort of put out to pasture, and it never launched. :-(

Essentially, we took already word-segmented dialog (splitting Chinese sentences into individual words is non-trivial, so having it already segmented was super useful), matched it to words that you knew, and suggested the next lesson you should learn by the percentage of vocabulary that would be new or challenging for you. It was pretty awesome, would love to have another shot at it someday.


I would love to have a chance to try it! A couple hundred hours into learning Chinese, and that sounds useful. Any chance you can release it?

....if not, what would you consider your best competition?


Sadly this was about six years ago. The code and the platform it was built on are long gone. I don't know of anyone else who is doing exactly the same thing, though Allset Learning in Shanghai (run by a friend and fellow ex-employee of the company I built this thing at) is making graded readers built around similar ideas. Not really the same as an adaptive system, though.


I'm working on http://pingtype.github.io which also does word spacing, and literal translations.


I would be really interested in a German version of this.

I wanted to do more a less the same: a "translator" that translates a German text content into a german text but by replacing words you don't know (e.g. extracted from memrise) into words you know. That way you can start reading texts of your foreign language without looking at your dictionary every sentence.


Years ago I found a great example of this on the letter level for learning Cyrillic - takes only a couple minutes to run through and it's really satisfying:

http://www.alphadictionary.com/rusgrammar/alphabet.html


don't Assimil courses work like that? not word by word but definitely builds up on common words to reach ever complex sentences


How do you decide on which sentences to use?

I'm interested in generating example sentences myself, but in a way, that chooses sentences that are simple, easy to understand and support the word, they are supposed to exemplify.

For example "She got a car for her birthday, while she was traveling in Italy eating pizza" does not tell the reader anything about what a car is, or how the word should be used. However "He drives his car to work", is a much better example of what a car is, what is a common associated verb and how it fits in a sentence.

How do you optimise selection for sentence like the latter?


There's the linguistic principle of "You shall know a word by the company it keeps", so for any particular word you can identify which other words are most specifically related to that word, the simplest measure that can be used for that is freq(both words together)/freq(that other word in general).

That would allow you to prioritize sentences containing "getting a car" over "driving a car" - even if "getting a car" is more frequent, driving is more specific according to such a measure.


Hmm, maybe I've been overcomplicating the problem in my mind. You've given me some good ideas.

Bigrams, as your own example shows, are too simple: in both examples, "car" will get related to "a", instead of "getting" and "driving".

Maybe if I parse all sentences with dependency and built dependency bigrams, and score sentences with frequency/inverse_freq and length of sentence (short sentences are better).


That's a great question. Optimizing for sentence selection is important for teaching. For now, I have a simple check that filters out sentences which are longer than 160 characters.

Also, I believe that this is one thing which humans can do better. I, therefore, plan to add upvote & downvote buttons to rate the quality of sentences.


Up/down voting seems good.

I wonder if you might get a bit of an head start if you combine the shorter sentence idea with selection based on higher n-gram counts. For instance, if the keyword + words either side match a common n-gram, you could expect that sentence was reasonably representative and boost it in the initial rankings as compared to an n-gram that has a much lower count.


I know!

You could check to see if there are some verbs that are used predominately with the word you're trying to generate sentences for. Example, I would expect "drive" appearing in a sentence to carry a higher than average probability for "car" also appearing in the sentence. Or "wind" for "watch", or "sit" for "chair" and "couch".

Then, I think sentences containing "car" that also contain the verb "drive" would probably give better clues for the meaning of "car" than verbs like "bought".

Just a thought.


That's the geek inside you talking. But the hard part was the word stats. Now that you get 1K words, writing a thousand sentences manually to illustrate them is not really hard. It's a one day manual work. Less than the work needing to figure out automation, with way better results.


I've always thought that the Simple English article versions of Wikipedia were always useful for non-native English speakers. https://simple.wikipedia.org/

Most people seem to be unaware of this Wikipedia aspect.


Furthermore Simple English tends to be the very high level TL;DR version of associated article.


Very cool! Although I feel that sometimes you really need a human touch to make it truly comprehended. For instance, I random clicked on "antediluvian":

https://buildmyvocab.in/antediluvian/

Everything here will get you a "good enough" understanding of what the word means, but this is the only one that really comes close to explaining the word's literal meaning, and it's too vague to be of much use:

any of the early patriarchs who lived prior to the Noachian deluge

A non-native speaker isn't going to have any idea what "Noachian" means (a native speaker probably isn't either unless they can explicitly identify "Noah" as the root), and "deluge" is part of the root of the word we're defining, so simply using the word "deluge" without explaining what it means doesn't really help.

In short, this is a good groundwork, but I think it needs a human editor to push the individual definitions from "acceptable" to "correct".


Thanks for your awesome feedback. And yes this is just the initial ground work and part of a larger experiment. We are also trying to teach English using Bollywood movies and GIFs[1]. I agree that human editing is very important and as such upvote/downvote button feature is in pipeline next.

[1] https://buildmyvocab.com/ddlj.html


Hey cool project. Can this be used to learn French as a English speaker?


In short, this is a good groundwork, but I think it needs a human editor to push the individual definitions from "acceptable" to "correct".

Good point. A StackOverflow-style upvoting system may provide exactly that.


I find there's a lot of material for studying isolated words, but as an engineer, analyzing the sentence patterns and grammar is more interesting.

I'm working on a project to do this for a database of Chinese grammar patterns. When there's enough sentence examples for each pattern as structured data, we can then make games and other learning tools. For example: yīnwèi / 因为 / because http://cgram.rikai-bots.com/grammar/yinwei

Now there's a magnets game to try to use that pattern: http://cgram.rikai-bots.com/magnets/?cnames=yinwei

I would be happy to share the repo with anyone who's interested, or using the data to make some other language learning games. PS I did a similar thing for japanese before: JGram.org and it really helped me learn japanese quickly.


In the same vein, for French translation, Linguee[1] uses many sources from websites of organisations that display official content in several languages (eg. the websites of the EU, of the Canadian Parliament...). The fact that it's official texts (eg. laws) makes it quite reliable.

[1] http://www.linguee.fr


This is pretty cool.

The second word I clicked was "cant"....and about half I saw were typos of "can't", so, there's some bad data in there if you're trying to learn standard english, but it's good data if you want to understand things people actually write.

Anyway, time to go through and add some apostrophes to a few articles. :-)


Nicely done. You could add in a mailing list to send users a digest of new or top vocabulary words every week.


There's a subscription link at the bottom of https://buildmyvocab.com/


Second that - would also be interesting for English speakers learning a language like Chinese, where the average literate speaker can recognize about 5000 of the most frequently used characters.


galen - I'm making http://pingtype.github.io for learning Chinese! I don't think that studying characters is the most important thing though. Words are more important than individual characters! My program can help you type parts of a character, and put spaces between words automatically.


Is there a word list for TOEFL and/or IELTS?

I'm using a similar strategy (movies, music, Bible, articles) for studying Chinese. I'm using the TOCFL and HSK word lists. My friend uses a book with a list of 15000 vocabulary words by Morris Hill. I can't find a txt version though.


https://play.google.com/store/apps/details?id=com.buildmyvoc...

is this your app abhas ? Quite interesting


Yes, this app is also a part of an initiative to make learning simpler and fun.


This list is to help non-native English learners? Many native English speakers might have trouble with a few of these: abeyance, abscission, accretion, amalgamate, anodyne, antediluvian, apposite, arabesque, atavism, and avuncular.


As an Italian native with a classical studies background, this kind of words are easy for me. They're almost all Latin-derived and they usually sound very similar to the Italian equivalent. You wanna know what's hard for us? The street talk. You know like when you shoot the breeze before you really spill the beans about your shenanigans while riding shotgun on a friend's old jalopy.


Those are fairly rare words in day to day English. Some of those words I have a "feeling" for what they mean, I've definitely seen some of them in print. Most of those words I would look up if reading on a Kindle, just to check my feeling about the word. Have I used any of those words in my own writing or speech? Nope!

From the linked website, the best words to learn for a learner are from the 1000 Most Common English Words section.


Yes, beside those you listed (that are not common in a normal conversation) there is a large number of Latin derived words in English, as a rule of thumb, there are almost always two words in English a non-latin derived one and a latin derived one having more or less the same meaning (or a near enough one).

Usually native English speaking people will use the non-latin one, but of course they also generally know it (or at least have heard or read the "other" one) so they can understand you alright, but I am told that if/once you are proficient enough in English, when you - mistakenly - use the latin based ones you sound like being snob or "upper class" or very formal (and wanting to seem so).

A few examples (of common words):

apartment=flat < I never managed to get this right

obscurity=darkness

arms=weapons

annually=yearly

legal=lawful

constructor=builder

transmit/transmitted=send/sent

custodian=keeper


Exactly :)

My friends say that I usually sound "academic" more than snob.

Some time ago I witnessed this first-hand. I was at a speaking event, the speaker was Canadian. He said: "If you ever had the fortune to encounter [Mister XYZ]". I would have definitively expected something like: "If you were lucky to meet him"/"had the luck to have met him" or similar. He was native, but very academic indeed. :)


> use the latin based ones you sound like being snob or "upper class" or very formal (and wanting to seem so)

I'm guessing that distinction comes from the norman conquest, with the upper class speaking French with the latin version for 500 years and the lower class speaking a germanic version.


> You know like when you shoot the breeze before you really spill the beans about your shenanigans while riding shotgun on a friend's old jalopy.

That sounds like quite the night out...


Well, in the early morn' we called a cab to steam off the debauchery and quit hopscotchin' around :)


The goal (afaict) is not precisely conversational fluency, but rather an ability to pass the exams such as GRE and GMAT - inferred from the fact that the lists are mostly from "Barron's" guides for those exams.


Very nice! Thanks for sharing!

Some words are not found: https://buildmyvocab.in/affinity

(Just a little correction there: does not exist*)


Thanks for pointing this out. Will fix this soon.


Completely OT, but is there a mathematical explanation for why, when scrolled the spaces between the words appear to form connect channels?


I think you're referring to word rivers (https://en.wikipedia.org/wiki/River_(typography)).


This is good stuff. I like the sentences part but I would put the definition before the sentences.

This would be a great foreign language tool too.


I created something similar using the Wordnik api. https://www.greedge.com/grewordlist/


Very cool. As others have said I think you should add definitions for the words (even a link off to an external one is fine) and pronunciation (with audio, perhaps link to forvo.com?) would be superb.


Mining canonical papers/text to generate standardized tests (SAT/GRE) might be a further step. My guess is that both tests and commercial prep-material are produced by committee.


Would be more intuitive if the meaning is presented first and then example sentences


While I think it needs a little polishing (a lot of wikipedia sentences are fairly hairy), I really like the core idea here. Keep up the good work.


Neat.

::clicks on a random word::

"We couldn't find any sentences for the word centripetal."

So... Why is it one of the chosen few?


What?! No "cromulent"?


Could something similar be done with other languages as well, say Simplified Chinese?


Simplified Chinese is a bit tricky, because the Chinese Wikipedia is mostly in Traditional Chinese, because it's blocked by the Great Firewall.

Otherwise, if you can find a large corpus, segment it into words and do some basic statistics, you could build something like this for any language.

The most similar implementation I am aware of (using the word 中文 (Chinese) as an example) is http://ce.linedict.com/#/cnen/example?query=%E4%B8%AD%E6%96%...


What about other languages?


I'm working on Chinese, and everything I do gets put into http://pingtype.github.io

This week I plan to finish making clips for words in movie subtitles.


1. I don't really understand what this is about. Having a description on the landing page would help.

> Barron's 800 Words list with example sentences

who is this Barron?

2. Please can you add pronunciation :D

3. words need a definition as well, not sure what some of these means even with the examples.


Thanks for the feedback. I understand that the current UX is very unintuitive as this was a one day hack which I published online. I will add a description and search feature soon.

Regarding Pronunciation and definitions: I am exploring the possibility of integrating "Wordnet" by Princeton University. I will try to integrate it as soon as possible.


> who is this Barron?

http://barronstestprep.com/


What kind of pronunciation? IPA? Phonics?


>who is this Barron?

I think if you are asking that question you are probably not the target audience for this website.


I'd like to know who or what Barron is too.


that comment helped a lot


can you make a container that lets you crawl any other language?


mined with a mithril axe


i am one of learner's of English, but i can't.. tips me to get my English perfect.. http://www.mrstatus.in/himbhoomi-jamabandi-copy/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: