Archive for the ‘corpora’ Category

POS-Tagsets. A list.

Learn and teach writing in your second language on


Improving language learning with technology for me seems to have 2 avenues: AI and human intelligence. Automated feedback on writing provided by proofing tools – even if they have become smarter and more contextual to spot (in MS-Word 2007 and up) common errors like your/you’re or their/there – makes one wonder about the feasibility of the former. But that automated essay-scoring tools which have been developed and deployed (at least for ESL) claim to score similarly as teachers makes one wonder about much more… Correcting writing remains expensive!

So may be we should look into crowd-sourced writing correction which needs no cutting edge NLP, only well-understood WWW-infrastructural technology to connect interested parties, but requires social engineering to attract and keep good contributors (and a viable business model  to stay afloat: This site seems freemium).

Reading online comments and postings in your native language makes one wonder: can language teachers be replaced by crowdsourcing? I became aware of this the language learning website that offers peer correction of writing input by native-speaker through a language learner corpus. I have not thoroughly evaluated the site, but the fact that its data is being used by SLA researchers here ( seems a strong indicator that the work done on the website is of value.

To judge by the numbers accompanying the corpus (it is a snapshot from 2010, a newer version is available however on request), these are the most-represented L2 on  image

Corpora, Treebanks, Word-Lists. A List.

How to workaround AntWordProfiler error “Cannot open the file”

  1. Seems a little bug in this otherwise great program. I started getting this on Windows 7 64-bit with
  2. clip_image003for all files, no matter which size.
  3.   It occurred to me to go to menu: Settings/ global settings / file settings / show full pathnames
  4. Here is what you see: Note the duplicate path to the file.
  5. clip_image001
  6. How did I get there? Seems like you cannot take my usual preferred shortcut and paste the full file path into the browse dialogue.
  7. If I browse to the file and select, the same botched up double path does not appear:
  8. clip_image002
  9. I can then process the file fine. image

Search Rhapsodie, a syntactic and prosodic Treebank of spoken French

  1. The Rhapsodie Treebank is made up of  “57 short samples of spoken French (5 minutes long on average, amounting to 3 hours of speech and a 33 000 word corpus)” endowed with an orthographical phoneme-aligned transcription”.
  2. Rhapsodie can be searched at
  3. View list,  read (1) text or (2”phonetic transcription, click (3) and (4) to listen  to found segmentrhapsodie-speech-corpus-treebank
  4. You can also search for text and download: image
  5. The best is obviously the markup and query language – and hence has a learning curve.
Categories: corpora, French, Listening, Speaking, websites Tags:

ELRA language corpora available in the LRC for research

The LRC has availed itself of a free research distribution of 55GB collection of language corpora from, the European Language Resources Association. This “big data” should be of interest for the translation program, as well as the language learning programs, since it enables corpus linguistic approaches to language learning and automated learning material production based on natural language processing.

Here is an overview of the materials included:


A list of files included can be found here:

MS-Bing Dictionary for Chinese learners of English–and vice versa?

  1. Link:, powered by Engkoo:
  2. This looks like a pretty evolved learning tool: It has instant suggestions that include usage information and translations.engkoo-bing-china-dictionary-learn
  3. Rich results that include contextual, parallel web-as-corpus matches in text and text-to-speech (that, on spot-checking, seems barely noticeably computer-generated). engkoo-bing-dictionary

American National Corpus English Word Frequency Lists

The American National Corpus list is long (~300k) and  lemmatized: