Archive

Archive for the ‘Corpus-linguistics’ Category

How to use archive.org’s US-English news collection as a language learning corpus with QUIK-like speaking samples

  1. Much of TV news nowadays seems to amount to not much more than a constant stream of sound bites  – however, exactly this brevity,
  2. the large archive and simple search interface: image
  3. the research/browsing capabilities visible on the left here, including the varied sources – of which Arabic and French and other European TV likely provide a somewhat different perspectives on Edward Snowden –
  4. image
  5. and the caption-like transcription, make it all the more accessible for intermediate learners of English.
  6. image
  7. video clips of only 30 seconds length is hardly enough for instruction, however, you can have students work with corpus-QUIK-like spoken samples, and have them string a news history together if you design webquest-like research assignments – with the major added benefits, that this corpus is spoken and trains listening.
  8. For more background info on archive.org’s transcribed TV news, consult this NYTimes article.

Query treebanks with Fangorn for English SLA?

To provide inductive empirical examples,  SLA  classes have benefitted from query interfaces to target language text corpora in SLA. But corpora are usually POS-tagged – and queried – at best, which constitutes a certain “impedance mismatch” to what SLA classes actually teach. The Fangorn very large treebank query language beta demonstration page

image

looks already interesting for analyzing English in SLA (hover over tree elements to highlight the corresponding text), including, thanks to its capability of editing and refining queries graphically from the search results, for demonstrations during face-to-face classes. Wondering whether other corpora than Penn Treebank, Wikipedia (5k and 5000k sentences) will be made available online, and other languages but English will be supported.

A search interface to the EuroParl corpus

image

(Note the on my IE9, the text in the right column appears “blacked-out” – Select it to view, or use a different webbrowser).

Setting up European Union translation memories and document corpora for SDL-Trados

  1. SDL-Trados installation allows the translation program to teach this industry-standard computer-aided translation application . So far, however, we had no actually translation memory loaded into this translation software.
  2. The European Union is a powerhouse for translation and interpreting – at least for the wide range of their member languages many of which are world languages – , and makes some of their resources – which have been set up for translation and interpreting study use here before – available to the community free of charge as reported during a variety of LREC’s.
    1. This spring, the Language Technology Group at the Joint Research Centre  of the European Union this spring updated their translation memory  offer DTG-TM can fill that void at least for the European Languages  that have a translation component at UNC-Charlotte.
      1. We download on demand (too big to store: http://langtech.jrc.ec.europa.eu/DGT-TM.html#Download)
        1. Is the DGT-TM 2011 truly a superset of the 2007, or should both be merged? probably too much work?
      2. and extract only the language pairs with English and the language only the languages “1”ed here : “G:\myfiles\doc\education\humanities\computer_linguistics\corpus\texts\multi\DGT-tm\DGT-tm_statistics.xlsx” (using “G:\myfiles\doc\education\humanities\computer_linguistics\corpus\texts\multi\DGT-tm\TMXtract.exe”)
      3. and convert
        1. English is the source language by default, but should be the target language in our programs,
        2. The TMX format this translation memory is distributed provided in, should be “upgradeable ” to the SDL Trados Studio 2011/2011 SP1 format in the Upgrade Translation Memories wizard”.,
          1. TBA:where is this component?
      4. configure the Trados to load the translation memory
        1. how much computing resources does this use up?
        2. how do you load a tm?
        3. can you load in demand instead of preload all?
      5. Here are the statistics for the translation memories for “our” languages
      6. uncc Language Language code Number of units in DGT – release 2007 Number of units in DGT – release 2011
        1 English EN 2187504 2286514
        1 German DE 532668 1922568
        1 Greek EL 371039 1901490
        1 Spanish ES 509054 1907649
        1 French FR 1106442 1853773
        1 Italian IT 542873 1926532
        1 Polish PL 1052136 1879469
        1 Portuguese PT 945203 1922585
        Total 8 8 7246919 15600580
    2. Would it be of interest to have the document-focused jrc-acquis distribution of the materials underlying the translation materials available on student/teachers TRADOS computers so that sample texts can be loaded  for which reliable translation suggestions will be available – this is not certain for texts from all domains – and the use of a translation memory can be trained in under realistic conditions?
      1. “The DGT Translation Memory is a collection of translation units, from which the full text cannot be reproduced. The JRC-Acquis is mostly a collection of full texts with additional information on which sentences are aligned with each other.”
      2. It remains to be seen how easily one can transfer documents from this distribution into Trados to work with the translation memory
      3.   Here is where to download:
      4. uncc

        lang

        inc

        1

        de

        jrc-de.tgz

        1

        en

        jrc-en.tgz

        1

        es

        jrc-es.tgz

        1

        fr

        jrc-fr.tgz

        1

        it

        jrc-it.tgz

        1

        pl

        jrc-pl.tgz

        1

        pt

        jrc-pt.tgz

      5. The JRC-Acquis comes with these statistics:
    3. uncc

      Language ISO code

      Number of texts

      Total No words

      Total No characters

      Average No words

      1

      de

      23541

      32059892

      232748675

      1361.87

      1

      en

      23545

      34588383

      210692059

      1469.03

      1

      es

      23573

      38926161

      238016756

      1651.3

      1

      fr

      23627

      39100499

      234758290

      1654.91

      1

      it

      23472

      35764670

      230677013

      1523.72

      1

      pl

      23478

      29713003

      214464026

      1265.57

      1

      pt

      23505

      37221668

      227499418

      1583.56

      Total

      7

      164741

      247374276

      1588856237

      10509.96

  3. What other multi corpora are there (for other domains and other non-European languages)?

Corpus del Español Actual (CEA)

  1. Example of KWIC view result: Corpus del Español Actual -- CQPweb Concordance_1335462213910
  2. Based on Europarl, Wikicorpus (2006!), MultiUN. From their metadata page:

    Metadata for Corpus del Español Actual

    Corpus name

    Corpus del Español Actual

    CQPweb’s short handles for this corpus

    cea / CEA

    Total number of corpus texts

    73,010

    Total words in all corpus texts

    539,367,886

    Word types in the corpus

    1,680,309

    Type:token ratio

    0 types per token

    Text metadata and word-level annotation

    The database stores the following information for each text in the corpus:

    There is no text-level metadata for this corpus.

    The primary classification of texts is based on:

    A primary classification scheme for texts has not been set.

    Words in this corpus are annotated with:

    Lemma (Lemma)

    Part-Of-Speech (POS)

    WStart (WStart)

    The primary tagging scheme is:

    Part-Of-Speech

    Further information about this corpus is available on the web at:

    http://sfn.uab.es:9080/SFN/tools/cea/english

  3. To use, consult the IMS’s brief description of the regular-expression syntax used by the CQP and their list of sample queries. If you wish to define your query in terms of grammatical and inflectional categories, you can use the part-of-speech tags listed on the CEA’s Corpus Tags page.
  4. Also provides frequency data (based on word forms or lemmas, and others  – up to a 1000): Corpus of Contemporary Spanish frequency interface
  5. Examples of a frequency query result (click for full-size image. Note that a lemmatized list was requested here which links all inflected forms back to the lemma, and vice versa, upon clicking the lemma, displays a KWIC view containing all forms subsumed under that lemma, see picture above):

MS-Bing Dictionary for Chinese learners of English–and vice versa?

  1. Link: http://dict.bing.com.cn/?ulang=EN-US&tlang=ZH-CN#%3Ahome, powered by Engkoo:
  2. This looks like a pretty evolved learning tool: It has instant suggestions that include usage information and translations.engkoo-bing-china-dictionary-learn
  3. Rich results that include contextual, parallel web-as-corpus matches in text and text-to-speech (that, on spot-checking, seems barely noticeably computer-generated). engkoo-bing-dictionary

American National Corpus English Word Frequency Lists

The American National Corpus list is long (~300k) and  lemmatized: http://americannationalcorpus.org/SecondRelease/data/ANC-all-count.txt.

Auto-Glossing and Lookup-Tracking in a Personal Corpus for Vocabulary Acquisition