Definition Korpuslinguistik Teil der Computerlinguistik, bei dem große Textkorpora eingesetzt werden, die mit stochastischen Verfahren bearbeitet werden. Definition Textkorpus A corpus is a collection of naturally-occuring language text, chosen to characterize a state or variety of language. (John Sinclair 91) Jede Sammlung von Texten (auch in gesprochener Sprache), sofern die Texte gezielt ausgewählt wurden. Problem: Auswahlkriterien, Größe und Repräsentativität eines Korpus Erstellung verwertbarer Textkorpora Probleme und Arbeitsschritte Markieren von Textteilen (z.B. Überschrift) Löschen von Steuerzeichen (Layout, Formatierungen) Vereinheitlichen von Sonderzeichen (Umlauten, echte SZ) Entfernen von Whitespace (Tabulatoren, Leerzeichen, Zeilenumbrüche) Trennen von Wörtern (z.B. "und/oder", "Artikel/Nomen/Verb", CDU/CSU, "Feldberg/Schwarzwald"; Probleme: "km/h", "WS 1970/71"). Abtrennen von Satzzeichen Rückgängigmachen von Trennungen am Zeilenende Zusammenführen von Zahlen Disambiguierung von Punkten Annotieren/tagging Satzsegmentierung Wichtige Arbeiten Armstrong, S. (Hrsg.) (1993), Using Large Corpora, Computational Linguistics Vol. 19, No 1/2 (1993), repr. MIT Press 1994 (mit Beiträgen von K.Church, T.Briscoe, W.Gale u.a.) Francis, W. und H. Kucera (1982). Frequency Analysis of English Language. Boston: Houghton Mifflin. W. B. Frakes und R. A. Baeza-Yates (Hrsg.) (1992), Information Retrieval. New Jersey: Prentice Hall Garside, R., G. Leech, und G. Sampson (Hrsg.) (1987). The Computational Analysis of English. London: Longman. Sinclair, J. (1991). Corpus Concordance Collocation. Oxford: Oxford University Press. Sperberg-McQueen, C. und L. Burnard (1994). Guidelines for the Electronic Text Encoding and Interchange (P3). Chicago and Oxford: Text Encoding Initiative. Svartvik, J. (Hrsg.) (1992). Directions in Corpus Linguistics: Proceedings of Nobel Symposium 82, Stockholm, 4-8 August 1991, Number 65 in Trends in Linguistics, Berlin: Mouton de Gruyter. Linguistics Resources on the Internet including Computational Linguistics and Natural Language Processing • SIL Linguistics resources Ethnologue, Living Languages of the Americas, Bibliography, Publications Catalog, SIL Electronic Working Papers, School and Training, Linguistic Glossary, LingBits, LingualLinks, CELLAR • Conferences, Workshops, Meetings, Symposia • Universities and Other Academic Sites Associations, USA, Canada, Mexico, UK and Ireland, Europe, Asia and Africa, Australia and New Zealand • Electronic Texts, Dictionaries and Data Centers, Texts, Dictionaries • Computing Resources General Information, SIL Resources, Commercial Sites, Software Archives, Software Tools • Journals and Newletters • Resources Listed Topically Speech and Phonetics, Phonology and Morphology, Grammar and Syntax, Semantics and Semiotics, Second Language Teaching and Learning, Pedagogical Resources, Sociolinguistics, Text Analysis and Corpus Linguistics, Translation, Scripts and Writing Systems, Languages and Language Families, Language Rights • Other Resources USENET Newsgroup and FAQs, Mailing Lists and Discussion Groups, Papers and Dissertations, Bibliographies, Publishers and Booksellers • Other Indexes to Linguistics on the Internet Indexes from the World Wide Web Virtual Library Linguistics The ACL NLP/CL Universe Directory listing for: RESOURCES: ARIES Natural Language Tools Bibliography [DIR: 20 entries] ... Books [DIR: 32 entries] ... Corpora [DIR: 60 entries] ... Courses [DIR: 19 entries] ... Dictionaries [DIR: 26 entries] ... Electronic mailing lists [DIR: 13 entries] ... Journals [DIR: 14 entries] ... Language and Linguistic Science information sources Language-specific resources (e.g. German, Italian) [DIR: 8 entries] ... Linguistic News Usenet News: Mailing Lists: Resources: Miscellaneous FTP sites [DIR: 4 entries] ... On-line resources [DIR: 5 entries] ... Other comprehensive sites [DIR: 33 entries] ... Papers [DIR: 12 entries] ... Software on the Internet [DIR: 219 entries] ... The RELATOR language resources server Usenet newsgroups [DIR: 6 entries] ... UP Total number of entries in system: 1883 , Last updated: Fri May 7 15:04:00 EDT 1999 [ABOUT] [SEARCH] [SUBSCRIBE] [SUBMIT] [FEEDBACK] elsnet Newspapers on the internet A list of links to eletronic versions of newspapers in several languages. TiMBL 1.0 - Tilburg Memory Based Learner ELSNET CD Distribution The HCRC Map Task Corpus The HCRC Map Task Corpus is a set of 8 CD-ROMs containing linked audio and transcriptions of a total of about 18 hours of spontaneous speech that was recorded from 128 two-person conversations according to a detailed experimental design. The cost of the corpus is GBP 143.25, plus VAT of GBP 25.07 for purchasers within the European Union. (Users outside Europe should contact the Linguistic Data Consortium [email protected]). The European Corpus Initiative Multilingual Corpus I The European Corpus Initiative (ECI) was founded to oversee the acquisition and preparation of a large multilingual corpus (ECI/MCI) to be made available in digital form for scientific research at a low a cost as possible. The corpus has been available on CD-ROM since the end of April 1994, and is now being distributed by Utrecht University on behalf of ELSNET. The price is 95 DFl (for payments made by credit card or Eurocheque); 110 DFl (for payments by bank transfer); or 120 DFl (for payments by cheques other than Eurocheques). The Groningen Speech Corpus The Groningen Speech Corpus was collected by A.M. Sulter, MD and Prof. H.K. Schutte as part of a research project funded by NWO (Netherlands Organization for Scientific Research). The 4 CD-ROMs contain over 20 hours of speech. It is a corpus of read speech material in Dutch, recorded on PCM tape under fairly good conditions. See also: The European Language Resources Association (ELRA) web site. Other Resources A database gathering addresses of parents of bilingual or multilingual children. [Home] [Up] [Search] [About Elsnet] [Publications] Site map [Resources] [Projects] [Training] [Jobs] [Events] [elsnet-list] [Contact Us] English Francais European Language Resources Association What's new ? updated 07/04/1999 Job openings at ELRA The European Language Resources Association (ELRA) was established as a non-profit organization in Luxembourg in February, 1995. The overall goal of ELRA is to provide a centralized organization for the validation, management, and distribution of speech, text, and terminology resources and tools, and to promote their use within the European telematics R&TD community. About ELRA AURORA Validation Members only Newsletter LE news LREC ELDA Catalogue Legal Issues Resources requested Related Sites Contact ELRA URL: http://www.icp.grenet.fr/ELRA/ - Copyright © 1996-99 ELRA - All rights reserved. Last update 19 April, 1999. Comments are welcome: [email protected] Home | Catalogue | Speech | Termino WRITTEN RESOURCES CORPUS | MONOLINGUAL LEXICON | MULTILINGUAL LEXICON | TOOLS The description of LRs given herein are brief summaries to facilitate its readability. Further information is given: follow the links ! The ELRA Catalogue R :For Research C :For Commercial use If none of these abbreviations (R or C) appears, there are no restrictions for the type of use. Discount for Non members are offered to members of organizations with which ELRA entered into special agreements (e.g. ELSNET). *** :At cost ELRA :Please contact ELRA office. --- :Price under discussion WWW :Please download this free resource from the Web (follow the links) The following prices are indicated in EURO (1 EUR~=1.2 USD). Some prices, which were negotiated in local currency, have been re-adjusted wrt exchange rate. CORPORA Ref. ELRA W0001 W0002 W0003 W0004 Name Type & No of entries BRITISH NATIONAL CORPUS 100 million words BNC (OTA) CONTEMPORARY PORTUGUESE 1.5 million words CORPUS CRATER Multi-lingual aligned 1 million tokens corpus ECI/MCI European Corpus Initiative W0005 ECI-ELSNET Italian & German tagged sub-corpus W0006 MLCC - Multi-lingual corpus Multilingual Corpus 98 million words Economy 17,000 words Politics 14,000 words Culture 18,000 words Sports 9,000 words Local Events 8,500 words Het Financieele Dagblad (8.5 million words) The Financial Times (30 million words) Le Monde (10 million words) Handelsblatt (33 million words) Il sole 24 Ore (1.88 million words) Expansion (10 million words) Language English Portuguese M R 175 Non-M R 254 --- English, French, 20 Spanish Major European languages + Turkish, R 45 Japanese, Russian, Chinese, Malay, etc. Date 01/09/96 --100 23/01/97 R 45 01/09/96 Italian & German R 20 R 45 01/09/96 Dutch, English, French, German, Italian, Spanish R 360 C 1500 R 750 C 3200 01/09/96 Struktur Text: t1 ... tn Tokens N n = 1 ... N ti = tj gleiche Wortform wi = {ti1 ... tik} i N k = 1 ... M f(k) Anzahl der k Vorkommen von w relative Häufigkeit fr (wi) = f(k) / N P (wi) = lim / N f(n)/N Konkordanz : Textumgebung Kollokation : Signifikantes gemeinsames Auftreten zweier Wörter Statistische Verfahren Anforderungen 1) plausibel 2) berechenbar A) bedingte Wahrscheinlichkeit B) informationstheoretische Modelle c) spezielle Signifikanzmaße Bedingte Wahrscheinlichkeit PB (A) = P(A|B) = P(A/B) unabhängig PB (A) = P(A) gemeinsame Auftretenswahrscheinlichkeit P (A B) = P(B) PB (A) = P(A) = PA (B) unabhängig P(A B) = P(A) P(B) Beispiel ..... dddaaabbbaaacccbbbdddccc ... P(a) = P(b) = P(c) = P(d) Pa (b) Pb (b) Markov Modell 0,8 A 0,2 0,5 0,5 B 0,5 C 0,5 0,2 D 0,8 PA (A) = PD (D) = 0,8 PA (B) = PD (C) = 0,2 P(A) = P(D) = 2,5 P(B) = 2,5 P(C) typische Kette ... AAAAABCBDDD ... Markov-Mopdell für natürlich-sprachlichen Text Phrase Q Q T: Lexikoneintrag G1 Gn G: Grammatikalische Funktion T1 Tn Lexikalische Wahrscheinlichkeiten PQ (Ti | Gi) Sprachmodell (a priori Wahrscheinlichkeit) PQ (Gi | Gi-1 ; Gi-2 ) u Gi opt arg max PQ (Gi (Gi 1 , Gi 2 ) PQ (Ti )Gi ) i 1 G Anwendung: Tagging Analyse Translation Memaries Spracherkennung A B C D CD ? Ja1 Entscheidungen A B C CA B C E Ja2 Nein3 C F G H Entscheidungsbaum nach 3 Allgemeiner Fall 1/2 1/4 A AE A B C D E F G 1/8 E BDG B D C F G C F Aufteilung in gleich wahrscheinliche Mengen Nach Ki Alternativentscheidungen ist das i-te Zeichen isoliert. pi = (1/2)ki ki = ld (1/pi) (Entscheidungsinformation) bit Buchstaben A E F C B D G Pi 1/4 1/4 1/8 1/8 1/8 1/16 1/16 Codierung 00 0| |00 |0| ||0 |||0 |||| H = pi (d(1/pi) = 2/4 + 2/4 + 3/6 ... = 2,625 B C DG F Einige Grundbegriffe Entscheidungsinformation: Anzahl optimal gewählter binärer Entscheidungen zur Ermittlung eines Zeichens innerhalb eines Zeichenvorrats Entscheidungsgehalt pro Zeichen: Iz = ld (1/pz) bit mittlerer Entscheidungsgehalt pro Zeichen: H = p1I1 + p2I2 + ... + pnIn = pi ld(1/pi) bit Shannon-Funktion: H(p) = p ld(1/p) + (1-p) ld (1/(1-p)) Redundanz und Entropie Informationsgehalt Schriftsprache 30 + Zwischenraum 1 = ld 30 = 4,9 bit H = 1,6 bit (unter Berücksichtigung von Bigrammen Redundanz 4,9 - 1,6 bit = 3,3 bit (Text auch noch dann lesbar, wenn jeder zweite Buchstabe fehlt) Redundanz - Beispiel Bei reduzierter Redundanz wird das Lesen sehr viel mühsamer BEI REDUZIERTER REDUNDANZ WIRD DAS LESEN SEHR VIEL MÜHSAMER BEIREDUZIERTERREDUNDANZWIRDDASLESENSEHRVIELMÜHSAMER BE RE UZ ER ER ED ND NZ IR DA LE EN EH VI LM HS ME (nach Breuer 1995) Nachrichtenquelle die nur 0 und 1 sendet. P0 , 1-P0 Mittlerer Informationswert (Entropie) S(p) = P0 |d (1/P0) + 2 |d (1/2) 2 = 1 - P0 S(p) P 0 0,5 1 Inhaltsanalyse - Experiment auf Grundlage des Maßes gemeinsamer Information I(x,y) = ld [ P(x,y)/P(x) P(y)] (Anm.: P(x,y) Wahrscheinlichkeit gemeinsamen Auftretens von x und y in einem Textfenster beliebiger Größe) 1. Segementierung des Textes in Sätze 2. Herausfiltern von Sätzen mit gleichem (i.e. Stammformreduziertem) Schlüsselwort 3. Berechnung des gemeinsamen Informationswertes für alle Stammformen eines Textfensters 4. Definition eines geeigneten Schwellenwertes Zipf'sches Gesetz rk f (k ) ~ konstant f (u ) k N rn A N , A ~ 0,1 k Abschätzung für niederfrequente Terme rn : Rang eines Terms, der genau n mal vorkommt (z.B. genau 1 mal) In : Anzahl der Terme, die genau n mal vorkommen rn A N N , v1 A n 1 I n rn rn 1 ( Vermeidung von Wiede rholungen) N N N A A n n 1 n(n 1) N N N I 1 r1 r2 A A A 1 2 2 A