Definition Korpuslinguistik

Definition Korpuslinguistik
Teil der Computerlinguistik, bei dem große Textkorpora eingesetzt werden, die mit
stochastischen Verfahren bearbeitet werden.
Definition Textkorpus
A corpus is a collection of naturally-occuring language text, chosen to
characterize a state or variety of language. (John Sinclair 91)
Jede Sammlung von Texten (auch in gesprochener Sprache), sofern die Texte
gezielt ausgewählt wurden.
Problem: Auswahlkriterien, Größe und Repräsentativität eines Korpus
Erstellung verwertbarer Textkorpora
Probleme und Arbeitsschritte

Markieren von Textteilen (z.B. Überschrift)

Löschen von Steuerzeichen (Layout, Formatierungen)

Vereinheitlichen von Sonderzeichen (Umlauten, echte SZ)

Entfernen von Whitespace (Tabulatoren, Leerzeichen, Zeilenumbrüche)

Trennen von Wörtern (z.B. "und/oder", "Artikel/Nomen/Verb", CDU/CSU,
"Feldberg/Schwarzwald"; Probleme: "km/h", "WS 1970/71").

Abtrennen von Satzzeichen

Rückgängigmachen von Trennungen am Zeilenende

Zusammenführen von Zahlen

Disambiguierung von Punkten

Annotieren/tagging

Satzsegmentierung
Wichtige Arbeiten
Armstrong, S. (Hrsg.) (1993), Using Large Corpora, Computational Linguistics Vol.
19, No 1/2 (1993), repr. MIT Press 1994 (mit Beiträgen von K.Church, T.Briscoe,
W.Gale u.a.)
Francis, W. und H. Kucera (1982). Frequency Analysis of English Language. Boston:
Houghton Mifflin.
W. B. Frakes und R. A. Baeza-Yates (Hrsg.) (1992), Information Retrieval. New
Jersey: Prentice Hall
Garside, R., G. Leech, und G. Sampson (Hrsg.) (1987). The Computational Analysis
of English. London: Longman.
Sinclair, J. (1991). Corpus Concordance Collocation. Oxford: Oxford University Press.
Sperberg-McQueen, C. und L. Burnard (1994). Guidelines for the Electronic Text
Encoding and Interchange (P3). Chicago and Oxford: Text Encoding Initiative.
Svartvik, J. (Hrsg.) (1992). Directions in Corpus Linguistics: Proceedings of Nobel
Symposium 82, Stockholm, 4-8 August 1991, Number 65 in Trends in Linguistics,
Berlin: Mouton de Gruyter.
Linguistics Resources on the Internet
including Computational Linguistics and Natural Language Processing
•
SIL Linguistics resources
Ethnologue, Living Languages of the Americas, Bibliography, Publications
Catalog, SIL Electronic Working Papers, School and Training, Linguistic
Glossary, LingBits, LingualLinks, CELLAR
•
Conferences, Workshops, Meetings, Symposia
•
Universities and Other Academic Sites
Associations, USA, Canada, Mexico, UK and Ireland, Europe, Asia and Africa,
Australia and New Zealand
•
Electronic Texts, Dictionaries and Data
Centers, Texts, Dictionaries
•
Computing Resources
General Information, SIL Resources, Commercial Sites, Software Archives,
Software Tools
•
Journals and Newletters
•
Resources Listed Topically
Speech and Phonetics, Phonology and Morphology, Grammar and Syntax,
Semantics and Semiotics, Second Language Teaching and Learning,
Pedagogical Resources, Sociolinguistics, Text Analysis and Corpus Linguistics,
Translation, Scripts and Writing Systems, Languages and Language Families,
Language Rights
•
Other Resources
USENET Newsgroup and FAQs, Mailing Lists and Discussion Groups, Papers
and Dissertations, Bibliographies, Publishers and Booksellers
•
Other Indexes to Linguistics on the Internet
 Indexes from the World Wide Web Virtual Library
 Linguistics
The ACL NLP/CL Universe
Directory listing for: RESOURCES:
 ARIES Natural Language Tools
 Bibliography [DIR: 20 entries] ...
 Books [DIR: 32 entries] ...
 Corpora [DIR: 60 entries] ...
 Courses [DIR: 19 entries] ...
 Dictionaries [DIR: 26 entries] ...
 Electronic mailing lists [DIR: 13 entries] ...
 Journals [DIR: 14 entries] ...
 Language and Linguistic Science information sources
 Language-specific resources (e.g. German, Italian) [DIR: 8 entries] ...
 Linguistic News Usenet News: Mailing Lists: Resources:
 Miscellaneous FTP sites [DIR: 4 entries] ...
 On-line resources [DIR: 5 entries] ...
 Other comprehensive sites [DIR: 33 entries] ...
 Papers [DIR: 12 entries] ...
 Software on the Internet [DIR: 219 entries] ...
 The RELATOR language resources server
 Usenet newsgroups [DIR: 6 entries] ...
UP
Total number of entries in system: 1883 , Last updated: Fri May 7 15:04:00 EDT
1999
[ABOUT] [SEARCH] [SUBSCRIBE] [SUBMIT] [FEEDBACK]
elsnet
Newspapers on the internet
A list of links to eletronic versions of newspapers in several languages.
TiMBL 1.0 - Tilburg Memory Based Learner
ELSNET CD Distribution
The HCRC Map Task Corpus
The HCRC Map Task Corpus is a set of 8 CD-ROMs containing linked audio and
transcriptions of a total of about 18 hours of spontaneous speech that was
recorded from 128 two-person conversations according to a detailed
experimental design.
The cost of the corpus is GBP 143.25, plus VAT of GBP 25.07 for purchasers
within the European Union. (Users outside Europe should contact the
Linguistic Data Consortium [email protected]).
The European Corpus Initiative Multilingual Corpus I
The European Corpus Initiative (ECI) was founded to oversee the acquisition
and preparation of a large multilingual corpus (ECI/MCI) to be made available
in digital form for scientific research at a low a cost as possible.
The corpus has been available on CD-ROM since the end of April 1994, and is
now being distributed by Utrecht University on behalf of ELSNET. The price is
95 DFl (for payments made by credit card or Eurocheque); 110 DFl (for
payments by bank transfer); or 120 DFl (for payments by cheques other than
Eurocheques).
The Groningen Speech Corpus
The Groningen Speech Corpus was collected by A.M. Sulter, MD and Prof. H.K.
Schutte as part of a research project funded by NWO (Netherlands
Organization for Scientific Research). The 4 CD-ROMs contain over 20 hours of
speech. It is a corpus of read speech material in Dutch, recorded on PCM tape
under fairly good conditions.
See also: The European Language Resources Association (ELRA) web site.
Other Resources

A database gathering addresses of parents of bilingual or multilingual children.
[Home] [Up] [Search] [About Elsnet] [Publications]
Site map
[Resources] [Projects] [Training] [Jobs] [Events] [elsnet-list] [Contact Us]
English
Francais
European Language Resources
Association
What's new ? updated 07/04/1999
Job openings at ELRA
The European Language Resources Association (ELRA) was established as a non-profit organization in
Luxembourg in February, 1995. The overall goal of ELRA is to provide a centralized organization for
the validation, management, and distribution of speech, text, and terminology resources and tools,
and to promote their use within the European telematics R&TD community.
About ELRA
AURORA
Validation
Members only
Newsletter
LE news
LREC
ELDA
Catalogue
Legal Issues
Resources requested
Related Sites
Contact ELRA
URL: http://www.icp.grenet.fr/ELRA/ - Copyright © 1996-99 ELRA - All rights reserved.
Last update 19 April, 1999. Comments are welcome: [email protected]
Home | Catalogue | Speech | Termino
WRITTEN RESOURCES
CORPUS | MONOLINGUAL LEXICON | MULTILINGUAL LEXICON | TOOLS
The description of LRs given herein are brief summaries to facilitate its readability.
Further information is given: follow the links !
The ELRA Catalogue
R :For Research
C :For Commercial use
If none of these
abbreviations (R or C)
appears, there are no
restrictions for the type of
use.
Discount for Non members are offered to members of organizations with which
ELRA
entered into special agreements (e.g. ELSNET).
*** :At cost
ELRA :Please contact ELRA office.
--- :Price under discussion
WWW :Please download this free resource from the Web (follow the
links)
The following prices are indicated in EURO (1 EUR~=1.2 USD). Some prices, which
were negotiated in local currency, have been re-adjusted wrt exchange rate.
CORPORA
Ref.
ELRA
W0001
W0002
W0003
W0004
Name
Type &
No of entries
BRITISH NATIONAL CORPUS 100 million words
BNC (OTA)
CONTEMPORARY PORTUGUESE
1.5 million words
CORPUS
CRATER Multi-lingual aligned
1 million tokens
corpus
ECI/MCI European Corpus
Initiative
W0005
ECI-ELSNET Italian & German
tagged sub-corpus
W0006
MLCC - Multi-lingual corpus
Multilingual Corpus
98 million words
Economy 17,000 words
Politics 14,000 words
Culture 18,000 words
Sports 9,000 words
Local Events 8,500 words
Het Financieele Dagblad (8.5
million words)
The Financial Times (30 million
words)
Le Monde (10 million words)
Handelsblatt (33 million words)
Il sole 24 Ore (1.88 million words)
Expansion (10 million words)
Language
English
Portuguese
M
R 175
Non-M
R 254
---
English, French,
20
Spanish
Major European
languages
+ Turkish,
R 45
Japanese, Russian,
Chinese, Malay, etc.
Date
01/09/96
--100
23/01/97
R 45
01/09/96
Italian & German
R 20
R 45
01/09/96
Dutch, English,
French, German,
Italian, Spanish
R 360
C 1500
R 750
C 3200
01/09/96
Struktur
Text: t1 ... tn
Tokens
N
n = 1 ... N
ti = tj
gleiche Wortform
wi = {ti1 ... tik} i  N
k = 1 ... M
f(k)
Anzahl der k Vorkommen von w
relative Häufigkeit
fr (wi) = f(k) / N
P (wi) = lim / N 
f(n)/N
Konkordanz : Textumgebung
Kollokation : Signifikantes gemeinsames Auftreten zweier Wörter
Statistische Verfahren
Anforderungen
1) plausibel
2) berechenbar
A) bedingte Wahrscheinlichkeit
B) informationstheoretische Modelle
c) spezielle Signifikanzmaße
Bedingte Wahrscheinlichkeit
PB (A) = P(A|B) = P(A/B)
unabhängig PB (A) = P(A)
gemeinsame Auftretenswahrscheinlichkeit
P (A B) = P(B)  PB (A) = P(A) = PA (B)
unabhängig P(A  B) = P(A)  P(B)
Beispiel
..... dddaaabbbaaacccbbbdddccc ...
P(a) = P(b) = P(c) = P(d)
Pa (b)  Pb (b)
Markov Modell
0,8
A
0,2
0,5
0,5
B
0,5
C
0,5
0,2
D
0,8
PA (A) = PD (D) = 0,8
PA (B) = PD (C) = 0,2
P(A) = P(D) = 2,5 P(B) = 2,5 P(C)
typische Kette ... AAAAABCBDDD ...
Markov-Mopdell für natürlich-sprachlichen Text
Phrase Q
Q
T: Lexikoneintrag
G1
Gn
G: Grammatikalische Funktion
T1
Tn
Lexikalische Wahrscheinlichkeiten
PQ (Ti | Gi)
Sprachmodell (a priori Wahrscheinlichkeit)
PQ (Gi | Gi-1 ; Gi-2 )
u
Gi opt  arg max 
PQ (Gi (Gi 1 , Gi  2 )  PQ (Ti )Gi )
i 1
G
Anwendung: Tagging
Analyse
 Translation Memaries
Spracherkennung
A
B
C
D
CD
? Ja1
Entscheidungen
A
B
C
CA
B
C
E
Ja2
Nein3
C
F
G
H
Entscheidungsbaum
nach 3
Allgemeiner Fall
1/2
1/4
A
AE
A B C
D E F G
1/8
E
BDG
B D
C F G
C F
Aufteilung in gleich wahrscheinliche Mengen
Nach Ki Alternativentscheidungen ist das i-te Zeichen isoliert.
pi = (1/2)ki
ki = ld (1/pi) (Entscheidungsinformation) bit
Buchstaben
A
E
F
C
B
D
G
Pi
1/4
1/4
1/8
1/8
1/8
1/16
1/16
Codierung
00
0|
|00
|0|
||0
|||0
||||
H = pi (d(1/pi) = 2/4 + 2/4 + 3/6 ... = 2,625
B
C
DG
F
Einige Grundbegriffe
Entscheidungsinformation: Anzahl optimal gewählter binärer Entscheidungen
zur Ermittlung eines Zeichens innerhalb eines Zeichenvorrats

Entscheidungsgehalt pro Zeichen: Iz = ld (1/pz) bit

mittlerer Entscheidungsgehalt pro Zeichen:
H = p1I1 + p2I2 + ... + pnIn
= pi ld(1/pi) bit

Shannon-Funktion: H(p) = p ld(1/p) + (1-p) ld (1/(1-p))
Redundanz und Entropie
Informationsgehalt Schriftsprache
30 + Zwischenraum
1 = ld 30 = 4,9 bit
H = 1,6 bit (unter Berücksichtigung von Bigrammen
Redundanz
4,9 - 1,6 bit = 3,3 bit
(Text auch noch dann lesbar, wenn jeder zweite Buchstabe fehlt)
Redundanz - Beispiel
Bei reduzierter Redundanz wird das Lesen sehr viel mühsamer
BEI REDUZIERTER REDUNDANZ WIRD DAS LESEN SEHR VIEL MÜHSAMER
BEIREDUZIERTERREDUNDANZWIRDDASLESENSEHRVIELMÜHSAMER
BE RE UZ ER ER ED ND NZ IR DA LE EN EH VI LM HS ME
(nach Breuer 1995)
Nachrichtenquelle die nur 0 und 1 sendet.
P0 , 1-P0
Mittlerer Informationswert (Entropie)
S(p) = P0 |d (1/P0) + 2 |d (1/2)
2 = 1 - P0
S(p)
P
0
0,5
1
Inhaltsanalyse
- Experiment auf Grundlage des Maßes gemeinsamer Information
I(x,y) = ld [ P(x,y)/P(x) P(y)]
(Anm.: P(x,y) Wahrscheinlichkeit gemeinsamen Auftretens von x und y in
einem Textfenster beliebiger Größe)
1. Segementierung des Textes in Sätze
2. Herausfiltern von Sätzen mit gleichem (i.e. Stammformreduziertem)
Schlüsselwort
3. Berechnung des gemeinsamen Informationswertes für alle Stammformen
eines Textfensters
4. Definition eines geeigneten Schwellenwertes
Zipf'sches Gesetz
rk  f (k ) ~ konstant
f (u ) 
k
N
rn  A 
N
, A ~ 0,1
k
Abschätzung für niederfrequente Terme
rn : Rang eines Terms, der genau n mal vorkommt (z.B. genau 1 mal)
In : Anzahl der Terme, die genau n mal vorkommen
rn  A 
N
N
, v1  A 
n
1
I n  rn  rn 1 ( Vermeidung von Wiede rholungen)
N
N
N
A
 A
n
n 1
n(n  1)
N
N
N
I 1  r1  r2  A   A  A 
1
2
2
A