ACE 2005 Multilingual Training Corpus

Introduction

This file contains documentation on the ACE 2005 Multilingual Training Corpus, Linguistic Data Consortium (LDC) catalog number LDC2006T06 and isbn 1-58563-376-3.

This publication contains the complete set of English, Arabic and Chinese training data for the 2005 Automatic Content Extraction (ACE) technology evaluation. The corpus consists of data of various types annotated for entities, relations and events was created by Linguistic Data Consortium with support from the ACE Program, with additional assistance from LDC. This data was previously distributed as an e-corpus (LDC2005E18) to participants in the 2005 ACE evaluation.

The objective of the ACE program is to develop automatic content extraction technology to support automatic processing of human language in text form.

In November 2005, sites were evaluated on system performance in five primary areas: the recognition of entities, values, temporal expressions, relations, and events. Entity, relation and event mention detection were also offered as diagnostic tasks. All tasks with the exception of event tasks were performed for three languages, English, Chinese and Arabic. Events tasks were evaluated in English and Chinese only. The current publication comprises the official training data for these evaluation tasks.

A complete description of the ACE 2005 Evaluation can be found on the ACE Program website maintained by the National Institute of Standards and Technology (NIST).

For more information about linguistic resources for the ACE Program, including annotation guidelines, task definitions, free annotation tools and other documentation, please visit LDC's ACE website

Data

Please see file.tbl for the directory structure of this publication, as well as a complete list of files. The README file contains complete documentation for the corpus.

Please go to data for a listing of data files.

Below is information about the amount of data included in the current release and its annotation status.

1P: data subject to first pass (complete) annotation
DUAL: data also subject to dual first pass (complete) annotation
ADJ: data also subject to discrepancy resolution/adjudication
NORM: data also subject to TIMEX2 normalization

English
words					files
	1P	DUAL	ADJ	NORM	1P	DUAL	ADJ	NORM
NW	60658	57807	33459	48399	128	124	81	106
BN	59239	58144	52444	55967	239	234	217	226
BC	46612	46110	33874	40415	68	67	52	60
WL	45210	43648	35529	37897	127	122	114	119
UN	45161	44473	26371	37366	58	57	37	49
CTS	47003	47003	34868	39845	46	46	34	39
Total	303833	297185	216545	259889	666	650	535	599

Chinese Note: Chinese data expressed in terms of characters.
We assume a correspondence of roughly 1.5 characters/word.
chars files
1P DUAL ADJ 1P DUAL ADJ
NW 127319 124175 121797 248 242 238
BN 134963 133696 120513 332 328 298
WL 71839 68063 65681 107 101 97
Total 334121 325834 307991 687 671 633

Arabic
words files
1P DUAL ADJ 1P DUAL ADJ
NW 61287 56158 53026 239 226 221
BN 29259 27165 26907 134 128 127
WL 21687 20181 20181 60 55 55
Total 112233 103504 100114 433 409 403

Updates

Additional information, updates, bug fixes may be available in the LDC catalog entry for this corpus at LDC2006T06.

Content Copyright

Portions © 2000-2003 Agence France Presse, © 2003 The Associated Press, © 2003 New York Times, © 2000-2001, 2003 Xinhua News Agency, © 2003 Cable News Network LP, LLLP, © 2000-2001 SPH AsiaOne Ltd, © 2000-2001 China Broadcasting System, © 2000-2001 China National Radio, © 2000-2001 China Television System, © 2000-2001 China Central TV, © 2000-2001 Al Hayat, © 2000-2001 An-Nahar, © 2000-2001 Nile TV, © 2005, 2006 Trustees of the University of Pennsylvania

Chinese Note: Chinese data expressed in terms of characters. We assume a correspondence of roughly 1.5 characters/word.
chars					files
	1P	DUAL	ADJ	1P	DUAL	ADJ
NW	127319	124175	121797	248	242	238
BN	134963	133696	120513	332	328	298
WL	71839	68063	65681	107	101	97
Total	334121	325834	307991	687	671	633

Arabic
words					files
	1P	DUAL	ADJ	1P	DUAL	ADJ
NW	61287	56158	53026	239	226	221
BN	29259	27165	26907	134	128	127
WL	21687	20181	20181	60	55	55
Total	112233	103504	100114	433	409	403