Corpus after |parts |np : wsj9_00*
Corpus after tagger : wsj_000*.mrg
programs are in /u/s_gael/program
the corpus in /u/s_gael/corpus
the results in /u/s_gael/result
Before |parts |np, the corpus need to be cleaned up from all tags :
perl clean.pl
The result gives a set of articles as
text files > |parts |np > article ready for extraction
Programs of NP extraction :
perl simplenp.pl : returns noun phrases cleaned up from adjectives
and/or articles :
- Entire NP
- Without articles
- Without adjectives
- Neither articles nor adjectives articles
- Nouns only
perl complexnp.pl : returns complex noun phrases from combination of
simple noun phrases related by : preposition, conjunction, verbs, attribution,
subordination
- preposition
- conjunction
- preposition and conjunction
- verbs
- be
- subordination
- all of previous
perl occ.pl : returns statistics on of those results, for example on
Neither
articles nor adjectives articles the occurrences for the all database
for each noun : total by itself, total included in a composed noun, and
the articles and sentences where they are :
occurrences
perl occ2.pl : returns only the total number of occurrences in the entire
database, it`s faster and it can help finding relevant semantic tags from
words; not only from syntactic tags. result in : occ2
The corpus : is already tagged with complex
tagged as NP-SBJ noun phrase subject with syntactic informations
to see the text use parsetotext.pl
headtag.pl returns the first 2 lines of the item ,it gives a simple view of of 1- or 2- item matches and the # of occurrences separated or close in the text : headtag
searchtag.pl : returns from wsj_000*.mrg the structure and content of
complex tags from the first x sentences of each article.
NP-SBJ of the first sentence of each article
gives generally information about the source of the article, the person(s)
or institution(s) involved.
VP of the first sentence of each article
generally gives information about the event developed further in the article.
searchtag2.pl : returns from the results of searchtag.pl the the intern elements we can use to build rules about the content of this information , for example who is concerned, searching the NNP : searchtag2
Search some tags cooccurring in the same area of text
autour.pl > autour
autour2.pl > autour2
For some statistics about an item, occurrences in the whole database,
per article and per sentence : perl stat.pl
result in stat
To quickly view the syntactic shape of a text use parsetotag.pl
returns in parsetotag the the structure
cleaned of words.
The program searchstring.pl returns in searchstring
the lines containing the string the occurrences as a word, prefix, suffix,
or affix.
Tag count: 33032 Tag one occurrence: 627
4647 NP |
Syntagme nominal |
2970 VP |
Syntagme verbal |
2530 NN |
Nom Commun singulier ou masse |
1937 IN |
Préposition subordonnant |
1743 DT |
Déterminant |
1734 S |
Phrase |
1542 NP-SBJ |
Syntagme nominal sujet |
1447 NNP |
Nom propre singulier |
1248 NNS |
Nom commun pluriel |
1116 JJ |
Adjectif |
1057 PP |
Syntagme prépositionnel |
614 VB |
Verbe base |
551 RB |
Adverbe |
477 TO |
To |
460 CC |
Conjonction de coordination |
444 VBD |
Verbe au passé -ed |
430 VBZ |
Verbe au présent 3ème pers./sing. -s |
428 CD |
Nombre cardinal |
428 VBN |
Participe passé |
413 SBAR |
Proposition subordonnée |
406 NONE |
Ellipse |
378 PRP |
Pronom personnel |
297 VBG |
Participe présent, gérondif |
281 VBP |
Verbe présent autres personnes |
226 ADVP |
Syntagme adverbial |
217 PP-CLR |
Complément circonstanciel |
212 MD |
Modal (tous vbs sans -s à la 3ème) |
206 PP-LOC |
Complément de lieu |
193 NONE-*-1 |
|
172 NP-SBJ-1 |
Sujet différent de la principale |
164 PRP$ |
Pronom possessif |
162 NONE-*T*-1 |
|
141 PP-TMP |
Complément de temps |
135 ADJP-PRD |
Syntagme adjectival attribut du sujet |
126 QP |
Syntagme ou complément de quantité |
121 ADVP-TMP |
Syntagme adverbial de temps |
119 ADJP |
Syntagme adjectival |
117 POSS |
|
110 NP-PRD |
Syntagme nominal attribut du sujet |
94 SBAR-ADV |
Prop sub. adverbiale |
92 JJR |
Adjectif comparatif -er, more, less |
92 WDT |
Pronom relatif : which, that, wh- determiner |
88 NONE-*U |
|
86 NN% |
|
81 $$ |
|
78 S-NOM |
SN à base verbal inclus dans un S prépositionnel |
65 NNPS |
Nom propre pluriel |
65 NONE-*T*-2 |
|
64 WP |
Pronom relatif : who, what, whom : wh- pronom |
63 WHNP-1 |
|
62 NONE-*-2 |
|
61 RBT |
|
60 PRN |
|
59 ADVP-MNR |
|
53 NP-SBJ-2 |
|
53 PP-DIR |
|
46 PRT |
|
45 S-TPC-1 |
|
44 NP-LGS |
|
40 SINV |
|
39 WHNP-2 |
|
37 NP-TMP |
|
37 RP |
Particule verbale |
37 S-ADV |
|
35 WHADVP-1 |
|
34 SBAR-TMP |
|
34 WRB |
Adverbe interrogatif : how, where, why, wh- adv |
28 JJS |
Adjectif superlatif |
26 NONE-*-3 |
|
26 RRB--RRB |
|
25 NP-SBJ-3 |
|
24 NONE-*T*-3 |
|
23 LRB--LRB |
|
22 PP-MNR |
|
21 RBR |
Adverbe comparatif -er : more, less, later + ADJ |
20 && |
|
20 ADVP-DIR |
S adv de direction |
20 NX |
|
19 NP-ADV |
|
19 PP-PRD |
|
18 SBAR-PRP |
|
17 S-1 |
|
16 ADVP-LOC |
|
16 CD000 |
|
16 EX |
"Il y a", existentiel |
16 S-PRP |
|
16 SBAR-NOM |
|
16 SQ |
|
16 VBZS |
|
15 NP-1 |
|
14 NONE-*ICH*-1 |
|
14 PP-PRP |
|
12 NP-LOC |
|
12 S-CLR |
Proposition circonstancielle |
12 WHNP |
|
12 WHNP-3 |
|
11 NNP&P |
|
11 S-PRD |
|
11 SBARQ |
|
10 S-TPC-2 |
|
10 WHADVP-2 |
|
9 NP-2 |
|
9 POS |
Possédé : 's POS |
8 CC& |
|
8 CD\/2 |
|
8 NONE-*ICH*-2 |
|
8 NONE-*RNR*-1 |
|
8 PP-LOC-CLR |
|
8 PP-PUT |
|
8 RBS |
Adverbe superlatif : most |
8 S-2 |
|
8 S-HLN |
|
8 SBAR-PRD |
|
8 VBPRE |
|
7 CD\/4 |
|
7 NNP-PACIFIC |
|
7 NONE-*T*-4 |
|
7 NP-TMP-CLR |
|
7 PP-DTV |
|
6 PDT |
Prédéterminant : the all world |
6 S-TPC-3 |
|
6 UCP |
|
6 VBPVE |
|
5 CD3 |
|
5 CD\/8 |
|
5 FRAG |
|
5 INTJ |
|
5 NNPD |
|
5 NONE-*EXP*-1 |
|
5 S-NOM-SBJ |
|
5 WHPP-1 |
|
4 ADVP-PRD |
|
4 NNPSA |
|
4 NP-EXT |
|
4 PP-TMP-CLR |
|
4 SBAR-1 |
|
4 SBAR-CLR |
|
3 ADVP-CLR |
|
3 ADVP-PRP |
|
3 CD55 |
|
3 LS |
Numérotation de liste |
3 LST |
Liste à numéro |
3 NAC |
|
3 NNPC |
|
3 NNPK |
|
3 NONE-*-52 |
|
3 NP-SBJ-4 |
|
3 PP-1 |
|
3 VBPM |
|
3 WHADVP-3 |
|
3 WHADVP-4 |
|
3 WHNP-4 |
|
3 WP$ |
Pronom possessif : wh- |
2 ADJP-ADV |
|
2 CD07 |
|
2 CD25 |
|
2 CD4 |
|
2 CD64 |
|
2 CD95 |
|
2 CONJP |
|
2 FW |
Mot étranger |
2 LRB--LCB |
|
2 NAC-LOC |
|
2 NNP-AMERICAN |
|
2 NNP-MELLON |
|
2 NNPBRIEN |
|
2 NP-CLR |
|
2 NP-HLN |
|
2 NP-VOC |
|
2 PP-DIR-2 |
|
2 PP-DIR-CLR |
|
2 PP-EXT |
|
2 RRB--RCB |
|
2 S-3 |
|
2 S-MNR |
|
2 S-PRP-CLR |
|
2 SBAR-NOM-PRD |
|
2 SBAR-NOM-SBJ |
|
2 SBARQ-NOM |
|
2 UCP-PRD |
|
2 UH |
Interjection exclamative |
2 VP-1 |
|
2 WHPP |
|
0 SYM |
Symboles |
1 ADJP-2 |
|
1 ADJP-CLR |
|
1 ADJP-TPC-1 |
|
1 ADVP-LOC-CLR |
|
1 ADVP-PUT |
|
1 ADVP|PRT |
|
1 FRAG-ADV |
|
1 FRAG-TTL-SBJ-1 |
|
1 JJ-BUSH |
|
1 JJ-SPEAKER |
|
1 JJ000 |
|
1 MDD |
|
1 MDLL |
|
1 NAC-TMP |
|
1 NNP-BACHE |
|
1 NNP-BUICK |
|
1 NNP-CONTRA |
|
1 NNP-DEFICIENCY |
|
1 NNP-SCOTT-RODINO |
|
1 NNP-SENATE |
|
1 NNP-TOTE |
|
1 NNP-TRACK |
|
1 NNPA |
|
1 NNPI |
|
1 NNPJ |
|
1 NNPY |
|
1 NNP\/DEL |
|
1 NNP\/FAWCETT |
|
1 NNSDS |
|
1 NP-3 |
|
1 NP-MNR |
|
1 NP-SBJ-9 |
|
1 NP-TMP-HLN |
|
1 NP-TTL |
|
1 PP-2 |
|
1 PP-BNF |
|
1 PP-DIR=2 |
|
1 PP-LOC-1 |
|
1 PP-LOC-CLR-TPC-1 |
|
1 PP-LOC-PRD |
|
1 PP-LOC=1 |
|
1 PP-TMP-PRD |
|
1 PP-TPC-1 |
|
1 RRC |
|
1 S-NOM-PRD |
|
1 S-SBJ |
|
1 S-TPC-4 |
|
1 SBAR-2 |
|
1 SBAR-4 |
|
1 SBAR-ADV-3 |
|
1 SBAR-LOC |
|
1 SBAR-MNR |
|
1 SBAR-NOM-1 |
|
1 SINV-2 |
|
1 SINV-TPC-1 |
|
1 UCP-MNR |
|
1 UCP-PRP |
|
1 VP-TPC-1 |
|
1 WHADVP-5 |
|
1 WHPP-3 |
|
1 X-HLN |
-PRD | Attribut du sujet |
- SBJ | Sujet |
-TMP | Temporel |
-CLR | Circonstanciel |
-ADV | Adverbial |
-LOC | De lieu |
-NOM | Nominal |
-DIR | Directionnel |
-MNR | Manière |
-2 | Occurrence d'un tag dans un phrase ou type de tag |
-HLN | Head line |
2141 : NP-suffixed
7914 : NP suffixed et non suffixed
2327 : VB suffixed or not
549 : VB non-suffixed
1778 : VB suffixed
31623 tags
4165 signes de ponctuation
TAGS :
thema
event
statement
description
attribut
object
result
action
detail
sub-tagg :
org-pers : organization or person (usually : NNP) or (DT NNP (or JJ
+ capital letter) +...)
(NP (DT the) (NNP Dutch) (VBG publishing) (NN group) )))))
(NP (DT this) (JJ British) (JJ industrial) (NN conglomerate)
))))))
(. .) ))
fact
cause
field
consequence
comments
nuance
affirmation
aim
definition
comparison
conclusion
exposure
event
frequency
quality
characteristic
actor
sources of information [info source]
declaration
Index of tagged item that can
be used to find some reccursive semantis structures :
A
Adjectives
Adverbs
manner
time : already à past part. :
+ [consequence] || [result] / [fact]
place
Interrogative adverbs
Articles
AS + Comparative + AS : + [comparison]
To show no difference: AS + MUCH + AS , AS + MANY + AS : + [comparison]
Auxiliary Verbs
B
Be : As an ordinary verb [definition] [description] [characteristic]
Be: As an auxiliary : + [characteristic]
C
Can
Could
Classes of adverbs
Comparative + than
Comparison of Adjectives
Comparing adverbs
Comparison of quantity (adjectives)
Compound Nouns
Countable and Uncountable nouns
D
Definite article: the : new [thema ]
démonstratifs : [cause] | [comments] / [thema]
Determiners
Distributives : either, or, neither, nor, each, every
Do : as an ordinary verb
Do : As an auxiliary
E
Either
Each
Every
Enough + noun
Enough -(adverb section)
Exclamatives: such and what
F
Form of adverbs
G
H
Have, Have got, have got to : As ordinary verbs [characteristic] [quality]
Have : As an auxiliary [consequence] [result] [fact]
I
Indefinite articles: an, a : more informations about? [description],
à
definite article
Indefinite Pronouns
Interrogative adverbs
Interrogative and Negative of Ordinary Verbs
J
K
L
M
May
Might
Must
Much, many
Modal auxiliary verbs
MORE, LESS, FEWER + THAN : To show difference
N
Nationalities
Need
Nouns
Not as...as
Numbers
O
One / Ones : Pronouns
Ought to
P
Personal and Possessive Pronouns
Personal Pronouns
Plural of Nouns
Possessives : my, your, his, her, its, our, their
Possessive Pronouns
Possessive with 's and '
Preposition : VPB + (PP (TO to : [object/thema]
Pronouns
Proper Nouns : [info source] [thema] [object]
Q
Quantifiers: a few, a little, much, many, a lot of, most, any, some, enough, etc.
R
Reflexive Pronouns
Relative Pronouns : Who, Whom, That, Which
S
Shall
Should
Some
Still as an adverb of time
Such
T
That : Relative Pronoun
the + Superlative
This, that, these, those
U
used to
V
VERBS
W
What (as exclamative)
Will
Who, Whom, Which : Relative Pronouns
would
X
Y
Yet as an adverb of time
Z
The major rules should need to be build on a succession of lexical tags that indicates the main ideas, their relationship and an order of importance.
Some salient tags can indicate relationship between some main parts
as noun phrase subject or attribut.
Example of rules observed from the first sentences : NP-SBJ extraction VP extraction
thema | ||||||
n*NNP | [ quality ] | [event] || [description] | ||||
n*NNP | [ quality ] | (CC and) | n*NNP | [ quality ] | [event] || [description] | |
event | ||||||
[org-pers]1 | [decision to do] | [have effect on] | [org-pers]2 | [illustration] | ||
MD will | [ object ] | |||||
illustration | ||||||
[datas and amount] | ||||||
description | ||||||
[ thema ] | VBZ is | [ quality ] | ||||
, | ADJP | |||||
object | ||||||
[ thema ] | [ event ] | NP | ||||
quality | ||||||
ADJ | ||||||
NP | ||||||
, | ADJP | |||||
org-pers | ||||||
n*NNP | [ quality ] |
tagging keys :
scale of priority :
event+++ or statement++ or descrption+
links between tags : X about Y = X/Y = Xà
Y
attribut of a thema : attribut/thema
thema with no common words :
thema1, thema2
with common noun or adjective :
thema1.1, thema 1.2
scheme after tagging :
n* : one item or more of the same
[ ] : optional