mtag - A practical
Part-of-Speech Tagger
SYNOPSIS
-
mtag
{-i <Input Text> }
{-o <Output Text> }
{-m <Compiled_matrices_file> }
{-P <Precision>}
{-C Nbr fields}
{-l Primary separator}
{-p secondary separator}
{-r <Correct tag list file>}
{-O Print the results with the original tag set}
{-n <Correct tag list file>}
{-L Loop Number}
{-M <Output Matrices file >}
{-t <file> }
{-v print version}
DESCRIPTION
A part-of-speech tagger that uses context to assign the most probable part
of speech tag(s) to each word in a text from a set of tags. mtag
employs the Viterbi algorithm and makes use of ambiguity classes in the
model to reduce the number of parameters to be estimated. The tagger uses
the values set in the matrices created in the training program to calculate
the optimal tag sequence. The precision criterion allows the solution set
to be expanded to include more than on tag.
An additional facility is provided to compare the output of the tagger
with a pre-tagged version of the text. If the -r option is used,
with the correct list of tags (desambiguited manually with mhandtag),
the tagger prints out statistics on the errors. If the -n,-L,-M
option(s) are indicated, the tagger will automatically readjust the values
in the matrices according to the correct solutions and retag the text.
If no files are given, the standard input is used. The program returns
the disambiguated text.
OPTIONS
-
mtag supports the following options:
-
-i Input text
-
Specifies the input text formatted by the mpreptxt program (default
stdin).
-
-o Output text
-
Specifies the program output (default stdout).
-
-m compiled_matrices_file
-
Specifies the matrices file that is the output of the training program,
with the definition of the tag and class set, (See also mcreate
program) (the default is MM.cmp)
-
-C Nbr fields
-
Number of fields before the [BOS|EOS] field (the default is 1)
-
-P Precision
-
The precision criterion allows the solution set to be extended to more
than one tag. The best score, i.e. only one solution per word, is obtained
with the default value 0. Increasing the value, increases the interval
of the probabilities between different tags accepted and thus the number
of solutions. The default value indicates that no interval is tolerated,
thus only one solution is permitted. Any increase my produce more than
one result, depending on how close the probabilities are for a given set
of tags. Note that increasing this value increases processing
time.
-
-r Correct tag list file
-
mtag tags the input text and print on the standard output the statistics
on the transition values obtained from the correspondence between the found
tags and the corrected tag. This option allows to have useful informations
for writing biases file. With this option the precision is 0. The
Correct tag list file can be obtained with the mhandtag
program, or can be create by hand with a list of correct tags in one column.
-
-n Correct tag list file
-
mtag tags the input text, compares the results with the hand corrected
tag list (corresponding to the text) and retags the text. This operation
allows to refine the transition matrix and normally improve the accuracy.
But if the size of the hand corrected tag list is not sufficient (~ 10-20%
of the original text), the performances can decrease. The Correct
tag list file can be obtained with the mhandtag
program, or can be create by hand with a list of correct tags in one column.
It's possible to specify a number of loops (-L option) before
printing the final tagged text to the output device. With this option the
precision is 0.
-
-O Print result in the original tag set.
-
Print the result of the tagging with the original tag set. Depending of
your tag conversion list you can introduce some disjunctions when you come
back to the original tags.
-
-L Loop number
-
Number of loops made by the tagger before the printing the final tagged
text to the output device (-n option).
-
-M Output matrices file
-
Save the new matrices created with -n or -r options.
-
-l primary separator
-
Specifies the separator within [LEM,TAG] pairs (the default character is'\').
-
-p secondary separator
-
Specifies the separator between [LEM,TAG] pairs (default character is '|').
-
-t Print tag list in file
-
This option inserts the correct tag list of tags into the file in the last
column. This list is useful for the mdiff
and mdiffb programs.
-
-v version
-
Print the program version
INPUT/OUTPUT
Description of the input and output files used by program.
- Initial text formatted : [stdin]
- Compiled matrices : $MM.cmp
Output ==>
- Disambiguated text : [stdout]
COMMAND EXAMPLES:
The tagger uses the matrices M1. The precision 1 (-P option)
extends the solution set (one or more solutions) to include those tags
assigned to a given word with very close probabilities.
mtag -i text -o text.tag -m M1 -P 1 -C 3 -l '\' -p '|'
To print the results with the original tag set:
mtag -i text -o text.tag -m M1 -C 3 -l '\' -p '|' -O
Tag the file using the hand corrected tag list file HAND_TAGGED
corresponding to the file text with 5 loops and print the final
matrices file in M.new:
mtag -i text -o text.tag -m M1 -P 1 -C 3 -l '\' -p '|' -n
HAND_TAGGED -M M5.new -L 5
SEE ALSO
mpreptxt(1)
mtrain(1)
mcreate(1)
mtagfreq(1)
mprint(1)
mdiff(1)
mdiffb(1)
mcontext(1)
mbiases(1)
mhandtag(1)
AUTHOR
Gilbert ROBERT
(Gilbert.Robert@issco.unige.ch)
ISSCO, 54 route des Acacias
1227 Geneva, Switzerland
Comments, suggestions, and bug reports are always welcome.