GeniaJ

GeniaJ
Release date 2010
Implementation Java
Author Pasquier, C.

GeniaJ (Pasquier 2010) is a Java implementation of the Genia tagger (Part-of-speech tagging and shallow parsing for biomedical texts) version 3.0.1 of April 16 2007. The original version was developped in C++ by Yoshimasa Tsuruoka from the Tsujii Laboratory at the University of Tokyo] and distributed under the modified BSD licence. The datasets are identical to the original C++ version. The output from this java version should be identical to the output of the original C++ version.

For more information about the original software, see:

  • Yoshimasa Tsuruoka, Yuka Tateishi, Jin-Dong Kim, Tomoko Ohta, John McNaught, Sophia Ananiadou, and Jun'ichi Tsujii, Developing a Robust Part-of-Speech Tagger for Biomedical Text, Advances in Informatics - 10th Panhellenic Conference on Informatics, LNCS 3746, pp. 382-392, 2005.

Execution

Prepare a text file containing one sentence per line, then execute the program with:

java -Xmx500m -jar GeniaJ.jar < RAWTEXT > TAGGEDTEXT

The tagger outputs the base forms, part-of-speech (POS) tags, chunk tags, and named entity (NE) tags in the following tab-separated format.

word1   base1   POStag1 chunktag1 NEtag1

word2   base2   POStag2 chunktag2 NEtag2

  :       :        :       :        :

Chunks are represented in the IOB2 format (B for BEGIN, I for INSIDE, and O for OUTSIDE).

Example

> echo "Inhibition of NF-kappaB activation reversed the anti-apoptotic effect of isochamaejasmin." | java -Xmx500m -jar GeniaJ.jar

Inhibition      Inhibition      NN      B-NP     O
of              of              IN      B-PP     O
NF-kappaB       NF-kappaB       NN      B-NP     B-protein
activation      activation      NN      I-NP     O
reversed        reverse         VBD     B-VP     O
the             the             DT      B-NP     O
anti-apoptotic  anti-apoptotic  JJ      I-NP     O
effect          effect          NN      I-NP     O
of              of              IN      B-PP     O
isochamaejasmin isochamaejasmin NN      B-NP     O
.               .               .       O        O

You can easily extract four noun phrases ("Inhibition", "NF-kappaB activation", "the anti-apoptotic effect", and "isochamaejasmin") from this output by looking at the chunk tags. You can also find a protein name with the named entity tags.

Pasquier, C. (2010), “Single Document Keyphrase Extraction Using Sentence Clustering and Latent Dirichlet Allocation,” in 5th international workshop on semantic evaluation, semeval 2010 (acl’10), Uppsala: Association for Computational Linguistics, pp. 154–157.

Avatar
Claude Pasquier
Researcher in Computer Science / Computational Biology

Université côte d’Azur, CNRS, I3S Laboratory, Sophia Antipolis

Related