GeniaJ (Pasquier 2010) is a Java implementation of the Genia tagger (Part-of-speech tagging and shallow parsing for biomedical texts) version 3.0.1 of April 16 2007. The original version was developped in C++ by Yoshimasa Tsuruoka from the Tsujii Laboratory at the University of Tokyo] and distributed under the modified BSD licence. The datasets are identical to the original C++ version. The output from this java version should be identical to the output of the original C++ version.
For more information about the original software, see:
- Yoshimasa Tsuruoka, Yuka Tateishi, Jin-Dong Kim, Tomoko Ohta, John McNaught, Sophia Ananiadou, and Jun'ichi Tsujii, Developing a Robust Part-of-Speech Tagger for Biomedical Text, Advances in Informatics - 10th Panhellenic Conference on Informatics, LNCS 3746, pp. 382-392, 2005.
Prepare a text file containing one sentence per line, then execute the program with:
java -Xmx500m -jar GeniaJ.jar < RAWTEXT > TAGGEDTEXT
The tagger outputs the base forms, part-of-speech (POS) tags, chunk tags, and named entity (NE) tags in the following tab-separated format.
word1 base1 POStag1 chunktag1 NEtag1 word2 base2 POStag2 chunktag2 NEtag2 : : : : :
Chunks are represented in the IOB2 format (B for BEGIN, I for INSIDE, and O for OUTSIDE).
> echo "Inhibition of NF-kappaB activation reversed the anti-apoptotic effect of isochamaejasmin." | java -Xmx500m -jar GeniaJ.jar Inhibition Inhibition NN B-NP O of of IN B-PP O NF-kappaB NF-kappaB NN B-NP B-protein activation activation NN I-NP O reversed reverse VBD B-VP O the the DT B-NP O anti-apoptotic anti-apoptotic JJ I-NP O effect effect NN I-NP O of of IN B-PP O isochamaejasmin isochamaejasmin NN B-NP O . . . O O
You can easily extract four noun phrases ("Inhibition", "NF-kappaB activation", "the anti-apoptotic effect", and "isochamaejasmin") from this output by looking at the chunk tags. You can also find a protein name with the named entity tags.
Pasquier, C. (2010), “Single Document Keyphrase Extraction Using Sentence Clustering and Latent Dirichlet Allocation,” in 5th international workshop on semantic evaluation, semeval 2010 (acl’10), Uppsala: Association for Computational Linguistics, pp. 154–157.