| Stanford NER - v4.2.0 - 2020-11-17 | |
| ---------------------------------------------- | |
| This package provides a high-performance machine learning based named | |
| entity recognition system, including facilities to train models from | |
| supervised training data and pre-trained models for English. | |
| (c) 2002-2020. The Board of Trustees of The Leland | |
| Stanford Junior University. All Rights Reserved. | |
| Original CRF code by Jenny Finkel. | |
| Additional modules, features, internationalization, compaction, and | |
| support code by Christopher Manning, Dan Klein, Christopher Cox, Huy Nguyen | |
| Shipra Dingare, Anna Rafferty, and John Bauer. | |
| This release prepared by Jason Bolton. | |
| LICENSE | |
| The software is licensed under the full GPL v2+. Please see the file LICENCE.txt | |
| For more information, bug reports, and fixes, contact: | |
| Christopher Manning | |
| Dept of Computer Science, Gates 2A | |
| Stanford CA 94305-9020 | |
| USA | |
| [email protected] | |
| https://nlp.stanford.edu/software/CRF-NER.html | |
| CONTACT | |
| For questions about this distribution, please contact Stanford's JavaNLP group | |
| at [email protected]. We provide assistance on a best-effort | |
| basis. | |
| TUTORIAL | |
| Quickstart guidelines, primarily for end users who wish to use the included NER | |
| models, are below. For further instructions on training your own NER model, | |
| go to https://nlp.stanford.edu/software/crf-faq.html. | |
| INCLUDED SERIALIZED MODELS / TRAINING DATA | |
| The basic included serialized model is a 3 class NER tagger that can | |
| label: PERSON, ORGANIZATION, and LOCATION entities. It is included as | |
| english.all.3class.distsim.crf.ser.gz. It is trained on data from | |
| CoNLL, MUC6, MUC7, ACE, OntoNotes, and Wikipedia. | |
| Because this model is trained on both US | |
| and UK newswire, it is fairly robust across the two domains. | |
| We have also included a 4 class NER tagger trained on the CoNLL 2003 | |
| Shared Task training data that labels for PERSON, ORGANIZATION, | |
| LOCATION, and MISC. It is named | |
| english.conll.4class.distsim.crf.ser.gz . | |
| A third model is trained only on data from MUC and | |
| distinguishes between 7 different classes: | |
| english.muc.7class.distsim.crf.ser.gz. | |
| All of the serialized classifiers come in two versions, one trained to | |
| basically expected standard written English capitalization, and the other | |
| to ignore capitalization information. The case-insensitive versions | |
| of the three models available on the Stanford NER webpage. | |
| These models use a distributional similarity lexicon to improve performance | |
| (by between 1.5%-3% F-measure). The distributional similarity features | |
| make the models perform substantially better, but they require rather | |
| more memory. The distsim models are included in the release package. | |
| The nodistsim versions of the same models may be available on the | |
| Stanford NER webpage. | |
| Finally, we have models for other languages, including two German models, | |
| a Chinese model, and a Spanish model. The files for these models can be | |
| found at: | |
| http://nlp.stanford.edu/software/CRF-NER.html | |
| QUICKSTART INSTRUCTIONS | |
| This NER system requires Java 1.8 or later. | |
| Providing java is on your PATH, you should be able to run an NER GUI | |
| demonstration by just clicking. It might work to double-click on the | |
| stanford-ner.jar archive but this may well fail as the operating system | |
| does not give Java enough memory for our NER system, so it is safer to | |
| instead double click on the ner-gui.bat icon (Windows) or ner-gui.sh | |
| (Linux/Unix/MacOSX). Then, using the top option from the Classifier | |
| menu, load a CRF classifier from the classifiers directory of the | |
| distribution. You can then `either load a text file or web page from | |
| the File menu, or decide to use the default text in the window. Finally, | |
| you can now named entity tag the text by pressing the Run NER button. | |
| From a command line, you need to have java on your PATH and the | |
| stanford-ner.jar file and the lib directory in your CLASSPATH. (The way of doing this depends on | |
| your OS/shell.) The supplied ner.bat and ner.sh should work to allow | |
| you to tag a single file. For example, for Windows: | |
| ner file | |
| Or on Unix/Linux you should be able to parse the test file in the distribution | |
| directory with the command: | |
| java -mx600m -cp stanford-ner.jar:lib/* edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz -textFile sample.txt | |
| Here's an output option that will print out entities and their class to | |
| the first two columns of a tab-separated columns output file: | |
| java -mx600m -cp stanford-ner.jar:lib/* edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz -outputFormat tabbedEntities -textFile sample.txt > sample.tsv | |
| When run from a jar file, you also have the option of using a serialized | |
| classifier contained in the jar file. | |
| USING FULL STANFORD CORENLP NER FUNCTIONALITY | |
| This standalone distribution also allows access to the full NER | |
| capabilities of the Stanford CoreNLP pipeline. These capabilities | |
| can be accessed via the NERClassifierCombiner class. | |
| NERClassifierCombiner allows for multiple CRFs to be used together, | |
| and has options for recognizing numeric sequence patterns and time | |
| patterns with the rule-based NER of SUTime. | |
| Suppose one combines three CRF's CRF-1,CRF-2, and CRF-3 with the | |
| NERClassifierCombiner. When the NERClassiferCombiner runs, it will | |
| first apply the NER tags of CRF-1 to the text, then it will apply | |
| CRF-2's NER tags to any tokens not tagged by CRF-1 and so on. If | |
| the option ner.combinationMode is set to NORMAL (default), any label | |
| applied by CRF-1 cannot be applied by subsequent CRF's. For instance | |
| if CRF-1 applies the LOCATION tag, no other CRF's LOCATION tag will be | |
| used. If ner.combinationMode is set to HIGH_RECALL, this limitation | |
| will be deactivated. | |
| To use NERClassifierCombiner at the command-line, the jars in lib | |
| and stanford-ner.jar must be in the CLASSPATH. Here is an example command: | |
| java -mx2g edu.stanford.nlp.ie.NERClassifierCombiner -ner.model \ | |
| classifiers/english.conll.4class.distsim.crf.ser.gz,classifiers/english.muc.7class.distsim.crf.ser.gz \ | |
| -ner.useSUTime false -textFile sample-w-time.txt | |
| Let's break this down a bit. The flag "-ner.model" should be followed by a | |
| list of CRF's to be combined by the NERClassifierCombiner. Some serialized | |
| CRF's are provided in the classifiers directory. In this example the CRF's | |
| trained on the CONLL 4 class data and the MUC 7 class data are being combined. | |
| When the flag "-ner.useSUTime" is followed by "false", SUTime is shut off. You should | |
| note that when the "false" is omitted, the text "4 days ago" suddenly is | |
| tagged with DATE. These are the kinds of phrases SUTime can identify. | |
| NERClassifierCombiner can be run on different types of input as well. Here is | |
| an example which is run on CONLL style input: | |
| java -mx2g edu.stanford.nlp.ie.NERClassifierCombiner -ner.model \ | |
| classifiers/english.conll.4class.distsim.crf.ser.gz,classifiers/english.muc.7class.distsim.crf.ser.gz \ | |
| -map word=0,answer=1 -testFile sample-conll-file.txt | |
| It is crucial to include the "-map word=0,answer=1" , which is specifying that | |
| the input test file has the words in the first column and the answer labels | |
| in the second column. | |
| It is also possible to serialize and load an NERClassifierCombiner. | |
| This command loads the three sample crfs with combinationMode=HIGH_RECALL | |
| and SUTime=false, and dumps them to a file named | |
| test_serialized_ncc.ncc.ser.gz | |
| java -mx2g edu.stanford.nlp.ie.NERClassifierCombiner -ner.model \ | |
| classifiers/english.conll.4class.distsim.crf.ser.gz,classifiers/english.muc.7class.distsim.crf.ser.gz,\ | |
| classifiers/english.all.3class.distsim.crf.ser.gz -ner.useSUTime false \ | |
| -ner.combinationMode HIGH_RECALL -serializeTo test.serialized.ncc.ncc.ser.gz | |
| An example serialized NERClassifierCombiner with these settings is supplied in | |
| the classifiers directory. Here is an example of loading that classifier and | |
| running it on the sample CONLL data: | |
| java -mx2g edu.stanford.nlp.ie.NERClassifierCombiner -loadClassifier \ | |
| classifiers/example.serialized.ncc.ncc.ser.gz -map word=0,answer=1 \ | |
| -testFile sample-conll-file.txt | |
| For a more exhaustive description of NERClassifierCombiner go to | |
| http://nlp.stanford.edu/software/ncc-faq.html | |
| PROGRAMMATIC USE | |
| The NERDemo file illustrates a couple of ways of calling the system | |
| programatically. You should get the same results from | |
| java -cp stanford-ner.jar:lib/*:. -mx300m NERDemo classifiers/english.all.3class.distsim.crf.ser.gz sample.txt | |
| as from using CRFClassifier. For more information on API calls, look in | |
| the enclosed javadoc directory: load index.html in a browser and look | |
| first at the edu.stanford.nlp.ie.crf package and CRFClassifier class. | |
| If you wish to train your own NER systems, look also at the | |
| edu.stanford.nlp.ie package NERFeatureFactory class. | |
| SERVER VERSION | |
| The NER code may also be run as a server listening on a socket: | |
| java -mx1000m -cp stanford-ner.jar:lib/* edu.stanford.nlp.ie.NERServer 1234 | |
| You can specify which model to load with flags, either one on disk: | |
| java -mx1000m -cp stanford-ner.jar:lib/* edu.stanford.nlp.ie.NERServer -loadClassifier classifiers/all.3class.crf.ser.gz 1234 | |
| Or if you have put a model inside the jar file, as a resource under, say, models: | |
| java -mx1000m -cp stanford-ner.jar:lib/* edu.stanford.nlp.ie.NERServer -loadClassifier models/all.3class.crf.ser.gz 1234 | |
| RUNNING CLASSIFIERS FROM INSIDE A JAR FILE | |
| The software can run any serialized classifier from within a jar file by | |
| following the -loadClassifier flag by some resource available within a | |
| jar file on the CLASSPATH. An end user can make | |
| their own jar files with the desired NER models contained inside. | |
| This allows single jar file deployment. | |
| PERFORMANCE GUIDELINES | |
| Performance depends on many factors. Speed and memory use depend on | |
| hardware, operating system, and JVM. Accuracy depends on the data | |
| tested on. Nevertheless, in the belief that something is better than | |
| nothing, here are some statistics from one machine on one test set, in | |
| semi-realistic conditions (where the test data is somewhat varied). | |
| ner-eng-ie.crf-3-all2006-distsim.ser.gz (older version of ner-eng-ie.crf-3-all2008-distsim.ser.gz) | |
| Memory: 320MB (on a 32 bit machine) | |
| PERSON ORGANIZATION LOCATION | |
| 91.88 82.91 88.21 | |
| -------------------- | |
| CHANGES | |
| -------------------- | |
| 2020-11-17 4.2.0 Update for compatibility | |
| 2020-05-10 4.0.0 Update to UDv2.0 tokenization | |
| 2018-10-16 3.9.2 Update for compatibility | |
| 2018-02-27 3.9.1 KBP ner models for Chinese and Spanish | |
| 2017-06-09 3.8.0 Updated for compatibility | |
| 2016-10-31 3.7.0 Improved Chinese NER | |
| 2015-12-09 3.6.0 Updated for compatibility | |
| 2015-04-20 3.5.2 synch standalone and CoreNLP functionality | |
| 2015-01-29 3.5.1 Substantial accuracy improvements | |
| 2014-10-26 3.5.0 Upgrade to Java 1.8 | |
| 2014-08-27 3.4.1 Add Spanish models | |
| 2014-06-16 3.4 Fix serialization bug | |
| 2014-01-04 3.3.1 Bugfix release | |
| 2013-11-12 3.3.0 Update for compatibility | |
| 2013-11-12 3.3.0 Update for compatibility | |
| 2013-06-19 3.2.0 Improve handling of line-by-line input | |
| 2013-04-04 1.2.8 nthreads option | |
| 2012-11-11 1.2.7 Improved English 3 class model by including | |
| data from Wikipedia, release Chinese model | |
| 2012-07-09 1.2.6 Minor bug fixes | |
| 2012-05-22 1.2.5 Fix encoding issue | |
| 2012-04-07 1.2.4 Caseless version of English models supported | |
| 2012-01-06 1.2.3 Minor bug fixes | |
| 2011-09-14 1.2.2 Improved thread safety | |
| 2011-06-19 1.2.1 Models reduced in size but on average improved | |
| in accuracy (improved distsim clusters) | |
| 2011-05-16 1.2 Normal download includes 3, 4, and 7 | |
| class models. Updated for compatibility | |
| with other software releases. | |
| 2009-01-16 1.1.1 Minor bug and usability fixes, changed API | |
| 2008-05-07 1.1 Additional feature flags, various code updates | |
| 2006-09-18 1.0 Initial release | |