Tutorial: Wrapping an NLP Application

The following is a tutorial on how to wrap a simple NLP tool as a CLAMS application, using app template generated by clams develop command. Particularly, this article is focused on writing _annotate() method in a CLAMS app (app.py). This may not make a lot of sense without glancing over recent MMIF specifications and CLAMS SDK overview.

The NLP tool

We use an ultra simple tokenizer in tokenizer.py as the example NLP tool. All it does is define a tokenize function that uses a simple regular expression and returns a list of offset pairs.

def tokenize(text):
    return [tok.span() for tok in re.finditer("\w+", text)]
>>> import tokenizer
>>> tokenizer.tokenize('Fido barks.')
[(0, 4), (5, 10)]

Wrapping the tokenizer

First, it is recommended to call clams develop in the command line and follow the instructions there to generate the necessary skeleton templates for developing the app.

By convention, all the wrapping code is in a script named app.py, but this is not a strict requirement and you can give it another name. The app.py script does several things: (1) import the necessary code, (2) create a subclass of ClamsApp that defines the metadata and provides a method to run the wrapped NLP tool, and (3) provide a way to run the code as a HTTP Flask server. The template will generate the third part. Thus, the first and second parts of the code are explained here.

Imports

Aside from a few standard modules we need the following imports:

from clams.app import ClamsApp
from clams.restify import Restifier
from clams.appmetadata import AppMetadata
from mmif.serialize import Mmif
from mmif.vocabulary import DocumentTypes
from lapps.discriminators import Uri
import tokenizer

For non-NLP CLAMS applications we would also do from mmif.vocabulary import AnnotationTypes, but this is not needed for NLP applications because they do not need the CLAMS vocabulary. What we do need to import are the URIs of all LAPPS annotation types and the NLP tool itself.

Importing lapps.discriminators.Uri is for convenience since it gives us easy access to the URIs of annotation types and some of their attributes. The following code prints a list of available variables that point to URIs:

>>> from lapps.discriminators import Uri
>>> attrs = [x for x in dir(Uri) if not x.startswith('__')]
>>> attrs = [a for a in attrs if not getattr(Uri, a).find('org/ns') > -1]
>>> print(' '.join(attrs))
ANNOTATION CHUNK CONSTITUENT COREF DATE DEPENDENCY DEPENDENCY_STRUCTURE DOCUMENT GENERIC_RELATION LEMMA LOCATION LOOKUP MARKABLE MATCHES NCHUNK NE ORGANIZATION PARAGRAPH PERSON PHRASE_STRUCTURE POS RELATION SEMANTIC_ROLE SENTENCE TOKEN VCHUNK
>>> print(Uri.TOKEN)
http://vocab.lappsgrid.org/Token

The application class

With the imports in place we define a subclass of ClamsApp which needs two methods:

class TokenizerApp(ClamsApp):
    def _appmetadata(self): pass

    def _annotate(self, mmif): pass

Here it is useful to introduce some background. The CLAMS HTTP API connects the GET and POST requests to the appmetdata() and annotate() methods on the app respectively, and those methods are both defined in ClamsApp. In essence, they are wrappers around _appmetadata() and _annotate() and provide some common functionality like making sure the output is serialized into a string.

App Metadata

The _appmetadata() method should return an AppMetadata object that defines the relevant metadata for the app:

(If you are using the app template, use metadata.py instead of app.py to define the metadata.)

APP_LICENSE = 'Apache 2.0'
TOKENIZER_LICENSE = 'Apache 2.0'
TOKENIZER_VERSION = tokenizer.__VERSION__

def _appmetadata(self):
    metadata = AppMetadata(
        identifier='tokenizer',
        url='https://github.com/clamsproject/app-nlp-example',
        name="Simplistic Tokenizer",
        description="Apply simple tokenization to all text documents in a MMIF file.",
        app_license=APP_LICENSE,
        analyzer_version=TOKENIZER_VERSION,
        analyzer_license=TOKENIZER_LICENSE,
    )
    metadata.add_input(DocumentTypes.TextDocument)
    metadata.add_output(Uri.TOKEN)
    metadata.add_parameter('error', 'Throw error if set to True', 'boolean')
    metadata.add_parameter('eol', 'Insert sentence boundaries', 'boolean')
    return metadata

Warning When using the separately generated metadata.py created via clams develop, this method within app.py should be left empty with a pass statement as shown below:

def _appmetadata(self):
    # When using the ``metadata.py`` leave this do-nothing "pass" method here.
        pass

And the appmetadata() within metadata.py should be implemented instead. Follow the instructions in the template.

Note Also refer to CLAMS App Metadata for more details regarding what fields need to be specified.

_annotate()

The _annotate() method should accept a MMIF file/string/object as its first parameter and always returns a MMIF object with an additional view containing annotation results. This is where the bulk of your logic will go. For a text processing app, it is mostly concerned with finding text documents, calling the code that runs over the text, creating new views and inserting the results.

In addition to the input MMIF, this method can accept any number of keyword arguments, which are the parameters set by the user/caller. Note that when this method is called inside the annotate() public method in the ClamsApp class (which is the usual case when running as a CLAMS app), the keyword arguments are automatically “refined” before being passed here. The refinement includes

  1. inserting “default” values for parameters that are not set by the user

  2. checking that the values are of the correct type and value, based on the parameter specification in the app metadata

def _annotate(self, mmif, **kwargs):
    # then, access the parameters: here to just print
    # them and to willy-nilly throw an error if the caller wants that
    for arg, val in configs.items():
        print("Parameter %s=%s" % (arg, val))
        # as we defined this `error` parameter in the app metadata
        if arg == 'error' and val is True:
            raise Exception("Exception - %s" % configs['error'])
    # Initialize the MMIF object from the string if needed
    self.mmif = mmif if type(mmif) is Mmif else Mmif(mmif)
    # process the text documents in the documents list
    for doc in self.mmif.get_documents_by_type(DocumentTypes.TextDocument):
        new_view = self._new_view(doc.id, kwargs)
        # _run_nlp_tool() is the method that does the actual work
        self._run_nlp_tool(doc, new_view, doc.id)
    # return the MMIF object
    return self.mmif

For language processing applications, one task is to retrieve all text documents from both the documents list and the views. Annotations generated by the NLP tool need to be anchored to the text documents, which in the case of text documents in the documents list is done by using the text document identifier, but for text documents in views we also need the view identifier. A view may have many text documents and typically all annotations created will be put in one view.

For each text document from the document list, there is one invocation of _new_view() which gets handed a document identifier, so it can be put in the view metadata. And for each view with text documents there is also one invocation of _new_view(), but no document identifier is handed in so the identifier will not be put into the view metadata.

The method _run_nlp_tool() is responsible for running the NLP tool and adding annotations to the new view. The third argument allows us to anchor annotations created by the tool by handing over the document identifier, possibly prefixed by the view the document lives in.

One thing about _annotate() as it is defined above is that it will most likely be the same for each NLP application, all the application specific details are in the code that creates new views and the code that adds annotations.

Creating a new view:
def _new_view(self, docid=None, runtime_config):
    view = self.mmif.new_view()
    view.metadata.app = self.metadata.identifier
    # first thing you need to do after creating a new view is "sign" the view
    # the sign_view() method will record the app's identifier and the timestamp
    # as well as the user parameter inputs. This is important for reproducibility.
    self.sign_view(view, runtime_config)
    # then record what annotations you want to create in this view
    view.new_contain(Uri.TOKEN, document=docid)
    return view

This is the simplest NLP view possible since there is only one annotation type and it has no metadata properties beyond the document property. Other applications may have more annotation types, which results in repeated invocations of new_contain(), and may define other metadata properties for those types.

Adding annotations:
def _run_nlp_tool(self, doc, new_view, full_doc_id):
    """Run the NLP tool over the document and add annotations to the view, using the
    full document identifier (which may include a view identifier) for the document
    property."""
    text = doc.text_value
    tokens = tokenizer.tokenize(text)
    for p1, p2 in tokens:
        a = new_view.new_annotation(Uri.TOKEN)
        # no need to do this for documents in the documents list
        if ':' in full_doc_id:
            a.add_property('document', full_doc_id)
        a.add_property('start', p1)
        a.add_property('end', p2)
        a.add_property('text', text[p1:p2])

First, with text_value we get the text from the text document, either from its location property or from its text property. Second, we apply the tokenizer to the text. And third, we loop over the token offsets in the tokenizer result and create annotations of type Uri.TOKEN with an identifier that is automatically generated by the SDK. All that is needed for adding an annotation is the new_annotation() method on the view object and the add_property() method on the annotation object.

Containerization with Docker

Apps within CLAMS typically run as Flask servers in Docker containers, and after an app is tested as a local Flask application it should be containerized. In fact, in some cases we don’t even bother running a local Flask server and move straight to the container set up.

Three configuration files for building a container image should be automatically generated through the clams develop command:

file

description

Containerfile

Describes how to create a container image for this application.

.dockerignore

Specifies which files are not needed for running this application.

requirements.txt

File with all Python modules that need to be installed.

Here is the minimal Containerfile included with this example:

# make sure to use a specific version number here
FROM ghcr.io/clamsproject/clams-python:x.y.z
WORKDIR ./app
COPY ./ ./
CMD ["python3", "app.py"]

This starts from the base CLAMS image which is created from an official Python image (Debian-based) with the clams-python package and the code it depends on added. The Containerfile only needs to be edited if additional installations are required to run the NLP tool. In that case the Containerfile will have a few more lines:

FROM ghcr.io/clamsproject/clams-python:x.y.z
RUN apt install -y <system-packages>
WORKDIR ./app
COPY ./requirements.txt .
RUN pip3 install -r requirements.txt
COPY ./ ./
CMD ["python3", "app.py"]

With this Containerfile you typically only need to make changes to the requirements file for additional python installs.

This repository also includes a .dockerignore file. Editing it is optional, but with large repositories with lots of documentation and images you may want to add some file paths just to keep the image as small as possible.

Use one of the following commands to build the image, the first one builds an image with a production server using Gunicorn, the second one builds a development server using Flask.

$ docker build -t clams-nlp-example:0.0.1 -f Containerfile .

The -t option lets you pick a name and a tag for the image. You can use another name if you like. You do not have to add a tag and you could just use -t nlp-clams-example, but it is usually a good idea to use the version name as the tag. The -f option lets you specify a different Containerfile. If you do not specify a file then docker will look for a file called Dockerfile in the current directory (Note that in this tutorial we’re using Containerfile as the name, not Dockerfile).

To test the Flask app in the container do

$ docker run --rm -it clams-nlp-example:0.0.1 bash

You are now running a bash shell in the container and in the container you can run

root@c85a08b22f18:/app# python test.py input/example-1.mmif out.json

Escape out of the container with Ctrl-d.

To test the Flask app in the container from your local machine do

$ docker run --name clams-nlp-example --rm -d -p 5000:5000 clams-nlp-example:0.0.1

The --name option gives a name to the container which we use later to stop it (if we do not name the container then docker will generate a name, and we have to query docker ps to see what containers are running and then use that name to stop it). Now you can use curl to send requests (not sending the -h headers for brevity, it does work without them):

$ curl http://localhost:5000/
$ curl -X POST -d@input/example-1.mmif http://localhost:5000/

Using the document.location property

Typically TextDocument in a MMIF use the location property to point to a text file. This will not work with the setup laid out above because that’s dependent on having a local path on your machine and the container has no access to that path. What you need to do is to make sure that the container can see the data on your local machine and you can use the -v option for that:

$ docker run --name clams-nlp-example --rm -d -p 5000:5000 -v $PWD/input/data:/data clams-nlp-example:0.0.1

We now have specified that the /data directory on the container is a mount of the ./input/data directory in the “host” machine. Given that ./input/data contains a example.txt text file, now you need to make sure that the input MMIF file uses the path on the container:

{
  "@type": "http://mmif.clams.ai/vocabulary/TextDocument/v1",
  "properties": {
    "id": "m1",
    "mime": "text/plain",
    "location": "/data/text/example.txt"
  }
}

To generate a MMIF file like this, you can use clams source command from your shell.

$ clams source --prefix /data/input text:example.txt
{
  "metadata": {
    "mmif": "http://mmif.clams.ai/1.0.0"
  },
  "documents": [
    {
      "@type": "http://mmif.clams.ai/vocabulary/TextDocument/v1",
      "properties": {
        "mime": "text",
        "id": "d1",
        "location": "file:///data/input/example.txt"
      }
    }
  ],
  "views": []
}

And now you can use curl again

$ curl -X POST -d@input/example-3.mmif http://0.0.0.0:5000/