Tutorial: Wrapping an NLP Application

The following is a tutorial on how to wrap a simple NLP tool as a CLAMS application, using app template generated by clams develop command. Particularly, this article is focused on writing _annotate() method in a CLAMS app (app.py). This may not make a lot of sense without glancing over recent MMIF specifications and CLAMS SDK overview.

The NLP tool

We use an ultra simple tokenizer as tokenizer.py as the example NLP tool. All it does is define a tokenize function that uses a simple regular expression and returns a list of offset pairs.

$ cat tokenizer.py


import re

__VERSION__ = 'v1'

def tokenize(text):
    return [tok.span() for tok in re.finditer("\w+", text)]
$ python
>>> import tokenizer
>>> tokenizer.tokenize('Fido barks.')
[(0, 4), (5, 10)]

Wrapping the tokenizer

First, it is recommended to use clams develop in the command line and follow the instructions there to generate the necessary skeleton templates for developing the app. In the rest of this tutorial, we will use TokenizerWrapper as the class name (To generate a starter code with that class name, you need to use -n tokenizer-wrapper flag).

By convention, all the “app” code is in a script named app.py (this is not a strict requirement and you can give it another name). The app.py script in general does several things:

  1. importing the necessary code (preamble)

  2. creating a subclass of ClamsApp that defines the metadata (_appmetadata()) and provides a method to run the wrapped tool (_annotate())

  3. providing a way to run the code as a HTTP Flask server (__main__ block).

The clams develop cookie cutter will generate the third part. Thus, the first and second parts of the code are explained here.

Imports

Aside from a few standard modules we need the following imports:

from clams import ClamsApp, Restifier
from mmif import Mmif, View, Annotation, Document, AnnotationTypes, DocumentTypes

# For an NLP tool we need to import the LAPPS vocabulary items
from lapps.discriminators import Uri
# --- came from the starter code

import tokenizer  # THIS MEANS you put the `tokenizer.py` in the same directory with `app.py`

For non-NLP CLAMS applications we would also do from mmif.vocabulary import AnnotationTypes, but this is not needed for NLP applications because they do not need the CLAMS vocabulary. What we do need to import are the URIs of all LAPPS annotation types and the NLP tool itself.

Note MMIF uses LAPPS vocabulary for linguistic annotation types.

Importing lapps.discriminators.Uri is for convenience since it gives us easy access to the URIs of annotation types and some of their attributes. The following code prints a list of available variables that point to URIs:

>>> from lapps.discriminators import Uri
>>> attrs = [x for x in dir(Uri) if not x.startswith('__')]
>>> attrs = [a for a in attrs if not getattr(Uri, a).find('org/ns') > -1]
>>> print(' '.join(attrs))
ANNOTATION CHUNK CONSTITUENT COREF DATE DEPENDENCY DEPENDENCY_STRUCTURE DOCUMENT GENERIC_RELATION LEMMA LOCATION LOOKUP MARKABLE MATCHES NCHUNK NE ORGANIZATION PARAGRAPH PERSON PHRASE_STRUCTURE POS RELATION SEMANTIC_ROLE SENTENCE TOKEN VCHUNK
>>> print(Uri.TOKEN)
http://vocab.lappsgrid.org/Token

The application class

With the imports in place we define a subclass of ClamsApp which needs two methods:

class TokenizerWrapper(ClamsApp):
    def _appmetadata(self): pass

    def _annotate(self, mmif): pass

Here it is useful to introduce some background. The CLAMS HTTP API connects the GET and POST requests to the appmetdata() and annotate() methods on the app respectively, and those methods are both defined in ClamsApp. In essence, they are wrappers around _appmetadata() and _annotate() and provide some common functionality like making sure the output is serialized into a string.

App Metadata

The _appmetadata() method should return an AppMetadata object that defines the relevant metadata for the app:

(If you are using the app template, use metadata.py instead of app.py to define the metadata.)

APP_LICENSE = 'Apache 2.0'
TOKENIZER_LICENSE = 'Apache 2.0'
TOKENIZER_VERSION = tokenizer.__VERSION__

def _appmetadata(self):
    metadata = AppMetadata(
        identifier='tokenizer-wrapper',
        name="Tokenizer Wrapper",
        url='https://github.com/clamsproject/app-nlp-example',
        description="Apply simple tokenization to all text documents in a MMIF file.",
        app_license=APP_LICENSE,
        analyzer_version=TOKENIZER_VERSION,
        analyzer_license=TOKENIZER_LICENSE,
    )
    metadata.add_input(DocumentTypes.TextDocument)
    metadata.add_output(Uri.TOKEN)
    metadata.add_parameter('error', 'Throw error if set to True', 'boolean')
    metadata.add_parameter('eol', 'Insert sentence boundaries', 'boolean')
    return metadata

Warning When using the separately generated metadata.py created via clams develop, this method within app.py should be left empty with a pass statement as shown below:

def _appmetadata(self):
    # When using the ``metadata.py`` leave this do-nothing "pass" method here.
        pass

And the appmetadata() within metadata.py should be implemented instead. Follow the instructions in the template.

As we see in the code above, the AppMetadata object is created with the following fields: identifier, name, url, description, app_license, analyzer_version, and analyzer_license. If you used clams develop to generate the app template, you’ll also notice that some of these fields are already filled in for you based on the -n argument you provided.

More interesting are the add_input(), add_output(), and add_parameter() parts. The add_input() method is used to specify the input annotation type(s) that the app expects. Here, we specify that the app just expects text documents. The add_output() method is used to specify the output annotation type(s) that the app produces. So our first CLAMS app will take text documents as input and produce token annotations as output. Note that the I/O types must be specified using the URIs. We are using URIs defined in CLAMS and LAPPS vocabularies, but as long as the type is defined somewhere, any URI can be used.

Finally, the add_parameter() method is used to specify the parameters that the app accepts. For usage of these parameters, you’ll find Runtime Configuration page helpful. Now, as a developer, you can specify the parameters that the app accepts, so the users can set these parameters when running the app, using add_parameter() method. Here, we are defining two parameters: error and eol, and in addition to the name of parameter, we also specify the description and the type of the parameter. With the desciption and the type, the parameters should be pretty self-explanatory to the user. One think to note in this code snippet is that both parameters don’t have default values. This means that if the user doesn’t specify the value for these parameters at the runtime, the app will not run and throw an error. If you want to make a parameter “optional” by providing a default value, you can do so by adding a default argument to the add_parameter() method.

Note Also refer to CLAMS App Metadata for more details regarding what fields need to be specified.

_annotate()

The _annotate() method should accept a MMIF file/string/object as its first parameter and always returns a MMIF object with an additional view containing annotation results. This is where the bulk of your logic will go. For a text processing app, it is mostly concerned with finding text documents, calling the code that runs over the text, creating new views and inserting the results.

In addition to the input MMIF, this method can accept any number of keyword arguments, which are the parameters set by the user/caller. Note that when this method is called inside the annotate() public method in the ClamsApp class (which is the usual case when running as a CLAMS app), the keyword arguments are automatically “refined” before being passed here. The refinement includes

  1. inserting “default” values for parameters that are not set by the user

  2. checking that the values are of the correct type and value, based on the parameter specification in the app metadata

def _annotate(self, mmif, **parameters):
    # then, access the parameters: here to just print
    # them and to willy-nilly throw an error if the caller wants that
    for arg, val in parameters.items():
        print(f"Parameter {arg}={val}")
        # as we defined this `error` parameter in the app metadata
        if arg == 'error' and val is True:
            raise Exception(f"Exception - {parameters['error']}")
    # Initialize the MMIF object from the string if needed
    self.mmif = mmif if isinstance(mmif, Mmif) else Mmif(mmif)
    # process the text documents in the documents list
    for doc in self.mmif.get_documents_by_type(DocumentTypes.TextDocument):
        # prepare a new _View_ object to store output annotations
        new_view = self._new_view(doc.long_id, parameters)  # continue reading to see what `long_id` does
        # _run_nlp_tool() is the method that does the actual work
        self._run_nlp_tool(doc, new_view)
    # return the MMIF object
    return self.mmif

For language processing applications, one task is to retrieve all text documents from both the documents list and the views. Annotations generated by the NLP tool need to be anchored to the text documents, which in the case of text documents in the documents list is done by using the text document identifier, but for text documents in views we also need the view identifier. A view may have many text documents and typically all annotations created will be put in one view.

For each text document from the document list, there is one invocation of _new_view() which gets handed a document identifier, so it can be put in the view metadata. And for each view with text documents there is also one invocation of _new_view(), but no document identifier is handed in so the identifier will not be put into the view metadata.

The method _run_nlp_tool() is responsible for running the NLP tool and adding annotations to the new view. The third argument allows us to anchor annotations created by the tool by handing over the document identifier, possibly prefixed by the view the document lives in.

One thing about _annotate() as it is defined above is that it will most likely be the same for each NLP application, all the application specific details are in the code that creates new views and the code that adds annotations.

Creating a new view:
def _new_view(self, docid, runtime_config):
    view = self.mmif.new_view()
    view.metadata.app = self.metadata.identifier
    # first thing you need to do after creating a new view is "sign" the view
    # the sign_view() method will record the app's identifier and the timestamp
    # as well as the user parameter inputs. This is important for reproducibility.
    self.sign_view(view, runtime_config)
    # then record what annotations you want to create in this view
    view.new_contain(Uri.TOKEN, document=docid)
    return view

This is the simplest NLP view possible since there is only one annotation type, and it has no metadata properties beyond the document property. Other applications may have more annotation types, which results in repeated invocations of new_contain(), and may define other metadata properties for those types.

Adding annotations:
def _run_nlp_tool(self, doc, new_view):
    """
    Run the NLP tool over the document and add annotations to the view, using the
    full document identifier (which may include a view identifier) for the document
    property.
    """
    text = doc.text_value
    tokens = tokenizer.tokenize(text)
    for p1, p2 in tokens:
        a = new_view.new_annotation(Uri.TOKEN)
        # `long_id` will give you the annotation object's ID, prefixed by its parents view's ID (if it has one)
        # so that when the targeting document is in a different view, we can still have back references
        a.add_property('document', doc.long_id)
        a.add_property('start', p1)
        a.add_property('end', p2)
        a.add_property('word', text[p1:p2])
        # see what properties are required / available in the LAPPS vocabulary https://vocab.lappsgrid.org/Token

First, with text_value we get the text from the text document, either from its location property or from its text property. Second, we apply the tokenizer to the text. And third, we loop over the token offsets in the tokenizer result and create annotations of type Uri.TOKEN with an identifier that is automatically generated by the SDK. All that is needed for adding an annotation is the new_annotation() method on the view object and the add_property() method on the annotation object.

Containerization with Docker

Apps within CLAMS typically run as Flask servers in Docker containers, and after an app is tested as a local Flask application, it should be containerized. In fact, in some cases we don’t even bother running a local Flask server and move straight to the container set up.

Three configuration files for building a container image should be automatically generated through the clams develop command:

file

description

Containerfile

Describes how to create a container image for this application.

.dockerignore

Specifies which files are not needed for running this application.

requirements.txt

File with all Python modules that need to be installed.

Here is the minimal Containerfile included with this example:

# make sure to use a specific version number here
FROM ghcr.io/clamsproject/clams-python:x.y.z
COPY ./ /app
WORKDIR ./app

CMD ["python3", "app.py", "--production"]

This starts from the base CLAMS image which is created from an official Python image (Debian-based) with the clams-python package and the code it depends on added. The Containerfile only needs to be edited if additional installations are required to run the NLP tool. In that case the Containerfile will have a few more lines:

FROM ghcr.io/clamsproject/clams-python:x.y.z
RUN apt install -y <system-packages>
WORKDIR ./app
COPY ./requirements.txt .
RUN pip3 install --no-cache-dir -r requirements.txt  # no-cache-dir will save some space and reduce final image size
COPY ./ ./
CMD ["python3", "app.py"]

Note Containerfile in the app starter template has more pre-configuration lines.

With this Containerfile you typically only need to make changes to the requirements file for additional python installs.

This repository also includes a .dockerignore file. Editing it is optional, but with large repositories with lots of documentation and images you may want to add some file paths just to keep the image as small as possible.

Use one of the following commands to build the image, the first one builds an image with a production server using Gunicorn, the second one builds a development server using Flask.

$ docker build -t clams-nlp-example:0.0.1 -f Containerfile .

The -t option lets you pick a name and a tag for the image. You can use another name if you like. You do not have to add a tag and you could just use -t nlp-clams-example, but it is usually a good idea to use the version name as the tag. The -f option lets you specify a different Containerfile. If you do not specify a file then docker will look for a file called Dockerfile in the current directory (Note that in this tutorial we’re using Containerfile as the name, not Dockerfile).

Note For full details on the docker build command see the docker-build documentation.

To test the Flask app in the container from your local machine do

$ docker run --name clams-nlp-example --rm -d -p 4000:5000 clams-nlp-example:0.0.1

There are a lot going on in this command, so let’s break it down: --name clams-nlp-example: This is the name of the container. You can use any name you like, but it has to be unique. You will need this name to stop the container later if you run it with -d (see below). --rm: This option tells Docker to remove the container when it is stopped. -d: This option tells Docker to run the container in the background. If you leave this option out you will see the output of the container in your terminal. -p 5000:5000: This option tells Docker to map port 5000 on the container to port 4000 on your local machine. This is the port that the Flask server is running on in the container. clams-nlp-example:0.0.1: This is the name and tag of the image that you want to run in a container.

Note For full details on the docker run command see the docker-run documentation.

Now you can call the server with curl command from your terminal, but first you need an input MMIF file.

{
  "metadata": {
    "mmif": "http://mmif.clams.ai/0.4.0"
  },
  "documents": [
    {
      "@type": "http://mmif.clams.ai/0.4.0/vocabulary/TextDocument",
      "properties": {
        "id": "m2",
    "text": {
      "@value": "Hello, this is Jim Lehrer with the NewsHour on PBS. In the nineteen eighties, barking dogs have increasingly become a problem in urban areas."
    }
      }
    }
  ],
  "views": []
}

Save this as example.mmif and run the following curl command:

$ curl http://localhost:5000/
$ curl -X POST -d@example.mmif http://localhost:5000/

Using the document.location property

Typically TextDocument in a MMIF use the location property to point to a text file. This will not work with the setup laid out above because that’s dependent on having a local path on your machine and the container has no access to that path. What you need to do is to make sure that the container can see the data on your local machine and you can use the -v option for that:

$ docker run --name clams-nlp-example --rm -d -p 5000:5000 -v $PWD/input/data:/data clams-nlp-example:0.0.1

We now have specified that the /data directory on the container is a mount of the ./input/data directory in the “host” machine. Given that ./input/data contains a example.txt text file, now you need to make sure that the input MMIF file uses the path on the container:

{
  "@type": "http://mmif.clams.ai/vocabulary/TextDocument/v1",
  "properties": {
    "id": "m1",
    "mime": "text/plain",
    "location": "file:///data/text/example.txt"
  }
}

To generate a MMIF file like this, you can use clams source command from your shell.

$ clams source --prefix /data/input text:example.txt
{
  "metadata": {
    "mmif": "http://mmif.clams.ai/1.0.0"
  },
  "documents": [
    {
      "@type": "http://mmif.clams.ai/vocabulary/TextDocument/v1",
      "properties": {
        "mime": "text",
        "id": "d1",
        "location": "file:///data/input/example.txt"
      }
    }
  ],
  "views": []
}

And now you can use curl again

$ curl -X POST -d@input/example-3.mmif http://0.0.0.0:5000/