X

NLP Proof of Concept (PoC) Series Part 2: NER

In the first post of this series, I explained my plan and purpose for writing this and subsequent articles. I want to make it simple to start your NLP project from a Proof of Concept project. These small projects will have an API, a Dockerfile, and a streamlit demo. That way you’re all set to go. In this article, I’m starting off with one of the most common NLP tasks out there: Named Entity Recognition. (You can find the code of this tutorial on my Github)

Task definition

Named Entity Recognition (NER) is one of the most common NLP tasks. It stems from the Entity Extraction task which is a labeling task concerned with predicting a class per token in the input text. This is usually done for the purpose of Information Extraction where your input is a chunk of text and your output is a list of labels per token. An example:

Director General of the International Atomic Energy Agency (IAEA) Rafael Mariano Grossi attends the IAEA Board of Governors meeting at the IAEA headquarters in Vienna, Austria, on March 7.

Running through our system would produce:

Output of our system

Components of our system

To build this system we’re going to implement a couple of components:

  1. Named entity extractor
  2. API
  3. Streamlit demo
  4. A Dockerfile to wrap this

Ideally, we need this system to be laid out in a way that we can easily extend with different systems or modify the output format and be open to whatever changes that we might have to implement.

The API and the Dockerfile are really important here so we can package our system as a microservice and be part of a larger pipeline or run through a cluster using Kubernetes.

The Named Entity Extractor

When it comes to NLP nowadays, attention is all you need. Transformers have been breaking the state-of-the-art charts consistently for the past few years proving they’re here to stay. so let’s create a transformer-based NE extractor.

And when we mention transformers, there is no better solution than Huggingface’s transformers library! so let’s build an NE extractor using their library.

Let’s first define a very basic abstract class to define what NE extractors should look like in case we wanted to extend our system to accommodate more types of models.

class NeExtractor(ABC): @abstractmethod def extract(self, text: str) -> List[Extraction]: pass
Code language: Python (python)

And our extraction struct is defined as:

@dataclass class Extraction: start: int end: int type: str text: str confidence: float = 0.
Code language: CSS (css)

Now let’s write a new class for the transformer extractor, let’s call it TransformerExtractor and it should extend NeExtractor

class TransformerExtractor(NeExtractor): def __init__(self, model_name): self.model = AutoModelForTokenClassification.from_pretrained(model_name) self.tokenizer = AutoTokenizer.from_pretrained(model_name)
Code language: Python (python)

According to our abstract class NeExtractor we still need to implement an extract function. We want this function to receive a text chunk and produce a list of Extractions where each extraction is the full entity’s text span.

To do this we have to:

  1. Tokenize input text
  2. Run through the model
  3. Group by BIO

BIO is a tagging scheme to denote that multiple consecutive words belong to the same class. If a word has the prefix B- then it’s the start of an entity. if it has the I- then it’s a continuation, else an O denotes no named entity.

Let’s tokenize our text

def extract(self, text: str) -> List[Extraction]: inputs = self.tokenizer(text, return_tensors="pt", return_offsets_mapping=True) tokens = inputs.tokens()
Code language: PHP (php)

I want to extract character offsets of tokens as well but I don’t want to send that to the model, so let’s store that in a variable

offsets = inputs["offset_mapping"].squeeze().numpy() del inputs["offset_mapping"]
Code language: JavaScript (javascript)

I also want to run the logits through a softmax function to normalize confidence and compute confidence per entity

outputs = torch.softmax(self.model(**inputs).logits, dim=2)
Code language: PHP (php)

Next, let’s define a struct that will hold each token’s information and the predicted class for it

@dataclass class NeToken: start: int end: int text: str bio: str confidence: float
Code language: CSS (css)

Now we loop over the predictions and save those in an array

bio_predictions = [] for token, offset, output in zip(tokens, offsets, outputs[0]): if offset[0] == offset[1]: continue prediction = torch.argmax(output) bio_predictions.append( NeToken( offset[0], offset[1], token, self.model.config.id2label[prediction.numpy().item()], output.max().item(), ) )
Code language: PHP (php)

Remember we still have to group by BIO?

grouped_bio = groupby_bio(bio_predictions)

To do this we create a finite state machine that will accumulate predictions of the same class into one list and return the output as a list of tuples where each tuple has the class without the prefix and all the corresponding class tokens

def groupby_bio(bio_predictions: List[NeToken]) -> List[Tuple[str, List[NeToken]]]: entities = [] accumulator = [] previous_tag = "O" for entity in bio_predictions: if entity.bio.replace("B-", "").replace("I-", "") == "O" and ( previous_tag.startswith("I-") or previous_tag.startswith("B-") ): entities.append( (previous_tag.replace("B-", "").replace("I-", ""), accumulator) ) accumulator = [entity] previous_tag = entity.bio if entity.bio.startswith("B-") and ( previous_tag.startswith("I-") or previous_tag.startswith("B-") ): entities.append( (previous_tag.replace("B-", "").replace("I-", ""), accumulator) ) accumulator = [entity] previous_tag = entity.bio if entity.bio.startswith("B-") and previous_tag.startswith("O"): accumulator = [entity] previous_tag = entity.bio if entity.bio.startswith("I-") and ( previous_tag.startswith("B-") or previous_tag.startswith("I-") ): accumulator.append(entity) previous_tag = entity.bio if len(accumulator) > 0 and previous_tag.replace("B-", "").replace("I-", "") != "O": entities.append((previous_tag.replace("B-", "").replace("I-", ""), accumulator)) return entities
Code language: Python (python)

Let’s go back to our extract function. We now have to transform the grouped results by BIO and return those named entities in the form of the Extraction data class we implemented earlier. We also reduce our confidence by averaging the confidence of all tokens of an entity.

extractions = [] for named_entity in grouped_bio: start, end = named_entity[1][0].start, named_entity[1][-1].end if len(named_entity[1]) == 1: confidence = named_entity[1][0].confidence else: confidence = float( reduce(lambda x, y: x + y, [n.confidence for n in named_entity[1]]) ) / len(named_entity[1]) extractions.append( Extraction( start, end, named_entity[0], text[start:end], confidence, ) ) return extractions
Code language: PHP (php)

You can find this class defined in this Python file

The API

Now that we have a working Named Entity extractor let’s start implementing an API. Personally, I like to use FastAPI because it feels more modern and up to date with Python’s latest features like asynchronous functions and it’s very easy to use.

We want our API to load the Named Entity extractor once and then with each input query it will run the model on the text and return the output in the format of a list of Extraction objects which in JSON will be a list of dictionaries.

Let’s make our API configurable by a config.json file so we won’t have to hard code the model name in the API code or the extractor code and also enable adding more configuration later on if we want.

We define the configuration struct as follows:

@dataclass class Configuration: model_name: str
Code language: CSS (css)

And to populate it I’m going to use dacite

with open("config.json") as config_ptr: json_config = json.load(config_ptr) config = dacite.from_dict( data_class=Configuration, data=json_config, )
Code language: Python (python)

Now let’s define the input and output models of our API

class Input(BaseModel): text: str class Output(BaseModel): named_entities: List[Extraction]
Code language: Python (python)

Then, we instantiate our API and our NE extractor

api = FastAPI() named_entity_extractor = TransformerExtractor(config.model_name)
Code language: Python (python)

Awesome, one last task is to implement a route that will take the text as input and produce the result as output

@api.post("/ner", response_model=Output) def extractions(input_request: Input): ne_extractions = named_entity_extractor.extract(input_request.text) return {"named_entities": ne_extractions}
Code language: JavaScript (javascript)

Great, now we can run our API with uvicorn by running the command uvicorn api:api

And to query it we can send a post request with curl

curl "http://127.0.0.1:8000/ner" \ -X POST \ -d '{"text":"John Doe is a Go Developer at Google"}'
Code language: JavaScript (javascript)

The API replied with:

{"named_entities":[{"start":0,"end":8,"type":"PER","text":"John Doe","confidence":0.9973324537277222,"__initialised__":true},{"start":30,"end":36,"type":"ORG","text":"Google","confidence":0.9978129863739014,"__initialised__":true}]}
Code language: JSON / JSON with Comments (json)

You can find the API code here

Demo

Finally, we want to have an interactive demo to explore our model and share it with our clients. This is usually one of the most important components of any freelance project because that’s how you let your clients explore your system and then maybe ask for modifications.

To implement a demo, I don’t think there’s any easier framework than streamlit so let’s create a visual interactive demo with streamlit that will have a text area where the user inputs their text and the system will produce the output in the form of a table of extracted entities.

Let’s give our demo a title and header

st.title("NER Demo") st.header("Extract Named Entities from text")
Code language: JavaScript (javascript)

Let’s also load the same configuration as we did with our API

with open("config.json") as config_ptr: json_config = json.load(config_ptr) @dataclass class Configuration: model_name: str config = dacite.from_dict( data_class=Configuration, data=json_config, )
Code language: Python (python)

Since streamlit runs everything whenever a change occurs, we want our model to only load once, so let’s write a function that will load the model and have it decorated with streamlit's cache decorator

@st.cache(allow_output_mutation=True) def load_ner_model(model_name: str): named_entity_extractor = TransformerExtractor(model_name) return named_entity_extractor
Code language: Python (python)

Then, let’s run the load function and also add a text area to produce our first outputs

model = load_ner_model(config.model_name) text = st.text_area( "Text to extract from", "Yann Lecun is a very famous scientist who works at Meta AI.", max_chars=500, ) ne_extractions = model.extract(text) predictions = pd.DataFrame( data={ "text": [e.text for e in ne_extractions], "type": [e.type for e in ne_extractions], } )
Code language: Python (python)

Finally, let’s write our outputs as an interactive table

st.write(predictions)
Code language: CSS (css)

Let’s run this and see how it looks

Our streamlit demo

It looks quite neat! and the output is not bad except for splitting Yann’s name but that’s a model problem.

You can find the demo’s code on Github

Dockerfile

Last but not least, let’s write a very basic docker file that we can use to build a container for our application.

FROM python:3.8-buster COPY . /home WORKDIR /home RUN pip install -r /home/requirements.txt --ignore-installed ENTRYPOINT ["uvicorn api:api"]
Code language: Dockerfile (dockerfile)

This file is on Github as well

Conclusion

Now that we’ve wrapped all our components and glued them together, we’re ready to have our NER project up and running whether as a microservice or as an online demo, or as an HTTP API. The way we wrote this project will let us extend it in a way where we can add more models whether from Huggingface or from other projects as long as we extend the same abstract class. Feel free to modify this code by issuing a pull request on Github if anything needs to be modified!

Omar:
Related Post