In the first post of this series, I explained my plan and purpose for writing this and subsequent articles. I want to make it simple to start your NLP project from a Proof of Concept project. These small projects will have an API, a Dockerfile, and a streamlit
demo. That way you’re all set to go. In this article, I’m starting off with one of the most common NLP tasks out there: Named Entity Recognition. (You can find the code of this tutorial on my Github)
Task definition
Named Entity Recognition (NER) is one of the most common NLP tasks. It stems from the Entity Extraction task which is a labeling task concerned with predicting a class per token in the input text. This is usually done for the purpose of Information Extraction where your input is a chunk of text and your output is a list of labels per token. An example:
Director General of the International Atomic Energy Agency (IAEA) Rafael Mariano Grossi attends the IAEA Board of Governors meeting at the IAEA headquarters in Vienna, Austria, on March 7.
Running through our system would produce:
Components of our system
To build this system we’re going to implement a couple of components:
- Named entity extractor
- API
- Streamlit demo
- A Dockerfile to wrap this
Ideally, we need this system to be laid out in a way that we can easily extend with different systems or modify the output format and be open to whatever changes that we might have to implement.
The API and the Dockerfile are really important here so we can package our system as a microservice and be part of a larger pipeline or run through a cluster using Kubernetes.
The Named Entity Extractor
When it comes to NLP nowadays, attention is all you need. Transformers have been breaking the state-of-the-art charts consistently for the past few years proving they’re here to stay. so let’s create a transformer-based NE extractor.
And when we mention transformers, there is no better solution than Huggingface’s transformers library! so let’s build an NE extractor using their library.
Let’s first define a very basic abstract class to define what NE extractors should look like in case we wanted to extend our system to accommodate more types of models.
class NeExtractor(ABC):
@abstractmethod
def extract(self, text: str) -> List[Extraction]:
pass
Code language: Python (python)
And our extraction struct is defined as:
@dataclass
class Extraction:
start: int
end: int
type: str
text: str
confidence: float = 0.
Code language: CSS (css)
Now let’s write a new class for the transformer extractor, let’s call it TransformerExtractor
and it should extend NeExtractor
class TransformerExtractor(NeExtractor):
def __init__(self, model_name):
self.model = AutoModelForTokenClassification.from_pretrained(model_name)
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
Code language: Python (python)
According to our abstract class NeExtractor
we still need to implement an extract
function. We want this function to receive a text chunk and produce a list of Extractions where each extraction is the full entity’s text span.
To do this we have to:
- Tokenize input text
- Run through the model
- Group by BIO
BIO
is a tagging scheme to denote that multiple consecutive words belong to the same class. If a word has the prefix B-
then it’s the start of an entity. if it has the I-
then it’s a continuation, else an O
denotes no named entity.
Let’s tokenize our text
def extract(self, text: str) -> List[Extraction]:
inputs = self.tokenizer(text, return_tensors="pt", return_offsets_mapping=True)
tokens = inputs.tokens()
Code language: PHP (php)
I want to extract character offsets of tokens as well but I don’t want to send that to the model, so let’s store that in a variable
offsets = inputs["offset_mapping"].squeeze().numpy()
del inputs["offset_mapping"]
Code language: JavaScript (javascript)
I also want to run the logits through a softmax function to normalize confidence and compute confidence per entity
outputs = torch.softmax(self.model(**inputs).logits, dim=2)
Code language: PHP (php)
Next, let’s define a struct that will hold each token’s information and the predicted class for it
@dataclass
class NeToken:
start: int
end: int
text: str
bio: str
confidence: float
Code language: CSS (css)
Now we loop over the predictions and save those in an array
bio_predictions = []
for token, offset, output in zip(tokens, offsets, outputs[0]):
if offset[0] == offset[1]:
continue
prediction = torch.argmax(output)
bio_predictions.append(
NeToken(
offset[0],
offset[1],
token,
self.model.config.id2label[prediction.numpy().item()],
output.max().item(),
)
)
Code language: PHP (php)
Remember we still have to group by BIO?
grouped_bio = groupby_bio(bio_predictions)
To do this we create a finite state machine that will accumulate predictions of the same class into one list and return the output as a list of tuples where each tuple has the class without the prefix and all the corresponding class tokens
def groupby_bio(bio_predictions: List[NeToken]) -> List[Tuple[str, List[NeToken]]]:
entities = []
accumulator = []
previous_tag = "O"
for entity in bio_predictions:
if entity.bio.replace("B-", "").replace("I-", "") == "O" and (
previous_tag.startswith("I-") or previous_tag.startswith("B-")
):
entities.append(
(previous_tag.replace("B-", "").replace("I-", ""), accumulator)
)
accumulator = [entity]
previous_tag = entity.bio
if entity.bio.startswith("B-") and (
previous_tag.startswith("I-") or previous_tag.startswith("B-")
):
entities.append(
(previous_tag.replace("B-", "").replace("I-", ""), accumulator)
)
accumulator = [entity]
previous_tag = entity.bio
if entity.bio.startswith("B-") and previous_tag.startswith("O"):
accumulator = [entity]
previous_tag = entity.bio
if entity.bio.startswith("I-") and (
previous_tag.startswith("B-") or previous_tag.startswith("I-")
):
accumulator.append(entity)
previous_tag = entity.bio
if len(accumulator) > 0 and previous_tag.replace("B-", "").replace("I-", "") != "O":
entities.append((previous_tag.replace("B-", "").replace("I-", ""), accumulator))
return entities
Code language: Python (python)
Let’s go back to our extract
function. We now have to transform the grouped results by BIO and return those named entities in the form of the Extraction
data class we implemented earlier. We also reduce our confidence by averaging the confidence of all tokens of an entity.
extractions = []
for named_entity in grouped_bio:
start, end = named_entity[1][0].start, named_entity[1][-1].end
if len(named_entity[1]) == 1:
confidence = named_entity[1][0].confidence
else:
confidence = float(
reduce(lambda x, y: x + y, [n.confidence for n in named_entity[1]])
) / len(named_entity[1])
extractions.append(
Extraction(
start,
end,
named_entity[0],
text[start:end],
confidence,
)
)
return extractions
Code language: PHP (php)
You can find this class defined in this Python file
The API
Now that we have a working Named Entity extractor let’s start implementing an API. Personally, I like to use FastAPI because it feels more modern and up to date with Python’s latest features like asynchronous functions and it’s very easy to use.
We want our API to load the Named Entity extractor once and then with each input query it will run the model on the text and return the output in the format of a list of Extraction
objects which in JSON will be a list of dictionaries.
Let’s make our API configurable by a config.json
file so we won’t have to hard code the model name in the API code or the extractor code and also enable adding more configuration later on if we want.
We define the configuration struct as follows:
@dataclass
class Configuration:
model_name: str
Code language: CSS (css)
And to populate it I’m going to use dacite
with open("config.json") as config_ptr:
json_config = json.load(config_ptr)
config = dacite.from_dict(
data_class=Configuration, data=json_config,
)
Code language: Python (python)
Now let’s define the input and output models of our API
class Input(BaseModel):
text: str
class Output(BaseModel):
named_entities: List[Extraction]
Code language: Python (python)
Then, we instantiate our API and our NE extractor
Code language: Python (python)api = FastAPI() named_entity_extractor = TransformerExtractor(config.model_name)
Awesome, one last task is to implement a route that will take the text as input and produce the result as output
@api.post("/ner", response_model=Output)
def extractions(input_request: Input):
ne_extractions = named_entity_extractor.extract(input_request.text)
return {"named_entities": ne_extractions}
Code language: JavaScript (javascript)
Great, now we can run our API with uvicorn
by running the command uvicorn api:api
And to query it we can send a post request with curl
curl "http://127.0.0.1:8000/ner" \
-X POST \
-d '{"text":"John Doe is a Go Developer at Google"}'
Code language: JavaScript (javascript)
The API replied with:
{"named_entities":[{"start":0,"end":8,"type":"PER","text":"John Doe","confidence":0.9973324537277222,"__initialised__":true},{"start":30,"end":36,"type":"ORG","text":"Google","confidence":0.9978129863739014,"__initialised__":true}]}
Code language: JSON / JSON with Comments (json)
You can find the API code here
Demo
Finally, we want to have an interactive demo to explore our model and share it with our clients. This is usually one of the most important components of any freelance project because that’s how you let your clients explore your system and then maybe ask for modifications.
To implement a demo, I don’t think there’s any easier framework than streamlit
so let’s create a visual interactive demo with streamlit
that will have a text area where the user inputs their text and the system will produce the output in the form of a table of extracted entities.
Let’s give our demo a title and header
st.title("NER Demo")
st.header("Extract Named Entities from text")
Code language: JavaScript (javascript)
Let’s also load the same configuration as we did with our API
with open("config.json") as config_ptr:
json_config = json.load(config_ptr)
@dataclass
class Configuration:
model_name: str
config = dacite.from_dict(
data_class=Configuration,
data=json_config,
)
Code language: Python (python)
Since streamlit
runs everything whenever a change occurs, we want our model to only load once, so let’s write a function that will load the model and have it decorated with streamlit's
cache decorator
@st.cache(allow_output_mutation=True)
def load_ner_model(model_name: str):
named_entity_extractor = TransformerExtractor(model_name)
return named_entity_extractor
Code language: Python (python)
Then, let’s run the load function and also add a text area to produce our first outputs
model = load_ner_model(config.model_name)
text = st.text_area(
"Text to extract from",
"Yann Lecun is a very famous scientist who works at Meta AI.",
max_chars=500,
)
ne_extractions = model.extract(text)
predictions = pd.DataFrame(
data={
"text": [e.text for e in ne_extractions],
"type": [e.type for e in ne_extractions],
}
)
Code language: Python (python)
Finally, let’s write our outputs as an interactive table
st.write(predictions)
Code language: CSS (css)
Let’s run this and see how it looks
It looks quite neat! and the output is not bad except for splitting Yann’s name but that’s a model problem.
You can find the demo’s code on Github
Dockerfile
Last but not least, let’s write a very basic docker file that we can use to build a container for our application.
FROM python:3.8-buster
COPY . /home
WORKDIR /home
RUN pip install -r /home/requirements.txt --ignore-installed
ENTRYPOINT ["uvicorn api:api"]
Code language: Dockerfile (dockerfile)
This file is on Github as well
Conclusion
Now that we’ve wrapped all our components and glued them together, we’re ready to have our NER project up and running whether as a microservice or as an online demo, or as an HTTP API. The way we wrote this project will let us extend it in a way where we can add more models whether from Huggingface or from other projects as long as we extend the same abstract class. Feel free to modify this code by issuing a pull request on Github if anything needs to be modified!
Be First to Comment