Named Entity Recognition

George Brocklehurst

Downloading recipes

I wanted to experiment with a reasonably large recipe dataset, to play around with some meal planning ideas. The trouble was, I didn’t have a dataset.

No problem, I thought, there are loads of recipes on the Web—I’ll use some of those!

Thanks to embedded data formats like the h-recipe microformat, and the recipe schema from Schema.org, many of the recipes published on the Web are marked up semantically. Even better, there’s a Ruby gem called hangry to parse these formats. In no time, I was turning recipes into structured data.

The thing I was most interested in was ingredients, and here I hit my next problem: I had human readable lists of ingredients, but nothing sufficiently structured to compare quantities, find similarities, or convert units.

Ingredients are hard

The first few examples I looked at seemed pretty simple:

[
  "2 tablespoons butter",
  "2 tablespoons flour",
  "1/2 cup white wine",
  "1 cup chicken broth",
]

It seemed like a clear pattern was emerging, and maybe one line of Ruby code would suffice:

quantity, unit, name = description.split(" ", 3)

Unfortunately, the reality was much more complex. I found more and more examples that didn’t fit this simple pattern. Some ingredients had multiple quantities that needed to be combined (“3 cups and 2 tablespoons”, or “2 10 ounce packages”); others had alternative quantities in metric and imperial, or in cups and ounces; still others followed the ingredient name with preparation instructions, or listed multiple ingredients together in the same item.

The special cases piled higher and higher, and my simple Ruby code got more and more tangled. I stopped feeling good about the code, then I stopped feeling like it would be OK after refactoring, and eventually I threw it away.

I needed a whole new plan.

Named entity recognition

This seemed like the perfect problem for supervised machine learning—I had lots of data I wanted to categorise; manually categorising a single example was pretty easy; but manually identifying a general pattern was at best hard, and at worst impossible.

After considering my options, a named entity recogniser looked like the right tool to use. Named entity recognisers identify pre-defined categories in text; in my case I wanted one to recognise name, quantities, and units of ingredients.

I opted for the Stanford NER, which uses a conditional random field sequence model. To be perfectly honest, I don’t understand the maths behind this particular type of model, but you can read the paper1 if you want all the gory details. The important thing for me was that I could train this NER model on my own dataset.

The process I followed to train my model was based on the Stanford NER FAQ’s Jane Austen example.

Training the model

The first thing I did was gather my example data. Within a single recipe, the way the ingredients are written is quite uniform. I wanted to make sure I had a good range of formats, so I combined the ingredients from around 30,000 online recipes into a single list, sorted them randomly, and picked the first 1,500 to be my training set.

It looked like this:

confectioners' sugar for dusting the cake
1 1/2 cups diced smoked salmon
1/2 cup whole almonds (3 oz), toasted
...

Next, I used part of Stanford’s suite of NLP tools to split these into tokens.

The following command will read text from standard input, and output tokens to standard output:

java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer

In this case, I wanted to build a model that will understand a single ingredient description, not a whole set of ingredient descriptions. In NLP parlance, that means each ingredient description should be considered a separate document. To represent that to the Stanford NER tools, we need to separate each set of tokens with a blank line.

I broke them up using a little shell script:

while read line; do
  echo $line | java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer >> train.tok
  echo >> train.tok
done < train.txt

The output looked like this:

confectioners
'
sugar
for
dusting
the
cake

1 1/2
cups
diced
smoked
salmon

1/2
cup
whole
almonds
-LRB-
3
oz
-RRB-
,
toasted

...

The last manual step was to tag the tokens, indicating which was part of the name of an ingredient, which was part of the quantity, and which was part of the unit. 1,500 examples was around 10,000 tokens, each labeled by hand—never let anyone tell you machine learning is all glamour.

Every token needs a label, even tokens that aren’t interesting, which are labelled with O. Stanford NER expects the tokens and label to be separated by a tab character. To get started, I labelled every token with O:

perl -ne 'chomp; $_ =~ /^$/ ? print "\n" : print "$_\tO\n"' \
  train.tok > train.tsv

Several hours in vim later, the results looked something like this:

confectioners  NAME
'              NAME
sugar          NAME
for            O
dusting        O
the            O
cake           O

1 1/2          QUANTITY
cups           UNIT
diced          O
smoked         NAME
salmon         NAME

1/2            QUANTITY
cup            UNIT
whole          O
almonds        NAME
-LRB-          O
3              QUANTITY
oz             UNIT
-RRB-          O
,              O
toasted        O

...

Now the training set was finished, I could build the model:

java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier \
  -trainFile train.tsv \
  -serializeTo ner-model.ser.gz \
  -prop train.prop

The train.prop file I used was very similar to the Stanford NER FAQ’s example file, austen.prop.

And there I had it! A model that could classify new examples.

Testing the model

One of the downsides of machine learning is that it’s somewhat opaque. I knew I had trained a model, but I didn’t know how accurate it was going to be. Fortunately, Stanford provide testing tools to let you know how well your model can generalise to new examples.

I took about another 500 examples at random from my dataset, went through the same glamorous process of hand-labelling the tokens. Now I had a test set I could use to validate my model. Our measures of accuracy will be based on how the token labels produced by the model differ from the token labels I wrote by hand.

I tested the model using this command:

java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier \
  -loadClassifier ner-model.ser.gz \
  -testFile text.tsv

This test command outputs the test data with the label I’d given each token and the label the model predicted for each token, followed by a summary of the accuracy:


CRFClassifier tagged 4539 words in 514 documents at 3953.83 words per second.
         Entity P       R       F1      TP      FP      FN
           NAME 0.8327  0.7764  0.8036  448     90      129
       QUANTITY 0.9678  0.9821  0.9749  602     20      11
           UNIT 0.9501  0.9630  0.9565  495     26      19
         Totals 0.9191  0.9067  0.9129  1545    136     159

The column headings are a little opaque, but they’re standard machine learning metrics that make good sense with a little explanation.

  • P is precision: this is the number of tokens of a given type that the model identified correctly, out of the total number of tokens the model predicted were that type. 83% of the tokens the model identified as NAME tokens really were NAME tokens, 97% of the tokens the model identified as QUANTITY tokens really were QUANTITY tokens, etc.

  • R is recall: this is the number of tokens of a given type that the model identified correctly, out of the total number of tokens of that type in the test set. The model found 78% of the NAME tokens, 98% of the QUANTITY tokens, etc.

  • F is the F1 score, which combines precision and recall. It’s possible for a model to be very inaccurate but still score highly on precision or on recall—imagine a model that labeled every token as a NAME, it would get a great recall score. By combining the two as a F1 score we get a single number that’s more representative of overall quality.

  • TP, FP, and FN are true positives, false positives, and false negatives respectively.

Using the model

Now I had a model and confidence that it was reasonably accurate, I could use it to classify new examples that weren’t in the training or test sets.

Here’s the command to run the model:

$ echo "1/2 cup of flour" | \
  java -cp stanford-ner/stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier \
  -loadClassifier ner-model.ser.gz \
  -readStdin
Invoked on Wed Sep 27 08:18:42 EDT 2017 with arguments: -loadClassifier
ner-model.ser.gz -readStdin
loadClassifier=ner-model.ser.gz
readStdin=true
Loading classifier from ner-model.ser.gz ... done [0.3 sec].
1/2/QUANTITY cup/UNIT of/O flour/NAME
CRFClassifier tagged 4 words in 1 documents at 18.87 words per second.

The output looks quite noisy, but most of it goes to STDERR, so we can throw it away if we choose to:

$ echo "1/2 cup of flour" | \
  java -cp stanford-ner/stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier \
  -loadClassifier ner-model.ser.gz \
  -readStdin 2>/dev/null
1/2/QUANTITY cup/UNIT of/O flour/NAME

Iterating on the model

Even with these seemingly high F1 scores, the model was only as good as its training set. When I went back and ran my full corpus of ingredient descriptions through the model I quickly discovered some flaws.

The most obvious problem was that the model couldn’t recognise fluid ounces as a unit of measurement. When I looked back at the training set and the test set, there wasn’t a single example of fluid ounces, fl ounces, or fl oz.

My random sample hadn’t been large enough to truly represent the data.

I selected additional training and testing examples, taking care to include various representations of fluid ounces in my training and test sets. The updated model got similar scores on the updated test sets, it no longer had trouble with fluid ounces.

The moral of the story

It’s an exciting time for machine learning. Like Web development a decade ago, the tools are becoming increasingly accessible, to the point where developers can focus less on the mechanism and more on the problem we’re solving.

It’s not a silver bullet—no technology solves every problem—but I’m excited to have these tools at our disposal, when the right kind of problems come along.

If you want to try this for yourself, I packaged up the commands I used into a Makefile to avoid typing a lot of long-winded commands. You can find that on GitHub: https://github.com/georgebrock/ner-tools

Named entity recognisers aren’t the only form of machine learning. If you want to learn about other models, get comfortable with ideas like precision, recall, and F1 scores, and much more, I’d recommend Andrew Ng’s machine learning course on Coursera.


[1] Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 363-370.