Exploring Word Embeddings and Text Catalogs with Apple’s Natural Language Framework in iOS | by Anupam Chugh

Learn how NLGazetteer, NLWordEmbedding, and other NLP functionalities work in Apple’s Natural Language framework

Photo by Ryan Wallace on Unsplash

NSLinguisticTaggerwhich was available as far back as the iOS 5 SDK, paved the path for Apple’s announcement of their Natural Language framework at WWDC 18. Everything from language identification to lemmatization and part-of-speech tagging, all of which were present in NSLinguisticTaggerare now a part of the Natural Language framework, with an API that’s been completely redesigned in Swift.

The added benefit that the Natural Language framework has over the NSLinguisticTagger is the ability to use custom NLP models.

During WWDC 2019, Apple announced the arrival of several powerful new tools to its Natural Language framework. The three notable ones that were introduced:

  • Built-in sentiment analysis
  • Word embedding
  • Text catalog

Before we delve deeper into each of these features, let’s quickly walk through the important APIs that are already present in Apple’s Natural Language framework.

Natural language processing is responsible for taking unstructured text data as input and inferring a number of possible observations on it. The following are some of the key APIs that are used for NLP in iOS to process text in intelligent ways.

The NLLanguageRecognizer helps us determine a piece of text’s language from a string, as shown below:

import NaturalLanguagevar str = "Hola"let recognizer = NLLanguageRecognizer()recognizer.processString(str)if let languageCode = recognizer.dominantLanguage?.rawValue{let detectedLangauge = Locale.current.localizedString(forIdentifier: languageCode)print(detectedLangauge) //prints Spanish}

The NLLanguageRecognizer class makes it possible to determine the dominant language code from the text’s context. Additionally, the API provides functionality to return the confidence level of the language that’s predicted.

The following function returns the top 5 languages ​​predicted as a dictionary of the language code and a probability value:

recognizer.languageHypotheses(withMaximum: 5))

Currently, Apple’s Natural Language Identification API is less accurate than its Firebase counterpart. Languages ​​such as Hindi (the one I tested on a wide variety of texts) aren’t currently identified from context by the NLLanguageRecognizer.

Tokenization is the process of splitting a string into chunks of words, sentences, characters, or paragraphs. These segmented texts can then be processed together or separately, depending on the use case. To tokenize a string, the NLTokenizer class is used.

We need to specify the unit type on which the string should be tokenized. Based on the unit scheme assigned, the text is classified. In the below code, each of the questions are split into different tokens:

import NaturalLanguagevar str = "How are you? Where were you?"let tokenizer = NLTokenizer(unit: .sentence)tokenizer.string = strtokenizer.enumerateTokens(in: str.startIndex..<str.endIndex) { tokenRange, _ inprint(str[tokenRange])return true}

Lemmatization is the process of converting a word into its base form. Oftentimes you’ll come across use cases in your NLP applications where the only difference between a few words is the tense in which its used. For example “assumed” and “assuming” are different flavors of the word “assume” and possess the same core meaning.

Lemmatization is commonly used in word tagging and fuzzy matching (which identifies misspelled words, like what you’d see in a Google search). The following code shows how lemmatization is implemented with NLTagger:

var text = "I am running late for the 10 km marathon run that was scheduled for today. Can we reschedule the meeting?"let tagger = NLTagger(tagSchemes: [.lemma])
tagger.string = text
tagger.enumerateTags(in: text.startIndex..<text.endIndex, unit: .word, scheme: .lemma) { tag, tokenRange inif let tag = tag {print("(text[tokenRange]) - (tag.rawValue)")}
return true

The Natural Language framework also has the ability to classify and identify words from speech or a sentence as nouns, pronouns, verbs, adjectives, prepositions, idioms, etc. Part-of-speech tagging is done using NLTagger .

This isn’t as straightforward as it sounds, since the same word can be tagged as a verb or a noun based on the semantic context. The Natural Language framework determines the appropriate lexical class.

In the following code, we’ve used the same NLTagger as previously, but with a different scheme specified:

import NaturalLanguagevar text = "His laugh was so shrill that it jarred everybody. You laugh in a shrill manner."let tagger = NLTagger(tagSchemes: [.lexicalClass])
let options : NLTagger.Options = [.omitWhitespace, .omitPunctuation]
tagger.string = texttagger.enumerateTags(in: text.startIndex..<text.endIndex, unit: .word, scheme: .lexicalClass, options: options) { tag, tokenRange inif let tag = tag {print("(text[tokenRange]) : is (tag.rawValue)")}return true}

In the result below, you can see that the word “laugh” was tagged as a noun and a verb in different parts of the text.

Named entity recognition is a subset of PoS tagging. We can identify names of people, places, and organizations in similar fashion by setting the tag scheme to nameType and looping through the tags. Moreover, for our specific use cases, we can set our custom Core ML models on the NLTagger class, as shown below:

let scheme = NLTagScheme("CoffeeName")let model = try NLModel(contentsOf: modelUrl)
let tagger = NLTagger(tagSchemes: [scheme])
tagger.setModels([model], forTagScheme: scheme)

Now that we’ve had a good look at the older Natural Language framework tools, let’s dig deep into the newer ones.

Text classification got a boost with the inclusion of the Sentiment Analysis API. It analyzes the degree of sentiment in the text, and based on that gives a score that ranges from -1 (highly negative) to 1 (very positive). Currently, the Natural Language framework supports 7 languages ​​for sentiment analysis.

To use built-in sentiment analysis on a piece of text, simply pass the sentimentScore tag scheme to the NLTagger instance, as shown below:

import NaturalLanguagevar text = "I won't be possible to make it today."let tagger = NLTagger(tagSchemes: [.sentimentScore])tagger.string = textlet (sentiment, _) = tagger.tag(at: text.startIndex, unit: .paragraph, scheme: .sentimentScore)print(sentiment?.rawValue) //prints -0.8

Changing the unit to word or sentence would not work. For that, we’ll need to enumerate over the text as we did before. The sentiment score that gets assigned to each word or sentence is the same as the whole text; hence, it’s recommended to specify paragraph as the unit.

Word tagging is the other important aspect of NLP and word embedding is a part of it. Word embedding basically maps strings to their vector counterparts. In doing so, strings that have small vector distances are deemed similar.

The following diagram showcases a few random strings placed in the coordinate space. You can see that the semantically similar ones are clustered together:

Word embedding is a crucial part of search engines and indexing search, as it’s pretty common to search terms that are not directly present in the search index. For such cases, by using word embedding we can retrieve the closest possible matches.

Currently, the Natural Language framework supports built-in OS embedding in 7 languages: English, Spanish, French, Italian, German, Portuguese, and Simplified Chinese. However, we can also create our own custom word embeddings as well, as we’ll see shortly.

An NLEmbedding class instance is instantiated as follows:

let embedding = NLEmbedding.wordEmbedding(for: .english)

Currently, OS embeddings require the word to be in lowercase only. Not doing so returns no result. The following method is used to retrieve the vector representation of a word:

embedding?.vector(for: "cat") //returns double array

The distance that computed between two words is a cosine value, and in cases where it can’t be computed (for example, a word doesn’t exist in the built-in OS embedding) the value returned is 2.0.

embedding?.distance(between: "cat", and: "dog") //0.71

To find the top K most similar words, we can enumerate this request in the following way:

embedding?.enumerateNeighbors(for: "HeartBeat".lowercased(), maximumCount: 5) { (string, distance) -> Bool inprint("(string) - (distance)")return true}

For specific use cases, custom word embeddings can be built using GloVe, Word2Vec, BERT, and FastText datasets. For demonstration purposes, we create a vector dictionary, as shown below:

let vectors = ["dog": [-0.4, 0.37],"cat": [-0.15, -0.02],"lion": [0.19, -0.4]]let embedding = try MLWordEmbedding(dictionary: vectors)try embedding.write(to: URL(fileURLWithPath: "modelPathHere"))

MLWordEmbedding uses an automatic compression technique that can compress gigabytes of data into a very tiny Core ML model.

To use the Core ML model in the NLEmbedderwe need to pass over the URL of the compiled model, as shown below:

let compiledUrl = try MLModel.compileModel(at: URL(fileURLWithPath: "modelPathHere"))let customEmbedding = try NLEmbedding(contentsOf: compiledUrl)customEmbedding.enumerateNeighbors(for: "dog", maximumCount: 2) { (string, distance) -> Bool inprint("(string) - (distance)")return true}

Text Catalog is a newly added functionality in the Natural Language Framework that allows us to customize word tagging models without the need to create a new word tagging model.

Instead, we create a dictionary of labels and their custom tags and pass them into the Create ML’s NLGazetteer model type. The output model that’s generated is an efficient (in terms of space and speed) form of the input dictionary that can then be used with a NLTagger:

let entities = ["American Actor" : ["Tom Cruise", "Brad Pitt", "Will Smith" , ....],"Indian Actor": ["Amitabh Bachchan", "Aamir Khan", "Sharukh Khan", ....]]let gazetteer = try MLGazetteer(dictionary: entities)try gazetteer.write(to: URL(fileURLWithPath: "modelUrlPath"))

Finally, we pass the compiled model URL to the NLGazetteer and set it in the NLTagger:

let compiledUrl = try MLModel.compileModel(at: URL(fileURLWithPath: "modelUrlPath"))let gazetteer = try! NLGazetteer(contentsOf: compiledUrl)var text = "Amitabh Bachchan and Leonardo DiCaprio featured together in the Movie Great Gatsby"let tagger = NLTagger(tagSchemes: [.nameTypeOrLexicalClass])let options : NLTagger.Options = [.omitWhitespace, .omitPunctuation]tagger.string = texttagger.setGazetteers([gazetteer], for: .nameTypeOrLexicalClass)tagger.enumerateTags(in: text.startIndex..<text.endIndex, unit: .word, scheme: .nameTypeOrLexicalClass, options: options) { tag, tokenRange inif let tag = tag {print("(text[tokenRange]) - (tag.rawValue)")}return true}

The NLTagger tags the name types with the tags specified in the Gazetteer:

We explored the different functionalities that the Natural Language framework provides and looked at their use case examples side-by-side.

Natural language processing is a complex yet powerful field of study and application. Complex in the sense that we’re dealing with unstructured data—human language, where the same word can be labeled differently depending on the context it’s present in.

Create ML allows us to build our own text classifiers and word taggers, and customize them by using word embedders and text catalogs. Now that we’ve had a good look at the framework, you can go ahead and try using NLP to build intelligent text processing applications. Something like determining similar sentences using word embedders would be a really interesting thing to implement.

That’s it for this one. I hope you enjoyed reading.

Leave a Comment