One of the “3 vs” of Big Data is variety. Quite often, when dealing with such variety, human language can be found.
Understanding the written word is one of the most complex things the machine can undertake. In the 1950s, Alan Turing devised The Turing Test which opened with the statement:
“Can machines think?”
In his test, Turing proposed:
“a human evaluator would judge natural language conversations between a human and a machine designed to generate human-like responses. The evaluator would be aware that one of the two partners in conversation is a machine, and all participants would be separated from one another”
Whilst the paper and test was criticized, it is still influential and an important topic in the realm of artificial intelligence and machine learning.
Understanding the Written Word Using Machine Learning and Natural Language Processing (NLP) Click To Tweet
Processing the written word is difficult for machines. Natural Language Processing or NLP (not to be confused with Neuro Linguistic Programming) is a field of computer science and AI concerned with writing software that attempts to understand the written word and derive meaning from it.
The Challenges of Analyzing Human Language
We know that it can be difficult for the machine to interpret and understand human language.
But why is this?
The order of words can have an effect on the overall meaning of the sentence, or an opinion may be expressed with sarcasm or irony.
Take the following sentence for example where the author of this is conveying sarcasm, challenges such as this are what NLP attempts to address.
“After a whole 5 hours away from work, I get to go back again, I’m so lucky!”
It can be time-consuming for the machine to deal with vast quantities of text that are often found in Big Data. It makes sense to pre-process the data prior to applying NLP routines by removing what are called Stop Words. These are also sometimes known as Noise Words.
For example, to preserve disk space, search engines do not record common stop words. They may also exclude stop words to help speed up searches.
Removing stop words ensures that NLP is only applied to words that add meaning to the sentence and thereby improves the speed at which NLP routines can be executed.
Finding Signals in the “Noise”
Cleansing and pre-processing data is only one part of the puzzle when trying to understand the written word.
- How can we identify topics that people are talking about?
- How can we identify if people are talking about a specific place?
- How can we attempt to derive meaning from the cleansed data?
Enter POS Tagging.
POS (Part of Speech) Tagging is one technique that can help address these challenges and therefore improve the machine’s understanding of the written word.
POS tagging is the process of assigning a ‘tag/category’ (in the form of an abbreviated code) to each word in a sentence. In the English language, common POS categories are:
There are numerous POS Tagging sets, a common one is the Penn Treebank POS Tag Set. The table below shows an extract of this:
[ Content upgrade ID not specified ]
How can POS Tagging be Used to Understand the Written Word?
If we take the following sentence:
“I think I’m going to get a new car”
Then run this through the Stanford Tagger. It will process each word in the sentence and return the following:
I/FW think/NN I’m/NN going/VBG to/TO get/VB a/DT new/JJ car/NN
With the information in this format, it makes it easier for the machine to identify patterns, entities, and linguistic constructs. The machine can also begin to understand the context of each word.
What Kind of Patterns can Help the Machine?
If we consider the following POS tagged sentence – *note the word “is” would have been removed as a Stop Word during pre-processing.
car/NN is/VB great/JJ
We know the sentence contains a noun (car) and an adjective (great) has been used to describe it. We can, therefore, infer that NN JJ is a very specific pattern that we should be looking for if we wish to identify strings of text whereby an object is being described.
Machine Learning as a Service
In the last few years, we have seen the commoditization of machine learning technology from companies such as Microsoft with their Cognitive Services platform and IBM with their Watson suite of products. These can be consumed via REST APIs for a small monthly fee and integrate seamlessly with your existing solutions. They offer the following services and APIs which include, but are not limited to:
- Classify text and its sentiment (sentiment analysis)
- Determine the content of images
- Identify which language is being used
- Face detection
- Speech recognition
- Text Translation
Below, an image is supplied to the Microsoft Cognitive Services platform. It has identified the associated properties and tags using its preloaded training data. We can see the description and tags the machine has identified along with probabilities of accuracy (or Confidence as they’ve been labeled).
Note the accuracy for: “woman”, “beach”, “person”. (0 = no confidence, 1 = 100% confident).
We touched on this when discussing POS Tagging, there are other text classification features offered by the main tech firms such as:
- Sentiment Analysis
- Key Phrase Extraction
- Topic Detection
- Language Detection
In the following example, we can see the Microsoft Key Phrase Extraction interpretation of human input text:
I had a wonderful experience! The rooms were wonderful and the staff was helpful.
The service has returned JSON, in the Key Phrases node, the key phrases “staff”, “wonderful experience” and “rooms” have been identified.
The key takeaway from these platforms is you now have this rich functionality “in a box” to use as you please. They are powerful, reside in the cloud and thereby guarantee almost constant uptime.
Using these services shields you from the complexities the underlying ML algorithms thereby allowing you to focus on the business problem at hand.
Machine Learning and the Future
Given the current rate of advancements, it’s hard to predict the impact of ML will have on the future. What can be said however is the employment market will be affected.
For example, in 2016, DeepMind, (Googles artificial intelligence company) formed a partnership with Moorfields Eye Hospital in London. The hospital supplied Google with access to 1 million anonymized retinal scans.
Google wanted to train a neural network with this data and images that would allow the machine identify the patterns associated with two of the most common eye diseases:
age related macular degeneration and diabetic retinopathy
More than 100 million people around the world have these conditions.
Often trained clinicians have trouble spotting these irreversible eye conditions. (Source: New Scientist)
This could make it possible for a machine learning system to detect the onset of disease before a human doctor could.
We’ve talked about some of the challenges machines face when trying to interpret human language, which often exists in Big Data.
We’ve discussed some solutions, and touched on SaaS platforms that have been developed by Microsoft and IBM which offer end to end machine learning and NLP implementations that can be integrated with your existing software products.
What’s been your experience with machine learning and NLP?