NLP


Well, human beings are the most advanced species on Earth. There's no doubt in that and our success as human beings is because of our ability to communicate and share information, that's where the concept of developing a language comes in.

When we talk about the human language, it is one of the most diverse and complex part of us considering a total of 6500 languages that exist.

So coming to the 21st century according to the industry estimates only 21% of the available data is present in the structured form, data is being generated as we speak tweet and send messages on WhatsApp or the various other groups of Facebook, and majority of this data exist in the textual form, which is highly unstructured in nature.

In order Analytics is the process of deriving meaningful information from natural language text. It usually involves the process of structuring the input text deriving patterns within the structured data and finally evaluating and interpreting the output. 

On the other hand natural language processing refers to the artificial intelligence method of communicating with an intelligent system using the natural language, as text mining refers to the process of deriving high quality information from the text. 

The overall goal is here to essentially turn the text into data analysis via the application of natural language processing that is why text mining and NLP go hand-in-hand. 

Applications of Natural Language Processing

So let's understand some of the applications of text mining or natural language processing. So one of the first and the most important applications of natural language processing is sentimental analysis. Be it Twitter sentimental analysis or the Facebook sentiment as it's being used heavily now.

Next, we have the implementation of chatbot, you might have used the customer chat services pride by various companies and the process behind all of that is because of the NLP.

Next, we have speech recognition, and here we are also talking about the voice assistance like Siri Google Assistant and Cortana and the process behind all of this is because of the natural language processing 

Next, machine translation is also another use case of natural language processing and the most common example for it is the Google Translate which uses NLP to translate data from one language to another in the real time. 

Another applications of NLP includes spell checking, keyword search and also extracting information from any dock or any website. 

Finally one of the coolest application of natural language processing is advertised on matching basically recommendation of ads based on your history. 

Division of Natural Language Processing 

NLP is divided into two major components, that is; 

  1. Natural language understanding  
  2. Natural language generation
The understanding generally refers to mapping the given input of natural language into useful representation and analyzing those aspects of the language whereas generation is the process and a lot of things to usually understand a particular language, especially if you are not a human being.

Steps in Natural Language Processing 

Now, there are various steps involved in the natural language processing which are:

  1. Tokenization
  2. Stemming
  3. Lemmatization 
  4. The POS tags 
  5. Named entity recognition and 
  6. Chunking 

1. Tokenization 

Starting with tokenization, tokenization is the process of operating strings into tokens, which in turn are small structures or unit that can be used for tokenization 

Tokenization

If we have a look at the example above, taking the sentence into consideration it can be divided into seven tokens. Now, this is very useful in the natural language processing part.

2. Stemming 

Coming to the second process in natural language processing is stemming, stemming usually refers to normalizing the words into its base or the root form. 

Tokenization

So if we have a look at the words above, we have affectation affects affections affected affection and affecting, all of these words originate from a single root word and as you might have guessed it is affect. 

Now stemming algorithm works by cutting off the end or the beginning of the word taking into account a list of common prefixes suffixes that can be found in an infected vote. This indiscriminate cutting can be successful in some occasions but not always.

3. Lemmatization 

So let's understand the concept of lemmatization, lemmatization on the other hand takes into consideration the morphological analysis of the word. To do so, it is necessary to have a detailed dictionary which the algorithm can look through to link the form back to its original word or the root word, which is also known as lemma. 

What lemmatization does is groups together different infected forms of the word called lemma and is somehow similar to stemming as it maps several words into one common root, but the major difference between stemming and lemmatization is that the output of the lemmatization is a proper word. 

For example, a lemmatizer should map the word gone, going and went into go that will not be the output for stemming. 

4. POS Tags

Now once we have the tokens and once we have divided the tokens into its root form, next comes the POS tags. Generally speaking the grammatical type of the word is referred to as POS tags or the parts of speech, be it the verb, noun, adjective,  adverb, article and many more, it indicates how a word function in meaning as well as grammatically within the sentence. A word can have more than one part of speech based on context in which it's used. For example, let's tale a sentence 'Google something on the internet'. Here Google is used as a verb although it's a proper noun. 

Now, these are some of the limitations or as you say the problems that occur while processing the natural language. To overcome all of these challenges, we have the named entity recognition, also known as NER. 

5. Named Entity Recognition - NER

Its the process of detecting named entities, such as person's name, the company name, quantities or the location. It has 3 steps, which are 

  • Noun phrase identification
  • Phrase classification
  • Entity disambiguation 

So if you look at this particular example in a picture below, "Google CEO Sundar Pichai introduced the new pixel 3 at New York Central Mall". So as you can see there Google is identified as a organization so in the picture as a person, we have New York as location and Central Mall is also defined as an organization. 

Named Entity Recognition

Now once we have divided the sentences into tokens and done the stemming, the lemmatization, added the tags and the name entity recognition. It's time for us to group it back together and make sense out of it. So for that we have chunking.

6. Chunking 

Chunking basically means picking up individual pieces of information and grouping them together into the bigger pieces. Now, these bigger pieces are also known as chunks, in the context of NLP, chunking means grouping of words or tokens into chunks. 

Chunking

So as you can see above, We have pink as an adjective, Panther as a noun and the as a determiner, and all of these are together chunked into a noun phrase, this helps in getting insights and meaningful information from the given text. 

Now, you might be wondering where does one execute or run all of these programs and all of these function on a given text file. So for that python came up with NLTK.

What is NLTK? 

NLTK is the natural language toolkit library which is heavily used for all the natural language processing and the text analysis. So guys if you want to know the details about how to execute each and every parts like tokenization, stemming lemmatization through NLTK follow Blueguard and stay tuned as we delve into NLP tutorials.  I hope you have enjoyed reading this post. Please be kind enough to share it and you can comment any of your doubts and queries. 

Print this post