{"id":400,"date":"2019-02-18T22:39:32","date_gmt":"2019-02-18T22:39:32","guid":{"rendered":"https:\/\/datagradient.com\/?p=400"},"modified":"2023-02-28T12:29:13","modified_gmt":"2023-02-28T12:29:13","slug":"nlp-with-ml","status":"publish","type":"post","link":"https:\/\/datasciencediscovery.com\/index.php\/2019\/02\/18\/nlp-with-ml\/","title":{"rendered":"NLP with ML"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">Text Classification<\/h2>\n\n\n\n<p><strong>Purpose:<\/strong> <br> Natural language processing (NLP) has been widely popular, with the large amount of data available (in emails, web pages, sms) it becomes important to extract valuable information from textual data. An assortment of machine learning techniques designed to accomplish this task. With current advances in deep learning, we felt it would be an interesting idea to compare traditional and deep learning techniques. We decided to pick up a playground kaggle data set with the purpose of text classification and proceeded to implement both these types of algorithms for comparison purposes. <\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Problem<\/h3>\n\n\n\n<p>In today\u2019s world, websites have to deal with toxic and divisive content. Especially major websites like Quora which cater to large traffic and their purpose is to provide a platform to people for asking and answering questions. A key challenge is to weed out insincere questions, those founded upon false premises or questions that intend to make a statement rather than look for helpful answers. <\/p>\n\n\n\n<p>A question is classified as insincere if:<\/p>\n\n\n\n<ul><li>Non-neutral tone directed at someone<\/li><li>Discriminatory or contains abusive language<\/li><li>Contains false information<\/li><\/ul>\n\n\n\n<p>For more information regarding the challenge you can use the following <a href=\"https:\/\/www.kaggle.com\/c\/quora-insincere-questions-classification\" target=\"_blank\" rel=\"noopener\">link<\/a>.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Code<\/h4>\n\n\n\n<p>The full code is available <a href=\"https:\/\/github.com\/datasciencediscovery\/NLP_Text_Classification\" target=\"_blank\" rel=\"noreferrer noopener\">here<\/a>. <\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Methodology<\/h3>\n\n\n\n<p>In this article we will tackle text classification by using machine learning and NLP techniques. For any data science problem with textual data the common steps include:<\/p>\n\n\n\n<ul><li>Data exploration<\/li><li>Text pre-processing<\/li><li>Feature engineering<ul><li>Text sentiment<\/li><li>Topic modelling<\/li><li>TFIDF and Count Vectorizer<\/li><li>Text Descriptive Features <\/li><\/ul><\/li><li>Model selection and Evaluation<\/li><\/ul>\n\n\n\n<p>Let\u2019s explore them step by step in more detail.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Data Exploration<\/h3>\n\n\n\n<p>One of the most important steps of any project, you need to familiarize yourself with the data prior to implementing any modeling technique.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">import os\nprint(os.listdir(\"..\/input\"))<\/pre>\n\n\n\n<p>Our dataset includes:<\/p>\n\n\n\n<ul><li>train.csv &#8211; the training set<\/li><li>test.csv &#8211; the test set<\/li><li>sample_submission.csv &#8211; A sample submission in the correct format<\/li><li>embeddings &#8211; Folder containing word embeddings.<\/li><\/ul>\n\n\n\n<p>We are not allowed to use any external data sources. The following embeddings are given to us which can be used for building our models. <\/p>\n\n\n\n<ul><li><a href=\"https:\/\/code.google.com\/archive\/p\/word2vec\/\" target=\"_blank\" rel=\"noopener\">GoogleNews-vectors-negative300<\/a><\/li><li><a href=\"https:\/\/nlp.stanford.edu\/projects\/glove\/\" target=\"_blank\" rel=\"noopener\">glove.840B.300d<\/a><\/li><li><a href=\"https:\/\/cogcomp.org\/page\/resource_view\/106\" target=\"_blank\" rel=\"noopener\">paragram_300_sl999<\/a><\/li><li><a href=\"https:\/\/fasttext.cc\/docs\/en\/english-vectors.html\" target=\"_blank\" rel=\"noopener\">wiki-news-300d-1M<\/a><\/li><\/ul>\n\n\n\n<p>A look at the size of our train and test data:<\/p>\n\n\n\n<ul><li>Shape of train: (Rows 1,306,122 with 3 columns)<\/li><li>Shape of test: (Rows 56,370 with 2 columns)<\/li><\/ul>\n\n\n\n<figure class=\"wp-block-image is-resized\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2019\/08\/Quora_data_snapshot.jpg?resize=512%2C180&#038;ssl=1\" alt=\"\" class=\"wp-image-402\" width=\"512\" height=\"180\" srcset=\"https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2019\/08\/Quora_data_snapshot.jpg?w=541&amp;ssl=1 541w, https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2019\/08\/Quora_data_snapshot.jpg?resize=300%2C105&amp;ssl=1 300w\" sizes=\"(max-width: 512px) 100vw, 512px\" title=\"\" data-recalc-dims=\"1\"><figcaption>What does the data look like?<\/figcaption><\/figure>\n\n\n\n<p>In the target variable <strong>1<\/strong> represents the class <strong>Insincere<\/strong> and <strong>0<\/strong> the <strong>Sincere<\/strong> class of questions. <\/p>\n\n\n\n<p>Let\u2019s explore the distribution of the target variable:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">import seaborn as sns\ncolor = sns.color_palette()\n\n%matplotlib inline\n\nfrom plotly import tools\nimport plotly.offline as py\npy.init_notebook_mode(connected=True)\nimport plotly.graph_objs as go\n\ncnt_srs = train_df['target'].value_counts()\n## target distribution ##\nlabels = (np.array(cnt_srs.index))\nsizes = (np.array((cnt_srs \/ cnt_srs.sum())*100))\n\ntrace = go.Pie(labels=labels, values=sizes)\nlayout = go.Layout(\n    title='Target distribution',\n    font=dict(size=18),\n    width=600,\n    height=600,\n)\ndata = [trace]\nfig = go.Figure(data=data, layout=layout)\npy.iplot(fig, filename=\"usertype\")\n<\/pre>\n\n\n\n<figure class=\"wp-block-image is-resized\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2019\/08\/Quora_Target.png?resize=270%2C270&#038;ssl=1\" alt=\"\" class=\"wp-image-403\" width=\"270\" height=\"270\" srcset=\"https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2019\/08\/Quora_Target.png?w=600&amp;ssl=1 600w, https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2019\/08\/Quora_Target.png?resize=150%2C150&amp;ssl=1 150w, https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2019\/08\/Quora_Target.png?resize=300%2C300&amp;ssl=1 300w\" sizes=\"(max-width: 270px) 100vw, 270px\" title=\"\" data-recalc-dims=\"1\"><figcaption>1: Insincere &amp; 0: Sincere<\/figcaption><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Box Plots:<\/strong><\/h4>\n\n\n\n<p>These box plots shared below will help understand if there are any patterns in the dataset regarding the word count or the number of characters.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote\"><p>Per question Insincere questions have more words<\/p><\/blockquote>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" loading=\"lazy\" width=\"884\" height=\"170\" src=\"https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2019\/08\/Quora_Word_Count.png?resize=884%2C170&#038;ssl=1\" alt=\"\" class=\"wp-image-404\" srcset=\"https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2019\/08\/Quora_Word_Count.png?w=884&amp;ssl=1 884w, https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2019\/08\/Quora_Word_Count.png?resize=300%2C58&amp;ssl=1 300w, https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2019\/08\/Quora_Word_Count.png?resize=768%2C148&amp;ssl=1 768w\" sizes=\"(max-width: 884px) 100vw, 884px\" title=\"\" data-recalc-dims=\"1\"><\/figure>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" loading=\"lazy\" width=\"882\" height=\"170\" src=\"https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2019\/08\/Quora_Unique_Word_Count.png?resize=882%2C170&#038;ssl=1\" alt=\"\" class=\"wp-image-405\" srcset=\"https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2019\/08\/Quora_Unique_Word_Count.png?w=882&amp;ssl=1 882w, https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2019\/08\/Quora_Unique_Word_Count.png?resize=300%2C58&amp;ssl=1 300w, https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2019\/08\/Quora_Unique_Word_Count.png?resize=768%2C148&amp;ssl=1 768w\" sizes=\"(max-width: 882px) 100vw, 882px\" title=\"\" data-recalc-dims=\"1\"><\/figure>\n\n\n\n<blockquote class=\"wp-block-quote\"><p>Insincere questions &gt; characters than sincere questions<\/p><\/blockquote>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" loading=\"lazy\" width=\"882\" height=\"170\" src=\"https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2019\/08\/Quora_Char_Count.png?resize=882%2C170&#038;ssl=1\" alt=\"\" class=\"wp-image-406\" srcset=\"https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2019\/08\/Quora_Char_Count.png?w=882&amp;ssl=1 882w, https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2019\/08\/Quora_Char_Count.png?resize=300%2C58&amp;ssl=1 300w, https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2019\/08\/Quora_Char_Count.png?resize=768%2C148&amp;ssl=1 768w\" sizes=\"(max-width: 882px) 100vw, 882px\" title=\"\" data-recalc-dims=\"1\"><\/figure>\n\n\n\n<blockquote class=\"wp-block-quote\"><p>Sincere questions have lesser punctuation&#8217;s<\/p><\/blockquote>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" loading=\"lazy\" width=\"882\" height=\"170\" src=\"https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2019\/08\/Quora_Punctuation_Count.png?resize=882%2C170&#038;ssl=1\" alt=\"\" class=\"wp-image-407\" srcset=\"https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2019\/08\/Quora_Punctuation_Count.png?w=882&amp;ssl=1 882w, https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2019\/08\/Quora_Punctuation_Count.png?resize=300%2C58&amp;ssl=1 300w, https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2019\/08\/Quora_Punctuation_Count.png?resize=768%2C148&amp;ssl=1 768w\" sizes=\"(max-width: 882px) 100vw, 882px\" title=\"\" data-recalc-dims=\"1\"><\/figure>\n\n\n\n<blockquote class=\"wp-block-quote\"><p>More upper case words in sincere questions<\/p><\/blockquote>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" loading=\"lazy\" width=\"882\" height=\"170\" src=\"https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2019\/08\/Quora_UpperCase_Count.png?resize=882%2C170&#038;ssl=1\" alt=\"\" class=\"wp-image-408\" srcset=\"https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2019\/08\/Quora_UpperCase_Count.png?w=882&amp;ssl=1 882w, https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2019\/08\/Quora_UpperCase_Count.png?resize=300%2C58&amp;ssl=1 300w, https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2019\/08\/Quora_UpperCase_Count.png?resize=768%2C148&amp;ssl=1 768w\" sizes=\"(max-width: 882px) 100vw, 882px\" title=\"\" data-recalc-dims=\"1\"><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\"> <strong>Word Clouds:<\/strong>  <\/h4>\n\n\n\n<p>For questions classified as sincere we see general words like \u201cwill\u201d, \u201cone\u201d and so on. We also see the word \u201cwill\u201d prevalent for insincere questions. During the data processing steps we will have to treat common words. Another point brought out in the word cloud is how words like \u201cTrump\u201d,\u201dliberal\u201d are very specific to insincere words, possibly because the person is  making a statement about these topics rather than genuinely providing an answer.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><em>Sincere<\/em><\/h4>\n\n\n\n<figure class=\"wp-block-image is-resized\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2019\/08\/Quora_Sincere_Cloud.png?resize=374%2C194&#038;ssl=1\" alt=\"\" class=\"wp-image-409\" width=\"374\" height=\"194\" srcset=\"https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2019\/08\/Quora_Sincere_Cloud.png?w=602&amp;ssl=1 602w, https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2019\/08\/Quora_Sincere_Cloud.png?resize=300%2C156&amp;ssl=1 300w\" sizes=\"(max-width: 374px) 100vw, 374px\" title=\"\" data-recalc-dims=\"1\"><figcaption><\/figcaption><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\"><em>Insincere<\/em><\/h4>\n\n\n\n<figure class=\"wp-block-image is-resized\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2019\/08\/Quora_Insincere_Cloud.png?resize=369%2C191&#038;ssl=1\" alt=\"\" class=\"wp-image-410\" width=\"369\" height=\"191\" srcset=\"https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2019\/08\/Quora_Insincere_Cloud.png?w=602&amp;ssl=1 602w, https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2019\/08\/Quora_Insincere_Cloud.png?resize=300%2C156&amp;ssl=1 300w\" sizes=\"(max-width: 369px) 100vw, 369px\" title=\"\" data-recalc-dims=\"1\"><figcaption><\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Text pre-processing<\/h3>\n\n\n\n<p>Usually unstructured text data will be dirty that is it will have misspelled words, case-insensitive words and various other issues. We need to clean the text and bring it to a standardized form before extracting information from it as without this step there will be noise resulting in a poor model. <\/p>\n\n\n\n<p>Broadly, consider the following steps:<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Tokenization:<\/strong><\/h4>\n\n\n\n<p>Tokenization refers to the splitting of strings of text into smaller chunks or tokens. Paragraphs or large bodies of text are tokenized into sentences and then sentences are broken down into individual words.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Normalization:<\/strong><\/h4>\n\n\n\n<p>This refers to a series of steps that transforms the corpus of text into a single standard and consistent form. The following steps are a part of this process:<\/p>\n\n\n\n<ul><li>Converting all letters to lowercase<\/li><li>Removing punctuation marks, numbers, stop words (a, is, will etc.)<\/li><\/ul>\n\n\n\n<p>Stemming, which involves chopping off the end of a word or inflectional endings (-ing, -ed etc.) to get its root form or stem, using crude heuristic rules. <\/p>\n\n\n\n<blockquote class=\"wp-block-quote\"><p>burning -&gt; burn.<\/p><\/blockquote>\n\n\n\n<p>Stemming generally works well most of the time, but can often return words which might not look correct intuitively. <\/p>\n\n\n\n<blockquote class=\"wp-block-quote\"><p>difficulties -&gt; difficulti<\/p><\/blockquote>\n\n\n\n<p>Lemmatization has the same goal as stemming. However, it uses a vocabulary and the morphological analysis of words, to remove inflectional endings and return the dictionary form of a word, known as the lemma. Unlike stemming, lemmatization aims to reduce the word properly so that it makes sense according to the language. <\/p>\n\n\n\n<blockquote class=\"wp-block-quote\"><p>ran -&gt; run, difficulties -&gt; difficulty<\/p><\/blockquote>\n\n\n\n<p>The idea for stemming or lemmatization of words is to reduce words into a common form. For example, difficulties and difficulty will portray the same intent and context. <\/p>\n\n\n\n<p>For our use case we have performed the following operations to clean the data (using the library NLTK):<\/p>\n\n\n\n<ul><li>Convert to lower-case. <\/li><li>Remove punctuation and numbers.<\/li><li>Removing Stop words: NLTK corpus contains 179 stop words such as &#8220;for&#8221;, &#8220;having&#8221;, &#8220;yours&#8221; and so on.<\/li><li>Lemmatize words<\/li><\/ul>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">import nltk\n\nfrom nltk.tokenize import word_tokenize\nfrom nltk.tokenize import sent_tokenize\nfrom nltk.corpus import stopwords\nfrom nltk.stem import PorterStemmer\nfrom nltk.stem.wordnet import WordNetLemmatizer \n\n#lower case\nall_data['question_text'] = all_data['question_text'].apply(lambda x: \" \".join(x.lower() for x in x.split()))\n#Removing Punctuation\nall_data['question_text'] = all_data['question_text'].str.replace('[^\\w\\s]','')\n#Removing numbers\nall_data['question_text'] = all_data['question_text'].str.replace('[0-9]','')\n#Removing stop words and words with length &lt;=2\nfrom nltk.corpus import stopwords\nstop = stopwords.words('english')\nall_data['question_text'] = all_data['question_text'].apply(lambda x: \" \".join(x for x in x.split() if x not in stop and len(x)>2))\n# Lemmatize\nfrom nltk.stem import WordNetLemmatizer\nwl = WordNetLemmatizer()\nall_data['question_text'] = all_data['question_text'].apply(lambda x: \" \".join(wl.lemmatize(x,'v') for x in x.split()))\n<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Feature Engineering<\/h3>\n\n\n\n<p>This part is what makes the difference between a good and a bad solution in any ML project. So what features can we create in our usecase. We can start with understanding the sentiment.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Text sentiment:<\/strong><\/h4>\n\n\n\n<p>Sentiment is a part of opinion mining and it involves building a system to extract the opinion from a text. That is we wish to get a score to understand how positive or negative the text is. <br> The assumption with respect to our data set was perhaps the questions flagged as Insincere may contain toxic content and would exhibit a negative sentiment. However, as far as modeling features are concerned sentiment turned out to be a weak feature. On deeper evaluation, we noticed that there were several questions with high polarity scores with insincere and sincere tags. <\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Topic modelling:<\/strong><\/h3>\n\n\n\n<p>Topic modelling is an approach to identify topics present across a corpus of text. A topic is defined as a repeating pattern of co-occurring terms in a corpus. A document contains multiple topics in varying proportions. So, for example, a document based on healthcare is more likely to contain a higher ratio of words like \u201cdoctor\u201d and \u201csurgery\u201d than words such as \u201cbrakes\u201d and \u201cgear\u201d, which indicate a theme of automobiles.<\/p>\n\n\n\n<p>Using a technique like <strong>Latent Dirichlet Allocation<\/strong> to get the distribution of topics across the corpus would potentially help to get a sense of the themes discussed in the set of questions. Further, we hypothesize that there would be some difference between the topics of sincere and insincere questions.<\/p>\n\n\n\n<p>The image below shows the distribution of the topics with respect to the different classes by taking an average.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" loading=\"lazy\" width=\"771\" height=\"393\" src=\"https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2019\/08\/Quora_Topic_Model_Sincere.jpg?resize=771%2C393&#038;ssl=1\" alt=\"\" class=\"wp-image-412\" srcset=\"https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2019\/08\/Quora_Topic_Model_Sincere.jpg?w=771&amp;ssl=1 771w, https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2019\/08\/Quora_Topic_Model_Sincere.jpg?resize=300%2C153&amp;ssl=1 300w, https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2019\/08\/Quora_Topic_Model_Sincere.jpg?resize=768%2C391&amp;ssl=1 768w\" sizes=\"(max-width: 771px) 100vw, 771px\" title=\"\" data-recalc-dims=\"1\"><figcaption>Topic Distribution Sincere<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" loading=\"lazy\" width=\"803\" height=\"389\" src=\"https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2019\/08\/Quora_Topic_Model_Insincere.jpg?resize=803%2C389&#038;ssl=1\" alt=\"\" class=\"wp-image-413\" srcset=\"https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2019\/08\/Quora_Topic_Model_Insincere.jpg?w=803&amp;ssl=1 803w, https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2019\/08\/Quora_Topic_Model_Insincere.jpg?resize=300%2C145&amp;ssl=1 300w, https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2019\/08\/Quora_Topic_Model_Insincere.jpg?resize=768%2C372&amp;ssl=1 768w\" sizes=\"(max-width: 803px) 100vw, 803px\" title=\"\" data-recalc-dims=\"1\"><figcaption>Insincere Topic Distribution<\/figcaption><\/figure>\n\n\n\n<p>Looking up top words from top topics from class <em>insincere<\/em>:<\/p>\n\n\n\n<ul><li><strong>Topic 45<\/strong>: trump, part, president, donald, drink, similar, sport, websites, suffer, insurance, abroad, court, respect, would, wall.<\/li><li><strong>topic 59<\/strong>: quora, question, ask, answer, wear, control, actually, treat, people, hear, worst, western, racist, many, opportunities.<\/li><li>Regarding <strong>topic 62<\/strong>: sex, hate, act, culture, pakistan, add, society, doctor, bring, present, people, search, pressure, characteristics, enjoy.<\/li><li>In <strong>topic 77<\/strong>: want, don&#8217;t, tell, guy, try, like, know, doesn&#8217;t, kill, people, say, let, brain, get, would.<\/li><li>For <strong>topic 79<\/strong>: women, men, white, black, water, watch, share, video, others, character, youtube, save, problem, prevent, people.<\/li><\/ul>\n\n\n\n<p>Looking up top words from top topics from class <em>sincere<\/em>:<\/p>\n\n\n\n<ul><li><strong>topic 0<\/strong>: use, like, best, possible, make, come, cause, good, become, would, singer, get, know, happen, etc.<\/li><li><strong>topic 1<\/strong>: make, use, like, best, cause, good, happen, many, would, find, better, nutritional, jar, work, venus.<\/li><li>For <strong>topic 34<\/strong>: job, engineer, company, chinese, get, work, project, interview, graduate, best, include, india, example, good, accord.<\/li><li>Regards to <strong>topic 56<\/strong>: someone, feel, love, man, process, like, post, view, would, care, else, give, advice, step, night.<\/li><li>For <strong>topic 77<\/strong>: want, dont, tell, guy, try, like, know, doesnt, kill, people, say, let, brain, get, would.<\/li><\/ul>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Count Vectorizer\/tf-idf:<\/strong><\/h3>\n\n\n\n<p>Countvectorizer returns a matrix which shows the frequency of each term in the vocabulary per document. On the other hand, tf-idf (term frequency-inverse document frequency) evaluate how important a word is to a document in the corpus.<\/p>\n\n\n\n<p><strong>tf(x) =<\/strong> <em>(Number of times term x appears in a document) \/ (Total number of terms in the document)<\/em><\/p>\n\n\n\n<p><strong>idf(x) =<\/strong> <em>log(Total number of documents \/ Number of documents with term x in it)<\/em><\/p>\n\n\n\n<p><strong>tf-idf =<\/strong> *tf(x) * idf(x)*<\/p>\n\n\n\n<p>Clearly, the importance of a word in a document increases proportionally to the number of times a it appears there. But, it is offset by the number of times it occurs in the corpus.<\/p>\n\n\n\n<p>Both tf-idf and countvectorizer, as features, may indicate the relevance of a certain set of words to questions labelled as \u201cSincere\u201d as well as \u201cInsincere\u201d. <\/p>\n\n\n\n<p>The image below is obtained by using a TF-IDF vectorizer to create features and a k-fold CV logistic regression model and it shows the words (of insincere questions) with most weight.<\/p>\n\n\n\n<figure class=\"wp-block-image is-resized\"><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2019\/08\/Quora_TFIDF.jpg?resize=107%2C403&#038;ssl=1\" alt=\"\" class=\"wp-image-414\" width=\"107\" height=\"403\" title=\"\" data-recalc-dims=\"1\"><figcaption><\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Text Descriptive features<\/h3>\n\n\n\n<p>The idea behind building features such as the number of unique words, characters or exclamation points is to check for uniformity in the data set. We wish to observe if there are some similarities between the train and test set. Some questions that these meta features help answer include:<\/p>\n\n\n\n<ul><li>Is it that our test set consists of very small questions as compared to the train set? <\/li><li>A question framed insincerely might be haphazardly framed with disregard for the correct use of punctuations and possibly contain an abnormally high count.<\/li><li>A user writing a toxic or insincere question may be using uppercase letters very liberally.  <\/li><\/ul>\n\n\n\n<p>The examples mentioned above give us the idea that there might be certain patterns specific to the respective classes that can be leveraged in our model. To give an ad hoc example of how useful meta features can be, on  a musical note, the number of words per minute for Eminem is different based on the content\/emotion of the song.<\/p>\n\n\n\n<p>Some of the meta features are listed below:<\/p>\n\n\n\n<ul><li>Number of Words, Unique Words, Characters<\/li><\/ul>\n\n\n\n<p>We have added some box plots in the data exploration section to provide you with an idea regarding the prevalent distribution with respect to the different classes. <\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Model<\/h3>\n\n\n\n<p>So far we have cleaned up our text and carried out feature engineering. Now, there are several ways to select the relevant features however, for the purpose of this article we decided to generate separate models for each set of features as this will help develop a general understanding and help utilize these tactics on other text classification datasets.  <\/p>\n\n\n\n<p>A few things to note:<\/p>\n\n\n\n<ul><li>We are using F1 Score as our performance metric as required by the competition rules. It also gives us a better picture than accuracy keeping in mind the imbalance in the data.<\/li><li>For each model we are using 5 fold cross validation.<\/li><li>In order to find the suitable threshold (to convert the probabilities to a binary) we have developed a loop. In this loop we try multiple potential thresholds and choose the one that maximizes the F1 score. The F1 score is calculated on the validation data set.<\/li><\/ul>\n\n\n\n<p>There are two pieces of code that will be reused in most of the models:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">kf = KFold(n_splits=5, shuffle=True, random_state=43)\n## Initialize 0\u2019s\ntest_pred_ots = 0\noof_pred_ots = np.zeros([train.shape[0],])\n\ntrain_target = train['target'].values\n\nx_test = test[selected_features].values\n\n\n## Loop to split the data set\nfor i, (train_index, val_index) in tqdm(enumerate(kf.split(train))):\n    x_train, x_val = train.loc[train_index][selected_features].values, \n    train.loc[val_index][selected_features].values\n    y_train, y_val = train_target[train_index], train_target[val_index]\n    \n   # Model\n    classifier = LogisticRegression(C= 0.1)\n    classifier.fit(x_train, y_train)\n    \n    ## Validation set predicted\n    val_preds = classifier.predict_proba(x_val)[:,1]\n    \n    ## Test set predictions\n    preds = classifier.predict_proba(x_test)[:,1]\n    test_pred_ots += 0.2*preds\n    oof_pred_ots[val_index] = val_preds\nprint(\"--- %s seconds for Model Selected Features ---\" % (time.time() - start_time))\n<\/pre>\n\n\n\n<p>The code above runs 5 fold cross validation and with each split we train and make predictions on the validation and test datasets. At the end of all splits we get oof_pred_ots which are predictions on the validation data sets combined into a single data frame. We also get the average prediction probabilities of each split in test_pred_ots.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">thresh_opt_ots = 0.5\nf1_opt = 0\nfor thresh in np.arange(0.1, 0.91, 0.01):\n    thresh = np.round(thresh, 2)\n    f1 = metrics.f1_score(train_target, (oof_pred_ots.astype(float) >thresh).astype(int))\n    #print(\"F1 score at threshold {0} is {1}\".format(thresh, f1))\n    if f1_opt &lt; f1:\n        f1_opt = f1\n        thresh_opt_ots = thresh\nprint(thresh_opt_ots)\npred_train_ots = (oof_pred_ots > thresh_opt_ots).astype(np.int)\nf1_score(train_target, pred_train_ots)\n<\/pre>\n\n\n\n<p>The code above will help find the best threshold.<\/p>\n\n\n\n<p><em>First Model:<\/em><br>\nWe used the text descriptive features and ran a 5-fold cross validation logistic regression model however, the F1 score is not that significant (0.27).<\/p>\n\n\n\n<p><em>Second Model:<\/em><br>\nWe used the sentiment and topic modeling features and ran the same model as mentioned before. This time we got a better score (0.34).<\/p>\n\n\n\n<p><em>Third Model:<\/em><br>\nWe used TFIDF features and tried logistic regression (F1 &#8211; 0.587) and light gbm (F1 &#8211; 0.591). This is much better.<\/p>\n\n\n\n<p><em>Fourth Model:<\/em><br>\nWe used countvectorizer and tried logistic regression (F1 &#8211; 0.592) and a multinomial (F1 &#8211; 0.55)  and bernoulli (F1 &#8211; 0.53) naive bayes models. <\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><em><strong>Ensemble:<\/strong><\/em><\/h4>\n\n\n\n<p>The idea here is that one model might be observing patterns that the other isn\u2019t. Further, ensemble will help get better results and at the same time reduce the chance of over fitting. We used stacking which means that we make predictions on the entire train set. This is accomplished by splitting data at each folds into train and holdout set and making predictions on the holdout set. This splitting of the data is carried out such that there is a prediction for each row in the train data set. <br> We use these new predictions from the respective models as input variables and run another (logistic regression) model on top of this giving us the final probabilities. <br><strong>Our final F1 Score (0.604) and on the leader-board (0.589).<\/strong> <\/p>\n\n\n\n<h4 class=\"wp-block-heading\">What\u2019s Next<\/h4>\n\n\n\n<p>In the <a href=\"https:\/\/datasciencediscovery.com\/index.php\/2019\/03\/11\/nlp-with-dl\/\">next article<\/a> we will implement a deep learning approach to the same use case and draw comparisons between the two methodologies.<\/p>\n\n\n\n<p>References:<\/p>\n\n\n\n<ul><li><a href=\"https:\/\/www.kdnuggets.com\/2017\/12\/general-approach-preprocessing-text-data.html\" target=\"_blank\" rel=\"noopener\">text processing<\/a><\/li><li><a href=\"https:\/\/nlp.stanford.edu\/IR-book\/html\/htmledition\/stemming-and-lemmatization-1.html\" target=\"_blank\" rel=\"noopener\">Stemming and Lemmatization by stanford<\/a><\/li><li><a href=\"http:\/\/www.tfidf.com\/\" target=\"_blank\" rel=\"noopener\">TF-IDF<\/a><\/li><li><a href=\"https:\/\/www.nltk.org\" target=\"_blank\" rel=\"noopener\">NLTK<\/a><\/li><li>Other kaggle kernals were tremendous help as well<\/li><\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">About Us<\/h2>\n\n\n\n<p>Data science discovery is a step on the path of your data science journey. Please follow us on <a href=\"https:\/\/www.linkedin.com\/company\/data-science-discovery\/\" target=\"_blank\" rel=\"noopener\">LinkedIn<\/a> to stay updated.<\/p>\n\n\n\n<p>About the writers:<\/p>\n\n\n\n<ul><li><a href=\"http:\/\/linkedin.com\/in\/ujjayant-sinha-0852b06b\" target=\"_blank\" rel=\"noopener\">Ujjayant Sinha<\/a>: Data science enthusiast with interest in natural language problems.<\/li><li><a href=\"http:\/\/linkedin.com\/in\/gadiankit\/\" target=\"_blank\" rel=\"noopener\">Ankit Gadi<\/a>: Driven by a knack and passion for data science coupled with a strong foundation in Operations Research and Statistics has helped me embark on my data science journey.<\/li><\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Text Classification Purpose: Natural language processing (NLP) has been widely popular, with the large amount of data available (in emails, web pages, sms) it becomes important to extract valuable information from textual data. An assortment of machine learning techniques designed to accomplish this task. With current advances in deep learning, we felt it would be [&hellip;]<\/p>\n","protected":false},"author":3,"featured_media":416,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_mi_skip_tracking":false,"spay_email":"","jetpack_publicize_message":"","jetpack_is_tweetstorm":false,"jetpack_publicize_feature_enabled":true},"categories":[22,24,23],"tags":[37,25,35,32,26,27,28,38,31,34,30,29,33,36,39],"jetpack_featured_media_url":"https:\/\/i0.wp.com\/datasciencediscovery.com\/wp-content\/uploads\/2019\/02\/abc-accomplished-alphabet-48898.jpg?fit=3600%2C2400&ssl=1","jetpack_publicize_connections":[],"_links":{"self":[{"href":"https:\/\/datasciencediscovery.com\/index.php\/wp-json\/wp\/v2\/posts\/400"}],"collection":[{"href":"https:\/\/datasciencediscovery.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/datasciencediscovery.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/datasciencediscovery.com\/index.php\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/datasciencediscovery.com\/index.php\/wp-json\/wp\/v2\/comments?post=400"}],"version-history":[{"count":8,"href":"https:\/\/datasciencediscovery.com\/index.php\/wp-json\/wp\/v2\/posts\/400\/revisions"}],"predecessor-version":[{"id":1063,"href":"https:\/\/datasciencediscovery.com\/index.php\/wp-json\/wp\/v2\/posts\/400\/revisions\/1063"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/datasciencediscovery.com\/index.php\/wp-json\/wp\/v2\/media\/416"}],"wp:attachment":[{"href":"https:\/\/datasciencediscovery.com\/index.php\/wp-json\/wp\/v2\/media?parent=400"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/datasciencediscovery.com\/index.php\/wp-json\/wp\/v2\/categories?post=400"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/datasciencediscovery.com\/index.php\/wp-json\/wp\/v2\/tags?post=400"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}