{"id":377,"date":"2022-10-24T14:10:23","date_gmt":"2022-10-24T11:10:23","guid":{"rendered":"https:\/\/quecst.qcri.org\/blog\/?p=377"},"modified":"2022-10-24T14:10:23","modified_gmt":"2022-10-24T11:10:23","slug":"log-odds-ratio-going-beyond-simple-term-frequencies-to-characterize-textual-categories","status":"publish","type":"post","link":"https:\/\/acua.qcri.org\/blog\/log-odds-ratio-going-beyond-simple-term-frequencies-to-characterize-textual-categories\/","title":{"rendered":"Log Odds Ratio: Going Beyond Simple Term Frequencies to Characterize Textual Categories"},"content":{"rendered":"<p>Gaining insights from text-based data can be a daunting task, even when the data is labeled with ground truth categories and ready for usage in machine learning tasks.<br \/>Researchers often rely on simple methods like the frequency of words in each category to understand the collection\u2019s characteristics. However, this approach is not always insightful, as term frequencies alone are not enough to distinguish between the categories because they fail to consider the fundamental differences between categories.<\/p>\n<figure id=\"attachment_383\" aria-describedby=\"caption-attachment-383\" style=\"width: 800px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-383\" src=\"https:\/\/quecst.qcri.org\/blog\/wp-content\/uploads\/2022\/10\/blog-img-what-customer-insights.jpeg\" alt=\"\" width=\"800\" height=\"300\" \/><figcaption id=\"caption-attachment-383\" class=\"wp-caption-text\">The issue of getting meaningful insights from textual and categorical data (<a href=\"https:\/\/powerdigitalmarketing.com\/blog\/what-are-customer-insights-in-marketing-2\/#gref\">source of image<\/a>)<\/figcaption><\/figure>\n<p>Therefore, in this tutorial, we will teach you how to use Log Odds Ratio, which is an alternative method to term frequency, TFIDF, and other term ranking techniques to obtain unique insights about the terms that represent categories in textual data [1].<\/p>\n<p><span style=\"text-decoration: underline;\"><strong>What is Log Odds Ratio?<\/strong><\/span><\/p>\n<p style=\"font-weight: 400;\">Let\u2019s start with what is meant by odds:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-417\" src=\"https:\/\/quecst.qcri.org\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-20-at-3.59.22-PM.png\" alt=\"\" width=\"920\" height=\"89\" srcset=\"https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-20-at-3.59.22-PM.png 920w, https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-20-at-3.59.22-PM-300x29.png 300w, https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-20-at-3.59.22-PM-768x74.png 768w\" sizes=\"(max-width: 920px) 100vw, 920px\" \/><\/p>\n<p>Keep in mind that the odds can be a very small number.<br \/>So now, the odds ratio is the ratio of the odds of a term being used in one category vs. another.<br \/>It can be computed using the following equation:<br \/><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-420\" src=\"https:\/\/quecst.qcri.org\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-20-at-4.04.53-PM.png\" alt=\"\" width=\"920\" height=\"89\" srcset=\"https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-20-at-4.04.53-PM.png 920w, https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-20-at-4.04.53-PM-300x29.png 300w, https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-20-at-4.04.53-PM-768x74.png 768w\" sizes=\"(max-width: 920px) 100vw, 920px\" \/><\/p>\n<p>The Log-Odds-Ratio is often used, given these quantities are often very small. For our purposes, the fact that it&#8217;s normally distributed is important.<br \/><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-422\" src=\"https:\/\/quecst.qcri.org\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-20-at-4.19.33-PM.png\" alt=\"\" width=\"920\" height=\"89\" srcset=\"https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-20-at-4.19.33-PM.png 920w, https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-20-at-4.19.33-PM-300x29.png 300w, https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-20-at-4.19.33-PM-768x74.png 768w\" sizes=\"(max-width: 920px) 100vw, 920px\" \/><\/p>\n<p>If term does not appear in category b (i.e., \ud835\udc66\ud835\udc4f\ud835\udc56=0), the odd of termi in categoryb will be 0 resulting the odds-ratio to be undefined. We circumvent this issue by adding a pseudo-count to each term, assuming it occurs at least \ud835\udefc times.<br \/>This will result in an equation that computes the smoothed Log-Odds-Ratio as follows:<br \/><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-424\" src=\"https:\/\/quecst.qcri.org\/blog\/wp-content\/uploads\/2022\/10\/imgonline-com-ua-twotoone-VYHu41pE8hz.jpg\" alt=\"\" width=\"1083\" height=\"88\" srcset=\"https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/10\/imgonline-com-ua-twotoone-VYHu41pE8hz.jpg 1083w, https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/10\/imgonline-com-ua-twotoone-VYHu41pE8hz-300x24.jpg 300w, https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/10\/imgonline-com-ua-twotoone-VYHu41pE8hz-1024x83.jpg 1024w, https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/10\/imgonline-com-ua-twotoone-VYHu41pE8hz-768x62.jpg 768w\" sizes=\"(max-width: 1083px) 100vw, 1083px\" \/><br \/>This has a slightly better profile than TFIDF, but still favors very low-frequency terms that are unique to a category.<br \/>The smoothed Log-Odds-Ratio ranks terms slightly better than TFIDF, yet it still prefers very low frequent terms which are unique to a category.<br \/>Note that when terms are exclusive to one category or another, their cluster can be extremely high or low scoring compared to others because these terms occur infrequently. Intuitively, we&#8217;d like to score things highly if they have a lot of evidence behind them AND they tend to be used much more in one category vs. another. However, in cases where the dataset has terms that appear in both categories (i.e., a lot of intersecting terms, even after cleaning), the Log Odds Ratio can be a suitable method of ranking terms to gain insights on the type of content in each category.<\/p>\n<p>For this tutorial, we will use a random sample of a dataset that was used in [2] to build a machine learning model for detecting toxicity triggers (i.e., causes of toxicity in online conversation threads). Furthermore, we will use scattertext [3], which is a powerful text visualization library that generates interactive scatter plots of corpora and renders them via html. For preprocessing the dataset, we will use spacy [4], which is a powerful open source NLP library that helps with different types of NLP-based text processing tasks.<\/p>\n<p>The dataset is available for download from <a href=\"https:\/\/www.dropbox.com\/s\/8wzmo1bieh0ecti\/toxicity_triggers_sample.csv\">this link<\/a>.<\/p>\n<p>After downloading the dataset, make sure that you install the following packages:<\/p>\n<pre style='color:#000000;background:#ffffff;'><span style='color:#696969; '>#Step 1: download missing libraries into Google colab with the pip command<\/span>\r\n!pip install <span style='color:#44aadd; '>-<\/span><span style='color:#44aadd; '>-<\/span>quiet scattertext\r\n!pip install <span style='color:#44aadd; '>-<\/span><span style='color:#44aadd; '>-<\/span>quiet spacy\r\n!python3 <span style='color:#44aadd; '>-<\/span>m spacy download en_core_web_sm\r\n<\/pre>\n<p><!--Created using ToHtml.com on 2022-10-24 11:01:14 UTC --><\/p>\n<p>Next, import the necessary libraries and download the required components for performing NLP tasks, as follows:<\/p>\n<pre style='color:#000000;background:#ffffff;'><span style='color:#696969; '>#Step 2: import the required libraries or load the needed modules<\/span>\r\n<span style='color:#800000; font-weight:bold; '>import<\/span> re\r\n<span style='color:#800000; font-weight:bold; '>import<\/span> pandas <span style='color:#800000; font-weight:bold; '>as<\/span> pd\r\n<span style='color:#800000; font-weight:bold; '>import<\/span> numpy <span style='color:#800000; font-weight:bold; '>as<\/span> np\r\n<span style='color:#800000; font-weight:bold; '>import<\/span> scattertext <span style='color:#800000; font-weight:bold; '>as<\/span> st\r\n<span style='color:#800000; font-weight:bold; '>import<\/span> spacy\r\n<span style='color:#800000; font-weight:bold; '>from<\/span> IPython<span style='color:#808030; '>.<\/span>display <span style='color:#800000; font-weight:bold; '>import<\/span> IFrame\r\n<span style='color:#800000; font-weight:bold; '>from<\/span> IPython<span style='color:#808030; '>.<\/span>core<span style='color:#808030; '>.<\/span>display <span style='color:#800000; font-weight:bold; '>import<\/span> display<span style='color:#808030; '>,<\/span> HTML\r\n<span style='color:#800000; font-weight:bold; '>import<\/span> html\r\ndisplay<span style='color:#808030; '>(<\/span>HTML<span style='color:#808030; '>(<\/span><span style='color:#0000e6; '>\"&lt;style>.container { width:98% !important; }&lt;\/style>\"<\/span><span style='color:#808030; '>)<\/span><span style='color:#808030; '>)<\/span> <span style='color:#696969; '>#For visualization<\/span>\r\nnlp <span style='color:#808030; '>=<\/span> spacy<span style='color:#808030; '>.<\/span>load<span style='color:#808030; '>(<\/span><span style='color:#0000e6; '>\"en_core_web_sm\"<\/span><span style='color:#808030; '>)<\/span>\r\n<span style='color:#800000; font-weight:bold; '>from<\/span> bs4 <span style='color:#800000; font-weight:bold; '>import<\/span> BeautifulSoup\r\n<span style='color:#800000; font-weight:bold; '>import<\/span> nltk\r\n<span style='color:#800000; font-weight:bold; '>from<\/span> nltk<span style='color:#808030; '>.<\/span>corpus <span style='color:#800000; font-weight:bold; '>import<\/span> stopwords\r\nnltk<span style='color:#808030; '>.<\/span>download<span style='color:#808030; '>(<\/span><span style='color:#0000e6; '>'stopwords'<\/span><span style='color:#808030; '>)<\/span>\r\n<span style='color:#800000; font-weight:bold; '>import<\/span> warnings\r\nwarnings<span style='color:#808030; '>.<\/span>filterwarnings<span style='color:#808030; '>(<\/span><span style='color:#0000e6; '>'ignore'<\/span><span style='color:#808030; '>)<\/span>\r\n<\/pre>\n<p><!--Created using ToHtml.com on 2022-10-24 11:02:51 UTC --><\/p>\n<p>Then, make a function for cleaning the raw text from the dataset and another one to map the categories to a numerical value as follows:<\/p>\n<pre style='color:#000000;background:#ffffff;'><span style='color:#696969; '>#Step 3: function to convert the comments to words<\/span>\r\n<span style='color:#800000; font-weight:bold; '>def<\/span> comment_to_words<span style='color:#808030; '>(<\/span> raw_comment <span style='color:#808030; '>)<\/span><span style='color:#808030; '>:<\/span>\r\n    <span style='color:#696969; '># 1. Remove HTML<\/span>\r\n    comment_text <span style='color:#808030; '>=<\/span> BeautifulSoup<span style='color:#808030; '>(<\/span>raw_comment<span style='color:#808030; '>,<\/span> <span style='color:#0000e6; '>'lxml'<\/span><span style='color:#808030; '>)<\/span><span style='color:#808030; '>.<\/span>get_text<span style='color:#808030; '>(<\/span><span style='color:#808030; '>)<\/span> \r\n    <span style='color:#696969; '># 2. Remove non-letters with regex<\/span>\r\n    letters_only <span style='color:#808030; '>=<\/span> re<span style='color:#808030; '>.<\/span>sub<span style='color:#808030; '>(<\/span><span style='color:#0000e6; '>\"[^a-zA-Z]\"<\/span><span style='color:#808030; '>,<\/span> <span style='color:#0000e6; '>\" \"<\/span><span style='color:#808030; '>,<\/span> comment_text<span style='color:#808030; '>)<\/span>   \r\n    <span style='color:#696969; '># 3. Convert to lower case, split into individual words<\/span>\r\n    words <span style='color:#808030; '>=<\/span> letters_only<span style='color:#808030; '>.<\/span>lower<span style='color:#808030; '>(<\/span><span style='color:#808030; '>)<\/span><span style='color:#808030; '>.<\/span>split<span style='color:#808030; '>(<\/span><span style='color:#808030; '>)<\/span>                              \r\n    <span style='color:#696969; '># 4. Create set of stopwords<\/span>\r\n    stops <span style='color:#808030; '>=<\/span> <span style='color:#400000; '>set<\/span><span style='color:#808030; '>(<\/span>stopwords<span style='color:#808030; '>.<\/span>words<span style='color:#808030; '>(<\/span><span style='color:#0000e6; '>\"english\"<\/span><span style='color:#808030; '>)<\/span><span style='color:#808030; '>)<\/span>                    \r\n    <span style='color:#696969; '># 5. Remove stop words<\/span>\r\n    meaningful_words <span style='color:#808030; '>=<\/span> <span style='color:#808030; '>[<\/span>w <span style='color:#800000; font-weight:bold; '>for<\/span> w <span style='color:#800000; font-weight:bold; '>in<\/span> words <span style='color:#800000; font-weight:bold; '>if<\/span> <span style='color:#800000; font-weight:bold; '>not<\/span> w <span style='color:#800000; font-weight:bold; '>in<\/span> stops<span style='color:#808030; '>]<\/span>      \r\n    <span style='color:#696969; '># 6. Join the words back into one string separated by space, <\/span>\r\n    <span style='color:#696969; '># and return the result.<\/span>\r\n    <span style='color:#800000; font-weight:bold; '>return<\/span><span style='color:#808030; '>(<\/span> <span style='color:#0000e6; '>\" \"<\/span><span style='color:#808030; '>.<\/span>join<span style='color:#808030; '>(<\/span> meaningful_words <span style='color:#808030; '>)<\/span><span style='color:#808030; '>)<\/span>\r\n\r\n<span style='color:#696969; '>#A function to change the category from numerical to text<\/span>\r\n<span style='color:#800000; font-weight:bold; '>def<\/span> get_category<span style='color:#808030; '>(<\/span>row<span style='color:#808030; '>)<\/span><span style='color:#808030; '>:<\/span>\r\n    <span style='color:#800000; font-weight:bold; '>for<\/span> c <span style='color:#800000; font-weight:bold; '>in<\/span> trigger_categories<span style='color:#808030; '>:<\/span>\r\n        <span style='color:#800000; font-weight:bold; '>if<\/span> row<span style='color:#808030; '>[<\/span>c<span style='color:#808030; '>]<\/span> <span style='color:#44aadd; '>==<\/span> <span style='color:#008c00; '>1<\/span><span style='color:#808030; '>:<\/span>\r\n            <span style='color:#800000; font-weight:bold; '>return<\/span> <span style='color:#0000e6; '>'Trigger'<\/span>\r\n    <span style='color:#800000; font-weight:bold; '>return<\/span> <span style='color:#0000e6; '>'Not-trigger'<\/span>\r\n    \r\n<span style='color:#800000; font-weight:bold; '>def<\/span> get_category_list<span style='color:#808030; '>(<\/span>row<span style='color:#808030; '>)<\/span><span style='color:#808030; '>:<\/span>\r\n    <span style='color:#800000; font-weight:bold; '>return<\/span> <span style='color:#808030; '>[<\/span>c <span style='color:#800000; font-weight:bold; '>for<\/span> c <span style='color:#800000; font-weight:bold; '>in<\/span> trigger_categories <span style='color:#800000; font-weight:bold; '>if<\/span> row<span style='color:#808030; '>[<\/span>c<span style='color:#808030; '>]<\/span> <span style='color:#44aadd; '>==<\/span> <span style='color:#008c00; '>1<\/span><span style='color:#808030; '>]<\/span>   \r\n<\/pre>\n<p><!--Created using ToHtml.com on 2022-10-24 11:04:01 UTC --><\/p>\n<p>After that, read the dataset and perform the preprocessing task:<\/p>\n<pre style='color:#000000;background:#ffffff;'><span style='color:#696969; '>#Step 4: read data and preprocess it<\/span>\r\ndf <span style='color:#808030; '>=<\/span> pd<span style='color:#808030; '>.<\/span>read_csv<span style='color:#808030; '>(<\/span><span style='color:#0000e6; '>'toxicity_triggers_sample.csv'<\/span><span style='color:#808030; '>)<\/span>\r\n\r\n<span style='color:#696969; '>#print dataset header before preprocessing <\/span>\r\n<span style='color:#800000; font-weight:bold; '>print<\/span><span style='color:#808030; '>(<\/span><span style='color:#0000e6; '>\"Before preprocessing:\"<\/span><span style='color:#808030; '>)<\/span>\r\n<span style='color:#800000; font-weight:bold; '>print<\/span><span style='color:#808030; '>(<\/span>df<span style='color:#808030; '>.<\/span>head<span style='color:#808030; '>(<\/span><span style='color:#808030; '>)<\/span><span style='color:#808030; '>)<\/span>\r\n\r\n<span style='color:#696969; '>#Create additional columns with the numerical representations of labels<\/span>\r\ndf<span style='color:#808030; '>[<\/span><span style='color:#0000e6; '>'Trigger'<\/span><span style='color:#808030; '>]<\/span> <span style='color:#808030; '>=<\/span> <span style='color:#008c00; '>0<\/span>\r\ndf<span style='color:#808030; '>[<\/span><span style='color:#0000e6; '>'Trigger'<\/span><span style='color:#808030; '>]<\/span><span style='color:#808030; '>.<\/span>loc<span style='color:#808030; '>[<\/span>df<span style='color:#808030; '>[<\/span><span style='color:#0000e6; '>'category'<\/span><span style='color:#808030; '>]<\/span> <span style='color:#44aadd; '>==<\/span> <span style='color:#0000e6; '>'trigger'<\/span><span style='color:#808030; '>]<\/span> <span style='color:#808030; '>=<\/span> <span style='color:#008c00; '>1<\/span>\r\ndf<span style='color:#808030; '>[<\/span><span style='color:#0000e6; '>'NotTrigger'<\/span><span style='color:#808030; '>]<\/span> <span style='color:#808030; '>=<\/span> <span style='color:#008c00; '>0<\/span>\r\ndf<span style='color:#808030; '>[<\/span><span style='color:#0000e6; '>'NotTrigger'<\/span><span style='color:#808030; '>]<\/span><span style='color:#808030; '>.<\/span>loc<span style='color:#808030; '>[<\/span>df<span style='color:#808030; '>[<\/span><span style='color:#0000e6; '>'category'<\/span><span style='color:#808030; '>]<\/span> <span style='color:#44aadd; '>==<\/span> <span style='color:#0000e6; '>'non_trigger'<\/span><span style='color:#808030; '>]<\/span> <span style='color:#808030; '>=<\/span> <span style='color:#008c00; '>0<\/span>\r\n\r\n<span style='color:#696969; '>#Delete the category column<\/span>\r\n<span style='color:#800000; font-weight:bold; '>del<\/span> df<span style='color:#808030; '>[<\/span><span style='color:#0000e6; '>'category'<\/span><span style='color:#808030; '>]<\/span>\r\n\r\n<span style='color:#696969; '>#Define the list of categories <\/span>\r\ntrigger_categories <span style='color:#808030; '>=<\/span> <span style='color:#808030; '>[<\/span><span style='color:#0000e6; '>'Trigger'<\/span><span style='color:#808030; '>,<\/span><span style='color:#0000e6; '>'NotTrigger'<\/span><span style='color:#808030; '>]<\/span>\r\n\r\ndf<span style='color:#808030; '>[<\/span><span style='color:#0000e6; '>'category'<\/span><span style='color:#808030; '>]<\/span> <span style='color:#808030; '>=<\/span> df<span style='color:#808030; '>.<\/span>apply<span style='color:#808030; '>(<\/span>get_category<span style='color:#808030; '>,<\/span> axis<span style='color:#808030; '>=<\/span><span style='color:#008c00; '>1<\/span><span style='color:#808030; '>)<\/span>\r\ndf<span style='color:#808030; '>[<\/span><span style='color:#0000e6; '>'category_list'<\/span><span style='color:#808030; '>]<\/span> <span style='color:#808030; '>=<\/span> df<span style='color:#808030; '>.<\/span>apply<span style='color:#808030; '>(<\/span>get_category_list<span style='color:#808030; '>,<\/span> axis<span style='color:#808030; '>=<\/span><span style='color:#008c00; '>1<\/span><span style='color:#808030; '>)<\/span>\r\n\r\n<span style='color:#696969; '>#Print the count of triggers and non-triggers in the dataset<\/span>\r\n<span style='color:#800000; font-weight:bold; '>print<\/span><span style='color:#808030; '>(<\/span>df<span style='color:#808030; '>.<\/span>category<span style='color:#808030; '>.<\/span>value_counts<span style='color:#808030; '>(<\/span><span style='color:#808030; '>)<\/span><span style='color:#808030; '>)<\/span>\r\n\r\n<span style='color:#696969; '>#Fill nan or missing values with _na_<\/span>\r\ndf<span style='color:#808030; '>[<\/span><span style='color:#0000e6; '>'Text'<\/span><span style='color:#808030; '>]<\/span><span style='color:#808030; '>=<\/span> df<span style='color:#808030; '>[<\/span><span style='color:#0000e6; '>'Text'<\/span><span style='color:#808030; '>]<\/span><span style='color:#808030; '>.<\/span>fillna<span style='color:#808030; '>(<\/span><span style='color:#0000e6; '>\"_na_\"<\/span><span style='color:#808030; '>)<\/span><span style='color:#808030; '>.<\/span>values\r\ndf<span style='color:#808030; '>[<\/span><span style='color:#0000e6; '>'Processed_Text'<\/span><span style='color:#808030; '>]<\/span> <span style='color:#808030; '>=<\/span>df<span style='color:#808030; '>[<\/span><span style='color:#0000e6; '>'Text'<\/span><span style='color:#808030; '>]<\/span><span style='color:#808030; '>.<\/span>apply<span style='color:#808030; '>(<\/span>comment_to_words<span style='color:#808030; '>)<\/span>\r\ndf<span style='color:#808030; '>[<\/span><span style='color:#0000e6; '>'parse'<\/span><span style='color:#808030; '>]<\/span> <span style='color:#808030; '>=<\/span> df<span style='color:#808030; '>.<\/span>Processed_Text<span style='color:#808030; '>.<\/span>apply<span style='color:#808030; '>(<\/span>nlp<span style='color:#808030; '>)<\/span>\r\n\r\n<span style='color:#696969; '>#Print dataset header after preprocessing<\/span>\r\n<span style='color:#800000; font-weight:bold; '>print<\/span><span style='color:#808030; '>(<\/span><span style='color:#0000e6; '>\"After preprocessing:\"<\/span><span style='color:#808030; '>)<\/span>\r\n<span style='color:#800000; font-weight:bold; '>print<\/span><span style='color:#808030; '>(<\/span>df<span style='color:#808030; '>.<\/span>head<span style='color:#808030; '>(<\/span><span style='color:#808030; '>)<\/span><span style='color:#808030; '>)<\/span>\r\n<\/pre>\n<p><!--Created using ToHtml.com on 2022-10-24 11:05:00 UTC --><br \/>\nNow, use Log Odds Ratio (with Dirichlet Prior) to know which ranked terms appear the most in the dataset:<\/p>\n<pre style='color:#000000;background:#ffffff;'><span style='color:#696969; '>#Step 5: Use scattertext to create a corpus from parsed documents, use log odds ratio to rank and print top 10, 30, and 50 terms per category<\/span>\r\n<span style='color:#696969; '>#Note that the terms displayed here are not using any smoothing in the LOR, instead it uses Informative Dirichlet Prior<\/span>\r\ncorpus <span style='color:#808030; '>=<\/span> st<span style='color:#808030; '>.<\/span>CorpusFromParsedDocuments<span style='color:#808030; '>(<\/span>\r\n    df<span style='color:#808030; '>,<\/span> \r\n    parsed_col<span style='color:#808030; '>=<\/span><span style='color:#0000e6; '>'parse'<\/span><span style='color:#808030; '>,<\/span> \r\n    category_col<span style='color:#808030; '>=<\/span><span style='color:#0000e6; '>'category'<\/span><span style='color:#808030; '>,<\/span> \r\n    feats_from_spacy_doc<span style='color:#808030; '>=<\/span>st<span style='color:#808030; '>.<\/span>UnigramsFromSpacyDoc<span style='color:#808030; '>(<\/span><span style='color:#808030; '>)<\/span><span style='color:#808030; '>)<\/span><span style='color:#808030; '>.<\/span>build<span style='color:#808030; '>(<\/span><span style='color:#808030; '>)<\/span>\r\n\r\n<span style='color:#696969; '>#Print the total number of terms in the corpus<\/span>\r\n<span style='color:#800000; font-weight:bold; '>print<\/span><span style='color:#808030; '>(<\/span><span style='color:#400000; '>len<\/span><span style='color:#808030; '>(<\/span>corpus<span style='color:#808030; '>.<\/span>get_terms<span style='color:#808030; '>(<\/span><span style='color:#808030; '>)<\/span><span style='color:#808030; '>)<\/span><span style='color:#808030; '>)<\/span>\r\n<span style='color:#800000; font-weight:bold; '>for<\/span> i <span style='color:#800000; font-weight:bold; '>in<\/span> <span style='color:#400000; '>range<\/span><span style='color:#808030; '>(<\/span><span style='color:#008c00; '>1<\/span><span style='color:#808030; '>,<\/span> <span style='color:#008c00; '>20<\/span><span style='color:#808030; '>)<\/span><span style='color:#808030; '>:<\/span>\r\n    <span style='color:#800000; font-weight:bold; '>print<\/span><span style='color:#808030; '>(<\/span><span style='color:#0000e6; '>'Threshold:'<\/span><span style='color:#808030; '>,<\/span> i<span style='color:#808030; '>,<\/span> <span style='color:#0000e6; '>'# terms:'<\/span><span style='color:#808030; '>,<\/span> \r\n          <span style='color:#400000; '>len<\/span><span style='color:#808030; '>(<\/span>corpus<span style='color:#808030; '>.<\/span>compact<span style='color:#808030; '>(<\/span>st<span style='color:#808030; '>.<\/span>ClassPercentageCompactor<span style='color:#808030; '>(<\/span>st<span style='color:#808030; '>.<\/span>OncePerDocFrequencyRanker<span style='color:#808030; '>,<\/span> i<span style='color:#808030; '>)<\/span><span style='color:#808030; '>)<\/span><span style='color:#808030; '>.<\/span>get_terms<span style='color:#808030; '>(<\/span><span style='color:#808030; '>)<\/span><span style='color:#808030; '>)<\/span><span style='color:#808030; '>)<\/span>\r\n\r\n\r\ncompact_corpus <span style='color:#808030; '>=<\/span> corpus<span style='color:#808030; '>.<\/span>compact<span style='color:#808030; '>(<\/span>st<span style='color:#808030; '>.<\/span>ClassPercentageCompactor<span style='color:#808030; '>(<\/span>st<span style='color:#808030; '>.<\/span>OncePerDocFrequencyRanker<span style='color:#808030; '>,<\/span> <span style='color:#008c00; '>14<\/span><span style='color:#808030; '>)<\/span><span style='color:#808030; '>)<\/span>\r\ndfs <span style='color:#808030; '>=<\/span> <span style='color:#808030; '>[<\/span><span style='color:#808030; '>]<\/span>\r\n\r\nsingle_df_columns <span style='color:#808030; '>=<\/span> <span style='color:#808030; '>[<\/span><span style='color:#0000e6; '>'parse'<\/span><span style='color:#808030; '>,<\/span> <span style='color:#0000e6; '>'category_list'<\/span><span style='color:#808030; '>,<\/span> <span style='color:#0000e6; '>'single_category'<\/span><span style='color:#808030; '>]<\/span>\r\n<span style='color:#800000; font-weight:bold; '>for<\/span> cat <span style='color:#800000; font-weight:bold; '>in<\/span> trigger_categories<span style='color:#808030; '>:<\/span>\r\n    <span style='color:#696969; '>#print(cat)<\/span>\r\n    new_df <span style='color:#808030; '>=<\/span> df<span style='color:#808030; '>[<\/span><span style='color:#808030; '>(<\/span>df<span style='color:#808030; '>[<\/span>cat<span style='color:#808030; '>]<\/span> <span style='color:#44aadd; '>==<\/span> <span style='color:#008c00; '>1<\/span><span style='color:#808030; '>)<\/span><span style='color:#808030; '>]<\/span>\r\n    new_df<span style='color:#808030; '>[<\/span><span style='color:#0000e6; '>'single_category'<\/span><span style='color:#808030; '>]<\/span> <span style='color:#808030; '>=<\/span> cat\r\n    dfs<span style='color:#808030; '>.<\/span>append<span style='color:#808030; '>(<\/span>new_df<span style='color:#808030; '>[<\/span>single_df_columns<span style='color:#808030; '>]<\/span><span style='color:#808030; '>)<\/span>\r\nnew_df <span style='color:#808030; '>=<\/span> df<span style='color:#808030; '>[<\/span>df<span style='color:#808030; '>[<\/span>trigger_categories<span style='color:#808030; '>]<\/span><span style='color:#808030; '>.<\/span><span style='color:#400000; '>sum<\/span><span style='color:#808030; '>(<\/span>axis<span style='color:#808030; '>=<\/span><span style='color:#008c00; '>1<\/span><span style='color:#808030; '>)<\/span> <span style='color:#44aadd; '>==<\/span> <span style='color:#008c00; '>0<\/span><span style='color:#808030; '>]<\/span>\r\nnew_df<span style='color:#808030; '>[<\/span><span style='color:#0000e6; '>'single_category'<\/span><span style='color:#808030; '>]<\/span> <span style='color:#808030; '>=<\/span> <span style='color:#0000e6; '>'Not-trigger'<\/span>\r\ndfs<span style='color:#808030; '>.<\/span>append<span style='color:#808030; '>(<\/span>new_df<span style='color:#808030; '>[<\/span>single_df_columns<span style='color:#808030; '>]<\/span><span style='color:#808030; '>)<\/span>\r\nsingle_df <span style='color:#808030; '>=<\/span> pd<span style='color:#808030; '>.<\/span>concat<span style='color:#808030; '>(<\/span>dfs<span style='color:#808030; '>)<\/span>\r\n<span style='color:#800000; font-weight:bold; '>del<\/span> dfs\r\nsingle_category_corpus <span style='color:#808030; '>=<\/span> st<span style='color:#808030; '>.<\/span>CorpusFromParsedDocuments<span style='color:#808030; '>(<\/span>\r\n    single_df<span style='color:#808030; '>,<\/span> \r\n    parsed_col<span style='color:#808030; '>=<\/span><span style='color:#0000e6; '>'parse'<\/span><span style='color:#808030; '>,<\/span> \r\n    category_col<span style='color:#808030; '>=<\/span><span style='color:#0000e6; '>'single_category'<\/span><span style='color:#808030; '>,<\/span> \r\n    feats_from_spacy_doc<span style='color:#808030; '>=<\/span>st<span style='color:#808030; '>.<\/span>UnigramsFromSpacyDoc<span style='color:#808030; '>(<\/span><span style='color:#808030; '>)<\/span><span style='color:#808030; '>)<\/span><span style='color:#808030; '>.<\/span>build<span style='color:#808030; '>(<\/span><span style='color:#808030; '>)<\/span>\r\nterm_freq_df <span style='color:#808030; '>=<\/span> st<span style='color:#808030; '>.<\/span>OncePerDocFrequencyRanker<span style='color:#808030; '>(<\/span>single_category_corpus<span style='color:#808030; '>)<\/span><span style='color:#808030; '>.<\/span>get_ranks<span style='color:#808030; '>(<\/span><span style='color:#808030; '>)<\/span>\r\n\r\n<span style='color:#800000; font-weight:bold; '>for<\/span> c <span style='color:#800000; font-weight:bold; '>in<\/span> single_category_corpus<span style='color:#808030; '>.<\/span>get_categories<span style='color:#808030; '>(<\/span><span style='color:#808030; '>)<\/span><span style='color:#808030; '>:<\/span>\r\n    \r\n    scores <span style='color:#808030; '>=<\/span>  <span style='color:#808030; '>(<\/span>st<span style='color:#808030; '>.<\/span>ScaledFScorePresets<span style='color:#808030; '>(<\/span>beta <span style='color:#808030; '>=<\/span> <span style='color:#008c00; '>1<\/span><span style='color:#808030; '>,<\/span> one_to_neg_one<span style='color:#808030; '>=<\/span><span style='color:#074726; '>True<\/span><span style='color:#808030; '>)<\/span>\r\n               <span style='color:#808030; '>.<\/span>get_scores<span style='color:#808030; '>(<\/span>\r\n                   term_freq_df<span style='color:#808030; '>[<\/span>c <span style='color:#44aadd; '>+<\/span> <span style='color:#0000e6; '>' freq'<\/span><span style='color:#808030; '>]<\/span><span style='color:#808030; '>,<\/span> \r\n                   term_freq_df<span style='color:#808030; '>[<\/span><span style='color:#808030; '>[<\/span>oc <span style='color:#800000; font-weight:bold; '>for<\/span> oc <span style='color:#800000; font-weight:bold; '>in<\/span> term_freq_df<span style='color:#808030; '>.<\/span>columns \r\n                                 <span style='color:#800000; font-weight:bold; '>if<\/span> oc <span style='color:#44aadd; '>!=<\/span> c <span style='color:#44aadd; '>+<\/span> <span style='color:#0000e6; '>' freq'<\/span> <span style='color:#800000; font-weight:bold; '>and<\/span> oc<span style='color:#808030; '>.<\/span>endswith<span style='color:#808030; '>(<\/span><span style='color:#0000e6; '>' freq'<\/span><span style='color:#808030; '>)<\/span><span style='color:#808030; '>]<\/span><span style='color:#808030; '>]<\/span><span style='color:#808030; '>.<\/span><span style='color:#400000; '>sum<\/span><span style='color:#808030; '>(<\/span>axis<span style='color:#808030; '>=<\/span><span style='color:#008c00; '>1<\/span><span style='color:#808030; '>)<\/span>\r\n               <span style='color:#808030; '>)<\/span><span style='color:#808030; '>)<\/span>\r\n    term_freq_df<span style='color:#808030; '>[<\/span><span style='color:#0000e6; '>'score'<\/span><span style='color:#808030; '>]<\/span> <span style='color:#808030; '>=<\/span> scores\r\n    <span style='color:#696969; '>#Print the category and the ranked terms that + belong to the category or - don't<\/span>\r\n    <span style='color:#800000; font-weight:bold; '>print<\/span><span style='color:#808030; '>(<\/span>c<span style='color:#808030; '>)<\/span>\r\n    <span style='color:#800000; font-weight:bold; '>print<\/span><span style='color:#808030; '>(<\/span><span style='color:#0000e6; '>\"10\"<\/span><span style='color:#808030; '>)<\/span>\r\n    <span style='color:#800000; font-weight:bold; '>print<\/span><span style='color:#808030; '>(<\/span><span style='color:#0000e6; '>'+'<\/span><span style='color:#808030; '>,<\/span><span style='color:#400000; '>list<\/span><span style='color:#808030; '>(<\/span>term_freq_df<span style='color:#808030; '>.<\/span>sort_values<span style='color:#808030; '>(<\/span>by<span style='color:#808030; '>=<\/span><span style='color:#0000e6; '>'score'<\/span><span style='color:#808030; '>,<\/span> ascending<span style='color:#808030; '>=<\/span><span style='color:#074726; '>False<\/span><span style='color:#808030; '>)<\/span><span style='color:#808030; '>.<\/span>iloc<span style='color:#808030; '>[<\/span><span style='color:#808030; '>:<\/span><span style='color:#008c00; '>10<\/span><span style='color:#808030; '>]<\/span><span style='color:#808030; '>.<\/span>index<span style='color:#808030; '>)<\/span><span style='color:#808030; '>)<\/span>\r\n    <span style='color:#800000; font-weight:bold; '>print<\/span><span style='color:#808030; '>(<\/span><span style='color:#0000e6; '>'-'<\/span><span style='color:#808030; '>,<\/span><span style='color:#400000; '>list<\/span><span style='color:#808030; '>(<\/span>term_freq_df<span style='color:#808030; '>.<\/span>sort_values<span style='color:#808030; '>(<\/span>by<span style='color:#808030; '>=<\/span><span style='color:#0000e6; '>'score'<\/span><span style='color:#808030; '>,<\/span> ascending<span style='color:#808030; '>=<\/span><span style='color:#074726; '>True<\/span><span style='color:#808030; '>)<\/span><span style='color:#808030; '>.<\/span>iloc<span style='color:#808030; '>[<\/span><span style='color:#808030; '>:<\/span><span style='color:#008c00; '>10<\/span><span style='color:#808030; '>]<\/span><span style='color:#808030; '>.<\/span>index<span style='color:#808030; '>)<\/span><span style='color:#808030; '>)<\/span>   \r\n    <span style='color:#800000; font-weight:bold; '>print<\/span><span style='color:#808030; '>(<\/span><span style='color:#0000e6; '>\"30\"<\/span><span style='color:#808030; '>)<\/span>\r\n    <span style='color:#800000; font-weight:bold; '>print<\/span><span style='color:#808030; '>(<\/span><span style='color:#0000e6; '>'+'<\/span><span style='color:#808030; '>,<\/span><span style='color:#400000; '>list<\/span><span style='color:#808030; '>(<\/span>term_freq_df<span style='color:#808030; '>.<\/span>sort_values<span style='color:#808030; '>(<\/span>by<span style='color:#808030; '>=<\/span><span style='color:#0000e6; '>'score'<\/span><span style='color:#808030; '>,<\/span> ascending<span style='color:#808030; '>=<\/span><span style='color:#074726; '>False<\/span><span style='color:#808030; '>)<\/span><span style='color:#808030; '>.<\/span>iloc<span style='color:#808030; '>[<\/span><span style='color:#808030; '>:<\/span><span style='color:#008c00; '>30<\/span><span style='color:#808030; '>]<\/span><span style='color:#808030; '>.<\/span>index<span style='color:#808030; '>)<\/span><span style='color:#808030; '>)<\/span>\r\n    <span style='color:#800000; font-weight:bold; '>print<\/span><span style='color:#808030; '>(<\/span><span style='color:#0000e6; '>'-'<\/span><span style='color:#808030; '>,<\/span><span style='color:#400000; '>list<\/span><span style='color:#808030; '>(<\/span>term_freq_df<span style='color:#808030; '>.<\/span>sort_values<span style='color:#808030; '>(<\/span>by<span style='color:#808030; '>=<\/span><span style='color:#0000e6; '>'score'<\/span><span style='color:#808030; '>,<\/span> ascending<span style='color:#808030; '>=<\/span><span style='color:#074726; '>True<\/span><span style='color:#808030; '>)<\/span><span style='color:#808030; '>.<\/span>iloc<span style='color:#808030; '>[<\/span><span style='color:#808030; '>:<\/span><span style='color:#008c00; '>30<\/span><span style='color:#808030; '>]<\/span><span style='color:#808030; '>.<\/span>index<span style='color:#808030; '>)<\/span><span style='color:#808030; '>)<\/span> \r\n    <span style='color:#800000; font-weight:bold; '>print<\/span><span style='color:#808030; '>(<\/span><span style='color:#0000e6; '>\"50\"<\/span><span style='color:#808030; '>)<\/span>\r\n    <span style='color:#800000; font-weight:bold; '>print<\/span><span style='color:#808030; '>(<\/span><span style='color:#0000e6; '>'+'<\/span><span style='color:#808030; '>,<\/span><span style='color:#400000; '>list<\/span><span style='color:#808030; '>(<\/span>term_freq_df<span style='color:#808030; '>.<\/span>sort_values<span style='color:#808030; '>(<\/span>by<span style='color:#808030; '>=<\/span><span style='color:#0000e6; '>'score'<\/span><span style='color:#808030; '>,<\/span> ascending<span style='color:#808030; '>=<\/span><span style='color:#074726; '>False<\/span><span style='color:#808030; '>)<\/span><span style='color:#808030; '>.<\/span>iloc<span style='color:#808030; '>[<\/span><span style='color:#808030; '>:<\/span><span style='color:#008c00; '>50<\/span><span style='color:#808030; '>]<\/span><span style='color:#808030; '>.<\/span>index<span style='color:#808030; '>)<\/span><span style='color:#808030; '>)<\/span>\r\n    <span style='color:#800000; font-weight:bold; '>print<\/span><span style='color:#808030; '>(<\/span><span style='color:#0000e6; '>'-'<\/span><span style='color:#808030; '>,<\/span><span style='color:#400000; '>list<\/span><span style='color:#808030; '>(<\/span>term_freq_df<span style='color:#808030; '>.<\/span>sort_values<span style='color:#808030; '>(<\/span>by<span style='color:#808030; '>=<\/span><span style='color:#0000e6; '>'score'<\/span><span style='color:#808030; '>,<\/span> ascending<span style='color:#808030; '>=<\/span><span style='color:#074726; '>True<\/span><span style='color:#808030; '>)<\/span><span style='color:#808030; '>.<\/span>iloc<span style='color:#808030; '>[<\/span><span style='color:#808030; '>:<\/span><span style='color:#008c00; '>50<\/span><span style='color:#808030; '>]<\/span><span style='color:#808030; '>.<\/span>index<span style='color:#808030; '>)<\/span><span style='color:#808030; '>)<\/span>\r\ntox_obs_corpus <span style='color:#808030; '>=<\/span> single_category_corpus\r\n\r\ntox_obs_corpus <span style='color:#808030; '>=<\/span> tox_obs_corpus<span style='color:#808030; '>.<\/span>compact<span style='color:#808030; '>(<\/span>st<span style='color:#808030; '>.<\/span>ClassPercentageCompactor<span style='color:#808030; '>(<\/span>term_count<span style='color:#808030; '>=<\/span><span style='color:#008c00; '>2<\/span><span style='color:#808030; '>)<\/span><span style='color:#808030; '>)<\/span>\r\npriors <span style='color:#808030; '>=<\/span> <span style='color:#808030; '>(<\/span>st<span style='color:#808030; '>.<\/span>PriorFactory<span style='color:#808030; '>(<\/span>single_category_corpus<span style='color:#808030; '>,<\/span> \r\n            category<span style='color:#808030; '>=<\/span><span style='color:#0000e6; '>'Trigger'<\/span><span style='color:#808030; '>,<\/span> \r\n            not_categories<span style='color:#808030; '>=<\/span><span style='color:#808030; '>[<\/span><span style='color:#0000e6; '>'Not-trigger'<\/span><span style='color:#808030; '>]<\/span><span style='color:#808030; '>)<\/span>\r\n      <span style='color:#808030; '>.<\/span>use_neutral_categories<span style='color:#808030; '>(<\/span><span style='color:#808030; '>)<\/span>\r\n      <span style='color:#808030; '>.<\/span>align_to_target<span style='color:#808030; '>(<\/span>tox_obs_corpus<span style='color:#808030; '>)<\/span>\r\n      <span style='color:#808030; '>.<\/span>get_priors<span style='color:#808030; '>(<\/span><span style='color:#808030; '>)<\/span><span style='color:#808030; '>)<\/span>\r\nterm_ranker <span style='color:#808030; '>=<\/span> st<span style='color:#808030; '>.<\/span>OncePerDocFrequencyRanker\r\nterm_scorer <span style='color:#808030; '>=<\/span> st<span style='color:#808030; '>.<\/span>LogOddsRatioInformativeDirichletPrior<span style='color:#808030; '>(<\/span>priors<span style='color:#808030; '>,<\/span> scale_type<span style='color:#808030; '>=<\/span><span style='color:#0000e6; '>'class-size'<\/span><span style='color:#808030; '>,<\/span> sigma<span style='color:#808030; '>=<\/span><span style='color:#008c00; '>10<\/span><span style='color:#808030; '>)<\/span>\r\nrank_df <span style='color:#808030; '>=<\/span> term_ranker<span style='color:#808030; '>(<\/span>tox_obs_corpus<span style='color:#808030; '>)<\/span><span style='color:#808030; '>.<\/span>get_ranks<span style='color:#808030; '>(<\/span><span style='color:#808030; '>)<\/span>\r\n<span style='color:#696969; '>#Perform an overall rank using LORIDP of the frequent trigger and non-trigger terms<\/span>\r\nrank_df<span style='color:#808030; '>[<\/span><span style='color:#0000e6; '>'score'<\/span><span style='color:#808030; '>]<\/span> <span style='color:#808030; '>=<\/span> term_scorer<span style='color:#808030; '>.<\/span>get_scores<span style='color:#808030; '>(<\/span>rank_df<span style='color:#808030; '>[<\/span><span style='color:#0000e6; '>'Trigger freq'<\/span><span style='color:#808030; '>]<\/span><span style='color:#808030; '>,<\/span> rank_df<span style='color:#808030; '>[<\/span><span style='color:#0000e6; '>'Not-trigger freq'<\/span><span style='color:#808030; '>]<\/span><span style='color:#808030; '>)<\/span>\r\n<span style='color:#800000; font-weight:bold; '>print<\/span><span style='color:#808030; '>(<\/span><span style='color:#0000e6; '>'Trigger LORIDP 10'<\/span><span style='color:#808030; '>,<\/span><span style='color:#400000; '>list<\/span><span style='color:#808030; '>(<\/span>rank_df<span style='color:#808030; '>.<\/span>sort_values<span style='color:#808030; '>(<\/span>by<span style='color:#808030; '>=<\/span><span style='color:#0000e6; '>'score'<\/span><span style='color:#808030; '>,<\/span> ascending<span style='color:#808030; '>=<\/span><span style='color:#074726; '>False<\/span><span style='color:#808030; '>)<\/span><span style='color:#808030; '>.<\/span>iloc<span style='color:#808030; '>[<\/span><span style='color:#808030; '>:<\/span><span style='color:#008c00; '>10<\/span><span style='color:#808030; '>]<\/span><span style='color:#808030; '>.<\/span>index<span style='color:#808030; '>)<\/span><span style='color:#808030; '>)<\/span>\r\n<span style='color:#800000; font-weight:bold; '>print<\/span><span style='color:#808030; '>(<\/span><span style='color:#0000e6; '>'NotTrigger LORIDP 10'<\/span><span style='color:#808030; '>,<\/span><span style='color:#400000; '>list<\/span><span style='color:#808030; '>(<\/span>rank_df<span style='color:#808030; '>.<\/span>sort_values<span style='color:#808030; '>(<\/span>by<span style='color:#808030; '>=<\/span><span style='color:#0000e6; '>'score'<\/span><span style='color:#808030; '>,<\/span> ascending<span style='color:#808030; '>=<\/span><span style='color:#074726; '>True<\/span><span style='color:#808030; '>)<\/span><span style='color:#808030; '>.<\/span>iloc<span style='color:#808030; '>[<\/span><span style='color:#808030; '>:<\/span><span style='color:#008c00; '>10<\/span><span style='color:#808030; '>]<\/span><span style='color:#808030; '>.<\/span>index<span style='color:#808030; '>)<\/span><span style='color:#808030; '>)<\/span>   \r\n<span style='color:#800000; font-weight:bold; '>print<\/span><span style='color:#808030; '>(<\/span><span style='color:#0000e6; '>'Trigger LORIDP 30'<\/span><span style='color:#808030; '>,<\/span><span style='color:#400000; '>list<\/span><span style='color:#808030; '>(<\/span>rank_df<span style='color:#808030; '>.<\/span>sort_values<span style='color:#808030; '>(<\/span>by<span style='color:#808030; '>=<\/span><span style='color:#0000e6; '>'score'<\/span><span style='color:#808030; '>,<\/span> ascending<span style='color:#808030; '>=<\/span><span style='color:#074726; '>False<\/span><span style='color:#808030; '>)<\/span><span style='color:#808030; '>.<\/span>iloc<span style='color:#808030; '>[<\/span><span style='color:#808030; '>:<\/span><span style='color:#008c00; '>30<\/span><span style='color:#808030; '>]<\/span><span style='color:#808030; '>.<\/span>index<span style='color:#808030; '>)<\/span><span style='color:#808030; '>)<\/span>\r\n<span style='color:#800000; font-weight:bold; '>print<\/span><span style='color:#808030; '>(<\/span><span style='color:#0000e6; '>'NotTrigger LORIDP 30'<\/span><span style='color:#808030; '>,<\/span><span style='color:#400000; '>list<\/span><span style='color:#808030; '>(<\/span>rank_df<span style='color:#808030; '>.<\/span>sort_values<span style='color:#808030; '>(<\/span>by<span style='color:#808030; '>=<\/span><span style='color:#0000e6; '>'score'<\/span><span style='color:#808030; '>,<\/span> ascending<span style='color:#808030; '>=<\/span><span style='color:#074726; '>True<\/span><span style='color:#808030; '>)<\/span><span style='color:#808030; '>.<\/span>iloc<span style='color:#808030; '>[<\/span><span style='color:#808030; '>:<\/span><span style='color:#008c00; '>30<\/span><span style='color:#808030; '>]<\/span><span style='color:#808030; '>.<\/span>index<span style='color:#808030; '>)<\/span><span style='color:#808030; '>)<\/span>   \r\n<span style='color:#800000; font-weight:bold; '>print<\/span><span style='color:#808030; '>(<\/span><span style='color:#0000e6; '>'Trigger LORIDP 50'<\/span><span style='color:#808030; '>,<\/span><span style='color:#400000; '>list<\/span><span style='color:#808030; '>(<\/span>rank_df<span style='color:#808030; '>.<\/span>sort_values<span style='color:#808030; '>(<\/span>by<span style='color:#808030; '>=<\/span><span style='color:#0000e6; '>'score'<\/span><span style='color:#808030; '>,<\/span> ascending<span style='color:#808030; '>=<\/span><span style='color:#074726; '>False<\/span><span style='color:#808030; '>)<\/span><span style='color:#808030; '>.<\/span>iloc<span style='color:#808030; '>[<\/span><span style='color:#808030; '>:<\/span><span style='color:#008c00; '>50<\/span><span style='color:#808030; '>]<\/span><span style='color:#808030; '>.<\/span>index<span style='color:#808030; '>)<\/span><span style='color:#808030; '>)<\/span>\r\n<span style='color:#800000; font-weight:bold; '>print<\/span><span style='color:#808030; '>(<\/span><span style='color:#0000e6; '>'NotTrigger LORIDP 50'<\/span><span style='color:#808030; '>,<\/span><span style='color:#400000; '>list<\/span><span style='color:#808030; '>(<\/span>rank_df<span style='color:#808030; '>.<\/span>sort_values<span style='color:#808030; '>(<\/span>by<span style='color:#808030; '>=<\/span><span style='color:#0000e6; '>'score'<\/span><span style='color:#808030; '>,<\/span> ascending<span style='color:#808030; '>=<\/span><span style='color:#074726; '>True<\/span><span style='color:#808030; '>)<\/span><span style='color:#808030; '>.<\/span>iloc<span style='color:#808030; '>[<\/span><span style='color:#808030; '>:<\/span><span style='color:#008c00; '>50<\/span><span style='color:#808030; '>]<\/span><span style='color:#808030; '>.<\/span>index<span style='color:#808030; '>)<\/span><span style='color:#808030; '>)<\/span>\r\n<\/pre>\n<p><!--Created using ToHtml.com on 2022-10-24 11:07:02 UTC --><\/p>\n<p>Part of the output looks as follows:<br \/><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-431\" src=\"https:\/\/quecst.qcri.org\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-19-at-3.32.19-PM.png\" alt=\"\" width=\"920\" height=\"36\" srcset=\"https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-19-at-3.32.19-PM.png 920w, https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-19-at-3.32.19-PM-300x12.png 300w, https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-19-at-3.32.19-PM-768x30.png 768w\" sizes=\"(max-width: 920px) 100vw, 920px\" \/><br \/>This outcome is similar to what we found in [2], which states that named entities like \u2018trump\u2019 are triggering terms, while terms that represent a positive action like \u2018workout\u2019 are not triggers.<\/p>\n<p>Well, what about using Log Odds Ratio with smoothing? Let\u2019s try it and visualize it as follows:<\/p>\n<pre style='color:#000000;background:#ffffff;'><span style='color:#696969; '>#Step 6: use scattertext to plot the Log Odds Ratio with smoothing factor <\/span>\r\n<span style='color:#800000; font-weight:bold; '>class<\/span> LogOddsRatioSmoothed<span style='color:#808030; '>:<\/span>\r\n    <span style='color:#800000; font-weight:bold; '>def<\/span> <span style='color:#074726; '>__init__<\/span><span style='color:#808030; '>(<\/span>self<span style='color:#808030; '>,<\/span> alpha<span style='color:#808030; '>)<\/span><span style='color:#808030; '>:<\/span>\r\n        self<span style='color:#808030; '>.<\/span>alpha <span style='color:#808030; '>=<\/span> alpha\r\n    <span style='color:#800000; font-weight:bold; '>def<\/span> get_scores<span style='color:#808030; '>(<\/span>self<span style='color:#808030; '>,<\/span> a<span style='color:#808030; '>,<\/span> b<span style='color:#808030; '>)<\/span><span style='color:#808030; '>:<\/span> \r\n        <span style='color:#800000; font-weight:bold; '>return<\/span> <span style='color:#808030; '>(<\/span>np<span style='color:#808030; '>.<\/span>log<span style='color:#808030; '>(<\/span><span style='color:#808030; '>(<\/span>a <span style='color:#44aadd; '>+<\/span> self<span style='color:#808030; '>.<\/span>alpha<span style='color:#808030; '>)<\/span><span style='color:#44aadd; '>\/<\/span><span style='color:#808030; '>(<\/span>np<span style='color:#808030; '>.<\/span><span style='color:#400000; '>sum<\/span><span style='color:#808030; '>(<\/span>a<span style='color:#808030; '>)<\/span> <span style='color:#44aadd; '>+<\/span> self<span style='color:#808030; '>.<\/span>alpha <span style='color:#44aadd; '>*<\/span> <span style='color:#400000; '>len<\/span><span style='color:#808030; '>(<\/span>a<span style='color:#808030; '>)<\/span> <span style='color:#44aadd; '>-<\/span> a <span style='color:#44aadd; '>-<\/span> self<span style='color:#808030; '>.<\/span>alpha<span style='color:#808030; '>)<\/span><span style='color:#808030; '>)<\/span> \r\n                <span style='color:#44aadd; '>-<\/span> np<span style='color:#808030; '>.<\/span>log<span style='color:#808030; '>(<\/span><span style='color:#808030; '>(<\/span>b <span style='color:#44aadd; '>+<\/span> self<span style='color:#808030; '>.<\/span>alpha<span style='color:#808030; '>)<\/span><span style='color:#44aadd; '>\/<\/span><span style='color:#808030; '>(<\/span>np<span style='color:#808030; '>.<\/span><span style='color:#400000; '>sum<\/span><span style='color:#808030; '>(<\/span>b<span style='color:#808030; '>)<\/span> <span style='color:#44aadd; '>+<\/span> self<span style='color:#808030; '>.<\/span>alpha <span style='color:#44aadd; '>*<\/span> <span style='color:#400000; '>len<\/span><span style='color:#808030; '>(<\/span>b<span style='color:#808030; '>)<\/span> <span style='color:#44aadd; '>-<\/span> b <span style='color:#44aadd; '>-<\/span> self<span style='color:#808030; '>.<\/span>alpha<span style='color:#808030; '>)<\/span><span style='color:#808030; '>)<\/span><span style='color:#808030; '>)<\/span>\r\n    <span style='color:#800000; font-weight:bold; '>def<\/span> get_name<span style='color:#808030; '>(<\/span>self<span style='color:#808030; '>)<\/span><span style='color:#808030; '>:<\/span> \r\n        <span style='color:#800000; font-weight:bold; '>return<\/span> <span style='color:#0000e6; '>'Smoothed Log-Odds-Ratio'<\/span>\r\n    \r\nhtml <span style='color:#808030; '>=<\/span> st<span style='color:#808030; '>.<\/span>produce_fightin_words_explorer<span style='color:#808030; '>(<\/span>\r\n    corpus<span style='color:#808030; '>,<\/span>\r\n    category<span style='color:#808030; '>=<\/span><span style='color:#0000e6; '>'Trigger'<\/span><span style='color:#808030; '>,<\/span>\r\n    not_category_name<span style='color:#808030; '>=<\/span><span style='color:#0000e6; '>'Not-trigger'<\/span><span style='color:#808030; '>,<\/span>\r\n    not_categories<span style='color:#808030; '>=<\/span><span style='color:#808030; '>[<\/span><span style='color:#0000e6; '>'Not-trigger'<\/span><span style='color:#808030; '>]<\/span><span style='color:#808030; '>,<\/span>\r\n    term_scorer<span style='color:#808030; '>=<\/span>LogOddsRatioSmoothed<span style='color:#808030; '>(<\/span><span style='color:#008000; '>0.00000001<\/span><span style='color:#808030; '>)<\/span><span style='color:#808030; '>,<\/span>\r\n    grey_threshold<span style='color:#808030; '>=<\/span><span style='color:#008c00; '>0<\/span><span style='color:#808030; '>,<\/span>\r\n<span style='color:#808030; '>)<\/span>\r\n\r\n<span style='color:#696969; '>#Save the html visualization file for later view<\/span>\r\nfile_name <span style='color:#808030; '>=<\/span> <span style='color:#0000e6; '>'trigger_not_lorsmooth.html'<\/span>\r\n<span style='color:#400000; '>open<\/span><span style='color:#808030; '>(<\/span>file_name<span style='color:#808030; '>,<\/span> <span style='color:#0000e6; '>'wb'<\/span><span style='color:#808030; '>)<\/span><span style='color:#808030; '>.<\/span>write<span style='color:#808030; '>(<\/span>html<span style='color:#808030; '>.<\/span>encode<span style='color:#808030; '>(<\/span><span style='color:#0000e6; '>'utf-8'<\/span><span style='color:#808030; '>)<\/span><span style='color:#808030; '>)<\/span>\r\nHTML<span style='color:#808030; '>(<\/span>html<span style='color:#808030; '>)<\/span> <span style='color:#696969; '>#View the file<\/span>\r\n<\/pre>\n<p><!--Created using ToHtml.com on 2022-10-24 11:09:03 UTC --><\/p>\n<p>The result of the visualization is as follows:<br \/><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-381\" src=\"https:\/\/quecst.qcri.org\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-19-at-2.23.28-PM.png\" alt=\"\" width=\"1433\" height=\"638\" srcset=\"https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-19-at-2.23.28-PM.png 1433w, https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-19-at-2.23.28-PM-300x134.png 300w, https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-19-at-2.23.28-PM-1024x456.png 1024w, https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/10\/Screen-Shot-2022-10-19-at-2.23.28-PM-768x342.png 768w\" sizes=\"(max-width: 1433px) 100vw, 1433px\" \/><br \/>Notice that this time we have different terms as triggers like \u2018Lebron\u2019 (another named entity that we found triggering in [2]). Furthermore, the top not-trigger terms from the above visualization show that terms related to technology like \u2018Computers\u2019 are not triggers. This finding also coincides with our work from [2], noting similarities between the discovered triggering and non-triggering terms.<\/p>\n<p>As final notes, we would like to emphasize that we do not claim that Log Odds Ratio is the best method to gain insights from text, as sometimes the least frequent terms may not be as informative or useful as we think. Therefore, we recommend trying other methods [1] on your dataset to see which term ranking technique produces the most useful insights for your analysis.<\/p>\n<p>The entire script is available on <a href=\"https:\/\/colab.research.google.com\/drive\/1A7BC07P8zYza7o8nuIIxy_02IHFuSFlJ?usp=sharing\">Google Colab<\/a>.<\/p>\n<p>Happy coding!<\/p>\n<p><strong>References:<\/strong><br \/>[1] \u201cJasonKessler\/SemioticSquaresTalk: #ddtx18 talk: Lexicon Mining for Semiotic Squares: Exploding Binary Classification.\u201d https:\/\/github.com\/JasonKessler\/SemioticSquaresTalk (accessed Oct. 19, 2022).<br \/>[2] H. Almerekhi, H. Kwak, J. Salminen, and B. J. Jansen, \u201cPROVOKE: Toxicity trigger detection in conversations from the top 100 subreddits,\u201d Data and Information Management, p. 100019, 2022.<br \/>[3] J. Kessler, \u201cScattertext: a Browser-Based Tool for Visualizing how Corpora Differ,\u201d in Proceedings of ACL 2017, System Demonstrations, Vancouver, Canada, Jul. 2017, pp. 85\u201390. Accessed: Oct. 19, 2022. [Online]. Available: https:\/\/aclanthology.org\/P17-4015<br \/>[4] \u201cspaCy \u00b7 Industrial-strength Natural Language Processing in Python.\u201d https:\/\/spacy.io\/ (accessed Oct. 19, 2022).<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Gaining insights from text-based data can be a daunting task, even when the data is labeled with ground truth categories and ready for usage in machine learning tasks.Researchers often rely on simple methods like the frequency of words in each category to understand the collection\u2019s characteristics. However, this approach is not always insightful, as term&hellip; <a class=\"more-link\" href=\"https:\/\/acua.qcri.org\/blog\/log-odds-ratio-going-beyond-simple-term-frequencies-to-characterize-textual-categories\/\">Continue reading <span class=\"screen-reader-text\">Log Odds Ratio: Going Beyond Simple Term Frequencies to Characterize Textual Categories<\/span><\/a><\/p>\n","protected":false},"author":10,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[21],"tags":[64],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v19.13 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Log Odds Ratio: Going Beyond Simple Term Frequencies to Characterize Textual Categories - Team Acua<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/acua.qcri.org\/blog\/log-odds-ratio-going-beyond-simple-term-frequencies-to-characterize-textual-categories\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Log Odds Ratio: Going Beyond Simple Term Frequencies to Characterize Textual Categories - Team Acua\" \/>\n<meta property=\"og:description\" content=\"Gaining insights from text-based data can be a daunting task, even when the data is labeled with ground truth categories and ready for usage in machine learning tasks.Researchers often rely on simple methods like the frequency of words in each category to understand the collection\u2019s characteristics. However, this approach is not always insightful, as term&hellip; Continue reading Log Odds Ratio: Going Beyond Simple Term Frequencies to Characterize Textual Categories\" \/>\n<meta property=\"og:url\" content=\"https:\/\/acua.qcri.org\/blog\/log-odds-ratio-going-beyond-simple-term-frequencies-to-characterize-textual-categories\/\" \/>\n<meta property=\"og:site_name\" content=\"Team Acua\" \/>\n<meta property=\"article:published_time\" content=\"2022-10-24T11:10:23+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/quecst.qcri.org\/blog\/wp-content\/uploads\/2022\/10\/blog-img-what-customer-insights.jpeg\" \/>\n<meta name=\"author\" content=\"Hind Almerekhi\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Hind Almerekhi\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"10 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/acua.qcri.org\/blog\/log-odds-ratio-going-beyond-simple-term-frequencies-to-characterize-textual-categories\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/acua.qcri.org\/blog\/log-odds-ratio-going-beyond-simple-term-frequencies-to-characterize-textual-categories\/\"},\"author\":{\"name\":\"Hind Almerekhi\",\"@id\":\"https:\/\/acua.qcri.org\/blog\/#\/schema\/person\/180d6b2a32636ac94c2fecb1839931cc\"},\"headline\":\"Log Odds Ratio: Going Beyond Simple Term Frequencies to Characterize Textual Categories\",\"datePublished\":\"2022-10-24T11:10:23+00:00\",\"dateModified\":\"2022-10-24T11:10:23+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/acua.qcri.org\/blog\/log-odds-ratio-going-beyond-simple-term-frequencies-to-characterize-textual-categories\/\"},\"wordCount\":888,\"publisher\":{\"@id\":\"https:\/\/acua.qcri.org\/blog\/#organization\"},\"keywords\":[\"Log Odds Ratio\"],\"articleSection\":[\"Customer segmentation\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/acua.qcri.org\/blog\/log-odds-ratio-going-beyond-simple-term-frequencies-to-characterize-textual-categories\/\",\"url\":\"https:\/\/acua.qcri.org\/blog\/log-odds-ratio-going-beyond-simple-term-frequencies-to-characterize-textual-categories\/\",\"name\":\"Log Odds Ratio: Going Beyond Simple Term Frequencies to Characterize Textual Categories - Team Acua\",\"isPartOf\":{\"@id\":\"https:\/\/acua.qcri.org\/blog\/#website\"},\"datePublished\":\"2022-10-24T11:10:23+00:00\",\"dateModified\":\"2022-10-24T11:10:23+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/acua.qcri.org\/blog\/log-odds-ratio-going-beyond-simple-term-frequencies-to-characterize-textual-categories\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/acua.qcri.org\/blog\/log-odds-ratio-going-beyond-simple-term-frequencies-to-characterize-textual-categories\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/acua.qcri.org\/blog\/log-odds-ratio-going-beyond-simple-term-frequencies-to-characterize-textual-categories\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/acua.qcri.org\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Log Odds Ratio: Going Beyond Simple Term Frequencies to Characterize Textual Categories\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/acua.qcri.org\/blog\/#website\",\"url\":\"https:\/\/acua.qcri.org\/blog\/\",\"name\":\"Team Acua\",\"description\":\"Audience, Customer, and User Analytics\",\"publisher\":{\"@id\":\"https:\/\/acua.qcri.org\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/acua.qcri.org\/blog\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/acua.qcri.org\/blog\/#organization\",\"name\":\"Team Acua\",\"url\":\"https:\/\/acua.qcri.org\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/acua.qcri.org\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/10\/cropped-cropped-logo.png\",\"contentUrl\":\"https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/10\/cropped-cropped-logo.png\",\"width\":1466,\"height\":770,\"caption\":\"Team Acua\"},\"image\":{\"@id\":\"https:\/\/acua.qcri.org\/blog\/#\/schema\/logo\/image\/\"}},{\"@type\":\"Person\",\"@id\":\"https:\/\/acua.qcri.org\/blog\/#\/schema\/person\/180d6b2a32636ac94c2fecb1839931cc\",\"name\":\"Hind Almerekhi\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/acua.qcri.org\/blog\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/a23aaa08517337c496fdf67313626d2e?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/a23aaa08517337c496fdf67313626d2e?s=96&d=mm&r=g\",\"caption\":\"Hind Almerekhi\"},\"url\":\"https:\/\/acua.qcri.org\/blog\/author\/hialmerekhihbku-edu-qa\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Log Odds Ratio: Going Beyond Simple Term Frequencies to Characterize Textual Categories - Team Acua","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/acua.qcri.org\/blog\/log-odds-ratio-going-beyond-simple-term-frequencies-to-characterize-textual-categories\/","og_locale":"en_US","og_type":"article","og_title":"Log Odds Ratio: Going Beyond Simple Term Frequencies to Characterize Textual Categories - Team Acua","og_description":"Gaining insights from text-based data can be a daunting task, even when the data is labeled with ground truth categories and ready for usage in machine learning tasks.Researchers often rely on simple methods like the frequency of words in each category to understand the collection\u2019s characteristics. However, this approach is not always insightful, as term&hellip; Continue reading Log Odds Ratio: Going Beyond Simple Term Frequencies to Characterize Textual Categories","og_url":"https:\/\/acua.qcri.org\/blog\/log-odds-ratio-going-beyond-simple-term-frequencies-to-characterize-textual-categories\/","og_site_name":"Team Acua","article_published_time":"2022-10-24T11:10:23+00:00","og_image":[{"url":"https:\/\/quecst.qcri.org\/blog\/wp-content\/uploads\/2022\/10\/blog-img-what-customer-insights.jpeg"}],"author":"Hind Almerekhi","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Hind Almerekhi","Est. reading time":"10 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/acua.qcri.org\/blog\/log-odds-ratio-going-beyond-simple-term-frequencies-to-characterize-textual-categories\/#article","isPartOf":{"@id":"https:\/\/acua.qcri.org\/blog\/log-odds-ratio-going-beyond-simple-term-frequencies-to-characterize-textual-categories\/"},"author":{"name":"Hind Almerekhi","@id":"https:\/\/acua.qcri.org\/blog\/#\/schema\/person\/180d6b2a32636ac94c2fecb1839931cc"},"headline":"Log Odds Ratio: Going Beyond Simple Term Frequencies to Characterize Textual Categories","datePublished":"2022-10-24T11:10:23+00:00","dateModified":"2022-10-24T11:10:23+00:00","mainEntityOfPage":{"@id":"https:\/\/acua.qcri.org\/blog\/log-odds-ratio-going-beyond-simple-term-frequencies-to-characterize-textual-categories\/"},"wordCount":888,"publisher":{"@id":"https:\/\/acua.qcri.org\/blog\/#organization"},"keywords":["Log Odds Ratio"],"articleSection":["Customer segmentation"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/acua.qcri.org\/blog\/log-odds-ratio-going-beyond-simple-term-frequencies-to-characterize-textual-categories\/","url":"https:\/\/acua.qcri.org\/blog\/log-odds-ratio-going-beyond-simple-term-frequencies-to-characterize-textual-categories\/","name":"Log Odds Ratio: Going Beyond Simple Term Frequencies to Characterize Textual Categories - Team Acua","isPartOf":{"@id":"https:\/\/acua.qcri.org\/blog\/#website"},"datePublished":"2022-10-24T11:10:23+00:00","dateModified":"2022-10-24T11:10:23+00:00","breadcrumb":{"@id":"https:\/\/acua.qcri.org\/blog\/log-odds-ratio-going-beyond-simple-term-frequencies-to-characterize-textual-categories\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/acua.qcri.org\/blog\/log-odds-ratio-going-beyond-simple-term-frequencies-to-characterize-textual-categories\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/acua.qcri.org\/blog\/log-odds-ratio-going-beyond-simple-term-frequencies-to-characterize-textual-categories\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/acua.qcri.org\/blog\/"},{"@type":"ListItem","position":2,"name":"Log Odds Ratio: Going Beyond Simple Term Frequencies to Characterize Textual Categories"}]},{"@type":"WebSite","@id":"https:\/\/acua.qcri.org\/blog\/#website","url":"https:\/\/acua.qcri.org\/blog\/","name":"Team Acua","description":"Audience, Customer, and User Analytics","publisher":{"@id":"https:\/\/acua.qcri.org\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/acua.qcri.org\/blog\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/acua.qcri.org\/blog\/#organization","name":"Team Acua","url":"https:\/\/acua.qcri.org\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/acua.qcri.org\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/10\/cropped-cropped-logo.png","contentUrl":"https:\/\/acua.qcri.org\/blog\/wp-content\/uploads\/2022\/10\/cropped-cropped-logo.png","width":1466,"height":770,"caption":"Team Acua"},"image":{"@id":"https:\/\/acua.qcri.org\/blog\/#\/schema\/logo\/image\/"}},{"@type":"Person","@id":"https:\/\/acua.qcri.org\/blog\/#\/schema\/person\/180d6b2a32636ac94c2fecb1839931cc","name":"Hind Almerekhi","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/acua.qcri.org\/blog\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/a23aaa08517337c496fdf67313626d2e?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/a23aaa08517337c496fdf67313626d2e?s=96&d=mm&r=g","caption":"Hind Almerekhi"},"url":"https:\/\/acua.qcri.org\/blog\/author\/hialmerekhihbku-edu-qa\/"}]}},"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/acua.qcri.org\/blog\/wp-json\/wp\/v2\/posts\/377"}],"collection":[{"href":"https:\/\/acua.qcri.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/acua.qcri.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/acua.qcri.org\/blog\/wp-json\/wp\/v2\/users\/10"}],"replies":[{"embeddable":true,"href":"https:\/\/acua.qcri.org\/blog\/wp-json\/wp\/v2\/comments?post=377"}],"version-history":[{"count":69,"href":"https:\/\/acua.qcri.org\/blog\/wp-json\/wp\/v2\/posts\/377\/revisions"}],"predecessor-version":[{"id":504,"href":"https:\/\/acua.qcri.org\/blog\/wp-json\/wp\/v2\/posts\/377\/revisions\/504"}],"wp:attachment":[{"href":"https:\/\/acua.qcri.org\/blog\/wp-json\/wp\/v2\/media?parent=377"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/acua.qcri.org\/blog\/wp-json\/wp\/v2\/categories?post=377"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/acua.qcri.org\/blog\/wp-json\/wp\/v2\/tags?post=377"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}