Gaining insights from text-based data can be a daunting task, even when the data is labeled with ground truth categories and ready for usage in machine learning tasks.
Researchers often rely on simple methods like the frequency of words in each category to understand the collection’s characteristics. However, this approach is not always insightful, as term frequencies alone are not enough to distinguish between the categories because they fail to consider the fundamental differences between categories.
Therefore, in this tutorial, we will teach you how to use Log Odds Ratio, which is an alternative method to term frequency, TFIDF, and other term ranking techniques to obtain unique insights about the terms that represent categories in textual data [1].
What is Log Odds Ratio?
Let’s start with what is meant by odds:
Keep in mind that the odds can be a very small number.
So now, the odds ratio is the ratio of the odds of a term being used in one category vs. another.
It can be computed using the following equation:
The Log-Odds-Ratio is often used, given these quantities are often very small. For our purposes, the fact that it’s normally distributed is important.
If term does not appear in category b (i.e., 𝑦𝑏𝑖=0), the odd of termi in categoryb will be 0 resulting the odds-ratio to be undefined. We circumvent this issue by adding a pseudo-count to each term, assuming it occurs at least 𝛼 times.
This will result in an equation that computes the smoothed Log-Odds-Ratio as follows:
This has a slightly better profile than TFIDF, but still favors very low-frequency terms that are unique to a category.
The smoothed Log-Odds-Ratio ranks terms slightly better than TFIDF, yet it still prefers very low frequent terms which are unique to a category.
Note that when terms are exclusive to one category or another, their cluster can be extremely high or low scoring compared to others because these terms occur infrequently. Intuitively, we’d like to score things highly if they have a lot of evidence behind them AND they tend to be used much more in one category vs. another. However, in cases where the dataset has terms that appear in both categories (i.e., a lot of intersecting terms, even after cleaning), the Log Odds Ratio can be a suitable method of ranking terms to gain insights on the type of content in each category.
For this tutorial, we will use a random sample of a dataset that was used in [2] to build a machine learning model for detecting toxicity triggers (i.e., causes of toxicity in online conversation threads). Furthermore, we will use scattertext [3], which is a powerful text visualization library that generates interactive scatter plots of corpora and renders them via html. For preprocessing the dataset, we will use spacy [4], which is a powerful open source NLP library that helps with different types of NLP-based text processing tasks.
The dataset is available for download from this link.
After downloading the dataset, make sure that you install the following packages:
#Step 1: download missing libraries into Google colab with the pip command !pip install --quiet scattertext !pip install --quiet spacy !python3 -m spacy download en_core_web_sm
Next, import the necessary libraries and download the required components for performing NLP tasks, as follows:
#Step 2: import the required libraries or load the needed modules import re import pandas as pd import numpy as np import scattertext as st import spacy from IPython.display import IFrame from IPython.core.display import display, HTML import html display(HTML("<style>.container { width:98% !important; }</style>")) #For visualization nlp = spacy.load("en_core_web_sm") from bs4 import BeautifulSoup import nltk from nltk.corpus import stopwords nltk.download('stopwords') import warnings warnings.filterwarnings('ignore')
Then, make a function for cleaning the raw text from the dataset and another one to map the categories to a numerical value as follows:
#Step 3: function to convert the comments to words def comment_to_words( raw_comment ): # 1. Remove HTML comment_text = BeautifulSoup(raw_comment, 'lxml').get_text() # 2. Remove non-letters with regex letters_only = re.sub("[^a-zA-Z]", " ", comment_text) # 3. Convert to lower case, split into individual words words = letters_only.lower().split() # 4. Create set of stopwords stops = set(stopwords.words("english")) # 5. Remove stop words meaningful_words = [w for w in words if not w in stops] # 6. Join the words back into one string separated by space, # and return the result. return( " ".join( meaningful_words )) #A function to change the category from numerical to text def get_category(row): for c in trigger_categories: if row[c] == 1: return 'Trigger' return 'Not-trigger' def get_category_list(row): return [c for c in trigger_categories if row[c] == 1]
After that, read the dataset and perform the preprocessing task:
#Step 4: read data and preprocess it df = pd.read_csv('toxicity_triggers_sample.csv') #print dataset header before preprocessing print("Before preprocessing:") print(df.head()) #Create additional columns with the numerical representations of labels df['Trigger'] = 0 df['Trigger'].loc[df['category'] == 'trigger'] = 1 df['NotTrigger'] = 0 df['NotTrigger'].loc[df['category'] == 'non_trigger'] = 0 #Delete the category column del df['category'] #Define the list of categories trigger_categories = ['Trigger','NotTrigger'] df['category'] = df.apply(get_category, axis=1) df['category_list'] = df.apply(get_category_list, axis=1) #Print the count of triggers and non-triggers in the dataset print(df.category.value_counts()) #Fill nan or missing values with _na_ df['Text']= df['Text'].fillna("_na_").values df['Processed_Text'] =df['Text'].apply(comment_to_words) df['parse'] = df.Processed_Text.apply(nlp) #Print dataset header after preprocessing print("After preprocessing:") print(df.head())
Now, use Log Odds Ratio (with Dirichlet Prior) to know which ranked terms appear the most in the dataset:
#Step 5: Use scattertext to create a corpus from parsed documents, use log odds ratio to rank and print top 10, 30, and 50 terms per category #Note that the terms displayed here are not using any smoothing in the LOR, instead it uses Informative Dirichlet Prior corpus = st.CorpusFromParsedDocuments( df, parsed_col='parse', category_col='category', feats_from_spacy_doc=st.UnigramsFromSpacyDoc()).build() #Print the total number of terms in the corpus print(len(corpus.get_terms())) for i in range(1, 20): print('Threshold:', i, '# terms:', len(corpus.compact(st.ClassPercentageCompactor(st.OncePerDocFrequencyRanker, i)).get_terms())) compact_corpus = corpus.compact(st.ClassPercentageCompactor(st.OncePerDocFrequencyRanker, 14)) dfs = [] single_df_columns = ['parse', 'category_list', 'single_category'] for cat in trigger_categories: #print(cat) new_df = df[(df[cat] == 1)] new_df['single_category'] = cat dfs.append(new_df[single_df_columns]) new_df = df[df[trigger_categories].sum(axis=1) == 0] new_df['single_category'] = 'Not-trigger' dfs.append(new_df[single_df_columns]) single_df = pd.concat(dfs) del dfs single_category_corpus = st.CorpusFromParsedDocuments( single_df, parsed_col='parse', category_col='single_category', feats_from_spacy_doc=st.UnigramsFromSpacyDoc()).build() term_freq_df = st.OncePerDocFrequencyRanker(single_category_corpus).get_ranks() for c in single_category_corpus.get_categories(): scores = (st.ScaledFScorePresets(beta = 1, one_to_neg_one=True) .get_scores( term_freq_df[c + ' freq'], term_freq_df[[oc for oc in term_freq_df.columns if oc != c + ' freq' and oc.endswith(' freq')]].sum(axis=1) )) term_freq_df['score'] = scores #Print the category and the ranked terms that + belong to the category or - don't print(c) print("10") print('+',list(term_freq_df.sort_values(by='score', ascending=False).iloc[:10].index)) print('-',list(term_freq_df.sort_values(by='score', ascending=True).iloc[:10].index)) print("30") print('+',list(term_freq_df.sort_values(by='score', ascending=False).iloc[:30].index)) print('-',list(term_freq_df.sort_values(by='score', ascending=True).iloc[:30].index)) print("50") print('+',list(term_freq_df.sort_values(by='score', ascending=False).iloc[:50].index)) print('-',list(term_freq_df.sort_values(by='score', ascending=True).iloc[:50].index)) tox_obs_corpus = single_category_corpus tox_obs_corpus = tox_obs_corpus.compact(st.ClassPercentageCompactor(term_count=2)) priors = (st.PriorFactory(single_category_corpus, category='Trigger', not_categories=['Not-trigger']) .use_neutral_categories() .align_to_target(tox_obs_corpus) .get_priors()) term_ranker = st.OncePerDocFrequencyRanker term_scorer = st.LogOddsRatioInformativeDirichletPrior(priors, scale_type='class-size', sigma=10) rank_df = term_ranker(tox_obs_corpus).get_ranks() #Perform an overall rank using LORIDP of the frequent trigger and non-trigger terms rank_df['score'] = term_scorer.get_scores(rank_df['Trigger freq'], rank_df['Not-trigger freq']) print('Trigger LORIDP 10',list(rank_df.sort_values(by='score', ascending=False).iloc[:10].index)) print('NotTrigger LORIDP 10',list(rank_df.sort_values(by='score', ascending=True).iloc[:10].index)) print('Trigger LORIDP 30',list(rank_df.sort_values(by='score', ascending=False).iloc[:30].index)) print('NotTrigger LORIDP 30',list(rank_df.sort_values(by='score', ascending=True).iloc[:30].index)) print('Trigger LORIDP 50',list(rank_df.sort_values(by='score', ascending=False).iloc[:50].index)) print('NotTrigger LORIDP 50',list(rank_df.sort_values(by='score', ascending=True).iloc[:50].index))
Part of the output looks as follows:
This outcome is similar to what we found in [2], which states that named entities like ‘trump’ are triggering terms, while terms that represent a positive action like ‘workout’ are not triggers.
Well, what about using Log Odds Ratio with smoothing? Let’s try it and visualize it as follows:
#Step 6: use scattertext to plot the Log Odds Ratio with smoothing factor class LogOddsRatioSmoothed: def __init__(self, alpha): self.alpha = alpha def get_scores(self, a, b): return (np.log((a + self.alpha)/(np.sum(a) + self.alpha * len(a) - a - self.alpha)) - np.log((b + self.alpha)/(np.sum(b) + self.alpha * len(b) - b - self.alpha))) def get_name(self): return 'Smoothed Log-Odds-Ratio' html = st.produce_fightin_words_explorer( corpus, category='Trigger', not_category_name='Not-trigger', not_categories=['Not-trigger'], term_scorer=LogOddsRatioSmoothed(0.00000001), grey_threshold=0, ) #Save the html visualization file for later view file_name = 'trigger_not_lorsmooth.html' open(file_name, 'wb').write(html.encode('utf-8')) HTML(html) #View the file
The result of the visualization is as follows:
Notice that this time we have different terms as triggers like ‘Lebron’ (another named entity that we found triggering in [2]). Furthermore, the top not-trigger terms from the above visualization show that terms related to technology like ‘Computers’ are not triggers. This finding also coincides with our work from [2], noting similarities between the discovered triggering and non-triggering terms.
As final notes, we would like to emphasize that we do not claim that Log Odds Ratio is the best method to gain insights from text, as sometimes the least frequent terms may not be as informative or useful as we think. Therefore, we recommend trying other methods [1] on your dataset to see which term ranking technique produces the most useful insights for your analysis.
The entire script is available on Google Colab.
Happy coding!
References:
[1] “JasonKessler/SemioticSquaresTalk: #ddtx18 talk: Lexicon Mining for Semiotic Squares: Exploding Binary Classification.” https://github.com/JasonKessler/SemioticSquaresTalk (accessed Oct. 19, 2022).
[2] H. Almerekhi, H. Kwak, J. Salminen, and B. J. Jansen, “PROVOKE: Toxicity trigger detection in conversations from the top 100 subreddits,” Data and Information Management, p. 100019, 2022.
[3] J. Kessler, “Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ,” in Proceedings of ACL 2017, System Demonstrations, Vancouver, Canada, Jul. 2017, pp. 85–90. Accessed: Oct. 19, 2022. [Online]. Available: https://aclanthology.org/P17-4015
[4] “spaCy · Industrial-strength Natural Language Processing in Python.” https://spacy.io/ (accessed Oct. 19, 2022).