Log Odds Ratio: Going Beyond Simple Term Frequencies to Characterize Textual Categories

Gaining insights from text-based data can be a daunting task, even when the data is labeled with ground truth categories and ready for usage in machine learning tasks.
Researchers often rely on simple methods like the frequency of words in each category to understand the collection’s characteristics. However, this approach is not always insightful, as term frequencies alone are not enough to distinguish between the categories because they fail to consider the fundamental differences between categories.

The issue of getting meaningful insights from textual and categorical data (source of image)

Therefore, in this tutorial, we will teach you how to use Log Odds Ratio, which is an alternative method to term frequency, TFIDF, and other term ranking techniques to obtain unique insights about the terms that represent categories in textual data [1].

What is Log Odds Ratio?

Let’s start with what is meant by odds:

Keep in mind that the odds can be a very small number.
So now, the odds ratio is the ratio of the odds of a term being used in one category vs. another.
It can be computed using the following equation:

The Log-Odds-Ratio is often used, given these quantities are often very small. For our purposes, the fact that it’s normally distributed is important.

If term does not appear in category b (i.e., 𝑦𝑏𝑖=0), the odd of termi in categoryb will be 0 resulting the odds-ratio to be undefined. We circumvent this issue by adding a pseudo-count to each term, assuming it occurs at least 𝛼 times.
This will result in an equation that computes the smoothed Log-Odds-Ratio as follows:

This has a slightly better profile than TFIDF, but still favors very low-frequency terms that are unique to a category.
The smoothed Log-Odds-Ratio ranks terms slightly better than TFIDF, yet it still prefers very low frequent terms which are unique to a category.
Note that when terms are exclusive to one category or another, their cluster can be extremely high or low scoring compared to others because these terms occur infrequently. Intuitively, we’d like to score things highly if they have a lot of evidence behind them AND they tend to be used much more in one category vs. another. However, in cases where the dataset has terms that appear in both categories (i.e., a lot of intersecting terms, even after cleaning), the Log Odds Ratio can be a suitable method of ranking terms to gain insights on the type of content in each category.

For this tutorial, we will use a random sample of a dataset that was used in [2] to build a machine learning model for detecting toxicity triggers (i.e., causes of toxicity in online conversation threads). Furthermore, we will use scattertext [3], which is a powerful text visualization library that generates interactive scatter plots of corpora and renders them via html. For preprocessing the dataset, we will use spacy [4], which is a powerful open source NLP library that helps with different types of NLP-based text processing tasks.

The dataset is available for download from this link.

After downloading the dataset, make sure that you install the following packages:

#Step 1: download missing libraries into Google colab with the pip command
!pip install --quiet scattertext
!pip install --quiet spacy
!python3 -m spacy download en_core_web_sm

Next, import the necessary libraries and download the required components for performing NLP tasks, as follows:

#Step 2: import the required libraries or load the needed modules
import re
import pandas as pd
import numpy as np
import scattertext as st
import spacy
from IPython.display import IFrame
from IPython.core.display import display, HTML
import html
display(HTML("<style>.container { width:98% !important; }</style>")) #For visualization
nlp = spacy.load("en_core_web_sm")
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords
import warnings

Then, make a function for cleaning the raw text from the dataset and another one to map the categories to a numerical value as follows:

#Step 3: function to convert the comments to words
def comment_to_words( raw_comment ):
    # 1. Remove HTML
    comment_text = BeautifulSoup(raw_comment, 'lxml').get_text() 
    # 2. Remove non-letters with regex
    letters_only = re.sub("[^a-zA-Z]", " ", comment_text)   
    # 3. Convert to lower case, split into individual words
    words = letters_only.lower().split()                              
    # 4. Create set of stopwords
    stops = set(stopwords.words("english"))                    
    # 5. Remove stop words
    meaningful_words = [w for w in words if not w in stops]      
    # 6. Join the words back into one string separated by space, 
    # and return the result.
    return( " ".join( meaningful_words ))

#A function to change the category from numerical to text
def get_category(row):
    for c in trigger_categories:
        if row[c] == 1:
            return 'Trigger'
    return 'Not-trigger'
def get_category_list(row):
    return [c for c in trigger_categories if row[c] == 1]   

After that, read the dataset and perform the preprocessing task:

#Step 4: read data and preprocess it
df = pd.read_csv('toxicity_triggers_sample.csv')

#print dataset header before preprocessing 
print("Before preprocessing:")

#Create additional columns with the numerical representations of labels
df['Trigger'] = 0
df['Trigger'].loc[df['category'] == 'trigger'] = 1
df['NotTrigger'] = 0
df['NotTrigger'].loc[df['category'] == 'non_trigger'] = 0

#Delete the category column
del df['category']

#Define the list of categories 
trigger_categories = ['Trigger','NotTrigger']

df['category'] = df.apply(get_category, axis=1)
df['category_list'] = df.apply(get_category_list, axis=1)

#Print the count of triggers and non-triggers in the dataset

#Fill nan or missing values with _na_
df['Text']= df['Text'].fillna("_na_").values
df['Processed_Text'] =df['Text'].apply(comment_to_words)
df['parse'] = df.Processed_Text.apply(nlp)

#Print dataset header after preprocessing
print("After preprocessing:")

Now, use Log Odds Ratio (with Dirichlet Prior) to know which ranked terms appear the most in the dataset:

#Step 5: Use scattertext to create a corpus from parsed documents, use log odds ratio to rank and print top 10, 30, and 50 terms per category
#Note that the terms displayed here are not using any smoothing in the LOR, instead it uses Informative Dirichlet Prior
corpus = st.CorpusFromParsedDocuments(

#Print the total number of terms in the corpus
for i in range(1, 20):
    print('Threshold:', i, '# terms:', 
          len(corpus.compact(st.ClassPercentageCompactor(st.OncePerDocFrequencyRanker, i)).get_terms()))

compact_corpus = corpus.compact(st.ClassPercentageCompactor(st.OncePerDocFrequencyRanker, 14))
dfs = []

single_df_columns = ['parse', 'category_list', 'single_category']
for cat in trigger_categories:
    new_df = df[(df[cat] == 1)]
    new_df['single_category'] = cat
new_df = df[df[trigger_categories].sum(axis=1) == 0]
new_df['single_category'] = 'Not-trigger'
single_df = pd.concat(dfs)
del dfs
single_category_corpus = st.CorpusFromParsedDocuments(
term_freq_df = st.OncePerDocFrequencyRanker(single_category_corpus).get_ranks()

for c in single_category_corpus.get_categories():
    scores =  (st.ScaledFScorePresets(beta = 1, one_to_neg_one=True)
                   term_freq_df[c + ' freq'], 
                   term_freq_df[[oc for oc in term_freq_df.columns 
                                 if oc != c + ' freq' and oc.endswith(' freq')]].sum(axis=1)
    term_freq_df['score'] = scores
    #Print the category and the ranked terms that + belong to the category or - don't
    print('+',list(term_freq_df.sort_values(by='score', ascending=False).iloc[:10].index))
    print('-',list(term_freq_df.sort_values(by='score', ascending=True).iloc[:10].index))   
    print('+',list(term_freq_df.sort_values(by='score', ascending=False).iloc[:30].index))
    print('-',list(term_freq_df.sort_values(by='score', ascending=True).iloc[:30].index)) 
    print('+',list(term_freq_df.sort_values(by='score', ascending=False).iloc[:50].index))
    print('-',list(term_freq_df.sort_values(by='score', ascending=True).iloc[:50].index))
tox_obs_corpus = single_category_corpus

tox_obs_corpus = tox_obs_corpus.compact(st.ClassPercentageCompactor(term_count=2))
priors = (st.PriorFactory(single_category_corpus, 
term_ranker = st.OncePerDocFrequencyRanker
term_scorer = st.LogOddsRatioInformativeDirichletPrior(priors, scale_type='class-size', sigma=10)
rank_df = term_ranker(tox_obs_corpus).get_ranks()
#Perform an overall rank using LORIDP of the frequent trigger and non-trigger terms
rank_df['score'] = term_scorer.get_scores(rank_df['Trigger freq'], rank_df['Not-trigger freq'])
print('Trigger LORIDP 10',list(rank_df.sort_values(by='score', ascending=False).iloc[:10].index))
print('NotTrigger LORIDP 10',list(rank_df.sort_values(by='score', ascending=True).iloc[:10].index))   
print('Trigger LORIDP 30',list(rank_df.sort_values(by='score', ascending=False).iloc[:30].index))
print('NotTrigger LORIDP 30',list(rank_df.sort_values(by='score', ascending=True).iloc[:30].index))   
print('Trigger LORIDP 50',list(rank_df.sort_values(by='score', ascending=False).iloc[:50].index))
print('NotTrigger LORIDP 50',list(rank_df.sort_values(by='score', ascending=True).iloc[:50].index))

Part of the output looks as follows:

This outcome is similar to what we found in [2], which states that named entities like ‘trump’ are triggering terms, while terms that represent a positive action like ‘workout’ are not triggers.

Well, what about using Log Odds Ratio with smoothing? Let’s try it and visualize it as follows:

#Step 6: use scattertext to plot the Log Odds Ratio with smoothing factor 
class LogOddsRatioSmoothed:
    def __init__(self, alpha):
        self.alpha = alpha
    def get_scores(self, a, b): 
        return (np.log((a + self.alpha)/(np.sum(a) + self.alpha * len(a) - a - self.alpha)) 
                - np.log((b + self.alpha)/(np.sum(b) + self.alpha * len(b) - b - self.alpha)))
    def get_name(self): 
        return 'Smoothed Log-Odds-Ratio'
html = st.produce_fightin_words_explorer(

#Save the html visualization file for later view
file_name = 'trigger_not_lorsmooth.html'
open(file_name, 'wb').write(html.encode('utf-8'))
HTML(html) #View the file

The result of the visualization is as follows:

Notice that this time we have different terms as triggers like ‘Lebron’ (another named entity that we found triggering in [2]). Furthermore, the top not-trigger terms from the above visualization show that terms related to technology like ‘Computers’ are not triggers. This finding also coincides with our work from [2], noting similarities between the discovered triggering and non-triggering terms.

As final notes, we would like to emphasize that we do not claim that Log Odds Ratio is the best method to gain insights from text, as sometimes the least frequent terms may not be as informative or useful as we think. Therefore, we recommend trying other methods [1] on your dataset to see which term ranking technique produces the most useful insights for your analysis.

The entire script is available on Google Colab.

Happy coding!

[1] “JasonKessler/SemioticSquaresTalk: #ddtx18 talk: Lexicon Mining for Semiotic Squares: Exploding Binary Classification.” https://github.com/JasonKessler/SemioticSquaresTalk (accessed Oct. 19, 2022).
[2] H. Almerekhi, H. Kwak, J. Salminen, and B. J. Jansen, “PROVOKE: Toxicity trigger detection in conversations from the top 100 subreddits,” Data and Information Management, p. 100019, 2022.
[3] J. Kessler, “Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ,” in Proceedings of ACL 2017, System Demonstrations, Vancouver, Canada, Jul. 2017, pp. 85–90. Accessed: Oct. 19, 2022. [Online]. Available: https://aclanthology.org/P17-4015
[4] “spaCy · Industrial-strength Natural Language Processing in Python.” https://spacy.io/ (accessed Oct. 19, 2022).