This is an excercise to find the most frequent words from CountVectorizer and TfidfTransformer

This could be a form of feature engineering if needed

In [1]:
import numpy as np
import pandas as pd
In [2]:
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.express as px
import cufflinks as cf

init_notebook_mode(connected=True)
cf.go_offline()
In [3]:
import nltk # Imports the library
nltk.download() #Download the necessary datasets
showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
Out[3]:
True

I am using the slogan dataset from Kaggle

In [4]:
df = pd.read_csv('sloganlist.csv')
In [5]:
df.head(10)
Out[5]:
Company Slogan
0 Costa Coffee For coffee lovers.
1 Evian Evian. Live young.
2 Dasani Designed to make a difference.
3 Heineken It's all about the beer.
4 Gatorade The Legend Continues.
5 Tío Pepe Good food tastes better after a glass of Tio Pepe
6 Tetley's Brewery Don't Do Things By Halves.
7 Batemans Brewery Good Honest Ales
8 Jones Soda Run with the little guy… create some change.
9 Grapette Thirsty or Not.
In [6]:
df.describe()
Out[6]:
Company Slogan
count 1162 1162
unique 569 568
top Domino's Pizza Taste The Feeling.
freq 31 31

From describe() looks like the csv file has many duplicate data. So I'll just remove all the duplicates

In [7]:
df = df.drop_duplicates() 
df
Out[7]:
Company Slogan
0 Costa Coffee For coffee lovers.
1 Evian Evian. Live young.
2 Dasani Designed to make a difference.
3 Heineken It's all about the beer.
4 Gatorade The Legend Continues.
... ... ...
1136 Hardee's Where the food's the star.
1137 Burger King Have it your way.
1138 Kit Kat Have a Break, Have a Kit Kat.
1139 Subway Eat Fresh.
1141 McDonalds I’m Lovin’ It.

569 rows × 2 columns

In [8]:
df.describe()
Out[8]:
Company Slogan
count 569 569
unique 569 568
top Del Taco Exquisite wodka.
freq 1 2
In [9]:
df[df['Slogan'] == 'Exquisite wodka.']
Out[9]:
Company Slogan
334 Wyborowa Vodka Exquisite wodka.
448 Stolichnaya vodka Exquisite wodka.

Two different brands of vodka with the same slogan so I'll let it be


Let's process the slogans

In [10]:
# Abbreviations of known words
abbrv_list = ['lol', 'lmao', 'rofl', 'ive', 'youve', 'brb', 'ttyl', 'im']
special_chars_list = ['â', '™', '€', 'Ã', '©']
In [11]:
import string
from nltk.corpus import stopwords, wordnet


def text_process(msg):
    """
    Takes in a string of text, then performs the following:
    1. Remove all punctuation and special chars
    2. Replace abbrv with words
    3. Remove all stopwords
    4. Lemmatize words by removing plurals
    5. Returns a list of the cleaned text
    """
    no_punc = []
    for char in msg:
        if char not in string.punctuation and char not in special_chars_list:
            no_punc.append(char)
       
    no_punc = ''.join(no_punc)
    no_punc_word_list = no_punc.split()
    
    cleaned_msg = []
    for word in no_punc_word_list:
        
        if word.lower() not in stopwords.words('english') and word.lower() not in abbrv_list:
            word_lower_case = word.lower()
            word_lemmatized = wordnet.morphy(word_lower_case)
            
            if word_lemmatized is None:
                use_word = word_lower_case
            else:
                use_word = word_lemmatized
                
            cleaned_msg.append(use_word)
    
    cleaned_msg = ' '.join(cleaned_msg)

    return cleaned_msg
In [12]:
col = 'Slogan'
In [16]:
cleaned_df = df
cleaned_df[col] = df[col].apply(text_process)

Here are my cleaned slogans

In [17]:
cleaned_df[col]
Out[17]:
0                 coffee lover
1             evian live young
2       design make difference
3                         beer
4              legend continue
                 ...          
1136                 food star
1137                       way
1138             break kit kat
1139                 eat fresh
1141                     lovin
Name: Slogan, Length: 569, dtype: object

Vectorization of Slogans

In [18]:
from sklearn.feature_extraction.text import CountVectorizer
In [19]:
# tokenize and build vocab
count_vectorizer = CountVectorizer().fit(cleaned_df[col])
In [20]:
# summarize
print(count_vectorizer.vocabulary_)
{'coffee': 133, 'lover': 424, 'evian': 219, 'live': 415, 'young': 808, 'design': 180, 'make': 433, 'difference': 183, 'beer': 46, 'legend': 403, 'continue': 143, 'good': 292, 'food': 261, 'taste': 691, 'better': 53, 'glass': 288, 'tio': 718, 'pepe': 515, 'dont': 190, 'things': 708, 'half': 316, 'honest': 342, 'ale': 13, 'run': 598, 'little': 414, 'guy': 312, 'create': 160, 'change': 111, 'thirsty': 711, 'juice': 378, 'jiffy': 375, 'bring': 84, 'crazy': 157, 'genki': 280, 'hatsuratsu': 327, 'hey': 338, 'culligan': 165, 'belvedere': 49, 'always': 14, 'go': 289, 'smoothly': 649, 'pure': 553, 'spirit': 661, 'vodka': 763, 'take': 687, 'white': 784, 'horse': 345, 'anywhere': 19, 'friend': 271, 'old': 490, 'parr': 504, 'royal': 593, 'race': 561, 'place': 529, 'strong': 679, 'healthier': 330, 'gum': 311, 'drink': 194, 'fanta': 229, 'stay': 669, 'bamboocha': 36, 'great': 300, 'head': 329, 'first': 249, 'pizza': 528, 'deliver': 178, 'expert': 223, 'peace': 511, 'love': 423, 'ice': 352, 'cream': 158, 'milo': 445, 'everyday': 214, 'customize': 169, 'cup': 167, 'vorsprung': 765, 'durch': 198, 'technik': 696, 'evolve': 220, 'skincare': 639, 'get': 283, 'nut': 485, 'start': 668, 'nescaf': 473, 'finger': 247, 'lickin': 408, 'original': 497, 'jean': 373, 'people': 514, 'tea': 695, 'secret': 611, 'beautiful': 42, 'hair': 315, 'discover': 187, 'baby': 29, 'world': 796, 'feeling': 238, 'relax': 574, 'times': 717, 'suntory': 682, 'time': 715, 'sharpen': 624, '100': 0, 'natural': 467, 'whole': 785, 'fruit': 274, 'british': 86, 'pop': 536, 'enuf': 211, 'enough': 210, 'different': 184, 'soft': 652, 'indian': 357, 'cola': 135, 'kiss': 388, 'orange': 493, 'bluna': 67, 'official': 488, 'beverage': 54, 'twist': 739, 'thirst': 710, 'crusher': 164, 'everything': 217, 'going': 290, 'ok': 489, 'sugar': 681, 'twice': 738, 'caffeine': 99, 'never': 475, 'see': 613, 'like': 411, 'real': 565, 'every': 213, 'drop': 195, 'milk': 443, 'shake': 621, 'lemon': 404, 'lime': 412, 'conscience': 142, 'busy': 95, 'fizzy': 252, 'unleash': 750, 'beast': 41, 'town': 729, 'uniquely': 749, 'southern': 656, 'deliciously': 177, 'hot': 346, 'cold': 136, 'moxie': 462, 'mine': 447, 'mello': 440, 'yello': 805, 'move': 461, 'veri': 760, 'lemoni': 406, 'totally': 728, 'tropical': 734, 'let': 407, 'nothing': 481, 'fresca': 267, 'barqs': 38, 'bite': 61, 'bucket': 89, 'night': 478, 'still': 673, 'best': 50, 'lemonade': 405, 'perfect': 517, 'gin': 285, 'black': 63, 'seed': 614, 'free': 266, 'vibration': 761, 'nature': 469, 'wont': 794, 'pluto': 535, 'hasta': 325, 'shasta': 625, 'clean': 132, 'sport': 662, 'culture': 166, 'pleasure': 534, 'say': 605, 'australian': 24, 'wine': 788, 'peak': 512, 'commitment': 140, 'outrageously': 500, 'smooth': 647, 'simply': 633, 'seriously': 619, 'easy': 200, 'ready': 564, 'tanqueray': 690, 'whats': 779, 'inside': 363, 'truly': 735, 'count': 151, 'prague': 543, 'southwold': 657, 'pint': 527, 'nothings': 482, 'fresh': 268, 'coldie': 137, 'america': 15, 'lager': 397, 'load': 416, 'look': 419, 'harp': 324, 'side': 631, 'born': 72, 'cool': 146, 'hang': 320, 'around': 20, 'milwaukee': 446, 'famous': 228, 'life': 409, 'cointreauversial': 134, 'seduction': 612, 'king': 387, 'plantation': 531, 'refresh': 571, 'sparkle': 658, 'tasting': 693, 'play': 532, 'moderation': 453, 'feel': 237, 'velvet': 759, 'smile': 644, 'awake': 25, 'ricor': 583, 'le': 401, 'work': 795, 'mother': 459, 'decaffeinate': 175, 'one': 491, 'hit': 339, 'spot': 663, 'there': 705, 'reason': 566, 'flip': 256, 'tip': 719, 'sip': 637, 'zing': 812, 'thing': 707, 'frugo': 273, 'today': 722, 'energy': 208, 'vague': 756, 'ask': 22, 'haig': 314, 'two': 740, 'us': 753, 'power': 542, 'back': 30, 'flavour': 255, 'jungle': 380, 'beyond': 55, 'expectation': 221, 'afore': 9, 'ye': 803, 'the': 703, 'queen': 558, 'table': 686, 'waters': 773, 'heart': 331, 'bitte': 62, 'ein': 204, 'bit': 60, 'ya': 802, 'bon': 70, 'buddy': 90, 'bean': 39, 'planet': 530, 'mabel': 430, 'label': 396, 'game': 279, 'manomanischewitz': 435, 'aye': 28, 'hours': 347, 'relentless': 575, 'no': 479, 'do': 188, 'hype': 351, 'share': 622, 'ummph': 744, 'future': 278, 'light': 410, 'new': 476, 'art': 21, 'dine': 185, 'differ': 182, 'tastefully': 692, 'thunder': 712, 'straight': 678, 'vegetable': 758, 'expertise': 224, 'family': 227, 'traditi': 730, 'cutty': 170, 'vary': 757, 'chivas': 125, 'raise': 563, 'rum': 597, 'awaken': 26, 'courage': 153, 'sit': 638, 'savour': 604, '1664': 1, 'brewing': 82, 'kootenays': 394, 'name': 466, 'filter': 244, 'guinness': 310, 'everywhere': 218, 'day': 174, 'passionate': 508, 'fill': 242, 'rim': 586, 'brim': 83, 'fiddler': 241, 'pay': 510, 'pour': 541, 'something': 653, 'priceless': 547, 'unusual': 751, 'remix': 578, 'double': 193, 'society': 651, 'enjoy': 209, 'contradiction': 144, 'fun': 276, 'calling': 101, 'exquisite': 225, 'wodka': 791, 'ordinary': 495, 'whisky': 783, 'looking': 420, 'sweet': 684, 'purity': 554, 'aged': 10, 'longer': 418, 'smoother': 648, 'tidings': 714, 'blend': 64, 'put': 555, 'talk': 688, 'cheer': 114, 'sense': 617, 'way': 775, 'obey': 487, 'growing': 307, 'karma': 381, 'wake': 767, 'special': 659, 'indulgence': 358, 'jar': 371, 'part': 505, 'wakin': 768, 'home': 341, 'brew': 81, 'refreshingly': 572, 'lucozade': 427, 'aids': 12, 'recovery': 568, 'moment': 454, 'celestial': 106, 'golden': 291, 'britain': 85, 'favourite': 235, 'green': 302, 'fine': 246, 'earth': 199, 'matter': 437, 'kettle': 384, 'everyones': 216, 'proper': 549, 'red': 569, 'rose': 591, 'begin': 48, 'zealand': 810, 'ten': 698, 'rens': 579, 'tetley': 701, 'petillant': 522, 'water': 772, 'must': 465, 'perrier': 520, 'ultimate': 743, 'refreshment': 573, 'magnesium': 432, 'hungry': 348, 'naya': 470, 'pellegrino': 513, 'italian': 368, 'volcanicity': 764, 'body': 69, 'happy': 323, 'know': 393, 'tequila': 700, 'switch': 685, 'ireland': 366, 'single': 636, 'source': 655, 'inspiration': 364, 'greatest': 301, 'nurture': 484, 'ever': 212, 'boaring': 68, 'mix': 451, 'accordingly': 6, 'tell': 697, 'gift': 284, 'whiskey': 782, 'please': 533, 'bourbon': 73, 'keep': 383, 'cooped': 148, 'rolling': 589, 'rock': 587, 'flowing': 258, 'gorgeous': 294, 'beck': 43, 'beckon': 44, 'forget': 262, 'reassuringly': 567, 'expensive': 222, 'paulaner': 509, 'mile': 442, 'away': 27, 'hooray': 344, 'wouldnt': 800, 'give': 287, 'xxxx': 801, 'anything': 17, 'germany': 282, 'funloving': 277, 'travel': 731, 'bad': 31, 'frosty': 272, 'mug': 463, 'sensation': 616, 'slow': 642, 'crisp': 161, 'stuff': 680, 'schhh': 607, 'grip': 305, 'johnnie': 376, 'walker': 769, 'walking': 770, 'last': 398, 'weve': 778, 'kid': 385, 'crown': 162, 'dirty': 186, 'martini': 436, 'nice': 477, 'hawaiian': 328, 'punch': 551, 'itll': 370, 'tickle': 713, 'innards': 361, 'absolut': 5, 'perfection': 518, 'responsibly': 580, 'worth': 798, 'weight': 776, 'delicious': 176, 'pepsi': 516, 'inner': 362, 'carabao': 105, 'uncola': 745, 'probably': 548, 'typhoo': 741, 'worst': 797, 'could': 150, 'happen': 322, 'wiiiings': 787, 'smoke': 646, 'chupa': 129, 'chups': 130, 'mint': 448, 'toblerone': 721, 'daily': 171, 'door': 192, 'add': 7, 'flavor': 254, 'donut': 191, 'favorite': 234, 'well': 777, 'misspend': 449, 'grow': 306, 'ugly': 742, 'friday': 270, 'monster': 456, 'chew': 119, 'trix': 733, 'wafer': 766, 'break': 79, 'layer': 400, 'flake': 253, 'shatter': 626, 'hunting': 350, 'season': 610, 'naturally': 468, 'remarkable': 577, 'chips': 124, 'ingredient': 360, 'whensa': 780, 'dolmio': 189, 'stimulate': 674, 'pork': 539, 'bull': 91, 'come': 139, 'colmans': 138, 'fire': 248, 'bread': 78, 'boursin': 74, 'butter': 96, 'step': 672, 'ahead': 11, 'big': 56, 'chocolate': 126, 'unfold': 748, 'dangerously': 173, 'cheesy': 117, 'wheres': 781, 'filling': 243, 'bake': 34, 'farm': 230, 'oreo': 496, 'eat': 201, 'freak': 265, 'ruffle': 595, 'rrridges': 594, 'zem': 811, 'ing': 359, 'cake': 100, 'stop': 675, 'baking': 35, 'guarantee': 309, 'branston': 77, 'pretty': 545, 'without': 790, 'full': 275, 'burger': 93, 'want': 771, 'handle': 318, 'crunch': 163, 'san': 602, 'francisco': 264, 'treat': 732, 'cent': 107, 'cant': 104, 'theyre': 706, 'icecreamalicious': 353, 'toast': 720, 'grain': 299, 'oat': 486, 'cereal': 109, 'kidtested': 386, 'parentapproved': 503, 'blood': 65, 'youll': 807, 'monstermad': 457, 'honey': 343, 'girl': 286, 'felt': 240, 'bubble': 88, 'melt': 441, 'center': 108, 'tootsie': 726, 'beef': 45, 'sing': 635, 'thats': 702, 'spell': 660, 'relief': 576, 'room': 590, 'jello': 374, 'beanz': 40, 'meanz': 438, 'heinz': 333, 'blue': 66, 'bonnet': 71, 'breakfast': 80, 'bird': 58, 'eye': 226, 'country': 152, 'quick': 559, 'tasty': 694, 'everyone': 215, 'candy': 103, 'bar': 37, 'idaho': 356, 'ich': 354, 'bin': 57, 'gourmeggle': 297, 'thee': 704, 'sure': 683, 'save': 603, 'shursave': 630, 'store': 677, 'shreddie': 629, 'grin': 304, 'aah': 4, 'bisto': 59, 'mouth': 460, 'meat': 439, 'mm': 452, 'champion': 110, 'grrreat': 308, 'rainbow': 562, 'small': 643, 'piece': 525, 'norway': 480, 'wonderland': 793, 'k¼sschen': 395, 'quality': 556, 'square': 665, 'child': 123, 'adult': 8, 'pur': 552, 'chicken': 121, 'tonight': 725, 'hobnob': 340, 'underneath': 746, 'bag': 32, 'hate': 326, 'top': 727, 'knock': 392, 'roll': 588, 'milky': 444, 'lunch': 428, 'reeses': 570, 'passion': 507, 'talking': 689, 'butteryness': 97, 'habit': 313, 'ought': 499, 'congratulate': 141, 'spread': 664, 'slice': 640, 'fit': 251, 'luxury': 429, 'utterly': 755, 'smite': 645, 'perfectly': 519, 'pair': 502, 'sharing': 623, 'genuine': 281, 'music': 464, 'lip': 413, 'creamy': 159, 'seeker': 615, 'would': 799, 'klondike': 391, 'cows': 154, 'eight': 203, 'round': 592, 'long': 417, 'unwrap': 752, 'britton': 87, 'ryvita': 600, 'simple': 632, 'magically': 431, 'unexplainably': 747, 'juicy': 379, 'father': 233, 'use': 754, 'originality': 498, 'scoop': 608, 'irresistible': 367, 'graeters': 298, 'open': 492, 'heaven': 332, 'cornish': 149, 'goodness': 293, 'tomato': 724, 'hunt': 349, 'buy': 98, 'bega': 47, 'choice': 127, 'coonoisseurs': 147, 'jarlsberg': 372, 'herd': 336, 'dairylea': 172, 'years': 804, 'farming': 231, 'pilgrim': 526, 'anythings': 18, 'possible': 540, 'popsicle': 538, 'ride': 584, 'ending': 207, 'return': 582, 'classics': 131, 'bowl': 75, 'perry': 521, 'floaty': 257, 'betcha': 52, 'stops': 676, 'character': 112, 'laughter': 399, 'cheese': 115, 'joy': 377, 'tenderness': 699, 'timeless': 716, 'emotion': 206, 'hello': 334, 'hershey': 337, 'choose': 128, 'shell': 627, 'nestle': 474, 'hands': 319, 'together': 723, 'else': 205, 'fast': 232, 'cheap': 113, 'bagful': 33, 'find': 245, 'hankerin': 321, 'henry': 335, 'id': 355, 'ruther': 599, 'druthers': 197, 'restaurant': 581, 'aurelios': 23, 'money': 455, 'devilish': 181, 'foster': 263, 'wonderful': 792, 'mity': 450, 'hamburger': 317, 'right': 585, 'feed': 236, 'kitchen': 390, 'steakburgers': 671, 'canadian': 102, 'que': 557, 'chez': 120, 'flunch': 259, 'quon': 560, 'peut': 523, 'fluncher': 260, 'curb': 168, 'service': 620, 'pub': 550, 'boy': 76, 'nowhere': 483, 'chef': 118, 'burrito': 94, 'cheesesteak': 116, 'crave': 156, 'rule': 596, 'gotta': 295, 'leave': 402, 'since': 634, '1968': 2, 'trust': 736, 'neighbourhood': 472, 'grill': 303, 'eats': 202, 'scenic': 606, 'view': 762, 'lotz': 422, 'fish': 250, 'serious': 618, 'delivery': 179, 'feelings': 239, 'gottahava': 296, 'wawa': 774, 'italy': 369, 'cooking': 145, 'wiener': 786, 'lowpriced': 426, 'stand': 666, 'yogurt': 806, 'besttasting': 51, 'tuna': 737, 'nearly': 471, 'wings': 789, 'pretzel': 546, 'sonic': 654, 'short': 628, 'steak': 670, 'seafood': 609, 'salad': 601, 'prepare': 544, 'order': 494, '2am': 3, 'youre': 809, 'drunk': 196, 'party': 506, 'popeyes': 537, 'pickle': 524, 'instant': 365, 'mor': 458, 'chikin': 122, 'snap': 650, 'crackle': 155, 'loosen': 421, 'making': 434, 'american': 16, 'slicing': 641, 'freshness': 269, 'think': 709, 'outside': 501, 'bun': 92, 'star': 667, 'kit': 389, 'kat': 382, 'lovin': 425}
In [21]:
# Vector representation of all msgs
all_msgs_vector = count_vectorizer.transform(cleaned_df[col])
print(all_msgs_vector)
  (0, 133)	1
  (0, 424)	1
  (1, 219)	1
  (1, 415)	1
  (1, 808)	1
  (2, 180)	1
  (2, 183)	1
  (2, 433)	1
  (3, 46)	1
  (4, 143)	1
  (4, 403)	1
  (5, 53)	1
  (5, 261)	1
  (5, 288)	1
  (5, 292)	1
  (5, 515)	1
  (5, 691)	1
  (5, 718)	1
  (6, 190)	1
  (6, 316)	1
  (6, 708)	1
  (7, 13)	1
  (7, 292)	1
  (7, 342)	1
  (8, 111)	1
  :	:
  (557, 341)	1
  (558, 261)	1
  (558, 323)	1
  (558, 434)	1
  (558, 514)	1
  (559, 16)	1
  (559, 261)	1
  (560, 184)	1
  (560, 653)	1
  (561, 269)	1
  (561, 641)	1
  (562, 585)	1
  (562, 691)	1
  (563, 92)	1
  (563, 501)	1
  (563, 709)	1
  (564, 261)	1
  (564, 667)	1
  (565, 775)	1
  (566, 79)	1
  (566, 382)	1
  (566, 389)	1
  (567, 201)	1
  (567, 268)	1
  (568, 425)	1

Applying TF-IDF algorithm

In [22]:
from sklearn.feature_extraction.text import TfidfTransformer
In [23]:
tfidf_transformer = TfidfTransformer().fit(all_msgs_vector)
In [24]:
messages_tfidf = tfidf_transformer.transform(all_msgs_vector)
print(messages_tfidf)
  (0, 424)	0.8120610477121847
  (0, 133)	0.583572493173375
  (1, 808)	0.6170194344433083
  (1, 415)	0.4884404109392671
  (1, 219)	0.6170194344433083
  (2, 433)	0.3996201195064444
  (2, 183)	0.6275069936630012
  (2, 180)	0.6682355370598582
  (3, 46)	1.0
  (4, 403)	0.7071067811865476
  (4, 143)	0.7071067811865476
  (5, 718)	0.4643575542189262
  (5, 691)	0.27533002873032236
  (5, 515)	0.4643575542189262
  (5, 292)	0.27533002873032236
  (5, 288)	0.4643575542189262
  (5, 261)	0.32371314078907243
  (5, 53)	0.31098671765630925
  (6, 708)	0.5847631354795332
  (6, 316)	0.6227173750031979
  (6, 190)	0.5198799344591708
  (7, 342)	0.6103629191218752
  (7, 292)	0.4039941472967685
  (7, 13)	0.6813558805139794
  (8, 598)	0.4714330618940757
  :	:
  (557, 84)	0.6817816882288547
  (558, 514)	0.5127066510144902
  (558, 434)	0.5946055526513928
  (558, 323)	0.460169742610059
  (558, 261)	0.41451168228148844
  (559, 261)	0.5718758868245777
  (559, 16)	0.8203401550994579
  (560, 653)	0.7071067811865476
  (560, 184)	0.7071067811865476
  (561, 641)	0.7071067811865476
  (561, 269)	0.7071067811865476
  (562, 691)	0.5519405422014124
  (562, 585)	0.8338834678025526
  (563, 709)	0.5773502691896257
  (563, 501)	0.5773502691896257
  (563, 92)	0.5773502691896257
  (564, 667)	0.8203401550994579
  (564, 261)	0.5718758868245777
  (565, 775)	1.0
  (566, 389)	0.5890699532425786
  (566, 382)	0.5890699532425786
  (566, 79)	0.553166503300383
  (567, 268)	0.7200021188983894
  (567, 201)	0.6939718645462722
  (568, 425)	1.0
In [25]:
print(messages_tfidf.shape)
(569, 813)

Going by the IDF score should be the best bet since lower score will mean more frequency.

I will turn the IDF score list into a hasmap and then use the keys of sorted values to find the words

In [26]:
idf_score_dict = {}

for i,val in enumerate(tfidf_transformer.idf_):
    idf_score_dict[i] = val   
In [27]:
import operator
sorted_idf_score_dict = dict(sorted(idf_score_dict.items(), key=operator.itemgetter(1)))
In [28]:
top_10_idf_score_list = list(sorted_idf_score_dict.keys())[0:10]
top_10_idf_score_list
Out[28]:
[292, 691, 433, 46, 53, 283, 50, 261, 409, 695]
In [29]:
unique_word_list = count_vectorizer.get_feature_names()

for i in top_10_idf_score_list:
    try:
        print(unique_word_list[i], idf_score_dict[i])
    except Exception as ex:
        print("no word matching the index")
good 3.9444389791664403
taste 3.9444389791664403
make 3.9783405308421216
beer 4.455264602932431
better 4.455264602932431
get 4.573047638588815
best 4.637586159726386
food 4.637586159726386
life 4.637586159726386
tea 4.706579031213337

From the result, it looks like the dataset has a lot of slogans for edible items

Verifying my results for some of these words

In [30]:
cleaned_df[cleaned_df['Slogan'].str.contains("good")]
Out[30]:
Company Slogan
5 Tío Pepe good food taste better glass tio pepe
7 Batemans Brewery good honest ale
34 KFC(Kentucky Fried Chicken) finger lickin good
120 Sunkist (soft drink) good vibration
177 United Breweries king good times
296 Labatt Brewing Company good things brewing
321 Guinness guinness good
360 Jamba Juice good tidings blend
368 Lavazza good karma great coffee
400 Yogi Tea good feel
405 Kericho Gold matter good taste
409 Red Rose Tea red rose tea good tea
458 Wild Turkey (bourbon) good keep cooped
485 Paulaner Brewery good better paulaner
492 Bad Frog beer good bad
521 Maxwell House good last drop
531 Naked Juice worth weight good
718 Campbell's mm mm good
743 Mamee Double-Decker world good taste
760 Smucker good
785 Country Crock good habit delicious
836 Swensen's good father use make
865 Vadilal nothing goodness
872 Dairylea (cheese) herd dairylea goodness
874 Pilgrims Choice good choice pilgrim
920 Nestles nestle good food good life
945 Heinz good food every day
967 Applebee's together good
981 Orange Julius devilish good drink
1055 Checkers and Rally's crazy good food
1059 Village Inn good food good feelings
In [31]:
cleaned_df[cleaned_df['Slogan'].str.contains("taste")]
Out[31]:
Company Slogan
5 Tío Pepe good food taste better glass tio pepe
39 Coca-Cola taste feeling
47 Campa Cola great indian taste
56 Crystal Pepsi never see taste like
83 Dr. Brown's taste town
90 Lilt totally tropical taste
92 Fresca nothing taste like fresca
99 Jolly Cola free taste
127 Sutter Home Winery taste commitment
133 Staropramen Brewery get taste prague
137 Minute Maid load taste
164 Disaronno taste seduction
208 Gilbey's gin taste smooth gin today
283 Tapal Tea differ tastefully
284 Thums Up taste thunder
298 Old Milwaukee taste great name
320 Amstel Brewery taste life pure filter
338 Bisleri sweet taste purity
339 Evan Williams aged longer taste smoother
376 Gold Peak Tea home brew taste
405 Kericho Gold matter good taste
491 Dos Equis let taste travel
640 Walkers fresh taste guarantee
741 Skittles taste rainbow
743 Mamee Double-Decker world good taste
798 Weis taste everyones lip
862 Kraft Foods little taste heaven
1008 Spangles taste better
1022 Steers taste better
1132 Wendy's taste right
In [32]:
cleaned_df[cleaned_df['Slogan'].str.contains("food")]
Out[32]:
Company Slogan
5 Tío Pepe good food taste better glass tio pepe
95 Horlicks food drink night
587 Dan-D Foods fine food earth
588 Canyon Creek Food Company favorite food make easy
920 Nestles nestle good food good life
945 Heinz good food every day
969 Winky's fast food cheap
1013 Brewers Fayre pub food
1055 Checkers and Rally's crazy good food
1059 Village Inn good food good feelings
1096 ConAgra Foods food love
1098 Sizzler great steak seafood salad
1128 Carl's Jr. making people happy food
1129 A&W w american food
1136 Hardee's food star

These words are very frequent in the slogans so my results are correct