This is an excercise to find the most frequent words from CountVectorizer and TfidfTransformer¶

This could be a form of feature engineering if needed¶

import numpy as np
import pandas as pd

from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.express as px
import cufflinks as cf

init_notebook_mode(connected=True)
cf.go_offline()

import nltk # Imports the library
nltk.download() #Download the necessary datasets

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml

True

I am using the slogan dataset from Kaggle¶

df = pd.read_csv('sloganlist.csv')

df.head(10)

df.describe()

From describe() looks like the csv file has many duplicate data. So I'll just remove all the duplicates¶

df = df.drop_duplicates() 
df

df.describe()

df[df['Slogan'] == 'Exquisite wodka.']

Two different brands of vodka with the same slogan so I'll let it be¶

Let's process the slogans¶

# Abbreviations of known words
abbrv_list = ['lol', 'lmao', 'rofl', 'ive', 'youve', 'brb', 'ttyl', 'im']
special_chars_list = ['â', '', '', 'Ã', '©']

import string
from nltk.corpus import stopwords, wordnet


def text_process(msg):
    """
    Takes in a string of text, then performs the following:
    1. Remove all punctuation and special chars
    2. Replace abbrv with words
    3. Remove all stopwords
    4. Lemmatize words by removing plurals
    5. Returns a list of the cleaned text
    """
    no_punc = []
    for char in msg:
        if char not in string.punctuation and char not in special_chars_list:
            no_punc.append(char)
       
    no_punc = ''.join(no_punc)
    no_punc_word_list = no_punc.split()
    
    cleaned_msg = []
    for word in no_punc_word_list:
        
        if word.lower() not in stopwords.words('english') and word.lower() not in abbrv_list:
            word_lower_case = word.lower()
            word_lemmatized = wordnet.morphy(word_lower_case)
            
            if word_lemmatized is None:
                use_word = word_lower_case
            else:
                use_word = word_lemmatized
                
            cleaned_msg.append(use_word)
    
    cleaned_msg = ' '.join(cleaned_msg)

    return cleaned_msg

col = 'Slogan'

cleaned_df = df
cleaned_df[col] = df[col].apply(text_process)

Here are my cleaned slogans¶

cleaned_df[col]

0                 coffee lover
1             evian live young
2       design make difference
3                         beer
4              legend continue
                 ...          
1136                 food star
1137                       way
1138             break kit kat
1139                 eat fresh
1141                     lovin
Name: Slogan, Length: 569, dtype: object

Vectorization of Slogans¶

from sklearn.feature_extraction.text import CountVectorizer

# tokenize and build vocab
count_vectorizer = CountVectorizer().fit(cleaned_df[col])

# summarize
print(count_vectorizer.vocabulary_)

{'coffee': 133, 'lover': 424, 'evian': 219, 'live': 415, 'young': 808, 'design': 180, 'make': 433, 'difference': 183, 'beer': 46, 'legend': 403, 'continue': 143, 'good': 292, 'food': 261, 'taste': 691, 'better': 53, 'glass': 288, 'tio': 718, 'pepe': 515, 'dont': 190, 'things': 708, 'half': 316, 'honest': 342, 'ale': 13, 'run': 598, 'little': 414, 'guy': 312, 'create': 160, 'change': 111, 'thirsty': 711, 'juice': 378, 'jiffy': 375, 'bring': 84, 'crazy': 157, 'genki': 280, 'hatsuratsu': 327, 'hey': 338, 'culligan': 165, 'belvedere': 49, 'always': 14, 'go': 289, 'smoothly': 649, 'pure': 553, 'spirit': 661, 'vodka': 763, 'take': 687, 'white': 784, 'horse': 345, 'anywhere': 19, 'friend': 271, 'old': 490, 'parr': 504, 'royal': 593, 'race': 561, 'place': 529, 'strong': 679, 'healthier': 330, 'gum': 311, 'drink': 194, 'fanta': 229, 'stay': 669, 'bamboocha': 36, 'great': 300, 'head': 329, 'first': 249, 'pizza': 528, 'deliver': 178, 'expert': 223, 'peace': 511, 'love': 423, 'ice': 352, 'cream': 158, 'milo': 445, 'everyday': 214, 'customize': 169, 'cup': 167, 'vorsprung': 765, 'durch': 198, 'technik': 696, 'evolve': 220, 'skincare': 639, 'get': 283, 'nut': 485, 'start': 668, 'nescaf': 473, 'finger': 247, 'lickin': 408, 'original': 497, 'jean': 373, 'people': 514, 'tea': 695, 'secret': 611, 'beautiful': 42, 'hair': 315, 'discover': 187, 'baby': 29, 'world': 796, 'feeling': 238, 'relax': 574, 'times': 717, 'suntory': 682, 'time': 715, 'sharpen': 624, '100': 0, 'natural': 467, 'whole': 785, 'fruit': 274, 'british': 86, 'pop': 536, 'enuf': 211, 'enough': 210, 'different': 184, 'soft': 652, 'indian': 357, 'cola': 135, 'kiss': 388, 'orange': 493, 'bluna': 67, 'official': 488, 'beverage': 54, 'twist': 739, 'thirst': 710, 'crusher': 164, 'everything': 217, 'going': 290, 'ok': 489, 'sugar': 681, 'twice': 738, 'caffeine': 99, 'never': 475, 'see': 613, 'like': 411, 'real': 565, 'every': 213, 'drop': 195, 'milk': 443, 'shake': 621, 'lemon': 404, 'lime': 412, 'conscience': 142, 'busy': 95, 'fizzy': 252, 'unleash': 750, 'beast': 41, 'town': 729, 'uniquely': 749, 'southern': 656, 'deliciously': 177, 'hot': 346, 'cold': 136, 'moxie': 462, 'mine': 447, 'mello': 440, 'yello': 805, 'move': 461, 'veri': 760, 'lemoni': 406, 'totally': 728, 'tropical': 734, 'let': 407, 'nothing': 481, 'fresca': 267, 'barqs': 38, 'bite': 61, 'bucket': 89, 'night': 478, 'still': 673, 'best': 50, 'lemonade': 405, 'perfect': 517, 'gin': 285, 'black': 63, 'seed': 614, 'free': 266, 'vibration': 761, 'nature': 469, 'wont': 794, 'pluto': 535, 'hasta': 325, 'shasta': 625, 'clean': 132, 'sport': 662, 'culture': 166, 'pleasure': 534, 'say': 605, 'australian': 24, 'wine': 788, 'peak': 512, 'commitment': 140, 'outrageously': 500, 'smooth': 647, 'simply': 633, 'seriously': 619, 'easy': 200, 'ready': 564, 'tanqueray': 690, 'whats': 779, 'inside': 363, 'truly': 735, 'count': 151, 'prague': 543, 'southwold': 657, 'pint': 527, 'nothings': 482, 'fresh': 268, 'coldie': 137, 'america': 15, 'lager': 397, 'load': 416, 'look': 419, 'harp': 324, 'side': 631, 'born': 72, 'cool': 146, 'hang': 320, 'around': 20, 'milwaukee': 446, 'famous': 228, 'life': 409, 'cointreauversial': 134, 'seduction': 612, 'king': 387, 'plantation': 531, 'refresh': 571, 'sparkle': 658, 'tasting': 693, 'play': 532, 'moderation': 453, 'feel': 237, 'velvet': 759, 'smile': 644, 'awake': 25, 'ricor': 583, 'le': 401, 'work': 795, 'mother': 459, 'decaffeinate': 175, 'one': 491, 'hit': 339, 'spot': 663, 'there': 705, 'reason': 566, 'flip': 256, 'tip': 719, 'sip': 637, 'zing': 812, 'thing': 707, 'frugo': 273, 'today': 722, 'energy': 208, 'vague': 756, 'ask': 22, 'haig': 314, 'two': 740, 'us': 753, 'power': 542, 'back': 30, 'flavour': 255, 'jungle': 380, 'beyond': 55, 'expectation': 221, 'afore': 9, 'ye': 803, 'the': 703, 'queen': 558, 'table': 686, 'waters': 773, 'heart': 331, 'bitte': 62, 'ein': 204, 'bit': 60, 'ya': 802, 'bon': 70, 'buddy': 90, 'bean': 39, 'planet': 530, 'mabel': 430, 'label': 396, 'game': 279, 'manomanischewitz': 435, 'aye': 28, 'hours': 347, 'relentless': 575, 'no': 479, 'do': 188, 'hype': 351, 'share': 622, 'ummph': 744, 'future': 278, 'light': 410, 'new': 476, 'art': 21, 'dine': 185, 'differ': 182, 'tastefully': 692, 'thunder': 712, 'straight': 678, 'vegetable': 758, 'expertise': 224, 'family': 227, 'traditi': 730, 'cutty': 170, 'vary': 757, 'chivas': 125, 'raise': 563, 'rum': 597, 'awaken': 26, 'courage': 153, 'sit': 638, 'savour': 604, '1664': 1, 'brewing': 82, 'kootenays': 394, 'name': 466, 'filter': 244, 'guinness': 310, 'everywhere': 218, 'day': 174, 'passionate': 508, 'fill': 242, 'rim': 586, 'brim': 83, 'fiddler': 241, 'pay': 510, 'pour': 541, 'something': 653, 'priceless': 547, 'unusual': 751, 'remix': 578, 'double': 193, 'society': 651, 'enjoy': 209, 'contradiction': 144, 'fun': 276, 'calling': 101, 'exquisite': 225, 'wodka': 791, 'ordinary': 495, 'whisky': 783, 'looking': 420, 'sweet': 684, 'purity': 554, 'aged': 10, 'longer': 418, 'smoother': 648, 'tidings': 714, 'blend': 64, 'put': 555, 'talk': 688, 'cheer': 114, 'sense': 617, 'way': 775, 'obey': 487, 'growing': 307, 'karma': 381, 'wake': 767, 'special': 659, 'indulgence': 358, 'jar': 371, 'part': 505, 'wakin': 768, 'home': 341, 'brew': 81, 'refreshingly': 572, 'lucozade': 427, 'aids': 12, 'recovery': 568, 'moment': 454, 'celestial': 106, 'golden': 291, 'britain': 85, 'favourite': 235, 'green': 302, 'fine': 246, 'earth': 199, 'matter': 437, 'kettle': 384, 'everyones': 216, 'proper': 549, 'red': 569, 'rose': 591, 'begin': 48, 'zealand': 810, 'ten': 698, 'rens': 579, 'tetley': 701, 'petillant': 522, 'water': 772, 'must': 465, 'perrier': 520, 'ultimate': 743, 'refreshment': 573, 'magnesium': 432, 'hungry': 348, 'naya': 470, 'pellegrino': 513, 'italian': 368, 'volcanicity': 764, 'body': 69, 'happy': 323, 'know': 393, 'tequila': 700, 'switch': 685, 'ireland': 366, 'single': 636, 'source': 655, 'inspiration': 364, 'greatest': 301, 'nurture': 484, 'ever': 212, 'boaring': 68, 'mix': 451, 'accordingly': 6, 'tell': 697, 'gift': 284, 'whiskey': 782, 'please': 533, 'bourbon': 73, 'keep': 383, 'cooped': 148, 'rolling': 589, 'rock': 587, 'flowing': 258, 'gorgeous': 294, 'beck': 43, 'beckon': 44, 'forget': 262, 'reassuringly': 567, 'expensive': 222, 'paulaner': 509, 'mile': 442, 'away': 27, 'hooray': 344, 'wouldnt': 800, 'give': 287, 'xxxx': 801, 'anything': 17, 'germany': 282, 'funloving': 277, 'travel': 731, 'bad': 31, 'frosty': 272, 'mug': 463, 'sensation': 616, 'slow': 642, 'crisp': 161, 'stuff': 680, 'schhh': 607, 'grip': 305, 'johnnie': 376, 'walker': 769, 'walking': 770, 'last': 398, 'weve': 778, 'kid': 385, 'crown': 162, 'dirty': 186, 'martini': 436, 'nice': 477, 'hawaiian': 328, 'punch': 551, 'itll': 370, 'tickle': 713, 'innards': 361, 'absolut': 5, 'perfection': 518, 'responsibly': 580, 'worth': 798, 'weight': 776, 'delicious': 176, 'pepsi': 516, 'inner': 362, 'carabao': 105, 'uncola': 745, 'probably': 548, 'typhoo': 741, 'worst': 797, 'could': 150, 'happen': 322, 'wiiiings': 787, 'smoke': 646, 'chupa': 129, 'chups': 130, 'mint': 448, 'toblerone': 721, 'daily': 171, 'door': 192, 'add': 7, 'flavor': 254, 'donut': 191, 'favorite': 234, 'well': 777, 'misspend': 449, 'grow': 306, 'ugly': 742, 'friday': 270, 'monster': 456, 'chew': 119, 'trix': 733, 'wafer': 766, 'break': 79, 'layer': 400, 'flake': 253, 'shatter': 626, 'hunting': 350, 'season': 610, 'naturally': 468, 'remarkable': 577, 'chips': 124, 'ingredient': 360, 'whensa': 780, 'dolmio': 189, 'stimulate': 674, 'pork': 539, 'bull': 91, 'come': 139, 'colmans': 138, 'fire': 248, 'bread': 78, 'boursin': 74, 'butter': 96, 'step': 672, 'ahead': 11, 'big': 56, 'chocolate': 126, 'unfold': 748, 'dangerously': 173, 'cheesy': 117, 'wheres': 781, 'filling': 243, 'bake': 34, 'farm': 230, 'oreo': 496, 'eat': 201, 'freak': 265, 'ruffle': 595, 'rrridges': 594, 'zem': 811, 'ing': 359, 'cake': 100, 'stop': 675, 'baking': 35, 'guarantee': 309, 'branston': 77, 'pretty': 545, 'without': 790, 'full': 275, 'burger': 93, 'want': 771, 'handle': 318, 'crunch': 163, 'san': 602, 'francisco': 264, 'treat': 732, 'cent': 107, 'cant': 104, 'theyre': 706, 'icecreamalicious': 353, 'toast': 720, 'grain': 299, 'oat': 486, 'cereal': 109, 'kidtested': 386, 'parentapproved': 503, 'blood': 65, 'youll': 807, 'monstermad': 457, 'honey': 343, 'girl': 286, 'felt': 240, 'bubble': 88, 'melt': 441, 'center': 108, 'tootsie': 726, 'beef': 45, 'sing': 635, 'thats': 702, 'spell': 660, 'relief': 576, 'room': 590, 'jello': 374, 'beanz': 40, 'meanz': 438, 'heinz': 333, 'blue': 66, 'bonnet': 71, 'breakfast': 80, 'bird': 58, 'eye': 226, 'country': 152, 'quick': 559, 'tasty': 694, 'everyone': 215, 'candy': 103, 'bar': 37, 'idaho': 356, 'ich': 354, 'bin': 57, 'gourmeggle': 297, 'thee': 704, 'sure': 683, 'save': 603, 'shursave': 630, 'store': 677, 'shreddie': 629, 'grin': 304, 'aah': 4, 'bisto': 59, 'mouth': 460, 'meat': 439, 'mm': 452, 'champion': 110, 'grrreat': 308, 'rainbow': 562, 'small': 643, 'piece': 525, 'norway': 480, 'wonderland': 793, 'k¼sschen': 395, 'quality': 556, 'square': 665, 'child': 123, 'adult': 8, 'pur': 552, 'chicken': 121, 'tonight': 725, 'hobnob': 340, 'underneath': 746, 'bag': 32, 'hate': 326, 'top': 727, 'knock': 392, 'roll': 588, 'milky': 444, 'lunch': 428, 'reeses': 570, 'passion': 507, 'talking': 689, 'butteryness': 97, 'habit': 313, 'ought': 499, 'congratulate': 141, 'spread': 664, 'slice': 640, 'fit': 251, 'luxury': 429, 'utterly': 755, 'smite': 645, 'perfectly': 519, 'pair': 502, 'sharing': 623, 'genuine': 281, 'music': 464, 'lip': 413, 'creamy': 159, 'seeker': 615, 'would': 799, 'klondike': 391, 'cows': 154, 'eight': 203, 'round': 592, 'long': 417, 'unwrap': 752, 'britton': 87, 'ryvita': 600, 'simple': 632, 'magically': 431, 'unexplainably': 747, 'juicy': 379, 'father': 233, 'use': 754, 'originality': 498, 'scoop': 608, 'irresistible': 367, 'graeters': 298, 'open': 492, 'heaven': 332, 'cornish': 149, 'goodness': 293, 'tomato': 724, 'hunt': 349, 'buy': 98, 'bega': 47, 'choice': 127, 'coonoisseurs': 147, 'jarlsberg': 372, 'herd': 336, 'dairylea': 172, 'years': 804, 'farming': 231, 'pilgrim': 526, 'anythings': 18, 'possible': 540, 'popsicle': 538, 'ride': 584, 'ending': 207, 'return': 582, 'classics': 131, 'bowl': 75, 'perry': 521, 'floaty': 257, 'betcha': 52, 'stops': 676, 'character': 112, 'laughter': 399, 'cheese': 115, 'joy': 377, 'tenderness': 699, 'timeless': 716, 'emotion': 206, 'hello': 334, 'hershey': 337, 'choose': 128, 'shell': 627, 'nestle': 474, 'hands': 319, 'together': 723, 'else': 205, 'fast': 232, 'cheap': 113, 'bagful': 33, 'find': 245, 'hankerin': 321, 'henry': 335, 'id': 355, 'ruther': 599, 'druthers': 197, 'restaurant': 581, 'aurelios': 23, 'money': 455, 'devilish': 181, 'foster': 263, 'wonderful': 792, 'mity': 450, 'hamburger': 317, 'right': 585, 'feed': 236, 'kitchen': 390, 'steakburgers': 671, 'canadian': 102, 'que': 557, 'chez': 120, 'flunch': 259, 'quon': 560, 'peut': 523, 'fluncher': 260, 'curb': 168, 'service': 620, 'pub': 550, 'boy': 76, 'nowhere': 483, 'chef': 118, 'burrito': 94, 'cheesesteak': 116, 'crave': 156, 'rule': 596, 'gotta': 295, 'leave': 402, 'since': 634, '1968': 2, 'trust': 736, 'neighbourhood': 472, 'grill': 303, 'eats': 202, 'scenic': 606, 'view': 762, 'lotz': 422, 'fish': 250, 'serious': 618, 'delivery': 179, 'feelings': 239, 'gottahava': 296, 'wawa': 774, 'italy': 369, 'cooking': 145, 'wiener': 786, 'lowpriced': 426, 'stand': 666, 'yogurt': 806, 'besttasting': 51, 'tuna': 737, 'nearly': 471, 'wings': 789, 'pretzel': 546, 'sonic': 654, 'short': 628, 'steak': 670, 'seafood': 609, 'salad': 601, 'prepare': 544, 'order': 494, '2am': 3, 'youre': 809, 'drunk': 196, 'party': 506, 'popeyes': 537, 'pickle': 524, 'instant': 365, 'mor': 458, 'chikin': 122, 'snap': 650, 'crackle': 155, 'loosen': 421, 'making': 434, 'american': 16, 'slicing': 641, 'freshness': 269, 'think': 709, 'outside': 501, 'bun': 92, 'star': 667, 'kit': 389, 'kat': 382, 'lovin': 425}

# Vector representation of all msgs
all_msgs_vector = count_vectorizer.transform(cleaned_df[col])
print(all_msgs_vector)

  (0, 133)	1
  (0, 424)	1
  (1, 219)	1
  (1, 415)	1
  (1, 808)	1
  (2, 180)	1
  (2, 183)	1
  (2, 433)	1
  (3, 46)	1
  (4, 143)	1
  (4, 403)	1
  (5, 53)	1
  (5, 261)	1
  (5, 288)	1
  (5, 292)	1
  (5, 515)	1
  (5, 691)	1
  (5, 718)	1
  (6, 190)	1
  (6, 316)	1
  (6, 708)	1
  (7, 13)	1
  (7, 292)	1
  (7, 342)	1
  (8, 111)	1
  :	:
  (557, 341)	1
  (558, 261)	1
  (558, 323)	1
  (558, 434)	1
  (558, 514)	1
  (559, 16)	1
  (559, 261)	1
  (560, 184)	1
  (560, 653)	1
  (561, 269)	1
  (561, 641)	1
  (562, 585)	1
  (562, 691)	1
  (563, 92)	1
  (563, 501)	1
  (563, 709)	1
  (564, 261)	1
  (564, 667)	1
  (565, 775)	1
  (566, 79)	1
  (566, 382)	1
  (566, 389)	1
  (567, 201)	1
  (567, 268)	1
  (568, 425)	1

Applying TF-IDF algorithm¶

from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer().fit(all_msgs_vector)

messages_tfidf = tfidf_transformer.transform(all_msgs_vector)
print(messages_tfidf)

  (0, 424)	0.8120610477121847
  (0, 133)	0.583572493173375
  (1, 808)	0.6170194344433083
  (1, 415)	0.4884404109392671
  (1, 219)	0.6170194344433083
  (2, 433)	0.3996201195064444
  (2, 183)	0.6275069936630012
  (2, 180)	0.6682355370598582
  (3, 46)	1.0
  (4, 403)	0.7071067811865476
  (4, 143)	0.7071067811865476
  (5, 718)	0.4643575542189262
  (5, 691)	0.27533002873032236
  (5, 515)	0.4643575542189262
  (5, 292)	0.27533002873032236
  (5, 288)	0.4643575542189262
  (5, 261)	0.32371314078907243
  (5, 53)	0.31098671765630925
  (6, 708)	0.5847631354795332
  (6, 316)	0.6227173750031979
  (6, 190)	0.5198799344591708
  (7, 342)	0.6103629191218752
  (7, 292)	0.4039941472967685
  (7, 13)	0.6813558805139794
  (8, 598)	0.4714330618940757
  :	:
  (557, 84)	0.6817816882288547
  (558, 514)	0.5127066510144902
  (558, 434)	0.5946055526513928
  (558, 323)	0.460169742610059
  (558, 261)	0.41451168228148844
  (559, 261)	0.5718758868245777
  (559, 16)	0.8203401550994579
  (560, 653)	0.7071067811865476
  (560, 184)	0.7071067811865476
  (561, 641)	0.7071067811865476
  (561, 269)	0.7071067811865476
  (562, 691)	0.5519405422014124
  (562, 585)	0.8338834678025526
  (563, 709)	0.5773502691896257
  (563, 501)	0.5773502691896257
  (563, 92)	0.5773502691896257
  (564, 667)	0.8203401550994579
  (564, 261)	0.5718758868245777
  (565, 775)	1.0
  (566, 389)	0.5890699532425786
  (566, 382)	0.5890699532425786
  (566, 79)	0.553166503300383
  (567, 268)	0.7200021188983894
  (567, 201)	0.6939718645462722
  (568, 425)	1.0

print(messages_tfidf.shape)

(569, 813)

Going by the IDF score should be the best bet since lower score will mean more frequency.¶

I will turn the IDF score list into a hasmap and then use the keys of sorted values to find the words¶

idf_score_dict = {}

for i,val in enumerate(tfidf_transformer.idf_):
    idf_score_dict[i] = val

import operator
sorted_idf_score_dict = dict(sorted(idf_score_dict.items(), key=operator.itemgetter(1)))

top_10_idf_score_list = list(sorted_idf_score_dict.keys())[0:10]
top_10_idf_score_list

[292, 691, 433, 46, 53, 283, 50, 261, 409, 695]

unique_word_list = count_vectorizer.get_feature_names()

for i in top_10_idf_score_list:
    try:
        print(unique_word_list[i], idf_score_dict[i])
    except Exception as ex:
        print("no word matching the index")

good 3.9444389791664403
taste 3.9444389791664403
make 3.9783405308421216
beer 4.455264602932431
better 4.455264602932431
get 4.573047638588815
best 4.637586159726386
food 4.637586159726386
life 4.637586159726386
tea 4.706579031213337

From the result, it looks like the dataset has a lot of slogans for edible items¶

Verifying my results for some of these words¶

cleaned_df[cleaned_df['Slogan'].str.contains("good")]

cleaned_df[cleaned_df['Slogan'].str.contains("taste")]

cleaned_df[cleaned_df['Slogan'].str.contains("food")]

	Company	Slogan
0	Costa Coffee	For coffee lovers.
1	Evian	Evian. Live young.
2	Dasani	Designed to make a difference.
3	Heineken	It's all about the beer.
4	Gatorade	The Legend Continues.
5	TÃo Pepe	Good food tastes better after a glass of Tio Pepe
6	Tetley's Brewery	Don't Do Things By Halves.
7	Batemans Brewery	Good Honest Ales
8	Jones Soda	Run with the little guyâ¦ create some change.
9	Grapette	Thirsty or Not.

	Company	Slogan
0	Costa Coffee	For coffee lovers.
1	Evian	Evian. Live young.
2	Dasani	Designed to make a difference.
3	Heineken	It's all about the beer.
4	Gatorade	The Legend Continues.
...	...	...
1136	Hardee's	Where the food's the star.
1137	Burger King	Have it your way.
1138	Kit Kat	Have a Break, Have a Kit Kat.
1139	Subway	Eat Fresh.
1141	McDonalds	Iâm Lovinâ It.

	Company	Slogan
5	TÃo Pepe	good food taste better glass tio pepe
7	Batemans Brewery	good honest ale
34	KFC(Kentucky Fried Chicken)	finger lickin good
120	Sunkist (soft drink)	good vibration
177	United Breweries	king good times
296	Labatt Brewing Company	good things brewing
321	Guinness	guinness good
360	Jamba Juice	good tidings blend
368	Lavazza	good karma great coffee
400	Yogi Tea	good feel
405	Kericho Gold	matter good taste
409	Red Rose Tea	red rose tea good tea
458	Wild Turkey (bourbon)	good keep cooped
485	Paulaner Brewery	good better paulaner
492	Bad Frog	beer good bad
521	Maxwell House	good last drop
531	Naked Juice	worth weight good
718	Campbell's	mm mm good
743	Mamee Double-Decker	world good taste
760	Smucker	good
785	Country Crock	good habit delicious
836	Swensen's	good father use make
865	Vadilal	nothing goodness
872	Dairylea (cheese)	herd dairylea goodness
874	Pilgrims Choice	good choice pilgrim
920	Nestles	nestle good food good life
945	Heinz	good food every day
967	Applebee's	together good
981	Orange Julius	devilish good drink
1055	Checkers and Rally's	crazy good food
1059	Village Inn	good food good feelings

	Company	Slogan
5	TÃo Pepe	good food taste better glass tio pepe
39	Coca-Cola	taste feeling
47	Campa Cola	great indian taste
56	Crystal Pepsi	never see taste like
83	Dr. Brown's	taste town
90	Lilt	totally tropical taste
92	Fresca	nothing taste like fresca
99	Jolly Cola	free taste
127	Sutter Home Winery	taste commitment
133	Staropramen Brewery	get taste prague
137	Minute Maid	load taste
164	Disaronno	taste seduction
208	Gilbey's gin	taste smooth gin today
283	Tapal Tea	differ tastefully
284	Thums Up	taste thunder
298	Old Milwaukee	taste great name
320	Amstel Brewery	taste life pure filter
338	Bisleri	sweet taste purity
339	Evan Williams	aged longer taste smoother
376	Gold Peak Tea	home brew taste
405	Kericho Gold	matter good taste
491	Dos Equis	let taste travel
640	Walkers	fresh taste guarantee
741	Skittles	taste rainbow
743	Mamee Double-Decker	world good taste
798	Weis	taste everyones lip
862	Kraft Foods	little taste heaven
1008	Spangles	taste better
1022	Steers	taste better
1132	Wendy's	taste right

	Company	Slogan
5	TÃo Pepe	good food taste better glass tio pepe
95	Horlicks	food drink night
587	Dan-D Foods	fine food earth
588	Canyon Creek Food Company	favorite food make easy
920	Nestles	nestle good food good life
945	Heinz	good food every day
969	Winky's	fast food cheap
1013	Brewers Fayre	pub food
1055	Checkers and Rally's	crazy good food
1059	Village Inn	good food good feelings
1096	ConAgra Foods	food love
1098	Sizzler	great steak seafood salad
1128	Carl's Jr.	making people happy food
1129	A&W	w american food
1136	Hardee's	food star

	Company	Slogan
count	1162	1162
unique	569	568
top	Domino's Pizza	Taste The Feeling.
freq	31	31

	Company	Slogan
334	Wyborowa Vodka	Exquisite wodka.
448	Stolichnaya vodka	Exquisite wodka.

This is an excercise to find the most frequent words from CountVectorizer and TfidfTransformer¶

This could be a form of feature engineering if needed¶

I am using the slogan dataset from Kaggle¶

From describe() looks like the csv file has many duplicate data. So I'll just remove all the duplicates¶

Two different brands of vodka with the same slogan so I'll let it be¶

Let's process the slogans¶

Here are my cleaned slogans¶

Vectorization of Slogans¶

Applying TF-IDF algorithm¶

Going by the IDF score should be the best bet since lower score will mean more frequency.¶

I will turn the IDF score list into a hasmap and then use the keys of sorted values to find the words¶

From the result, it looks like the dataset has a lot of slogans for edible items¶

Verifying my results for some of these words¶

These words are very frequent in the slogans so my results are correct¶