A user has fitted a CountVectorizer to some documents in scikit-learn. He would like to see all the terms and their corresponding frequency in the text corpus, in order to select stop-words. For example: 'and' 123 times, 'to' 100 times, 'for' 90 times, ... and so on, Is there any built-in function for this?

1.1K Asked by GayatriJaiteley in Data Science , Asked on Dec 11, 2019

If cv is the CountVectorizer and X is the vectorized corpus, then the following code must work

zip(cv.get_feature_names(),

np.asarray(X.sum(axis=0)).ravel())

It will return a list of (term frequency) pairs for each distinct term in the corpus that the CountVectorizer extracted.

But it won’t be in ordered format. Another way of doing that is given below

from sklearn.feature_extraction.text import CountVectorizer

texts = ["Hello world", "Python makes a better world"]

vec = CountVectorizer().fit(texts)

bag_of_words = vec.transform(texts)

sum_words = bag_of_words.sum(axis=0)

words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]

sorted(words_freq, key = lambda x: x[1], reverse=True)

The above code will give the following output

[('world', 2), ('python', 1), ('hello', 1), ('better', 1), ('makes', 1)]

Your Answer