How to sort word (from file) frequancy in decrease order? I need help

Question

alivip 0 Newbie Poster

17 Years Ago

I wont to find most 10 frequency word of specific file for that I have written this code

import sys
import string
import re
file = open ( "corpora.txt", "r" )
text = file.read ( )
file.close ( )

word_freq ={ }

word_list = string.split ( text )

for word in word_list:
    count = word_freq.get ( string.lower ( word ), 0 )
    word_freq[string. lower ( word )] = count + 1

keys = word_freq.keys ( )
keys.sort ( )
i=0
while i<10:
 for word in keys:
     
     print word, word_freq[word]
     i=1+1

but it only get the word and its frequency
sample output

blanklines 2
blanklines, 1
characters 1
console 1
count 3
blanklines 2
blanklines, 1
characters 1
console 1
count 3

and its read the file again and again. also it did not sort the
output as you can see

how I can sort output in decrease order to be able to stop print after 10 words?

please help me ASAP

python

2 Contributors
5 Replies
129 Views
2 Days Discussion Span
Latest Post 17 Years Ago Latest Post by alivip

ZZucker 342 Practically a Master Poster

17 Years Ago

There are mistakes like your last line should be i = i + 1. Also string functions are builtin since version 2.2, module re is not needed.

Here is one way to do this with version 2.5

# count words in a text and show the first ten items
# by decreasing frequency

# sample text for testing
text = """\
My name is Fred Flintstone and I am a famous TV
star.  I have as much authority as the Pope, I
just don't have as many people who believe it.
"""

word_freq = {}

word_list = text.split()

for word in word_list:
    # word all lower case
    word = word.lower()
    # strip any trailing period or comma
    word = word.rstrip('.,')
    # build the dictionary
    count = word_freq.get(word, 0)
    word_freq[word] = count + 1

# create a list of (freq, word) tuples
freq_list = [(freq, word) for word, freq in word_freq.items()]

# sort the list by the first element in each tuple (default)
freq_list.sort(reverse=True)

for n, tup in enumerate(freq_list):
    # print the first ten items
    if n < 10:
        freq, word = tup
        print freq, word
        # or
        #print word, freq

"""
my output -->
3 i
3 as
2 have
1 who
1 tv
1 the
1 star
1 pope
1 people
1 name
"""

ZZucker 342 Practically a Master Poster

17 Years Ago

... can I remove the marks like (? "" [] ) .. etc ineed only words

Instead of
word = word.rstrip('.,')
use
word = word.rstrip('.,?"[]()')

... control number of word to be enter by user

Where you now have
if n < 10:
use
if n < select:
where variable select is an integer from the user's input

... can python bult user interfac (buttun ,text box etc) and how ?

Python has a simple GUI toolkit called Tkinter supplied that can do all that for you. You need to study up on that, it's a whole new ball of wax. Here would be a typical example:

# a look at the Tkinter Text widget
# use ctrl+c to copy, ctrl+x to cut selected text, 
# ctrl+v to paste, and ctrl+/ to select all

import Tkinter as tk

def get_text():
    # get text widget contents between start_index and end_index
    # start_index = "%d.%d" % (line, column)  here "1.0"
    # line starts with 1 and column with 0
    # here end_index = tk.END    
    # set the label text to the typed-in text
    v1.set(text1.get(1.0, tk.END))
    # clear the text
    text1.delete(1.0, tk.END)      
    text1.insert(tk.INSERT, ' new text')
    text1.insert(tk.INSERT, '\n and more text')

# this sets the window title caption too
# without the leading space Text will be text!?
root = tk.Tk(className = " Text, Button, Label ...")

# text entry field, width=width chars, height=lines text
text1 = tk.Text(root, width=50, height=2, bg='yellow')
text1.pack()

# function listed in command will be executed on button click
button1 = tk.Button(root, text='get the text', command=get_text)
button1.pack(pady=5)

# define a variable to hold the label text
v1 = tk.StringVar()

# label text will always be the textvariable's value
# width/height in char size
label1 = tk.Label(root, textvariable=v1, width=50, height=2) 
label1.pack(pady=5)

# do some caculation and format result
pi_approx = 355/113.0
str1 = "%.4f" % (pi_approx)    # 3.1416
# show result in text widget
text1.insert(tk.INSERT, str1)

# start cursor in text1
text1.focus()

root.mainloop()

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

alivip 0 Newbie Poster · Answer 1 · 2008-03-15T23:01:16+00:00

thank you very much it was very helpfull
but is there way to control number of word to be enter by user
like rather than most 10 frequancy word he can enter 11 , 50 or 44 most frequancy word ..etc

and can I remove the marks like (? "" [] ) ..etc ineed only words

and can python bult user interfac (buttun ,text box etc) and how ?

if not how can I ingreat cod to be user interfac (buttun ,text box etc)

alivip 0 Newbie Poster · Answer 2 · 2008-03-16T15:17:35+00:00

your reply was so helpful
but how can I make an integer from the user's input (select)?

Is python provide search in directory file contain subfile and folder
for example file name is cars and subfile is Toyota,Honda and BMW and Toyota conain folder name camry and corola, honda contain accord and BMW contan folder name X5

Is there way to enter name of parent file(cars) and search in all sub file(Toyota,Honda and BMW)?

alivip 0 Newbie Poster · Answer 3 · 2008-03-16T21:58:42+00:00

this is modify code

# a look at the Tkinter Text widget

# use ctrl+c to copy, ctrl+x to cut selected text,

# ctrl+v to paste, and ctrl+/ to select all
import Tkinter as tk


def most_frequant_word():

      # count words in a text and show the first ten items
    # by decreasing frequency
     
    # sample text for testing

    import sys
    import string
    import re
    v1.set(text1.get(1.0, tk.END))
    text1.delete(1.0, tk.END)
    file = open ("arb.txt", "r")
    text = file.read ( )
    file.close ( )
     
    word_freq = {}
     
    word_list = text.split()
     
    for word in word_list:
        # word all lower case
        word = word.lower()
        # strip any trailing period or comma
        word = word.rstrip('.,/"-_;\[]()')
        # build the dictionary
        count = word_freq.get(word, 0)
        word_freq[word] = count + 1
     
    # create a list of (freq, word) tuples
    freq_list = [(freq, word) for word, freq in word_freq.items()]
     
    # sort the list by the first element in each tuple (default)
    freq_list.sort(reverse=True)
     
    for n, tup in enumerate(freq_list):
        # print the first ten items
        if n < 10:
            text1.insert(tk.INSERT, freq)
            text1.insert(tk.INSERT, word)
            text1.insert(tk.INSERT, "\n")
            freq, word = tup
            print freq, word
root = tk.Tk(className = " most_frequant_word")


# text entry field, width=width chars, height=lines text


text1 = tk.Text(root, width=50, height=20, bg='green')
text1.pack()
# function listed in command will be executed on button click
button1 = tk.Button(root, text='result', command=most_frequant_word)
button1.pack(pady=5)

# define a variable to hold the label text
v1 = tk.StringVar()
# label text will always be the textvariable's value
# width/height in char size
label1 = tk.Label(root, textvariable=v1, width=50, height=20)
label1.pack(pady=5)

# start cursor in text1.
text1.focus()
root.mainloop()

but unfortinatly when I wont to search in (not English text) for example (Arabic) file it will not read it probably it print text like
3ÇáäíÇÈÉ
28Ýí
11Úáì
11ÊÜÊÜãÜÉ
10ãä
10Úä
7Ãä
6ÈÓÈÈ
5ÎÈÑ
5ÇáãÓáãæä

the sample file in attach

I use

text1.insert(tk.INSERT, freq)
            text1.insert(tk.INSERT, word)
            text1.insert(tk.INSERT, "\n")

to inset to the text
pleas I need your help for this and previous one