Hi all,

I'm a student learning Java right now, so I'm still very new to what Java can do.
I've written a program to perform a word count (thanks to VernonDozier for help with case-sensitivity) but another problem has come up. The program is identifying words with punctuation to be different from words without punctuation - for example, "process" would be considered different from "process.", as if it were at the end of a sentence.

The reason this is important is that it's part of the assignment requirements for the program to be able to identify distinct words.

I currently have no idea how to address this whatsoever. All help is appreciated!

Thanks!

The program is as follows:

import java.util.*;

public class WordCount {
	
	public static void main(String[] args) {
		// Defining total word count variable
		int totalCount = 0;
		
		// Setting up linked hash map so output will display words in order of appearance
		Map<String, Integer> textInput = new LinkedHashMap<String, Integer>();
		
		// Determining the number of distinct words and their frequencies of occurrence
		for (String a : args) {
			a = a.toLowerCase();
			System.out.println(a);
        	Integer freq = textInput.get(a);
        	textInput.put(a, (freq == null) ? 1 : freq + 1);
        					}
		
		// Determining the word count using the occurrence frequencies
		for (int Values : textInput.values())
		    if (Values >=2) {
		    	totalCount += Values;
		    }
		    else {
		    	++totalCount;
		    }
		
		// Determining correct grammar and printing word count results
		// If there's only one word
		if (totalCount == 1) {
			System.out.println("The total word count is " + totalCount + " word.");
			System.out.println("The word is " +textInput.keySet());
		}
		
		// If there's no words
		else if (totalCount == 0) {
			System.out.println("There are no words.");
			
		// If there's more than one word
		} else {
			System.out.println("The total word count is " + totalCount + " words.");
			System.out.println("There are "+ textInput.size() + " different words.");
			System.out.println("The words are: " +textInput.keySet());
			
		}
		
    }
	
}

Hmm complicated.

I'm not sure if you want to consider process different from process. or if you want to consider them the same thing. Which is it? If you create a Scanner, read in things one by one, put them into an ArrayList if they are unique words, and increment a counter for that word - then process and process. would be considered different since the Scanner by default would read in the words as "process" and "process." and a search of the ArrayList when you got to "process." would not match "process". If you want them to match, then strip punctuation and any special chars that you don't want from the end of every word.

PS nice help Jenn.

http://java.sun.com/javase/6/docs/api/java/lang/Character.html
http://java.sun.com/javase/6/docs/api/java/lang/String.html

The above are links to the documentation for the String and Character classes. Like many things in Java, there are many ways to do something. Scroll down the list of functions in these classes and you'll find functions that:

  • Split strings into several different strings using delimiters.
  • Isolate characters.
  • Tell whether a character is white space.
  • Tell whether a character is a letter.
  • Tell whether a character is a digit.
  • Tell whether a character is punctuation.

and a lot more. Try not to get overwhelmed. I'm not 100% sure what the punctuation stuff detects. You'll have to experiment or possibly write you own.

You have a good start on defining exactly what you're trying to do, but I think there's a bit more to do regarding what legal data is and how to count it all and what assumptions you can make. For example, can one assume that all words are made up of letters and only letters, separated by spaces, and the sentences end in one and only one of these ( ? . ! ) ? Do you have to be able to parse nonsense like:

a4353s$@!(887  TrwQ:,<kasdYEW  !!# JklP9

or can you assume reasonable input, and what's "reasonable"? These are all things you have to decide and which will affect how you program it. The more "weirdness" you have to be able to catch and the less you can assume, the more complicated it gets.

Thanks all for your help!

I found a different way. There's something called StringBuilder which you guys might find interesting/useful. Basically it puts together a string from various inputs.

Basically, I created a new class named Normalize and defined a Normalize method within it.

public class Normalize {
	public static String Normalize(String a) {
		char search;
		StringBuilder result = new StringBuilder();
		for (int i = 0; i < a.length(); ++i) {
			search = a.charAt(i);
                       // Place all unwanted symbols here
			if (search == '.' || search == ',' || search == '?' ||  search == '!') 
				{
				continue; 
				}
			else {
				result.append(search);
			}
			
		}
		return result.toString().toLowerCase();
		}
}

The idea is that the StringBuilder will recreate the word without any punctuation. Seems to work very well!
Thanks Best for the idea of stripping punctuation. That's initially what I was aiming to do directly till I found StringBuilder, which gave a roundabout way of losing unwanted marks.

Yeah, we've used StringBuilder before. Personally I would have used Scanner, which could have been used just as effectively, but I'm glad you got it working.

Also, what is the reason for the 'continue' statement? Doesn't it just do nothing in your case? Btw, you could equivalently say:

if (search != '.' && search != ',' && search != '!' && search != '?') result.append(search)

I guess the continue is a bit useless. I guess I'll learn to identify those types of things better as I get more experience.

Thanks for the suggestion!

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.