HashSet contains duplicate elements?

Question

Bladtman242 8 Junior Poster

13 Years Ago

I have this annoying problem:
I am filling a HashSet out with Road objects, only doing so to avoid duplicate elements.
However, the output from lines 81 though 95 shows that the set contains duplicate elements.
Sorry for the messy code, this is a pretty ad-hoc sollution, in addition to being a parsing class.
I have a feeling the problem is that the Road class (line 229) doesn't overide Objects hashCode() method.
But my attemps to so had some weird consequences, so now I thought I'd try asking here :)

A little BG info of ot helps the readability of the code.
The code reads a few files provided by a danish map and phonebook company (it is a school project).
The purpose is to make an entry set that will later be read into a trie, to allow prefix seaches on road names etc.
[EDIT:] would it have been cleaner to link to pastebin, than to show the code in-post?

//A second attempt at creating a data set for the trie.
//These classes are completely ad-hoc, and thus not optimized for reuse.

import java.io.File;
import java.io.IOException;
import java.io.FileOutputStream;
import java.io.OutputStreamWriter;
import java.io.BufferedWriter;
import java.io.Closeable;

import java.util.Scanner;
import java.util.List;
import java.util.ArrayList;
import java.util.Map;
import java.util.HashMap;
import java.util.Set;
import java.util.TreeSet;
import java.util.HashSet;

public class ConstructTrieData {
    private static final String outFile="TrieData.txt";
    private Map<Short, String> zipMap = new HashMap<Short, String>(); //map from zipcode to cityname
    private List<Road> roads = new ArrayList<Road>(); //list of (unique) roads.
    private Map<String,HashSet<Road>> zipBList = new HashMap<String, HashSet<Road>>();

    //The constructor controlls the flow of this ad-hoc  parser
    public ConstructTrieData (String zips, String edges, String nodes) {
        //initialize the file objects, terminate by runtimeexception if the files do not exist/can't be read
        File zipsF  = assignRFile(zips);
        File edgesF = assignRFile(edges);
        File nodesF = assignRFile(edges);
        File outF = assignWFile(outFile);

        //read contents of files into appropriate data structures
        readZips(zipsF, zipMap);        //allows zip-code -> city-name mapping
        makeRoadList(edgesF, zipMap, roads);    //stores unique roads (unique for roadname + zip)
        writeFormats(nodesF, outF, roads);  //transfer controll to the writeFormats method, which pretty much takes over from here.
    }

    //Reads zips in the format zip; cityName from File f, intro the map zips
    private void readZips(File f, Map<Short, String> zips) {
        System.out.println("Reading zip codes");
        Scanner sc = getFReader(f);
        sc.nextLine(); sc.nextLine(); //Jump 2 lines    
        while(sc.hasNextLine()) {
            String[] l = sc.nextLine().split(";");
            //add K=zip V=cityName tothe zipmap
            zips.put(Short.parseShort(l[0]), l[1]);
        }
        closeStream(sc);
    }

    //Fills ArrayList roads with Road objects, created by reading File edges and mapping the zips to cities via Map zips
    //this method also fills out zipBList
    private void makeRoadList (File edges, Map<Short, String> zips, List<Road> roads) {
        System.out.println("Reading edges to make the road list");
        long ST = System.currentTimeMillis();

        //actually reading the roads into a set, so that one road is stored only once, regardless of its number of edges.   
        Set<Road> uniqueRoads = new HashSet<Road>();

        //read the file
        Scanner sc = getFReader(edges);
        sc.nextLine(); //jump a line
        while(sc.hasNextLine()) {
            String[] l = sc.nextLine().split(",");

            String rName = l[6]; //name at sixth place
            rName = rName.substring(1, rName.length()-1); // remove the leading and thrailing '
            Short zip = Short.parseShort(l[17]);    //zip code at place 17 (and another at 18)
            String city = zips.get(zip);        //get city name from the zip-map
            Integer nId = Integer.parseInt(l[0]);   //node-id at zeroth place (and another at 1)

            if (city == null || rName.length()==0 ) continue;   //this goes for sweedish roads, where the data from post danmark is inadequate 
                                        //so the city name is null.
                                        //And for kdv_unload entries without a road name
            Road r = new Road(rName, zip, city, nId);
            uniqueRoads.add(r);
            addToZipList(r);
        }
        Road one = null, two = null, three = null;
        int i =0;
        for (Road r : uniqueRoads) {
            if (r.zipCode == 4600 && r.rName.equals("Lærkevej")) {
                System.out.println("found one");
                if (i==0) one=r;
                if (i==1) two=r;
                if (i==2) three=r;
                i++;
            }
        }

        System.out.println(one.equals(one));
        System.out.println(one.equals(two));
        System.out.println(one.equals(three));

        roads.addAll(uniqueRoads);
        System.out.println("Done reading edges and making the road list. It took: " + ((System.currentTimeMillis()-ST)/1000) + " seconds.\n" );
        closeStream(sc);
    }

    private void writeFormats(File nodeFile, File outFile, List<Road> roads) {
        writeR  (outFile, roads);
        //writeC    (nodeFile, outFile, zipBList);
    }

    private void writeR (File outF, List<Road> roads) {
        System.out.println("Writing normal formats formats (i.e; City## not yet included)");
        long ST = System.currentTimeMillis();
        BufferedWriter bw = getFWriter(outF);
        for (Road r : roads) {
            //Road##City
            writeTo(bw, r.rName + "##" + r.cityName + ";" + r.nodeId);
            //Road#Zip#City
            writeTo(bw, r.rName + "#" + r.zipCode + "#" + r.cityName + ";" + r.nodeId);
            //City#Zip#Road
            writeTo(bw, r.cityName + "#" + r.zipCode + "#" + r.rName + ";" + r.nodeId);
            //City##Road
            writeTo(bw, r.cityName + "##" + r.rName + ";" + r.nodeId);
        }
        System.out.println("Done writing normal formats. It took: " + (System.currentTimeMillis()-ST) + " milliseconds.\n" );
        closeStream(bw);
    }

    private void writeC(File nodeF, File outF, Map<String, HashSet<Road>> zipBList) {
        System.out.println("Writing City## format");
        long ST = System.currentTimeMillis();
        BufferedWriter bw = getFWriter(outF);
        for (Set<Road> s : zipBList.values()){
            Road r = centralRoad(s);
            writeTo(bw, r.cityName + "##" + ";" + r.nodeId);
        }

        System.out.println("Done writing City format. It took: " + (System.currentTimeMillis()-ST) + " milliseconds.\n" );
        closeStream(bw);
    }

    private Road centralRoad(Set<Road> roads) {
        Object[] arr = roads.toArray();
        @SuppressWarnings("unchecked")
        Road tmp = (Road) arr[0];
        return tmp;
    }

    /*
     * Helper methods from here on out
     * convention for the IO helpers is to throw a RuntimeException on fatal IO problems, and thereby crash the program, unless otherwise noted
     */ 

    //getRead/write methods, simply provides a 'cleaner' way of instatiating fileIO objects
    private Scanner getFReader(File f){
        try {
            return new Scanner(f);
        } catch (IOException e){
            throw new RuntimeException(e);
        }
    }

    //the two following methods, clumsy as they may seem, provide a means of writing to a file using relavtively clean code
    //returns a BufferedWriter on the File outF (asumes UTF-8)
    private BufferedWriter getFWriter(File outF) {
        try {
            return new BufferedWriter( new OutputStreamWriter ( new FileOutputStream(outF, true), "UTF-8")); //true means append instead of overwriting
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
    }

    //write String s, followed by a newline character, to bw
    private void writeTo (BufferedWriter bw, String s) {
        try {
            bw.write(s);
            bw.newLine();
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
    }

    //verifies the validity of a file denoted by String path, and assigns it to a File f on a successful verification.
    private File assignRFile(String path) {
        File tmp = new File(path);
        if (!tmp.canRead()) {
            throw new RuntimeException("Cant read file " + path);
        }
        return tmp;
    }

    //removes the file if it exists, and returns a file obj representing the newly created file
    //terminates by runtimeexception if any of the IO fails fataly.
    private File assignWFile(String path) {
        File tmp = new File(path);
        if (tmp.exists()) {
            if (!tmp.delete()) throw new RuntimeException("Could not delete file: " + path);
        }
        try {
            tmp.createNewFile();
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
        return tmp;
    }

    //convient way to close streams.
    //doesnt crash on IO problems, as an error here is non-fatal for the parser.)
    private void closeStream(Closeable c) {
        try {
            c.close();
        } catch (IOException e) {
            System.out.println("Huh, couldn't close this");
            e.printStackTrace();
        }
    }

    private void addToZipList(Road r) {
        if (zipBList.containsKey(r.cityName)) {
            zipBList.get(r.cityName).add(r);
        } else {
            HashSet<Road> hs = new HashSet<Road>();
            hs.add(r);
            zipBList.put(r.cityName ,hs);
        }
    }

    /*
     * Helper section over, from here on out only private classes and main()
     */

    //class representing  a road, with all its neccesary data.
    private class Road {    
        public String rName, cityName;
        public Short zipCode;
        public Integer nodeId;

        public Road(String rn, Short zip, String cName, Integer nId) {
            rName = rn;
            zipCode = zip;
            cityName = cName;
            nodeId = nId;
        }

        @Override
        public boolean equals(Object r) {
            if (r == null || r.getClass() != this.getClass()) return false;
            if (this == r) return true;
            Road other = (Road) r;
            return (this.rName.equals(other.rName) && this.zipCode.equals(other.zipCode));
        }
    }

    public static void main (String[] args){
        if (args.length != 3) {
            System.out.println("Usage: zipfile edgefile nodefile");
            return;
        }

        new ConstructTrieData(args[0], args[1], args[2]);
    }
}

java

Edited 13 Years Ago by Bladtman242

2 Contributors
18 Replies
3K Views
6 Hours Discussion Span
Latest Post 13 Years Ago Latest Post by NormR1

NormR1 563 Posting Sage

13 Years Ago

Do you have a small input file for testing?

How did you code the hashCode method so as to maintain the general contract for the hashCode method, which states that equal objects must have equal hash codes?

Edited 13 Years Ago by NormR1

NormR1 563 Posting Sage

13 Years Ago

Another idea: rName.hashCode() + zipCode

NormR1 563 Posting Sage

13 Years Ago

Try working with a very small input file. Make a special test file with a couple of duplicates. Add printlns to the methods for debugging and execute the code to see what happens.

NormR1 563 Posting Sage

13 Years Ago

To be able to debug the code you need to know what the code is doing incorrectly. What items are being omitted from the Set that should be there?

NormR1 563 Posting Sage

13 Years Ago

What items are being omitted from the Set that should be there?

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

Bladtman242 8 Junior Poster · Answer 1 · 2012-05-07T15:54:46+00:00

I have my data set, which is about 115 MB. So it's small enough that I can just use that :)
Well, thats the point. I haven't made a hashCode() method, right now it just uses Object's method. The most resent I tried looked like this:

public int hashCode() {
    int hash =1;
    hash = (hash * 17) + this.zipCode;
    hash = (hash * 31) + this.rName;
    return hash;
}

It's pretty much stolen from wikipedia :)
As far as I can tell it doesn't handle overflow (?).
With this method I had no duplicates, but nowhere near all the roads where in the set :/

Bladtman242 8 Junior Poster · Answer 2 · 2012-05-07T16:20:09+00:00

That was my first attempt, the dataset is missing about 1/5 of the entries when it is done.
I have no idea why :/

Bladtman242 8 Junior Poster · Answer 3 · 2012-05-07T17:42:50+00:00

Ok, I'll do that and return when I've figured something out :)

Bladtman242 8 Junior Poster · Answer 4 · 2012-05-07T18:09:39+00:00

At the risk of seeming a bit helpless; I can't reproduce it with the smaller data set.
Its weird, I know the original data set has significantly more than one duplicate per street, but only a few remain in the output file.

Bladtman242 8 Junior Poster · Answer 5 · 2012-05-07T18:36:44+00:00

I ran diff on the large output (the complete one + duplicates) and the (too) small one.
And I can't see any 'pattern'. Its not exclusively the ones with æøå, or the ones with whitespaces, its not certain formats. Its just entire roads.
Actually thats the only 'pattern'. All formats of the missing road seems to be missing.

Bladtman242 8 Junior Poster · Answer 6 · 2012-05-07T19:28:20+00:00

Here are a few examples:

<Bakkehaven##Kalundborg;444900
< Bakkehaven#4400#Kalundborg;444900
< Kalundborg#4400#Bakkehaven;444900
< Kalundborg##Bakkehaven;444900
< Gyvelvej##Jerup;8286
< Gyvelvej#9981#Jerup;8286
< Jerup#9981#Gyvelvej;8286
< Jerup##Gyvelvej;8286

It seems when an entry is ignored, all entries/formats of that specific road/city combination are ignored.
But it's not specific to the road name. There are other streets by the name "Gyvelvej" from other cities that are present in both the long and short output.
Ass well as other streets from the city Jerup are present.

NormR1 563 Posting Sage Team Colleague · Answer 7 · 2012-05-07T19:36:51+00:00

what does ignored mean?
Left out of the Set because it is a duplicate?
Or left out of the Set with no known reason?
I assume the latter and that what you are trying to find out is the reason the item was not included in the Set. You are 100% sure that there is not a Road object with the same name and zipcode already in the Set.

Your equals() method does not look at city or id.

Bladtman242 8 Junior Poster · Answer 8 · 2012-05-07T20:05:47+00:00

Sorry, it means it's not in the output file, and it is supposed to be.
The equals() method looks only at zip and street name. The id isn't really defining for a street , and the city is represented by the zip.
(a street has serveral id's though only is used, and several streets can have the same id.
I compare by zip instead of city because the zip can be read in prior to the city name.
(The data set contains zip codes but not city names, the zips are then mapped to city names later, from another file).

NormR1 563 Posting Sage Team Colleague · Answer 9 · 2012-05-07T20:12:52+00:00

You need to find out why the missing entry is not in the Set. It must be equal to another entry. You need to find that entry. If you know of a name and zip that are not in the Set but should be, add some debug code to the equals method to test for that name and zip and print out a message every time it sees it. If you get 2 print outs you will know the there are more than one and that is the reason the second is not in the Set.

Bladtman242 8 Junior Poster · Answer 10 · 2012-05-07T21:19:45+00:00

I am going to beat my self over the head with this:
Flushing the outputstream did it.
The original problem was that I didn't override hashCode().
Fixing that didn't seem like a fix because it still skipped some lines :)
I'm hessitant to say so, because the output is still only 13 MB, and not the ~50 I was expecting, but it seems to have done the trick.
Thank you very much for your help and patience :)

NormR1 563 Posting Sage Team Colleague · Answer 11 · 2012-05-07T21:25:08+00:00

Glad you got it working. I had expected your debug print outs to show the size of the Set that you'd be using to determine how many were found. I guess you did not look at that because you would have seen that the size of the Set was larger than the size of the file.

Bladtman242 8 Junior Poster · Answer 12 · 2012-05-07T21:46:19+00:00

Actually that is (kind of) how I found out. I found that the set was complete when the file was not :)

NormR1 563 Posting Sage Team Colleague · Answer 13 · 2012-05-07T21:48:27+00:00

NormR1 563 Posting Sage

13 Years Ago

printlns are a useful tool for debugging.