How to fix this mess...

Question

ppetree 2 Junior Poster

13 Years Ago

Hi All!

I have a zip code table that has the zip code, city, state and various other info in each row. The problem is that if I pull up all the cities in a zip code (say 33308) I get the same city spelled twice like this:

Fort Lauderdale
Ft Lauderdale
Laud By Sea
Lauderdale By The Sea
Oakland Park
Oakland Pk
Sea Ranch Lakes
Sea Ranch Lks

The table looks like this:

zip_code varchar(5)               
lat float                
lon float                
city varchar(100)                
state_prefix varchar(100)                
county varchar(100)                
country varchar(100)                
population varchar(255)                
landarea varchar(255)

Because zip codes can span cities, counties and states, there are over 79,000 rows in this table!

I should also say I have no idea how many city names are messed up or how to find the ones that are messed up.

So the big question is this: What is the best strategy for fixing the city names in the database?

Thanks,

Pete

mysql

Edited 13 Years Ago by ppetree because: n/a

3 Contributors
4 Replies
146 Views
1 Day Discussion Span
Latest Post 13 Years Ago Latest Post by almostbob

urtrivedi 276 Nearly a Posting Virtuoso

13 Years Ago

I am not sure about the best solution. This happened to mean for person address directory. I had to go for manual correction. Though I developed a tool in php, which list thing is alphabetic order,
If I found more than two rows I used to delete all rows but keeping one row of them.

I do not think there is any automatic way to removing duplicates.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

ppetree 2 Junior Poster · Answer 1 · 2011-06-10T22:15:42+00:00

I am not sure about the best solution. This happened to mean for person address directory. I had to go for manual correction. Though I developed a tool in php, which list thing is alphabetic order,
If I found more than two rows I used to delete all rows but keeping one row of them.
I do not think there is any automatic way to removing duplicates.

Hi urtrivedi. Thanks for the response!

Removing duplicates isn't the problem... You can run this query to find and list the duplicates:

SELECT city, count( * ) AS n
FROM zip_code
GROUP BY city
HAVING n >1
LIMIT 30 , 30

This reduces the count to about 9900 duplicate city names but still doesn't solve this problem:

ADDISLEIGH PARK 2 
ADDISLEIGH PK 2
ALPINE MDWS 2 
ALPINE MEADOWS 2
ALTAMONTE SPG 4 
ALTAMONTE SPRINGS 4

Also, searching for duplicates doesn't solve the problem where you find just 1 instance of AMERICAN CANYON and 1 one instance of AMERICAN CYN.

I might be able to use part of your idea to compare the 9900 where x% of the characters match to catch most of the dups contained in the above search.

ppetree 2 Junior Poster · Answer 2 · 2011-06-11T03:03:41+00:00

OK, I wrote a .php script (shared below) that found most of the really bad ones and then we manually edited those (cross checking them on google).

Now I need to figure out how to go through the table and find all the single instances like these:

BND - BEND
BRG - BURG
CY/CTY - CITY
DT - DEPOT
FT - FORT
GRV - GROVE
LK - LAKE
LND - LAND
PK - PARK
PT -POINT
SPR/SPGS/SPNGS - SPRINGS
TOWNSHP - TOWNSHIP
TN/TWN - TOWN

Anyone have any thoughts?

php code:

<?php
include "../inc/sqlinc.php";
include "../inc/sessmgr.php";

$match = .75;     // % of characters that must match
$old_city = "";   // holds previous city name
$new_city = "";   // holds current city name
$old_zip = "";
$new_zip = "";
$total = 0;
$count = 0;

$sql = "SELECT zip_code, city, COUNT(*) AS n FROM zip_code GROUP BY city HAVING n > 1";
$result = mysql_query($sql) or log_and_die(mysql_error());
$total = mysql_num_rows($result);

echo "Processing $total records<br>";

while($row = mysql_fetch_assoc($result))
{
  if($old_city == "")   // first one?
  {
    $old_city = $row['city'];
    $old_zip = $row['zip_code'];
    continue;
  }
  elseif($old_city == $new_city)  // duplicate?
  {
    continue;
  }
  else
  {
    $new_city = $row['city'];
    $new_zip = $row['zip_code'];
    
    $old_len = strlen($old_city);    
    $new_len = strlen($new_city);
    
    // different zips are just same names in different places
    if($old_zip == $new_zip)
    {
      // AMERICAN CANYON 15
      // AMERICAN CYN 12
      // use the longer of the two string lengths
      $percent = ceil(($old_len > $new_len ? $old_len : $new_len) * $match);
      if(strncmp($old_city, $new_city, $percent) == 0)
      {
        echo "$old_zip - $old_city - $percent<br>";
        echo "$new_zip - $new_city<br><br>";
        $count++;
      }
    }
    $old_city = $new_city;
    $old_zip = $new_zip;
    $new_city = $new_zip = "";    
  }  
}
echo "<br>------------------------------------<br>";
echo "Matches found: $count<br>";

?>

almostbob 866 Retired: passive income ROCKS · Answer 3 · 2011-06-11T05:38:47+00:00

perhaps step through the list of contractions to make them all the full name, then remove duplicates,something like

UPDATE TABLE set CITY=replace(CITY, 'Ft.', 'Fort') 
UPDATE TABLE set CITY=replace(CITY, 'Ft', 'Fort') 
UPDATE TABLE set CITY=replace(CITY, 'Bnd.', 'Bend') 
UPDATE TABLE set CITY=replace(CITY, 'Brg.', 'Burg')

not a tested code, it may be neccessary to pad the replace statments with spaces replace(city," Ft. ", " Fort ") in case there is a valid town name with that string inside the words