utf-8 encoding issues with strtolower

Question

diafol

14 Years Ago

Hi all. Got a bit of a problem with this.

Here's my code:

<?php header('Content-Type: text/html; charset=utf-8'); ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Untitled Document</title>
</head>

<body>
<?php
echo strtolower("TWIUBCÜ");
echo "TWIUBCÜ";
?>
</body>
</html>

OK, perhaps the header and the meta tag are a bit of an overkill, but never mind.

The second echo displays the Ü, no problem. The first however, gives me a �! Yes, that horrible diamond encrusted question mark.

Now having researched, I know strtolower can't cope with UTF-8 very well, so it suggests using the mb_ functions. Fine. BUT, using strtoupper() actually works by showing accented characters as the original case. Why the difference between them?

This isn't life or death, but the acursed diamond is messing up my ajax script - the strtolower() returns a NULL (via js) if it contains an accented character.

Pretty screwy?

//EDIT

Oh no!

Even using mb_strtolower() does the same! I'm using Apache 2.2.14 / php 5.3.1 on Vista.

phpinfo() says:

Accept-Charset	ISO-8859-1,utf-8;q=0.7,*;q=0.3
Content-Type	text/html; charset=utf-8

I get the same problem on my Linux remote server too.

Oh boy, even worse now, I've found that strtoupper("TWIUBCÜ") gives TWIUBCÌ.

Well that's torn it.

php

Edited 14 Years Ago by diafol because: n/a

2 Contributors
4 Replies
5K Views
2 Days Discussion Span
Latest Post 14 Years Ago Latest Post by diafol

mschroeder 251 Bestower of Knowledge

14 Years Ago

Well both strtolower and strtoupper are not UTF-8 compatible but I believe you already knew this.

http://www.phpwact.org/php/i18n/utf-8 is a great resource for the compatability of utf-8 with the current string functions in php.

I get the same results as you when I put my own test together using a recommended internationalization test string (Iñtërnâtiônàlizætiøn).

<?php header('Content-Type: text/html; charset=utf-8'); ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Untitled Document</title>
</head>
 
<body>
<?php
//header ('Content-type: text/html; charset=utf-8');
$string = 'Iñtërnâtiônàlizætiøn';

echo strtoupper($string).PHP_EOL;
echo strtolower($string).PHP_EOL;
echo mb_strtoupper($string).PHP_EOL;
echo mb_strtolower($string).PHP_EOL;
?>
</body>
</html>

It appears that strtoupper returns the string with the ASCII characters capitalized and the utf-8 characters untouched. However strtolower seems to corrupt the string.

IñTëRNâTIôNàLIZæTIøN
i�t�rn�ti�n�liz�ti�n
IñTëRNâTIôNàLIZæTIøN
i�t�rn�ti�n�liz�ti�n

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

diafol · Answer 1 · 2011-01-26T04:49:57+00:00

Hey thanks MS. Sorry I took so long to respond - my laptop died! Probably due to key thumping born of incredulity of the way strtolower works! Anyhow thanks for the link.

mschroeder 251 Bestower of Knowledge Team Colleague · Answer 2 · 2011-01-26T07:26:35+00:00

So I've been messing with this problem a little bit more this evening.
UTF-8 has always intrigued me but I've never had an application I had to worry about it on.

I've been using the chart from http://www.utf8-chartable.de/unicode-utf8-table.pl to get UTF-8 characters to test.

$string = 'Թ';
echo 'Uppercase: '.mb_convert_case($string, MB_CASE_UPPER, "UTF-8").'<br />';
echo 'Lowercase: '.mb_convert_case($string, MB_CASE_LOWER, "UTF-8").'<br />';
echo 'Original: '.$string.'<br />';

Once my file was actually set to UTF-8 encoding, this started working like a charm.
Seems like the hardest problem will be determining the incoming encoding type and then converting that to UTF-8

diafol · Answer 3 · 2011-01-27T07:43:25+00:00

Yes, I've been messing with the MB functions myself lately. I forgot to include the all important "UTF-8" parameter, so that's why I was still getting the horrible question mark.

I have to say, UTF-8 + MB will do me fine for now. I can't believe that I've never run into the strtolower issue before - although I should have read the blurb in the php manual. I've seen 'grown-up' open-source apps use this when checking usernames in a DB. Huh, I can just image what would happen if somebody had a non-ASCII character in there somewhere.

As I understand it, UTF-8 will cover *almost* anything.

When I tested my remote setup I pretty much had all the main supported locales (no Norwegian for some odd reason!), and they ALL were UTF-8-ified.

OK, there's the UTF-16/UTF-32 WITH BIG/SMALL/BOM.

If you need to check the encoding, I think there's the mb_detect_encoding() function