utf8 and Unicode

Question

gerard4143 371 Nearly a Posting Maven

10 Years Ago

I have a simple question about unicode and utf8.

How does a utf8 encoding know what its uppercase encoding is? I understand how utf8 encoding carries its unicode value embedded in itself but I fail to see how it maps a utf8 encoding to an uppercase unicode value. What is the mechanism which maps utf8 encodings to uppercase encodings or the other features available in the unicode universe?

2 Contributors
3 Replies
163 Views
1 Hour Discussion Span
Latest Post 10 Years Ago Latest Post by deceptikon

deceptikon 1,790 Code Sniper

10 Years Ago

I'm not sure I understand the question. Upper and lower case glyphs have a unique encoding, they're independent of each other in the same way digits and letters are.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

gerard4143 371 Nearly a Posting Maven · Answer 1 · 2014-08-04T11:17:44+00:00

@deceptikon

That's the point. How does an uppercase function work when its used with utf8? A uppercase function would be pretty simple arithmetic with ASCII values but I fail to see how that function would work with utf8.

deceptikon 1,790 Code Sniper Team Colleague Featured Poster · Answer 2 · 2014-08-04T12:46:27+00:00

Let's first differentiate Unicode and UTF-8. UTF-8 is an encoding technique for Unicode code points. So you can break the problem down by removing the encoding aspect and only thinking about code points.

For simplicity sake, consider the code points which can fit in a single byte. These correspond to ASCII, and the case conversion is identical in concept with ASCII. The only real difference is more complex encoding logic of the value because UTF-8 is a variable width encoding.

Extending that a bit to the non-ASCII code points, a simple arithmetic transform may not work. At that point the upper case function would use a lookup to map the two characters.