No of code points from a file

Question

manoj_93 0 Newbie Poster

13 Years Ago

I am trying to calculate the number of code points that are present in a file, but it always showing me the number of characters being used.

I am using string buffer to read the file, so I come to this part f d code

while ((sCurrentLine = br.readLine()) != null) {
	System.out.println(sCurrentLine);
        numChar=numChar+sCurrentLine.length();//Calculating number of characters on each line
        numCdpoints=numCdpoints+sCurrentLine.codePointCount(0,sCurrentLine.length());//Calculating the number of code points
}

If there is anything anyone can suggest, it would be really helpful

java

Edited 13 Years Ago by ~s.o.s~ because: Added code tags, learn to use them.

2 Contributors
5 Replies
121 Views
10 Hours Discussion Span
Latest Post 13 Years Ago Latest Post by ~s.o.s~

~s.o.s~ 2,560 Failure as a human

13 Years Ago

What kind of file is it? What kind of "text" does it contain? If it contains ASCII encoded text, you'll always get the char count of a string same as the number of code-points. Read this, try to understand it and get back in case of more queries.

~s.o.s~ 2,560 Failure as a human

13 Years Ago

OK, now that we have confirmed that you have some characters with surrogate pairs, the next point would be to understand what is the Charset used when opening the file stream for reading. Make sure that you don't rely on the default OS charset (windows-1252, latin etc.) and explicitly pass UTF8. Something like (not tested):

new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF8"));

Since UTF-8 is backwards compatible with ASCII (or ASCII is acceptable UTF-8), you'll be able to read regular characters along with characters having surrogate pairs.

If it still doesn't work, post/attach a small fragment of your text file.

Edited 13 Years Ago by ~s.o.s~ because: n/a

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

manoj_93 0 Newbie Poster · Answer 1 · 2012-02-24T23:09:54+00:00

What kind of file is it? What kind of "text" does it contain? If it contains ASCII encoded text, you'll always get the char count of a string same as the number of code-points. Read this, try to understand it and get back in case of more queries.

In the text, there are some characters that have other font.., and thus when i print the string i get somethig lyk this,

��5ܲ hile the ��5ܲ ecimal representation

So, there are some characters which are surrogate pairs(codepoints)

manoj_93 0 Newbie Poster · Answer 2 · 2012-02-25T00:02:46+00:00

OK, now that we have confirmed that you have some characters with surrogate pairs, the next point would be to understand what is the Charset used when opening the file stream for reading. Make sure that you don't rely on the default OS charset (windows-1252, latin etc.) and explicitly pass UTF8. Something like (not tested):
new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF8"));
Since UTF-8 is backwards compatible with ASCII (or ASCII is acceptable UTF-8), you'll be able to read regular characters along with characters having surrogate pairs.
If it still doesn't work, post/attach a small fragment of your text file.

Actually i am using
br = new BufferedReader(new FileReader("C:\\piblurb.txt"));
to read the file, nd i have do UTF16 encoding only

I have attached the text file

~s.o.s~ 2,560 Failure as a human Team Colleague Featured Poster · Answer 3 · 2012-02-25T01:48:55+00:00

Assuming the text is from the wikipedia description of PI, there are a few problems. First, the text is completely garbled; how did you generate the text file? Second, why UTF-16? Did you specifically encode the file as UTF-16? If yes, then is it without BOM or with BOM?