Hello,


Now I'm making a program for creating language glossaries, but the problem is that windows uses ANSI for encoding text files, and the program that will read these files (which is not mine) only displays words in utf-8 encoding.


Since my program is multiplatform, it can also work under Linux. In Linux there is no problem at all, because it uses UTF-8 as default, so it works smoothly. The problem is Windows.

Right now I have this:

the program manages to take the words and convert them to utf-8, (or at least that's what I think, see code) then it writes them to the file, but when I open it under windows the character encoding is still ANSI.

I understand I need to turn the file into a UTF-8 file FROM Python (right now I have to open the file and change it myself, everything works fine after that.)

t = word.get()    #I'm using tkinter, word is an entry field
e = meaning.get() #I'm using tkinter, meaning is an entry field
meaning.delete(0, END)
word.focus()
es = e.encode("utf-8") 
ts = t.encode("utf-8") 
es.decode("utf-8")     
ts.decode("utf-8")     

#then the usual write procedure whre I write es and ts to the file.

Thank you

You would do something along the lines of the following. Note that if you open the file using a program that does not encompass encoding, then you will see what appears to be some garbage letters. This in not the file's fault, but the lies with the program being used to view it. Also, Python3.X has unicode built in, so what happens depends on which version of Python you are using. If there are any further questions, include the version of Python that you are using with the question.

fp = codecs.open('test', encoding='utf-8', mode='w+')
fp.write(u'\u4500 blah blah blah\n')

See this page http://docs.python.org/howto/unicode.html

You would do something along the lines of the following. Note that if you open the file using a program that does not encompass encoding, then you will see what appears to be some garbage letters. This in not the file's fault, but the lies with the program being used to view it. Also, Python3.X has unicode built in, so what happens depends on which version of Python you are using. If there are any further questions, include the version of Python that you are using with the question.

fp = codecs.open('test', encoding='utf-8', mode='w+')
fp.write(u'\u4500 blah blah blah\n')

See this page http://docs.python.org/howto/unicode.html

Oh, I forgot that, I'm using python 3.1.

could you please explain what the first line does? Is it encoding the whole file?

The encoding info is contained at the beginning of the file (I think), so that would write some sort of header to indicate the file's encoding, i.e. how to process the bytes following. The rest of the file would be just bytes, as all files are.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.