I am trying to do some text processing tasks against a collection of files stored in a directory. The data set is just standard 20-newsgroup data. However, running the following code segement gives error message such as UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 240: invalid start byte I think it should be related to unicode problem. But I am not clear how to solve it.

   9: DIR = 'C:\\Users\\Desktop\\data\\rec.sport.hockey'
   10:    posts = [open(os.path.join(DIR,f)).read() for f in os.listdir(DIR)]
   11:    x_train = vectorizer.fit_transform(posts)

The traceback message is as follows

Traceback (most recent call last):
  File "C:/Users/PycharmProjects/Project3/demo10.py", line 11, in <module>
    x_train = vectorizer.fit_transform(posts)
  File "C:\Users\AppData\Roaming\Python\Python27\site-packages\sklearn\feature_extraction\text.py", line 804, in fit_transform
    self.fixed_vocabulary_)
  File "C:\Users\AppData\Roaming\Python\Python27\site-packages\sklearn\feature_extraction\text.py", line 739, in _count_vocab
    for feature in analyze(doc):
  File "C:\Users\AppData\Roaming\Python\Python27\site-packages\sklearn\feature_extraction\text.py", line 236, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)
  File "C:\Users\AppData\Roaming\Python\Python27\site-packages\sklearn\feature_extraction\text.py", line 113, in decode
    doc = doc.decode(self.encoding, self.decode_error)
  File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 240: invalid start byte

For Python 3:
By default open() uses the ASCII encoding which only recognizes the first 128 values. 'Latin-1' handles the first 256. So specify the encoding explicitly in your open function - the second parameter should be encoding='Latin-1'

By default open() uses the ASCII encoding

According to the documentation, the default encoding is locale.getpreferredencoding(). For me it is

>>> import locale
>>> locale.getpreferredencoding()
'UTF-8'

You can try to guess your file's encoding with the chardet module/cli utility.

My current encoding comes up 'cp1252' because I am using the Anaconda3 Python system. It might be best to set it in open() if you need a specific encoding.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.