for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True): interpreter.process_page(page)
text = retstr.getvalue() print text
fp.close() device.close() retstr.close() return text
需要修改源码convert.py文件167行,将
1
self.outfp.write(u"é")
改为
1
self.outfp.write(u"é".encode('utf-8'))
否则会有以下报错信息报错:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Traceback (most recent call last): File "/Users/Administer/Desktop/pdfReader.py", line 33, in <module> convert_pdf_to_txt('document1.pdf') File "/Users/Administer/Desktop/pdfReader.py", line 13, in convert_pdf_to_txt device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=None) File "/Library/Python/2.7/site-packages/pdfminer/converter.py", line 180, in __init__ PDFConverter.__init__(self, rsrcmgr, outfp, codec=codec, pageno=pageno, laparams=laparams) File "/Library/Python/2.7/site-packages/pdfminer/converter.py", line 167, in __init__ self.outfp.write(u"é") UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128) [Finished in 0.2s with exit code 1]Administer [shell_cmd: python -u "/Users/Administer/Desktop/pdfReader.py"] [dir: /Users/Administer/Desktop] [path: /usr/bin:/bin:/usr/sbin:/sbin]