Python3 扩展教程 —— 聊一聊文件编码

今天在学习python操作读写文件时，遇到一个编码问题，这里做下。以下文章少儿不宜，请酌情观看，18岁以上儿童请在美女的陪同下观看（MD，楼主现在还DanShen）

我的系统环境是MacOS，Python3.6

下图是我看的视频教程，是window操作系统，python3.3版本，教程中非常的顺利，直接就能将txt文件中的左右内容读取出来了。

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

然后切换我的环境，使用mac下，IDLE编译器就会出现错误

Python 3.6.2 (v3.6.2:5fd33b5926, Jul 16 2017, 20:11:06) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "copyright", "credits" or "license()" for more information.
>>> WARNING: The version of Tcl/Tk (8.5.9) in use may be unstable.
Visit http://www.python.org/download/mac/tcltk/ for current information.
>>> 
>>> f = open('/Users/liurenkui/Desktop/recode.txt','r')
>>> f
<_io.TextIOWrapper name='/Users/liurenkui/Desktop/recode.txt' mode='r' encoding='US-ASCII'>
>>> f.read()
Traceback (most recent call last):
  File "<pyshell#4>", line 1, in <module>
    f.read()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
>>>

然后使用SublimeText下执行Python，也出现同样错误

然后，再试，我又在Pycharm CE环境下执行上述代码，完美运行，丝毫没有错误，这就让我奇了怪了？？？

recode.txt文件内容如下

你好Anson 你好Aimi

欢迎使用程序喵博客！

Ready Go...

姑娘你说你是不是傻？
哥们从小习武，称霸校园。年少得志，学得一手计算机控制计算机挖掘机炒菜技能，站稳职场。
一人吃饱全家小康，保证能把你养的膘肥体胖，貌美如花。姑娘你还担心啥？
写得了代码，翻的过围墙，开的了跑车又能耍流氓。
宠的了姑娘，修的了家电，还能哄丈母娘。
爱护小动物，知识而丰富，嗓音贼拉酷，光背影一杵就能吸引粉丝无数，你还在想啥？
姑娘你说你是不是傻？

上面的小问题会难倒我？开什么玩笑！咱可是有Java编程基础的男人，怎能轻易倒下。

经过观察：windows教程上面IDLE的 f 返回的默认编码格式为cp936，而我的Mac系统IDLE上面的编码是US-ASCII。

那么问题来了，什么是cp936?

对着google撸了一发后，找到答案：cp936即 code page 936(代码页936)是以GBK(国标扩展字符集)为基础的编码

即是GBK，那么在windows上，执行读取中文文字自然不会出现问题。

解决Mac编码问题

分析：使用help函数来撸一发，看看哪里可以设置编码，然后根据相应的编码匹配即可。

>>> help(open)
Help on built-in function open in module io:
open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)
    Open file and return a stream.  Raise IOError upon failure.
    
    file is either a text or byte string giving the name (and the path
    if the file isn't in the current working directory) of the file to
    be opened or an integer file descriptor of the file to be
    wrapped. (If a file descriptor is given, it is closed when the
    returned I/O object is closed, unless closefd is set to False.)
    
    mode is an optional string that specifies the mode in which the file
    is opened. It defaults to 'r' which means open for reading in text
    mode.  Other common values are 'w' for writing (truncating the file if
    it already exists), 'x' for creating and writing to a new file, and
    'a' for appending (which on some Unix systems, means that all writes
    append to the end of the file regardless of the current seek position).
    In text mode, if encoding is not specified the encoding used is platform
    dependent: locale.getpreferredencoding(False) is called to get the
    current locale encoding. (For reading and writing raw bytes use binary
    mode and leave encoding unspecified.) The available modes are:
    
    ========= ===============================================================
    Character Meaning
    --------- ---------------------------------------------------------------
    'r'       open for reading (default)
    'w'       open for writing, truncating the file first
    'x'       create a new file and open it for writing
    'a'       open for writing, appending to the end of the file if it exists
    'b'       binary mode
    't'       text mode (default)
    '+'       open a disk file for updating (reading and writing)
    'U'       universal newline mode (deprecated)
    ========= ===============================================================
    
    省略以下内容。。

上面帮助文档中明确说明，可以在open时候，指定编码方式，那么这，就是突破口，值的去捅一捅（呃，怎么感觉这么邪恶...）

经过修改后代码，如下

>>> f = open('/Users/liurenkui/Desktop/recode.txt', 'r', encoding='utf-8')
>>> f
<_io.TextIOWrapper name='/Users/liurenkui/Desktop/recode.txt' mode='r' encoding='utf-8'>
>>> f.read()
'\ufeff你好Anson 你好Aimi\n\n欢迎使用程序喵博客！\n\nReady Go...\n\n姑娘你说你是不是傻？\n哥们从小习武，称霸校园。年少得志，学得一手计算机控制计算机挖掘机炒菜技能，站稳职场。\n一人吃饱全家小康，保证能把你养的膘肥体胖，貌美如花。姑娘你还担心啥？\n写得了代码，翻的过围墙，开的了跑车又能耍流氓。\n宠的了姑娘，修的了家电，还能哄丈母娘。\n爱护小动物，知识而丰富，嗓音贼拉酷，光背影一杵就能吸引粉丝无数，你还在想啥？\n姑娘你说你是不是傻？'
>>>

问题解决了，好开森啊，啪啪啪！

等等，结尾输出好像哪里不对？

仔细观察输出结果

在IDLE中：“你好”前面多出一个“\ufeff”
在Pycharm CE中：“你好”前面多出一个“空格”

'\ufeff你好Anson 你好Aimi\n\n欢迎使用程序喵博客！\n\nReady Go...\n\n姑娘你说你是不是傻？\n哥们从小习武，称霸校园。年少得志，学得一手计算机控制计算机挖掘机炒菜技能，站稳职场。\n一人吃饱全家小康，保证能把你养的膘肥体胖，貌美如花。姑娘你还担心啥？\n写得了代码，翻的过围墙，开的了跑车又能耍流氓。\n宠的了姑娘，修的了家电，还能哄丈母娘。\n爱护小动物，知识而丰富，嗓音贼拉酷，光背影一杵就能吸引粉丝无数，你还在想啥？\n姑娘你说你是不是傻？'

“\ufeff”是什么？怎么解决？

那么这个"\ufeff"到底又是个什么鬼？，别着急，或许上面的还沉浸在撸Google中意犹未尽（小伙子棒棒哒）。那么就再对着它撸一发找找答案吧。

最后，找到答案如下。我们只需要点击把它转为UTF-8无BOM格式编码即可。转换的工具有很多，比如我用的润滑剂SublimeText，也可以用Notepad++等等。款式不同，效果相同，总有一款适合你！

编码设置，完毕，问题解决，啪！啪！啪！！！

注意

仔细观察也看到了，在IDLE中，会将换行符\n、等其他特殊字符给直接打印出来，而在Pycharm CE工具中，会将这些特殊转移等字符进行编译运行，比如\n会直接换行处理。

总结

遇到问题莫慌张，多使用帮助文档，帮助文档若满足不了你，那就投入Google的怀抱撸一发。

未经允许请勿转载：程序喵 » Python3 扩展教程 —— 聊一聊文件编码

程序喵

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

那么问题来了，什么是cp936?

解决Mac编码问题

等等，结尾输出好像哪里不对？

“\ufeff”是什么？怎么解决？

注意

总结

openpyxl 中遇到 TypeError: 'generator' object is not subscriptable

10行代码爬取微信公众号文章评论

Python 判断文件是否存在的三种方法

Python3 扩展阅读 —— 序列化

Python扩展教程 —— 生成杨辉三角讲解

Python3 开发入门 —— 第十五讲_2（OS 操作文件和目录）

Python3 开发入门 —— 第十五讲_1（File文件读写操作）

Python3 开发入门 —— 第十四讲（安装第三方模块）