怎么分析包含不同中文编码格式的内容 log 文件？

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

推荐学习书目

› Learn Python the Hard Way

Python Sites

› PyPI - Python Package Index

› http://diveintopython.org/toc/index.html

› Pocoo

值得关注的项目

› PyPy

› Celery

› Jinja2

› Read the Docs

› gevent

› pyenv

› virtualenv

› Stackless Python

› Beautiful Soup

› 结巴中文分词

› Green Unicorn

› Sentry

› Shovel

› Pyflakes

› pytest

Python 编程

› pep8 Checker

Styles

› PEP 8

› Google Python Style Guide

› Code Style from The Hitchhiker's Guide

这是一个创建于 2744 天前的主题，其中的信息可能已经有所发展或是发生改变。

工作上需要分析一些客户发过来的文件，但是实际发现文件编码很混乱，怀疑是客户也是把几个来源的内容不做转码而拼凑的，file -I 识别不了：

$ file -I logs/mv1.txt
logs/mv1.txt: text/plain; charset=unknown-8bit

用 atom 打开文件，大致如下: Log

写了一个函数把乱码部分提取出来，想重新编码


if __name__ == "__main__":
    cont = get_content("mv1.txt")
    #print cont.decode('gbk')
    print cont.decode('gb2312')

但是总是提示错误:

UnicodeDecodeError: 'gb2312' codec can't decode bytes in position 0-1: illegal multibyte sequence

客户比较强势，我这边推动不了客户修改 log 文件格式，不知道大伙有解决方案么?

9 条回复 • 2017-05-24 16:58:40 +08:00

binux

2017-05-24 16:20:11 +08:00

按行读，分别解码

fyooo

2017-05-24 16:25:23 +08:00

@binux cont 就是按行读的思路了，把乱码的那行 "content" 单独拿出来了，但是不知道怎么解码才对

fyooo

2017-05-24 16:26:35 +08:00

用 `chardet.detect` 得到的结果是:

```
{'confidence': 0.3888443803816883, 'language': 'Turkish', 'encoding': 'Windows-1254'}
```

感觉没啥参考价值。

enenaaa

2017-05-24 16:37:14 +08:00

起码得让客户告诉你都有些什么编码

shalk

2017-05-24 16:40:39 +08:00

那到底是什么编码呢

binux

2017-05-24 16:48:53 +08:00

@fyooo #2 中文一般就 2-3 种编码，从 utf8, gbk, big5 挨个试呗。

binux

2017-05-24 16:49:21 +08:00

@fyooo #2 还有，不要用 gb2312，用 gb18030

murmur

2017-05-24 16:50:13 +08:00

这个看来按二进制切片解析然后分段 decode 了

fyooo

2017-05-24 16:58:40 +08:00

谢谢楼上几位，原来是 cont.decode('utf8').encode('gb18030')，诶!