中文编码问题(utf-8转为中文)-CarlZeng

更新于 2023-10-02 阅读次数：阅读时长 ≈ 12 分钟

escape() 方法：采用ISO Latin字符集对指定的字符串进行编码。所有的空格符、标点符号、特殊字符以及其他非ASCII字符都将被转化成%xx格式的字符编码（xx等于该字符在字符集表里面的编码的16进制数字）。比如，空格符对应的编码是%20。unescape方法与此相反。不会被此方法编码的字符： @ * / +
英文解释：MSDN JScript Reference: The escape method returns a string value (in Unicode format) that contains the contents of [the argument]. All spaces, punctuation, accented characters, and any other non-ASCII characters are replaced with %xx encoding, where xx is equivalent to the hexadecimal number representing the character. For example, a space is returned as “%20.”
Edge Core Javascript Guide: The escape and unescape functions let you encode and decode strings. The escape function returns the hexadecimal encoding of an argument in the ISO Latin character set. The unescape function returns the ASCII string for the specified hexadecimal encoding value.

encodeURI() 方法：把URI字符串采用UTF-8编码格式转化成escape格式的字符串。不会被此方法编码的字符：! @ # $& * ( ) = : / ; ? + ‘
英文解释： MSDN JScript Reference: The encodeURI method returns an encoded URI. If you pass the result to decodeURI, the original string is returned. The encodeURI method does not encode the following characters: “:”, “/“, “;”, and “?”. Use encodeURIComponent to encode these characters. Edge Core Javascript Guide: Encodes a Uniform Resource Identifier (URI) by replacing each instance of certain characters by one, two, or three escape sequences representing the UTF-8 encoding of the character

encodeURIComponent() 方法：把URI字符串采用UTF-8编码格式转化成escape格式的字符串。与encodeURI()相比，这个方法将对更多的字符进行编码，比如 / 等字符。所以如果字符串里面包含了URI的几个部分的话，不能用这个方法来进行编码，否则 / 字符被编码之后URL将显示错误。不会被此方法编码的字符：! * ( )
英文解释： MSDN JScript Reference: The encodeURIComponent method returns an encoded URI. If you pass the result to decodeURIComponent, the original string is returned. Because the encodeURIComponent method encodes all characters, be careful if the string represents a path such as /folder1/folder2/default.html. The slash characters will be encoded and will not be valid if sent as a request to a web server. Use the encodeURI method if the string contains more than a single URI component. Mozilla Developer Core Javascript Guide： Encodes a Uniform Resource Identifier (URI) component by replacing each instance of certain characters by one, two, or three escape sequences representing the UTF-8 encoding of the character.

-———————————— ——-

因此，对于中文字符串来说，如果不希望把字符串编码格式转化成UTF-8格式的（比如原页面和目标页面的charset是一致的时候），只需要使用escape。如果你的页面是GB2312或者其他的编码，而接受参数的页面是UTF-8编码的，就要采用encodeURI或者encodeURIComponent。
另外，encodeURI/encodeURIComponent是在javascript1.5之后引进的，escape则在javascript1.0版本就有。
英文注释：The escape() method does not encode the + character which is interpreted as a space on the server side as well as generated by forms with spaces in their fields. Due to this shortcoming, you should avoid use of escape() whenever possible. The best alternative is usually encodeURIComponent().Use of the encodeURI() method is a bit more specialized than escape() in that it encodes for URIs [REF] as opposed to the querystring, which is part of a URL. Use this method when you need to encode a string to be used for any resource that uses URIs and needs certain characters to remain un-encoded. Note that this method does not encode the ‘ character, as it is a valid character within URIs.Lastly, the encodeURIComponent() method should be used in most cases when encoding a single component of a URI. This method will encode certain chars that would normally be recognized as special chars for URIs so that many components may be included. Note that this method does not encode the ‘ character, as it is a valid character within URIs.

-———————————— ——-

汉字标准交换码共分两级。第一级为常用字，有3755字，按汉语拼音字母顺序排列，第二级为次常用字，有3008字，按部首排列。GB2312的编码范围为2121H-777EH.

UNICODE 是两字节的全编码，对于ASCII字符它也使用两字节表示。代码页是通过高字节的取值范围来确定是ASCII字符，还是汉字的高字节。如果发生数据损坏，某处内容破坏，则会引起其后汉字的混乱。UNICODE则一律使用两个字节表示一个字符，最明显的好处是它简化了汉字的处理过程。

关于编码的文章可以参考：

百度的页面是gb2312的，URL编码自然也是从gb转换而来，比如”一”这个字，百度转换的结果是D2%BB，而从Utf-8转换来的结果是%E4%B8%80比如google（gb是2字节编码，utf-8是3字节变长编码）

可以用javascript的encodeURI和decodeURI来得到这些结果，设置页面编码就可以看到不同结果了。

在网上找，也没找到现成的转换程序，只得自己写。还好网上不缺gb-utf的对照表，修改了一下就可以用了:gb-utf.txt

这个对照表是将gb字节编码转到utf的16进制编码，而不是字节编码。

javascript中escape和unescape是转换16进制编码用的，因此gb汉字到utf汉字的转换思路是：encodeURI(“gb汉字”)，到对照表中查找utf的16进制编码，unescape(“16进制utf编码”)，得到utf汉字。

URL 编码转换工具，用以将 %55%52%4C%20%B1%E0%C2%EB%D7%AA%BB%BB%B9%A4%BE%DF 这样的utf-8，需要 到对照表中查找utf的16进制编码，然后unescape(“16进制utf编码”)，得到汉字。

中间那一步最关键，我的转换只用到了这一步，其他两步直接调用那两个函数就可以了。下面是转换程序：

对照表就是gb-utf.txt了，你自己改一下读取路径即可。
还需要注意的是，上面的程序是需要在服务器端运行的，因为涉及到文件操作。

个性化需求沟通扫客服加V加群：