如何:转换旧式编码的MySQL数据库到UTF-8

| 7 Comments

最近帮公司的一个客户做了一个数据库迁移,客户声称数据是 utf-8 的,然而在使用过程中出现了许多乱码,检查发现数据并非 utf-8,而是 utf-8 编码之后的 big5,而排序方式更是混乱不堪的默认的utf8-swedian-ci。

MySQL的国际化支持很差。MySQL从 4.1 版本开始大刀阔斧地进行了不兼容的改动,简单地说,这些改动让相当多的操作默认以UTF-8进行,然而这会给旧的应用程序带来很多问题。许多文献推荐使用 SET CHARACTER SET 作为 workaround,尽管这能够解决一些问题,但这种做法本质上会导致 MySQL 进行额外的转换,因此并不是十分理想。

正确的方法是把所有的东西都转换成 UTF-8。一个比较常见的错误是使用SET NAMES UTF8并直接从原始的 Big5 表中导出。这种情况下,文字会被以UTF-8编码的big5方式保存。比较简单的做法是将其直接导入数据表,并以 latin-1 导出:

mysqldump -u 用户名 -p密码 --skip-extended-insert --default-character-set=latin1 --databases 数据库名 > big5.sql

接着,使用 piconv (随 Perl 提供)来转换编码。我个人认为这个工具比 uconv 和 iconv 都要好。

piconv -f big5 -t utf8 < big5.sql > utf8.sql

接下来要把 utf8.sql 中的 latin1 等不正确的编码改为 utf8(通常,简单地sed即可)

最后重新导入。前面 skip-extended-insert 可以让发生错误时显示较具体的行号,这对于较大的数据库而言会比较方便。

声明:此 HOWTO 文档以现状方式提供,不提供任何担保。此文档为个人作弊条,本人会对部分疑问进行解释,但同样不提供任何保证。

此篇按下述授权发表:

/*-
* Copyright (c) 2009 Xin LI <delphij@FreeBSD.org>
* All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* 1. Redistributions of source code must retain the above copyright
*    notice, this list of conditions and the following disclaimer.
* 2. Redistributions in binary form must reproduce the above copyright
*    notice, this list of conditions and the following disclaimer in the
*    documentation and/or other materials provided with the distribution.
*
* THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
* ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
* SUCH DAMAGE.
*
*/

7 Comments

openid 登录了还可以输 URL 哦~~

但是不能输入 nickname 呢~~

大家好,我其实叫 atppp。。。

要是早点看到这篇文章就好了。。。

能说下为啥 piconv 比那两个要好?

delphij大神的这个license很强大

@atppp 想知道ID的来历。惯性的看做atapp。

Leave a comment

About this Entry

This page contains a single entry by Xin LI published on March 26, 2009 8:23 AM.

终于,基于BIOS的终极木马诞生了 was the previous entry in this blog.

Overhaul of libc Berkeley DB has been committed is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.

Pages

OpenID accepted here Learn more about OpenID
Powered by Movable Type 5.01