Dev Corner

Software Developer’s Notepad

Problem: You have a string encoded in UTF-8 and you need to convert it to two byte UTF encoded string. For example, you might want to display convert the string to CP1251 (Windows-1251) encoding using standard encoding tables.

Solution:
Here is a simple procedure in PHP that can convert UTF-8 string to corresponding full UTF representation. The function was constructed to solve specific problem and is not complete. It supports only 1, 2 and 3 octet (byte) UTF-8 entries. It also doesn’t support BOM.

/**
 * Decode UTF8 without BOM string to UTF string.
 * 
 * @param $string string Original UTF-8 encoded string we need to decode.
 * @param $strip_zeroes Remove trailing zeroes from converted UTF entry points. Default is false.
 * @return string UTF representation of the original UTF8 encoded string.
 *
 * @todo Add support for four byte characters.
 * @author Ivan Georgiev
 */
function utf8_decode($string, $strip_zeroes = false) {
	$pos = 0;
	$len = strlen($string);
	$result = '';
 
	while ($pos < $len) {
		$code1 = ord($string[$pos++]);
		if ($code1 < 0x80) {
			$result .= chr($code1);
		} elseif ($code1 < 0xE0) {
			// Two byte
			$code1 = 0x1F & $code1;
			$code2 = 0x3F & ord($string[$pos++]);
			$res_code1 = $code1 >> 2;
			if ($res_code1 > 0 || $strip_zeroes) {
				$result .= chr($res_code1);
			}
			$result .= chr( ($code1 << 6) | $code2);
		} elseif ($code1 < 0xF0) {
			// Three byte
			$code1 = $code1; // No need to mask
			$code2 = 0x3F & ord($string[$pos++]);
			$code3 = 0x3F & ord($string[$pos++]);
			$res_code1 = chr( ($code1 << 4) | ($code2 >> 2));
			if ($res_code1 > 0 || $strip_zeroes) {
				$result .= chr($res_code1);
			}
			$result .= chr( ($code2 << 6) | $code3);
		}
	}
 
	return $result;
}

See also: Generic File Input Stream in PHP with UTF-8 Support
Back to: PHP Tips and Recipes

Add A Comment

You must be logged in to post a comment.