Fails to break long multibyte strings. #273

Closed
opened 2023-03-09 13:06:57 +00:00 by fg_toshiyuki · 6 comments

If you send a long multi-byte character string such as Japanese with TINYIB_WORDBREAK set, problems such as the post content being cut off in the middle will occur.

For example, posting long sentences without spaces, such as the one below, is common on Japanese futaba. However, the processing of WORDBREAK in multibyte encoding seems to be inappropriate, and it cuts off in the middle.

(Example of futaba's famous meme, KOUSHIROU-bot. Encode is UTF-8)

よく来たぴるす君、まあ座りたまえ。何、今日は君を叱るために呼んだんじゃないんだよ、たか子、彼にまろ茶は入れてやりなさい。それで話というのはだねぷるす君、なんだねまか子、今話をして・・・何?たろ茶がない?ならば最初から茶葉のままお出しすればいい!なんだねピエンロー君、茶ころの入れたたまが飲めないのかね!なら君には何もやらん!立て!のんびりしている暇はないのだよピチパツ白スク水君!公演は近いぞ練習をしなさい!何?舞台には立てない?それでも歌舞伎役者かねピーチジョン君!そんな弱音を吐く口にもバンテリンはスーッと効いて・・・何?口に入れる物じゃない?しかし染五郎よ少し遅かったようだ。せめて彼の亡骸をHIRAKIにして菩提を弔うと・・・まだ息がある!ちゃんと止めを刺さないか!

Sending this sentence results in something like the attached image.

koushirou_issue

We found that there was a problem in handling multi-byte strings in the handling of TINYIB_WORDBREAK, so we temporarily fixed it as follows. It is assumed that the encoding is UTF-8.

defines.php
// Line:11
define('TINYIB_WORDBREAK_IDENTIFIER', '@!@TINYIB_WORDBREAK@!@');

// toshiyuki: Bug fix for multibyte word break processing.: マルチバイトのワードブレーク処理の不具合対応
if(function_exists('mb_internal_encoding') && mb_internal_encoding() == 'UTF-8'){
	define('TINYIB_WORDBREAK_MULTIBYTE','u');
} else {
	define('TINYIB_WORDBREAK_MULTIBYTE','');
}

imgboard.php
// Line:334

		} else {
		    // toshiyuki: Bug fix for multibyte word break processing.: マルチバイトのワードブレーク処理の不具合対応
			if (TINYIB_WORDBREAK > 0) {
			    $post['message'] = preg_replace('/([^\s]{' . TINYIB_WORDBREAK . '})(?=[^\s])/' . TINYIB_WORDBREAK_MULTIBYTE, '$1' . TINYIB_WORDBREAK_IDENTIFIER, $post['message']);
			}

If you send a long multi-byte character string such as Japanese with TINYIB_WORDBREAK set, problems such as the post content being cut off in the middle will occur. For example, posting long sentences without spaces, such as the one below, is common on Japanese futaba. However, the processing of WORDBREAK in multibyte encoding seems to be inappropriate, and it cuts off in the middle. (Example of futaba's famous meme, KOUSHIROU-bot. Encode is UTF-8) > よく来たぴるす君、まあ座りたまえ。何、今日は君を叱るために呼んだんじゃないんだよ、たか子、彼にまろ茶は入れてやりなさい。それで話というのはだねぷるす君、なんだねまか子、今話をして・・・何?たろ茶がない?ならば最初から茶葉のままお出しすればいい!なんだねピエンロー君、茶ころの入れたたまが飲めないのかね!なら君には何もやらん!立て!のんびりしている暇はないのだよピチパツ白スク水君!公演は近いぞ練習をしなさい!何?舞台には立てない?それでも歌舞伎役者かねピーチジョン君!そんな弱音を吐く口にもバンテリンはスーッと効いて・・・何?口に入れる物じゃない?しかし染五郎よ少し遅かったようだ。せめて彼の亡骸をHIRAKIにして菩提を弔うと・・・まだ息がある!ちゃんと止めを刺さないか! Sending this sentence results in something like the attached image. ![koushirou_issue](https://code.rocketnine.space/attachments/0280605f-ad6d-404f-9db5-935b19648a23) We found that there was a problem in handling multi-byte strings in the handling of TINYIB_WORDBREAK, so we temporarily fixed it as follows. It is assumed that the encoding is UTF-8. ``` defines.php // Line:11 define('TINYIB_WORDBREAK_IDENTIFIER', '@!@TINYIB_WORDBREAK@!@'); // toshiyuki: Bug fix for multibyte word break processing.: マルチバイトのワードブレーク処理の不具合対応 if(function_exists('mb_internal_encoding') && mb_internal_encoding() == 'UTF-8'){ define('TINYIB_WORDBREAK_MULTIBYTE','u'); } else { define('TINYIB_WORDBREAK_MULTIBYTE',''); } ``` ``` imgboard.php // Line:334 } else { // toshiyuki: Bug fix for multibyte word break processing.: マルチバイトのワードブレーク処理の不具合対応 if (TINYIB_WORDBREAK > 0) { $post['message'] = preg_replace('/([^\s]{' . TINYIB_WORDBREAK . '})(?=[^\s])/' . TINYIB_WORDBREAK_MULTIBYTE, '$1' . TINYIB_WORDBREAK_IDENTIFIER, $post['message']); } ```
Author

This modified code probably has the following problems.

  • This code will not work correctly when sending multibyte strings without the mbstring module enabled. However, most people who need to post UTF-8 multibyte strings on TinyIB will have the mbstring module enabled.

  • This code cannot use multibyte character codes other than UTF-8. This is because PCRE Functions does not work properly with multibyte character codes other than UTF-8 (such as sjis). If you run into this problem, you'll have no choice but to force UTF-8 encoding.

This modified code probably has the following problems. - This code will not work correctly when sending multibyte strings without the mbstring module enabled. However, most people who need to post UTF-8 multibyte strings on TinyIB will have the mbstring module enabled. - This code cannot use multibyte character codes other than UTF-8. This is because PCRE Functions does not work properly with multibyte character codes other than UTF-8 (such as sjis). If you run into this problem, you'll have no choice but to force UTF-8 encoding.
tslocum added the
bug
label 2023-03-28 04:59:00 +00:00
Owner

Thanks for reporting this. There are several areas of TinyIB which do not handle UTF-8 properly. I will look into this when I have the time.

Thanks for reporting this. There are several areas of TinyIB which do not handle UTF-8 properly. I will look into this when I have the time.
Owner

This will probably require a version bump in the minimum supported PHP version. PHP 7 was released in late 2015, about eight years ago now.

This will probably require a version bump in the minimum supported PHP version. PHP 7 was released in late 2015, about eight years ago now.
Author

Probably gettext doesn't work when using PHP version 5.
If I use gettext to translate, probably I need PHP 7.4 or later. Because gettext uses "class properties typing".

It did not work with PHP Version 5.6.40 due to the following error at gettext.

Warning: Unsupported declare 'strict_types' in /var/www/html/imgboard/inc/gettext/src/Loader/PoLoader.php on line 2
Parse error: syntax error, unexpected ':', expecting ';' or '{' in /var/www/html/imgboard/inc/gettext/src/Loader/PoLoader.php on line 14

Probably gettext doesn't work when using PHP version 5. If I use gettext to translate, probably I need PHP 7.4 or later. Because gettext uses "class properties typing". It did not work with PHP Version 5.6.40 due to the following error at gettext. > Warning: Unsupported declare 'strict_types' in /var/www/html/imgboard/inc/gettext/src/Loader/PoLoader.php on line 2 > Parse error: syntax error, unexpected ':', expecting ';' or '{' in /var/www/html/imgboard/inc/gettext/src/Loader/PoLoader.php on line 14
Owner

Thanks. I had upgraded the gettext library to a version that was incompatible with versions before PHP 7. I've resolved this just now so TinyIB can continue to work on PHP 5 and 6.

Thanks. I had upgraded the gettext library to a version that was incompatible with versions before PHP 7. I've resolved this just now so TinyIB can continue to work on PHP 5 and 6.
Author

I checked the latest sources and confirmed that TinyIB works with PHP 5.6.40. In my environment, the modified part of TINYIB_WORDBREAK_MULTIBYTE also works, but this may be due to the behavior of mbstring.

In Japan, mbstring is used in almost all cases, so it is difficult to verify operation without mbstring.

I checked the latest sources and confirmed that TinyIB works with PHP 5.6.40. In my environment, the modified part of TINYIB_WORDBREAK_MULTIBYTE also works, but this may be due to the behavior of mbstring. In Japan, mbstring is used in almost all cases, so it is difficult to verify operation without mbstring.
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: tslocum/tinyib#273
No description provided.