Home » Lambda Probe Community Forums » Bugs & Issues

Thread: LineReader does not correctly read multi-byte character.


Permlink Replies: 7 - Pages: 1 - Last Post: 29-Jun-2006 23:33 by: supercreek Threads: [ Previous | Next ]
supercreek

Posts: 22
Registered: 09/04/06
LineReader does not correctly read multi-byte character.
Posted: 21-Jun-2006 13:09
  Click to reply to this thread Reply

Well, please see the screen shot that is attached in this thread.
Multi-byte character is NOT displayed correctly in the log tailing view page.
This is a major bug. I investigated its cause and just found it.

There is the problem in LineReader class.
In LineReader constructer, BackwardsFileStream object is wrapped into new InputStreamReader instance.
--------------------------------------------------------------------------------------------------------------
    public LineReader(InputStream stream, boolean backwards) {
        this.readBackwards = backwards;

        this.streamReader = new InputStreamReader(stream); -> HERE!
        tokenizer = new Tokenizer(streamReader);
        tokenizer.addSymbol(new TokenizerSymbol(LINE_SEPARATOR, "\n", null, false, false, true, false));
        tokenizer.addSymbol(new TokenizerSymbol(LINE_SEPARATOR, "\r", null, false, false, true, false));

        if (backwards) {
            tokenizer.addSymbol(new TokenizerSymbol(LINE_SEPARATOR, "\n\r", null, false, false, true, false));
        } else {
            tokenizer.addSymbol(new TokenizerSymbol(LINE_SEPARATOR, "\r\n", null, false, false, true, false));
        }
    }
--------------------------------------------------------------------------------------------------------------
In this case, only bytes of the multi-byte character are not decoded correctly because they are the reversed byte codes, and the InputStreamReader always returns wrong character as result.

The root cause is that the InputStreamReader reads the reversed bytes and decodes them into characters.
Therefore the "read-line" algorithm in the LineReader class has to be reconsidered fundamentally.



turbomonkey

Posts: 329
Registered: 05/04/06
Re: LineReader does not correctly read multi-byte character.
Posted: 21-Jun-2006 18:08   in response to: supercreek
  Click to reply to this thread Reply

can you also attach the file containing multibyte characters that causing the problem please?

Also, what is the default charset on your system?


supercreek

Posts: 22
Registered: 09/04/06
Re: LineReader does not correctly read multi-byte character.
Posted: 22-Jun-2006 01:34   in response to: turbomonkey
  Click to reply to this thread Reply
Attachment catalina.out (75.0 K)

I attach the log file that caused this problem, please review it.
The default character set of my Java runtime environment (on Windows 2000, Japanese Locale) is "MS932".

Regards,

turbomonkey

Posts: 329
Registered: 05/04/06
Re: LineReader does not correctly read multi-byte character.
Posted: 22-Jun-2006 09:03   in response to: supercreek
  Click to reply to this thread Reply

thanks Kan.


supercreek

Posts: 22
Registered: 09/04/06
Re: LineReader does not correctly read multi-byte character.
Posted: 29-Jun-2006 20:43   in response to: turbomonkey
  Click to reply to this thread Reply

I will submit sample code of backwards line reader. This sample code supports reading multi-byte character.
This code is easy sample and I think that it is very helpful for this issue.
Can you resolve the issue by referring to it?

Regards,
Kan Ogawa



turbomonkey

Posts: 329
Registered: 05/04/06
Re: LineReader does not correctly read multi-byte character.
Posted: 29-Jun-2006 21:40   in response to: supercreek
  Click to reply to this thread Reply

thanks a lot Kan. Unfortunately it's been very busy week at work and i didnt have a chance to get much done on the project.

I however have looked into the problem know how it can be fixed. The exiting LineReader is flawed coz it uses Tokenizer, which is simply unable to read stream backwards. The tokenizer has to be dropped and a simple line parsing algorithm is in order.

I'll have a look at your code and will replace the existing LineReader over this w/e.

sorry if i've appeared to've ignored the issue. :(


turbomonkey

Posts: 329
Registered: 05/04/06
Re: LineReader does not correctly read multi-byte character.
Posted: 29-Jun-2006 22:26   in response to: supercreek
  Click to reply to this thread Reply

Ok, had a look at your version of LineReader and it works just fine. The update is submitted to trunk.

I've modified the code slightly though:

1. reduced the line buffer size to from 8K to 512 bytes. I think it is unlikely that tomcat logs will contain lines as long as 8K.

2. reverse byte array method can do with going only through a half of the array, which is a bit faster.

3. encoding is unused in this particular circumstances becuase the files are written using default encoding and they should be read using one. The option of using encoding is still there in the code though.

Otherwise i think the problem is fixed (based on the log file you submitted). Could you check the update on your system anyway?

Regards,
Vlad


supercreek

Posts: 22
Registered: 09/04/06
Re: LineReader does not correctly read multi-byte character.
Posted: 29-Jun-2006 23:33   in response to: turbomonkey
  Click to reply to this thread Reply

I have updated the source codes and reviewed them.
It works successfully. Thanks for applying my patch!

Regards,
Kan Ogawa


Point your RSS reader here for a feed of the latest messages in all forums