extensions/universalchardet/doc/ChardetInterface.htm
author martijn.martijn@gmail.com
Thu, 12 Apr 2007 09:10:42 -0700
changeset 494 c80966c1021b2e2d547990dc06c7d677dc4f858f
parent 1 9b2a99adc05e53cd4010de512f50118594756650
permissions -rw-r--r--
Mochikit test for bug 371724, r=sayrer

<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
<html>
<head>
   <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
   <meta name="GENERATOR" content="Mozilla/4.76 [en] (WinNT; U) [Netscape]">
   <title>Charset Detector Interface</title>
</head>
<body>
<font face="Arial">Charset Detector Interface</font>
<p><font size=-1>This is the charset detector�s interface that is exposed
to outside world, in our case, the browser. In the very beginning, caller
calls detector�s "Init()" method and let detector know how it would like
to be notified about the detecting result. Observer pattern is used in
this case. Then the caller just need to feed charset detector with text
data through "DoIt()". This can be done through a series "DoIt()" calls,
with each call only contains part of the data. This can be very useful
if the data is only partially available at one time. In our case, since
the data comes from network, we can start detecting long before network
finishes transferring all data. When detector is confident enough about
one encoding, it will notify its caller and stop detecting. If all data
has been feed to detector but detector still is not confident enough about
any encoding, method "Done" will tell detector to make a best guess.</font>
<p><font face="Courier New"><font size=-1>class nsICharsetDetector : public
nsISupports {</font></font>
<br><font face="Courier New"><font size=-1>&nbsp; public:</font></font>
<br><font face="Courier New"><font size=-1>&nbsp; NS_DEFINE_STATIC_IID_ACCESSOR(NS_ICHARSETDETECTOR_IID)</font></font>
<p><font face="Courier New"><font size=-1>&nbsp; //Setup the observer so
it know how to notify the answer</font></font>
<br><font face="Courier New"><font size=-1>&nbsp; NS_IMETHOD Init(nsICharsetDetectionObserver*
observer) = 0;</font></font>
<p><font face="Courier New"><font size=-1>&nbsp; //Feed a block of bytes
to the detector.</font></font>
<br><font face="Courier New"><font size=-1>&nbsp; //It will call the Notify
function of the nsICharsetObserver if it</font></font>
<br><font face="Courier New"><font size=-1>&nbsp; //find out the answer.</font></font>
<br><font face="Courier New"><font size=-1>&nbsp; // aBytesArray - array
of bytes</font></font>
<br><font face="Courier New"><font size=-1>&nbsp; // aLen - length of aBytesArray</font></font>
<br><font face="Courier New"><font size=-1>&nbsp; // oDontFeedMe - return
PR_TRUE if the detector do not need the</font></font>
<br>&nbsp; <font face="Courier New"><font size=-1>// following block</font></font>
<br>&nbsp; <font face="Courier New"><font size=-1>// PR_FALSE it need more
bytes.</font></font>
<br>&nbsp; <font face="Courier New"><font size=-1>// This is used to enhance
performance</font></font>
<br>&nbsp; <font face="Courier New"><font size=-1>NS_IMETHOD DoIt(const
char* aBytesArray, PRUint32 aLen, PRBool* oDontFeedMe) = 0;</font></font>
<p>&nbsp; /<font face="Courier New"><font size=-1>/It also tell the detector
the last chance the make a decision</font></font>
<br>&nbsp; <font face="Courier New"><font size=-1>NS_IMETHOD Done() = 0;</font></font>
<br><font size=-1>}<font face="Courier New">;</font></font>
<br>&nbsp;
<br>&nbsp;
<p><font face="Arial">Inside Charset Detector</font>
<p><font size=-1>Inside Charset Detector, major work is done by function
"HandleData()". In fact, "DoIt" has very little extra thing to do other
than call "HandleData". The following is the algorithm logic using C-Like
Pseudo-Language. Some detail is drop in order to make main point more clear.</font>
<p><font face="Courier New"><font size=-1>HandleData(batch_of_text)</font></font>
<br><font face="Courier New"><font size=-1>{</font></font>
<br><font face="Courier New"><font size=-1>&nbsp; if (batch_of_text contains
BOM)</font></font>
<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; report UCS2;</font></font>
<br><font face="Courier New"><font size=-1>&nbsp; if ((inputState is PureAscii)
|| (inputState is EscAscii))</font></font>
<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; if (batch_of_text
contains 8-bits-byte)</font></font>
<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
inputState = HighByte;</font></font>
<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; else if ((inputState
is PureAscii ) &amp;&amp; (batch_of_text contains Esc_Sequence) )</font></font>
<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
inputState = EscAscii;</font></font>
<p><font face="Courier New"><font size=-1>&nbsp; if (inputState is HighByte)</font></font>
<br><font face="Courier New"><font size=-1>&nbsp; {</font></font>
<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; Remove Ascii
character that is not neighboring to 8-bits byte</font></font>
<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; For each
prober in multibyte_probers</font></font>
<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; Prober.HandleData(batch_of_text);</font></font>
<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; For each
prober in singlebyte_probers</font></font>
<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; Prober.HandleData(batch_of_text);</font></font>
<br><font face="Courier New"><font size=-1>&nbsp; }</font></font>
<br><font face="Courier New"><font size=-1>&nbsp; else if (inputState is
EscAscii)</font></font>
<br><font face="Courier New"><font size=-1>&nbsp; {</font></font>
<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; For each
prober in (ISO2022_XX or HZ)</font></font>
<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; Prober.HandleData(batch_of_text);</font></font>
<br><font face="Courier New"><font size=-1>&nbsp; }</font></font>
<br><font face="Courier New"><font size=-1>}</font></font>
<p><i><font face="Courier New"><font size=-1>nsUniversalDetector.h</font></font></i>
<br><i><font face="Courier New"><font size=-1>nsUniversalDetector.cpp</font></font></i>
<p><i><font face="Courier New"><font size=-1>Implemented the high level
control logic.</font></font></i>
<br>&nbsp;
<br>&nbsp;
<p>Charset Prober
<p><font size=-1>A charset prober verifies if the input data is belong
to certain encoding or group of encoding. It maintains its state in member
"mState", which has 3 possible value. State "eDetecting" means it hasn�t
found any sure answer yet, "eFoundIt" and "eNotMe" carries the same meaning
as their names. Method "GetCharSetName" tell its caller its sure answer
or best guess.</font>
<p><font size=-1>Generally, for each encoding we implemented a charset
prober. Several probers can be wrapped together with a wrapper prober.
It is also possible for a prober to "probe" several encodings. Each charset
prober is designed, implemented and working independently. This enables
prober caller to eliminate certain probers when it has any pre-knowledge.
For example, if user know that an html page is some kind of Japanese encoding,
non-Japanese charset probers will not be fired. If user have not interest
in certain languages, they can also eliminate those charset probers. Those
measures will lead to a small footprint and faster performance.</font>
<p><font face="Courier New"><font size=-1>typedef enum {</font></font>
<br><font face="Courier New"><font size=-1>&nbsp; eDetecting = 0,</font></font>
<br><font face="Courier New"><font size=-1>&nbsp; eFoundIt = 1,</font></font>
<br><font face="Courier New"><font size=-1>&nbsp; eNotMe = 2</font></font>
<br><font size=-1>}<font face="Courier New"> nsProbingState;</font></font>
<p><font face="Courier New"><font size=-1>class nsCharSetProber {</font></font>
<br><font face="Courier New"><font size=-1>&nbsp; public:</font></font>
<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; nsCharSetProber(){};</font></font>
<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; virtual const
char* GetCharSetName() {return "";};</font></font>
<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; virtual nsProbingState
HandleData(const char* aBuf, PRUint32 aLen) = 0;</font></font>
<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; nsProbingState
GetState(void) {return mState;};</font></font>
<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; virtual void
Reset(void) {mState = eDetecting;};</font></font>
<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; virtual float
GetConfidence(void) = 0;</font></font>
<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; virtual void
SetOpion() {};</font></font>
<br><font face="Courier New"><font size=-1>&nbsp; protected:</font></font>
<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; nsProbingState
mState;</font></font>
<br><font face="Courier New"><font size=-1>};</font></font>
<br>&nbsp;
<br>&nbsp;
<p><font face="Arial">How multi-byte encoding charset prober works</font>
<p><font size=-1>For charset prober verifying SJIS, EUC-JP, EUC-KR, EUC-CN
(or GB2312), EUC-TW, Big5 encodings, each prober embeds state machine (mCodingSM),
which identify legal byte sequence base on its encoding scheme. If an illegal
byte sequence is met, this state machine will reach "eError" state. That
signifies a failure for this prober, and prober will report negative answer
to its caller. Once state machine reach "eStart" state, it means sequence
of bytes has been identified as a character. This character will be sent
to Character distribution analyzer (mDistributionAnalyser) and 2-Char sequence
analyzer (mContextAnalyser) for statistic sampling. "GetConfidence" call
will let its caller know the likelihood of input charset being of this
encoding.</font>
<p><font size=-1>Inside "HandleData" method each time after a batch of
text has been processed, shortcut judgement is performed. If the prober
receives enough data and reaches certain confidence level, it will set
its state to be "eFoundIt" and notify its caller an immediate sure answer.</font>
<p><font size=-1>For encoding like ISO_2022 and HZ, since the embedded
state machine can do almost a perfect job along, no other statistic sampling
is done.</font>
<p><i><font size=-1>Big5Freq.tab</font></i>
<p><i><font size=-1>EUCKRFreq.tab</font></i>
<p><i><font size=-1>EUCTWFreq.tab</font></i>
<p><i><font size=-1>GB2312Freq.tab</font></i>
<p><i><font size=-1>JISFreq.tab</font></i>
<p><i><font size=-1>Those files defined the frequency table (Character
to frequency order mapping) for each language. Since Big5 and EUC-TW are
not basing on the same charset standard like EUC-JP and SJIS do, 2 tables
is defined.</font></i>
<p><i><font size=-1>CharDistribution.h</font></i>
<p><i><font size=-1>CharDistribution.cpp</font></i>
<p><i><font size=-1>Implementation for Character distribution analyzer.</font></i>
<p><i><font size=-1>nsPkgInt.h</font></i>
<p><i><font size=-1>nsCodingStateMachine.h</font></i>
<p><i><font size=-1>Those are bases of state machine implementation.</font></i>
<p><i><font size=-1>nsEscSM.cpp</font></i>
<p><i><font size=-1>State machine for ISO-2022XX and HZ.</font></i>
<p><i><font size=-1>nsMBCSSM.cpp</font></i>
<p><i><font size=-1>State machines for Big5, EUC-JP, EUC-KR, EUC-TW, GB2312,
SJIS, and UTF8.</font></i>
<p><i><font size=-1>JpCntx.h</font></i>
<p><i><font size=-1>JpCntx.cpp</font></i>
<p><i><font size=-1>Japanese hiragana sequence analyzer.</font></i>
<p><i><font size=-1>nsBig5Prober.h</font></i>
<p><i><font size=-1>nsBig5Prober.cpp</font></i>
<p><i><font size=-1>nsEUCKRProber.h</font></i>
<p><i><font size=-1>nsEUCKRProber.cpp</font></i>
<p><i><font size=-1>nsEUCJPProber.h</font></i>
<p><i><font size=-1>nsEUCJPProber.cpp</font></i>
<p><i><font size=-1>nsEUCTWProber.h</font></i>
<p><i><font size=-1>nsEUCTWProber.cpp</font></i>
<p><i><font size=-1>nsSJISProber.h</font></i>
<p><i><font size=-1>nsSJISProber.cpp</font></i>
<p><i><font size=-1>nsGB2312Prober.h</font></i>
<p><i><font size=-1>nsGB2312Prober.cpp</font></i>
<p><i><font size=-1>nsUTF8Prober.h</font></i>
<p><i><font size=-1>nsUTF8Prober.cpp</font></i>
<p><i><font size=-1>Charset Prober classes definition and implementation
for each encoding. Each prober has an embedded state machine and a character
distribution analyzer except UTF8, which state machine is good enough.</font></i>
<p><i><font size=-1>nsMBCSProber.h</font></i>
<p><i><font size=-1>nsMBCSProber.cpp</font></i>
<p><i><font size=-1>This is a wrapper of all the MBCS probers. I was expecting
to put some high level logic which base on multiple encoding knowledge
to appears here in the very beginning. That might still be needed in future.</font></i>
<br>&nbsp;
<br>&nbsp;
<p><font face="Arial">How single-byte encoding charset prober works</font>
<p><font size=-1>For each encoding, a table is used to map a character
to an encoding independent identification number. Those identification
numbers in fact come from characters� frequency order but with some adjustment.
For each language, a 2-D matrix is defined as language model. If cell &lt;x,
y> is 0, it means sequence &lt;character(x), character(y)> is a rarely
used sequence in this language, with character(x) representing the character
whose identification number is x. The 2-D matrix only defines sequence
of a subset of all the characters. For characters whose identification
number is out of this range, those characters are ignored. Since some of
the sequences, like ascii-to-ascii sequences, have no relation with the
language we try to verify, and those sequences should not be counted. In
current implementation, a sequence will be counted if both characters are
8-bits ones. In some situations, one 8-bits character sequence is expected
to be counted.</font>
<p><i><font size=-1>LangCyrillicModel.cpp : these files defined a mapping
table for each encoding and a 2-D matrix for all Cyrillic languages. A
"SequenceModel" structure is also defined for each encoding. This structure
will be used to initialize a single-byte character prober class. All Cyrillic
encodings are sharing the same prober class implementation.</font></i>
<p><i><font size=-1>nsSBCharSetProber.h</font></i>
<p><i><font size=-1>nsSBCharSetProber.cpp : These 2 files defined and implemented
single-byte charset prober.</font></i>
</body>
</html>