Remove extension/universalchardet/doc/. Bug 471480, r=smontagu
authortyler.downer@gmail.com
Sat, 13 Mar 2010 12:08:57 -0800
changeset 39401 14a2482192f2a910dd7202b3226a27d1e2bbb957
parent 39400 3dadc7cacb96aa4005fe60b0ee54971519ddcb9d
child 39402 28be6df4ba1227ba53a6394e0379f01962edfe9f
push id12174
push usersmontagu@mozilla.com
push dateSat, 13 Mar 2010 20:11:30 +0000
treeherdermozilla-central@14a2482192f2 [default view] [failures only]
perfherder[talos] [build metrics] [platform microbench] (compared to previous push)
reviewerssmontagu
bugs471480
milestone1.9.3a3pre
Remove extension/universalchardet/doc/. Bug 471480, r=smontagu
extensions/universalchardet/doc/ChardetInterface.htm
extensions/universalchardet/doc/UniversalCharsetDetection.doc
deleted file mode 100644
--- a/extensions/universalchardet/doc/ChardetInterface.htm
+++ /dev/null
@@ -1,231 +0,0 @@
-<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
-<html>
-<head>
-   <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
-   <meta name="GENERATOR" content="Mozilla/4.76 [en] (WinNT; U) [Netscape]">
-   <title>Charset Detector Interface</title>
-</head>
-<body>
-<font face="Arial">Charset Detector Interface</font>
-<p><font size=-1>This is the charset detectorís interface that is exposed
-to outside world, in our case, the browser. In the very beginning, caller
-calls detectorís "Init()" method and let detector know how it would like
-to be notified about the detecting result. Observer pattern is used in
-this case. Then the caller just need to feed charset detector with text
-data through "DoIt()". This can be done through a series "DoIt()" calls,
-with each call only contains part of the data. This can be very useful
-if the data is only partially available at one time. In our case, since
-the data comes from network, we can start detecting long before network
-finishes transferring all data. When detector is confident enough about
-one encoding, it will notify its caller and stop detecting. If all data
-has been feed to detector but detector still is not confident enough about
-any encoding, method "Done" will tell detector to make a best guess.</font>
-<p><font face="Courier New"><font size=-1>class nsICharsetDetector : public
-nsISupports {</font></font>
-<br><font face="Courier New"><font size=-1>&nbsp; public:</font></font>
-<br><font face="Courier New"><font size=-1>&nbsp; NS_DEFINE_STATIC_IID_ACCESSOR(NS_ICHARSETDETECTOR_IID)</font></font>
-<p><font face="Courier New"><font size=-1>&nbsp; //Setup the observer so
-it know how to notify the answer</font></font>
-<br><font face="Courier New"><font size=-1>&nbsp; NS_IMETHOD Init(nsICharsetDetectionObserver*
-observer) = 0;</font></font>
-<p><font face="Courier New"><font size=-1>&nbsp; //Feed a block of bytes
-to the detector.</font></font>
-<br><font face="Courier New"><font size=-1>&nbsp; //It will call the Notify
-function of the nsICharsetObserver if it</font></font>
-<br><font face="Courier New"><font size=-1>&nbsp; //find out the answer.</font></font>
-<br><font face="Courier New"><font size=-1>&nbsp; // aBytesArray - array
-of bytes</font></font>
-<br><font face="Courier New"><font size=-1>&nbsp; // aLen - length of aBytesArray</font></font>
-<br><font face="Courier New"><font size=-1>&nbsp; // oDontFeedMe - return
-PR_TRUE if the detector do not need the</font></font>
-<br>&nbsp; <font face="Courier New"><font size=-1>// following block</font></font>
-<br>&nbsp; <font face="Courier New"><font size=-1>// PR_FALSE it need more
-bytes.</font></font>
-<br>&nbsp; <font face="Courier New"><font size=-1>// This is used to enhance
-performance</font></font>
-<br>&nbsp; <font face="Courier New"><font size=-1>NS_IMETHOD DoIt(const
-char* aBytesArray, PRUint32 aLen, PRBool* oDontFeedMe) = 0;</font></font>
-<p>&nbsp; /<font face="Courier New"><font size=-1>/It also tell the detector
-the last chance the make a decision</font></font>
-<br>&nbsp; <font face="Courier New"><font size=-1>NS_IMETHOD Done() = 0;</font></font>
-<br><font size=-1>}<font face="Courier New">;</font></font>
-<br>&nbsp;
-<br>&nbsp;
-<p><font face="Arial">Inside Charset Detector</font>
-<p><font size=-1>Inside Charset Detector, major work is done by function
-"HandleData()". In fact, "DoIt" has very little extra thing to do other
-than call "HandleData". The following is the algorithm logic using C-Like
-Pseudo-Language. Some detail is drop in order to make main point more clear.</font>
-<p><font face="Courier New"><font size=-1>HandleData(batch_of_text)</font></font>
-<br><font face="Courier New"><font size=-1>{</font></font>
-<br><font face="Courier New"><font size=-1>&nbsp; if (batch_of_text contains
-BOM)</font></font>
-<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; report UCS2;</font></font>
-<br><font face="Courier New"><font size=-1>&nbsp; if ((inputState is PureAscii)
-|| (inputState is EscAscii))</font></font>
-<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; if (batch_of_text
-contains 8-bits-byte)</font></font>
-<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
-inputState = HighByte;</font></font>
-<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; else if ((inputState
-is PureAscii ) &amp;&amp; (batch_of_text contains Esc_Sequence) )</font></font>
-<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
-inputState = EscAscii;</font></font>
-<p><font face="Courier New"><font size=-1>&nbsp; if (inputState is HighByte)</font></font>
-<br><font face="Courier New"><font size=-1>&nbsp; {</font></font>
-<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; Remove Ascii
-character that is not neighboring to 8-bits byte</font></font>
-<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; For each
-prober in multibyte_probers</font></font>
-<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; Prober.HandleData(batch_of_text);</font></font>
-<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; For each
-prober in singlebyte_probers</font></font>
-<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; Prober.HandleData(batch_of_text);</font></font>
-<br><font face="Courier New"><font size=-1>&nbsp; }</font></font>
-<br><font face="Courier New"><font size=-1>&nbsp; else if (inputState is
-EscAscii)</font></font>
-<br><font face="Courier New"><font size=-1>&nbsp; {</font></font>
-<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; For each
-prober in (ISO2022_XX or HZ)</font></font>
-<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; Prober.HandleData(batch_of_text);</font></font>
-<br><font face="Courier New"><font size=-1>&nbsp; }</font></font>
-<br><font face="Courier New"><font size=-1>}</font></font>
-<p><i><font face="Courier New"><font size=-1>nsUniversalDetector.h</font></font></i>
-<br><i><font face="Courier New"><font size=-1>nsUniversalDetector.cpp</font></font></i>
-<p><i><font face="Courier New"><font size=-1>Implemented the high level
-control logic.</font></font></i>
-<br>&nbsp;
-<br>&nbsp;
-<p>Charset Prober
-<p><font size=-1>A charset prober verifies if the input data is belong
-to certain encoding or group of encoding. It maintains its state in member
-"mState", which has 3 possible value. State "eDetecting" means it hasnít
-found any sure answer yet, "eFoundIt" and "eNotMe" carries the same meaning
-as their names. Method "GetCharSetName" tell its caller its sure answer
-or best guess.</font>
-<p><font size=-1>Generally, for each encoding we implemented a charset
-prober. Several probers can be wrapped together with a wrapper prober.
-It is also possible for a prober to "probe" several encodings. Each charset
-prober is designed, implemented and working independently. This enables
-prober caller to eliminate certain probers when it has any pre-knowledge.
-For example, if user know that an html page is some kind of Japanese encoding,
-non-Japanese charset probers will not be fired. If user have not interest
-in certain languages, they can also eliminate those charset probers. Those
-measures will lead to a small footprint and faster performance.</font>
-<p><font face="Courier New"><font size=-1>typedef enum {</font></font>
-<br><font face="Courier New"><font size=-1>&nbsp; eDetecting = 0,</font></font>
-<br><font face="Courier New"><font size=-1>&nbsp; eFoundIt = 1,</font></font>
-<br><font face="Courier New"><font size=-1>&nbsp; eNotMe = 2</font></font>
-<br><font size=-1>}<font face="Courier New"> nsProbingState;</font></font>
-<p><font face="Courier New"><font size=-1>class nsCharSetProber {</font></font>
-<br><font face="Courier New"><font size=-1>&nbsp; public:</font></font>
-<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; nsCharSetProber(){};</font></font>
-<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; virtual const
-char* GetCharSetName() {return "";};</font></font>
-<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; virtual nsProbingState
-HandleData(const char* aBuf, PRUint32 aLen) = 0;</font></font>
-<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; nsProbingState
-GetState(void) {return mState;};</font></font>
-<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; virtual void
-Reset(void) {mState = eDetecting;};</font></font>
-<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; virtual float
-GetConfidence(void) = 0;</font></font>
-<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; virtual void
-SetOpion() {};</font></font>
-<br><font face="Courier New"><font size=-1>&nbsp; protected:</font></font>
-<br><font face="Courier New"><font size=-1>&nbsp;&nbsp;&nbsp; nsProbingState
-mState;</font></font>
-<br><font face="Courier New"><font size=-1>};</font></font>
-<br>&nbsp;
-<br>&nbsp;
-<p><font face="Arial">How multi-byte encoding charset prober works</font>
-<p><font size=-1>For charset prober verifying SJIS, EUC-JP, EUC-KR, EUC-CN
-(or GB2312), EUC-TW, Big5 encodings, each prober embeds state machine (mCodingSM),
-which identify legal byte sequence base on its encoding scheme. If an illegal
-byte sequence is met, this state machine will reach "eError" state. That
-signifies a failure for this prober, and prober will report negative answer
-to its caller. Once state machine reach "eStart" state, it means sequence
-of bytes has been identified as a character. This character will be sent
-to Character distribution analyzer (mDistributionAnalyser) and 2-Char sequence
-analyzer (mContextAnalyser) for statistic sampling. "GetConfidence" call
-will let its caller know the likelihood of input charset being of this
-encoding.</font>
-<p><font size=-1>Inside "HandleData" method each time after a batch of
-text has been processed, shortcut judgement is performed. If the prober
-receives enough data and reaches certain confidence level, it will set
-its state to be "eFoundIt" and notify its caller an immediate sure answer.</font>
-<p><font size=-1>For encoding like ISO_2022 and HZ, since the embedded
-state machine can do almost a perfect job along, no other statistic sampling
-is done.</font>
-<p><i><font size=-1>Big5Freq.tab</font></i>
-<p><i><font size=-1>EUCKRFreq.tab</font></i>
-<p><i><font size=-1>EUCTWFreq.tab</font></i>
-<p><i><font size=-1>GB2312Freq.tab</font></i>
-<p><i><font size=-1>JISFreq.tab</font></i>
-<p><i><font size=-1>Those files defined the frequency table (Character
-to frequency order mapping) for each language. Since Big5 and EUC-TW are
-not basing on the same charset standard like EUC-JP and SJIS do, 2 tables
-is defined.</font></i>
-<p><i><font size=-1>CharDistribution.h</font></i>
-<p><i><font size=-1>CharDistribution.cpp</font></i>
-<p><i><font size=-1>Implementation for Character distribution analyzer.</font></i>
-<p><i><font size=-1>nsPkgInt.h</font></i>
-<p><i><font size=-1>nsCodingStateMachine.h</font></i>
-<p><i><font size=-1>Those are bases of state machine implementation.</font></i>
-<p><i><font size=-1>nsEscSM.cpp</font></i>
-<p><i><font size=-1>State machine for ISO-2022XX and HZ.</font></i>
-<p><i><font size=-1>nsMBCSSM.cpp</font></i>
-<p><i><font size=-1>State machines for Big5, EUC-JP, EUC-KR, EUC-TW, GB2312,
-SJIS, and UTF8.</font></i>
-<p><i><font size=-1>JpCntx.h</font></i>
-<p><i><font size=-1>JpCntx.cpp</font></i>
-<p><i><font size=-1>Japanese hiragana sequence analyzer.</font></i>
-<p><i><font size=-1>nsBig5Prober.h</font></i>
-<p><i><font size=-1>nsBig5Prober.cpp</font></i>
-<p><i><font size=-1>nsEUCKRProber.h</font></i>
-<p><i><font size=-1>nsEUCKRProber.cpp</font></i>
-<p><i><font size=-1>nsEUCJPProber.h</font></i>
-<p><i><font size=-1>nsEUCJPProber.cpp</font></i>
-<p><i><font size=-1>nsEUCTWProber.h</font></i>
-<p><i><font size=-1>nsEUCTWProber.cpp</font></i>
-<p><i><font size=-1>nsSJISProber.h</font></i>
-<p><i><font size=-1>nsSJISProber.cpp</font></i>
-<p><i><font size=-1>nsGB2312Prober.h</font></i>
-<p><i><font size=-1>nsGB2312Prober.cpp</font></i>
-<p><i><font size=-1>nsUTF8Prober.h</font></i>
-<p><i><font size=-1>nsUTF8Prober.cpp</font></i>
-<p><i><font size=-1>Charset Prober classes definition and implementation
-for each encoding. Each prober has an embedded state machine and a character
-distribution analyzer except UTF8, which state machine is good enough.</font></i>
-<p><i><font size=-1>nsMBCSProber.h</font></i>
-<p><i><font size=-1>nsMBCSProber.cpp</font></i>
-<p><i><font size=-1>This is a wrapper of all the MBCS probers. I was expecting
-to put some high level logic which base on multiple encoding knowledge
-to appears here in the very beginning. That might still be needed in future.</font></i>
-<br>&nbsp;
-<br>&nbsp;
-<p><font face="Arial">How single-byte encoding charset prober works</font>
-<p><font size=-1>For each encoding, a table is used to map a character
-to an encoding independent identification number. Those identification
-numbers in fact come from charactersí frequency order but with some adjustment.
-For each language, a 2-D matrix is defined as language model. If cell &lt;x,
-y> is 0, it means sequence &lt;character(x), character(y)> is a rarely
-used sequence in this language, with character(x) representing the character
-whose identification number is x. The 2-D matrix only defines sequence
-of a subset of all the characters. For characters whose identification
-number is out of this range, those characters are ignored. Since some of
-the sequences, like ascii-to-ascii sequences, have no relation with the
-language we try to verify, and those sequences should not be counted. In
-current implementation, a sequence will be counted if both characters are
-8-bits ones. In some situations, one 8-bits character sequence is expected
-to be counted.</font>
-<p><i><font size=-1>LangCyrillicModel.cpp : these files defined a mapping
-table for each encoding and a 2-D matrix for all Cyrillic languages. A
-"SequenceModel" structure is also defined for each encoding. This structure
-will be used to initialize a single-byte character prober class. All Cyrillic
-encodings are sharing the same prober class implementation.</font></i>
-<p><i><font size=-1>nsSBCharSetProber.h</font></i>
-<p><i><font size=-1>nsSBCharSetProber.cpp : These 2 files defined and implemented
-single-byte charset prober.</font></i>
-</body>
-</html>
deleted file mode 100644
index c4eee5258641f48b81411a455bdd9ef78fac331a..0000000000000000000000000000000000000000
GIT binary patch
literal 0
Hc$@<O00001