intl/hyphenation/README.compound
author Nicolas B. Pierron <nicolas.b.pierron@mozilla.com>
Wed, 01 Oct 2014 19:17:51 +0200
changeset 208230 ed4b995667b58b364d2a7ce9b2111fc22dc1f622
parent 196035 9230143db3b778f3a248269c7e707d7f13489d8b
permissions -rw-r--r--
Bug 1074911 - Replace JS_ASSERT by MOZ_ASSERT. r=jorendorff Apply the following script sed -i ' /JS_ASSERT(/ { s/JS_ASSERT(/MOZ_ASSERT(/; :b; s/ \\$/\\/; /;/ { p; d; }; n; s/^/ /; b b; }; s/JS_ASSERT (/MOZ_ASSERT(/; ' Except where the JS_ASSERT macro does not end with a semi-colon, where empty lines are in the middle of the macro, and where the backslahes are always the same-length after the expression.

New option of Libhyphen 2.7: NOHYPHEN

Hyphen, apostrophe and other characters may be word boundary characters,
but they don't need (extra) hyphenation. With NOHYPHEN option
it's possible to hyphenate the words parts correctly.

Example:

ISO8859-1
NOHYPHEN -,'
1-1
1'1
NEXTLEVEL

Description:

1-1 and 1'1 declare hyphen and apostrophe as word boundary characters
and NOHYPHEN with the comma separated character (or character sequence)
list forbid the (extra) hyphens at the hyphen and apostrophe characters.

Implicite NOHYPHEN declaration

Without explicite NEXTLEVEL declaration, Hyphen 2.8 uses the
previous settings, plus in UTF-8 encoding, endash (U+2013) and
typographical apostrophe (U+2019) are NOHYPHEN characters, too.

It's possible to enlarge the hyphenation distance from these
NOHYPHEN characters by using COMPOUNDLEFTHYPHENMIN and
COMPOUNDRIGHTHYPHENMIN attributes.

Compound word hyphenation

Hyphen library supports better compound word hyphenation and special
rules of compound word hyphenation of German languages and other
languages with arbitrary number of compound words. The new options,
COMPOUNDLEFTHYPHENMIN and COMPOUNDRIGHTHYPHENMIN help to set the right
style for the hyphenation of compound words.

Algorithm

The algorithm is an extension of the original pattern based hyphenation
algorithm. It uses two hyphenation pattern sets, defined in the same
pattern file and separated by the NEXTLEVEL keyword. First pattern
set is for hyphenation only at compound word boundaries, the second one
is for hyphenation within words or word parts.

Recursive compound level hyphenation

The algorithm is recursive: every word parts of a successful 
first (compound) level hyphenation will be rehyphenated
by the same (first) pattern set.

Finally, when first level hyphenation is not possible, Hyphen uses
the second level hyphenation for the word or the word parts.

Word endings and word parts

Patterns for word endings (patterns with ellipses) match the
word parts, too.

Options

COMPOUNDLEFTHYPHENMIN: min. hyph. dist. from the left compound word boundary
COMPOUNDRIGHTHYPHENMIN: min. hyph. dist. from the right comp. word boundary
NEXTLEVEL: sign second level hyphenation patterns

Default hyphenmin values

Default values of COMPOUNDLEFTHYPHENMIN and COMPOUNDRIGHTHYPHENMIN are 0,
and 0 under the hyphenation, too. ("0" values of
LEFTHYPHENMIN and RIGHTHYPHENMIN mean the default "2" under the hyphenation.)

Examples

See tests/compound* test files.

Preparation of hyphenation patterns

It hasn't been special pattern generator tool for compound hyphenation
patterns, yet. It is possible to use PATGEN to generate both of
pattern sets, concatenate it manually and set the requested HYPHENMIN values.
(But don't forget the preprocessing steps by substrings.pl before
concatenation.) One of the disadvantage of this method, that PATGEN
doesn't know recursive compound hyphenation of Hyphen.

László Németh
<nemeth (at) openoffice.org>