From b8be01adca7f9b70d04cbd967628136398a7abaa Mon Sep 17 00:00:00 2001 From: Simon Tatham Date: Sun, 10 Oct 2021 14:51:17 +0100 Subject: [PATCH] Complete rewrite of the bidi algorithm. A user reported that PuTTY's existing bidi algorithm will generate misordered text in cases like this (assuming UTF-8): echo -e '12 A \xD7\x90\xD7\x91 B' The hex codes in the middle are the Hebrew letters aleph and beth. Appearing in the middle of a line whose primary direction is left-to-right, those two letters should appear in the opposite order, but not cause the rest of the line to move around. That is, you expect the displayed text in this situation to be 12 A B But in fact, the digits '12' were erroneously reversed, so you would actually see '21 A B'. I tried to debug the existing bidi algorithm, but it was very hard, because the Unicode bidi spec has been extensively changed since Arabeyes contributed that code, and I couldn't even reliably work out which version of the spec the code was intended to implement. I found some problems, notably that the resolution phase was running once on the whole line instead of separately on runs of characters at the same level, and also that the 'sor' and 'eor' values were being wrongly computed. But I had no way to test any fix to ensure it hadn't introduced another bug somewhere else. Unicode provides a set of conformance tests in the UCD. That was just what I wanted - but they're too up-to-date to run against the old algorithm and expect to pass! So, paradoxically, it seemed to me that the _easiest_ way to fix this bidi bug would be to bring absolutely everything up to date. But the revised bidi algorithm is significantly more complicated, so I also didn't think it would be sensible to try to gradually evolve the existing code into it. Instead, I've done a complete rewrite of my own. The new code implements the full UAX#9 rev 44 algorithm, including in particular support for the new 'directional isolate' control characters, and also special handling for matched pairs of brackets in the text (see rule N0 in the spec). I've managed to get it to pass the entire UCD conformance test suite, so I'm reasonably confident it's right, or at the very least a lot closer to right than the old algorithm was. So the upshot is: the test case shown at the top of this file now passes, but also, other detailed bidi handling might have changed, certainly some cases involving brackets, but perhaps also other things that were either bugs in the old algorithm or updates to the standard. --- terminal/bidi.c | 3799 ++++++++++++++++++++++++++++++++++------------- terminal/bidi.h | 69 + 2 files changed, 2798 insertions(+), 1070 deletions(-) diff --git a/terminal/bidi.c b/terminal/bidi.c index 15ed23d0..ea05e2bd 100644 --- a/terminal/bidi.c +++ b/terminal/bidi.c @@ -1,42 +1,132 @@ -/************************************************************************ - * - * ------------ - * Description: - * ------------ - * This is an implementation of Unicode's Bidirectional Algorithm - * (known as UAX #9). - * - * http://www.unicode.org/reports/tr9/ - * - * Author: Ahmad Khalifa - * - * (www.arabeyes.org - under MIT license) - * - ************************************************************************/ - /* - * TODO: - * ===== - * - Explicit marks need to be handled (they are not 100% now) - * - Ligatures + * Implementation of the Unicode bidirectional and Arabic shaping + * algorithms for PuTTY. + * + * Original version written and kindly contributed to this code base + * by Arabeyes. The bidi part was almost completely rewritten in 2021 + * by Simon Tatham to bring it up to date, but the shaping part is + * still the one by the original authors. + * + * Implementation notes: + * + * Algorithm version + * ----------------- + * + * This algorithm is up to date with Unicode Standard Annex #9 + * revision 44: + * + * https://www.unicode.org/reports/tr9/tr9-44.html + * + * and passes the full conformance test suite in Unicode 14.0.0. + * + * Paragraph and line handling + * --------------------------- + * + * The full Unicode bidi algorithm expects to receive text containing + * multiple paragraphs, together with a decision about how those + * paragraphs are broken up into lines. It calculates embedding levels + * a whole paragraph at a time without considering the line breaks, + * but then the final reordering of the text for display is done to + * each _line_ independently based on the levels computed for the text + * in that line. + * + * This algorithm omits all of that, because it's intended for use as + * a display-time transformation of a text terminal, which doesn't + * preserve enough semantic information to decide what's a paragraph + * break and what is not. So a piece of input text provided to this + * algorithm is always expected to consist of exactly one paragraph + * *and* exactly one line. + * + * Embeddings, overrides and isolates + * ---------------------------------- + * + * This implementation has full support for all the Unicode special + * control characters that modify bidi behaviour, such as + * + * U+202A LEFT-TO-RIGHT EMBEDDING + * U+202B RIGHT-TO-LEFT EMBEDDING + * U+202D LEFT-TO-RIGHT OVERRIDE + * U+202E RIGHT-TO-LEFT OVERRIDE + * U+202C POP DIRECTIONAL FORMATTING + * U+2068 FIRST STRONG ISOLATE + * U+2066 LEFT-TO-RIGHT ISOLATE + * U+2067 RIGHT-TO-LEFT ISOLATE + * U+2069 POP DIRECTIONAL ISOLATE + * + * However, at present, the terminal emulator that is a client of this + * code has no way to pass those in (because they're dropped during + * escape sequence processing and don't get stored in the terminal + * state). Nonetheless, the code is all here, so if the terminal + * emulator becomes able to record those characters at some later + * point, we'll be all set to take account of them during bidi. + * + * But the _main_ purpose of supporting the full bidi algorithm is + * simply that that's the easiest way to be sure it's correct, because + * if you support the whole thing, you can run the full conformance + * test suite. (And I don't 100% believe that restricting to the + * subset of _tests_ valid with a reduced character set will test the + * full set of _functionality_ relevant to the reduced set.) + * + * Retained formatting characters + * ------------------------------ + * + * The standard bidi algorithm, in step X9, deletes assorted + * formatting characters from the text: all the embedding and override + * section initiator characters, the Pop Directional Formatting + * character that closes one of those sections again, and any + * character labelled as Boundary Neutral. So the characters it + * returns are not a _full_ reordering of the input; some input + * characters vanish completely. + * + * This would be fine, if it were not for the fact that - as far as I + * can see - _exactly one_ Unicode code point in the discarded + * category has a wcwidth() of more than 1, namely U+00AD SOFT HYPHEN + * which is a printing character for terminal purposes but has a bidi + * class of BN. + * + * Therefore, we must implement a modified version of the algorithm, + * as described in section 5.2 of TR9, which retains those formatting + * characters so that a client can find out where they ended up in the + * reordering. + * + * Section 5.2 describes a set of modifications to the algorithm that + * are _intended_ to achieve this without changing the rest of the + * behaviour: that is, if you take the output of the modified + * algorithm and delete all the characters that the standard algorithm + * would have removed, you should end up with the remaining characters + * in the same order that the standard algorithm would have delivered. + * However, section 5.2 admits the possibility of error, and says "in + * case of any deviation the explicit algorithm is the normative + * statement for conformance". And indeed, in one or two places I + * found I had to make my own tweaks to the section 5.2 description in + * order to get the whole test suite to pass, because I think the 5.2 + * modifications if taken literally don't quite achieve that. My + * justification is that sentence of 5.2: in case of doubt, the right + * thing is to make the code behave the same as the official + * algorithm. + * + * It's possible that there might still be some undiscovered + * discrepancies between the behaviour of the standard and modified + * algorithms. So, just in case, I've kept in this code the ability to + * implement the _standard_ algorithm too! If you compile with + * -DREMOVE_FORMATTING_CHARS, this code should go back to implementing + * the literal UAX#9 bidi algorithm - so you can run your suspect + * input through both versions, making it much easier to figure out + * why they differ, and in which of the many stages of the algorithm + * the difference was introduced. + * + * However, beware that when compiling in this mode, the do_bidi + * interface to the terminal will stop working, and just abort() when + * called! The only useful thing you can do with this mode is to run + * the companion program bidi_test.c. */ -#include /* definition of wchar_t*/ +#include /* definition of wchar_t */ #include "putty.h" #include "misc.h" #include "bidi.h" -/* function declarations */ -static void flipThisRun( - bidi_char *from, unsigned char *level, int max, int count); -static int findIndexOfRun( - unsigned char *level, int start, int count, int tlevel); -static unsigned char setOverrideBits( - unsigned char level, unsigned char override); -static int getPreviousLevel(unsigned char *level, int from); -static void doMirror(unsigned int *ch); - typedef struct { char type; wchar_t form_b; @@ -228,66 +318,18 @@ static const shape_node shapetypes[] = { /* 6D2 */ {SR, 0xFBAE}, }; -/* - * Flips the text buffer, according to max level, and - * all higher levels - * - * Input: - * from: text buffer, on which to apply flipping - * level: resolved levels buffer - * max: the maximum level found in this line (should be unsigned char) - * count: line size in bidi_char - */ -static void flipThisRun( - bidi_char *from, unsigned char *level, int max, int count) -{ - int i, j, k, tlevel; - bidi_char temp; - - j = i = 0; - while (i j; k--, j++) { - temp = from[k]; - from[k] = from[j]; - from[j] = temp; - } - } -} - -/* - * Finds the index of a run with level equals tlevel - */ -static int findIndexOfRun( - unsigned char *level , int start, int count, int tlevel) -{ - int i; - for (i=start; i>) { + chomp; s{\s}{}g; s{#.*$}{}; next unless /./; + @_ = split /;/, $_; + $src = hex $_[0]; $dst = hex $_[1]; + $m{$src}=$dst; $m{$dst}=$src; + } + for $src (sort {$a <=> $b} keys %m) { + printf " {0x%04x, 0x%04x},\n", $src, $m{$src}; + } +' BidiMirroring.txt + + * + * FIXME: there are also glyphs which the text rendering engine is + * supposed to display left-right reflected, since no mirrored glyph + * exists in Unicode itself to indicate the reflected form. Those are + * listed in comments in BidiMirroring.txt. Many of them are + * mathematical, e.g. the square root sign, or set difference + * operator, or integral sign. No API currently exists here to + * communicate the need for that reflected display back to the client. + */ +static unsigned mirror_glyph(unsigned int ch) +{ + static const struct { + unsigned src, dst; + } mirror_pairs[] = { + {0x0028, 0x0029}, + {0x0029, 0x0028}, + {0x003c, 0x003e}, + {0x003e, 0x003c}, + {0x005b, 0x005d}, + {0x005d, 0x005b}, + {0x007b, 0x007d}, + {0x007d, 0x007b}, + {0x00ab, 0x00bb}, + {0x00bb, 0x00ab}, + {0x0f3a, 0x0f3b}, + {0x0f3b, 0x0f3a}, + {0x0f3c, 0x0f3d}, + {0x0f3d, 0x0f3c}, + {0x169b, 0x169c}, + {0x169c, 0x169b}, + {0x2039, 0x203a}, + {0x203a, 0x2039}, + {0x2045, 0x2046}, + {0x2046, 0x2045}, + {0x207d, 0x207e}, + {0x207e, 0x207d}, + {0x208d, 0x208e}, + {0x208e, 0x208d}, + {0x2208, 0x220b}, + {0x2209, 0x220c}, + {0x220a, 0x220d}, + {0x220b, 0x2208}, + {0x220c, 0x2209}, + {0x220d, 0x220a}, + {0x2215, 0x29f5}, + {0x221f, 0x2bfe}, + {0x2220, 0x29a3}, + {0x2221, 0x299b}, + {0x2222, 0x29a0}, + {0x2224, 0x2aee}, + {0x223c, 0x223d}, + {0x223d, 0x223c}, + {0x2243, 0x22cd}, + {0x2245, 0x224c}, + {0x224c, 0x2245}, + {0x2252, 0x2253}, + {0x2253, 0x2252}, + {0x2254, 0x2255}, + {0x2255, 0x2254}, + {0x2264, 0x2265}, + {0x2265, 0x2264}, + {0x2266, 0x2267}, + {0x2267, 0x2266}, + {0x2268, 0x2269}, + {0x2269, 0x2268}, + {0x226a, 0x226b}, + {0x226b, 0x226a}, + {0x226e, 0x226f}, + {0x226f, 0x226e}, + {0x2270, 0x2271}, + {0x2271, 0x2270}, + {0x2272, 0x2273}, + {0x2273, 0x2272}, + {0x2274, 0x2275}, + {0x2275, 0x2274}, + {0x2276, 0x2277}, + {0x2277, 0x2276}, + {0x2278, 0x2279}, + {0x2279, 0x2278}, + {0x227a, 0x227b}, + {0x227b, 0x227a}, + {0x227c, 0x227d}, + {0x227d, 0x227c}, + {0x227e, 0x227f}, + {0x227f, 0x227e}, + {0x2280, 0x2281}, + {0x2281, 0x2280}, + {0x2282, 0x2283}, + {0x2283, 0x2282}, + {0x2284, 0x2285}, + {0x2285, 0x2284}, + {0x2286, 0x2287}, + {0x2287, 0x2286}, + {0x2288, 0x2289}, + {0x2289, 0x2288}, + {0x228a, 0x228b}, + {0x228b, 0x228a}, + {0x228f, 0x2290}, + {0x2290, 0x228f}, + {0x2291, 0x2292}, + {0x2292, 0x2291}, + {0x2298, 0x29b8}, + {0x22a2, 0x22a3}, + {0x22a3, 0x22a2}, + {0x22a6, 0x2ade}, + {0x22a8, 0x2ae4}, + {0x22a9, 0x2ae3}, + {0x22ab, 0x2ae5}, + {0x22b0, 0x22b1}, + {0x22b1, 0x22b0}, + {0x22b2, 0x22b3}, + {0x22b3, 0x22b2}, + {0x22b4, 0x22b5}, + {0x22b5, 0x22b4}, + {0x22b6, 0x22b7}, + {0x22b7, 0x22b6}, + {0x22b8, 0x27dc}, + {0x22c9, 0x22ca}, + {0x22ca, 0x22c9}, + {0x22cb, 0x22cc}, + {0x22cc, 0x22cb}, + {0x22cd, 0x2243}, + {0x22d0, 0x22d1}, + {0x22d1, 0x22d0}, + {0x22d6, 0x22d7}, + {0x22d7, 0x22d6}, + {0x22d8, 0x22d9}, + {0x22d9, 0x22d8}, + {0x22da, 0x22db}, + {0x22db, 0x22da}, + {0x22dc, 0x22dd}, + {0x22dd, 0x22dc}, + {0x22de, 0x22df}, + {0x22df, 0x22de}, + {0x22e0, 0x22e1}, + {0x22e1, 0x22e0}, + {0x22e2, 0x22e3}, + {0x22e3, 0x22e2}, + {0x22e4, 0x22e5}, + {0x22e5, 0x22e4}, + {0x22e6, 0x22e7}, + {0x22e7, 0x22e6}, + {0x22e8, 0x22e9}, + {0x22e9, 0x22e8}, + {0x22ea, 0x22eb}, + {0x22eb, 0x22ea}, + {0x22ec, 0x22ed}, + {0x22ed, 0x22ec}, + {0x22f0, 0x22f1}, + {0x22f1, 0x22f0}, + {0x22f2, 0x22fa}, + {0x22f3, 0x22fb}, + {0x22f4, 0x22fc}, + {0x22f6, 0x22fd}, + {0x22f7, 0x22fe}, + {0x22fa, 0x22f2}, + {0x22fb, 0x22f3}, + {0x22fc, 0x22f4}, + {0x22fd, 0x22f6}, + {0x22fe, 0x22f7}, + {0x2308, 0x2309}, + {0x2309, 0x2308}, + {0x230a, 0x230b}, + {0x230b, 0x230a}, + {0x2329, 0x232a}, + {0x232a, 0x2329}, + {0x2768, 0x2769}, + {0x2769, 0x2768}, + {0x276a, 0x276b}, + {0x276b, 0x276a}, + {0x276c, 0x276d}, + {0x276d, 0x276c}, + {0x276e, 0x276f}, + {0x276f, 0x276e}, + {0x2770, 0x2771}, + {0x2771, 0x2770}, + {0x2772, 0x2773}, + {0x2773, 0x2772}, + {0x2774, 0x2775}, + {0x2775, 0x2774}, + {0x27c3, 0x27c4}, + {0x27c4, 0x27c3}, + {0x27c5, 0x27c6}, + {0x27c6, 0x27c5}, + {0x27c8, 0x27c9}, + {0x27c9, 0x27c8}, + {0x27cb, 0x27cd}, + {0x27cd, 0x27cb}, + {0x27d5, 0x27d6}, + {0x27d6, 0x27d5}, + {0x27dc, 0x22b8}, + {0x27dd, 0x27de}, + {0x27de, 0x27dd}, + {0x27e2, 0x27e3}, + {0x27e3, 0x27e2}, + {0x27e4, 0x27e5}, + {0x27e5, 0x27e4}, + {0x27e6, 0x27e7}, + {0x27e7, 0x27e6}, + {0x27e8, 0x27e9}, + {0x27e9, 0x27e8}, + {0x27ea, 0x27eb}, + {0x27eb, 0x27ea}, + {0x27ec, 0x27ed}, + {0x27ed, 0x27ec}, + {0x27ee, 0x27ef}, + {0x27ef, 0x27ee}, + {0x2983, 0x2984}, + {0x2984, 0x2983}, + {0x2985, 0x2986}, + {0x2986, 0x2985}, + {0x2987, 0x2988}, + {0x2988, 0x2987}, + {0x2989, 0x298a}, + {0x298a, 0x2989}, + {0x298b, 0x298c}, + {0x298c, 0x298b}, + {0x298d, 0x2990}, + {0x298e, 0x298f}, + {0x298f, 0x298e}, + {0x2990, 0x298d}, + {0x2991, 0x2992}, + {0x2992, 0x2991}, + {0x2993, 0x2994}, + {0x2994, 0x2993}, + {0x2995, 0x2996}, + {0x2996, 0x2995}, + {0x2997, 0x2998}, + {0x2998, 0x2997}, + {0x299b, 0x2221}, + {0x29a0, 0x2222}, + {0x29a3, 0x2220}, + {0x29a4, 0x29a5}, + {0x29a5, 0x29a4}, + {0x29a8, 0x29a9}, + {0x29a9, 0x29a8}, + {0x29aa, 0x29ab}, + {0x29ab, 0x29aa}, + {0x29ac, 0x29ad}, + {0x29ad, 0x29ac}, + {0x29ae, 0x29af}, + {0x29af, 0x29ae}, + {0x29b8, 0x2298}, + {0x29c0, 0x29c1}, + {0x29c1, 0x29c0}, + {0x29c4, 0x29c5}, + {0x29c5, 0x29c4}, + {0x29cf, 0x29d0}, + {0x29d0, 0x29cf}, + {0x29d1, 0x29d2}, + {0x29d2, 0x29d1}, + {0x29d4, 0x29d5}, + {0x29d5, 0x29d4}, + {0x29d8, 0x29d9}, + {0x29d9, 0x29d8}, + {0x29da, 0x29db}, + {0x29db, 0x29da}, + {0x29e8, 0x29e9}, + {0x29e9, 0x29e8}, + {0x29f5, 0x2215}, + {0x29f8, 0x29f9}, + {0x29f9, 0x29f8}, + {0x29fc, 0x29fd}, + {0x29fd, 0x29fc}, + {0x2a2b, 0x2a2c}, + {0x2a2c, 0x2a2b}, + {0x2a2d, 0x2a2e}, + {0x2a2e, 0x2a2d}, + {0x2a34, 0x2a35}, + {0x2a35, 0x2a34}, + {0x2a3c, 0x2a3d}, + {0x2a3d, 0x2a3c}, + {0x2a64, 0x2a65}, + {0x2a65, 0x2a64}, + {0x2a79, 0x2a7a}, + {0x2a7a, 0x2a79}, + {0x2a7b, 0x2a7c}, + {0x2a7c, 0x2a7b}, + {0x2a7d, 0x2a7e}, + {0x2a7e, 0x2a7d}, + {0x2a7f, 0x2a80}, + {0x2a80, 0x2a7f}, + {0x2a81, 0x2a82}, + {0x2a82, 0x2a81}, + {0x2a83, 0x2a84}, + {0x2a84, 0x2a83}, + {0x2a85, 0x2a86}, + {0x2a86, 0x2a85}, + {0x2a87, 0x2a88}, + {0x2a88, 0x2a87}, + {0x2a89, 0x2a8a}, + {0x2a8a, 0x2a89}, + {0x2a8b, 0x2a8c}, + {0x2a8c, 0x2a8b}, + {0x2a8d, 0x2a8e}, + {0x2a8e, 0x2a8d}, + {0x2a8f, 0x2a90}, + {0x2a90, 0x2a8f}, + {0x2a91, 0x2a92}, + {0x2a92, 0x2a91}, + {0x2a93, 0x2a94}, + {0x2a94, 0x2a93}, + {0x2a95, 0x2a96}, + {0x2a96, 0x2a95}, + {0x2a97, 0x2a98}, + {0x2a98, 0x2a97}, + {0x2a99, 0x2a9a}, + {0x2a9a, 0x2a99}, + {0x2a9b, 0x2a9c}, + {0x2a9c, 0x2a9b}, + {0x2a9d, 0x2a9e}, + {0x2a9e, 0x2a9d}, + {0x2a9f, 0x2aa0}, + {0x2aa0, 0x2a9f}, + {0x2aa1, 0x2aa2}, + {0x2aa2, 0x2aa1}, + {0x2aa6, 0x2aa7}, + {0x2aa7, 0x2aa6}, + {0x2aa8, 0x2aa9}, + {0x2aa9, 0x2aa8}, + {0x2aaa, 0x2aab}, + {0x2aab, 0x2aaa}, + {0x2aac, 0x2aad}, + {0x2aad, 0x2aac}, + {0x2aaf, 0x2ab0}, + {0x2ab0, 0x2aaf}, + {0x2ab1, 0x2ab2}, + {0x2ab2, 0x2ab1}, + {0x2ab3, 0x2ab4}, + {0x2ab4, 0x2ab3}, + {0x2ab5, 0x2ab6}, + {0x2ab6, 0x2ab5}, + {0x2ab7, 0x2ab8}, + {0x2ab8, 0x2ab7}, + {0x2ab9, 0x2aba}, + {0x2aba, 0x2ab9}, + {0x2abb, 0x2abc}, + {0x2abc, 0x2abb}, + {0x2abd, 0x2abe}, + {0x2abe, 0x2abd}, + {0x2abf, 0x2ac0}, + {0x2ac0, 0x2abf}, + {0x2ac1, 0x2ac2}, + {0x2ac2, 0x2ac1}, + {0x2ac3, 0x2ac4}, + {0x2ac4, 0x2ac3}, + {0x2ac5, 0x2ac6}, + {0x2ac6, 0x2ac5}, + {0x2ac7, 0x2ac8}, + {0x2ac8, 0x2ac7}, + {0x2ac9, 0x2aca}, + {0x2aca, 0x2ac9}, + {0x2acb, 0x2acc}, + {0x2acc, 0x2acb}, + {0x2acd, 0x2ace}, + {0x2ace, 0x2acd}, + {0x2acf, 0x2ad0}, + {0x2ad0, 0x2acf}, + {0x2ad1, 0x2ad2}, + {0x2ad2, 0x2ad1}, + {0x2ad3, 0x2ad4}, + {0x2ad4, 0x2ad3}, + {0x2ad5, 0x2ad6}, + {0x2ad6, 0x2ad5}, + {0x2ade, 0x22a6}, + {0x2ae3, 0x22a9}, + {0x2ae4, 0x22a8}, + {0x2ae5, 0x22ab}, + {0x2aec, 0x2aed}, + {0x2aed, 0x2aec}, + {0x2aee, 0x2224}, + {0x2af7, 0x2af8}, + {0x2af8, 0x2af7}, + {0x2af9, 0x2afa}, + {0x2afa, 0x2af9}, + {0x2bfe, 0x221f}, + {0x2e02, 0x2e03}, + {0x2e03, 0x2e02}, + {0x2e04, 0x2e05}, + {0x2e05, 0x2e04}, + {0x2e09, 0x2e0a}, + {0x2e0a, 0x2e09}, + {0x2e0c, 0x2e0d}, + {0x2e0d, 0x2e0c}, + {0x2e1c, 0x2e1d}, + {0x2e1d, 0x2e1c}, + {0x2e20, 0x2e21}, + {0x2e21, 0x2e20}, + {0x2e22, 0x2e23}, + {0x2e23, 0x2e22}, + {0x2e24, 0x2e25}, + {0x2e25, 0x2e24}, + {0x2e26, 0x2e27}, + {0x2e27, 0x2e26}, + {0x2e28, 0x2e29}, + {0x2e29, 0x2e28}, + {0x2e55, 0x2e56}, + {0x2e56, 0x2e55}, + {0x2e57, 0x2e58}, + {0x2e58, 0x2e57}, + {0x2e59, 0x2e5a}, + {0x2e5a, 0x2e59}, + {0x2e5b, 0x2e5c}, + {0x2e5c, 0x2e5b}, + {0x3008, 0x3009}, + {0x3009, 0x3008}, + {0x300a, 0x300b}, + {0x300b, 0x300a}, + {0x300c, 0x300d}, + {0x300d, 0x300c}, + {0x300e, 0x300f}, + {0x300f, 0x300e}, + {0x3010, 0x3011}, + {0x3011, 0x3010}, + {0x3014, 0x3015}, + {0x3015, 0x3014}, + {0x3016, 0x3017}, + {0x3017, 0x3016}, + {0x3018, 0x3019}, + {0x3019, 0x3018}, + {0x301a, 0x301b}, + {0x301b, 0x301a}, + {0xfe59, 0xfe5a}, + {0xfe5a, 0xfe59}, + {0xfe5b, 0xfe5c}, + {0xfe5c, 0xfe5b}, + {0xfe5d, 0xfe5e}, + {0xfe5e, 0xfe5d}, + {0xfe64, 0xfe65}, + {0xfe65, 0xfe64}, + {0xff08, 0xff09}, + {0xff09, 0xff08}, + {0xff1c, 0xff1e}, + {0xff1e, 0xff1c}, + {0xff3b, 0xff3d}, + {0xff3d, 0xff3b}, + {0xff5b, 0xff5d}, + {0xff5d, 0xff5b}, + {0xff5f, 0xff60}, + {0xff60, 0xff5f}, + {0xff62, 0xff63}, + {0xff63, 0xff62}, + }; + + int i, j, k; + + i = -1; + j = lenof(mirror_pairs); + + while (j - i > 1) { + k = (i + j) / 2; + if (ch < mirror_pairs[k].src) + j = k; + else if (ch > mirror_pairs[k].src) + i = k; + else + return mirror_pairs[k].dst; + } + + return ch; +} + +/* + * Identify the bracket characters treated specially by bidi rule + * BD19, and return their paired character(s). + * + * The data table in this function is constructed from the Unicode + * Character Database version 14.0.0, downloadable from unicode.org at + * the URL + * + * https://www.unicode.org/Public/14.0.0/ucd/ + * + * by the following fragment of Perl: + +perl -e ' + open BIDIBRACKETS, "<", $ARGV[0] or die; + while () { + chomp; s{\s}{}g; s{#.*$}{}; next unless /./; + @_ = split /;/, $_; + $src = hex $_[0]; $dst = hex $_[1]; $kind = $_[2]; + $m{$src}=[$kind, $dst]; + } + open UNICODEDATA, "<", $ARGV[1] or die; + while () { + chomp; @_ = split /;/, $_; + $src = hex $_[0]; next unless defined $m{$src}; + if ($_[5] =~ /^[0-9a-f]+$/i) { + $equiv = hex $_[5]; + $e{$src} = $equiv; + $e{$equiv} = $src; + } + } + for $src (sort {$a <=> $b} keys %m) { + ($kind, $dst) = @{$m{$src}}; + $equiv = 0 + $e{$dst}; + printf " {0x%04x, {0x%04x, 0x%04x, %s}},\n", $src, $dst, $equiv, + $kind eq "c" ? "BT_CLOSE" : "BT_OPEN"; + } +' BidiBrackets.txt UnicodeData.txt + + */ +typedef enum { BT_NONE, BT_OPEN, BT_CLOSE } BracketType; +typedef struct BracketTypeData { + unsigned partner, equiv_partner; + BracketType type; +} BracketTypeData; +static BracketTypeData bracket_type(unsigned int ch) +{ + static const struct { + unsigned src; + BracketTypeData payload; + } bracket_pairs[] = { + {0x0028, {0x0029, 0x0000, BT_OPEN}}, + {0x0029, {0x0028, 0x0000, BT_CLOSE}}, + {0x005b, {0x005d, 0x0000, BT_OPEN}}, + {0x005d, {0x005b, 0x0000, BT_CLOSE}}, + {0x007b, {0x007d, 0x0000, BT_OPEN}}, + {0x007d, {0x007b, 0x0000, BT_CLOSE}}, + {0x0f3a, {0x0f3b, 0x0000, BT_OPEN}}, + {0x0f3b, {0x0f3a, 0x0000, BT_CLOSE}}, + {0x0f3c, {0x0f3d, 0x0000, BT_OPEN}}, + {0x0f3d, {0x0f3c, 0x0000, BT_CLOSE}}, + {0x169b, {0x169c, 0x0000, BT_OPEN}}, + {0x169c, {0x169b, 0x0000, BT_CLOSE}}, + {0x2045, {0x2046, 0x0000, BT_OPEN}}, + {0x2046, {0x2045, 0x0000, BT_CLOSE}}, + {0x207d, {0x207e, 0x0000, BT_OPEN}}, + {0x207e, {0x207d, 0x0000, BT_CLOSE}}, + {0x208d, {0x208e, 0x0000, BT_OPEN}}, + {0x208e, {0x208d, 0x0000, BT_CLOSE}}, + {0x2308, {0x2309, 0x0000, BT_OPEN}}, + {0x2309, {0x2308, 0x0000, BT_CLOSE}}, + {0x230a, {0x230b, 0x0000, BT_OPEN}}, + {0x230b, {0x230a, 0x0000, BT_CLOSE}}, + {0x2329, {0x232a, 0x3009, BT_OPEN}}, + {0x232a, {0x2329, 0x3008, BT_CLOSE}}, + {0x2768, {0x2769, 0x0000, BT_OPEN}}, + {0x2769, {0x2768, 0x0000, BT_CLOSE}}, + {0x276a, {0x276b, 0x0000, BT_OPEN}}, + {0x276b, {0x276a, 0x0000, BT_CLOSE}}, + {0x276c, {0x276d, 0x0000, BT_OPEN}}, + {0x276d, {0x276c, 0x0000, BT_CLOSE}}, + {0x276e, {0x276f, 0x0000, BT_OPEN}}, + {0x276f, {0x276e, 0x0000, BT_CLOSE}}, + {0x2770, {0x2771, 0x0000, BT_OPEN}}, + {0x2771, {0x2770, 0x0000, BT_CLOSE}}, + {0x2772, {0x2773, 0x0000, BT_OPEN}}, + {0x2773, {0x2772, 0x0000, BT_CLOSE}}, + {0x2774, {0x2775, 0x0000, BT_OPEN}}, + {0x2775, {0x2774, 0x0000, BT_CLOSE}}, + {0x27c5, {0x27c6, 0x0000, BT_OPEN}}, + {0x27c6, {0x27c5, 0x0000, BT_CLOSE}}, + {0x27e6, {0x27e7, 0x0000, BT_OPEN}}, + {0x27e7, {0x27e6, 0x0000, BT_CLOSE}}, + {0x27e8, {0x27e9, 0x0000, BT_OPEN}}, + {0x27e9, {0x27e8, 0x0000, BT_CLOSE}}, + {0x27ea, {0x27eb, 0x0000, BT_OPEN}}, + {0x27eb, {0x27ea, 0x0000, BT_CLOSE}}, + {0x27ec, {0x27ed, 0x0000, BT_OPEN}}, + {0x27ed, {0x27ec, 0x0000, BT_CLOSE}}, + {0x27ee, {0x27ef, 0x0000, BT_OPEN}}, + {0x27ef, {0x27ee, 0x0000, BT_CLOSE}}, + {0x2983, {0x2984, 0x0000, BT_OPEN}}, + {0x2984, {0x2983, 0x0000, BT_CLOSE}}, + {0x2985, {0x2986, 0x0000, BT_OPEN}}, + {0x2986, {0x2985, 0x0000, BT_CLOSE}}, + {0x2987, {0x2988, 0x0000, BT_OPEN}}, + {0x2988, {0x2987, 0x0000, BT_CLOSE}}, + {0x2989, {0x298a, 0x0000, BT_OPEN}}, + {0x298a, {0x2989, 0x0000, BT_CLOSE}}, + {0x298b, {0x298c, 0x0000, BT_OPEN}}, + {0x298c, {0x298b, 0x0000, BT_CLOSE}}, + {0x298d, {0x2990, 0x0000, BT_OPEN}}, + {0x298e, {0x298f, 0x0000, BT_CLOSE}}, + {0x298f, {0x298e, 0x0000, BT_OPEN}}, + {0x2990, {0x298d, 0x0000, BT_CLOSE}}, + {0x2991, {0x2992, 0x0000, BT_OPEN}}, + {0x2992, {0x2991, 0x0000, BT_CLOSE}}, + {0x2993, {0x2994, 0x0000, BT_OPEN}}, + {0x2994, {0x2993, 0x0000, BT_CLOSE}}, + {0x2995, {0x2996, 0x0000, BT_OPEN}}, + {0x2996, {0x2995, 0x0000, BT_CLOSE}}, + {0x2997, {0x2998, 0x0000, BT_OPEN}}, + {0x2998, {0x2997, 0x0000, BT_CLOSE}}, + {0x29d8, {0x29d9, 0x0000, BT_OPEN}}, + {0x29d9, {0x29d8, 0x0000, BT_CLOSE}}, + {0x29da, {0x29db, 0x0000, BT_OPEN}}, + {0x29db, {0x29da, 0x0000, BT_CLOSE}}, + {0x29fc, {0x29fd, 0x0000, BT_OPEN}}, + {0x29fd, {0x29fc, 0x0000, BT_CLOSE}}, + {0x2e22, {0x2e23, 0x0000, BT_OPEN}}, + {0x2e23, {0x2e22, 0x0000, BT_CLOSE}}, + {0x2e24, {0x2e25, 0x0000, BT_OPEN}}, + {0x2e25, {0x2e24, 0x0000, BT_CLOSE}}, + {0x2e26, {0x2e27, 0x0000, BT_OPEN}}, + {0x2e27, {0x2e26, 0x0000, BT_CLOSE}}, + {0x2e28, {0x2e29, 0x0000, BT_OPEN}}, + {0x2e29, {0x2e28, 0x0000, BT_CLOSE}}, + {0x2e55, {0x2e56, 0x0000, BT_OPEN}}, + {0x2e56, {0x2e55, 0x0000, BT_CLOSE}}, + {0x2e57, {0x2e58, 0x0000, BT_OPEN}}, + {0x2e58, {0x2e57, 0x0000, BT_CLOSE}}, + {0x2e59, {0x2e5a, 0x0000, BT_OPEN}}, + {0x2e5a, {0x2e59, 0x0000, BT_CLOSE}}, + {0x2e5b, {0x2e5c, 0x0000, BT_OPEN}}, + {0x2e5c, {0x2e5b, 0x0000, BT_CLOSE}}, + {0x3008, {0x3009, 0x232a, BT_OPEN}}, + {0x3009, {0x3008, 0x2329, BT_CLOSE}}, + {0x300a, {0x300b, 0x0000, BT_OPEN}}, + {0x300b, {0x300a, 0x0000, BT_CLOSE}}, + {0x300c, {0x300d, 0x0000, BT_OPEN}}, + {0x300d, {0x300c, 0x0000, BT_CLOSE}}, + {0x300e, {0x300f, 0x0000, BT_OPEN}}, + {0x300f, {0x300e, 0x0000, BT_CLOSE}}, + {0x3010, {0x3011, 0x0000, BT_OPEN}}, + {0x3011, {0x3010, 0x0000, BT_CLOSE}}, + {0x3014, {0x3015, 0x0000, BT_OPEN}}, + {0x3015, {0x3014, 0x0000, BT_CLOSE}}, + {0x3016, {0x3017, 0x0000, BT_OPEN}}, + {0x3017, {0x3016, 0x0000, BT_CLOSE}}, + {0x3018, {0x3019, 0x0000, BT_OPEN}}, + {0x3019, {0x3018, 0x0000, BT_CLOSE}}, + {0x301a, {0x301b, 0x0000, BT_OPEN}}, + {0x301b, {0x301a, 0x0000, BT_CLOSE}}, + {0xfe59, {0xfe5a, 0x0000, BT_OPEN}}, + {0xfe5a, {0xfe59, 0x0000, BT_CLOSE}}, + {0xfe5b, {0xfe5c, 0x0000, BT_OPEN}}, + {0xfe5c, {0xfe5b, 0x0000, BT_CLOSE}}, + {0xfe5d, {0xfe5e, 0x0000, BT_OPEN}}, + {0xfe5e, {0xfe5d, 0x0000, BT_CLOSE}}, + {0xff08, {0xff09, 0x0000, BT_OPEN}}, + {0xff09, {0xff08, 0x0000, BT_CLOSE}}, + {0xff3b, {0xff3d, 0x0000, BT_OPEN}}, + {0xff3d, {0xff3b, 0x0000, BT_CLOSE}}, + {0xff5b, {0xff5d, 0x0000, BT_OPEN}}, + {0xff5d, {0xff5b, 0x0000, BT_CLOSE}}, + {0xff5f, {0xff60, 0x0000, BT_OPEN}}, + {0xff60, {0xff5f, 0x0000, BT_CLOSE}}, + {0xff62, {0xff63, 0x0000, BT_OPEN}}, + {0xff63, {0xff62, 0x0000, BT_CLOSE}}, + }; + + int i, j, k; + + i = -1; + j = lenof(bracket_pairs); + + while (j - i > 1) { + k = (i + j) / 2; + if (ch < bracket_pairs[k].src) { + j = k; + } else if (ch > bracket_pairs[k].src) { + i = k; + } else { + return bracket_pairs[k].payload; + } + } + + static const BracketTypeData null = { 0, 0, BT_NONE }; + return null; +} + /* * Function exported to front ends to allow them to identify * bidi-active characters (in case, for example, the platform's @@ -970,56 +2375,7 @@ unsigned char bidi_getType(int ch) */ bool is_rtl(int c) { - /* - * After careful reading of the Unicode bidi algorithm (URL as - * given at the top of this file) I believe that the only - * character classes which can possibly cause trouble are R, - * AL, RLE and RLO. I think that any string containing no - * character in any of those classes will be displayed - * uniformly left-to-right by the Unicode bidi algorithm. - */ - const int mask = (1< 0) { - unsigned char current = level[--from]; - - while (from >= 0 && level[from] == current) - from--; - - if (from >= 0) - return level[from]; - - return -1; - } else - return -1; + return typeIsBidiActive(bidi_getType(c)); } /* The Main shaping function, and the only one to be used @@ -1116,8 +2472,68 @@ int do_shape(bidi_char *line, bidi_char *to, int count) return 1; } +typedef enum { DO_NEUTRAL, DO_LTR, DO_RTL } DirectionalOverride; + +typedef struct DSStackEntry { + /* + * An entry in the directional status stack (rule section X). + */ + unsigned char level; + bool isolate; + DirectionalOverride override; +} DSStackEntry; + +typedef struct BracketStackEntry { + /* + * An entry in the bracket-pair-tracking stack (rule BD16). + */ + unsigned ch; + size_t c; +} BracketStackEntry; + +typedef struct IsolatingRunSequence { + size_t start, end; + BidiType sos, eos, embeddingDirection; +} IsolatingRunSequence; + +#define MAX_DEPTH 125 /* specified in the standard */ + struct BidiContext { - int dummy; + /* + * Storage space preserved between runs, all allocated to the same + * length (internal_array_sizes). + */ + size_t internal_array_sizes; + BidiType *types, *origTypes; + unsigned char *levels; + size_t *irsindices, *bracketpos; + bool *irsdone; + + /* + * Separately allocated with its own size field + */ + IsolatingRunSequence *irslist; + size_t irslistsize; + + /* + * Rewritten to point to the input to the currently active run of + * the bidi algorithm + */ + bidi_char *text; + size_t textlen; + + /* + * State within a run of the algorithm + */ + BidiType paragraphOverride; + DSStackEntry dsstack[MAX_DEPTH + 2]; + size_t ds_sp; + size_t overflowIsolateCount, overflowEmbeddingCount, validIsolateCount; + unsigned char paragraphLevel; + size_t *irs; + size_t irslen; + BidiType sos, eos, embeddingDirection; + BracketStackEntry bstack[63]; /* constant size specified in rule BD16 */ }; BidiContext *bidi_new_context(void) @@ -1129,813 +2545,1056 @@ BidiContext *bidi_new_context(void) void bidi_free_context(BidiContext *ctx) { + sfree(ctx->types); + sfree(ctx->origTypes); + sfree(ctx->levels); + sfree(ctx->irsindices); + sfree(ctx->irsdone); + sfree(ctx->bracketpos); + sfree(ctx->irslist); sfree(ctx); } -/* - * The Main Bidi Function, and the only function that should - * be used by the outside world. - * - * line: a buffer of size count containing text to apply - * the Bidirectional algorithm to. - */ - -void do_bidi(BidiContext *ctx, bidi_char *line, size_t count) +static void ensure_arrays(BidiContext *ctx, size_t textlen) { - unsigned char* types; - unsigned char* levels; - unsigned char paragraphLevel; - unsigned char currentEmbedding; - unsigned char currentOverride; - unsigned char tempType; - int i, j; - bool yes, bover; + if (textlen <= ctx->internal_array_sizes) + return; + ctx->internal_array_sizes = textlen; + ctx->types = sresize(ctx->types, ctx->internal_array_sizes, BidiType); + ctx->origTypes = sresize(ctx->origTypes, ctx->internal_array_sizes, + BidiType); + ctx->levels = sresize(ctx->levels, ctx->internal_array_sizes, + unsigned char); + ctx->irsindices = sresize(ctx->irsindices, ctx->internal_array_sizes, + size_t); + ctx->irsdone = sresize(ctx->irsdone, ctx->internal_array_sizes, bool); + ctx->bracketpos = sresize(ctx->bracketpos, ctx->internal_array_sizes, + size_t); +} - /* Check the presence of R or AL types as optimization */ - yes = false; - for (i=0; itextlen; i++) + ctx->types[i] = ctx->origTypes[i] = bidi_getType(ctx->text[i].wc); +} + +static bool text_needs_bidi(BidiContext *ctx) +{ + /* + * Initial optimisation: check for any bidi-active character at + * all in an input line. If there aren't any, we can skip the + * whole algorithm. + * + * Also include the paragraph override in this check! + */ + for (size_t i = 0; i < ctx->textlen; i++) + if (typeIsBidiActive(ctx->types[i])) + return true; + return typeIsBidiActive(ctx->paragraphOverride); +} + +static size_t find_matching_pdi(const BidiType *types, size_t i, size_t size) +{ + /* Assuming that types[i] is an isolate initiator, find its + * matching PDI by rule BD9. */ + unsigned counter = 1; + i++; + for (; i < size; i++) { + BidiType t = types[i]; + if (typeIsIsolateInitiator(t)) { + counter++; + } else if (t == PDI) { + counter--; + if (counter == 0) + return i; } } - if (!yes) - return; - /* Initialize types, levels */ - types = snewn(count, unsigned char); - levels = snewn(count, unsigned char); + /* If no PDI was found, return the length of the array. */ + return size; +} - /* Rule (P1) NOT IMPLEMENTED - * P1. Split the text into separate paragraphs. A paragraph separator is - * kept with the previous paragraph. Within each paragraph, apply all the - * other rules of this algorithm. +static unsigned char rule_p2_p3(const BidiType *types, size_t size) +{ + /* + * Rule P2. Find the first strong type (L, R or AL), ignoring + * anything inside an isolated segment. + * + * Rule P3. If that type is R or AL, choose a paragraph embeddding + * level of 1, otherwise 0. */ - - /* Rule (P2), (P3) - * P2. In each paragraph, find the first character of type L, AL, or R. - * P3. If a character is found in P2 and it is of type AL or R, then set - * the paragraph embedding level to one; otherwise, set it to zero. - */ - paragraphLevel = 0; - for (i=0; iparagraphOverride == L) + ctx->paragraphLevel = 0; + else if (ctx->paragraphOverride == R) + ctx->paragraphLevel = 1; + else + ctx->paragraphLevel = rule_p2_p3(ctx->types, ctx->textlen); +} + +static inline unsigned char nextOddLevel(unsigned char x) { return (x+1)|1; } +static inline unsigned char nextEvenLevel(unsigned char x) { return (x|1)+1; } + +static inline void push(BidiContext *ctx, unsigned char level, + DirectionalOverride override, bool isolate) +{ + ctx->ds_sp++; + assert(ctx->ds_sp < lenof(ctx->dsstack)); + ctx->dsstack[ctx->ds_sp].level = level; + ctx->dsstack[ctx->ds_sp].override = override; + ctx->dsstack[ctx->ds_sp].isolate = isolate; +} + +static inline void pop(BidiContext *ctx) +{ + assert(ctx->ds_sp > 0); + ctx->ds_sp--; +} + +static void process_explicit_embeddings(BidiContext *ctx) +{ + /* + * Rule X1 initialisation. */ - currentEmbedding = paragraphLevel; - currentOverride = ON; + ctx->ds_sp = (size_t)-1; + push(ctx, ctx->paragraphLevel, DO_NEUTRAL, false); + ctx->overflowIsolateCount = 0; + ctx->overflowEmbeddingCount = 0; + ctx->validIsolateCount = 0; - /* Rule (X2), (X3), (X4), (X5), (X6), (X7), (X8) - * X2. With each RLE, compute the least greater odd embedding level. - * X3. With each LRE, compute the least greater even embedding level. - * X4. With each RLO, compute the least greater odd embedding level. - * X5. With each LRO, compute the least greater even embedding level. - * X6. For all types besides RLE, LRE, RLO, LRO, and PDF: - * a. Set the level of the current character to the current - * embedding level. - * b. Whenever the directional override status is not neutral, - * reset the current character type to the directional - * override status. - * X7. With each PDF, determine the matching embedding or override code. - * If there was a valid matching code, restore (pop) the last - * remembered (pushed) embedding level and directional override. - * X8. All explicit directional embeddings and overrides are completely - * terminated at the end of each paragraph. Paragraph separators are not - * included in the embedding. (Useless here) NOT IMPLEMENTED - */ - bover = false; - for (i=0; idsstack[ctx->ds_sp]) - case LRE: - currentEmbedding = levels[i] = leastGreaterEven(currentEmbedding); - levels[i] = setOverrideBits(levels[i], currentOverride); - currentOverride = ON; - break; + for (size_t i = 0; i < ctx->textlen; i++) { + BidiType t = ctx->types[i]; + switch (t) { + case RLE: case LRE: case RLO: case LRO: { + /* Rules X2-X5 */ + unsigned char newLevel; + DirectionalOverride override; - case RLO: - currentEmbedding = levels[i] = leastGreaterOdd(currentEmbedding); - tempType = currentOverride = R; - bover = true; - break; +#ifndef REMOVE_FORMATTING_CHARS + ctx->levels[i] = stk->level; +#endif - case LRO: - currentEmbedding = levels[i] = leastGreaterEven(currentEmbedding); - tempType = currentOverride = L; - bover = true; - break; - - case PDF: { - int prevlevel = getPreviousLevel(levels, i); - - if (prevlevel == -1) { - currentEmbedding = paragraphLevel; - currentOverride = ON; - } else { - currentOverride = currentEmbedding & OMASK; - currentEmbedding = currentEmbedding & ~OMASK; + switch (t) { + case RLE: /* rule X2 */ + newLevel = nextOddLevel(stk->level); + override = DO_NEUTRAL; + break; + case LRE: /* rule X3 */ + newLevel = nextEvenLevel(stk->level); + override = DO_NEUTRAL; + break; + case RLO: /* rule X4 */ + newLevel = nextOddLevel(stk->level); + override = DO_RTL; + break; + case LRO: /* rule X5 */ + newLevel = nextEvenLevel(stk->level); + override = DO_LTR; + break; + default: + unreachable("how did this get past the outer switch?"); + } + + if (newLevel <= MAX_DEPTH && + ctx->overflowIsolateCount == 0 && + ctx->overflowEmbeddingCount == 0) { + /* Embedding code is valid. Push a stack entry. */ + push(ctx, newLevel, override, false); + } else { + /* Embedding code is an overflow one. */ + if (ctx->overflowIsolateCount == 0) + ctx->overflowEmbeddingCount++; } - levels[i] = currentEmbedding; break; } - /* Whitespace is treated as neutral for now */ - case WS: - case S: - levels[i] = currentEmbedding; - tempType = ON; - if (currentOverride != ON) - tempType = currentOverride; - break; + case RLI: case LRI: case FSI: { + /* Rules X5a, X5b, X5c */ - default: - levels[i] = currentEmbedding; - if (currentOverride != ON) - tempType = currentOverride; - break; + if (t == FSI) { + /* Rule X5c: decide whether this should be treated + * like RLI or LRI */ + size_t pdi = find_matching_pdi(ctx->types, i, ctx->textlen); + unsigned char level = rule_p2_p3(ctx->types + (i + 1), + pdi - (i + 1)); + t = (level == 1 ? RLI : LRI); + } - } - types[i] = tempType; - } - /* this clears out all overrides, so we can use levels safely... */ - /* checks bover first */ - if (bover) - for (i=0; ilevels[i] = stk->level; + if (stk->override != DO_NEUTRAL) + ctx->types[i] = (stk->override == DO_LTR ? L : + stk->override == DO_RTL ? R : t); - /* Rule (X9) - * X9. Remove all RLE, LRE, RLO, LRO, PDF, and BN codes. - * Here, they're converted to BN. - */ - for (i=0; ilevel) : + nextEvenLevel(stk->level)); + + if (newLevel <= MAX_DEPTH && + ctx->overflowIsolateCount == 0 && + ctx->overflowEmbeddingCount == 0) { + /* Isolate code is valid. Push a stack entry. */ + push(ctx, newLevel, DO_NEUTRAL, true); + ctx->validIsolateCount++; + } else { + /* Isolate code is an overflow one. */ + ctx->overflowIsolateCount++; + } break; + } + + case PDI: { + /* Rule X6a */ + if (ctx->overflowIsolateCount > 0) { + ctx->overflowIsolateCount--; + } else if (ctx->validIsolateCount == 0) { + /* Do nothing: spurious isolate-pop */ + } else { + /* Valid isolate-pop. We expect that the stack must + * therefore contain at least one isolate==true entry, + * so pop everything up to and including it. */ + ctx->overflowEmbeddingCount = 0; + while (!stk->isolate) + pop(ctx); + pop(ctx); + ctx->validIsolateCount--; + } + ctx->levels[i] = stk->level; + if (stk->override != DO_NEUTRAL) + ctx->types[i] = (stk->override == DO_LTR ? L : R); + break; + } + + case PDF: { + /* Rule X7 */ + if (ctx->overflowIsolateCount > 0) { + /* Do nothing if we've overflowed on isolates */ + } else if (ctx->overflowEmbeddingCount > 0) { + ctx->overflowEmbeddingCount--; + } else if (ctx->ds_sp > 0 && !stk->isolate) { + pop(ctx); + } else { + /* Do nothing: spurious embedding-pop */ + } + +#ifndef REMOVE_FORMATTING_CHARS + ctx->levels[i] = stk->level; +#endif + break; + } + + case B: { + /* Rule X8: if an explicit paragraph separator appears in + * this text at all then it does not participate in any of + * the above, and just gets assigned the paragraph level. + * + * PS, it had better be right at the end of the text, + * because we have not implemented rule P1 in this code. */ + assert(i == ctx->textlen - 1); + ctx->levels[i] = ctx->paragraphLevel; + break; + } + + case BN: { + /* + * The section 5.2 adjustment to rule X6 says that we + * apply it to BN just like any other class. But I think + * this can't possibly give the same results as the + * unmodified algorithm. + * + * Proof: adding RLO BN or LRO BN at the end of a + * paragraph should not change the output of the standard + * algorithm, because the override doesn't affect the BN + * in rule X6, and then rule X9 removes both. But with the + * modified rule X6, the BN is changed into R or L, and + * then rule X9 doesn't remove it, and then you've added a + * strong type that will set eos for the level run just + * before the override. And whatever the standard + * algorithm set eos to, _one_ of these override sequences + * will disagree with it. + * + * So I think we just set the BN's level, and don't change + * its type. + */ + ctx->levels[i] = stk->level; + break; + } + + default: { + /* Rule X6. */ + ctx->levels[i] = stk->level; + if (stk->override != DO_NEUTRAL) + ctx->types[i] = (stk->override == DO_LTR ? L : R); + break; + } } } - /* Rule (W1) - * W1. Examine each non-spacing mark (NSM) in the level run, and change - * the type of the NSM to the type of the previous character. If the NSM - * is at the start of the level run, it will get the type of sor. - */ - if (types[0] == NSM) - types[0] = paragraphLevel; + #undef stk +} - for (i=1; itextlen; i++) { + BidiType t = ctx->types[i]; + if (typeIsRemovedDuringProcessing(t)) { + ctx->types[i] = BN; + + /* + * My own adjustment to the section 5.2 mods: a sequence + * of contiguous BN generated by this setup should never + * be at different levels from each other. + * + * An example where this goes wrong is if you open two + * LREs in sequence, then close them again: + * + * ... LRE LRE PDF PDF ... + * + * The initial level assignment gives level 0 to the outer + * LRE/PDF pair, and level 2 to the inner one. The + * standard algorithm would remove all four, so this + * doesn't matter, and you end up with no break in the + * surrounding level run. But if you just rewrite the + * types of all those characters to BN and leave the + * levels in that state, then the modified algorithm will + * leave the middle two BN at level 2, dividing what + * should have been a long level run at level 0 into two + * separate ones. + */ + if (i > 0 && ctx->types[i-1] == BN) + ctx->levels[i] = ctx->levels[i-1]; + } + } +#else + /* + * Rule X9, original version: completely remove embedding + * start/end characters and also boundary neutrals. + */ + size_t outpos = 0; + for (size_t i = 0; i < ctx->textlen; i++) { + BidiType t = ctx->types[i]; + if (!typeIsRemovedDuringProcessing(t)) { + ctx->text[outpos] = ctx->text[i]; + ctx->levels[outpos] = ctx->levels[i]; + ctx->types[outpos] = ctx->types[i]; + ctx->origTypes[outpos] = ctx->origTypes[i]; + outpos++; + } + } + ctx->textlen = outpos; +#endif +} + +typedef void (*irs_fn_t)(BidiContext *ctx); + +static void find_isolating_run_sequences(BidiContext *ctx, irs_fn_t process) +{ + /* + * Rule X10 / BD13. Now that we've assigned an embedding level to + * each character in the text, we have to divide the text into + * subsequences on which to do the next stage of processing. + * + * In earlier issues of the bidi algorithm, these subsequences + * were contiguous in the original text, and each one was a 'level + * run': a maximal contiguous subsequence of characters all at the + * same embedding level. + * + * But now we have isolates, and the point of an (isolate + * initiator ... PDI) sequence is that the whole sequence should + * be treated like a single BN for the purposes of formatting + * everything outside it. As a result, we now have to recombine + * our level runs into longer sequences, on the principle that if + * a level run ends with an isolate initiator, then we bring it + * together with whatever later level run starts with the matching + * PDI. + * + * These subsequences are no longer contiguous (the whole point is + * that between the isolate initiator and the PDI is some other + * text that we've skipped over). They're called 'isolating run + * sequences'. + */ + + memset(ctx->irsdone, 0, ctx->textlen); + size_t i = 0; + size_t n_irs = 0; + size_t indexpos = 0; + while (i < ctx->textlen) { + if (ctx->irsdone[i]) { + i++; + continue; + } + + /* + * Found a character not already processed. Start a new + * sequence here. + */ + sgrowarray(ctx->irslist, ctx->irslistsize, n_irs); + IsolatingRunSequence *irs = &ctx->irslist[n_irs++]; + irs->start = indexpos; + size_t j = i; + size_t irslevel = ctx->levels[i]; + while (j < ctx->textlen) { + /* + * We expect that all level runs in this sequence will be + * at the same level as each other, by construction of how + * we set up the levels from the isolates in the first + * place. + */ + assert(ctx->levels[j] == irslevel); + + do { + ctx->irsdone[j] = true; + ctx->irsindices[indexpos++] = j++; + } while (j < ctx->textlen && ctx->levels[j] == irslevel); + if (!typeIsIsolateInitiator(ctx->types[j-1])) + break; /* this IRS is ended */ + j = find_matching_pdi(ctx->types, j-1, ctx->textlen); + } + irs->end = indexpos; + + /* + * Determine the start-of-sequence and end-of-sequence types + * for this sequence. + * + * These depend on the embedding levels of surrounding text. + * But processing each run can change those levels. That's why + * we have to use a two-pass strategy here, first identifying + * all the isolating run sequences using the input level data, + * and not processing any of them until we know where they all + * are. + */ + size_t p; + unsigned char level_inside, level_outside, level_max; + + p = i; + level_inside = ctx->levels[p]; + level_outside = ctx->paragraphLevel; + while (p > 0) { + p--; + if (ctx->types[p] != BN) { + level_outside = ctx->levels[p]; + break; + } + } + level_max = max(level_inside, level_outside); + irs->sos = (level_max % 2 ? R : L); + + p = ctx->irsindices[irs->end - 1]; + level_inside = ctx->levels[p]; + level_outside = ctx->paragraphLevel; + if (typeIsIsolateInitiator(ctx->types[p])) { + /* Special case: if an isolating run sequence ends in an + * unmatched isolate initiator, then level_outside is + * taken to be the paragraph embedding level and the + * loop below is skipped. */ + } else { + while (p+1 < ctx->textlen) { + p++; + if (ctx->types[p] != BN) { + level_outside = ctx->levels[p]; + break; + } + } + } + level_max = max(level_inside, level_outside); + irs->eos = (level_max % 2 ? R : L); + + irs->embeddingDirection = (irslevel % 2 ? R : L); + + /* + * Now we've listed in ctx->irsindices[] the index of every + * character that's part of this isolating run sequence, and + * recorded an entry in irslist containing the interval of + * indices relevant to this IRS, plus its assorted metadata. + * We've also marked those locations in the input text as done + * in ctx->irsdone, so that we'll skip over them when the + * outer iteration reaches them later. */ } - /* Rule (W2) - * W2. Search backwards from each instance of a European number until the - * first strong type (R, L, AL, or sor) is found. If an AL is found, - * change the type of the European number to Arabic number. - */ - for (i=0; i= 0) { - if (types[j] == AL) { - types[i] = AN; - break; - } else if (types[j] == R || types[j] == L) { - break; - } - j--; - } - } + for (size_t k = 0; k < n_irs; k++) { + IsolatingRunSequence *irs = &ctx->irslist[k]; + ctx->irs = ctx->irsindices + irs->start; + ctx->irslen = irs->end - irs->start; + ctx->sos = irs->sos; + ctx->eos = irs->eos; + ctx->embeddingDirection = irs->embeddingDirection; + process(ctx); } - /* Rule (W3) - * W3. Change all ALs to R. - * - * Optimization: on Rule Xn, we might set a flag on AL type - * to prevent this loop in L R lines only... - */ - for (i=0; i 0 && types[i-1] == EN) { - types[i] = EN; - continue; - } else if (i < count-1 && types[i+1] == EN) { - types[i] = EN; - continue; - } else if (i < count-1 && types[i+1] == ET) { - j=i; - while (j < count-1 && types[j] == ET) { - j++; - } - if (types[j] == EN) - types[i] = EN; - } - } - } - - /* Rule (W6) - * W6. Otherwise, separators and terminators change to Other Neutral: - */ - for (i=0; i= 0) { - if (types[j] == L) { - types[i] = L; - break; - } else if (types[j] == R || types[j] == AL) { - break; - } - j--; - } - } - } - - /* Rule (N1) - * N1. A sequence of neutrals takes the direction of the surrounding - * strong text if the text on both sides has the same direction. European - * and Arabic numbers are treated as though they were R. - */ - if (count >= 2 && types[0] == ON) { - if ((types[1] == R) || (types[1] == EN) || (types[1] == AN)) - types[0] = R; - else if (types[1] == L) - types[0] = L; - } - for (i=1; i<(count-1); i++) { - if (types[i] == ON) { - if (types[i-1] == L) { - j=i; - while (j<(count-1) && types[j] == ON) { - j++; - } - if (types[j] == L) { - while (i= 2 && types[count-1] == ON) { - if (types[count-2] == R || types[count-2] == EN || types[count-2] == AN) - types[count-1] = R; - else if (types[count-2] == L) - types[count-1] = L; - } - - /* Rule (N2) - * N2. Any remaining neutrals take the embedding direction. - */ - for (i=0; i0 && (bidi_getType(line[j].wc) == WS)) { - j--; - } - if (j < (count-1)) { - for (j++; j=i ; j--) { - levels[j] = paragraphLevel; - } - } - } else if (tempType == B || tempType == S) { - levels[i] = paragraphLevel; - } - } - - /* Rule (L4) NOT IMPLEMENTED - * L4. A character that possesses the mirrored property as specified by - * Section 4.7, Mirrored, must be depicted by a mirrored glyph if the - * resolved directionality of that character is R. - */ - /* Note: this is implemented before L2 for efficiency */ - for (i=0; i tempType) - tempType = levels[i]; - i++; - } - /* maximum level in tempType. */ - while (tempType > 0) { /* loop from highest level to the least odd, */ - /* which i assume is 1 */ - flipThisRun(line, levels, tempType, count); - tempType--; - } - - /* Rule (L3) NOT IMPLEMENTED - * L3. Combining marks applied to a right-to-left base character will at - * this point precede their base character. If the rendering engine - * expects them to follow the base characters in the final display - * process, then the ordering of the marks and the base character must - * be reversed. - */ - sfree(types); - sfree(levels); - return; + /* Reset irslen to 0 when we've finished. This means any other + * functions that absentmindedly try to use irslen at all will end + * up doing nothing at all, which should be easier to detect and + * debug than if they run on subtly the wrong subset of the + * text. */ + ctx->irslen = 0; } +static void remove_nsm(BidiContext *ctx) +{ + /* Rule W1: NSM gains the type of the previous character, or sos + * at the start of the run, with the exception that isolation + * boundaries turn into ON. */ + BidiType prevType = ctx->sos; + for (size_t c = 0; c < ctx->irslen; c++) { + size_t i = ctx->irs[c]; + BidiType t = ctx->types[i]; + if (t == NSM) { + ctx->types[i] = prevType; + } else if (typeIsIsolateInitiatorOrPDI(t)) { + prevType = ON; +#ifndef REMOVE_FORMATTING_CHARS + } else if (t == BN) { + /* section 5.2 adjustment: these don't affect prevType */ +#endif + } else { + prevType = t; + } + } +} + +static void change_en_to_an(BidiContext *ctx) +{ + /* Rule W2: EN becomes AN if the previous strong type is AL. (The + * spec says that the 'previous strong type' is counted as sos at + * the start of the run, although it hardly matters, since sos + * can't be AL.) */ + BidiType prevStrongType = ctx->sos; + for (size_t c = 0; c < ctx->irslen; c++) { + size_t i = ctx->irs[c]; + BidiType t = ctx->types[i]; + if (t == EN && prevStrongType == AL) { + ctx->types[i] = AN; + } else if (typeIsStrong(t)) { + prevStrongType = t; + } + } +} + +static void change_al_to_r(BidiContext *ctx) +{ + /* Rule W3: AL becomes R unconditionally. (The only difference + * between the two types was their effect on nearby numbers, which + * was dealt with in rule W2, so now we're done with the + * distinction.) */ + for (size_t c = 0; c < ctx->irslen; c++) { + size_t i = ctx->irs[c]; + if (ctx->types[i] == AL) + ctx->types[i] = R; + } +} + +static void eliminate_separators_between_numbers(BidiContext *ctx) +{ + /* Rule W4: a single numeric separator between two numbers of the + * same type compatible with that separator takes the type of the + * number. ES is a separator type compatible only with EN; CS is a + * separator type compatible with either EN or AN. + * + * Section 5.2 adjustment: intervening BNs do not break this, so + * instead of simply looking at types[irs[c-1]] and types[irs[c+1]], + * we must track the last three indices we saw that were not BN. */ + size_t i0 = 0, i1 = 0, i2 = 0; + BidiType t0 = ON, t1 = ON, t2 = ON; + for (size_t c = 0; c < ctx->irslen; c++) { + size_t i = ctx->irs[c]; + BidiType t = ctx->types[i]; + +#ifndef REMOVE_FORMATTING_CHARS + if (t == BN) + continue; +#endif + + i0 = i1; i1 = i2; i2 = i; + t0 = t1; t1 = t2; t2 = t; + if (t0 == t2 && ((t1 == ES && t0 == EN) || + (t1 == CS && (t0 == EN || t0 == AN)))) { + ctx->types[i1] = t0; + } + } +} + +static void eliminate_et_next_to_en(BidiContext *ctx) +{ + /* Rule W5: a sequence of ET adjacent to an EN take the type EN. + * This is easiest to implement with one loop in each direction. + * + * Section 5.2 adjustment: include BN with ET. (We don't need to + * #ifdef that out, because in the standard algorithm, we won't + * have any BN left in any case.) */ + + bool modifying = false; + + for (size_t c = 0; c < ctx->irslen; c++) { + size_t i = ctx->irs[c]; + BidiType t = ctx->types[i]; + if (t == EN) { + modifying = true; + } else if (modifying && typeIsETOrBN(t)) { + ctx->types[i] = EN; + } else { + modifying = false; + } + } + + for (size_t c = ctx->irslen; c-- > 0 ;) { + size_t i = ctx->irs[c]; + BidiType t = ctx->types[i]; + if (t == EN) { + modifying = true; + } else if (modifying && typeIsETOrBN(t)) { + ctx->types[i] = EN; + } else { + modifying = false; + } + } +} + +static void eliminate_separators_and_terminators(BidiContext *ctx) +{ + /* Rule W6: all separators and terminators change to ON. + * + * (The spec is not quite clear on which bidi types are included + * in this; one assumes ES, ET and CS, but what about S? I _think_ + * the answer is that this is a rule in the W section, so it's + * implicitly supposed to only apply to types designated as weakly + * directional, so not S.) */ + +#ifndef REMOVE_FORMATTING_CHARS + /* + * Section 5.2 adjustment: this also applies to any BN adjacent on + * either side to one of these types, which is easiest to + * implement with a separate double-loop converting those to an + * arbitrary one of the affected types, say CS. + * + * This double loop can be completely skipped in the standard + * algorithm. + */ + bool modifying = false; + + for (size_t c = 0; c < ctx->irslen; c++) { + size_t i = ctx->irs[c]; + BidiType t = ctx->types[i]; + if (typeIsWeakSeparatorOrTerminator(t)) { + modifying = true; + } else if (modifying && t == BN) { + ctx->types[i] = CS; + } else { + modifying = false; + } + } + + for (size_t c = ctx->irslen; c-- > 0 ;) { + size_t i = ctx->irs[c]; + BidiType t = ctx->types[i]; + if (typeIsWeakSeparatorOrTerminator(t)) { + modifying = true; + } else if (modifying && t == BN) { + ctx->types[i] = CS; + } else { + modifying = false; + } + } +#endif + + /* Now the main part of rule W6 */ + for (size_t c = 0; c < ctx->irslen; c++) { + size_t i = ctx->irs[c]; + BidiType t = ctx->types[i]; + if (typeIsWeakSeparatorOrTerminator(t)) + ctx->types[i] = ON; + } +} + +static void change_en_to_l(BidiContext *ctx) +{ + /* Rule W7: EN becomes L if the previous strong type (or sos) is L. */ + BidiType prevStrongType = ctx->sos; + for (size_t c = 0; c < ctx->irslen; c++) { + size_t i = ctx->irs[c]; + BidiType t = ctx->types[i]; + if (t == EN && prevStrongType == L) { + ctx->types[i] = L; + } else if (typeIsStrong(t)) { + prevStrongType = t; + } + } +} + +typedef void (*bracket_pair_fn)(BidiContext *ctx, size_t copen, size_t cclose); + +static void find_bracket_pairs(BidiContext *ctx, bracket_pair_fn process) +{ + const size_t NO_BRACKET = ~(size_t)0; + + /* + * Rule BD16. + */ + size_t sp = 0; + for (size_t c = 0; c < ctx->irslen; c++) + ctx->bracketpos[c] = NO_BRACKET; + + for (size_t c = 0; c < ctx->irslen; c++) { + size_t i = ctx->irs[c]; + unsigned wc = ctx->text[i].wc; + BracketTypeData bt = bracket_type(wc); + if (bt.type == BT_OPEN) { + if (sp >= lenof(ctx->bstack)) { + /* + * Stack overflow. The spec says we simply give up at + * this point. + */ + goto found_all_pairs; + } + + ctx->bstack[sp].ch = wc; + ctx->bstack[sp].c = c; + sp++; + } else if (bt.type == BT_CLOSE) { + size_t new_sp = sp; + + /* + * Search up the stack for an entry containing a matching + * open bracket. If we find it, pop that entry and + * everything deeper, and record a matching pair. If we + * reach the bottom of the stack without finding anything, + * leave sp where it started. + */ + while (new_sp-- > 0) { + if (ctx->bstack[new_sp].ch == bt.partner || + ctx->bstack[new_sp].ch == bt.equiv_partner) { + /* Found a stack element matching this one */ + size_t cstart = ctx->bstack[new_sp].c; + ctx->bracketpos[cstart] = c; + sp = new_sp; + break; + } + } + } + } + + found_all_pairs: + for (size_t c = 0; c < ctx->irslen; c++) { + if (ctx->bracketpos[c] != NO_BRACKET) { + process(ctx, c, ctx->bracketpos[c]); + } + } +} + +static BidiType get_bracket_type(BidiContext *ctx, size_t copen, size_t cclose) +{ + /* + * Rule N0: a pair of matched brackets containing at least one + * strong type takes on the current embedding direction, unless + * all of these are true at once: + * + * (a) there are no strong types inside the brackets matching the + * current embedding direction + * (b) there _is_ at least one strong type inside the brackets + * that is _opposite_ to the current embedding direction + * (c) the strong type preceding the open bracket is also + * opposite to the current embedding direction + * + * in which case they take on the opposite direction. + * + * For these purposes, number types (EN and AN) count as R. + */ + + bool foundOppositeTypeInside = false; + for (size_t c = copen + 1; c < cclose; c++) { + size_t i = ctx->irs[c]; + BidiType t = ctx->types[i]; + if (typeIsStrongOrNumber(t)) { + t = t == L ? L : R; /* numbers count as R */ + if (t == ctx->embeddingDirection) { + /* Found something inside the brackets matching the + * current level, so (a) is violated. */ + return ctx->embeddingDirection; + } else { + foundOppositeTypeInside = true; + } + } + } + + if (!foundOppositeTypeInside) { + /* No strong types at all inside the brackets, so return ON to + * indicate that we're not messing with their type at all. */ + return ON; + } + + /* There was an opposite strong type in the brackets. Look + * backwards to the preceding strong type, and go with that, + * whichever it is. */ + for (size_t c = copen; c-- > 0 ;) { + size_t i = ctx->irs[c]; + BidiType t = ctx->types[i]; + if (typeIsStrongOrNumber(t)) { + t = t == L ? L : R; /* numbers count as R */ + return t; + } + } + + /* Fallback: if the preceding strong type was not found, go with + * sos. */ + return ctx->sos; +} + +static void reset_bracket_type(BidiContext *ctx, size_t c, BidiType t) +{ + /* Final bullet point of rule N0: when we change the type of a + * bracket, the same change applies to any contiguous sequence of + * characters after it whose _original_ bidi type was NSM. */ + do { + ctx->types[ctx->irs[c++]] = t; + +#ifndef REMOVE_FORMATTING_CHARS + while (c < ctx->irslen && ctx->origTypes[ctx->irs[c]] == BN) { + /* Section 5.2 adjustment: skip past BN in the process. */ + c++; + } +#endif + } while (c < ctx->irslen && ctx->origTypes[ctx->irs[c]] == NSM); +} + +static void resolve_brackets(BidiContext *ctx, size_t copen, size_t cclose) +{ + if (typeIsNeutral(ctx->types[ctx->irs[copen]]) && + typeIsNeutral(ctx->types[ctx->irs[cclose]])) { + BidiType t = get_bracket_type(ctx, copen, cclose); + if (t != ON) { + reset_bracket_type(ctx, copen, t); + reset_bracket_type(ctx, cclose, t); + } + } +} + +static void remove_ni(BidiContext *ctx) +{ + /* + * Rules N1 and N2 together: neutral or isolate characters take + * the direction of the surrounding strong text if the nearest + * strong characters on each side match, and otherwise, they take + * the embedding direction. + */ + const size_t NO_INDEX = ~(size_t)0; + BidiType prevStrongType = ctx->sos; + size_t c_ni_start = NO_INDEX; + for (size_t c = 0; c <= ctx->irslen; c++) { + BidiType t; + + if (c < ctx->irslen) { + size_t i = ctx->irs[c]; + t = ctx->types[i]; + } else { + /* One extra loop iteration, using eos to resolve the + * final sequence of NI if any */ + t = ctx->eos; + } + + if (typeIsStrongOrNumber(t)) { + t = t == L ? L : R; /* numbers count as R */ + if (c_ni_start != NO_INDEX) { + /* There are some NI we have to fix up */ + BidiType ni_type = (t == prevStrongType ? t : + ctx->embeddingDirection); + for (size_t c2 = c_ni_start; c2 < c; c2++) { + size_t i2 = ctx->irs[c2]; + BidiType t2 = ctx->types[i2]; + if (typeIsNeutralOrIsolate(t2)) + ctx->types[i2] = ni_type; + } + } + prevStrongType = t; + c_ni_start = NO_INDEX; + } else if (typeIsNeutralOrIsolate(t) && c_ni_start == NO_INDEX) { + c_ni_start = c; + } + } +} + +static void resolve_implicit_levels(BidiContext *ctx) +{ + /* Rules I1 and I2 */ + for (size_t c = 0; c < ctx->irslen; c++) { + size_t i = ctx->irs[c]; + unsigned char level = ctx->levels[i]; + BidiType t = ctx->types[i]; + if (level % 2 == 0) { + /* Rule I1 */ + if (t == R) + ctx->levels[i] += 1; + else if (t == AN || t == EN) + ctx->levels[i] += 2; + } else { + /* Rule I2 */ + if (t == L || t == AN || t == EN) + ctx->levels[i] += 1; + } + } +} + +static void process_isolating_run_sequence(BidiContext *ctx) +{ + /* Section W: resolve weak types */ + remove_nsm(ctx); + change_en_to_an(ctx); + change_al_to_r(ctx); + eliminate_separators_between_numbers(ctx); + eliminate_et_next_to_en(ctx); + eliminate_separators_and_terminators(ctx); + change_en_to_l(ctx); + + /* Section N: resolve neutral types (and isolates) */ + find_bracket_pairs(ctx, resolve_brackets); + remove_ni(ctx); + + /* Section I: resolve implicit levels */ + resolve_implicit_levels(ctx); +} + +static void reset_whitespace_and_separators(BidiContext *ctx) +{ + /* + * Rule L1: segment and paragraph separators, plus whitespace + * preceding them, all reset to the paragraph embedding level. + * This also applies to whitespace at the very end. + * + * This is done using the original types, not the versions that + * the rest of this algorithm has been merrily mutating. + */ + bool modifying = true; + for (size_t i = ctx->textlen; i-- > 0 ;) { + BidiType t = ctx->origTypes[i]; + if (typeIsSegmentOrParaSeparator(t)) { + ctx->levels[i] = ctx->paragraphLevel; + modifying = true; + } else if (modifying) { + if (typeIsWhitespaceOrIsolate(t)) { + ctx->levels[i] = ctx->paragraphLevel; + } else if (!typeIsRemovedDuringProcessing(t)) { + modifying = false; + } + } + } + +#ifndef REMOVE_FORMATTING_CHARS + /* + * Section 5.2 adjustment: types removed by rule X9 take the level + * of the character to their left. + */ + for (size_t i = 0; i < ctx->textlen; i++) { + BidiType t = ctx->origTypes[i]; + if (typeIsRemovedDuringProcessing(t)) { + /* Section 5.2 adjustment */ + ctx->levels[i] = (i > 0 ? ctx->levels[i-1] : ctx->paragraphLevel); + } + } +#endif /* ! REMOVE_FORMATTING_CHARS */ +} + +static void reverse(BidiContext *ctx, size_t start, size_t end) +{ + for (size_t i = start, j = end; i < j; i++, j--) { + bidi_char tmp = ctx->text[i]; + ctx->text[i] = ctx->text[j]; + ctx->text[j] = tmp; + } +} + +static void mirror_glyphs(BidiContext *ctx) +{ + /* + * Rule L3: any character with a mirror-image pair at an odd + * embedding level is replaced by its mirror image. + * + * This is specified in the standard as happening _after_ rule L2 + * (the actual reordering of the text). But it's much easier to + * implement it before, while our levels[] array still matches up + * to the text order. + */ + for (size_t i = 0; i < ctx->textlen; i++) { + if (ctx->levels[i] % 2) + ctx->text[i].wc = mirror_glyph(ctx->text[i].wc); + } +} + +static void reverse_sequences(BidiContext *ctx) +{ + /* + * Rule L2: every maximal contiguous sequence of characters at a + * given level or higher is reversed. + */ + unsigned level = 0; + for (size_t i = 0; i < ctx->textlen; i++) + level = max(level, ctx->levels[i]); + + for (; level >= 1; level--) { + for (size_t i = 0; i < ctx->textlen; i++) { + if (ctx->levels[i] >= level) { + size_t start = i; + while (i+1 < ctx->textlen && ctx->levels[i+1] >= level) + i++; + reverse(ctx, start, i); + } + } + } +} /* - * Bad, Horrible function - * takes a pointer to a character that is checked for - * having a mirror glyph. + * The Main Bidi Function, and the only function that should be used + * by the outside world. + * + * text: a buffer of size textlen containing text to apply the + * Bidirectional algorithm to. */ -static void doMirror(unsigned int *ch) +void do_bidi_new(BidiContext *ctx, bidi_char *text, size_t textlen) { - if ((*ch & 0xFF00) == 0) { - switch (*ch) { - case 0x0028: *ch = 0x0029; break; - case 0x0029: *ch = 0x0028; break; - case 0x003C: *ch = 0x003E; break; - case 0x003E: *ch = 0x003C; break; - case 0x005B: *ch = 0x005D; break; - case 0x005D: *ch = 0x005B; break; - case 0x007B: *ch = 0x007D; break; - case 0x007D: *ch = 0x007B; break; - case 0x00AB: *ch = 0x00BB; break; - case 0x00BB: *ch = 0x00AB; break; - } - } else if ((*ch & 0xFF00) == 0x2000) { - switch (*ch) { - case 0x2039: *ch = 0x203A; break; - case 0x203A: *ch = 0x2039; break; - case 0x2045: *ch = 0x2046; break; - case 0x2046: *ch = 0x2045; break; - case 0x207D: *ch = 0x207E; break; - case 0x207E: *ch = 0x207D; break; - case 0x208D: *ch = 0x208E; break; - case 0x208E: *ch = 0x208D; break; - } - } else if ((*ch & 0xFF00) == 0x2200) { - switch (*ch) { - case 0x2208: *ch = 0x220B; break; - case 0x2209: *ch = 0x220C; break; - case 0x220A: *ch = 0x220D; break; - case 0x220B: *ch = 0x2208; break; - case 0x220C: *ch = 0x2209; break; - case 0x220D: *ch = 0x220A; break; - case 0x2215: *ch = 0x29F5; break; - case 0x223C: *ch = 0x223D; break; - case 0x223D: *ch = 0x223C; break; - case 0x2243: *ch = 0x22CD; break; - case 0x2252: *ch = 0x2253; break; - case 0x2253: *ch = 0x2252; break; - case 0x2254: *ch = 0x2255; break; - case 0x2255: *ch = 0x2254; break; - case 0x2264: *ch = 0x2265; break; - case 0x2265: *ch = 0x2264; break; - case 0x2266: *ch = 0x2267; break; - case 0x2267: *ch = 0x2266; break; - case 0x2268: *ch = 0x2269; break; - case 0x2269: *ch = 0x2268; break; - case 0x226A: *ch = 0x226B; break; - case 0x226B: *ch = 0x226A; break; - case 0x226E: *ch = 0x226F; break; - case 0x226F: *ch = 0x226E; break; - case 0x2270: *ch = 0x2271; break; - case 0x2271: *ch = 0x2270; break; - case 0x2272: *ch = 0x2273; break; - case 0x2273: *ch = 0x2272; break; - case 0x2274: *ch = 0x2275; break; - case 0x2275: *ch = 0x2274; break; - case 0x2276: *ch = 0x2277; break; - case 0x2277: *ch = 0x2276; break; - case 0x2278: *ch = 0x2279; break; - case 0x2279: *ch = 0x2278; break; - case 0x227A: *ch = 0x227B; break; - case 0x227B: *ch = 0x227A; break; - case 0x227C: *ch = 0x227D; break; - case 0x227D: *ch = 0x227C; break; - case 0x227E: *ch = 0x227F; break; - case 0x227F: *ch = 0x227E; break; - case 0x2280: *ch = 0x2281; break; - case 0x2281: *ch = 0x2280; break; - case 0x2282: *ch = 0x2283; break; - case 0x2283: *ch = 0x2282; break; - case 0x2284: *ch = 0x2285; break; - case 0x2285: *ch = 0x2284; break; - case 0x2286: *ch = 0x2287; break; - case 0x2287: *ch = 0x2286; break; - case 0x2288: *ch = 0x2289; break; - case 0x2289: *ch = 0x2288; break; - case 0x228A: *ch = 0x228B; break; - case 0x228B: *ch = 0x228A; break; - case 0x228F: *ch = 0x2290; break; - case 0x2290: *ch = 0x228F; break; - case 0x2291: *ch = 0x2292; break; - case 0x2292: *ch = 0x2291; break; - case 0x2298: *ch = 0x29B8; break; - case 0x22A2: *ch = 0x22A3; break; - case 0x22A3: *ch = 0x22A2; break; - case 0x22A6: *ch = 0x2ADE; break; - case 0x22A8: *ch = 0x2AE4; break; - case 0x22A9: *ch = 0x2AE3; break; - case 0x22AB: *ch = 0x2AE5; break; - case 0x22B0: *ch = 0x22B1; break; - case 0x22B1: *ch = 0x22B0; break; - case 0x22B2: *ch = 0x22B3; break; - case 0x22B3: *ch = 0x22B2; break; - case 0x22B4: *ch = 0x22B5; break; - case 0x22B5: *ch = 0x22B4; break; - case 0x22B6: *ch = 0x22B7; break; - case 0x22B7: *ch = 0x22B6; break; - case 0x22C9: *ch = 0x22CA; break; - case 0x22CA: *ch = 0x22C9; break; - case 0x22CB: *ch = 0x22CC; break; - case 0x22CC: *ch = 0x22CB; break; - case 0x22CD: *ch = 0x2243; break; - case 0x22D0: *ch = 0x22D1; break; - case 0x22D1: *ch = 0x22D0; break; - case 0x22D6: *ch = 0x22D7; break; - case 0x22D7: *ch = 0x22D6; break; - case 0x22D8: *ch = 0x22D9; break; - case 0x22D9: *ch = 0x22D8; break; - case 0x22DA: *ch = 0x22DB; break; - case 0x22DB: *ch = 0x22DA; break; - case 0x22DC: *ch = 0x22DD; break; - case 0x22DD: *ch = 0x22DC; break; - case 0x22DE: *ch = 0x22DF; break; - case 0x22DF: *ch = 0x22DE; break; - case 0x22E0: *ch = 0x22E1; break; - case 0x22E1: *ch = 0x22E0; break; - case 0x22E2: *ch = 0x22E3; break; - case 0x22E3: *ch = 0x22E2; break; - case 0x22E4: *ch = 0x22E5; break; - case 0x22E5: *ch = 0x22E4; break; - case 0x22E6: *ch = 0x22E7; break; - case 0x22E7: *ch = 0x22E6; break; - case 0x22E8: *ch = 0x22E9; break; - case 0x22E9: *ch = 0x22E8; break; - case 0x22EA: *ch = 0x22EB; break; - case 0x22EB: *ch = 0x22EA; break; - case 0x22EC: *ch = 0x22ED; break; - case 0x22ED: *ch = 0x22EC; break; - case 0x22F0: *ch = 0x22F1; break; - case 0x22F1: *ch = 0x22F0; break; - case 0x22F2: *ch = 0x22FA; break; - case 0x22F3: *ch = 0x22FB; break; - case 0x22F4: *ch = 0x22FC; break; - case 0x22F6: *ch = 0x22FD; break; - case 0x22F7: *ch = 0x22FE; break; - case 0x22FA: *ch = 0x22F2; break; - case 0x22FB: *ch = 0x22F3; break; - case 0x22FC: *ch = 0x22F4; break; - case 0x22FD: *ch = 0x22F6; break; - case 0x22FE: *ch = 0x22F7; break; - } - } else if ((*ch & 0xFF00) == 0x2300) { - switch (*ch) { - case 0x2308: *ch = 0x2309; break; - case 0x2309: *ch = 0x2308; break; - case 0x230A: *ch = 0x230B; break; - case 0x230B: *ch = 0x230A; break; - case 0x2329: *ch = 0x232A; break; - case 0x232A: *ch = 0x2329; break; - } - } else if ((*ch & 0xFF00) == 0x2700) { - switch (*ch) { - case 0x2768: *ch = 0x2769; break; - case 0x2769: *ch = 0x2768; break; - case 0x276A: *ch = 0x276B; break; - case 0x276B: *ch = 0x276A; break; - case 0x276C: *ch = 0x276D; break; - case 0x276D: *ch = 0x276C; break; - case 0x276E: *ch = 0x276F; break; - case 0x276F: *ch = 0x276E; break; - case 0x2770: *ch = 0x2771; break; - case 0x2771: *ch = 0x2770; break; - case 0x2772: *ch = 0x2773; break; - case 0x2773: *ch = 0x2772; break; - case 0x2774: *ch = 0x2775; break; - case 0x2775: *ch = 0x2774; break; - case 0x27D5: *ch = 0x27D6; break; - case 0x27D6: *ch = 0x27D5; break; - case 0x27DD: *ch = 0x27DE; break; - case 0x27DE: *ch = 0x27DD; break; - case 0x27E2: *ch = 0x27E3; break; - case 0x27E3: *ch = 0x27E2; break; - case 0x27E4: *ch = 0x27E5; break; - case 0x27E5: *ch = 0x27E4; break; - case 0x27E6: *ch = 0x27E7; break; - case 0x27E7: *ch = 0x27E6; break; - case 0x27E8: *ch = 0x27E9; break; - case 0x27E9: *ch = 0x27E8; break; - case 0x27EA: *ch = 0x27EB; break; - case 0x27EB: *ch = 0x27EA; break; - } - } else if ((*ch & 0xFF00) == 0x2900) { - switch (*ch) { - case 0x2983: *ch = 0x2984; break; - case 0x2984: *ch = 0x2983; break; - case 0x2985: *ch = 0x2986; break; - case 0x2986: *ch = 0x2985; break; - case 0x2987: *ch = 0x2988; break; - case 0x2988: *ch = 0x2987; break; - case 0x2989: *ch = 0x298A; break; - case 0x298A: *ch = 0x2989; break; - case 0x298B: *ch = 0x298C; break; - case 0x298C: *ch = 0x298B; break; - case 0x298D: *ch = 0x2990; break; - case 0x298E: *ch = 0x298F; break; - case 0x298F: *ch = 0x298E; break; - case 0x2990: *ch = 0x298D; break; - case 0x2991: *ch = 0x2992; break; - case 0x2992: *ch = 0x2991; break; - case 0x2993: *ch = 0x2994; break; - case 0x2994: *ch = 0x2993; break; - case 0x2995: *ch = 0x2996; break; - case 0x2996: *ch = 0x2995; break; - case 0x2997: *ch = 0x2998; break; - case 0x2998: *ch = 0x2997; break; - case 0x29B8: *ch = 0x2298; break; - case 0x29C0: *ch = 0x29C1; break; - case 0x29C1: *ch = 0x29C0; break; - case 0x29C4: *ch = 0x29C5; break; - case 0x29C5: *ch = 0x29C4; break; - case 0x29CF: *ch = 0x29D0; break; - case 0x29D0: *ch = 0x29CF; break; - case 0x29D1: *ch = 0x29D2; break; - case 0x29D2: *ch = 0x29D1; break; - case 0x29D4: *ch = 0x29D5; break; - case 0x29D5: *ch = 0x29D4; break; - case 0x29D8: *ch = 0x29D9; break; - case 0x29D9: *ch = 0x29D8; break; - case 0x29DA: *ch = 0x29DB; break; - case 0x29DB: *ch = 0x29DA; break; - case 0x29F5: *ch = 0x2215; break; - case 0x29F8: *ch = 0x29F9; break; - case 0x29F9: *ch = 0x29F8; break; - case 0x29FC: *ch = 0x29FD; break; - case 0x29FD: *ch = 0x29FC; break; - } - } else if ((*ch & 0xFF00) == 0x2A00) { - switch (*ch) { - case 0x2A2B: *ch = 0x2A2C; break; - case 0x2A2C: *ch = 0x2A2B; break; - case 0x2A2D: *ch = 0x2A2C; break; - case 0x2A2E: *ch = 0x2A2D; break; - case 0x2A34: *ch = 0x2A35; break; - case 0x2A35: *ch = 0x2A34; break; - case 0x2A3C: *ch = 0x2A3D; break; - case 0x2A3D: *ch = 0x2A3C; break; - case 0x2A64: *ch = 0x2A65; break; - case 0x2A65: *ch = 0x2A64; break; - case 0x2A79: *ch = 0x2A7A; break; - case 0x2A7A: *ch = 0x2A79; break; - case 0x2A7D: *ch = 0x2A7E; break; - case 0x2A7E: *ch = 0x2A7D; break; - case 0x2A7F: *ch = 0x2A80; break; - case 0x2A80: *ch = 0x2A7F; break; - case 0x2A81: *ch = 0x2A82; break; - case 0x2A82: *ch = 0x2A81; break; - case 0x2A83: *ch = 0x2A84; break; - case 0x2A84: *ch = 0x2A83; break; - case 0x2A8B: *ch = 0x2A8C; break; - case 0x2A8C: *ch = 0x2A8B; break; - case 0x2A91: *ch = 0x2A92; break; - case 0x2A92: *ch = 0x2A91; break; - case 0x2A93: *ch = 0x2A94; break; - case 0x2A94: *ch = 0x2A93; break; - case 0x2A95: *ch = 0x2A96; break; - case 0x2A96: *ch = 0x2A95; break; - case 0x2A97: *ch = 0x2A98; break; - case 0x2A98: *ch = 0x2A97; break; - case 0x2A99: *ch = 0x2A9A; break; - case 0x2A9A: *ch = 0x2A99; break; - case 0x2A9B: *ch = 0x2A9C; break; - case 0x2A9C: *ch = 0x2A9B; break; - case 0x2AA1: *ch = 0x2AA2; break; - case 0x2AA2: *ch = 0x2AA1; break; - case 0x2AA6: *ch = 0x2AA7; break; - case 0x2AA7: *ch = 0x2AA6; break; - case 0x2AA8: *ch = 0x2AA9; break; - case 0x2AA9: *ch = 0x2AA8; break; - case 0x2AAA: *ch = 0x2AAB; break; - case 0x2AAB: *ch = 0x2AAA; break; - case 0x2AAC: *ch = 0x2AAD; break; - case 0x2AAD: *ch = 0x2AAC; break; - case 0x2AAF: *ch = 0x2AB0; break; - case 0x2AB0: *ch = 0x2AAF; break; - case 0x2AB3: *ch = 0x2AB4; break; - case 0x2AB4: *ch = 0x2AB3; break; - case 0x2ABB: *ch = 0x2ABC; break; - case 0x2ABC: *ch = 0x2ABB; break; - case 0x2ABD: *ch = 0x2ABE; break; - case 0x2ABE: *ch = 0x2ABD; break; - case 0x2ABF: *ch = 0x2AC0; break; - case 0x2AC0: *ch = 0x2ABF; break; - case 0x2AC1: *ch = 0x2AC2; break; - case 0x2AC2: *ch = 0x2AC1; break; - case 0x2AC3: *ch = 0x2AC4; break; - case 0x2AC4: *ch = 0x2AC3; break; - case 0x2AC5: *ch = 0x2AC6; break; - case 0x2AC6: *ch = 0x2AC5; break; - case 0x2ACD: *ch = 0x2ACE; break; - case 0x2ACE: *ch = 0x2ACD; break; - case 0x2ACF: *ch = 0x2AD0; break; - case 0x2AD0: *ch = 0x2ACF; break; - case 0x2AD1: *ch = 0x2AD2; break; - case 0x2AD2: *ch = 0x2AD1; break; - case 0x2AD3: *ch = 0x2AD4; break; - case 0x2AD4: *ch = 0x2AD3; break; - case 0x2AD5: *ch = 0x2AD6; break; - case 0x2AD6: *ch = 0x2AD5; break; - case 0x2ADE: *ch = 0x22A6; break; - case 0x2AE3: *ch = 0x22A9; break; - case 0x2AE4: *ch = 0x22A8; break; - case 0x2AE5: *ch = 0x22AB; break; - case 0x2AEC: *ch = 0x2AED; break; - case 0x2AED: *ch = 0x2AEC; break; - case 0x2AF7: *ch = 0x2AF8; break; - case 0x2AF8: *ch = 0x2AF7; break; - case 0x2AF9: *ch = 0x2AFA; break; - case 0x2AFA: *ch = 0x2AF9; break; - } - } else if ((*ch & 0xFF00) == 0x3000) { - switch (*ch) { - case 0x3008: *ch = 0x3009; break; - case 0x3009: *ch = 0x3008; break; - case 0x300A: *ch = 0x300B; break; - case 0x300B: *ch = 0x300A; break; - case 0x300C: *ch = 0x300D; break; - case 0x300D: *ch = 0x300C; break; - case 0x300E: *ch = 0x300F; break; - case 0x300F: *ch = 0x300E; break; - case 0x3010: *ch = 0x3011; break; - case 0x3011: *ch = 0x3010; break; - case 0x3014: *ch = 0x3015; break; - case 0x3015: *ch = 0x3014; break; - case 0x3016: *ch = 0x3017; break; - case 0x3017: *ch = 0x3016; break; - case 0x3018: *ch = 0x3019; break; - case 0x3019: *ch = 0x3018; break; - case 0x301A: *ch = 0x301B; break; - case 0x301B: *ch = 0x301A; break; - } - } else if ((*ch & 0xFF00) == 0xFF00) { - switch (*ch) { - case 0xFF08: *ch = 0xFF09; break; - case 0xFF09: *ch = 0xFF08; break; - case 0xFF1C: *ch = 0xFF1E; break; - case 0xFF1E: *ch = 0xFF1C; break; - case 0xFF3B: *ch = 0xFF3D; break; - case 0xFF3D: *ch = 0xFF3B; break; - case 0xFF5B: *ch = 0xFF5D; break; - case 0xFF5D: *ch = 0xFF5B; break; - case 0xFF5F: *ch = 0xFF60; break; - case 0xFF60: *ch = 0xFF5F; break; - case 0xFF62: *ch = 0xFF63; break; - case 0xFF63: *ch = 0xFF62; break; - } - } + ensure_arrays(ctx, textlen); + ctx->text = text; + ctx->textlen = textlen; + setup_types(ctx); + + /* Quick initial test: see if we need to bother with any work at all */ + if (!text_needs_bidi(ctx)) + return; + + set_paragraph_level(ctx); + process_explicit_embeddings(ctx); + remove_embedding_characters(ctx); + find_isolating_run_sequences(ctx, process_isolating_run_sequence); + + /* If this implementation distinguished paragraphs from lines, + * then this would be the point where we repeat the remainder of + * the algorithm once for each line in the paragraph. */ + + reset_whitespace_and_separators(ctx); + mirror_glyphs(ctx); + reverse_sequences(ctx); +} + +void do_bidi(BidiContext *ctx, bidi_char *text, size_t textlen) +{ +#ifdef REMOVE_FORMATTING_CHARACTERS + abort(); /* can't use the standard algorithm in a live terminal */ +#else + assert(textlen >= 0); + do_bidi_new(ctx, text, textlen); +#endif } diff --git a/terminal/bidi.h b/terminal/bidi.h index 53ffbcd3..dd488e1f 100644 --- a/terminal/bidi.h +++ b/terminal/bidi.h @@ -30,11 +30,15 @@ unsigned char bidi_getType(int ch); X(L) \ X(LRE) \ X(LRO) \ + X(LRI) \ X(R) \ X(AL) \ X(RLE) \ X(RLO) \ + X(RLI) \ X(PDF) \ + X(PDI) \ + X(FSI) \ X(EN) \ X(ES) \ X(ET) \ @@ -62,4 +66,69 @@ typedef enum { BIDI_CHAR_TYPE_LIST(ENUM_DECL) N_BIDI_TYPES } BidiType; typedef enum { SHAPING_CHAR_TYPE_LIST(ENUM_DECL) N_SHAPING_TYPES } ShapingType; #undef ENUM_DECL +static inline bool typeIsStrong(BidiType t) +{ + return ((1<