1
0
mirror of https://git.tartarus.org/simon/putty.git synced 2025-01-10 09:58:01 +00:00
putty-source/terminal/bidi.c
Simon Tatham 4cb429e3f4 Update to Unicode 15.
Now I have a script I can easily re-run, there's no reason not to do
just that! This updates all of the new generated header files for the
UCD.zip that comes with Unicode 15.0.0.

I've re-run my bidi test suite against 15.0.0's file of test cases,
and confirmed they all pass.
2022-11-11 08:44:07 +00:00

1686 lines
54 KiB
C

/*
* Implementation of the Unicode bidirectional and Arabic shaping
* algorithms for PuTTY.
*
* Original version written and kindly contributed to this code base
* by Ahmad Khalifa of Arabeyes. The bidi part was almost completely
* rewritten in 2021 by Simon Tatham to bring it up to date, but the
* shaping part is still the one by the original authors.
*
* Implementation notes:
*
* Algorithm version
* -----------------
*
* This algorithm is up to date with Unicode Standard Annex #9
* revision 46:
*
* https://www.unicode.org/reports/tr9/tr9-46.html
*
* and passes the full conformance test suite in Unicode 15.0.0.
*
* Paragraph and line handling
* ---------------------------
*
* The full Unicode bidi algorithm expects to receive text containing
* multiple paragraphs, together with a decision about how those
* paragraphs are broken up into lines. It calculates embedding levels
* a whole paragraph at a time without considering the line breaks,
* but then the final reordering of the text for display is done to
* each _line_ independently based on the levels computed for the text
* in that line.
*
* This algorithm omits all of that, because it's intended for use as
* a display-time transformation of a text terminal, which doesn't
* preserve enough semantic information to decide what's a paragraph
* break and what is not. So a piece of input text provided to this
* algorithm is always expected to consist of exactly one paragraph
* *and* exactly one line.
*
* Embeddings, overrides and isolates
* ----------------------------------
*
* This implementation has full support for all the Unicode special
* control characters that modify bidi behaviour, such as
*
* U+202A LEFT-TO-RIGHT EMBEDDING
* U+202B RIGHT-TO-LEFT EMBEDDING
* U+202D LEFT-TO-RIGHT OVERRIDE
* U+202E RIGHT-TO-LEFT OVERRIDE
* U+202C POP DIRECTIONAL FORMATTING
* U+2068 FIRST STRONG ISOLATE
* U+2066 LEFT-TO-RIGHT ISOLATE
* U+2067 RIGHT-TO-LEFT ISOLATE
* U+2069 POP DIRECTIONAL ISOLATE
*
* However, at present, the terminal emulator that is a client of this
* code has no way to pass those in (because they're dropped during
* escape sequence processing and don't get stored in the terminal
* state). Nonetheless, the code is all here, so if the terminal
* emulator becomes able to record those characters at some later
* point, we'll be all set to take account of them during bidi.
*
* But the _main_ purpose of supporting the full bidi algorithm is
* simply that that's the easiest way to be sure it's correct, because
* if you support the whole thing, you can run the full conformance
* test suite. (And I don't 100% believe that restricting to the
* subset of _tests_ valid with a reduced character set will test the
* full set of _functionality_ relevant to the reduced set.)
*
* Retained formatting characters
* ------------------------------
*
* The standard bidi algorithm, in step X9, deletes assorted
* formatting characters from the text: all the embedding and override
* section initiator characters, the Pop Directional Formatting
* character that closes one of those sections again, and any
* character labelled as Boundary Neutral. So the characters it
* returns are not a _full_ reordering of the input; some input
* characters vanish completely.
*
* This would be fine, if it were not for the fact that - as far as I
* can see - _exactly one_ Unicode code point in the discarded
* category has a wcwidth() of more than 0, namely U+00AD SOFT HYPHEN
* which is a printing character for terminal purposes but has a bidi
* class of BN.
*
* Therefore, we must implement a modified version of the algorithm,
* as described in section 5.2 of TR9, which retains those formatting
* characters so that a client can find out where they ended up in the
* reordering.
*
* Section 5.2 describes a set of modifications to the algorithm that
* are _intended_ to achieve this without changing the rest of the
* behaviour: that is, if you take the output of the modified
* algorithm and delete all the characters that the standard algorithm
* would have removed, you should end up with the remaining characters
* in the same order that the standard algorithm would have delivered.
* However, section 5.2 admits the possibility of error, and says "in
* case of any deviation the explicit algorithm is the normative
* statement for conformance". And indeed, in one or two places I
* found I had to make my own tweaks to the section 5.2 description in
* order to get the whole test suite to pass, because I think the 5.2
* modifications if taken literally don't quite achieve that. My
* justification is that sentence of 5.2: in case of doubt, the right
* thing is to make the code behave the same as the official
* algorithm.
*
* It's possible that there might still be some undiscovered
* discrepancies between the behaviour of the standard and modified
* algorithms. So, just in case, I've kept in this code the ability to
* implement the _standard_ algorithm too! If you compile with
* -DREMOVE_FORMATTING_CHARS, this code should go back to implementing
* the literal UAX#9 bidi algorithm - so you can run your suspect
* input through both versions, making it much easier to figure out
* why they differ, and in which of the many stages of the algorithm
* the difference was introduced.
*
* However, beware that when compiling in this mode, the do_bidi
* interface to the terminal will stop working, and just abort() when
* called! The only useful thing you can do with this mode is to run
* the companion program bidi_test.c.
*/
#include <stdlib.h> /* definition of wchar_t */
#include "putty.h"
#include "misc.h"
#include "bidi.h"
typedef struct {
char type;
wchar_t form_b;
} shape_node;
/* Kept near the actual table, for verification. */
#define SHAPE_FIRST 0x621
#define SHAPE_LAST (SHAPE_FIRST + lenof(shapetypes) - 1)
static const shape_node shapetypes[] = {
/* index, Typ, Iso, Ligature Index*/
/* 621 */ {SU, 0xFE80},
/* 622 */ {SR, 0xFE81},
/* 623 */ {SR, 0xFE83},
/* 624 */ {SR, 0xFE85},
/* 625 */ {SR, 0xFE87},
/* 626 */ {SD, 0xFE89},
/* 627 */ {SR, 0xFE8D},
/* 628 */ {SD, 0xFE8F},
/* 629 */ {SR, 0xFE93},
/* 62A */ {SD, 0xFE95},
/* 62B */ {SD, 0xFE99},
/* 62C */ {SD, 0xFE9D},
/* 62D */ {SD, 0xFEA1},
/* 62E */ {SD, 0xFEA5},
/* 62F */ {SR, 0xFEA9},
/* 630 */ {SR, 0xFEAB},
/* 631 */ {SR, 0xFEAD},
/* 632 */ {SR, 0xFEAF},
/* 633 */ {SD, 0xFEB1},
/* 634 */ {SD, 0xFEB5},
/* 635 */ {SD, 0xFEB9},
/* 636 */ {SD, 0xFEBD},
/* 637 */ {SD, 0xFEC1},
/* 638 */ {SD, 0xFEC5},
/* 639 */ {SD, 0xFEC9},
/* 63A */ {SD, 0xFECD},
/* 63B */ {SU, 0x0},
/* 63C */ {SU, 0x0},
/* 63D */ {SU, 0x0},
/* 63E */ {SU, 0x0},
/* 63F */ {SU, 0x0},
/* 640 */ {SC, 0x0},
/* 641 */ {SD, 0xFED1},
/* 642 */ {SD, 0xFED5},
/* 643 */ {SD, 0xFED9},
/* 644 */ {SD, 0xFEDD},
/* 645 */ {SD, 0xFEE1},
/* 646 */ {SD, 0xFEE5},
/* 647 */ {SD, 0xFEE9},
/* 648 */ {SR, 0xFEED},
/* 649 */ {SR, 0xFEEF}, /* SD */
/* 64A */ {SD, 0xFEF1},
/* 64B */ {SU, 0x0},
/* 64C */ {SU, 0x0},
/* 64D */ {SU, 0x0},
/* 64E */ {SU, 0x0},
/* 64F */ {SU, 0x0},
/* 650 */ {SU, 0x0},
/* 651 */ {SU, 0x0},
/* 652 */ {SU, 0x0},
/* 653 */ {SU, 0x0},
/* 654 */ {SU, 0x0},
/* 655 */ {SU, 0x0},
/* 656 */ {SU, 0x0},
/* 657 */ {SU, 0x0},
/* 658 */ {SU, 0x0},
/* 659 */ {SU, 0x0},
/* 65A */ {SU, 0x0},
/* 65B */ {SU, 0x0},
/* 65C */ {SU, 0x0},
/* 65D */ {SU, 0x0},
/* 65E */ {SU, 0x0},
/* 65F */ {SU, 0x0},
/* 660 */ {SU, 0x0},
/* 661 */ {SU, 0x0},
/* 662 */ {SU, 0x0},
/* 663 */ {SU, 0x0},
/* 664 */ {SU, 0x0},
/* 665 */ {SU, 0x0},
/* 666 */ {SU, 0x0},
/* 667 */ {SU, 0x0},
/* 668 */ {SU, 0x0},
/* 669 */ {SU, 0x0},
/* 66A */ {SU, 0x0},
/* 66B */ {SU, 0x0},
/* 66C */ {SU, 0x0},
/* 66D */ {SU, 0x0},
/* 66E */ {SU, 0x0},
/* 66F */ {SU, 0x0},
/* 670 */ {SU, 0x0},
/* 671 */ {SR, 0xFB50},
/* 672 */ {SU, 0x0},
/* 673 */ {SU, 0x0},
/* 674 */ {SU, 0x0},
/* 675 */ {SU, 0x0},
/* 676 */ {SU, 0x0},
/* 677 */ {SU, 0x0},
/* 678 */ {SU, 0x0},
/* 679 */ {SD, 0xFB66},
/* 67A */ {SD, 0xFB5E},
/* 67B */ {SD, 0xFB52},
/* 67C */ {SU, 0x0},
/* 67D */ {SU, 0x0},
/* 67E */ {SD, 0xFB56},
/* 67F */ {SD, 0xFB62},
/* 680 */ {SD, 0xFB5A},
/* 681 */ {SU, 0x0},
/* 682 */ {SU, 0x0},
/* 683 */ {SD, 0xFB76},
/* 684 */ {SD, 0xFB72},
/* 685 */ {SU, 0x0},
/* 686 */ {SD, 0xFB7A},
/* 687 */ {SD, 0xFB7E},
/* 688 */ {SR, 0xFB88},
/* 689 */ {SU, 0x0},
/* 68A */ {SU, 0x0},
/* 68B */ {SU, 0x0},
/* 68C */ {SR, 0xFB84},
/* 68D */ {SR, 0xFB82},
/* 68E */ {SR, 0xFB86},
/* 68F */ {SU, 0x0},
/* 690 */ {SU, 0x0},
/* 691 */ {SR, 0xFB8C},
/* 692 */ {SU, 0x0},
/* 693 */ {SU, 0x0},
/* 694 */ {SU, 0x0},
/* 695 */ {SU, 0x0},
/* 696 */ {SU, 0x0},
/* 697 */ {SU, 0x0},
/* 698 */ {SR, 0xFB8A},
/* 699 */ {SU, 0x0},
/* 69A */ {SU, 0x0},
/* 69B */ {SU, 0x0},
/* 69C */ {SU, 0x0},
/* 69D */ {SU, 0x0},
/* 69E */ {SU, 0x0},
/* 69F */ {SU, 0x0},
/* 6A0 */ {SU, 0x0},
/* 6A1 */ {SU, 0x0},
/* 6A2 */ {SU, 0x0},
/* 6A3 */ {SU, 0x0},
/* 6A4 */ {SD, 0xFB6A},
/* 6A5 */ {SU, 0x0},
/* 6A6 */ {SD, 0xFB6E},
/* 6A7 */ {SU, 0x0},
/* 6A8 */ {SU, 0x0},
/* 6A9 */ {SD, 0xFB8E},
/* 6AA */ {SU, 0x0},
/* 6AB */ {SU, 0x0},
/* 6AC */ {SU, 0x0},
/* 6AD */ {SD, 0xFBD3},
/* 6AE */ {SU, 0x0},
/* 6AF */ {SD, 0xFB92},
/* 6B0 */ {SU, 0x0},
/* 6B1 */ {SD, 0xFB9A},
/* 6B2 */ {SU, 0x0},
/* 6B3 */ {SD, 0xFB96},
/* 6B4 */ {SU, 0x0},
/* 6B5 */ {SU, 0x0},
/* 6B6 */ {SU, 0x0},
/* 6B7 */ {SU, 0x0},
/* 6B8 */ {SU, 0x0},
/* 6B9 */ {SU, 0x0},
/* 6BA */ {SR, 0xFB9E},
/* 6BB */ {SD, 0xFBA0},
/* 6BC */ {SU, 0x0},
/* 6BD */ {SU, 0x0},
/* 6BE */ {SD, 0xFBAA},
/* 6BF */ {SU, 0x0},
/* 6C0 */ {SR, 0xFBA4},
/* 6C1 */ {SD, 0xFBA6},
/* 6C2 */ {SU, 0x0},
/* 6C3 */ {SU, 0x0},
/* 6C4 */ {SU, 0x0},
/* 6C5 */ {SR, 0xFBE0},
/* 6C6 */ {SR, 0xFBD9},
/* 6C7 */ {SR, 0xFBD7},
/* 6C8 */ {SR, 0xFBDB},
/* 6C9 */ {SR, 0xFBE2},
/* 6CA */ {SU, 0x0},
/* 6CB */ {SR, 0xFBDE},
/* 6CC */ {SD, 0xFBFC},
/* 6CD */ {SU, 0x0},
/* 6CE */ {SU, 0x0},
/* 6CF */ {SU, 0x0},
/* 6D0 */ {SU, 0x0},
/* 6D1 */ {SU, 0x0},
/* 6D2 */ {SR, 0xFBAE},
};
/*
* Returns the bidi character type of ch.
*/
unsigned char bidi_getType(int ch)
{
static const struct {
int first, last, type;
} lookup[] = {
#include "unicode/bidi_type.h"
};
int i, j, k;
i = -1;
j = lenof(lookup);
while (j - i > 1) {
k = (i + j) / 2;
if (ch < lookup[k].first)
j = k;
else if (ch > lookup[k].last)
i = k;
else
return lookup[k].type;
}
/*
* If we reach here, the character was not in any of the
* intervals listed in the lookup table. This means we return
* ON (`Other Neutrals'). This is the appropriate code for any
* character genuinely not listed in the Unicode table, and
* also the table above has deliberately left out any
* characters _explicitly_ listed as ON (to save space!).
*/
return ON;
}
/*
* Return the mirrored version of a glyph.
*
* FIXME: there are also glyphs which the text rendering engine is
* supposed to display left-right reflected, since no mirrored glyph
* exists in Unicode itself to indicate the reflected form. Those are
* listed in comments in BidiMirroring.txt. Many of them are
* mathematical, e.g. the square root sign, or set difference
* operator, or integral sign. No API currently exists here to
* communicate the need for that reflected display back to the client.
*/
static unsigned mirror_glyph(unsigned int ch)
{
static const struct {
unsigned src, dst;
} mirror_pairs[] = {
#include "unicode/bidi_mirror.h"
};
int i, j, k;
i = -1;
j = lenof(mirror_pairs);
while (j - i > 1) {
k = (i + j) / 2;
if (ch < mirror_pairs[k].src)
j = k;
else if (ch > mirror_pairs[k].src)
i = k;
else
return mirror_pairs[k].dst;
}
return ch;
}
/*
* Identify the bracket characters treated specially by bidi rule
* BD19, and return their paired character(s).
*/
typedef enum { BT_NONE, BT_OPEN, BT_CLOSE } BracketType;
typedef struct BracketTypeData {
unsigned partner, equiv_partner;
BracketType type;
} BracketTypeData;
static BracketTypeData bracket_type(unsigned int ch)
{
static const struct {
unsigned src;
BracketTypeData payload;
} bracket_pairs[] = {
#include "unicode/bidi_brackets.h"
};
int i, j, k;
i = -1;
j = lenof(bracket_pairs);
while (j - i > 1) {
k = (i + j) / 2;
if (ch < bracket_pairs[k].src) {
j = k;
} else if (ch > bracket_pairs[k].src) {
i = k;
} else {
return bracket_pairs[k].payload;
}
}
static const BracketTypeData null = { 0, 0, BT_NONE };
return null;
}
/*
* Function exported to front ends to allow them to identify
* bidi-active characters (in case, for example, the platform's
* text display function can't conveniently be prevented from doing
* its own bidi and so special treatment is required for characters
* that would cause the bidi algorithm to activate).
*
* This function is passed a single Unicode code point, and returns
* nonzero if the presence of this code point can possibly cause
* the bidi algorithm to do any reordering. Thus, any string
* composed entirely of characters for which is_rtl() returns zero
* should be safe to pass to a bidi-active platform display
* function without fear.
*
* (is_rtl() must therefore also return true for any character
* which would be affected by Arabic shaping, but this isn't
* important because all such characters are right-to-left so it
* would have flagged them anyway.)
*/
bool is_rtl(int c)
{
return typeIsBidiActive(bidi_getType(c));
}
/* The Main shaping function, and the only one to be used
* by the outside world.
*
* line: buffer to apply shaping to. this must be passed by doBidi() first
* to: output buffer for the shaped data
* count: number of characters in line
*/
int do_shape(bidi_char *line, bidi_char *to, int count)
{
int i, tempShape;
bool ligFlag = false;
for (i=0; i<count; i++) {
to[i] = line[i];
tempShape = STYPE(line[i].wc);
switch (tempShape) {
case SC:
break;
case SU:
break;
case SR:
tempShape = (i+1 < count ? STYPE(line[i+1].wc) : SU);
if ((tempShape == SL) || (tempShape == SD) || (tempShape == SC))
to[i].wc = SFINAL((SISOLATED(line[i].wc)));
else
to[i].wc = SISOLATED(line[i].wc);
break;
case SD:
/* Make Ligatures */
tempShape = (i+1 < count ? STYPE(line[i+1].wc) : SU);
if (line[i].wc == 0x644) {
if (i > 0) switch (line[i-1].wc) {
case 0x622:
ligFlag = true;
if ((tempShape == SL) || (tempShape == SD) || (tempShape == SC))
to[i].wc = 0xFEF6;
else
to[i].wc = 0xFEF5;
break;
case 0x623:
ligFlag = true;
if ((tempShape == SL) || (tempShape == SD) || (tempShape == SC))
to[i].wc = 0xFEF8;
else
to[i].wc = 0xFEF7;
break;
case 0x625:
ligFlag = true;
if ((tempShape == SL) || (tempShape == SD) || (tempShape == SC))
to[i].wc = 0xFEFA;
else
to[i].wc = 0xFEF9;
break;
case 0x627:
ligFlag = true;
if ((tempShape == SL) || (tempShape == SD) || (tempShape == SC))
to[i].wc = 0xFEFC;
else
to[i].wc = 0xFEFB;
break;
}
if (ligFlag) {
to[i-1].wc = 0x20;
ligFlag = false;
break;
}
}
if ((tempShape == SL) || (tempShape == SD) || (tempShape == SC)) {
tempShape = (i > 0 ? STYPE(line[i-1].wc) : SU);
if ((tempShape == SR) || (tempShape == SD) || (tempShape == SC))
to[i].wc = SMEDIAL((SISOLATED(line[i].wc)));
else
to[i].wc = SFINAL((SISOLATED(line[i].wc)));
break;
}
tempShape = (i > 0 ? STYPE(line[i-1].wc) : SU);
if ((tempShape == SR) || (tempShape == SD) || (tempShape == SC))
to[i].wc = SINITIAL((SISOLATED(line[i].wc)));
else
to[i].wc = SISOLATED(line[i].wc);
break;
}
}
return 1;
}
typedef enum { DO_NEUTRAL, DO_LTR, DO_RTL } DirectionalOverride;
typedef struct DSStackEntry {
/*
* An entry in the directional status stack (rule section X).
*/
unsigned char level;
bool isolate;
DirectionalOverride override;
} DSStackEntry;
typedef struct BracketStackEntry {
/*
* An entry in the bracket-pair-tracking stack (rule BD16).
*/
unsigned ch;
size_t c;
} BracketStackEntry;
typedef struct IsolatingRunSequence {
size_t start, end;
BidiType sos, eos, embeddingDirection;
} IsolatingRunSequence;
#define MAX_DEPTH 125 /* specified in the standard */
struct BidiContext {
/*
* Storage space preserved between runs, all allocated to the same
* length (internal_array_sizes).
*/
size_t internal_array_sizes;
BidiType *types, *origTypes;
unsigned char *levels;
size_t *irsindices, *bracketpos;
bool *irsdone;
/*
* Separately allocated with its own size field
*/
IsolatingRunSequence *irslist;
size_t irslistsize;
/*
* Rewritten to point to the input to the currently active run of
* the bidi algorithm
*/
bidi_char *text;
size_t textlen;
/*
* State within a run of the algorithm
*/
BidiType paragraphOverride;
DSStackEntry dsstack[MAX_DEPTH + 2];
size_t ds_sp;
size_t overflowIsolateCount, overflowEmbeddingCount, validIsolateCount;
unsigned char paragraphLevel;
size_t *irs;
size_t irslen;
BidiType sos, eos, embeddingDirection;
BracketStackEntry bstack[63]; /* constant size specified in rule BD16 */
};
BidiContext *bidi_new_context(void)
{
BidiContext *ctx = snew(BidiContext);
memset(ctx, 0, sizeof(BidiContext));
return ctx;
}
void bidi_free_context(BidiContext *ctx)
{
sfree(ctx->types);
sfree(ctx->origTypes);
sfree(ctx->levels);
sfree(ctx->irsindices);
sfree(ctx->irsdone);
sfree(ctx->bracketpos);
sfree(ctx->irslist);
sfree(ctx);
}
static void ensure_arrays(BidiContext *ctx, size_t textlen)
{
if (textlen <= ctx->internal_array_sizes)
return;
ctx->internal_array_sizes = textlen;
ctx->types = sresize(ctx->types, ctx->internal_array_sizes, BidiType);
ctx->origTypes = sresize(ctx->origTypes, ctx->internal_array_sizes,
BidiType);
ctx->levels = sresize(ctx->levels, ctx->internal_array_sizes,
unsigned char);
ctx->irsindices = sresize(ctx->irsindices, ctx->internal_array_sizes,
size_t);
ctx->irsdone = sresize(ctx->irsdone, ctx->internal_array_sizes, bool);
ctx->bracketpos = sresize(ctx->bracketpos, ctx->internal_array_sizes,
size_t);
}
static void setup_types(BidiContext *ctx)
{
for (size_t i = 0; i < ctx->textlen; i++)
ctx->types[i] = ctx->origTypes[i] = bidi_getType(ctx->text[i].wc);
}
static bool text_needs_bidi(BidiContext *ctx)
{
/*
* Initial optimisation: check for any bidi-active character at
* all in an input line. If there aren't any, we can skip the
* whole algorithm.
*
* Also include the paragraph override in this check!
*/
for (size_t i = 0; i < ctx->textlen; i++)
if (typeIsBidiActive(ctx->types[i]))
return true;
return typeIsBidiActive(ctx->paragraphOverride);
}
static size_t find_matching_pdi(const BidiType *types, size_t i, size_t size)
{
/* Assuming that types[i] is an isolate initiator, find its
* matching PDI by rule BD9. */
unsigned counter = 1;
i++;
for (; i < size; i++) {
BidiType t = types[i];
if (typeIsIsolateInitiator(t)) {
counter++;
} else if (t == PDI) {
counter--;
if (counter == 0)
return i;
}
}
/* If no PDI was found, return the length of the array. */
return size;
}
static unsigned char rule_p2_p3(const BidiType *types, size_t size)
{
/*
* Rule P2. Find the first strong type (L, R or AL), ignoring
* anything inside an isolated segment.
*
* Rule P3. If that type is R or AL, choose a paragraph embeddding
* level of 1, otherwise 0.
*/
for (size_t i = 0; i < size; i++) {
BidiType t = types[i];
if (typeIsIsolateInitiator(t))
i = find_matching_pdi(types, i, size);
else if (typeIsStrong(t))
return (t == L ? 0 : 1);
}
return 0; /* default if no strong type found */
}
static void set_paragraph_level(BidiContext *ctx)
{
if (ctx->paragraphOverride == L)
ctx->paragraphLevel = 0;
else if (ctx->paragraphOverride == R)
ctx->paragraphLevel = 1;
else
ctx->paragraphLevel = rule_p2_p3(ctx->types, ctx->textlen);
}
static inline unsigned char nextOddLevel(unsigned char x) { return (x+1)|1; }
static inline unsigned char nextEvenLevel(unsigned char x) { return (x|1)+1; }
static inline void push(BidiContext *ctx, unsigned char level,
DirectionalOverride override, bool isolate)
{
ctx->ds_sp++;
assert(ctx->ds_sp < lenof(ctx->dsstack));
ctx->dsstack[ctx->ds_sp].level = level;
ctx->dsstack[ctx->ds_sp].override = override;
ctx->dsstack[ctx->ds_sp].isolate = isolate;
}
static inline void pop(BidiContext *ctx)
{
assert(ctx->ds_sp > 0);
ctx->ds_sp--;
}
static void process_explicit_embeddings(BidiContext *ctx)
{
/*
* Rule X1 initialisation.
*/
ctx->ds_sp = (size_t)-1;
push(ctx, ctx->paragraphLevel, DO_NEUTRAL, false);
ctx->overflowIsolateCount = 0;
ctx->overflowEmbeddingCount = 0;
ctx->validIsolateCount = 0;
#define stk (&ctx->dsstack[ctx->ds_sp])
for (size_t i = 0; i < ctx->textlen; i++) {
BidiType t = ctx->types[i];
switch (t) {
case RLE: case LRE: case RLO: case LRO: {
/* Rules X2-X5 */
unsigned char newLevel;
DirectionalOverride override;
#ifndef REMOVE_FORMATTING_CHARS
ctx->levels[i] = stk->level;
#endif
switch (t) {
case RLE: /* rule X2 */
newLevel = nextOddLevel(stk->level);
override = DO_NEUTRAL;
break;
case LRE: /* rule X3 */
newLevel = nextEvenLevel(stk->level);
override = DO_NEUTRAL;
break;
case RLO: /* rule X4 */
newLevel = nextOddLevel(stk->level);
override = DO_RTL;
break;
case LRO: /* rule X5 */
newLevel = nextEvenLevel(stk->level);
override = DO_LTR;
break;
default:
unreachable("how did this get past the outer switch?");
}
if (newLevel <= MAX_DEPTH &&
ctx->overflowIsolateCount == 0 &&
ctx->overflowEmbeddingCount == 0) {
/* Embedding code is valid. Push a stack entry. */
push(ctx, newLevel, override, false);
} else {
/* Embedding code is an overflow one. */
if (ctx->overflowIsolateCount == 0)
ctx->overflowEmbeddingCount++;
}
break;
}
case RLI: case LRI: case FSI: {
/* Rules X5a, X5b, X5c */
if (t == FSI) {
/* Rule X5c: decide whether this should be treated
* like RLI or LRI */
size_t pdi = find_matching_pdi(ctx->types, i, ctx->textlen);
unsigned char level = rule_p2_p3(ctx->types + (i + 1),
pdi - (i + 1));
t = (level == 1 ? RLI : LRI);
}
ctx->levels[i] = stk->level;
if (stk->override != DO_NEUTRAL)
ctx->types[i] = (stk->override == DO_LTR ? L :
stk->override == DO_RTL ? R : t);
unsigned char newLevel = (t == RLI ? nextOddLevel(stk->level) :
nextEvenLevel(stk->level));
if (newLevel <= MAX_DEPTH &&
ctx->overflowIsolateCount == 0 &&
ctx->overflowEmbeddingCount == 0) {
/* Isolate code is valid. Push a stack entry. */
push(ctx, newLevel, DO_NEUTRAL, true);
ctx->validIsolateCount++;
} else {
/* Isolate code is an overflow one. */
ctx->overflowIsolateCount++;
}
break;
}
case PDI: {
/* Rule X6a */
if (ctx->overflowIsolateCount > 0) {
ctx->overflowIsolateCount--;
} else if (ctx->validIsolateCount == 0) {
/* Do nothing: spurious isolate-pop */
} else {
/* Valid isolate-pop. We expect that the stack must
* therefore contain at least one isolate==true entry,
* so pop everything up to and including it. */
ctx->overflowEmbeddingCount = 0;
while (!stk->isolate)
pop(ctx);
pop(ctx);
ctx->validIsolateCount--;
}
ctx->levels[i] = stk->level;
if (stk->override != DO_NEUTRAL)
ctx->types[i] = (stk->override == DO_LTR ? L : R);
break;
}
case PDF: {
/* Rule X7 */
if (ctx->overflowIsolateCount > 0) {
/* Do nothing if we've overflowed on isolates */
} else if (ctx->overflowEmbeddingCount > 0) {
ctx->overflowEmbeddingCount--;
} else if (ctx->ds_sp > 0 && !stk->isolate) {
pop(ctx);
} else {
/* Do nothing: spurious embedding-pop */
}
#ifndef REMOVE_FORMATTING_CHARS
ctx->levels[i] = stk->level;
#endif
break;
}
case B: {
/* Rule X8: if an explicit paragraph separator appears in
* this text at all then it does not participate in any of
* the above, and just gets assigned the paragraph level.
*
* PS, it had better be right at the end of the text,
* because we have not implemented rule P1 in this code. */
assert(i == ctx->textlen - 1);
ctx->levels[i] = ctx->paragraphLevel;
break;
}
case BN: {
/*
* The section 5.2 adjustment to rule X6 says that we
* apply it to BN just like any other class. But I think
* this can't possibly give the same results as the
* unmodified algorithm.
*
* Proof: adding RLO BN or LRO BN at the end of a
* paragraph should not change the output of the standard
* algorithm, because the override doesn't affect the BN
* in rule X6, and then rule X9 removes both. But with the
* modified rule X6, the BN is changed into R or L, and
* then rule X9 doesn't remove it, and then you've added a
* strong type that will set eos for the level run just
* before the override. And whatever the standard
* algorithm set eos to, _one_ of these override sequences
* will disagree with it.
*
* So I think we just set the BN's level, and don't change
* its type.
*/
ctx->levels[i] = stk->level;
break;
}
default: {
/* Rule X6. */
ctx->levels[i] = stk->level;
if (stk->override != DO_NEUTRAL)
ctx->types[i] = (stk->override == DO_LTR ? L : R);
break;
}
}
}
#undef stk
}
static void remove_embedding_characters(BidiContext *ctx)
{
#ifndef REMOVE_FORMATTING_CHARS
/*
* Rule X9, as modified by section 5.2: turn embedding (but not
* isolate) characters into BN.
*/
for (size_t i = 0; i < ctx->textlen; i++) {
BidiType t = ctx->types[i];
if (typeIsRemovedDuringProcessing(t)) {
ctx->types[i] = BN;
/*
* My own adjustment to the section 5.2 mods: a sequence
* of contiguous BN generated by this setup should never
* be at different levels from each other.
*
* An example where this goes wrong is if you open two
* LREs in sequence, then close them again:
*
* ... LRE LRE PDF PDF ...
*
* The initial level assignment gives level 0 to the outer
* LRE/PDF pair, and level 2 to the inner one. The
* standard algorithm would remove all four, so this
* doesn't matter, and you end up with no break in the
* surrounding level run. But if you just rewrite the
* types of all those characters to BN and leave the
* levels in that state, then the modified algorithm will
* leave the middle two BN at level 2, dividing what
* should have been a long level run at level 0 into two
* separate ones.
*/
if (i > 0 && ctx->types[i-1] == BN)
ctx->levels[i] = ctx->levels[i-1];
}
}
#else
/*
* Rule X9, original version: completely remove embedding
* start/end characters and also boundary neutrals.
*/
size_t outpos = 0;
for (size_t i = 0; i < ctx->textlen; i++) {
BidiType t = ctx->types[i];
if (!typeIsRemovedDuringProcessing(t)) {
ctx->text[outpos] = ctx->text[i];
ctx->levels[outpos] = ctx->levels[i];
ctx->types[outpos] = ctx->types[i];
ctx->origTypes[outpos] = ctx->origTypes[i];
outpos++;
}
}
ctx->textlen = outpos;
#endif
}
typedef void (*irs_fn_t)(BidiContext *ctx);
static void find_isolating_run_sequences(BidiContext *ctx, irs_fn_t process)
{
/*
* Rule X10 / BD13. Now that we've assigned an embedding level to
* each character in the text, we have to divide the text into
* subsequences on which to do the next stage of processing.
*
* In earlier issues of the bidi algorithm, these subsequences
* were contiguous in the original text, and each one was a 'level
* run': a maximal contiguous subsequence of characters all at the
* same embedding level.
*
* But now we have isolates, and the point of an (isolate
* initiator ... PDI) sequence is that the whole sequence should
* be treated like a single BN for the purposes of formatting
* everything outside it. As a result, we now have to recombine
* our level runs into longer sequences, on the principle that if
* a level run ends with an isolate initiator, then we bring it
* together with whatever later level run starts with the matching
* PDI.
*
* These subsequences are no longer contiguous (the whole point is
* that between the isolate initiator and the PDI is some other
* text that we've skipped over). They're called 'isolating run
* sequences'.
*/
memset(ctx->irsdone, 0, ctx->textlen);
size_t i = 0;
size_t n_irs = 0;
size_t indexpos = 0;
while (i < ctx->textlen) {
if (ctx->irsdone[i]) {
i++;
continue;
}
/*
* Found a character not already processed. Start a new
* sequence here.
*/
sgrowarray(ctx->irslist, ctx->irslistsize, n_irs);
IsolatingRunSequence *irs = &ctx->irslist[n_irs++];
irs->start = indexpos;
size_t j = i;
size_t irslevel = ctx->levels[i];
while (j < ctx->textlen) {
/*
* We expect that all level runs in this sequence will be
* at the same level as each other, by construction of how
* we set up the levels from the isolates in the first
* place.
*/
assert(ctx->levels[j] == irslevel);
do {
ctx->irsdone[j] = true;
ctx->irsindices[indexpos++] = j++;
} while (j < ctx->textlen && ctx->levels[j] == irslevel);
if (!typeIsIsolateInitiator(ctx->types[j-1]))
break; /* this IRS is ended */
j = find_matching_pdi(ctx->types, j-1, ctx->textlen);
}
irs->end = indexpos;
/*
* Determine the start-of-sequence and end-of-sequence types
* for this sequence.
*
* These depend on the embedding levels of surrounding text.
* But processing each run can change those levels. That's why
* we have to use a two-pass strategy here, first identifying
* all the isolating run sequences using the input level data,
* and not processing any of them until we know where they all
* are.
*/
size_t p;
unsigned char level_inside, level_outside, level_max;
p = i;
level_inside = ctx->levels[p];
level_outside = ctx->paragraphLevel;
while (p > 0) {
p--;
if (ctx->types[p] != BN) {
level_outside = ctx->levels[p];
break;
}
}
level_max = max(level_inside, level_outside);
irs->sos = (level_max % 2 ? R : L);
p = ctx->irsindices[irs->end - 1];
level_inside = ctx->levels[p];
level_outside = ctx->paragraphLevel;
if (typeIsIsolateInitiator(ctx->types[p])) {
/* Special case: if an isolating run sequence ends in an
* unmatched isolate initiator, then level_outside is
* taken to be the paragraph embedding level and the
* loop below is skipped. */
} else {
while (p+1 < ctx->textlen) {
p++;
if (ctx->types[p] != BN) {
level_outside = ctx->levels[p];
break;
}
}
}
level_max = max(level_inside, level_outside);
irs->eos = (level_max % 2 ? R : L);
irs->embeddingDirection = (irslevel % 2 ? R : L);
/*
* Now we've listed in ctx->irsindices[] the index of every
* character that's part of this isolating run sequence, and
* recorded an entry in irslist containing the interval of
* indices relevant to this IRS, plus its assorted metadata.
* We've also marked those locations in the input text as done
* in ctx->irsdone, so that we'll skip over them when the
* outer iteration reaches them later.
*/
}
for (size_t k = 0; k < n_irs; k++) {
IsolatingRunSequence *irs = &ctx->irslist[k];
ctx->irs = ctx->irsindices + irs->start;
ctx->irslen = irs->end - irs->start;
ctx->sos = irs->sos;
ctx->eos = irs->eos;
ctx->embeddingDirection = irs->embeddingDirection;
process(ctx);
}
/* Reset irslen to 0 when we've finished. This means any other
* functions that absentmindedly try to use irslen at all will end
* up doing nothing at all, which should be easier to detect and
* debug than if they run on subtly the wrong subset of the
* text. */
ctx->irslen = 0;
}
static void remove_nsm(BidiContext *ctx)
{
/* Rule W1: NSM gains the type of the previous character, or sos
* at the start of the run, with the exception that isolation
* boundaries turn into ON. */
BidiType prevType = ctx->sos;
for (size_t c = 0; c < ctx->irslen; c++) {
size_t i = ctx->irs[c];
BidiType t = ctx->types[i];
if (t == NSM) {
ctx->types[i] = prevType;
} else if (typeIsIsolateInitiatorOrPDI(t)) {
prevType = ON;
#ifndef REMOVE_FORMATTING_CHARS
} else if (t == BN) {
/* section 5.2 adjustment: these don't affect prevType */
#endif
} else {
prevType = t;
}
}
}
static void change_en_to_an(BidiContext *ctx)
{
/* Rule W2: EN becomes AN if the previous strong type is AL. (The
* spec says that the 'previous strong type' is counted as sos at
* the start of the run, although it hardly matters, since sos
* can't be AL.) */
BidiType prevStrongType = ctx->sos;
for (size_t c = 0; c < ctx->irslen; c++) {
size_t i = ctx->irs[c];
BidiType t = ctx->types[i];
if (t == EN && prevStrongType == AL) {
ctx->types[i] = AN;
} else if (typeIsStrong(t)) {
prevStrongType = t;
}
}
}
static void change_al_to_r(BidiContext *ctx)
{
/* Rule W3: AL becomes R unconditionally. (The only difference
* between the two types was their effect on nearby numbers, which
* was dealt with in rule W2, so now we're done with the
* distinction.) */
for (size_t c = 0; c < ctx->irslen; c++) {
size_t i = ctx->irs[c];
if (ctx->types[i] == AL)
ctx->types[i] = R;
}
}
static void eliminate_separators_between_numbers(BidiContext *ctx)
{
/* Rule W4: a single numeric separator between two numbers of the
* same type compatible with that separator takes the type of the
* number. ES is a separator type compatible only with EN; CS is a
* separator type compatible with either EN or AN.
*
* Section 5.2 adjustment: intervening BNs do not break this, so
* instead of simply looking at types[irs[c-1]] and types[irs[c+1]],
* we must track the last three indices we saw that were not BN. */
size_t i1 = 0, i2 = 0;
BidiType t0 = ON, t1 = ON, t2 = ON;
for (size_t c = 0; c < ctx->irslen; c++) {
size_t i = ctx->irs[c];
BidiType t = ctx->types[i];
#ifndef REMOVE_FORMATTING_CHARS
if (t == BN)
continue;
#endif
i1 = i2; i2 = i;
t0 = t1; t1 = t2; t2 = t;
if (t0 == t2 && ((t1 == ES && t0 == EN) ||
(t1 == CS && (t0 == EN || t0 == AN)))) {
ctx->types[i1] = t0;
}
}
}
static void eliminate_et_next_to_en(BidiContext *ctx)
{
/* Rule W5: a sequence of ET adjacent to an EN take the type EN.
* This is easiest to implement with one loop in each direction.
*
* Section 5.2 adjustment: include BN with ET. (We don't need to
* #ifdef that out, because in the standard algorithm, we won't
* have any BN left in any case.) */
bool modifying = false;
for (size_t c = 0; c < ctx->irslen; c++) {
size_t i = ctx->irs[c];
BidiType t = ctx->types[i];
if (t == EN) {
modifying = true;
} else if (modifying && typeIsETOrBN(t)) {
ctx->types[i] = EN;
} else {
modifying = false;
}
}
for (size_t c = ctx->irslen; c-- > 0 ;) {
size_t i = ctx->irs[c];
BidiType t = ctx->types[i];
if (t == EN) {
modifying = true;
} else if (modifying && typeIsETOrBN(t)) {
ctx->types[i] = EN;
} else {
modifying = false;
}
}
}
static void eliminate_separators_and_terminators(BidiContext *ctx)
{
/* Rule W6: all separators and terminators change to ON.
*
* (The spec is not quite clear on which bidi types are included
* in this; one assumes ES, ET and CS, but what about S? I _think_
* the answer is that this is a rule in the W section, so it's
* implicitly supposed to only apply to types designated as weakly
* directional, so not S.) */
#ifndef REMOVE_FORMATTING_CHARS
/*
* Section 5.2 adjustment: this also applies to any BN adjacent on
* either side to one of these types, which is easiest to
* implement with a separate double-loop converting those to an
* arbitrary one of the affected types, say CS.
*
* This double loop can be completely skipped in the standard
* algorithm.
*/
bool modifying = false;
for (size_t c = 0; c < ctx->irslen; c++) {
size_t i = ctx->irs[c];
BidiType t = ctx->types[i];
if (typeIsWeakSeparatorOrTerminator(t)) {
modifying = true;
} else if (modifying && t == BN) {
ctx->types[i] = CS;
} else {
modifying = false;
}
}
for (size_t c = ctx->irslen; c-- > 0 ;) {
size_t i = ctx->irs[c];
BidiType t = ctx->types[i];
if (typeIsWeakSeparatorOrTerminator(t)) {
modifying = true;
} else if (modifying && t == BN) {
ctx->types[i] = CS;
} else {
modifying = false;
}
}
#endif
/* Now the main part of rule W6 */
for (size_t c = 0; c < ctx->irslen; c++) {
size_t i = ctx->irs[c];
BidiType t = ctx->types[i];
if (typeIsWeakSeparatorOrTerminator(t))
ctx->types[i] = ON;
}
}
static void change_en_to_l(BidiContext *ctx)
{
/* Rule W7: EN becomes L if the previous strong type (or sos) is L. */
BidiType prevStrongType = ctx->sos;
for (size_t c = 0; c < ctx->irslen; c++) {
size_t i = ctx->irs[c];
BidiType t = ctx->types[i];
if (t == EN && prevStrongType == L) {
ctx->types[i] = L;
} else if (typeIsStrong(t)) {
prevStrongType = t;
}
}
}
typedef void (*bracket_pair_fn)(BidiContext *ctx, size_t copen, size_t cclose);
static void find_bracket_pairs(BidiContext *ctx, bracket_pair_fn process)
{
const size_t NO_BRACKET = ~(size_t)0;
/*
* Rule BD16.
*/
size_t sp = 0;
for (size_t c = 0; c < ctx->irslen; c++)
ctx->bracketpos[c] = NO_BRACKET;
for (size_t c = 0; c < ctx->irslen; c++) {
size_t i = ctx->irs[c];
unsigned wc = ctx->text[i].wc;
BracketTypeData bt = bracket_type(wc);
if (bt.type == BT_OPEN) {
if (sp >= lenof(ctx->bstack)) {
/*
* Stack overflow. The spec says we simply give up at
* this point.
*/
goto found_all_pairs;
}
ctx->bstack[sp].ch = wc;
ctx->bstack[sp].c = c;
sp++;
} else if (bt.type == BT_CLOSE) {
size_t new_sp = sp;
/*
* Search up the stack for an entry containing a matching
* open bracket. If we find it, pop that entry and
* everything deeper, and record a matching pair. If we
* reach the bottom of the stack without finding anything,
* leave sp where it started.
*/
while (new_sp-- > 0) {
if (ctx->bstack[new_sp].ch == bt.partner ||
ctx->bstack[new_sp].ch == bt.equiv_partner) {
/* Found a stack element matching this one */
size_t cstart = ctx->bstack[new_sp].c;
ctx->bracketpos[cstart] = c;
sp = new_sp;
break;
}
}
}
}
found_all_pairs:
for (size_t c = 0; c < ctx->irslen; c++) {
if (ctx->bracketpos[c] != NO_BRACKET) {
process(ctx, c, ctx->bracketpos[c]);
}
}
}
static BidiType get_bracket_type(BidiContext *ctx, size_t copen, size_t cclose)
{
/*
* Rule N0: a pair of matched brackets containing at least one
* strong type takes on the current embedding direction, unless
* all of these are true at once:
*
* (a) there are no strong types inside the brackets matching the
* current embedding direction
* (b) there _is_ at least one strong type inside the brackets
* that is _opposite_ to the current embedding direction
* (c) the strong type preceding the open bracket is also
* opposite to the current embedding direction
*
* in which case they take on the opposite direction.
*
* For these purposes, number types (EN and AN) count as R.
*/
bool foundOppositeTypeInside = false;
for (size_t c = copen + 1; c < cclose; c++) {
size_t i = ctx->irs[c];
BidiType t = ctx->types[i];
if (typeIsStrongOrNumber(t)) {
t = t == L ? L : R; /* numbers count as R */
if (t == ctx->embeddingDirection) {
/* Found something inside the brackets matching the
* current level, so (a) is violated. */
return ctx->embeddingDirection;
} else {
foundOppositeTypeInside = true;
}
}
}
if (!foundOppositeTypeInside) {
/* No strong types at all inside the brackets, so return ON to
* indicate that we're not messing with their type at all. */
return ON;
}
/* There was an opposite strong type in the brackets. Look
* backwards to the preceding strong type, and go with that,
* whichever it is. */
for (size_t c = copen; c-- > 0 ;) {
size_t i = ctx->irs[c];
BidiType t = ctx->types[i];
if (typeIsStrongOrNumber(t)) {
t = t == L ? L : R; /* numbers count as R */
return t;
}
}
/* Fallback: if the preceding strong type was not found, go with
* sos. */
return ctx->sos;
}
static void reset_bracket_type(BidiContext *ctx, size_t c, BidiType t)
{
/* Final bullet point of rule N0: when we change the type of a
* bracket, the same change applies to any contiguous sequence of
* characters after it whose _original_ bidi type was NSM. */
do {
ctx->types[ctx->irs[c++]] = t;
#ifndef REMOVE_FORMATTING_CHARS
while (c < ctx->irslen && ctx->origTypes[ctx->irs[c]] == BN) {
/* Section 5.2 adjustment: skip past BN in the process. */
c++;
}
#endif
} while (c < ctx->irslen && ctx->origTypes[ctx->irs[c]] == NSM);
}
static void resolve_brackets(BidiContext *ctx, size_t copen, size_t cclose)
{
if (typeIsNeutral(ctx->types[ctx->irs[copen]]) &&
typeIsNeutral(ctx->types[ctx->irs[cclose]])) {
BidiType t = get_bracket_type(ctx, copen, cclose);
if (t != ON) {
reset_bracket_type(ctx, copen, t);
reset_bracket_type(ctx, cclose, t);
}
}
}
static void remove_ni(BidiContext *ctx)
{
/*
* Rules N1 and N2 together: neutral or isolate characters take
* the direction of the surrounding strong text if the nearest
* strong characters on each side match, and otherwise, they take
* the embedding direction.
*/
const size_t NO_INDEX = ~(size_t)0;
BidiType prevStrongType = ctx->sos;
size_t c_ni_start = NO_INDEX;
for (size_t c = 0; c <= ctx->irslen; c++) {
BidiType t;
if (c < ctx->irslen) {
size_t i = ctx->irs[c];
t = ctx->types[i];
} else {
/* One extra loop iteration, using eos to resolve the
* final sequence of NI if any */
t = ctx->eos;
}
if (typeIsStrongOrNumber(t)) {
t = t == L ? L : R; /* numbers count as R */
if (c_ni_start != NO_INDEX) {
/* There are some NI we have to fix up */
BidiType ni_type = (t == prevStrongType ? t :
ctx->embeddingDirection);
for (size_t c2 = c_ni_start; c2 < c; c2++) {
size_t i2 = ctx->irs[c2];
BidiType t2 = ctx->types[i2];
if (typeIsNeutralOrIsolate(t2))
ctx->types[i2] = ni_type;
}
}
prevStrongType = t;
c_ni_start = NO_INDEX;
} else if (typeIsNeutralOrIsolate(t) && c_ni_start == NO_INDEX) {
c_ni_start = c;
}
}
}
static void resolve_implicit_levels(BidiContext *ctx)
{
/* Rules I1 and I2 */
for (size_t c = 0; c < ctx->irslen; c++) {
size_t i = ctx->irs[c];
unsigned char level = ctx->levels[i];
BidiType t = ctx->types[i];
if (level % 2 == 0) {
/* Rule I1 */
if (t == R)
ctx->levels[i] += 1;
else if (t == AN || t == EN)
ctx->levels[i] += 2;
} else {
/* Rule I2 */
if (t == L || t == AN || t == EN)
ctx->levels[i] += 1;
}
}
}
static void process_isolating_run_sequence(BidiContext *ctx)
{
/* Section W: resolve weak types */
remove_nsm(ctx);
change_en_to_an(ctx);
change_al_to_r(ctx);
eliminate_separators_between_numbers(ctx);
eliminate_et_next_to_en(ctx);
eliminate_separators_and_terminators(ctx);
change_en_to_l(ctx);
/* Section N: resolve neutral types (and isolates) */
find_bracket_pairs(ctx, resolve_brackets);
remove_ni(ctx);
/* Section I: resolve implicit levels */
resolve_implicit_levels(ctx);
}
static void reset_whitespace_and_separators(BidiContext *ctx)
{
/*
* Rule L1: segment and paragraph separators, plus whitespace
* preceding them, all reset to the paragraph embedding level.
* This also applies to whitespace at the very end.
*
* This is done using the original types, not the versions that
* the rest of this algorithm has been merrily mutating.
*/
bool modifying = true;
for (size_t i = ctx->textlen; i-- > 0 ;) {
BidiType t = ctx->origTypes[i];
if (typeIsSegmentOrParaSeparator(t)) {
ctx->levels[i] = ctx->paragraphLevel;
modifying = true;
} else if (modifying) {
if (typeIsWhitespaceOrIsolate(t)) {
ctx->levels[i] = ctx->paragraphLevel;
} else if (!typeIsRemovedDuringProcessing(t)) {
modifying = false;
}
}
}
#ifndef REMOVE_FORMATTING_CHARS
/*
* Section 5.2 adjustment: types removed by rule X9 take the level
* of the character to their left.
*/
for (size_t i = 0; i < ctx->textlen; i++) {
BidiType t = ctx->origTypes[i];
if (typeIsRemovedDuringProcessing(t)) {
/* Section 5.2 adjustment */
ctx->levels[i] = (i > 0 ? ctx->levels[i-1] : ctx->paragraphLevel);
}
}
#endif /* ! REMOVE_FORMATTING_CHARS */
}
static void reverse(BidiContext *ctx, size_t start, size_t end)
{
for (size_t i = start, j = end; i < j; i++, j--) {
bidi_char tmp = ctx->text[i];
ctx->text[i] = ctx->text[j];
ctx->text[j] = tmp;
}
}
static void mirror_glyphs(BidiContext *ctx)
{
/*
* Rule L3: any character with a mirror-image pair at an odd
* embedding level is replaced by its mirror image.
*
* This is specified in the standard as happening _after_ rule L2
* (the actual reordering of the text). But it's much easier to
* implement it before, while our levels[] array still matches up
* to the text order.
*/
for (size_t i = 0; i < ctx->textlen; i++) {
if (ctx->levels[i] % 2)
ctx->text[i].wc = mirror_glyph(ctx->text[i].wc);
}
}
static void reverse_sequences(BidiContext *ctx)
{
/*
* Rule L2: every maximal contiguous sequence of characters at a
* given level or higher is reversed.
*/
unsigned level = 0;
for (size_t i = 0; i < ctx->textlen; i++)
level = max(level, ctx->levels[i]);
for (; level >= 1; level--) {
for (size_t i = 0; i < ctx->textlen; i++) {
if (ctx->levels[i] >= level) {
size_t start = i;
while (i+1 < ctx->textlen && ctx->levels[i+1] >= level)
i++;
reverse(ctx, start, i);
}
}
}
}
/*
* The Main Bidi Function. The two wrappers below it present different
* external APIs for different purposes, but everything comes through
* here.
*
* text: a buffer of size textlen containing text to apply the
* Bidirectional algorithm to.
*/
static void do_bidi_new(BidiContext *ctx, bidi_char *text, size_t textlen)
{
ensure_arrays(ctx, textlen);
ctx->text = text;
ctx->textlen = textlen;
setup_types(ctx);
/* Quick initial test: see if we need to bother with any work at all */
if (!text_needs_bidi(ctx))
return;
set_paragraph_level(ctx);
process_explicit_embeddings(ctx);
remove_embedding_characters(ctx);
find_isolating_run_sequences(ctx, process_isolating_run_sequence);
/* If this implementation distinguished paragraphs from lines,
* then this would be the point where we repeat the remainder of
* the algorithm once for each line in the paragraph. */
reset_whitespace_and_separators(ctx);
mirror_glyphs(ctx);
reverse_sequences(ctx);
}
size_t do_bidi_test(BidiContext *ctx, bidi_char *text, size_t textlen,
int override)
{
ctx->paragraphOverride = (override > 0 ? L : override < 0 ? R : ON);
do_bidi_new(ctx, text, textlen);
return ctx->textlen;
}
void do_bidi(BidiContext *ctx, bidi_char *text, size_t textlen)
{
#ifdef REMOVE_FORMATTING_CHARACTERS
abort(); /* can't use the standard algorithm in a live terminal */
#else
ctx->paragraphOverride = ON;
do_bidi_new(ctx, text, textlen);
#endif
}