Unicode safety ============== SLUL already disallows Unicode control characters such as RLO, but there are other character sequences that confuse someone who reads the code: 1. Look-alikes in strings 2. RTL (right-to-left) text that continues in following weak or neutral characters (such as punctuation, quoting, parentheses and digits) Note that the only parts of SLUL source that can contain non-ASCII characters are strings, comments and (some) header values. Look-alike prevention --------------------- This is mostly a non-issue with comments. But it could be used to add comments with "fake" TODO's that don't appear when searching for "TODO". Or it to hide other things that someone don't want to show up in searches. For header values, it is also mostly a non-issue. All header values are restricted to ASCII, except for things like \license.text. For \license.text, that could be used to hide, from searches and indexing software, that a module is licensed under a restrictive license. So the impact on comments and header values is relatively small. For strings, it could be used to obfuscate code, however. Here are some ways of how it could be prevented: - We could require some indicator in strings that contain "confusing" mixing of data. - We could require some indicators in strings that contain characters from various sets of confusable characters. Perhaps something like this: \slul 0.0.0 \name test ... check_access("\C;АБВГД\L;abcdef") Which sets of characters are needed? - \L = latin - \C = cyrillic - \G = greek (except non-confusing ones, since it's used in math formulas) - \S = specials (for special chars that look like other special chars) - \O = others - \A = allow all - allow sequences of these for combinations? e.g. "\LG;Σδx+δy" - ...others...? The list above is currently implemented by hand. But it should be generated automatically from some official Unicode character information txt file. (Or possibly, there could be a toggle between ASCII and non-ASCII) Another issue is that it's not very obvious what e.g. \C; does. Maybe it should be \Cyr; \Gr; \Spec; etc.? Maybe it should be \LG{Σδx+δy} (i.e. start with "{" and end with "}") RTL continuation prevention --------------------------- RTL mode can continue into non-letter characters in some cases. This is also affected by grouping characters such as parentheses, which also includes several non-ASCII characters (such a superscripted parentheses) There are (at least) two ways to combat this issue: 1. Make the compiler aware of these somewhat complex rules, and update them with new Unicode standards (= not fully compatible changes). That is obviously a bad idea. 2. Require a certain structure of the code, to avoid the issue. One way to do 2 is to require strings with RTL characters to be placed on separate lines (only allowing "" around and possibly a trailing comma), followed by a special "end_bidi" keyword: example_func( "... RTL text here ...", "... more RTL text ..." "...possible spanning over multiple lines..." end_bidi ) The combination of a newline and the "end_bidi" keyword should reset all Unicode BiDi state to left-to-right mode. An "\option bidi_strings" header value could be added to allow RTL characters in a less cumbersome way, for code that uses a lot of RTL text: \slul 0.0.0 \name test \option bidi_strings example_func("... RTL text here ...", "...more RTL text..." "...possible spanning over multiple lines...") Combining characters -------------------- If a combining character appears immediately after a string start character (i.e. "), then the combining character might be applied to the " when displayed. So combining characters should be forbidden as the first character of a string literal. Instead, they should be escaped.