20100609: CSR: a language made of syntactic sugar.
-- Please note that this is a very BAD idea. I keep this here because it was worth doing the mental gymnastics, but the power to define all your operators is fundamentally bad. For maximum punnage, I refer to this language as "Clean Slate Representation". --
CSR is a language designed to be completely flexible. While other languages may allow you to define your own classes and then specify how the + operator should behave, CSR lets you define new operators and specify precedence.
This language is probably not of great value. I invented it more as a curiosity than anything else, but it may end up being of use as an embedded expression evaluator. If not, it's still done its job. :)
Everything is a token. A token could be a literal number or string, or a variable, or an operator. Variables do not have to be declared, but are created on first use.
Operators are created by directives, which affect parsing from that point to the end of the file. They come in several types:
* Unary operators: the one expression to the right (or left) of the operator becomes the one parameter to the function.
* Binary operators: the expressions to the left and to the right become the parameters.
* Grouping operators: two symbols group an expression. If the expression is preceded or followed by a token (as opposed to an operator), this will become a part of the function call too.
Operators can use symbols and/or letters, but not numbers. (For instance, you cannot define 5 to be an operator.)
Examples of what can be done:
`d(): parses 2d4+3*2d8+5
`%(): parses "Hello, %5s!"%"world"
`?`:(): parses (a==1)?"foo":"bar"
`;(): parses foo(1); and forces the evaluation of foo() - precedence 2
`,(): returns a list object with multiple objects in it - precedence 4
The last ones bear some explanation. The parser treats the primary operand of ?: as "foo", with preceding operand (a==1) and following operand "bar". The function would check the preceding for truth, and evaluate either the primary or following operand, thus mimicking C's ternary operator. The semicolon operator is actually binary: it forces the complete evaluation of its first operand, then returns its second unevaluated; the comma returns both its operands, unevaluated. List objects more are like LISP pairs than traditional arrays, in that they will nest rather than having more than two elements.
Functions and expressions are evaluated lazily. An unevaluated expression can be accepted as a function argument. With the ?: operator, for instance, it can take one int and two expressions (thus forcing the evaluation of the preceding operand), and return an expression (thus NOT evaluating either of the others until the result is used).
Ambiguity can easily result from alphabetic operators. Expressions are tokenized from left to right, and if the beginning of a token matches an operator, and the next part of the token is a digit, then it is the operator; if it is a letter, it is not. For example, suppose the "d" operator is defined for dice rolling, in both unary and binary forms:
* asdf: Variable
* abcd1234: Variable
* 123defg: Literal number followed by variable, without operator. Parse error.
* 123d234: Equivalent to `d(123,234)
* 123+d234: Equivalent to 123+`d(234)
* d(123): Equivalent to `d((123))
Note the last entry. Unary operators will mask functions of the same name.
The operator-creating directive must specify precedence. This is an integer, where 0 is lowest and 100 is highest. 0 may not be used, and represents literals; 100 is reserved for parentheses. Odd-numbered precedence levels associate right to left, even-numbered levels left-to-right.
Parsing an expression is done as follows:
* Example expression:
- d20+4d5+(2*3)+mod(str)
- Grouping () at 100
- Unary d at precedence 75 (rtl associativity)
- Binary d at precedence 50
- Binary * at 40
- Binary + at 30
* Divide the expression into tokens.
- Scan from left to right, and know what type of object is being built.
- If nothing is being built:
- If double quote, build a string.
- If current character is an alpha, build a variable.
- If current character is a digit, build an int.
- If current character is neither, build a symbol, or ignore whitespace.
- If a variable is being built:
- Alpha: Append
- Digit: Check if current variable name matches any operator. If so, replace variable object with operator object, and begin building an int.
- Anything else: Complete current object, ungetch, iterate.
- If an int is being built:
- Digit: Append
- Anything else: Complete current object, ungetch, iterate.
- If a symbol is being built:
- If current object is "/" and next char is "*", is comment - scan for next */, complete current object, iterate.
- Check if there is an operator which includes the current symbol and the next character. If so, append, iterate.
- If not: Complete current object, ungetch, iterate.
- If a string is being built:
- Double quote: Complete current object, do NOT ungetch, iterate
- Backslash: Take next character, and append to current string as appropriate (handing \n --> newline, \" --> quote, etc)
- Newline: Error (but allow an escaped newline aka trailing backslash)
- Anything else: Append
- Whitespace (at any time other than string): Complete current object, iterate.
- "Complete current object" includes resetting the "building" flag to nothing.
* At this point, we have a sequence of left-to-right objects. Operator, Integer, Variable:
- Od I20 O+ I4 Od I5 O+ O( I2 O* I3 O) O+ Vmod O( Vstr O)
* Proceed through the operator table in precedence order. (This can be done using only the operators actually used in the expression, as collected during the token-division step.) Start with the highest precedence operators.
- If currently parsing an even precedence level, scan the object stream from left to right, otherwise from right to left.
- Unary operators: require that the preceding object (in scan order) be a non-operator, and the following object be an operator or end-of-search. The preceding object is the argument.
- Binary operators: two adjacent objects must be non-operators, they are the arguments.
- Grouping operators take some more checking.
- First, locate both parts of the operator, with no other parts between them. If this cannot be done (either due to absence of one part, or mismatched nesting), error. If the two parts are the same symbol, then simply locate pairs, following the associativity order, and error if there is an odd number.
- Between the two parts is an expression. Recurse to parse the object-stream and record it as an object.
- If the object before the first operator part is an operator, leave it be. Otherwise, take it unchanged as the "preceding operand".
- Ditto the object after the second operator part.
* This should result in one single "Function" object, which can be passed around unevaluated.
- F( `+, F( `+, F( `+, F( `d, I20 ), F( `d, I4, I5 ) ), F( `(`), F(`*,I2,I3) ) ), F( `(`), Vstr, Vmod ) )
- Or in tree form:
F( `+,
F( `+,
F( `+,
F( `d, I20 ),
F( `d, I4, I5 )
),
F( `(`),
F(`*,I2,I3)
)
),
F(
`(`), Vstr, Vmod
)
)
Evaluating an expression is as simple as calling the function with its given arguments and returning the result. This is a recursion-heavy algorithm. Note that a function is most welcome to return another function.
Assignment cannot be chained, as this would make the lazy evaluation quite tricky. Consequently, the = (assignment) operator simply returns 0.
/* Hello world in CSR */
#include std
main=(
write(stdout,"Hello, world!\n");
);
main();