Alarm bells, IOT, Grammars, AI and DSL

A grammar essentially captures just the symbols and their transitions/groups. In that sense the formalism of PEG and its peers are quite tidy and easy to maintain, as well can have multiple language implementations. However, the generated code from PEG parser is not that performant when compared with high level APIs such as Chevrotain.

One question then is:

Does it makes sense to have the grammar intent be kept in the tidy format of, say PEG or similar, and have them code generated that uses high level API such as Chevrotain ?

That way, we can gain

the original grammar is separate and easy to reuse part-wise and change (especially when it grows large)
can have the performance benefits
can have implementations from other languages too (since the original grammar is abstracted out as separate and format grammar file)

I like the Chevrotain approach of Javascript itself being the DSL - it solves multiple problems as one might guess. However, the challenge is, having the same grammar be parsed by other languages, say C/C++.

On the other hand, I think we really do not need to have the same grammar be parsed by other languages, if the underlying actions can somehow are available in the DSL irrespective of their language.

Let me explain with a simple example.

Use case 1:

Consider the below code in a custom DSL:

  read_input | update_chart | ( > 10) ? raise_alarm;

The read_input, raise_alarm and update_chart are provided by domain specific implementations, such as, for example:

implementation 1 (say, in C++)

void read_input() { 
  // read from standard input
}
void raise_alarm() {
  // write to error console
}
void update_chart() {
  // write to console output
}

implementation 2 (say, in Javascript)

function read_input() {
  // ajax load the input for every 5 seconds
}
function raise_alarm() {
  // POST data to error log on server and send eMail
}
function update_chart() {
  // update the chart on web-page UI
}

Now, with Autobhan / WAMP and DeepStream etc. it is pretty straightforward and easy to inter-operate and invoke methods from different languages with ease.

What is missing, rather, is the ability to create dynamic DSLs that are formed out of the available underlying methods and knowing when to invoke what (like a parser event).

For example, consider something like below:

import 'implementation2';

read_input | update_chart | ( > 10) ? raise_alarm;

After the above import statement, all functions / actions provided by the implementation2 should be available as tokens in the current DSL, and the parser should be capable of invoking them (in the order governed by the DSL's operator semantics, say left-to-right or right-to-left etc.).

In other words, the valid tokens of the DSL are not static, but rather are varying, leading to a dynamic DSL that is populated by the underlying actions (implemented by any language and invoked through RPC, such as WAMP or DeepStream) - but the language itself is static (such as the operator precedence etc.)

Use case 2:

To give another example, consider the below DSL from music domain:

  C C D G A

Each note (such as C, D etc.) has an underlying MIDI action supported by C/C++ implementation as below:

function C() { play_midi_note('c'); }
function D() { play_midi_note('d'); }
// ...

In this case the notes are valid tokens in the DSL. But the notes need not be C D ... always, they could just be Do Re Mi ... etc. in which case:

import 'solfege';
 Do Re Mi Do ...

solfege implementation:

function Do() { play_midi_note('...'); }
function Re() { play_midi_note('...'); }
// ...

Question:

How to implement such dynamic DSL (that has varying tokens) ?

The goal is: Consider a text editor that supports this kind of music DSL. As user keeps entering the notes (such as C D etc..) the editor should actually execute the underlying action for each note (in this case play the actual note) that the user just now entered.

This requires the text editor to support dynamic DSLs that can be loaded on the fly and know when to invoke the underlying action for the token.

One concept that comes here to mind, grammar inheritance, sounds interesting - but it is limited in the scope, since it is primarily static, and it the parser has to have the knowledge of all tokens before-hand.

Rather, what I am looking at is: a token list (and their associated actions) that can grow and shrink at runtime.

For example, consider an editor app. When it starts running, it would not know which tokens the user might type. But once the user enters import xyz then it would know the list of valid tokens the user is expected to enter from then on. And by the time the user completes entering his logic (in the custom xyz DSL) in that editor, the editor would have already figured out the AST required to process that logic (and the corresponding actions to take to execute it). This AST could then be distributed as work definition to some cluster of machines around the world, and each machine will either execute the AST action as it is, or convert into native binary for performance and invoke the underlying actions (each possibly provided by different language binaries).

Static parsers (such as those expressed with PEG grammars) would not be able to achieve that kind of feat - and I think Chevrotain is the closest bet (given that JS is its DSL, which is dynamic by nature) for this kind of task.

One thing I can tell is, requirements for dyanamic DSLs is on rise in the industry, and it is going to play major role in the evolution of next gen tech surrounding the use-cases, such as IOT and edge-analytics (which require business logic work definitions to be programmable by user that is usually powered by Polyglot functional providers underneath).

A smart home DSL example:

Let me illustrate with an example, a smart home DSL, consider the below:

import alarm_dsl;

IF living_room_temperature_sensor temperature > 25 degrees
  THEN play music_alarm {
    C C D G C G G
  }

The play music_alarm allows different music grammar that is comprised of music notes such as C, D, G etc... and its grammatical meaning is entirely different from the parent / host grammar (and within the bounds of { } ).

Now, consider that the music_alarm supports repeating the alarm some number of time (say 3 times as expressed below):

THEN play music_alarm {
  C C D G C G G
} X 3

Then such a play music_alarm grammatical structure could possibly be expressed as below:

// file: alarm_dsl

import music_string_dsl;

definition: 
    music_alarm {MUSIC_STRING} [X NUMBER]

or it could be more extensive as below:

// file: alarm_dsl

import music_string_dsl;
import calcluator_dsl;
import sql_dsl;


definition: 
    music_alarm {MUSIC_STRING} [X NUMBER  |  (CALCULATION)  |  {SQL_STATEMENT}]

which allows one to write something like below utilizing the Calculator grammar:

    THEN play music_alarm {
        C C D G C G G
    } X (1 + 2)

or something like below utilizing the SQL grammar:

    THEN play music_alarm {
        C C D G C G G
    } X { SELECT alarm_count FROM user_preferences }

Extension Plugins

One possibility I would like to note here is: say, the alarm vendor supplies only the {MUSIC_STRING} [X NUMBER] as the original grammar (without the calculator or sql support), user may (on his end) use his own (possible third-party) extensions such as the below to extend the vendor's grammar.

// file: number_extension_dsl
redefine:
     NUMBER: NUMBER  |  (CALCULATION)  |  {SQL_STATEMENT}

By importing the above extension_dsl user can use SQL statements in place where the alarm vendor originally only supported numbers.

Can we limit the imports to the top of the text?

Should be fine with that, but not sure if one can know the list of all imports upfront.

For example,

THEN play music_alarm {
  import music_dsl; // <<- scoped import here
  C C D G C G G
}

This is required when we are merging user-supplied pieces with a default template (such as happens in admin-dashboards). For example, say the above music piece is actually gathered from user as input on a web-page textbox. Then users would only see a music-entry editor / text box (without the surrounding context) and enters:

  import music_dsl; 
  C C D G C G G

which will then be merged with the outer base-template play music_alarm { % user_string_here % } loaded from DB.

In such case, we would not know which all DSL to be imported up front. User may choose to use his own custom compatible music dsl (uploaded from his computer).

Since import statement essentially results in an AST being loaded into memory (possibly either replacing / extending or standalone), it would be good to support the import statement anywhere.

On the need for dynamic DSL for IOT

Many hate the idea of multiple new syntaxes. For that matter, any one Turing-complete language is capable of solving all computable problems with just one syntax (by definition). Say, simple, plain-old C is enough.

However, I would like to impress that it is equally important to be able to create new syntaxes.

The fact that we do have plethora of languages (more specifically, syntaxes) already at the high-level itself, such as C++, Javascript, Python, R etc., clearly indicate what the underlying problem is.

I will not go into the language wars, but hope to clarify the problem by illustrating few pain points that we are facing in the industry right now:

Consider a CEP (complex event processing) engine that requires real-time streaming calculations on the incoming streaming data. It requires more expressiveness in terms of "data flow". The syntax is more apt to write something like

    source | transform1 | transform2 | output

than something that requires heavily nested callbacks or promises (JS style), or heavily sequential control flow (such as the C style), though these JS/C etc. can perfectly do the job and perhaps with more efficiently.

Consider a cultural document preservation system, such as music notation. It requires tablature style of expressiveness that allows both "sequential and parallel" notes illustration at the same time. This kind of syntax is hard to express with any existing programming language.

And the problem is, the syntax that suits well for CEP, does not work that well for the Music Notation.

We always try to keep it low and stay true to one syntax paradigm, but it just doesn't work. For example,

Our CFugue library: its a master piece for Carnatic Music, straight from C/C++. But it is just not enough for what we want to achieve.
The M2M gap for C/C++ is too large: REST Gap in C/C++ Asynchronous multi-core execution syntax is hard to capture on fundamentally sequential syntax designs

Now, one may argue that C/C++ is the wrong choice, stick to JS or Python or xyz language (syntax) for all things. Well each language has its own problems.

It is just NOT possible to create one Syntax to rule them all. The very fact that each of the programming languages / syntaxes keep evolving with each version (say ES6, Python 3 etc..), is itself clear indication that every syntax is missing something.

Now, consider this scenario:

Instead of one programming language / syntax growing over years with different versions vertically (enhancing its syntax along the way), think of different syntaxes evolving horizontally at the same time each tailor built for different purposes. That horizontal set of syntaxes is our DSLs.

This is just like modularity concept in our programming - one module to do just one thing, and doing it right. But we are applying that modularity to a syntax/language.

What is the need for multiple syntaxes

The earlier days, some one created a language / syntax and all used it. With LLVM there are already many languages / syntaxes out there created by every other person.

IOT brings the capability to connect multiple devices from different domains (from wrist watch to refrigerator to aeroplane and what not). It is impossible to convince all device vendors to agree on one meta language. Even if the vendors want to create one syntax, given the diverse domains, they would eventually endup with creating whole set of DSLs one for each domain (just like JS for web, C for system programming etc.)

Software creators (such as those who create IOT analytics platforms) ultimately has to support all these DSLs, and at any given point of time the software would not know which exact DSL it can expect (the devices may connect and go away randomly).

Hence the need for on the fly syntax and semantic adaptable editors / interpreters / generators.

Bottom Line is:

Parsers should be open and able to work with different set of syntaxes, as important as being able to extend the semantics of one syntax.

The former takes the horizontal approach, while the later takes the vertical approach - both of which are needed.

Also, a sideline thought:

Is it possible to consider the AST as the universal meta language and treat all the syntaxes just as aliases at the semantic layer?

When a new syntactic expression is inserted into a parent code, if we can treat it as AST subtree insertion into parent AST (mounted at that insertion point), then I think it is possible to have different syntaxes work together.

I am taking this concept directly from the language workbenches but I see no reason why it cannot be extended here.

Consider one implementation where each import statement causes a parser object to be created (that knows how to parse that particular dls). So, essentially when one says:

import "music.dsl";
import "sql.dsl";

the above statement results in two parser objects in memory, say: parser.music and parser.sql (and of-course the host/parent parser object itself is already there, making it total 3).

Then when one encounters something like this kind of grammar, for example:

definition: 
    music_alarm {MUSIC_STRING}

where MUSIC_STRING was previously defined in the music.dsl then when the host / parent parser encounters something like below:

music_alarm {
  C D C G
}

then it knows that music_alarm by definition expects a MUSIC_STRING object to follow it (within braces {} ), so it would invoke the parser.music to parse the content { C D C G } and moves on with rest of the stuff as is.

In other words, the host / parent parser just acts as a container of different sub-parsers and keep delegating it to the right parser (based on the definition supplied by user).

Since in Javascript objects can be extended anytime, the parser.music, parser.sql and few others such as parser.c++ etc. can be created on the fly (based on the import statement).

The host parser, supports multiple syntaxes, but the syntaxes cannot appear anywhere randomly in the text. They have to follow the host syntax grammar.

For example, consider a base syntax definition file as below:

// file: base.dsl
import "music.dsl";
import "sql.dsl";
import "smart_home.dsl";

<SQL_STATEMENT> WS <MUSIC_ALARM_STATEMENT> WS <MULTIPLE_ROBOT_STATEMENTS>

Now, the host parser knows to expect, SQL statements first followed by MUSIC followed by ROBOT statements (from the above base dsl definition), so the below example:

import "base.dsl";

select name, age from customer where age > 60

music_alarm {
  C D C G
}

activate robotic helper bob and
insert three bottles of soda to the fridge.

becomes parsable. Agreed, this is rigid, but if that is what user wants let it be.

Or else, my personal favorite (from Latex style):

// file: base.dsl

import "music.dsl";
import "sql.dsl";
import "smart_home.dsl";

definition: 
\Sql { SQL_STATEMENT }

definition: 
\Music { MUSIC_STATEMENT }

definition: 
\Smart { SMART_HOME_STATEMENTS }

Here, I am (trying) to define my own delimiters, such as \Sql etc. that should indicate the parser to expect SQL_STATEMENT to follow where ever a \Sql entry appears. That way one can re-write the example as:

import "base.dsl";

\Sql { select name, age from customer where age > 60 }

\Music { 
    music_alarm {
        C D C G
    }
}

\Smart {
    activate robotic helper bob and
    insert three bottles of soda to the fridge.
}

\Sql { select something from somewhere  }

The order could be changed and the blocks could start anywhere, since they are all top-level definitions that indicate the host parser which one to delegate to.

Question: how to decide which parser to activate for each line/statement?

That would be determined by the host grammar supplied by the user.

How does one know what is the host grammar?

One (reasonable) way perhaps is to consider what ever is the first import, it specifies the host grammar (that tells how to parse the rest of that file)

In other words, user code is a self-describing entity that tells (with the first import) how to parse the rest of its content. (Like xml-schema that is embedded inside the XML files).

This Self-describing machine-digestible entity concept is very important as we are leaving the information-age towards machine-learning and AI, and is inline with the OKFN (Open Knowledge Foundation)'s Data Package standards

Imagine a csv file that has self-describing info on what each column's datatype is (in parser's terms, how to parser each column and derive the actual value).

Analytics pipelines currently suffer from the lack of this info (and something the data packages and other initiatives from the OKFN aim to solve in general). IOT devices need this capability to be able to "communicate" with diverse devices and invoke each others' actions.

At Cenacle we are constantly working on bringing these programmable IOT and AI tools closer to more audience with less costs and ease of use.

References

GK Palem, IOT Consultant

Published: Dec 2017

Keywords: GK, IOT Consultant, Blockchain, Artificial Intelligence, Open Source, CarMusTy, CFugue, C/C++ Music Library, Carnatic Music, Song, Notation, MIDI, Typesetting, PDF, Books, Maya, Visual Effects, DirectX, OpenGL, Simulation, Predictive Analytics, Big Data, M2M Telematics, Predictive Maintenance, Condition-based Maintenance, Research, Cryptography, Distributed Ledger.