Iro

1. Introduction

Iro is a development tool designed to simplify the creation of syntax highlighters across many platforms.

Click here to load Iro, which features an embedded sample for immediate demonstration of key features.

1.1. Syntax Highlighters

Syntax highlighters typically uses rules to split text into smaller and smaller subregions; assigning each subregion a style.

Figure 1. Syntax Highlighted Text

1.2. Required Knowledge

Both Iro and this document require a working knowledge of regular expressions.

For beginners, links to tutorials are provided here.

1.3. Syntax Highlighting for Beginners

Imagine a hypothetical document containing some text.
Now imagine the first line in that document.
You wish to colour in that line of text according to some rules.
But for now, forget about colours.
What you really need is to designate the text into regions.
Once you have non overlapping regions, then you can style the text CSS style based on region assignments.
In Iro, these regions are called styles.
Styles can be setup with default rendering colours for the purposes of debugging.
You require one style per district region of text.
For example, a comment would be a region / style, a keyword would be a different style a bit of quoted text would be a different style, a literal number might be a different style, and so on.
Now, how to start zoning?
Most simple way is to run through a list of regular expressions and the first one that matches wins and consumes some text. Once it wins, the regular expression assigns a one or my style ids to grouped regions within its own match.
Then run through the list of regular expressions again starting from the position of the end of the last match.
If nothing matches, go to next line and start again.
The "main" context is the initial list.
Basic regular expressions are :pattern objects.

An Iro Syntax Highlighter (Output Shown Next)

Here is a simple highlighter definition.

name               = sample
file_extensions [] = sample;
styles [] {
   .comments : style color = "light_green";
   .tree     : style color = "orange";
}
contexts[] {
   main : context {
      : pattern {
         regex    \= (\boak\b)
         styles [] = .tree;
      }
      : pattern {
         regex    \= (//.*)
         styles [] = .comments;
      }
   }
}

Syntax Highlighted Text

Here is some text highlighted corresponding to the above definition.

Please see the algorithm section for more detailed information.

1.4. Supported Exporters

A single Iro language definition can export to the following formats.

Format

Grammar Definition Language

Platform

Adopters

Textmate

XML

OSX Based Editor

Visual Studio Code, Eclipse (via TM4E), IntelliJ, Github

Atom

Coffeescript

Github’s own code editor, desktop app that sits on JavaScript foundations

N/A

Pygments

Python

Python Syntax Highlighting Library

N/A

Rouge

Ruby

Ruby Syntax Highlighting Library

N/A

Ace Editor

JavaScript

Code editor for web

N/A

Sublime 3

YAML

Cross platform commercial text editor

N/A

1.5. Goals

To make creating and debugging syntax highlighters as simple as is possible.
To act as a lingua franca (commonly understood) format; exporting to many different popular formats without requiring deep knowledge of the target formats.

1.6. Terminology

Term	Description
Iro Highlighting Model	The model represented by the .iro file format.
Derived Grammar	Language grammar files generated by Iro (such as .tmLangauge files)
Context / Lexical State	A lexical state is a context in which to interpret text at the current cursor position. Context and Lexical state are used interchangably in Iro.

Term

Description

Iro Highlighting Model

The model represented by the .iro file format.

Derived Grammar

Language grammar files generated by Iro (such as .tmLangauge files)

Context / Lexical State

A lexical state is a context in which to interpret text at the current cursor position. Context and Lexical state are used interchangably in Iro.

1.7. Visualizing Regular Expressions

If you find any regular expression in this document difficult to read then copy and paste the expression into the excellent DebugEx tool and it will produce an excellent visualization.

For example, in the JSON example, later in the document, I use the following regular expression without explanation:

((?:\-?(?:0|(?:[1-9][0-9]*)))(?:(?:\.[0-9]*)(?:(?:e|E)(?:\+|\-)[0-9]+)?)?)

Figure 2. DebugEx in action

1.8. Highlighting JSON With Iro

Let’s jump forward to a fully functional example, provided for those that prefer to copy paste and modify rather than read. Everything will be explained later in the document.

If your first reaction after (potentially) understanding the syntax is that this is an over simplification of JSON - then you are right, but you are also wrong. See : perfect is the enemy of good.

#############################################################
## JSON Syntax highlighter
#############################################################

name                   = json
file_extensions []     = json;

contexts [] {
   main : context {
      : inline_push {
         regex       \= (\{)
         styles []    = .punctuation;
         : pop {
            regex    \= (\})
            styles [] = .punctuation;
         }
         : include "main" ;
      }
      : inline_push {
         regex       \= (\[)
         styles []    = .punctuation;
         : pop {
            regex    \= (\])
            styles [] = .punctuation;
         }
         : include "main" ;
      }
      : pattern {
         regex       \= (,|\:)
         styles []    = .punctuation;
      }
      : inline_push {
         regex       \= (\")
         styles []    = .punctuation;
         : pop {
            regex    \= (\")
            styles [] = .punctuation;
         }
         : pattern {
            regex    \= ((?:\\u[0-9a-f]{4})|(?:\\["\\/bfnrt]))
            styles [] = .escape_char;
         }
         : pattern {
            regex    \= ([^\"\\]+)
            styles [] = .attribute_value;
         }
      }
      : pattern {
         regex       \= ((?:\-?(?:0|(?:[1-9][0-9]*)))(?:(?:\.[0-9]*)(?:(?:e|E)(?:\+|\-)[0-9]+)?)?)
         styles []    = .special;
      }
      : pattern {
         regex       \= (true|false|null)
         styles []    = .special;
      }
      : pattern {
         regex       \= ([^\s])
         styles []    = .illegal;
      }
   }
}
styles [] {
   .escape_char : style {
      textmate_scope   = constant.character.escape
      pygments_scope   = String.Escape
      color            = #0a0
      background_color = #232
   }
   .illegal : style {
      textmate_scope   = invalid.illegal
      pygments_scope   = Generic.Error
      color            = white
      background_color = red
   }
   .punctuation : style {
      textmate_scope   = keyword.operator
      pygments_scope   = Punctuation
      color            = red_2
   }
   .attribute_value : style {
      textmate_scope   = string.unquoted
      pygments_scope   = String.Char
      color            = gold
   }
   .special : style {
      textmate_scope   = storage.type
      pygments_scope   = Keyword.Type
      color            = #6bc
   }
}

2. Typical Workflow

Use of Iro to create a new language definition (.iro file) is usually split into two distinct phases:

2.1. Phase One - Define Lexing Rules

Phase One consists of setting up some styles, and at least one lexing context. Phase One involves updating the Iro langauge definition file and checking preview renders until the results look good.

2.2. Phase Two - Assign Scopes & Export

Phase Two consists of assigning per-exporter scopes to the styles defined in phase One. Phase Two involves updating the Iro Highlighting Model styles until the scope summary report is greenlit.

2.3. Flowchart

The following flowchart details a typical workflow that might be followed when using Iro to create derived grammars.

3. The Iro Highlighting Model (.iro file)

Please notice that regular expressions use the \= operator. This is a feature of the Rion object notation language described here

The .iro model is hosted in the Rion object notation language. The purpose of the Iro Highlighting Model is to act as a convenient editing and debugging format, to be used to create derived grammars for a variety of popular products.

The Iro Model is split up into 3 different sections:

Header
Styles
Contexts (also known as Lexical States)

A language name and the file extensions are required at a minimum.

name                    = sample
file_extensions []      = sample1, sample2;

Iro does not force uniqueness or names or file extensions, so do some research before naming your language.

At least one file extension is required. This may overlap with file extensions of other languages.

In this case we provide two file extensions 'sample1' and 'sample2'. The dot is not required as a prefix for file extension.

3.2. Styles

Styles should be set up per language concept, so that they can be colorized / stylised individually.

Styles serve two purposes.

Define a default stylesheet for a language for the purposes of debugging.
Provide a lookup table to exporter-specific scopes.

Minimal Style Definition

At minimum, only a color is required. By default, Iro uses a dark background color. This color ONLY applies to Iro debugger and has no relation to allocating textmate (or other exporter) scopes.

styles[] {
   .sl_comment : style {
      color  = green
   }
}

A More Complete Style Definition

A fully loaded style contains mappings to fixed scope ids for each exporter. It is not required (but it is advised) to have full coverage of scopes for each exporter per style.

The scope coverage report will inform the grammar author of if they have correctly mapped internal Iro styles to all exporters scopes without ambiguity.

styles[] {
   .sl_comment : style {
      color           = green
      textmate_scope  = comment.line
      pygments_scope  = Comment
   }
}

Ace scope can be supplied seperately to TextMate scope but it inherits the TextMate scope by default; so generally speaking Ace scope is fine not to configure directly. The Scope Summary report will let you know if you missed something.

Advice

If no color is defined, then Iro will use the default background color and a shade of white for the text color (assuming using the default dark theme, black in the case of the secondary light theme).

If you define no colors or styles, it is counter intuitive as to the purpose of using a syntax highlighter at all so it is recommended to set up every style with some type of color or formatting.

Attributes for Styles

Here is a set of attributes that apply to styles. All attributes are optional.

Id	Description	Sample Values
color	The foreground / text color if the current theme is a dark theme	See Colors section
background_color	The background color if the current theme is a dark theme	See Colors section
color_2	The foreground / text color if the current theme is a light theme	See Colors section
background_color_2	The background color if the current theme is a light theme	See Colors section
extends	Inherit from parent style	N/A
bold	Make the current item bold (applies to both light and dark themes)	true or false (defaults to false)
italic	Make the current item italic (applies to both light and dark themes)	true or false (defaults to false)
textmate_scope	The scope to use when exporting to Atom / Textmate / Sublime 3/ tm4e. Textmate is the most commonly used syntax highlighting format, and is also support by VSCode. If nothing is specified here, and the current style does not extend another style, then the default scope to use will be 'text'.	See Textmate Scopes section
ace_scope	The scope to use when exporting to Ace Editor. If nothing is specified here, and the current scope does not extend another scope, then the default scope will be inherited from the most significant Textmate scope (which defaults to 'text' if nothing is specified).	See Textmate Scopes section
pygments_scope	The scope to use when exporting to Pygments. If nothing is specified here, and the current style does not extend another style, then the default scope to use will be 'String'.	See Pygments Rouge Scopes section
suppress_scope_conflict_warning	If set to true, then we suppress the warning when multiple Iro internal scopes map to the same exporter scope (for any of the exporters). There may be many reasons to use multiple scopes to map to the same scope as an endpoint, therefore we provide a flag to switch off warnings when this occurs (although a lower level warning will still be available on the overview screen).	true or false (defaults to false)

Description

Sample Values

color

The foreground / text color if the current theme is a dark theme

See Colors section

background_color

The background color if the current theme is a dark theme

See Colors section

color_2

The foreground / text color if the current theme is a light theme

See Colors section

background_color_2

The background color if the current theme is a light theme

See Colors section

extends

Inherit from parent style

N/A

bold

Make the current item bold (applies to both light and dark themes)

true or false (defaults to false)

italic

Make the current item italic (applies to both light and dark themes)

true or false (defaults to false)

textmate_scope

The scope to use when exporting to Atom / Textmate / Sublime 3/ tm4e. Textmate is the most commonly used syntax highlighting format, and is also support by VSCode. If nothing is specified here, and the current style does not extend another style, then the default scope to use will be 'text'.

See Textmate Scopes section

ace_scope

The scope to use when exporting to Ace Editor. If nothing is specified here, and the current scope does not extend another scope, then the default scope will be inherited from the most significant Textmate scope (which defaults to 'text' if nothing is specified).

See Textmate Scopes section

pygments_scope

The scope to use when exporting to Pygments. If nothing is specified here, and the current style does not extend another style, then the default scope to use will be 'String'.

See Pygments Rouge Scopes section

suppress_scope_conflict_warning

If set to true, then we suppress the warning when multiple Iro internal scopes map to the same exporter scope (for any of the exporters). There may be many reasons to use multiple scopes to map to the same scope as an endpoint, therefore we provide a flag to switch off warnings when this occurs (although a lower level warning will still be available on the overview screen).

true or false (defaults to false)

3.3. Context

A context (or lexical state) contains an ordered list of match expressions with associated command with each command performing some operation should the regular expression match.

For example, the Java langauge has the keyword for. If defining a syntax highlighter for Java, you may wish to provide a style/color for keywords such as for. If there were only a single lexical state (AKA context), then for would be syntax highlighted as a keyword, even if in contained in double quotes.

We take for granted that a keyword in quoted text is not a keyword - but a syntax highlighter needs to understand how to transition the state from a regular code block to quoted text.

If regular expression matches then typically the match cursor will move to the end of the matched region. The match cursor is the place at which the engine is currently evaluating text.

When Iro is performing a render, a context item within a context may push a new context onto the context stack or pop the current context from the stack.

contexts [] {
   main : context {
   }
}

3.4. Context Items

3.4.1. Pattern

Please notice that regular expressions use the \= operator. This is a feature of the Rion object notation language described here

The simplest kind of context item is the pattern matcher.

: pattern {
   regex      \= (for|if|var|function)
   styles []   = .keyword;
}

The pattern matcher simply matches from the current cursor position (to the end of the line), and if matched, assigns one or more styles to the matched text. The pattern matcher does not alter the context stack in any way.

Multiple capture styles may captured in a single regular expression, but never ever overlap groups or omit characters from a match. See : non-overlapping groups.

Multiple styles captured in a single 'pattern'

: pattern {
   regex      \= (group)(\s*)(\{)
   styles []   = .keyword, .whitespace, .punctuation;
}

Please see the algorithm section for how these match patterns are utlized in relation to the current line cursor.

When Iro is performing a debug render (as opposed to translating into alternative grammar formats), the match that closest to the line cursor is the match that will be selected. Upon matching, if the first match is not at offset 0 from the current line and column cursor position, non matched characters are emitted to the renderer without a style.

The matched text alongside corresponding styles will then be sent to a renderer, the line cursor will be moved forward, and the remainder of the line (or the start of the next line if zero characters remain on line) will be tested against context matchers in order in the same context again.

If the regular expression cannot be matched, then the next command is evaluated. If all regular expressions do not match then the remainder of the line is emitted with no associated style and the line cursor movest to the beginning of the next line.

3.4.2. Pop

Please notice that regular expressions use the \= operator. This is a feature of the Rion object notation language described here

A pop operation pops the current context off the stack. Iro maintains a stack of contexts, typically traversed through pushes and pops. It is important to make sure that pushes and pops align up carefully.

: pop {
  regex      \= (")
  styles []   = .punctuation;
}

A matched pop operation will consume the matched text, emit the matched style or styles, pop the current context from the stack. And start scanning in the new context (if the context changed).

3.4.3. End of line Pop

An EOL pop is a pop (remove the current context from the stack) that occurs when the end of the current line is reached. As such, there are no required attributes.

: eol_pop {}

3.4.4. Inline Push

Inline push and pop are useful for lexing langauges that are stateful. Most languages enter into different lexical states. The most commmon type of lexical state might be quoted text.

An inline push pushes its own internal inline context on a stack of contexts maintained by Iro (or any of the supported exporters).

There are two modes in which an inline push operates:

Default Style Provided
Additional match items provided

There follows two examples of an inline push, for handling text within quotes.

Basic Inline Push

Please notice that regular expressions use the \= operator. This is a feature of the Rion object notation language described here

A basic inline push uses a regular expression to denote when the engine to enter the inline context. Once within the context, all items that are matched before the : pop {} item will be assigned to the default style - which must be provided.

In this variation, there must always exist a : pop {} item within the inline push, and no more inline items. This will be validated when loading a .iro definition file. If you forget about it, the engine will remind you.

An inline push can span over any number of lines, but a maximum of one line of text at a time will only ever be evaluated for a match.

: inline_push {
   regex         \= (")
   styles []      = .punctuation;
   default_style  = .quoted_text;
   : pop {
      regex      \= (")
      styles []   = .punctuation;
   }
}

Complex Inline Push

Please notice that regular expressions use the \= operator. This is a feature of the Rion object notation language described here

Sometimes a basic inline push will not satisfy requirement. A common example of this is when handling escaped text. For example, you wish to place a literal double quote character within double quotes.

In these cases, the basic inline push would not be sufficient.

In this variation, the first item of the inline push must be a :pop {} item. Failure to place exactly one : pop {} item within the :inline_push {} item will invoke the wrath of the compiler.

The default style attribute is not required if there are additional items AFTER the : pop {}

In this particular example, a pattern is placed after the pop to deal with escaped double quotes. Even though the pop has the top priority for matching, the earliest match always wins, so \" would be matched before ", as the backslash would be closer to the line cursor. As this pattern would then consume both the backslash and the double quote, we now have support for syntax highlighting escape characters independently within an inner lexical state / context.

Inline pushes may be embedded too, or may be contained within independent comtexts and referenced via includes.

See Pitfalls section for explanation on (\\(?:\\|")) regular expression.

: inline_push {
   regex         \= (")
   styles []      = .punctuation;
   : pop {
      regex      \= (")
      styles []   = .punctuation;
   }
   : pattern {
      regex      \= (\\(?:\\|"))
      styles [] = .escaped_text;
   }
   : pattern {
      regex      \= ([^"\\]+)
      styles [] = .quoted_text;
   }
}

3.4.5. Push

Please notice that regular expressions use the \= operator. This is a feature of the Rion object notation language described here

Push is the same type of behaviour as : inline_push except instead of entering an inline context, it enters an explicit new context.

In textmate compatibility mode (the default mode of operation), only one context may be pushed onto the stack at one time. If this compatibility mode is disabled, then it is permitted to push more than one context onto the stack upon a match occurring.

See Pitfalls section for explanation on (\\(?:\\|")) regular expression.

main : context {
   : push {
      regex         \= (")
      styles []      = .punctuation;
      context []     = inside_quotes;
   }
}

inside_quotes : context {
   : pop {
      regex      \= (")
      styles []   = .punctuation;
   }
   : pattern {
      regex      \= (\\(?:\\|"))
      styles [] = .escaped_text;
   }
   : pattern {
      regex      \= ([^"\\]+)
      styles [] = .quoted_text;
   }
}

3.4.6. Include

The 'include' context item allows the contents of other contexts to be referenced and imported as if they were part of the current context.

That is, if we reference a context with ten context items, those ten context items will be copied and pasted into the current context.

Recursive includes loops are flagged by the compiler.

: include "comments" ;

Sample Use Of Include

contexts[] {
    main : context {
        : pattern {
            regex         \= (if|then|something)
            styles[]       = .keyword;
        }
        : include "comments" ;
    }
    comments : context {
        : pattern {
            regex         \= (//.*)
            styles[]       = .comment;
        }
        : inline_push {
            regex         \= (/\*)
            styles []      = .comments;
            default_style  = .comments
            : pop {
               regex      \= (.*?\*/)
               styles []   = .comments;
          }
       }
    }
}

3.5. Advanced Techniques

3.5.1. Constants

In Iro, any attribute that starts with two underscores is deemed to be a constant.

Constants may be referred to using the $${} syntax to reference constant value attributes. Any number of constants may be utilized as long as they start with two underscore characters.

Constants can be useful for adhering to (DRY) principles.

__id = [a-zA-Z][0-9a-zA-Z]*(?:_[0-9a-zA-Z]+)*

: pattern {
   regex      \= (\^$${__id})
   styles []   = .varname;
}

: pattern {
   regex      \= (\$$${__id})
   styles []   = .globalvarname;
}

: pattern {
   regex      \= (\%$${__id})
   styles []   = .functionname;
}

3.5.2. Bracket Matching

It is possible to push the same context on the stack multiple times in a row. This technique can be used to highlight unmatched brackets.

For example, consider the following grammar:

Bracket Matching Grammar

Sample code to implement bracket matching in grammar.

name               = sample
file_extensions [] = sample;

contexts[] {
    main : context {
      : pattern {
          regex   \= ([a-z]+)
          styles[] = .keyword;
      }
      : inline_push {
         match         \= (\{)
         captures []    = .punctuation;
         : pop {
            match      \= (\})
            captures [] = .punctuation;
         }
         : include "main" ;
      }
      : pattern {
          regex   \= (\})
          styles[] = .illegal;
      }
    }
}
styles[] {
   .punctuation : style {
      color            = red_2
   }
   .keyword : style {
      color            = gold
   }
   .illegal : style {
      color            = white
      background_color = red
   }
}

Bracket Matching Result

The result of the above grammar.

3.5.3. Perfect is the enemy of good

It is important to note that the prior JSON sample is not attempting to enforce the rules of the langauge through highlighting, beyond matched bracket highlighting (which is trivial to achieve).

By way of example, the following snippet would be syntax highlighted without error, even though it is invalid JSON:

The JSON grammar described earlier is an approximation and not intended to model the nuances of what is and is not possible in JSON. It is common practise for syntax highlighters to have undefined behaviour when faced with improper grammar.

By all means, build your model as good as you possibly can, but be aware. Perfect parsing of a language is an order of magnitude may be much more complex, for very little benefit. It is the job of the parser to highlight errors in the source, not the job of the syntax highlighter.

3.6. Pitfalls

Iro requires that every character defined in a regular expression belongs to a single group (be between ()).

Regular expressions may have one or more groups, but never zero, and no part of a match should ever belong to no group. Inner groupings are allowed but every matched character should belong to exactly one group.

3.6.1. Non-grouped match characters

: pattern {
   regex      \= (group)\s*(\{)
   styles []   = .keyword, .punctuation;
}

The snippet shown above shows an invalid regular expression. The '\s*' part of the regular expression does not belong to a match group (a bracketed region). This means that when emitting tokens that Iro (and other engines) would not emit characters that match this part of the expression meaning that the ungrouped characters would be lost.

The compiler will detect non matched characters.

3.6.2. Mismatch between regex capture groups and styles

: pattern {
   regex      \= (group)(\s*)(\{)
   styles []   = .keyword, .punctuation;
}

The snippet shown above shows that three capture groups are defined in the 'match' regular expression(correctly), but only two styles are defined.

The compiler will flag this.

3.6.3. Overlapping Matches

Overlapping matches are a "blind-spot" in the compiler. Please take care.

In the snippet shown below, 4 groups are defined. 3 outer groups, and one inner group. The problem here is that 'ou' is grouped twice, therefore will be emitted twice.

Although the compiler does perform a check that the number of match groups in the regular expression corresponds to the number of supplied styles, the compiler does not detect overlapping matches in regular expressions, so please be aware of the overlapping match scenario.

See Use of non-capture groups for information on how to resolve.

: pattern {
   regex      \= (gr(ou)p)(\s*)(\{)
   styles []   = .keyword, .keyword,.whitespace, .punctuation;
}

3.6.4. Bad Implementaton of Escape Characters

A common requirement is to use an inline push (or push) to deal with quoted text.

When dealing with literal text, there is invariable some kind of escape character handing that must occur. When dealing with escape characters, it is important to be aware of greedy regular expression that may steal the escape character.

In the below example, we enter into a literal text state upon encountering a " character. We flag that upon encountering another " character we should leave this lexcical state (aka context).

But we also wish to be able to place quotes inside the quoted block without it being a match that triggers a pop. In most languages, the backslash \ character is employed as a prefix to escape the quote so that the parser (and the syntax highlighter) knows to treat the \= pair as an escaped pair of characters.

Naive Escape Handling (1)

A naive way to implement this logic is to define a rule for handling the \= pair, and define a rule that handles everything else.

: inline_push {
   regex         \= (")
   styles []      = .punctuation;
   : pop {
      regex      \= (")
      styles []   = .punctuation;
   }
   : pattern {
      regex      \= (\\")
      styles [] = .escaped_text;
   }
   : pattern {
      regex      \= ([^"]+)
      styles [] = .quoted_text;
   }
}

The problem here is that the ([^"]+) expression will consume trailing backslashes. Meaning that this text "one\"" will be tokenized as follows:

Input = "one\""
#	Matched Style	Chars Consumed	Details
1	.punctuation	"	Enter inline context
2	.quoted_text	one\	Stay in same context - FAIL HERE
3	.punctuation	"	Pop out of context
4	UNKNOWN	"	UNKNOWN

Input = "one\""

Matched Style

Chars Consumed

Details

.punctuation

Enter inline context

.quoted_text

one\

Stay in same context - FAIL HERE

.punctuation

Pop out of context

UNKNOWN

Failed.

Naive Escape Handling (2)

So, let’s stop the bottom pattern from picking up the backslash character:

: inline_push {
   regex         \= (")
   styles []      = .punctuation;
   : pop {
      regex      \= (")
      styles []   = .punctuation;
   }
   : pattern {
      regex      \= (\\")
      styles [] = .escaped_text;
   }
   : pattern {
      regex      \= ([^"\\])
      styles [] = .quoted_text;
   }
}

Input = "one\""
#	Matched Style	Chars Consumed	Details
1	.punctuation	"	Enter inline context
2	.quoted_text	one	Stay in same context
3	.escaped_text	\"	Stay in same context
4	.punctuation	"	Pop out of context (PASS)

Input = "one\""

Matched Style

Chars Consumed

Details

.punctuation

Enter inline context

.quoted_text

one

Stay in same context

.escaped_text

Stay in same context

.punctuation

Pop out of context (PASS)

Passed.

Seems good, but what if we want to be able to place a literal backslash in the text too. We’d need to escape the backslash with a backslash. Let’s try "o\\ne\""

Input = "o\\ne\""
#	Matched Style	Chars Consumed	Details
1	.punctuation	"	Enter inline context
2	.quoted_text	o	Stay in same context
3	.escaped_text	\\ne\""	NO MATCH

Input = "o\\ne\""

Matched Style

Chars Consumed

Details

.punctuation

Enter inline context

.quoted_text

Stay in same context

.escaped_text

\\ne\""

NO MATCH

Failed.

Robust Escape Handling

So this time, let’s support 3 items:

A quote
A backslash followed by a backslash or a quote
Any text that is not a backslash or a quote

: inline_push {
   regex         \= (")
   styles []      = .punctuation;
   : pattern {
      regex      \= (\\(?:\\|"))
      styles [] = .escaped_text;
   }
   : pattern {
      regex      \= ([^"\\]+)
      styles [] = .quoted_text;
   }
}

Input = "one\""
#	Matched Style	Chars Consumed	Details
1	.punctuation	"	Enter inline context
2	.quoted_text	one	Stay in same context
3	.escaped_text	\"	Stay in same context
4	.punctuation	"	Pop out of context (PASS)

Input = "one\""

Matched Style

Chars Consumed

Details

.punctuation

Enter inline context

.quoted_text

one

Stay in same context

.escaped_text

Stay in same context

.punctuation

Pop out of context (PASS)

Passed.

Input = "o\\ne\""
#	Matched Style	Chars Consumed	Details
1	.punctuation	"	Enter inline context
2	.quoted_text	o	Stay in same context
3	.escaped_text	\\	Stay in same context
4	.quoted_text	ne	Stay in same context
5	.escaped_text	\"	Stay in same context
6	.punctuation	"	Pop out of context (PASS)

Input = "o\\ne\""

Matched Style

Chars Consumed

Details

.punctuation

Enter inline context

.quoted_text

Stay in same context

.escaped_text

Stay in same context

.quoted_text

Stay in same context

.escaped_text

Stay in same context

.punctuation

Pop out of context (PASS)

Passed.

Use of non-capture groups

This is a good demonstration of how to eliminate non-overlapping groups using the ?: (non-capturing group) notation in a regular expression. By default, all () capture a group, to disable, use ?: at the start of the group.

(\\(\\|"))

(\\(?:\\|"))

4. Scope Coverage Report

The scope coverage report is used to validate if all defined styles have been mapped to an exporter scope name, and that each has been mapped to a unique scope name.

Failure to map styles correctly to external scope names will not result in the export failing, but rather, it will issue an advisory at the time of definition file creation.

A file - "iro_scope_coverage_report.html" will be produced alongside all of the exporter textfiles.

4.1. Success Report

The following scope report corresponds to the JSON Sample shown earlier.

Look for "No Problems Detected"

4.2. Failure Reports

4.2.1. Style Without A Scope

The following scope report corresponds to the JSON Sample shown earlier except we deleted the textmate scope from the '.escape_char' style.

Here is the modified style:

.escape_char : style {
  #textmate_scope   = constant.character.escape
  pygments_scope   = String.Escape
  color            = #0a0
  background_color = #232
}

Here is the resultant report:

4.2.2. Multiple Styles Without A Scope

The following scope report corresponds to the JSON Sample shown earlier except we deleted the textmate scope from the '.escape_char' and 'punctuation' styles.

Here are the modified styles:

.escape_char : style {
  #textmate_scope   = constant.character.escape
  pygments_scope   = String.Escape
  color            = #0a0
  background_color = #232
}
.punctuation : style {
  #textmate_scope   = keyword.operator
  pygments_scope   = Punctuation
  color            = red_2
}

Here is the resultant report:

4.2.3. Two styles with the same scope

The following scope report corresponds to the JSON Sample shown earlier except we modified 'punctuation' style to have the same textmate scope as the '.escape_char' style.

Here are the modified styles:

.punctuation : style {
  textmate_scope   = constant.character.escape
  pygments_scope   = Punctuation
  color            = red_2
}

Here is the resultant report:

5. Tutorial

These tutorials require good (but not expert) knowledge of regular expressions. If you are not familar with regular expressions and you work in development, then it would be the single best use of your time to learn them.

Consider the following code snippet of a langauge we wish to syntax highlight (you may recognise this snippet from the introduction section).

say "hello"; // This is a comment

5.1. Phase One - Define Lexing Rules

5.1.1. Analysis

First we should identify the initial state and the items we would like to syntax highlight in in the initial state.

In this first example, there is only one state (possibly two if we view quoted text as its own state).

Here are the things we wish to syntax highlight in the initial state. This does not necessarily need to be a complete list. This process is iterative.

So, let’s start with:

Keywords (such as 'say')
';' semi colon
Comments

5.1.2. Boilerplate

We start with a small amount of boilerplate, which can be copied and pasted from this section of the document.

It incorporates the 3 significant sections - header, styles, contexts.

########################################
## Header
########################################

name                    = tutorial1
file_extensions []      = tutorial1;

########################################
## Styles
########################################

styles[] {

}

########################################
## Contexts
########################################

contexts[] {
   main : context {

   }
}

5.1.3. Define Styles

From the 3 categories, plan to create 3 styles. Remember, style names start with period/dot (.) character by convention. This was a questionable design decision by yours truly, and it keeps me awake at night.

########################################
## Styles
########################################

styles[] {
   .keyword    : style {
       color = yellow
   }
   .punctuation : style {
       color = orange
   }
   .quoted_text : style {
       color = cyan
   }
   .comment    : style {
       color = grey
   }
}

Iro currently only supports the dark theme for debugging rendering (render against a dark backdrop).

5.1.4. Define Simple Regular Expression Patterns

########################################
## Contexts
########################################

contexts[] {
   main : context {
      : pattern {
         regex   \= (say|ask)
         styles[] = .keyword;
      }

      : pattern {
         regex   \= (;)
         styles[] = .punctuation;
      }
   }
}

Progress so far

5.1.5. Using Inline Push For Quoted Text

Now we create an inline push rule to handle quoted text. We enter an inline lexical state upon encountering the " character. We only pop out of the state when we encounter another " character. All text handled before the closing " character is assigned the default style of ".quoted_text".

: inline_push {
   regex \= (")
   styles[] = .punctuation;
   default_style = .quoted_text
   : pop {
       regex   \= (")
       styles[] = .punctuation;
   }
}

Progress so far

5.1.6. Handling single line comments

This is a fairly simple rule. It looks for // and then consumes everthing after this. Remember, we are using a line based matcher, so .* only takes us to the end of the line.

: pattern {
   regex \= (//.*)
   styles[] = .comment;
}

Progress so far - Phase 1 Complete

5.1.7. Debugging

The debug render of the document contains a lot of useful debugging information. Every single piece of text and whitespace has a hiver which displays the current style and stack for the current location.

In this below diagram, it displays which rules have been executed in which order, using line numbers; very useful in larger models.

Performing mouse hover in debug render

.iro Snippet Corresponding To Debug Snippet

Reading the stack

In the above image, hovering over the double quote on the right shows the matched style for the matched region (the " following hello). The matched style in this example is ".punctuation".

As most context items in Iro are anonymous, then the stack trace uses the ANON alias to represent unidentified context items.

The stack reads "main::32 → ANON::42 → ANON::46"

All stack traces will start inside the main context. The number after main represents the line number in the corresponding .iro file where to find the current context or context item.

Breaking the stack trace down further:

main::32 → Whilst in the main context
ANON::42 → We match the regular expression (\") which puts us into the inline_push context.
ANON::46 → Whilst in the context of inline push (line 42) we encounter a matching regular expression (\") for popping back into the main context.

For the ';' character following the shown screenshot we can imply would have a stack trace as follows:

main::32 → Whilst in the main context
ANON::38 → Match the (;) regular expression, without changing context.

5.1.8. Assigning ids to context items (UIDs)

In order to make stack traces easier to read, match items are able to assign unique identifiers to context items. Contexts already have ids.

UIDs must be unique across all declared UIDs in the .iro file and must start with 'uid_'. This is to make it clear when a stack trace element is a context, or when it is a context (as contexts are not permitted to start with the 'uid_' prefix).

: pattern {
   uid      = uid_single_line_comment
   regex   \= (//.*)
   styles[] = .comment;
}

5.2. Phase Two - Assign Scopes & Export

Phase 2 involves the assignment of various exporter scopes to styles so that when derived grammars are generated, there is enough entropy to be able to syntax highlight.

The first thing we do is to review the scope coverage report:

5.2.1. Add in some scopes to existing styles

Adding in a textmate scope and a pygments scope is required to obtain the greenlight (at present). Ace inherits from textmate, Rouge inherits from Pygments.

   .keyword    : style {
       color          = yellow
       textmate_scope = keyword
       pygments_scope = Keyword
   }
   .punctuation : style {
       color = orange
       textmate_scope = punctuation
       pygments_scope = Punctuation
   }

And here is the report:

5.2.2. Obtain the greenlight

We assign textmate and pyments scopes to the final two styles. Now we have full scope coverage across all supported exporters.

   .quoted_text : style {
       color          = cyan
       textmate_scope = string
       pygments_scope = String
   }
   .comment    : style {
       color = grey
       textmate_scope = comment
       pygments_scope = Comment
   }

And here is the report:

5.2.3. Generate Exporters

If running from the command line, the created syntax highlighting definition files are available in the 'iro-gen' folder (they are generated at the same time as debug builds).

If running from the web UI, select the exporter specific tab to see the appropriate exporters. The download button may be used to download all generated exporters in a zipfile.

It is not in the scope of this document to describe how to import these files into various editors/IDEs/Frameworks.

TextMate (.tmLanguage file)
Sublime 3 (YAML)
Atom (CofeeScript)
Rouge (Ruby)
Pygments (Python)
Ace Editor (JavaScript)

6. Syntax Highlighting Algorithm

6.1. Line-Based Matching

The syntax highlighting algorithm is quite common across technologies. Iro uses line based matching,. A regular expression is not permitted to match across multiple lines. The start of a line matches

Pygments and Rouge typically offer full document based matching, which is more flexible. Compatibility between line based matching grammars and document based matching grammars is handled with logic in the exporters. All regular expressions should be written to assume line-based matching.

6.2. Walkthrough

In this visualization of the algorithm, the cursor will be represented by the '^' char. It should not be considered part of the document being evaluated. Please do not confuse it with the start of line character in a regular expression.

Consider a sample document containing the following text:

Phase 0 (prior to matching)

^my document ; my document ; x
my documen

Defined Regular expressions (relating to sample)

Imagine that we have just one context defined. And the context has the following three regular expressions

(my)
(document)
(;)

The above listed regular expressions are defined via the 'pattern' command within a context in the Iro file format. For the purposes of detailing the scanning algorithm, we only need to show the regular expressions themselves.

For the current context, from the current cursor position to the end of the line, evaluate a list of regular expressions, and if matched, do something, and move the cursor to the end of the match.

Scan Begins

We attempt to match the first entire line 'my document ; my document ;' using the first regular expression (my), and good news, it matches. A full match is not required for the whole line, the match just needs to be able to match somewhere on the line.

If the match does not start exactly from the cursor position, then other regular expressions will be evaluated and the regular expression that matches closes to the cursor is nominated winner. The longest match does not win (consume the matched characters), the match that starts closest to the cursor wins (consumes the matched characters).

Phase 1 (after first match)

my^ document ; my document ; x
my documen

The cursor has moved to the end of 'my' in the document. Notice that the whitespace has not been consumed after 'my'. The operation that is performed on the consumed text is irrelevant to the algorithm, but will be described later on. Assume that regular expression matches are simply consuming text then moving a virtual cursor.

Now the remainder of the current line contains ' document ; my document ; x' (without quotes). We now test against the first regular expression '(my)'. It matches ! But the position of the match is at an offset of 12 from the remainder of the line being evaluated. If the offset is greater than 0 for the match, then the matching continues.

Now we compare the regular expression (document) against ' document ; my document ; x'. It matches. This time the offset is just 1. Still not zero. So we perform another regular expression test. This time we test for (;). It matches - but at offset 10.

So now we have 3 matches, but the match with the closest offset will be chosen. That is the 2nd regular expression (document). The length of the match is 8, and the offset of the match is 1. So the first character is flagged as unmatched, and the cursor is moved along 9 characters.

Phase 2 (after second match)

my document^ ; my document ; x
my documen

Now we are evaulating the string ' ; my document ; x'. None of the 3 matches start from offset 0, but the (;) regular expression is at offset 1. So we skip the space again, and consume the ';' matched character. The cursor moves along 2 characters.

Phase 3 (after third match)

my document ;^ my document ; x
my documen

Now we are evaulating the string ' my document ; x'. None of the 3 matches start from offset 0, but the (my) regular expression is at offset 1. So we skip the space again, and consume the 'my' matched characters. The cursor moves along 3 characters.

Phase 4 (after fourth match)

my document ; my^ document ; x
my documen

Now we are evaulating the string ' document ; x'. None of the 2 matches start from offset 0, but the (document) regular expression is at offset 1. So we skip the space again, and consume the 'document' matched characters. The cursor moves along 9 characters.

Phase 5 (after fifth match)

my document ; my document^ ; x
my documen

Now we are evaulating the string ' ; x'. There is just one match. The (;) regular expression is at offset 1. So we skip the space again, and consume the ';' matched character. The cursor moves along 2 characters.

Phase 6 (after sixth match)

my document ; my document ;^ x
my documen

Now we are evaulating the string ' x'. Nothing matches at all. The remainder of the line is skipped, and the cursor moves down to the next line.

Phase 7 (after moving line cursor down)

my document ; my document ; x
^my documen

Now we are evaulating the string 'my documen' (on the second line). The first regular expression (my) matches, at offset 0, so no more expressions are needed to be evaluated. The cursor moves forward two characters.

Phase 8

my document ; my document ; x
my^ documen

Now we evaluate the string ' documen'. None of the three regular expressions match, so the remaining characters are skipped to the end of the line.

Phase 9 (End of file)

my document ; my document ; x
my documen^

We have reached the end of the document.

6.2.1. Skipped characters

When the scanner skips characters (like the ' x' and ' ' and ' documen') in the walkthrough. These characters are emitted with no style associated with them.

All matches emit events, and these events are handled by the highlighter.

Iro renders text for the purposes of debugging, but its algorithm simulates the textmate and Ace editor algorithm (derived from observation).

7. Appendices

7.1. Java API

The Java API is subject to change. It will only render using the default style defined in the .iro file.

For full stack trace debugging, call .setDebug(true).setConcise(false) on the HtmlRenderer object.

import xyz.iroiro.iro.Iro;
import xyz.iroiro.iro.engine.HtmlRenderer;
import xyz.iroiro.iro.engine.IroTokenizer;
import xyz.iroiro.iro.engine.LanguageModel;
import xyz.iroiro.iro.engine.Renderer;

public class YourClass {

    public static void render(final String text) throws InvalidModelException, IOException {

        final java.io.File iroDefinitionFile =  ... ;

        HtmlRenderer renderer = new HtmlRenderer().setDebug(true).setConcise(true);

        new IroTokenizer(){
            @Override
            public String getText() { return text; }

            @Override
            public LanguageModel getLanguageModel() { return Iro.loadLanguageModel(locationOfIroFile); }

            @Override
            public Renderer getRenderer() { return renderer; }

        }.execute();

        // Will contain escaped html <pre> block, does not require any css to render.
        return renderer.toString();
    }
}

7.2. Lookup Tables

Iro supports varios tokens in the styles section. Acceptable values for those tokens are described in this section.

7.2.1. Textmate/Atom/Ace Scopes

Here are a list of common Textmate scopes. This list is advisory. Other scopes may be supplied. The Textmate website contains some additional information on how to assign these scopes.

Ace editor generally mirrors the scopes used by Textmate in order to be semi-compatible using an adapted version of the Textmate lexing algorithm.

If assigning a Textmate scope to a style, then the Ace scope inherits the same scope. If an Ace scope is explicitly set, then this overrides the inheritance.

comment
comment punctuation
comment.block.documentation
comment.block.preprocessor
comment.documentation
constant
constant.character
constant.character punctuation
constant.character.entity
constant.character.escape
constant.language
constant.language punctuation
constant.numeric
constant.numeric punctuation
constant.numeric.line-number.find-in-files
constant.other
constant.other punctuation
constant.other.color
constant.other.symbol
entity
entity.name
entity.name.class
entity.name.class punctuation
entity.name.exception
entity.name.filename
entity.name.filename.find-in-files
entity.name.function
entity.name.function punctuation
entity.name.section
entity.name.tag
entity.name.tag punctuation
entity.name.tag.css
entity.name.type
entity.name.type.class
entity.other
entity.other.attribute-name
entity.other.attribute-name punctuation
entity.other.attribute-name.id
entity.other.inherited-class
entity.other.inherited-class punctuation
invalid
invalid.deprecated
invalid.illegal
keyword
keyword punctuation
keyword.control
keyword.operator
keyword.other.special-method
keyword.other.unit
markup
markup.bold
markup.changed
markup.deleted
markup.deleted punctuation
markup.error
markup.heading
markup.heading punctuation.definition.heading
markup.inserted
markup.inserted punctuation
markup.italic
markup.list
markup.output
markup.prompt
markup.quote
markup.raw
markup.raw.inline
markup.traceback
markup.underline
message.error
meta
meta.class
meta.diff
meta.diff punctuation
meta.diff.header
meta.diff.header punctuation
meta.diff.header.from-file
meta.diff.header.to-file
meta.diff.index
meta.diff.range
meta.function-call
meta.link
meta.link punctuation
meta.property-name
meta.property-value constant
meta.property-value constant.numeric
meta.property-value keyword
meta.require
meta.selector
meta.selector entity
meta.selector entity punctuation
meta.separator
meta.structure.dictionary.json string.quoted.double.json
meta.tag string punctuation
meta.tag string.quoted
meta.tag string.quoted constant.character.entity
meta.tag.sgml.doctype
none
punctuation
punctuation.definition
punctuation.definition.array
punctuation.definition.bold
punctuation.definition.comment
punctuation.definition.entity
punctuation.definition.italic
punctuation.definition.parameters
punctuation.definition.string
punctuation.definition.tag
punctuation.definition.variable
punctuation.section.embedded
storage
storage punctuation
storage.type
storage.type punctuation
string
string punctuation
string source
string.other.link
string.regexp
support
support.class
support.class punctuation
support.constant
support.constant punctuation
support.function
support.function punctuation
support.other.variable
support.type
support.type punctuation
support.type.property-name
support.variable
text
text source
text.html.markdown punctuation
variable
variable punctuation
variable.interpolation
variable.language
variable.other
variable.other punctuation
variable.parameter
variable.parameter.function

7.2.2. Pygments / Rouge

Here are the pygments/rouge scopes that are supported by Iro. This list is advisory. Other scopes may be supplied

Rouge / Pygments Scopes
Comment
Comment.Multiline
Comment.Preproc
Comment.Single
Comment.Special
Generic
Generic.Deleted
Generic.Emph
Generic.Error
Generic.Heading
Generic.Inserted
Generic.Output
Generic.Prompt
Generic.Strong
Generic.Subheading
Generic.Traceback
Keyword
Keyword.Constant
Keyword.Declaration
Keyword.Namespace
Keyword.Pseudo
Keyword.Reserved
Keyword.Type
Literal
Literal.Date
Name
Name.Attribute
Name.Builtin
Name.Builtin.Pseudo
Name.Class
Name.Constant
Name.Decorator
Name.Entity
Name.Exception
Name.Function
Name.Label
Name.Namespace
Name.Other
Name.Tag
Name.Variable
Name.Variable.Class
Name.Variable.Global
Name.Variable.Instance
Number
Number.Bin
Number.Float
Number.Hex
Number.Integer
Number.Integer.Long
Number.Oct
Operator
Operator.Word
Punctuation
String
String.Backtick
String.Char
String.Doc
String.Double
String.Escape
String.Heredoc
String.Interpol
String.Other
String.Regex
String.Single
String.Symbol

7.2.3. Colors

Colors within styles can use either a hex value (in the #000-#fff format) or a color alias, as described in the following table.

Do not worry about "hardcoding" colors into your definition file. These colors are only to be used for debugging purposes as when emitting grammar definitions for external syntax highlighters, colors defined within a style will be completely ignored.

Alias	12-bit Color Alias
black	#000
white	#fff
blue	#00f
brown	#a52
cyan	#aff"
gold	#cb6
green	#0f0
grey	#aaa
light_blue	#19f
light_green	#9c2
light_grey	#ddd
light_yellow	#ffc
navy	#008
orange	#f80
pink	#f0c
purple	#a2f
violet	#e8e
violet_red	#d29
red	#f00
red_2	#c15
yellow	#ff0

Alias

12-bit Color Alias

black

#000

white

#fff

blue

#00f

brown

#a52

cyan

#aff"

gold

#cb6

green

#0f0

grey

#aaa

light_blue

#19f

light_green

#9c2

light_grey

#ddd

light_yellow

#ffc

navy

#008

orange

#f80

pink

#f0c

purple

#a2f

violet

#e8e

violet_red

#d29

red

#f00

red_2

#c15

yellow

#ff0

7.3. Rion Object Notation

Iro’s configuration is hosted upon the 'Rion' object notation.

Rion supports the concept of objects, collections and objects. Only the feature relevant to Iro will be detailed in this section.

7.3.1. Objects

A blank document will be parsed into containing a single root object with no contents. Every valid item parsed within a document is added to the root document. Within the root object and within any object, an object can contain attributes, objects and collections. All items within an object must be identified.

Object with id and type

In Rion, all objects must have a type but id is optional.

.comments : style {
   // Some contents
}

In the above example, the '.' at the beginning of the beginning of scope is not required by the object notation language, but is a convention when describing Iro scopes.

Objects with type (pattern) and no identifier

In this snippet, we are defining an unidentified object of type 'pattern'. Objects without type are usually order sensitive (order is maintained).

: pattern {

}

7.3.2. Attributes

Basic Attribute

Basic attributes are simple keys and values.

textmate_uuid = c03f1283-7bde-4b48-bfa5-7bc40ef922ab

Leading and trailing whitespace characters are ignored.

Regular Expression Attribute

Rion supports a native Regular Expression syntax such that no escaping is performed.

match      \= (\s*\:\s*)(foo)

Multi-Value Attribute

Some attributes may involve multiple values.

styles [] = .punctuation, .quoted_text, .punctuation;

7.3.3. Collections

Collections are containers for objects. If the order of objects in a sequence is important then that is a reason to use a collection.

contexts [] {
   : pattern {
      regex      \= (say)
      styles []   = .keyword;
   }
}

In the above example, the collection is 'contexts [] { … }' and it contains a single typed and unidentified object (of type patten).

7.4. FAQ

Why does Iro complain when it encounters Lookbehind regular expressions?

Lookbehind regular expressions are not permitted as the lowest common denominator regular expression implementations do not support them (JavaScript).

If there is interest I may add a flag to force lookbehinds to be accepted even though it is likely to break compatibility with Ace and Atom (which both use JavaScript regular expression engines).

Does Iro support grammars within grammars?

Not yet. This may follow in later releases.

Will HighlightJS be supported in the future?

I hope so.

Does Iro support semantic highlighting?

No.

Semantic highlighting involves a deep knowledge of the structure of a language in order to colour language elements differently depending on one of a variety of factors. An example of this is highlighting global variables differently to local variables.

Such highlighting is not supported by Iro, nor is it supported by any of the exporters on offer - the the best of my knowledge.

7.5. Recommended Reading

Regular Expression Wikipedia Entry

Wikipedia Entry

Regular Expression Tutorials

Semantic Highlighting

Make sense of your variables at a glance with semantic highlighting by Paul Everitt

Arguments against syntax highlighting

A case against syntax highlighting by Linus Åkesson
Syntax Highlighting Off by Robert Melton

7.6. Recommended Tools

In my opinion DebugEx really is the best visualization tool for creating and debugging individual regular expressions.

https://www.debuggex.com

7.7. Limitation of Liabilities

Consoli Limited is not responsible for, and expressly disclaims all liability for, damages of any kind arising out of use, reference to, or reliance on any information contained or generated from the Iro application and from information contained within this document.

The information in this document and the Iro application itself are provided "as is" without warranty of any kind, either expressed or implied, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose

7.8. Contact

Twitter: @ainslec
Email: c.b.ainsley@gmail.com

8. Sample Exporter Output

This section contains derived grammars, corresponding to the Tutorial section that are useful for reference purposes but do not flow in the context of a document.

TextMate (.tmLanguage file)

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple Computer//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
  <key>fileTypes</key>
  <array>
    <string>tutorial1</string>
  </array>
  <key>name</key>
  <string>tutorial1</string>
  <key>patterns</key>
  <array>
    <dict>
      <key>include</key>
      <string>#main</string>
    </dict>
  </array>
  <key>scopeName</key>
  <string>source.tutorial1</string>
  <key>uuid</key>
  <string></string>
  <key>repository</key>
  <dict>
    <key>main</key>
    <dict>
      <key>patterns</key>
      <array>
        <dict>
          <key>match</key>
          <string>(say|ask)</string>
          <key>captures</key>
          <dict>
            <key>1</key>
            <dict>
              <key>name</key>
              <string>keyword.tutorial1</string>
            </dict>
          </dict>
        </dict>
        <dict>
          <key>match</key>
          <string>(;)</string>
          <key>captures</key>
          <dict>
            <key>1</key>
            <dict>
              <key>name</key>
              <string>punctuation.tutorial1</string>
            </dict>
          </dict>
        </dict>
        <dict>
          <key>begin</key>
          <string>(\&quot;)</string>
          <key>beginCaptures</key>
          <dict>
            <key>1</key>
            <dict>
              <key>name</key>
              <string>punctuation.tutorial1</string>
            </dict>
          </dict>
          <key>contentName</key>
          <string>string.tutorial1</string>
          <key>end</key>
          <string>(\&quot;)</string>
          <key>endCaptures</key>
          <dict>
            <key>1</key>
            <dict>
              <key>name</key>
              <string>punctuation.tutorial1</string>
            </dict>
          </dict>
        </dict>
        <dict>
          <key>match</key>
          <string>(//.*)</string>
          <key>captures</key>
          <dict>
            <key>1</key>
            <dict>
              <key>name</key>
              <string>comment.tutorial1</string>
            </dict>
          </dict>
        </dict>
      </array>
    </dict>
    <key>main__1</key>
    <dict>
      <key>patterns</key>
      <array>
      </array>
    </dict>
  </dict>
</dict>
</plist>

Ace Editor (JavaScript)

Ace Editor supports a feature whereby you can try out the grammar online with a variety of stylesheets. See the URL contained in the generated exporter.

/*
* To try in Ace editor, copy and paste into the mode creator
* here : http://ace.c9.io/tool/mode_creator.html
*/

define(function(require, exports, module) {
"use strict";
var oop = require("../lib/oop");
var TextHighlightRules = require("./text_highlight_rules").TextHighlightRules;
/* --------------------- START ----------------------------- */
var Tutorial1HighlightRules = function() {
this.$rules = {
"start" : [
   {
      "token" : ["keyword"],
      "regex" : "(say|ask)"
   },
   {
      "token" : ["punctuation"],
      "regex" : "(;)"
   },
   {
      "token" : ["punctuation"],
      "regex" : "(\\\")",
      "push" : "main__1"
   },
   {
      "token" : ["comment"],
      "regex" : "(//.*)"
   },
   {
      defaultToken : "text",
   }
],
"main__1" : [
   {
      "token" : ["punctuation"],
      "regex" : "(\\\")",
      "next" : "pop"
   },
   {
      defaultToken : "string",
   }
]
};
this.normalizeRules();
};
/* ------------------------ END ------------------------------ */
oop.inherits(Tutorial1HighlightRules, TextHighlightRules);
exports.Tutorial1HighlightRules = Tutorial1HighlightRules;
});

Atom (CofeeScript)

'fileTypes' : [
  'tutorial1'
]
'name' : 'tutorial1'
'patterns' : [
  {
    'include' : '#main'
  }
]
'scopeName' : 'source.tutorial1'
'uuid' : ''
'repository' : {
  'main' : {
    'patterns' : [
      {
        'match' : '(say|ask)'
        'captures' : {
          '1' : {
            'name' : 'keyword.tutorial1'
          }
        }
      }
      {
        'match' : '(;)'
        'captures' : {
          '1' : {
            'name' : 'punctuation.tutorial1'
          }
        }
      }
      {
        'begin' : '(\\")'
        'beginCaptures' : {
          '1' : {
            'name' : 'punctuation.tutorial1'
          }
        }
        'contentName' : 'string.tutorial1'
        'end' : '(\\")'
        'endCaptures' : {
          '1' : {
            'name' : 'punctuation.tutorial1'
          }
        }
      }
      {
        'match' : '(//.*)'
        'captures' : {
          '1' : {
            'name' : 'comment.tutorial1'
          }
        }
      }
    ]
  }
  'main__1' : {
    'patterns' : [
    ]
  }
}

Sublime 3 (YAML)

%YAML 1.2
---
name: tutorial1
scope: source.tutorial1
file_extensions: [ tutorial1 ]

contexts:
  main:
    - match: (say|ask)
      captures:
        0: keyword.tutorial1
    - match: (;)
      captures:
        0: punctuation.tutorial1
    - match: (\\\")
      captures:
        0: punctuation.tutorial1
      push:
        - match: (\\\")
          pop: true
          captures:
            0: punctuation.tutorial1
        - match: (.)
          captures:
            0: string.tutorial1
    - match: (//.*)
      captures:
        0: comment.tutorial1
    - match: (.)
      captures:
        0: text.tutorial1

Pygments (Python)

from pygments.lexer import RegexLexer, bygroups
from pygments.token import *

import re

__all__=['Tutorial1Lexer']

class Tutorial1Lexer(RegexLexer):
    name = 'Tutorial1'
    aliases = ['tutorial1']
    filenames = ['*.tutorial1']
    flags = re.MULTILINE | re.UNICODE

    tokens = {
        'root' : [
            (u'(say|ask)', bygroups(Keyword)),
            (u'(;)', bygroups(Punctuation)),
            (u'(\\\")', bygroups(Punctuation), 'main__1'),
            (u'(//.*)', bygroups(Comment)),
            ('(\n|\r|\r\n)', String),
            ('.', String),
        ],
        'main__1' : [
            (u'(\\\")', bygroups(Punctuation), '#pop'),
            ('(\n|\r|\r\n)', String),
            ('.', String),
        ]
    }

Rouge (Ruby)

# -*- coding: utf-8 -*- #

module Rouge
  module Lexers
    class Tutorial1 < RegexLexer
      title     "tutorial1"
      tag       'Tutorial1'
      mimetypes 'text/x-tutorial1'
      filenames '*.tutorial1'

      state:root do
          rule /(say|ask)/, Keyword
          rule /(;)/, Punctuation
          rule /(\")/, Punctuation, :main__1
          rule /(\/\/.*)/, Comment
          rule /(\n|\r|\r\n)/, String
          rule /./, String
      end

      state:main__1 do
          rule /(\")/, Punctuation, :pop!
          rule /(\n|\r|\r\n)/, String
          rule /./, String
      end

    end
  end
end

Iro

1. Introduction

1.1. Syntax Highlighters

1.2. Required Knowledge

1.3. Syntax Highlighting for Beginners

1.4. Supported Exporters

1.5. Goals

1.6. Terminology

1.7. Visualizing Regular Expressions

1.8. Highlighting JSON With Iro

2. Typical Workflow

2.1. Phase One - Define Lexing Rules

2.2. Phase Two - Assign Scopes & Export

2.3. Flowchart

3. The Iro Highlighting Model (.iro file)

3.1. Header Part

3.2. Styles

3.3. Context

3.4. Context Items

3.4.1. Pattern

3.4.2. Pop

3.4.3. End of line Pop

3.4.4. Inline Push

Basic Inline Push

Complex Inline Push

3.4.5. Push

3.4.6. Include

Sample Use Of Include

3.5. Advanced Techniques

3.5.1. Constants

3.5.2. Bracket Matching

3.5.3. Perfect is the enemy of good

3.6. Pitfalls

3.6.1. Non-grouped match characters

3.6.2. Mismatch between regex capture groups and styles

3.6.3. Overlapping Matches

3.6.4. Bad Implementaton of Escape Characters

Use of non-capture groups

(\\(\\|"))

(\\(?:\\|"))

4. Scope Coverage Report

4.1. Success Report

4.2. Failure Reports

4.2.1. Style Without A Scope

4.2.2. Multiple Styles Without A Scope

4.2.3. Two styles with the same scope

5. Tutorial

5.1. Phase One - Define Lexing Rules

5.1.1. Analysis

5.1.2. Boilerplate

5.1.3. Define Styles

5.1.4. Define Simple Regular Expression Patterns

Progress so far

5.1.5. Using Inline Push For Quoted Text

Progress so far

5.1.6. Handling single line comments

Progress so far - Phase 1 Complete

5.1.7. Debugging

Performing mouse hover in debug render

.iro Snippet Corresponding To Debug Snippet

Reading the stack

5.1.8. Assigning ids to context items (UIDs)

5.2. Phase Two - Assign Scopes & Export

5.2.1. Add in some scopes to existing styles

5.2.2. Obtain the greenlight

5.2.3. Generate Exporters

6. Syntax Highlighting Algorithm

6.1. Line-Based Matching

6.2. Walkthrough

6.2.1. Skipped characters

7. Appendices

7.1. Java API

7.2. Lookup Tables

7.2.1. Textmate/Atom/Ace Scopes

7.2.2. Pygments / Rouge

7.2.3. Colors

7.3. Rion Object Notation

7.3.1. Objects

7.3.2. Attributes

7.3.3. Collections