home

Crazy Parser Idea

This one is stupid. I have an idea for parser and I even have a proof of concept working in CLIPS1. Given my search it seems it something like recursive-descend parser, but I’m not sure if it qualifies.

If you know drop me an e-mail please.

Pre-word

I will use Lisp-like syntax, but this has no implementation except for aformentioned CLIPS, and is more about data structures and logic.

When I write about rune I mean rune as per in Go language, aka a codepoint. Token is one of (I’m still considering full grammar):

  • (name value)
  • (name range-start range-end)
  • (name range-start range-end value)

Symbols used:

  • unp - unprocessed
  • start - input start
  • end - input end
  • nl - newline
  • ◊token◊ - symbol name for a token
  • (?:...) - non capturing pseudo-regex group

Basic example

Step 1

Given string, explode while maintaining Rune position, e.g.

IN: Hello ⌘
OUT: (unp (start 0) (H 1) (e 2) (l 3) (l 4) (o 5) (SPC 6) (⌘ 7) (nl 8) (end 9))

Step 2

Define the rule for parsing extraction, e.g. (in pseudoregex). Note that beginning and end are non-capturing.

#+begin_src text LINE := (?:◊start◊|◊nl◊)[^◊nl◊]*(?:◊nl◊|◊end◊) e#+end_src

Step 3

Move parsed structs in other structs:

OUT: (unp (start 0) (nl 8) (end 9))
     (line (H 1) (e 2) (l 3) (l 4) (o 5) (SPC 6) (⌘ 7))

Step 4

Either apply other rules or clean up and remove empty structures. For example (note: unp was dropped as it was empty).

DROP := (◊nl◊◊end◊)
DROP := (◊start◊)

OUT: (line (H 1) (e 2) (l 3) (l 4) (o 5) (SPC 6) (⌘ 7))
     (drop (start 0) ((nl 8) (end 9)))

String with escape example

Step 1

The same as above, explode and mark runes:


IN: "a\"b\"c"
OUT: (unp (start 0) (" 1) (a 2) (\ 3) (" 4) (b 5) (\ 6) (" 7) (c 8) (" 9) (end 10))

Step 2

Define the rules (order of rules matter) and move to the other structs

ESCAPED_DQUOTE := (◊\◊◊"◊)
DQUOTE := (◊"◊)

OUT:   (unp (start 0) (a 2) (b 5) (c 8) (end 10)
       (dquote (" 1) (" 9))
       (escaped_dquote := (((\ 3) (" 4)) ((\ 6) (" 7))))

Step 2+

Optionally you can merge result into a single line with a common name (note that searching occurs across all patterns, not only on a single line).

OUT: (phase2 (unp start 0) (dquote 1) ... (escaped_quote 6 7) (c 8) (end 10))

Step 3

Define rule for final result:

End word

As I mentioned I’ve used it to parse C header files using CLIPS. CLIPS doesn’t have good text processing capabilitiies but after some tweaking it turned out to be quite efficient at finding parenthesis and comments and ignoring escaped versions.

What is interesting is that this categorization can be done in parallel meaning that instead of having one huge parser it’s possible to search separaterly for strings, parenthesis, blocks etc. and only then combine the results.

I’m playing with Haskell today so it’s possible I’ll have some toy implementation in it, but also possible I’ll never finish it. If you ever take inspiration and implement it, please mention my name and link this article. Thanks!


  1. https://clipsrules.net - Lisp-like Forward-Chaining Rule-based language. Really awesome to work in. ↩︎

Przemysław Alexander Kamiński
vel xlii vel exlee

gh | li | rss

Powered by hugo and hugo-theme-nostyleplease.