Regex Overview

This is an overview of Regex and advanced parsing features in the Row64 formula language.

The goal of the parsing features is to provide a comprehensive set of tools for text parsing, for nearly any scenario. We think of it as a Swiss Army Knife that could handle any parsing situation.

For this reason, the features start with basic regex operations, and extend to advanced parsing that resembles coding more than formulas.

Our hope is that this will simplify the workflow of general data work, like text cleanup, but will also have everything needed when encountering a more complex parsing situation.

Python Perl Regex

Regex is powerful for searching and manipulating text strings. At the core, it's a string of characters that defines a pattern-matching strategy.

These regex sequences contain trick moves for crawling and finding matches in strings. The trick moves revolve around special characters.

For example, in the regex:

The trick moves are:

So, if we run our regex on:

I said hello Row64

It will match the text "hello," and crawl until the end of the string, returning:

hello Row64

There are many outstanding resources online to learn regex.

To get started in Row64, the formulas that start with "RE_" are the most general (Python & Perl style) and will likely be your primary tool for using regex.

For example a formula that starts with "RE_" is:

RE_REPLACE

The Python re library is very similar to the Perl Regular Expression Syntax. This is the most standard and well known regex flavor. In Perl regular expressions, all characters match themselves except for the following special characters:

.[{}()\*+?|^$

Here is a link to all the syntax details:

https://www.boost.org/doc/libs/1_89_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html

POSIX Regex

POSIX Regex has different tricks and behaviors than the standard Python/Perl regex.

It has many fans and use cases where it is faster and simpler.

In Row64, formulas that start with "REP_" are POSIX Regular Expressions. For example:

REP_REPLACE

Here is a link to the exact POSIX regex details:

https://www.boost.org/doc/libs/1_89_0/libs/regex/doc/html/boost_regex/syntax/basic_syntax.html

Fuzzy Search

Fuzzy search is a string matching technique to find patterns in noisy text that contains typos or misspellings.

Fuzzy search does similarity calculations using the Levenshtein Distance. It takes a string as input and doesn't use regular expression syntax.

Levenshtein Distance figures out the minimum number of single-character changes (insert, delete or swap) needed to change one word into the other.

Fuzzy Regex Search

Fuzzy regex search also searches noisy text with typos or misspellings, but it gives you greater control than basic fuzzy search.

Instead of a simple string to match, you can use regex to guide the matching pattern and then use fuzzy comparision as a second step.

This allows great precision and control when you understand the error patterns in your input text.

Fuzzy Regex Search also uses Levenshstein Distance, and is optimized for speed with a fast fuzzy approximate regex matcher.

Sentiment

Sentiment analysis, also known as opinion mining, takes text as input and returns a polarity (positive/negative) score.

The score range is from -1 to 1, with -1 being very negative and one being very positive.

It uses a rule-based sentiment analysis engine to improve performance.

Tokens

Token parsing, also called lexing, is an advanced technique where you break apart strings by splitting patterns. The splitting elements are called delimeters.

Generally, tokenization is done by code, but Row64 has several formulas that offer some of the benefits of these techniques.

Splitting is accomplished using regular expressions. You can break apart either the between elements, or the splitters.

In a simple example, we tokenize the following string using a space as the delimeter:

sum = a + 2

In a more complex example, we tokenize using regex with the formula:

RE_JOIN_TOKENS("\d+","__012--345==678##","|")

We take the splitter regex for numeric characters:

\d+

And split the string:

__012--345==678##

The formula:

RE_JOIN_TOKENS

Will join together the split tokens with the string "|".

This will return a final result of:

012|345|678

So, using only a few characters, we break apart a complex crawling split operation and bring them together with a new delimeter.

This is just a basic example. Much more powerful transformations are possible!