PlistSpec: Specifying Parameter Lists in SciDB 19.3


#1

Specifying SciDB Operator Parameters with Regular Expressions

Who Should Read This

This article describes changes to the SciDB database engine's LogicalOperator interface. They affect both system operators that reside in the core and user-defined operators (UDOs) that reside in plugins. If you are a maintainer for a system operator or UDO, keep reading.

Overview

In SciDB release 19.3, logical operators specify their parameter lists using a regular expression-like descriptor called a PlistSpec. Using PlistSpec objects has several advantages over the old method:

  • They enable parameter sublists, a convenient new syntax.
  • They break an annoying dependency that required logical operator objects to participate directly in query parsing.
  • They introduce clear and familiar rules for specifying parameter lists.
  • They reduce coding errors, since parameter parsing is now data-driven rather than code-driven.

In this article you'll learn about PlistSpec objects, and how to modify LogicalOperator subclasses to use them.

Earlier Releases

This section recaps how parameter parsing was done in earlier releases. If you don't need to convert any older operators, you can skip it.

Prior to release 19.3, parameters were specified by using ADD_PARAM_xxx(...) macros in the logical operator constructor, and implementing the nextVaryParamPlaceholders() method to handle any optional parameters. In a nutshell, the constructor described non-optional parameters, and for optional parameters an ADD_PARAM_VARIES() macro call told the query language front end to ``ask me about the next set of valid parameter types''.

The help() operator provides a short example. Here are the pre-19.3 LogicalHelp constructor and the nextVaryParamPlaceholder() method:

 1: LogicalHelp::LogicalHelp(const std::string& logicalName,
 2:                          const std::string& alias)
 3:     : LogicalOperator(logicalName, alias)
 4: {
 5:     ADD_PARAM_VARIES()
 6:     _usage = "help([<operator_name_string>])";
 7: }
 8:
 9: Placeholders
10: LogicalHelp::nextVaryParamPlaceholder(
11:     const std::vector<ArrayDesc> &inputSchemas)
12: {
13:     Placeholders result;
14:     if (_parameters.size() == 0)
15:         result.push_back(PARAM_CONSTANT(TID_STRING));
16:     result.push_back(END_OF_VARIES_PARAMS());
17:     return result;
18: }

The idea is that all of the parameters are optional (line 5). The query parser has to call back to the nextVaryParamPlaceholder() object method to find out the types of the optional parameters. That method can look at

  • the number of parameters accumulated within the LogicalOperator object so far (line 14),
  • the types of parameters accumulated so far (the contents of the _parameters vector), and
  • the schemas of the input arrays (line 11). Valid parameter types might depend on, say, the number of input array dimensions.

Each successive call to the method returns a Placeholders vector containing the types acceptable at this point in the parse. The END_OF_VARIES_PARAMS() placeholder macro tells the parser that ``no more parameters'' is a valid ``parameter'' at this point.

So the help() operator can take zero or one string parameters. If that wasn't instantly clear to you the moment you looked at the code, don't feel bad.

Introducing PlistSpec Maps and makePlistSpec()

Release 19.3 introduces parameter specification using PlistSpec maps. With PlistSpec maps, the nextVaryParamPlaceholder() method is gone altogether, as are any ADD_PARAM_xxx() constructor macros (and any calls to addKeywordPlaceholder(), which we haven't covered yet).

Instead, LogicalHelp implements a new static method called makePlistSpec():

 1: class LogicalHelp : public LogicalOperator {
 2: public:
 3:     static PlistSpec const* LogicalHelp::makePlistSpec();
 4:     // ...
 5: };
 6: // ...
 7: PlistSpec const* LogicalHelp::makePlistSpec()
 8: {
 9:     static PlistSpec argSpec {
10:         { "", RE(RE::QMARK, { RE(PP(PLACEHOLDER_CONSTANT, TID_STRING)) }) }
11:     };
12:     return &argSpec;
13: }

For such a small code snippet, there is a lot to unpack here. First, some general observations:

  • A PlistSpec is a map from keywords to regular expressions.
  • The makePlistSpec() method must be declared static (line 3) and have exactly this signature (line 7), otherwise the OperatorLibrary will be unable to find the PlistSpec, and it will assume that the operator takes no parameters.
  • The method definition creates a PlistSpec map named argSpec using C++11 aggregate initialization syntax (line 9).
  • The argSpec declaration uses the static keyword for efficiency: this is read-only data, so why re-create it on each call?
  • The "" empty string map entry (line 10) denotes the ordinary non-keyword positional parameters. As we'll see, you can specify regular expressions for each keyword parameter you wish to support by making map entries keyed by non-empty strings.

Now, what are RE, PP, and QMARK all about? Here are the relevant type definitions, from <query/LogicalOperator.h>.

// Convenience type aliases for operators that implement makePlistSpec().
using PlistRegex = dfa::RE<OperatorParamPlaceholder>;
using PlistSpec = std::map<std::string, PlistRegex>;
using PP = OperatorParamPlaceholder; // Shorthand to make writing these specs
using RE = PlistRegex;               // in logical ops less cumbersome.

So PP and RE are just shorthand for longer identifiers. They make PlistSpec initializations easier to read.

An RE is a dfa::RE<OperatorParamPlaceholder>. The template class dfa::RE<T> defines regular expressions on a symbol alphabet of objects of type T. The test program for this class uses dfa::RE<std::string> since strings are much easier to work with than OperatorParamPlaceholders. The C++ namespace dfa contains support for (D)eterministic (F)inite-state (A)utomatons. The OperatorLibrary compiles RE regular expressions into DFA state machines that recognize valid parameter lists.

But what about QMARK?

A Closer Look at RE

The RE class supports simple regular expression syntax, as summarized in the table below.

What Meaning RE::Code
ε Empty symbol EMPTY
x Terminal symbol LEAF
x &vert; y Either x or y OR
x* Zero or more x STAR
x+ One or more x PLUS
x? Zero or one x QMARK
x y z Sequence LIST
( x y z ) Grouped sequence GROUP

The RE terminal symbols are objects of type OperatorParamPlaceholder, or PP. So the RE object constructed by

RE(PP(PLACEHOLDER_CONSTANT, TID_STRING))

is an RE::LEAF terminal symbol that corresponds to an AFL string constant. When we enclose it with

RE(RE::QMARK, { ... })

to get

RE(RE::QMARK, { RE(PP(PLACEHOLDER_CONSTANT, TID_STRING)) })

we have zero or one string constants, which exactly describes the valid parameter lists for the help() operator.

The RE constructor signatures are:

explicit RE(T const& t);
explicit RE(Code c);
RE(Code c, std::vector<RE> const& children);

The first constructs a terminal symbol, as in the string constant example above.

The second constructs a non-LEAF RE that has no children. There's only one of those: RE(RE::EMPTY).

The third constructs an RE with children, that is, all the other non-LEAF regular expressions.

Each constructor performs a consistency check, so that only semantically valid regular expressions can be constructed. If your foo() operator crashes SciDB while executing LogicalFoo::makePlistSpec(), chances are you've broken the RE construction rules. Check the scidb.log file for error messages containing ``DFA:''.

Other Placeholder Types

For quick reference, here is the complete list of available placeholders.

PLACEHOLDER_INPUT
Either an array name or a subquery. If present, these should appear first in the parameter list.
PLACEHOLDER_ARRAY_NAME
Not a subquery. Unversioned unless setAllowVersions(true) called.
PLACEHOLDER_ATTRIBUTE_NAME
The named attribute must exist in some array unless setMustExist(false) called.
PLACEHOLDER_DIMENSION_NAME
The named dimension must exist in some array unless setMustExist(false) called.
PLACEHOLDER_CONSTANT
May be restricted to a particulary type (for example, with type id TID_BOOL) or wildcarded (use TID_VOID).
PLACEHOLDER_EXPRESSION
Ditto.
PLACEHOLDER_AGGREGATE_CALL
Ditto. The specified type is the return type of the aggregate.
PLACEHOLDER_SCHEMA
Literal schema, or an array name whose schema is substituted.
PLACEHOLDER_NS_NAME
Namespace name. (Namespaces are authenticated access domains in SciDB Enterprise Edition.)
PLACEHOLDER_DISTRIBUTION
Array storage distribution.

More Examples

You've now got the basics of how to use RE and makePlistSpec() to specify AFL operator parameter lists. Let's look at some further examples, to introduce some new concepts and to get a better feel for describing more elaborate parameter lists.

Parameter Sublists: apply()

Here's the LogicalApply::makePlistSpec() code:

 1: static PlistSpec const* makePlistSpec()
 2: {
 3:     // Some shorthand definitions.
 4:     PP const PP_EXPR(PLACEHOLDER_EXPRESSION, TID_VOID);
 5:     PP const PP_ATTR_OUT =
 6:         PP(PLACEHOLDER_ATTRIBUTE_NAME).setMustExist(false);
 7:
 8:     // The parameter list specification.
 9:     static PlistSpec argSpec {
10:         { "", // positionals
11:           RE(RE::LIST, {
12:              RE(PP(PLACEHOLDER_INPUT)),
13:              RE(RE::OR, {
14:                 RE(RE::PLUS, { RE(PP_ATTR_OUT), RE(PP_EXPR) }),
15:                 RE(RE::PLUS,
16:                  { RE(RE::GROUP, { RE(PP_ATTR_OUT), RE(PP_EXPR) }) })
17:               })
18:            })
19:         }
20:     };
21:     return &argSpec;
22: }

Lines 3-6 introduce some shorthand for PP objects used frequently in the PlistSpec. Who'd want to type all that out repeatedly?

On line 4, the PP_EXPR declaration uses TID_VOID to document that an expression of any type is allowed. TID_VOID is the default ``type of'' parameter, so it could have been left out (as was done for PP_ATTR_OUT).

On line 6, the PP object's setMustExist() method is called. PP (that is, OperatorParamPlaceholder) has some new setter methods for toggling flags:

setMustExist(bool)
Default true, the reference must already exist. Calling setMustExist(false) tells the query language front end not to bother looking up the reference name in any of the input array schemas or the system catalog. The apply() operator creates new attributes, so PP_ATTR_OUT denotes an attribute that doesn't exist yet. By default, ``must exist'' is true for array, namespace, attribute and dimension references.
setAllowVersions(bool)
Default false, version specifiers are not allowed. Calling setAllowVersions(true) allows array reference parameters to have version specifiers, such as MYARRAY@5 for version five of MYARRAY.

Now we come to the PlistSpec map's aggregate initialization. As for LogicalHelp, there are only positional parameters, so there is only one map entry with an empty "" keyword (line 10).

A typical parameter list will start with a few required parameters, followed by some optional ones. Use RE::LIST to enclose a sequence of parameters (line 11).

The first (required) parameter is RE(PP(PLACEHOLDER_INPUT)), that is, an input array or subquery (line 12).

The next required parameter is an RE::OR disjunction of two RE::PLUS subexpressions. The first subexpression (line 14) covers the backward compatibility case: this is how apply() has always worked, with a sequence of pairs of attribute name and expression.

Lines 15-16 express a new style of apply() usage that uses RE::GROUP to enclose each attribute/expression pair in explicit parentheses. Parenthesized groups of parameters are sometimes called sublists or nested parameter groups. These are new in release 19.3.

Lastly, lines 13-16 specify an RE::OR of RE::PLUSs rather than an RE::PLUS of RE::ORs. That's intentional: it prohibits mixing and matching old style ungrouped attr, expr pairs with new style (attr, expr) ones.

Simple Keyword Parameters: input()

Here is LogicalInput::makePlistSpec():

 1: PlistSpec const* LogicalInput::makePlistSpec()
 2: {
 3:     static PlistSpec argSpec {
 4:         { "", // positionals
 5:           RE(RE::LIST, {
 6:              RE(PP(PLACEHOLDER_SCHEMA)),
 7:              RE(PP(PLACEHOLDER_CONSTANT, TID_STRING)), // filename
 8:              RE(RE::QMARK, {
 9:                 RE(PP(PLACEHOLDER_CONSTANT, TID_INT64)), // instance_id
10:                 RE(RE::QMARK, {
11:                    RE(PP(PLACEHOLDER_CONSTANT, TID_STRING)), // format
12:                    RE(RE::QMARK, {
13:                       RE(PP(PLACEHOLDER_CONSTANT, TID_INT64)), // maxErrors
14:                       RE(RE::QMARK, {
15:                          RE(PP(PLACEHOLDER_CONSTANT, TID_BOOL)) // strict
16:                       })
17:                    })
18:                 })
19:              })
20:           })
21:         },
22:         // keywords
23:         { InputSettings::KW_INSTANCE, RE(PP(PLACEHOLDER_CONSTANT, TID_INT64)) },
24:         { InputSettings::KW_FORMAT, RE(PP(PLACEHOLDER_CONSTANT, TID_STRING)) },
25:         { InputSettings::KW_MAX_ERRORS, RE(PP(PLACEHOLDER_CONSTANT, TID_INT64)) },
26:         { InputSettings::KW_STRICT, RE(PP(PLACEHOLDER_CONSTANT, TID_BOOL)) }
27:     };
28:     return &argSpec;
29: }

Here we see the usual RE::LIST of positionals starting at line 5.

Lines 22-26 show additional PlistSpec map entries for simple keywords. The InputSettings::KW_* identifiers are constant strings used to avoid typos. Using KW_INSTANCE instead of "instance" guarantees that no "intsance" bugs go undetected. The RE values can be arbitrary, just as for positionals (but with one additional constraint, see Keyword Parameters Are Unary below).

The ``Cascading QMARK'' idiom (lines 8-19) describes a typical positional parameter list where the rightmost parameters are increasingly optional.

Other formulations are possible, and may be clearer depending on the circumstances. For example, the dfa_tests.cpp program (which uses dfa::RE<std::string>) mimics the input() operator's positionals like this:

 1: RE(RE::LIST,
 2:  { RE("schema"),
 3:    RE("filename"),
 4:    RE(RE::OR,
 5:     { RE(RE::EMPTY),
 6:       RE("instance"),
 7:       RE(RE::LIST, { RE("instance"), RE("format") }),
 8:       RE(RE::LIST, { RE("instance"), RE("format"), RE("maxErr") }),
 9:       RE(RE::LIST, { RE("instance"), RE("format"), RE("maxErr"),
10:                      RE("strict") })
11:     })
12:  })

On line 5, note the use of RE::EMPTY to make the whole RE::OR clause optional.

Incidentally, there is no requirement that a PlistSpec contain a "" entry for positionals. You can have all keyword entries, or an empty PlistSpec map.

Complex Keyword Parameters: rng_uniform()

The rng_uniform() operator produces sequences of pseudo-random numbers using the drand48 algorithm, which takes a single 64-bit seed value.

Experimental changes to LogicalRngUniform allow multiple 64-bit seed values to be specified, so that one day algorithms that use bigger seeds can be supported. Here is the keywords portion of LogicalRngUniform::makePlistSpec():

 1: // keywords
 2: { "min", RE(PP(PLACEHOLDER_EXPRESSION, TID_DOUBLE)) },
 3: { "max", RE(PP(PLACEHOLDER_EXPRESSION, TID_DOUBLE)) },
 4: { "generator", RE(PP(PLACEHOLDER_EXPRESSION, TID_STRING)) },
 5: { "seed", RE(PP(PLACEHOLDER_EXPRESSION, TID_UINT64)) },
 6: { "seeds",
 7:    RE(RE::OR, {
 8:       RE(PP(PLACEHOLDER_EXPRESSION, TID_STRING)), // JSON-encoded seeds
 9:       RE(RE::GROUP, {
10:          RE(RE::PLUS, { RE(PP(PLACEHOLDER_EXPRESSION, TID_UINT64)) })
11:       })
12:    })
13: }

The "seeds" keyword can be either a JSON-encoded string (did I mention this was experimental?) or a parenthesized RE::GROUP of one or more TID_UINT64 expressions.

Why one or more? RE::GROUP is implemented using an AFL grammar production called nested_operand. The grammar insists that a nested_operand contain at least one suboperand. So () is not a legal group.

A group with a single item requires an extra comma, to distinguish it from a parenthesized expression. So (42,) is a group of one item, but (42) is a non-group expression that evaluates to 42. (This is the same as Python tuple syntax.)

Keyword Parameters Are Unary

Keyword parameters can take only one operand. That constrains the kinds of RE expressions you can use for keyword parameters. You can specify any arbitrary RE for a keyword, but the language front end will only build syntax trees for singular keyword values. Specifically, the RE for a keyword may be either:

  • a single RE::LEAF expression, or
  • a single RE::GROUP nested sublist, or
  • an RE::OR disjunction of those.

If you specify any other top-level RE for a keyword, your operator source code will compile and build cleanly, but the RE will never be matched. A variety of runtime errors await you. (Consider: how should the language parser treat a keyword that is an RE::LIST or RE::QMARK? It cannot distinguish between these and the next unnested parameter.)

In summary, if you need to pass multiple values for a keyword parameter, use RE::GROUP to parenthesize them.

This must-be-unary constraint could be checked when the logical operator factory is loaded into OperatorLibrary, but that introduces coupling between OperatorLibrary and the language grammar. We chose not to perform the check and trust that advanced SciDB users developing their own operators would actually test with all desired parameter lists.

Some Implications of Sublists

Previously the ``location'' of a parameter could be described by a non-negative integer (for positionals) or a string (for keywords). With RE::GROUP nested sublists, the location of a particular parameter needs a more complex description.

Similarly, visiting all parameters is no longer a matter of iterating over a vector of positionals and a map of keywords.

Visiting All Parameters

These definitions are from <query/OperatorParam.h>:

using PlistWhere = std::vector<size_t>;
using PlistVisitor = std::function<void(Parameter& param,
                                        PlistWhere const& where,
                                        std::string const& kw)>;

/** Shared code for depth-first visiting logical or physical parameters. */
void visitParameters(PlistVisitor&, Parameters&, KeywordParameters&);

And in <query/LogicalOperator.h>:

/** Visit logical parameters, positional and keyword, nested or not. */
void visitParameters(PlistVisitor& f)
{
    scidb::visitParameters(f, _parameters, _kwParameters);
}

The PlistVisitor functor is invoked for each Parameter, with the Parameter location described by two arguments:

  • a PlistWhere vector of sublist indices and
  • the enclosing keyword string kw.

For example, if you have an operator with arguments

op_foo(A, (alpha, 12), scale_by: (3.1415, 6.02e23, 2.718))

then when visiting the 12, where is [0, 1] and kw is "" . The sublist (alpha, 12) is the zero-th top-level parameter, because PLACEHOLDER_INPUT subqueries and arrays like A are treated differently and not included in the _parameters vector. So the group (alpha, 12) is parameter zero, and the 12 is parameter 1 within that sublist.

When visiting the 6.02e23, where is likewise [0, 1] but kw is "scale_by". The sublist (3.1415, 6.02e23, 2.718) is the zero-th (and only, since Keyword Parameters Are Unary) parameter for the scale_by keyword, and 6.02e23 is parameter 1 within that sublist.

In general, operators know where to find their own arguments, since the operator provided the PlistSpec. Typically it is generic subsystems like the query optimizer or the language front end that need to visit parameters in blanket fashion.

Tile Mode Considerations

If you have an operator with one or more expression parameters that must be compiled in ``tile mode'', you must override the LogicalOperator::compileParamInTileMode() method. That method's signature and default implementation is

1: virtual bool compileParamInTileMode(PlistWhere const& where,
2:                                     std::string const& keyword)
3: {
4:     return false;
5: }

As with PlistVisitor callables, you get a PlistWhere vector and a keyword string to indicate which parameter the optimizer is asking about. Recall the new syntax for the apply operator: apply(A, (attr1, expr1), (attr2, expr2)). Here is the compileParamInTileMode() method's implementation in LogicalApply:

 1: bool compileParamInTileMode(PlistWhere const& where,
 2:                             string const&) override
 3: {
 4:     assert(where.size() < 3);
 5:     if (where.size() == 2) {
 6:         // In a (foo, bar) nested list, compile 'bar' in tile mode if
 7:         // possible.
 8:         return where[1] == 1;
 9:     }
10:
11:     assert(where.size() == 1);
12:     if (getParameters()[where[0]]->getParamType() == PARAM_NESTED) {
13:         return false;
14:     }
15:
16:     // Backward compatibility: old-style parameter list without
17:     // nesting.  In a "foo, bar, baz, mumble" parameter list, compile
18:     // 'bar' and 'mumble' in tile mode if possible.
19:     return (where[0] % 2) == 1;
20: }

Lines 5-9 handle the case where new style (attr, expr) pairs are used. We know we're looking inside a nested sublist because where.size() == 2. The where indices of a pair will be [x, 0] for the attribute and [x, 1] for the expression, so we return true if the optimizer is asking about the expression parameter.

Lines 11-14 take care of the ``parent node'' of the (attr, expr) pair. If this is indeed a nested sublist, then it's not an expression and shouldn't be compiled in tile mode.

Lines 16-19 handle old style attr, expr pairs not enclosed in parentheses. The odd numbered parameter is the expr, so compile that in tile mode if possible.

Debugging RE Regular Expressions: dfa_tests.cpp

Because the dfa::RE<T> template class is equally happy when T is an OperatorParamPlaceholder or when T is an std::string, you can write a small program using dfa::RE<std::string> to test possible parameter list specs.

The tests/dfa/dfa_tests.cpp program in the SciDB Community Edition source distribution provides an excellent testbed for trying out new RE regular expressions. By adding a new entry to the testgroups vector, you can try out an arbitrary RE against a series of input symbol sequences.

During startup, operators write their PlistSpec objects to the scidb.log file:

... Logical help() parameter list spec: {"": "<constant:string>?"}

If this line is missing, OperatorLibrary had trouble recognizing your static makePlistSpec() method. Check the signature and make sure it's declared static.

Printing REs

The dfa::RE<T>::asRegex() method returns a string representation of the RE. The SciDB help() operator makes use of this to automatically generate the usage text for many operators.

Displaying DFAs

The dfa_test.cpp file and src/query/OperatorLibrary.cpp both contain examples of how to compile an RE into a DFA. Once you have a compiled DFA, you can use the dfa::DFA<T>::asDot(std::ostream&) method to write a DOT language representation of the state machine to an output file. Then use the dot program from the graphviz package (install from from your favorite repo or from https://graphviz.gitlab.io) to generate a displayable .png file. The DOT language is an easy-to-read text representation, so you may not need to bother with the pretty picture.


Author: Mike Leibensperger, Paradigm4 Inc.

Created: 2019-05-01 Wed 11:45