The “Parsing Nesting” parser and Solr query transformer

User-entered queries handled

Conversion to Solr

As mentioned, a straight list of terms such as (in the most complicated) case: one -two +“three four” >> is translated directly to a dismax query for those entered terms. Using the qf/pf/mm/etc you have configured for the Blacklight search_field in question. (While by default the advanced search plugin uses exactly the same field configurations you already have for simple search, you could also choose to pass in different ones for advanced search, perhaps setting mm to 100% if desired for adv search)

There are a few motivations for doing things this way:

Once you start putting boolean operators AND, OR, NOT in, the query will no longer neccesarily be converted to a single nested dismax query, a single user-entered string may be converted to multiple nested queries. In some common cases, multiple clauses will still be collapsed into fewer dismax queries than the 'naive' translation. Examples:

However, if you use complicated crazy nesting, you can get a lot of nested queries generated:

Note on pure negative queries

In Solr 1.4.1, the dismax query parser can't handle queries with only “-” excluded terms. And while the lucene query parser can handle certain types of pure negative queries, it can't properly handle a NOT(x) as one of the operands of the “OR”. Our query generation strategy notices these cases and transforms to semantically equivalent query that can be handled by Solr properly. At least it tries, this is the least clean part of the code. But there are specs showing it works for some fairly complicated queries.

This works with very complicated queries when the bad pure negative part would be just a sub-clause or sub-query. Sometimes the result is not the most concise query possible, but it should hold to it's semantics.

Why not use e-dismax?

That would be a potentially reasonable choice. Why didn't I?

One, at the time of this writing, edismax is not available in a tagged stable Solr release, and I write code for Blacklight that works with tagged stable releases.

Two, edismax doesn't neccesarily entirely support the semantics I want, especially for features I would like to add in the future. I am not sure exactly what edismax does with complicated deeply nested expressions. For fielded searches, dismax supports actual individual solr fields, but not the “fields” as dismax qf aggregates that we need. These things could be added to dismax, but with my lack of Java chops and familiarity with Solr code, it would have taken me much longer to do (and been much less enjoyable).

I think it may be a reasonable choice to seperate concerns between Solr and the app layer like this, let Solr handle basic search expressions, but let the app layer handle more complicated query parsing, translating to those simple expressions.

On the other hand, there are definite downsides to this approach. Including having to deal with idiosyncracies of built-in query parsers (“pure negative” behavior), depend upon other idiosyncracies (dismax does not apply mm to -excluded terms), etc. And not being able to share the code at the Solr/Java level.

In the future, a different approach that might be best of all could be using the not-yet-finished XML query parser, to do initial parsing in ruby at the app level, but translate to specified lucene primitives using XML query parser, instead of having to translate to lucene/dismax query parsers.

Future Enhancement Ideas

Just ideas.

  1. Allow expert “fielded” searches. title:foo which would correspond not to actual solr index field “title”, but to a Blacklight-configured “search field” qf/pf.

  2. Insert this app-level parser even in “simple” search, so users can use boolean operators even in a single-fielded simple search.

  3. Allow a different set of qf to be used for any “phrase term”, so phrases would search only on non-stemming fields. This would be cool, but kind of do weird things with dismax mm effects, since it would mean all phrases would be extracted into seperate nested queries.

  4. Better error handling of syntax errors in query entry. Both in the plugin as a whole, error messages should be displayed on the input screen, so the entry can be fixed. But also using Parslet for parsing, we can potentially deliver better error messages guessing what they got wrong where in their entry.