QueryLanguages stable

From Lomi
Jump to: navigation, search

Contents

ProLearn Query Language Definition, V.0.7

Alex Campi, Stefano Ceri, Peter Dolog, Erik Duval, Sam Guinea, Geert-Jan Houben, David Massart, Mikael Nilsson, Stefaan Ternier, Zhou Xuan

Introduction

This report provides a first definition of the ProLearn Query Language (PLQL), a “query language for learning repositories” defined in the context of the ProLearn Network of Excellence (WP4). Our background is the simple query interface (SQI), a query transport standard (also defined within ProLearn) that is becoming widely used within the e-learning community. One of SQI’s distinguishing features is to be (on purpose) agnostic about query languages; as such, it can be used together with any query language, but does not include any provision for expressing the query semantics. Going beyond SQI requires adding support to a specific query language, designed with the specific objective of supporting the exchange of queries between heterogeneous learning object repositories.

PLQL is primarily a “query interchange format” used by source e-learning applications (or sources) to query target learning object repositories (or targets), using the SQI protocol. It is up to the source application developers to define user interfaces, normally based upon keyword-based forms. It is up to the target repository managers to build adapters that will be compliant with PLQL.

PLQL combines two features: support of keyword-based search and support of exact matching.

  • The former is required for extracting LOs that best match informal descriptions, typically provided through search interfaces; for this feature we borrow concepts from CQL [1], a well-established language used for library search.
  • The latter is required when the application knows the metadata schema, which is typically in hierarchical format (e.g., LOM), and therefore can extract data that match specific items of the repository schema. We use the concept of path for this purpose.

The language should then be capable of performing exact search when the structure of the metadata is known as well as ranked retrieval when the structure of the metadata is not know; perhaps the most challenging aspect of the language's semantics is to define the query semantics when both query aspects should be taken into consideration.

In designing PLQL, we aim at supporting also repositories with a minimum level of support. Thus, one of the main concerns of the language design is on providing progressive layers, supporting increasingly expressive power, so that even simple repositories can support the lower layers.

PLQL Layers

PLQL is built by layers. Layer zero is very basic and corresponds to simple keyword searches. Layer five is the richest level (for the time being yet undefined, we assume it will eventually comprise all desired features). The first three layers are presented next. (This document is concerned with levels 0, 1 and 2. Future versions of the specifications will address layers 3, 4, and 5.) For each layer we indicate:

1. Description of the layer's expressive power

2. Syntax

3. Some examples

4. Discussion

Layer 0

Expressive power

This layer enables the expression of search terms. A target that receives a layer-0 query must decide itself on which metadata fields it will map the search terms. The target must take all the search terms specified in the query into account. The structure of the result returned by a query is discussed in a dedicated section below.

Syntax

Following is the Backus Naur Form (BNF) definition for Layer 0. This is based on [2]

0-1: PLQLQuery ::= clause

0-2: clause ::= keywordClause | '(' clause ')' | clause 'and' clause

0-3: keywordClause ::= term1 | term2

0-4: term1 ::= charString1

0-5: term2 ::= charstring2

0-6: charString1 ::= Any sequence of characters that does not include any of the following:

  • whitespace
  • ( (open parenthesis )
  • ) (close parenthesis)
  • =
  • <
  • >
  • '"' (double quote)
  • /

If the final sequence is a reserved word, that token is returned instead. Note that '.' (period) may be included, and a sequence of digits is also permitted. Reserved words are 'and', 'or', 'not', and 'prox' (case insensitive). When a reserved word is used in a search term, case is preserved.

0-7 charString2 ::= Double quotes enclosing a sequence of any characters except double quote (unless preceded by backslash (\)). Backslash escapes the character following it. The resultant value includes all backslash characters except those releasing a double quote (this allows other systems to interpret the backslash character). The surrounding double quotes are not included.

This Layer of PLQL is identified with the following URI: http://www.prolearn-project.org/PLQL/l0

Examples

The following queries are correct PLQL layer zero expressions:

 "dog"
 "learning object" and dog
 dog and cat and jaguar
 (dog and cat) and jaguar

These examples are incorrect expressions (that cannot be submitted to a repository using PLQL level 0):

 "learning object" or "dog"
 "learning object" dog
 cql.serverChoice = "dog" 

Note: for a storage server supporting Cql, the last example is equivalent to "dog", but such query is supported only at Level 1.

Discussion

Layer zero offers the same expressive power as VSQL, the query language that is supported by a whole range of SQI targets. As an example, a VSQL looks like:

<simpleQuery>
 <term>learning object</term>
 <term>dog</term>
</simpleQuery>

The query above is equivalent to:

 "learning object" and "dog"

Terms are considered in conjunctive form, but this has to be interpreted in the context of ranked queries; therefore, LOs matching “many” terms will have higher rankings than LOs matching “fewer” terms.

Layer 1

Expressive power

In layer one, in addition to the approximate searches supported by level 0, PLQL queries can express exact searches on metadata fields. The latter are denoted by means of paths. Layer 1 supports paths as simple concatenations of elements (separated by dots), starting from the root, with no omission; expressions and parentheses are not allowed.

Layer 1 only supports the following roots: DC, LOM, and MPEG. Generic namespaces are not supported. This layer is unaware of “types”, and attribute values cannot be composed. However, encoded strings that represent diverse dataTypes are allowed, given that their meanings can be clarified by referencing URI-identified meta-schemas. Similarly, we allow for the meaning of certain expressions to be clarified by referring to these meta-schemas. For instance, a query asking for duration “> PT1H" or the meaning of VCARD becomes clear within the LOM context.

When both exact and approximate clauses are present in a single query, it is assumed that the exact search has a higher priority than the approximate search. The semantics of PLQL when both exact and approximate clauses are present is to apply the exact clauses first to build an initial result set, then to apply the approximate clauses to the initial result set. This produces a final result set. As in layer 0, the exact semantics of the approximate search is left to the target repository implementation.

However, exact queries might not parse correctly against the metadata available at the storage server. When some exact clauses cannot be parsed by the server, a return code should indicate each of them as "not executed". In particular, if no exact clauses can be parsed in the repository metadata, then the effect of the exact search is null; the repository should operate on the entire set of LOs as if no exact search had been performed.

As a variant to this semantics, requested by the application at query presentation time, the repository could be allowed to use the constant values in the exact clauses that are not parsed correctly as free keywords, so as to perform an approximate search based upon the terms indicated in the exact clauses; a return code should then indicate to the application that this case has occurred. Such variant should be evaluated experimentally, to see if it can be useful at least in certain contexts. Note that a repository unable to process exact queries against certain metadata could always resort to such query interpretation.

Syntax

Note that productions with the same number as productions at lower layers substitute for them, e.g. production 1-2 substitutes (generally extends) production 0-2.

1-2: clause ::= keywordClause | standard '.' exactClause | '(' clause ')' | clause 'and' clause

1-7: exactClause ::= path operator term1 | path operator term2

1-8: path ::= term1 | term1 '.' path

1-9: operator ::= '='

1-10: standard ::= 'dc' | 'lom' | 'mpeg'

This Layer of PLQL is identified with the following URI: http://www.prolearn-project.org/PLQL/l1

Examples

 dc.title = “SQL”  and lom.general.title = “SQL” 
 lom.general.title = "Design Patterns" and lom.general.language = "en"
 lom.general.title = "Design Patterns" and lom.technical.format = "video/mpeg" 
 and lom.technical.duration <= "PT1H" and lom.rights.cost="no"
 lom.general.title = "Design Patterns" and lom.educational.intendedEndUserRole 
 = "learner" and lom.educational.typicalAgeRange = "15-18"

Discussion

Note that Lucene would support PLQL level one.

Layer 2

Expressive power

In Layer 2 we add parentheses to paths, but we continue to avoid variables and incomplete path expressions skipping operators. We enable general namespaces.

Layer 2 introduces the possibility to use all the comparators (<, >, >=, <=, <>, =, etc.). It is important to distinguish between the "=" symbol and the use of the "exact" keyword. The former indicates the 'includes' operator, while "exact" is used to indicate exact string matching.

Syntax

2-8: path ::= term1 | '(' selector ')' | path '.' path

2-9: operator ::= '=' | '>', '>=', '<', '<=', 'exact'

2-10: standard ::= 'dc' | 'lom' | 'mpeg' | 'cql' | term1

2-11: selector ::= term1 operator term1 | term1 operator term2 | '(' selector 'and' selector ')'

This Layer of PLQL is identified with the following URI: http://www.prolearn-project.org/PLQL/l2

Examples

 lom.general.identifier.(catalog=isbn and entry=xxxxx)
 lom.general.(title = "Design Patterns" and language = "en")
 lom.general.title = "Design Patterns" and lom.technical.(format = "video/mpeg" 
 and duration <= "PT1H") and lom.rights.cost="free"
 lom.general.title = "Design Patterns" and lom.educational.(intendedEndUserRole
 = "learner" and typicalAgeRange = "15-18")

Discussion

We assume that PLQL - layer 2 offers more or less the maximum expressive power that we can expect from the current generation of repositories. This assumption should be subject to a due diligence.

Query Results

The result produced after invoking PLQL on a repository is also “layered”. The result includes in all cases the resource identifiers (URI) of learning objects, while obtaining the resource itself is a task left to the application.

At level zero, sources return just a set of URIs with its cardinality.

At level one, sources return a list of URIs ordered according to the matching (the first resource is the one giving the best match) with its cardinality.

At level two, sources return level one’s information and some specific meta-data (such as: LO’s title, author, date-of-last-update).

At level three, sources return level two’s information and a numeric ranking value, optionally together with the method that was locally used on the source to compute the ranking.

We expect most sources to be able to support at least level one (i.e., to return URI ordered according to their ranking).

Syntax

Using the method setResultFormat of SQI, a source can use a URI to inform a target of the expected level of the result. Optionally, the URI also indicates also the expected format of metadata.

Its BNF is:

1: PLQLRES ::= 'http://www.prolearn-project.org/PLRF/' level ‘/’ standard

2: level ::= '0'|'1'|'2'|'3'

3: standard ::= | 'dc'| 'lom' | 'mpeg'

Result example

All the records from a query are returned in an XML format, we provide an example below. Suppose that a source wants its query to return level 2 ranked results using standard lom metadata. For such a query would select the results format using the following URI:

 http://www.prolearn-project.org/PLRF/2/lom

After the query is submitted, the target could return results like the following example:

 <?xml version="1.0" encoding="UTF-8"?>
 <results 
   xmlns="http://www.prolearn-project.org/PLQLRES/1/lom"
   xmlns:lom="http://ltsc.ieee.org/xsd/LOM"> 
   <result position="1">
     <identifier>
       <catalog>URI</catalog>
       <entry>http://www.ariadne-eu.org/identifier/UPS_v_3.1_nr_180</entry>
     </identifier>
     <metadata> 
        <title> “workflow patterns” </title>
        <author> “van der alst” </author>
     </metadata>
   </result>
  …
 </results>

Overview of the full PLQL specification

We have only partially discussed the features of levels 3, 4, and 5. We anticipate the use of: disjunction and negation, types, arbitrary path expressions in level 3; joins, quantifiers, nested queries in level 4, recursion, proximity, and other highly expressive clauses in level 5.

In this version of the language, we concentrate on one-time queries rather than a query protocol. We anticipate, however, that future versions of the specifications may include a protocol by means of which the application can interact with the user, who will indicates preferences about the query result so that the result itself can be reused for interaction with the repositories, asking for “more results of the same kind”. Merging results from various repositories will also be part of the protocol.

During the summer, several implementations and experiments are planned (EUNLRE, ARIADNE/GLOBE, KnowledgeMarkets) in order to verify the specifications provided in this document; a revised version based on experiments’ results is planned for October, while another release of the specifications is planned for January 1st, 2007. The final deliverable is due on July 1st, 2007.

CUTS : REINSERT?

As we envision scenarios where these queries will be translated to XQuery, care must be taken with metadata field names. For instance, consider the following query.

 lom.general.Title = "metadata"

Transforming this query to XQuery would return a query like:

 for $lom in doc("data/something.xml")/lom
  where(	some $title in $lom/general/Title satisfies
    $title/string/text() = "metadata" )
  return	
    $lom

Note however that in the LOM XML Schema, title is defined without a capital 'T'. This implies that this XQuery instance would never return results. Therefore, we refer to this schema for the exact definitions of the field names.

Personal tools