[[search-percolate]] == Percolator .Percolating geo-queries in Elasticsearch 2.2.0 or later [IMPORTANT] ====================================== The new <> added in Elasticsearch 2.2.0 and above require that <> are enabled in order to function. Unfortunately, the in-memory index used by the percolator does not yet have support for `doc_values`, meaning that <> will not work in a percolator index created in Elasticsearch 2.2.0 or later. See <> for a workaround. ====================================== Traditionally you design documents based on your data, store them into an index, and then define queries via the search API in order to retrieve these documents. The percolator works in the opposite direction. First you store queries into an index and then, via the percolate API, you define documents in order to retrieve these queries. The reason that queries can be stored comes from the fact that in Elasticsearch both documents and queries are defined in JSON. This allows you to embed queries into documents via the index API. Elasticsearch can extract the query from a document and make it available to the percolate API. Since documents are also defined as JSON, you can define a document in a request to the percolate API. The percolator and most of its features work in realtime, so once a percolate query is indexed it can immediately be used in the percolate API. [IMPORTANT] ===================================== Fields referred to in a percolator query must *already* exist in the mapping associated with the index used for percolation. There are two ways to make sure that a field mapping exist: * Add or update a mapping via the <> or <> APIs. * Percolate a document before registering a query. Percolating a document can add field mappings dynamically, in the same way as happens when indexing a document. ===================================== [float] === Sample Usage Create an index with a mapping for the field `message`: [source,js] -------------------------------------------------- curl -XPUT 'localhost:9200/my-index' -d '{ "mappings": { "my-type": { "properties": { "message": { "type": "string" } } } } }' -------------------------------------------------- Register a query in the percolator: [source,js] -------------------------------------------------- curl -XPUT 'localhost:9200/my-index/.percolator/1' -d '{ "query" : { "match" : { "message" : "bonsai tree" } } }' -------------------------------------------------- Match a document to the registered percolator queries: [source,js] -------------------------------------------------- curl -XGET 'localhost:9200/my-index/my-type/_percolate' -d '{ "doc" : { "message" : "A new bonsai tree in the office" } }' -------------------------------------------------- The above request will yield the following response: [source,js] -------------------------------------------------- { "took" : 19, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "total" : 1, "matches" : [ <1> { "_index" : "my-index", "_id" : "1" } ] } -------------------------------------------------- <1> The percolate query with id `1` matches our document. [float] === Indexing Percolator Queries Percolate queries are stored as documents in a specific format and in an arbitrary index under a reserved type with the name `.percolator`. The query itself is placed as is in a JSON object under the top level field `query`. [source,js] -------------------------------------------------- { "query" : { "match" : { "field" : "value" } } } -------------------------------------------------- Since this is just an ordinary document, any field can be added to this document. This can be useful later on to only percolate documents by specific queries. [source,js] -------------------------------------------------- { "query" : { "match" : { "field" : "value" } }, "priority" : "high" } -------------------------------------------------- On top of this, also a mapping type can be associated with this query. This allows to control how certain queries like range queries, shape filters, and other query & filters that rely on mapping settings get constructed. This is important since the percolate queries are indexed into the `.percolator` type, and the queries / filters that rely on mapping settings would yield unexpected behaviour. Note: By default, field names do get resolved in a smart manner, but in certain cases with multiple types this can lead to unexpected behavior, so being explicit about it will help. [source,js] -------------------------------------------------- { "query" : { "range" : { "created_at" : { "gte" : "2010-01-01T00:00:00", "lte" : "2011-01-01T00:00:00" } } }, "type" : "tweet", "priority" : "high" } -------------------------------------------------- In the above example the range query really gets parsed into a Lucene numeric range query, based on the settings for the field `created_at` in the type `tweet`. Just as with any other type, the `.percolator` type has a mapping, which you can configure via the mappings APIs. The default percolate mapping doesn't index the query field, only stores it. Because `.percolate` is a type it also has a mapping. By default the following mapping is active: [source,js] -------------------------------------------------- { ".percolator" : { "properties" : { "query" : { "type" : "object", "enabled" : false } } } } -------------------------------------------------- If needed, this mapping can be modified with the update mapping API. In order to un-register a percolate query the delete API can be used. So if the previous added query needs to be deleted the following delete requests needs to be executed: [source,js] -------------------------------------------------- curl -XDELETE localhost:9200/my-index/.percolator/1 -------------------------------------------------- [float] === Percolate API The percolate API executes in a distributed manner, meaning it executes on all shards an index points to. .Required options * `index` - The index that contains the `.percolator` type. This can also be an alias. * `type` - The type of the document to be percolated. The mapping of that type is used to parse document. * `doc` - The actual document to percolate. Unlike the other two options this needs to be specified in the request body. Note: This isn't required when percolating an existing document. [source,js] -------------------------------------------------- curl -XGET 'localhost:9200/twitter/tweet/_percolate' -d '{ "doc" : { "created_at" : "2010-10-10T00:00:00", "message" : "some text" } }' -------------------------------------------------- .Additional supported query string options * `routing` - In case the percolate queries are partitioned by a custom routing value, that routing option makes sure that the percolate request only gets executed on the shard where the routing value is partitioned to. This means that the percolate request only gets executed on one shard instead of all shards. Multiple values can be specified as a comma separated string, in that case the request can be be executed on more than one shard. * `preference` - Controls which shard replicas are preferred to execute the request on. Works the same as in the search API. * `ignore_unavailable` - Controls if missing concrete indices should silently be ignored. Same as is in the search API. * `percolate_format` - If `ids` is specified then the matches array in the percolate response will contain a string array of the matching ids instead of an array of objects. This can be useful to reduce the amount of data being send back to the client. Obviously if there are two percolator queries with same id from different indices there is no way to find out which percolator query belongs to what index. Any other value to `percolate_format` will be ignored. .Additional request body options * `filter` - Reduces the number queries to execute during percolating. Only the percolator queries that match with the filter will be included in the percolate execution. The filter option works in near realtime, so a refresh needs to have occurred for the filter to included the latest percolate queries. * `query` - Same as the `filter` option, but also the score is computed. The computed scores can then be used by the `track_scores` and `sort` option. * `size` - Defines to maximum number of matches (percolate queries) to be returned. Defaults to unlimited. * `track_scores` - Whether the `_score` is included for each match. The `_score` is based on the query and represents how the query matched the *percolate query's metadata*, *not* how the document (that is being percolated) matched the query. The `query` option is required for this option. Defaults to `false`. * `sort` - Define a sort specification like in the search API. Currently only sorting `_score` reverse (default relevancy) is supported. Other sort fields will throw an exception. The `size` and `query` option are required for this setting. Like `track_score` the score is based on the query and represents how the query matched to the percolate query's metadata and *not* how the document being percolated matched to the query. * `aggs` - Allows aggregation definitions to be included. The aggregations are based on the matching percolator queries, look at the aggregation documentation on how to define aggregations. * `highlight` - Allows highlight definitions to be included. The document being percolated is being highlight for each matching query. This allows you to see how each match is highlighting the document being percolated. See highlight documentation on how to define highlights. The `size` option is required for highlighting, the performance of highlighting in the percolate API depends of how many matches are being highlighted. [[dedicated-percolator]] [float] === Dedicated Percolator Index Percolate queries can be added to any index. Instead of adding percolate queries to the index the data resides in, these queries can also be added to a dedicated index. The advantage of this is that this dedicated percolator index can have its own index settings (For example the number of primary and replica shards). If you choose to have a dedicated percolate index, you need to make sure that the mappings from the normal index are also available on the percolate index. Otherwise percolate queries can be parsed incorrectly. [float] === Filtering Executed Queries Filtering allows to reduce the number of queries, any filter that the search API supports, (except the ones mentioned in important notes) can also be used in the percolate API. The filter only works on the metadata fields. The `query` field isn't indexed by default. Based on the query we indexed before, the following filter can be defined: [source,js] -------------------------------------------------- curl -XGET localhost:9200/test/type1/_percolate -d '{ "doc" : { "field" : "value" }, "filter" : { "term" : { "priority" : "high" } } }' -------------------------------------------------- [float] === Percolator Count API The count percolate API, only keeps track of the number of matches and doesn't keep track of the actual matches Example: [source,js] -------------------------------------------------- curl -XGET 'localhost:9200/my-index/my-type/_percolate/count' -d '{ "doc" : { "message" : "some message" } }' -------------------------------------------------- Response: [source,js] -------------------------------------------------- { ... // header "total" : 3 } -------------------------------------------------- [float] === Percolating an Existing Document In order to percolate a newly indexed document, the percolate existing document can be used. Based on the response from an index request, the `_id` and other meta information can be used to immediately percolate the newly added document. .Supported options for percolating an existing document on top of existing percolator options: * `id` - The id of the document to retrieve the source for. * `percolate_index` - The index containing the percolate queries. Defaults to the `index` defined in the url. * `percolate_type` - The percolate type (used for parsing the document). Default to `type` defined in the url. * `routing` - The routing value to use when retrieving the document to percolate. * `preference` - Which shard to prefer when retrieving the existing document. * `percolate_routing` - The routing value to use when percolating the existing document. * `percolate_preference` - Which shard to prefer when executing the percolate request. * `version` - Enables a version check. If the fetched document's version isn't equal to the specified version then the request fails with a version conflict and the percolation request is aborted. Internally the percolate API will issue a GET request for fetching the `_source` of the document to percolate. For this feature to work, the `_source` for documents to be percolated needs to be stored. [float] ==== Example Index response: [source,js] -------------------------------------------------- { "_index" : "my-index", "_type" : "message", "_id" : "1", "_version" : 1, "created" : true } -------------------------------------------------- Percolating an Existing Document: [source,js] -------------------------------------------------- curl -XGET 'localhost:9200/my-index1/message/1/_percolate' -------------------------------------------------- The response is the same as with the regular percolate API. [float] === Multi Percolate API The multi percolate API allows to bundle multiple percolate requests into a single request, similar to what the multi search API does to search requests. The request body format is line based. Each percolate request item takes two lines, the first line is the header and the second line is the body. The header can contain any parameter that normally would be set via the request path or query string parameters. There are several percolate actions, because there are multiple types of percolate requests. .Supported actions: * `percolate` - Action for defining a regular percolate request. * `count` - Action for defining a count percolate request. Depending on the percolate action different parameters can be specified. For example the percolate and percolate existing document actions support different parameters. .The following endpoints are supported * `GET|POST /[index]/[type]/_mpercolate` * `GET|POST /[index]/_mpercolate` * `GET|POST /_mpercolate` The `index` and `type` defined in the url path are the default index and type. [float] ==== Example Request: [source,js] -------------------------------------------------- curl -XGET 'localhost:9200/twitter/tweet/_mpercolate' --data-binary "@requests.txt"; echo -------------------------------------------------- The index `twitter` is the default index, and the type `tweet` is the default type and will be used in the case a header doesn't specify an index or type. requests.txt: [source,js] -------------------------------------------------- {"percolate" : {"index" : "twitter", "type" : "tweet"}} {"doc" : {"message" : "some text"}} {"percolate" : {"index" : "twitter", "type" : "tweet", "id" : "1"}} {} {"percolate" : {"index" : "users", "type" : "user", "id" : "3", "percolate_index" : "users_2012" }} {"size" : 10} {"count" : {"index" : "twitter", "type" : "tweet"}} {"doc" : {"message" : "some other text"}} {"count" : {"index" : "twitter", "type" : "tweet", "id" : "1"}} {} -------------------------------------------------- For a percolate existing document item (headers with the `id` field), the response can be an empty JSON object. All the required options are set in the header. Response: [source,js] -------------------------------------------------- { "responses" : [ { "took" : 24, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0, }, "total" : 3, "matches" : [ { "_index": "twitter", "_id": "1" }, { "_index": "twitter", "_id": "2" }, { "_index": "twitter", "_id": "3" } ] }, { "took" : 12, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0, }, "total" : 3, "matches" : [ { "_index": "twitter", "_id": "4" }, { "_index": "twitter", "_id": "5" }, { "_index": "twitter", "_id": "6" } ] }, { "error" : "DocumentMissingException[[_na][_na] [user][3]: document missing]" }, { "took" : 12, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0, }, "total" : 3 }, { "took" : 14, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0, }, "total" : 3 } ] } -------------------------------------------------- Each item represents a percolate response, the order of the items maps to the order in which the percolate requests were specified. In case a percolate request failed, the item response is substituted with an error message. [float] === How it Works Under the Hood When indexing a document that contains a query in an index and the `.percolator` type, the query part of the documents gets parsed into a Lucene query and is kept in memory until that percolator document is removed or the index containing the `.percolator` type gets removed. So, all the active percolator queries are kept in memory. At percolate time, the document specified in the request gets parsed into a Lucene document and is stored in a in-memory Lucene index. This in-memory index can just hold this one document and it is optimized for that. Then all the queries that are registered to the index that the percolate request is targeted for, are going to be executed on this single document in-memory index. This happens on each shard the percolate request needs to execute. By using `routing`, `filter` or `query` features the amount of queries that need to be executed can be reduced and thus the time the percolate API needs to run can be decreased. [float] === Limitations Because the percolator API is processing one document at a time, it doesn't support queries and filters that run against child documents such as `has_child` and `has_parent`. The `inner_hits` feature on the `nested` query isn't supported in the percolate api. The `wildcard` and `regexp` query natively use a lot of memory and because the percolator keeps the queries into memory this can easily take up the available memory in the heap space. If possible try to use a `prefix` query or ngramming to achieve the same result (with way less memory being used). The `delete-by-query` plugin doesn't work to unregister a query, it only deletes the percolate documents from disk. In order to update the registered queries in memory the index needs be closed and opened. There are a number of queries that fetch data via a get call during query parsing. For example the `terms` query when using terms lookup, `template` query when using indexed scripts and `geo_shape` when using pre-indexed shapes. Each time the percolator evaluates these queries, it fetches terms, shapes etc. The data gets fetched separately on the primary, replica shards and when a shard is being recovered (e.g. due to relocation or node startup). This makes it likely that over time these percolator queries have different terms between primary and replica shards, resulting in inconsistent results when percolating. There is no workaround to prevent this from happening. The only alternative is to not use these queries and do the fetching of terms on the client side prior to indexing. [float] === Forcing Unmapped Fields to be Handled as Strings In certain cases it is unknown what kind of percolator queries do get registered, and if no field mapping exists for fields that are referred by percolator queries then adding a percolator query fails. This means the mapping needs to be updated to have the field with the appropriate settings, and then the percolator query can be added. But sometimes it is sufficient if all unmapped fields are handled as if these were default string fields. In those cases one can configure the `index.percolator.map_unmapped_fields_as_string` setting to `true` (default to `false`) and then if a field referred in a percolator query does not exist, it will be handled as a default string field so that adding the percolator query doesn't fail. [[geo-percolate]] [float] === Percolating geo-queries in Elasticsearch 2.2.0 and later The new <> added in Elasticsearch 2.2.0 and above require that <> are enabled in order to function. Unfortunately, the in-memory index used by the percolator does not yet have support for `doc_values`, meaning that <> will not work in a percolator index created in Elasticsearch 2.2.0 or later. A workaround exists which allows you to both benefit from the new `geo_point` field format when searching or aggregating, and to use geo-queries in the percolator. Documents with `geo_point`fields should be indexed into a new index created in Elasticsearch 2.2.0 or later for searching or aggregations: [source,js] --------------------------- PUT my_index { "mappings": { "my_type": { "properties": { "location": { "type": "geo_point" } } } } } PUT my_index/my_type/1 { "location": { "lat": 0, "lon": 0 } } --------------------------- Percolator queries should be created in a separate <>, which claims to have been created in Elasticsearch 2.1.0: [source,js] --------------------------- PUT my_percolator_index { "settings": { "index.version.created": 2010299 <1> }, "mappings": { "my_type": { "properties": { "location": { "type": "geo_point" } } } } } PUT my_percolator_index/.percolator/my_geo_query { "query": { "geo_distance": { "distance": "5km", "location": { "lat": 0, "lon": 0 } } } } --------------------------- <1> Marks the index as having been created in Elasticsearch 2.1.0 With this setup, you can percolate an inline document as follows: [source,js] --------------------------- GET my_percolator_index/my_type/_percolate { "doc": { "location": { "lat": 0, "lon": 0 } } } --------------------------- or percolate a document already indexed in `my_index` as follows: [source,js] --------------------------- GET my_index/my_type/1/_percolate?percolate_index=my_percolator_index ---------------------------