gabriel-v/batch-api.md

## batch-api.md

      
    Raw
  

              batch-api.md
            
          
    Hoover search batch API

The idea is to facilitate searching for a large number of terms without
hitting the rate limiter and with a decent accuracy.
The solution is to use the Elasticsearch _msearch endpoint with count
operations, so the query will only return the hit count for each individual
query, along with any aggregations that were requested.
The request

Send a GET request to /batch, without any parameters, and with the data
formatted as JSON in the body. You can see an example of the request in
src/search.js, in the search method.
Here's an example request:
{
    "aggs": {
        "approx_distinct_hash": {
            "cardinality": {
                "field": "sha1"
            }
        }
    },
    "queries": [
        { "query_string": {"query": "one"}},
        { "query_string": {"query": "two"}},
        { "query_string": {"query": "three"}},
        { "query_string": {"query": "*"}},
        { "query_string": {"query": "!@$^!^#!@@"}}
    ],
    "collections": [
        "Code",
        "Test",
        "Enron"
    ]
}

The query_string.query field should be filled by the user
(each line goes into another query).
The number of queries submitted for a single request is 100.
Requests with more than 100 queries will fail.
The aggs field is appended next to each of the queries sent and
its result is included for each of the queries made.
The response

The response has a responses field that has the results in the
same order as the queries given.
For each response object, the following data is important:

response.hits.total the total number of hits for that query
response.timed_out set if it failed
response._query the query object you passed in (like {"query_string":{"query": "one"}})

The response._query field is filled out so the UI doesn't have to store
the queries until the response is actually returned. The UI should extract
the query string (such as "one" above) and use it to:

show the result text
link to /search?q=one

The example above also includes an aggregation to approximate the number of documents that are distinct (by hash).
This number varies from query to query. The approximate value is in response.aggregations.approx_distinct_hash.value.
If one of the queries fails, you won't get any of those fields set on the reponse.
You will have to get the error message from
response.error.root_cause[0].reason.
If response.error.root_cause is actually an empty list, you
could get the error message from response.error.failed_shards[0].reason.reason. If response.error.failed_shards is actually an empty list, that means that the Elasticsearch setup is utterly broken and all hope is lost.
A sample of the data returned by the request is below.
{
  "status": "ok",
  "responses": [
    {
      "aggregations": {
        "approx_distinct_hash": {
          "value": 4051
        }
      },
      "hits": {
        "max_score": 0,
        "total": 4034,
        "hits": []
      },
      "timed_out": false,
      "_query": {
        "query_string": {
          "query": "one"
        }
      },
      "_shards": {
        "total": 10,
        "successful": 10,
        "failed": 0
      },
      "took": 31
    },
    {
      "aggregations": {
        "approx_distinct_hash": {
          "value": 2350
        }
      },
      "hits": {
        "max_score": 0,
        "total": 2350,
        "hits": []
      },
      "timed_out": false,
      "_query": {
        "query_string": {
          "query": "two"
        }
      },
      "_shards": {
        "total": 10,
        "successful": 10,
        "failed": 0
      },
      "took": 40
    },
    {
      "aggregations": {
        "approx_distinct_hash": {
          "value": 1224
        }
      },
      "hits": {
        "max_score": 0,
        "total": 1224,
        "hits": []
      },
      "timed_out": false,
      "_query": {
        "query_string": {
          "query": "three"
        }
      },
      "_shards": {
        "total": 10,
        "successful": 10,
        "failed": 0
      },
      "took": 39
    },
    {
      "aggregations": {
        "approx_distinct_hash": {
          "value": 21912
        }
      },
      "hits": {
        "max_score": 0,
        "total": 22185,
        "hits": []
      },
      "timed_out": false,
      "_query": {
        "query_string": {
          "query": "*"
        }
      },
      "_shards": {
        "total": 10,
        "successful": 10,
        "failed": 0
      },
      "took": 61
    },
    {
      "error": {
        "failed_shards": [
          {
            "reason": {
              "col": 50,
              "index": "hoover-enron-pst",
              "caused_by": {
                "type": "parse_exception",
                "caused_by": {
                  "type": "token_mgr_error",
                  "reason": "Lexical error at line 1, column 5. Encountered: \"!\" (33), after : \"\""
                },
                "reason": "Cannot parse '!@$^!^#!@@': Lexical error at line 1, column 5. Encountered: \"!\" (33), after : \"\""
              },
              "line": 1,
              "reason": "Failed to parse query [!@$^!^#!@@]",
              "type": "query_parsing_exception"
            },
            "index": "hoover-enron-pst",
            "shard": 0,
            "node": "0xO11SNzT_6xdw7Y2mMA4w"
          },
          {
            "reason": {
              "col": 50,
              "index": "hoover-test-data",
              "caused_by": {
                "type": "parse_exception",
                "caused_by": {
                  "type": "token_mgr_error",
                  "reason": "Lexical error at line 1, column 5. Encountered: \"!\" (33), after : \"\""
                },
                "reason": "Cannot parse '!@$^!^#!@@': Lexical error at line 1, column 5. Encountered: \"!\" (33), after : \"\""
              },
              "line": 1,
              "reason": "Failed to parse query [!@$^!^#!@@]",
              "type": "query_parsing_exception"
            },
            "index": "hoover-test-data",
            "shard": 0,
            "node": "0xO11SNzT_6xdw7Y2mMA4w"
          }
        ],
        "reason": "all shards failed",
        "grouped": true,
        "phase": "query",
        "root_cause": [
          {
            "col": 50,
            "index": "hoover-enron-pst",
            "line": 1,
            "type": "query_parsing_exception",
            "reason": "Failed to parse query [!@$^!^#!@@]"
          },
          {
            "col": 50,
            "index": "hoover-test-data",
            "line": 1,
            "type": "query_parsing_exception",
            "reason": "Failed to parse query [!@$^!^#!@@]"
          }
        ],
        "type": "search_phase_execution_exception"
      },
      "_query": {
        "query_string": {
          "query": "!@$^!^#!@@"
        }
      }
    }
  ]
}
No results found