伍佰目录 短网址
  当前位置:海洋目录网 » 站长资讯 » 站长资讯 » 文章详细 订阅RssFeed

[Elasticsearch集群分页]from-size VS scroll-scan

来源:网络转载 浏览:59次 时间:2022-11-18

1.from-size

ElasticSearch可以用一种分页的形式来查询数据——from-size,https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-from-size.html。

from:定义从哪里开始拿数据。

size:定义一共拿多少条数据。

from-size的工作原理是:如size=10&from=100,那么ElasticSearch会从每个Shard里取出100条数据,然后再排序,取出前10条。由此观之,from-size的效率必然不会很高,特别是分页越深,需要排序的数据越多,其效率就越低。
另外,ElasticSearch对于from-size的默认分页深度的10000,如果超过10000就会报错如下:

$ curl -XGET 'http://127.0.01:9200/test-index/test-type/_search?size=2&from=10000&pretty=true'
{
  "error" : {
    "root_cause" : [ {
      "type" : "query_phase_execution_exception",
      "reason" : "Result window is too large, from + size must be less than or equal to: [10000] but was [10002]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level parameter."
    } ],
    "type" : "search_phase_execution_exception",
    "reason" : "all shards failed",
    "phase" : "query",
    "grouped" : true,
    "failed_shards" : [ {
      "shard" : 0,
      "index" : "passenger0510",
      "node" : "KQcH1OgUQDGAJX9ojz2OqQ",
      "reason" : {
        "type" : "query_phase_execution_exception",
        "reason" : "Result window is too large, from + size must be less than or equal to: [10000] but was [10002]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level parameter."
      }
    } ]
  },
  "status" : 500
}

但是,由我的工作经验即使在from=2000000的时候,响应时间也不是太慢,5秒左右。同时,ElasticSearch也提供了修改默认分页深度的参数:

curl -XPUT "http://127.0.0.1:9200/test-index/_settings" -d '{
    "index": {
        "max_result_window": 10000000
    }
}'

2.scroll-scan

2.1scroll

ElasticSearch还提供了另一种高校的分页查询方式——scroll-scan,https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html。
scroll类似于传统的cursor方式,我的理解是JDBC的ResultSet。scroll才是ElasticSearch提供的用于查询大数据的神器,但是不是为了解决实时的查询。
注意:从 scroll 请求返回的结果反映了 search 发生时刻的索引状态,就像一个快照。后续的对文档的改动(索引、更新或者删除)都只会影响后面的搜索请求。
但是,为了使用scroll,你需要首先获取_scroll_id,同时可以为这个_scroll_id设置保存时长。注意:_scroll_id的长度与Shards的数目有关。:

$curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search?scroll=1m' -d '{"query":{"match_all":{}}}'
{"_scroll_id":"cXVlcnlBbmRGZXRjaDsxOzYzOmRVRUN4TG5TUXYyaXRIT0Q5SUxiR0E7MDs=","took":37,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":2068128,"max_score":1.0,"hits":[{"_index":"test-index","_type":"test-index","_id":"10781773","_score":1.0},{"_index":"test-index","_type":"test-index","_id":"1947331466223","_score":1.0},{"_index":"test-index","_type":"test-index","_id":"2069595890757","_score":1.0},{"_index":"test-index","_type":"test-index","_id":"1544758231677","_score":1.0},{"_index":"test-index","_type":"test-index","_id":"2069367750853","_score":1.0},{"_index":"test-index","_type":"test-index","_id":"1947537248223","_score":1.0},{"_index":"test-index","_type":"test-index","_id":"2069233668997","_score":1.0},{"_index":"test-index","_type":"test-index","_id":"8572333","_score":1.0},{"_index":"test-index","_type":"test-index","_id":"1672543738239","_score":1.0},{"_index":"test-index","_type":"test-index","_id":"2419961040135","_score":1.0}]}}

然后,使用上面得到的_scroll_id获取下一批次的结果:

$curl -XGET 'http://127.0.0.1:9200/_search/scroll?scroll=1m' -d 'cXVlcnlBbmRGZXRjaDsxOzYzOmRVRUN4TG5TUXYyaXRIT0Q5SUxiR0E7MDs='
{"_scroll_id":"cXVlcnlBbmRGZXRjaDsxOzYzOmRVRUN4TG5TUXYyaXRIT0Q5SUxiR0E7MDs=","took":20,"timed_out":false,"terminated_early":true,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":2068128,"max_score":1.0,"hits":[{"_index":"test-index","_type":"test-index","_id":"2068822106901","_score":1.0},{"_index":"test-index","_type":"test-index","_id":"2467124554218","_score":1.0},{"_index":"test-index","_type":"test-index","_id":"14649373","_score":1.0},{"_index":"test-index","_type":"test-index","_id":"6412093","_score":1.0},{"_index":"test-index","_type":"test-index","_id":"1493598471617","_score":1.0},{"_index":"test-index","_type":"test-index","_id":"1477453","_score":1.0},{"_index":"test-index","_type":"test-index","_id":"1949882913855","_score":1.0},{"_index":"test-index","_type":"test-index","_id":"2419962095719","_score":1.0},{"_index":"test-index","_type":"test-index","_id":"1484893","_score":1.0},{"_index":"test-index","_type":"test-index","_id":"2065404665444","_score":1.0}]}}
2.2scroll-scan

scroll保持了哪些结果已经返回的记录,所以能更加高效地返回排序的结果。但是,按照默认设定排序结果仍然需要代价。
一般来说,你仅仅想要找到结果,不关心顺序。你可以通过组合scroll和scan来关闭任何打分或者排序,以最高效的方式返回结果。你需要做的就是将search_type=scan 加入到查询的字符串中:

curl -XGET 'http://127.0.0.1:9200/test-index/test-type/_search?pretty=true&search_type=scan&scroll=10m&size=10' -d '{"query":{"match_all":{}}}'
{"_scroll_id":"c2NhbjsxOzY0OmRVRUN4TG5TUXYyaXRIT0Q5SUxiR0E7MTt0b3RhbF9oaXRzOjIwNjgxMjg7","took":9,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":2068128,"max_score":0.0,"hits":[]}}

同样,然后用得到的_scroll_id去查询下一批次的结果。

curl -XGET 'http://127.0.0.1:9200/_search/scroll?scroll=10m' -d 'c2NhbjsxOzY0OmRVRUN4TG5TUXYyaXRIT0Q5SUxiR0E7MTt0b3RhbF9oaXRzOjIwNjgxMjg7'
{"_scroll_id":"c2NhbjsxOzY0OmRVRUN4TG5TUXYyaXRIT0Q5SUxiR0E7MTt0b3RhbF9oaXRzOjIwNjgxMjg7","took":12,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":2068128,"max_score":0.0,"hits":[{"_index":"test-index","_type":"test-type","_id":"10781773","_score":0.0},{"_index":"test-index","_type":"test-type","_id":"1947331466223","_score":0.0},{"_index":"test-index","_type":"test-type","_id":"2069595890757","_score":0.0},{"_index":"test-index","_type":"test-type","_id":"1544758231677","_score":0.0},{"_index":"test-index","_type":"test-type","_id":"2069367750853","_score":0.0},{"_index":"test-index","_type":"test-type","_id":"1947537248223","_score":0.0},{"_index":"test-index","_type":"test-type","_id":"2069233668997","_score":0.0},{"_index":"test-index","_type":"test-type","_id":"8572333","_score":0.0},{"_index":"test-index","_type":"test-type","_id":"1672543738239","_score":0.0},{"_index":"test-index","_type":"test-type","_id":"2419961040135","_score":0.0}]}}

注意:在查询阶段,每个Shard都会把ID保存到memory中直到timeout。
不难看出scroll-scan的查询分两个阶段:

第一步执行一个query,并返回一个_scroll_id;

第二步滚动document。迭代第二步,得到新的_scroll_id,然后获取其他的documents。

注意:

初始查询请求和每次连续的scroll请求,都会返回一个新的_scroll_id,并且只有最新的_scroll_id可用;

如果scroll大数据集,必须使用scan。否则,可能有重复结果;

size是对每个Shard指定的。如果有10个Shards且size=5,那么会返回50个documents。

2.3清除scroll id

scroll id会在timeout时自动清除,但是如果保存scroll id的时间较长,且Shards较多(scroll id越大),而且很多次回滚,这样会占用不少额外的memory。 ElasticSearch也允许你显示的清除scroll id:

curl -XDELETE "http://127.0.0.1:9200/_search/scroll" -d '{
    "scroll_id" : ["c2NhbjsxOzY0OmRVRUN4TG5TUXYyaXRIT0Q5SUxiR0E7MTt0b3RhbF9oaXRzOjIwNjgxMjg7"]
}'

curl -XDELETE "http://127.0.0.1:9200/_search/scroll" -d 'c2NhbjsxOzY0OmRVRUN4TG5TUXYyaXRIT0Q5SUxiR0E7MTt0b3RhbF9oaXRzOjIwNjgxMjg7'

多个scroll id放在数组里一起删除:

curl -XDELETE "http://127.0.0.1:9200/_search/scroll" -d '{  
    "scroll_id" : ["c2NhbjsxOzY0OmRVRUN4TG5TUXYyaXRIT0Q5SUxiR0E7MTt0b3RhbF9oaXRzOjIwNjgxMjg7","c2NhbjsxOzY0OmRVRUN4TG5TUXYyaXRIT0Q5SUxiR0E7MTt0b3RhbF9oaXRzOjIwNjgxMjg7"]  
}'

当然,也可以不用记住scroll id的同时把所有scroll id删除:

curl -XDELETE "http://127.0.0.1:9200/_search/scroll/_all"


  推荐站点

  • At-lib分类目录At-lib分类目录

    At-lib网站分类目录汇集全国所有高质量网站,是中国权威的中文网站分类目录,给站长提供免费网址目录提交收录和推荐最新最全的优秀网站大全是名站导航之家

    www.at-lib.cn
  • 中国链接目录中国链接目录

    中国链接目录简称链接目录,是收录优秀网站和淘宝网店的网站分类目录,为您提供优质的网址导航服务,也是网店进行收录推广,站长免费推广网站、加快百度收录、增加友情链接和网站外链的平台。

    www.cnlink.org
  • 35目录网35目录网

    35目录免费收录各类优秀网站,全力打造互动式网站目录,提供网站分类目录检索,关键字搜索功能。欢迎您向35目录推荐、提交优秀网站。

    www.35mulu.com
  • 就要爱网站目录就要爱网站目录

    就要爱网站目录,按主题和类别列出网站。所有提交的网站都经过人工审查,确保质量和无垃圾邮件的结果。

    www.912219.com
  • 伍佰目录伍佰目录

    伍佰网站目录免费收录各类优秀网站,全力打造互动式网站目录,提供网站分类目录检索,关键字搜索功能。欢迎您向伍佰目录推荐、提交优秀网站。

    www.wbwb.net