elasticsearch bucket 之rare terms聚合

1、背景
2、需求
3、前置准备
- 3.1 准备mapping
- 3.2 准备数据
4、实现需求
- 4.1 dsl
- 4.2 java代码
- 4.3 运行结果
5、max_doc_count 和 search.max_buckets
6、注意事项
7、完整代码
8、参考文档

1、背景

我们知道当我们使用 terms聚合时，当修改默认顺序为_count asc时，统计的结果是不准备的，而且官方也不推荐我们这样做，而是推荐使用rare terms聚合。rare terms是一个稀少的term聚合，可以一定程度的解决升序问题。

2、需求

统计province字段中包含上和湖的term数据，并且最多只能出现2次。获取到聚合后的结果。

3、前置准备

3.1 准备mapping

PUT /index_person

{

  "settings": {

    "number_of_shards": 1

  },

  "mappings": {

    "properties": {

      "id": {

        "type": "long"

      },

      "name": {

        "type": "keyword"

      },

      "province": {

        "type": "keyword"

      },

      "sex": {

        "type": "keyword"

      },

      "age": {

        "type": "integer"

      },

      "pipeline_province_sex":{

        "type": "keyword"

      },

      "address": {

        "type": "text",

        "analyzer": "ik_max_word",

        "fields": {

          "keyword": {

            "type": "keyword",

            "ignore_above": 256

          }

        }

      }

    }

  }

}

3.2 准备数据



PUT /_bulk

{"create":{"_index":"index_person","_id":1}}

{"id":1,"name":"张三","sex":"男","age":20,"province":"湖北","address":"湖北省黄冈市罗田县匡河镇"}

{"create":{"_index":"index_person","_id":2}}

{"id":2,"name":"李四","sex":"男","age":19,"province":"江苏","address":"江苏省南京市"}

{"create":{"_index":"index_person","_id":3}}

{"id":3,"name":"王武","sex":"女","age":25,"province":"湖北","address":"湖北省武汉市江汉区"}

{"create":{"_index":"index_person","_id":4}}

{"id":4,"name":"赵六","sex":"女","age":30,"province":"北京","address":"北京市东城区"}

{"create":{"_index":"index_person","_id":5}}

{"id":5,"name":"钱七","sex":"女","age":16,"province":"北京","address":"北京市西城区"}

{"create":{"_index":"index_person","_id":6}}

{"id":6,"name":"王八","sex":"女","age":45,"province":"北京","address":"北京市朝阳区"}

{"create":{"_index":"index_person","_id":7}}

{"id":7,"name":"九哥","sex":"男","age":25,"province":"上海市","address":"上海市嘉定区"}

4、实现需求

4.1 dsl

GET /index_person/_search

{

  "size": 0,

  "aggs": {

    "agg_province": {

      "rare_terms": {

        "field": "province",

        "max_doc_count": 2,

        "precision": 0.01,

        "include": "(.*上.*|.*湖.*|.*江.*)",

        "exclude": ["江苏"],

        "missing": "default省"

      }

    }

  }

}

4.2 java代码

@Test

@DisplayName("稀少的term聚合，类似按照 _count asc 排序的terms聚合，但是terms聚合中按照_count asc的结果是不准的，需要使用 rare terms 聚合")

public void agg01() throws IOException {

    SearchRequest searchRequest = new SearchRequest.Builder()

            .size(0)

            .index("index_person")

            .aggregations("agg_province", agg ->

                    agg.rareTerms(rare ->

                            // 稀有词 的字段

                            rare.field("province")

                                    // 该稀有词最多可以出现在几个文档中，最大值为100，如果要调整，需要修改search.max_buckets参数的值(尝试修改这个值，不生效)

                                    // 在该例子中，只要是出现的次数<=2的聚合都会返回

                                    .maxDocCount(2L)

                                    // 内部布谷鸟过滤器的精度，精度越小越准，但是相应的消耗内存也越多，最小值为 0.00001，默认值为 0.01

                                    .precision(0.01)

                                    // 应该包含在聚合的term, 当是单个字段是，可以写正则表达式

                                    .include(include -> include.regexp("(.*上.*|.*湖.*|.*江.*)"))

                                    // 排出在聚合中的term，当是集合时，需要写准确的值

                                    .exclude(exclude -> exclude.terms(Collections.singletonList("江苏")))

                                    // 当文档中缺失province字段时，给默认值

                                    .missing("default省")

                    )

            )

            .build();

    System.out.println(searchRequest);

    SearchResponse<Object> response = client.search(searchRequest, Object.class);

    System.out.println(response);

}

一些注意事项都在注释中。

4.3 运行结果

5、max_doc_count 和 search.max_buckets

6、注意事项

rare terms统计返回的数据没有大小限制，而且受max_doc_count参数的限制，比如：如果复合 max_doc_count 的分组有60个，那么这60个分组会直接返回。
max_doc_count的值最大为100，貌似不能修改。
如果一台节点聚合收集的结果过多，那么很容易超过 search.max_buckets的值，此时就需要修改这个值。

# 临时修改

PUT /_cluster/settings

{"transient": {"search.max_buckets": 65536}}

# 永久修改

PUT /_cluster/settings

{"persistent": {"search.max_buckets": 65536}}

7、完整代码

https://gitee.com/huan1993/spring-cloud-parent/blob/master/es/es8-api/src/main/java/com/huan/es8/aggregations/bucket/RareTermsAggs.java

8、参考文档

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-rare-terms-aggregation.html

你可能想看：

response.sendRedirect（）和request.getRequestDispatcher（）。forward（request，response）有什么区别[重复]

response.sendRedirect("login.jsp");不添加上下文路径（指的是捆绑了servlet的应用程序/模块），将在相应应用程序的上下文路径之前。用于将资源重定向到其他服务器或域...

网页报错Message: date: It is not safe to rely on the system's timezone settings. You are *required* to use the date.timezone s

实际上，从 PHP 5.1.0 ，当对使用date()等函数时，如果timezone设置不正确，在每一次调用时间函数时,都会产生E_NOTICE 或者 E_WARNING 信息。而又在php5.1.0...

使用ReferenceLoopHandling.Ignore序列化依赖于循环的ISerializable对象时，抛出StackOverflowException。

writer.WritePropertyName(serializationEntry.Name);null;member)),我有一个使用二进制序列化来保留数据的旧版应用程序;现在;直到我们遇到了循...

Go-Excelize API源码阅读（二十七）——SetRowOutlineLevel、SetColOutlineLevel

为开源项目提供增长支持，可读取和写入https：Microsoft Excel2007及以上版本创建的电子表格文档，与具有样式、图片(表)、透视表、切片机等复杂组件的文档高度兼容，可应用于各种报表平台...

在Docker中，apt-get安装失败，并显示“无法获取http://archive.ubuntu.com/…404未找到”错误为什么？我们如何才能克服它？

因此所有最近添加或编辑的命令apt-get install行将使用旧数据，强制在每次构建映像时运行Dockerfile中的每个语句。特定软件包的特定版本已从存档中删除。新下载的ubuntu映像具有正确...

[Kogel.Subscribe.Mssql]SQL Server incremental subscription, database change monitoring

您可以下载安装到Nuget：[Display(Rename=“t_oms_order_detail”)]，[Identity]；[Display(Rename=“trade_id)]；[Nest.Pr...

PHP communication management system wamp running custom development mysql database html webpage computer software engineering

Modify and delete picturesMessage board：admin Database configuration file data.php2.The development ...

【sketch教程】Sketch的Resizing功能详解

今天我们来聊一下建立组件库时少不了的Resizing功能。掌握Resizing功能可以帮助我们制作弹性控件，用一个组件适配不同尺寸，提升效率so easy！下面我们就从Resizing的出现条件，Re...

One day a big red hen is playing with her chickens in the garden. After some time,she wants to eat

1 where is the hen playing with her children ?2 what does the hen take her children to do ?3 wha...

#92;--no-cache Vs的Alpine Dockerfile的优势 rm / var / cache / apk / *

WARNING:Ignoring APKINDEX.5022a8a2.tar.gz:ERROR:world[Nginx]://dl-cdn.alpinelinux.org/alpine/v3.7/co...

Look at the MySQL architecture from the query statement execution process

< div id = "content_views" class = "htmledit_views" >

Centos7安装docker环境(centos7安装docker的详细步骤)

　　1.查看centos系统内核版本（centos7 内核版本高于3.10才能安装docker）:　　我的centos7内核版本:　　2.更新系统软件及内核：　　3.安装所需要的依赖 yum-util...

getCurrentUser（）。role在DaftMonk / generator-angular-fullstack中未定义

$scope.user.email,password:$scope.user.password;" + Auth.getCurrentUser().role);User.get();会触发一个异步HT...

System Volume Information是什么文件 System Volume Information可以删除吗

有很多朋友们在清理系统的时候遇到一个名为System Volume Information的文件夹，那么这个文件夹到底是什么文件，我们可以清理删除吗？System Volume Information...

为什么My hobby is playing badminton, basketball, step on a bike, etc.

主语是hobby 单数动词的单复数看的是主语。当然，英语中有就近原则，谓语与靠近的名词、代词在“人称、数”上一致。选b little few的区别为 little用于不可数名词 few用于可数名词。...

three percent 和three percentage一样吗?有什么不同?

两个词都对:percent=百分数,percentage=百分点从百分之一百增加到百分之一百零三,增加3percent,3percentage。从百分之一千增加到百分之一千零三十,也增加了3per...

[解疑]The committee is expected to vote on Thursday. Ten Republicans serve with

The committee is expected to vote on Thursday. Ten Republicans serve with eight opposition Democrats...

RPG MAKER XP下载 RPG Maker VX和RPG Maker XP脚本最大的区别在哪里?

文章目录：RGPMAKERXP这给问题如何解决?RPG Maker VX和RPG Maker XP脚本最大的区别在哪里?一、RGPMAKERXP这给问题如何解决?RPG Maker XP是一款非常受欢...

the last but not least和last but not least区别？

first，second，third... ... 第一，第二，第三……First，then / next，after that / next，finally 第一，然后，之后，最后。一方面来说......

docker (six) haproxy load balancing MySQL cluster

Ha-proxy是一种高性能的负载均衡软件。由于集中于负载均衡，·支持多种负载均衡算法，基于单进程处理模式(与Nginx类似)使性能卓越，当服务器处理时间保持均匀分布时：在运行时调整其服务器权重并不有...

分享给朋友：

问答百科