HTML Strip Character Filter
原文链接 : https://www.elastic.co/guide/en/elasticsearch/reference/5.3/analysis-htmlstrip-charfilter.html
译文链接 :http://www.apache.wiki/display/Elasticsearch/HTML+Strip+Character+Filter
HTML Strip Character Filter 会删除文本中的HTML元素,并且将HTML实体替换成对应的解码值(例如用&替换&)。
案例
POST _analyze{"tokenizer": "keyword","char_filter": [ "html_strip" ],"text": "<p>I'm so <b>happy</b>!</p>"}
keyword分词器只会返回一个词元(term)。
上面的案例将会返回如下的词元(term):
[ \nI'm so happy!\n ]
同样的案例,如果使用标准分词器(standar tokenizer)将会返回的词元(term)如下:
[ I'm, so, happy ]
配置
HTML Strip Character Filter 接收如下参数:
| 参数名称 | 说明 |
|---|---|
escaped_tags |
会原始文本中保留的一系列标签 |
配置案例:
PUT my_index{"settings": {"analysis": {"analyzer": {"my_analyzer": {"tokenizer": "keyword","char_filter": ["my_char_filter"]}},"char_filter": {"my_char_filter": {"type": "html_strip","escaped_tags": ["b"]}}}}}POST my_index/_analyze{"analyzer": "my_analyzer","text": "<p>I'm so <b>happy</b>!</p>"}
上面案例将返回如下结果:
[ \nI'm so <b>happy</b>!\n ]
