路径层次结构分词器编辑

path_hierarchy 分词器接受一个层次结构值(如文件系统路径),按路径分隔符进行拆分,并为树中的每个组件生成一个词项。path_hierarcy 分词器在其底层使用 Lucene 的 PathHierarchyTokenizer

示例输出编辑

response = client.indices.analyze(
  body: {
    tokenizer: 'path_hierarchy',
    text: '/one/two/three'
  }
)
puts response
POST _analyze
{
  "tokenizer": "path_hierarchy",
  "text": "/one/two/three"
}

以上文本将生成以下词项

[ /one, /one/two, /one/two/three ]

配置编辑

path_hierarchy 分词器接受以下参数

delimiter

用作路径分隔符的字符。默认为 /

replacement

用于分隔符的可选替换字符。默认为 delimiter

buffer_size

在单次传递中读入词项缓冲区的字符数。默认为 1024。词项缓冲区将按此大小增长,直到所有文本都被消耗完。建议不要更改此设置。

reverse

如果为 true,则使用 Lucene 的 ReversePathHierarchyTokenizer,它适用于类似域的层次结构。默认为 false

skip

要跳过的初始词项数。默认为 0

示例配置编辑

在此示例中,我们将 path_hierarchy 分词器配置为按 - 字符进行拆分,并将其替换为 /。前两个词项将被跳过

response = client.indices.create(
  index: 'my-index-000001',
  body: {
    settings: {
      analysis: {
        analyzer: {
          my_analyzer: {
            tokenizer: 'my_tokenizer'
          }
        },
        tokenizer: {
          my_tokenizer: {
            type: 'path_hierarchy',
            delimiter: '-',
            replacement: '/',
            skip: 2
          }
        }
      }
    }
  }
)
puts response

response = client.indices.analyze(
  index: 'my-index-000001',
  body: {
    analyzer: 'my_analyzer',
    text: 'one-two-three-four-five'
  }
)
puts response
PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "path_hierarchy",
          "delimiter": "-",
          "replacement": "/",
          "skip": 2
        }
      }
    }
  }
}

POST my-index-000001/_analyze
{
  "analyzer": "my_analyzer",
  "text": "one-two-three-four-five"
}

以上示例将生成以下词项

[ /three, /three/four, /three/four/five ]

如果我们将 reverse 设置为 true,则会生成以下内容

[ one/two/three/, two/three/, three/ ]

详细示例编辑

path_hierarchy 分词器的一个常见用例是按文件路径过滤结果。如果将文件路径与数据一起索引,则使用 path_hierarchy 分词器分析路径允许按文件路径字符串的不同部分过滤结果。

此示例配置了一个索引,使其具有两个自定义分析器,并将这些分析器应用于将存储文件名的 file_path 文本字段的多字段。两个分析器之一使用反向分词。然后索引一些示例文档,以表示两个不同用户的照片文件夹中的一些照片的文件路径。

response = client.indices.create(
  index: 'file-path-test',
  body: {
    settings: {
      analysis: {
        analyzer: {
          custom_path_tree: {
            tokenizer: 'custom_hierarchy'
          },
          custom_path_tree_reversed: {
            tokenizer: 'custom_hierarchy_reversed'
          }
        },
        tokenizer: {
          custom_hierarchy: {
            type: 'path_hierarchy',
            delimiter: '/'
          },
          custom_hierarchy_reversed: {
            type: 'path_hierarchy',
            delimiter: '/',
            reverse: 'true'
          }
        }
      }
    },
    mappings: {
      properties: {
        file_path: {
          type: 'text',
          fields: {
            tree: {
              type: 'text',
              analyzer: 'custom_path_tree'
            },
            tree_reversed: {
              type: 'text',
              analyzer: 'custom_path_tree_reversed'
            }
          }
        }
      }
    }
  }
)
puts response

response = client.index(
  index: 'file-path-test',
  id: 1,
  body: {
    file_path: '/User/alice/photos/2017/05/16/my_photo1.jpg'
  }
)
puts response

response = client.index(
  index: 'file-path-test',
  id: 2,
  body: {
    file_path: '/User/alice/photos/2017/05/16/my_photo2.jpg'
  }
)
puts response

response = client.index(
  index: 'file-path-test',
  id: 3,
  body: {
    file_path: '/User/alice/photos/2017/05/16/my_photo3.jpg'
  }
)
puts response

response = client.index(
  index: 'file-path-test',
  id: 4,
  body: {
    file_path: '/User/alice/photos/2017/05/15/my_photo1.jpg'
  }
)
puts response

response = client.index(
  index: 'file-path-test',
  id: 5,
  body: {
    file_path: '/User/bob/photos/2017/05/16/my_photo1.jpg'
  }
)
puts response
PUT file-path-test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "custom_path_tree": {
          "tokenizer": "custom_hierarchy"
        },
        "custom_path_tree_reversed": {
          "tokenizer": "custom_hierarchy_reversed"
        }
      },
      "tokenizer": {
        "custom_hierarchy": {
          "type": "path_hierarchy",
          "delimiter": "/"
        },
        "custom_hierarchy_reversed": {
          "type": "path_hierarchy",
          "delimiter": "/",
          "reverse": "true"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "file_path": {
        "type": "text",
        "fields": {
          "tree": {
            "type": "text",
            "analyzer": "custom_path_tree"
          },
          "tree_reversed": {
            "type": "text",
            "analyzer": "custom_path_tree_reversed"
          }
        }
      }
    }
  }
}

POST file-path-test/_doc/1
{
  "file_path": "/User/alice/photos/2017/05/16/my_photo1.jpg"
}

POST file-path-test/_doc/2
{
  "file_path": "/User/alice/photos/2017/05/16/my_photo2.jpg"
}

POST file-path-test/_doc/3
{
  "file_path": "/User/alice/photos/2017/05/16/my_photo3.jpg"
}

POST file-path-test/_doc/4
{
  "file_path": "/User/alice/photos/2017/05/15/my_photo1.jpg"
}

POST file-path-test/_doc/5
{
  "file_path": "/User/bob/photos/2017/05/16/my_photo1.jpg"
}

针对文本字段搜索特定文件路径字符串会匹配所有示例文档,其中 Bob 的文档排名最高,因为 bob 也是标准分析器创建的词项之一,从而提高了 Bob 文档的相关性。

response = client.search(
  index: 'file-path-test',
  body: {
    query: {
      match: {
        file_path: '/User/bob/photos/2017/05'
      }
    }
  }
)
puts response
GET file-path-test/_search
{
  "query": {
    "match": {
      "file_path": "/User/bob/photos/2017/05"
    }
  }
}

使用 file_path.tree 字段匹配或过滤文件路径位于特定目录中的文档很简单。

response = client.search(
  index: 'file-path-test',
  body: {
    query: {
      term: {
        'file_path.tree' => '/User/alice/photos/2017/05/16'
      }
    }
  }
)
puts response
GET file-path-test/_search
{
  "query": {
    "term": {
      "file_path.tree": "/User/alice/photos/2017/05/16"
    }
  }
}

使用此分词器的 reverse 参数,还可以从文件路径的另一端进行匹配,例如单个文件名或深层子目录。以下示例显示了通过在映射中配置为使用 reverse 参数的 file_path.tree_reversed 字段搜索任何目录中名为 my_photo1.jpg 的所有文件。

response = client.search(
  index: 'file-path-test',
  body: {
    query: {
      term: {
        'file_path.tree_reversed' => {
          value: 'my_photo1.jpg'
        }
      }
    }
  }
)
puts response
GET file-path-test/_search
{
  "query": {
    "term": {
      "file_path.tree_reversed": {
        "value": "my_photo1.jpg"
      }
    }
  }
}

查看使用正向和反向生成词项的方式,可以了解为相同文件路径值创建的词项。

response = client.indices.analyze(
  index: 'file-path-test',
  body: {
    analyzer: 'custom_path_tree',
    text: '/User/alice/photos/2017/05/16/my_photo1.jpg'
  }
)
puts response

response = client.indices.analyze(
  index: 'file-path-test',
  body: {
    analyzer: 'custom_path_tree_reversed',
    text: '/User/alice/photos/2017/05/16/my_photo1.jpg'
  }
)
puts response
POST file-path-test/_analyze
{
  "analyzer": "custom_path_tree",
  "text": "/User/alice/photos/2017/05/16/my_photo1.jpg"
}

POST file-path-test/_analyze
{
  "analyzer": "custom_path_tree_reversed",
  "text": "/User/alice/photos/2017/05/16/my_photo1.jpg"
}

能够在与其他类型的搜索结合使用时使用文件路径进行过滤也很有用,例如,此示例查找包含 16 的任何文件路径,这些文件路径也必须位于 Alice 的照片目录中。

response = client.search(
  index: 'file-path-test',
  body: {
    query: {
      bool: {
        must: {
          match: {
            file_path: '16'
          }
        },
        filter: {
          term: {
            'file_path.tree' => '/User/alice'
          }
        }
      }
    }
  }
)
puts response
GET file-path-test/_search
{
  "query": {
    "bool" : {
      "must" : {
        "match" : { "file_path" : "16" }
      },
      "filter": {
        "term" : { "file_path.tree" : "/User/alice" }
      }
    }
  }
}