elasticsearch数据库的向量存储


简要

elasticsearch在7+的版本上,支持把向量存储入字段中,进行距离搜索,本文记录了向量存储数据库的硬盘占用情况

1. 在es数据库上mapping向量字段数据类型

1.1 dense_vector

A dense_vector field stores dense vectors of float values. The maximum number of dimensions that can be in a vector should not exceed 2048. A dense_vector field is a single-valued field.

  • mapping示例

    PUT my-index-000001
    {
      "mappings": {
        "properties": {
          "my_vector": {
            "type": "dense_vector",
            "dims": 3  
          },
          "my_text" : {
            "type" : "keyword"
          }
        }
      }
    }

    1.2 sparse_vector

    A sparse_vector field stores sparse vectors of float values. The maximum number of dimensions that can be in a vector should not exceed 1024. The number of dimensions can be different across documents. A sparse_vector field is a single-valued field.

  • mapping示例

    PUT my-index-000001
    {
      "mappings": {
        "properties": {
          "my_vector": {
            "type": "sparse_vector"
          },
          "my_text" : {
            "type" : "keyword"
          }
        }
      }
    }
  • 官网说了向量里元素是float数据类型,但是没有找到可以调整float64还是float32,只找到设置向量维度(dims)

2. 在es数据库上存储向量占用空间

以长度为50的dense vector向量数据类型为例子:

  1. 分别创建4个index,依次命名为ict_1num_idf、ict_1num_idf_and_vec、ict_7num_idf、ict_7num_idf_and_vec
  2. 4个index中,分别插入ict语料6千条,第一个index插入计算1组逆词频、第二个index插入计算1组逆词频和句向量、第三个index插入计算7组逆词频、第四个index插入计算7组逆词频和句向量,占用硬盘空间如下:

结论:

  1. 用第二个index占用空间减去第一个占用空间,以及第四个减第三个,可知:es数据库插入向量共多出6M空间,如果不考虑其他因素,按照比例约算,6M除以6千条,一条50维度float词向量占用1KB。(可能会有少许一次性创建字段的空间占用)

  2. elasticsearch 使用java开发,所以查了一下java中,float数据一条占4B,所以一条向量50个float,占比200B,至于怎么从200B到1KB的,还没来得及想。

  3. 直觉上在产品上多几M硬盘空间应该不碍事

3. 附录

计算多个文档分析器计算多组逆词频的方式:

"mappings": {
          "properties": {
              "sentence: {
                  "type": "keyword",
                  "fields": {
                      "stop":{
                          "type": "text",
                          "similarity" : "bm25",
                          "analyzer": "stopwords_analyzer",
                          "search_analyzer": "stopwords_analyzer"
                      },
                      "ngram": {
                          "type": "text",
                          "similarity" : "bm25",
                          "analyzer": "ngram_analyzer",
                          "search_analyzer": "ngram_analyzer"
                      },
                      "lowercase":{
                          "type": "text",
                          "similarity" : "bm25",
                          "analyzer": "lowercase",
                          "search_analyzer": "lowercase"
                      },
                      "edge_ngram":{
                          "type": "text",
                          "similarity" : "bm25",
                          "analyzer": "edge_ngram_analyzer",
                          "search_analyzer": "edge_ngram_analyzer"
                      },
                      "edge_ngram_stop":{
                          "type": "text",
                          "similarity" : "bm25",
                          "analyzer": "edge_ngram_analyzer_stop",
                          "search_analyzer": "edge_ngram_analyzer_stop"
                      },
                      "ik":{
                          "type": "text",
                          "similarity" : "bm25",
                          "analyzer": "ik_max_word",
                          "search_analyzer": "ik_smart"
                      },
                      "ik_stop":{
                          "type": "text",
                          "similarity" : "bm25",
                          "analyzer": "ik_max_word_stop",
                          "search_analyzer": "ik_smart_stop"
                      }

                  }
              },

文章作者: Lowin Li
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Lowin Li !
评论
  目录