【Azure Cognitive Search】使用 Python SDK 在認知搜尋中進行向量搜尋（上）

認知搜尋 Azure Cognitive Search，在需要限縮 ChatGPT 回答內容的對話機器人情境中，Cognitive Search 可以作為其中最關鍵的「知識庫」角色，其 PaaS 服務的方便性加上後期與 Azure OpenAI 的整合，可說是 2023 年最受益於 ChatGPT 的 Azure 服務之一。

因應 Cognitive Search 在 10 月份對向量搜尋的最新版更新，本篇參考以下 GitHub 範例程式實作與心得分享；接續前篇由快速入門 Cognitive Search 的角度出發，一樣會使用 Python SDK 建立 Cognitive Search 的 Index ，而測試資料的向量會由 Azure OpenAI 的 Embedding API 產生後，與資料一同存入 Index 中，最後會在 Index 中進行向量搜尋。

完整程式碼於文末 GitHub 連結提供！

必要條件

在任何區域或任何層級上的認知搜尋服務 Azure Cognitive Search，記得先複製好金鑰備用

Cognitive Search 的官方 Python SDK：azure-search-documents 的 11.4.0b11 版

以上項目如果不知道如何進行，建議可由下方連結的前文開始閱讀，參考如何建立需要的服務與環境。

測試資料

繼續借用以下官方放在 GitHub 上的範例資料，資料中包含了 108 項 Azure 服務的名稱與簡介，並存放於 Json 格式中。

建立 Index

我們預計會對測試資料中的 title 與 content 兩欄位各自產生向量，所以在這次建立的 Index 中會新增 titleVector 與 contentVector 兩個欄位，用於存放這些向量。

首先使用以下程式與 Cognitive Search 建立連線，記得將 endpoint 與 credential 替換成自己的。

from azure.search.documents.indexes import SearchIndexClient
from azure.core.credentials import AzureKeyCredential

search_index_client = SearchIndexClient(
    endpoint="https://charlie-test.search.windows.net",
    credential=AzureKeyCredential("YOUR-SEARCH-API-KEY") # 你的 Cognitive Search 金鑰
)

建立 Index 的方式基本上與前文相同，一樣都是透過 Field 設定每個欄位：

from azure.search.documents.indexes.models import SimpleField, SearchableField, SearchFieldDataType, SearchField
fields = [
    SimpleField(name="id", type=SearchFieldDataType.String, key=True, sortable=True, filterable=True, facetable=True),
    SearchableField(name="title", type=SearchFieldDataType.String),
    SearchableField(name="content", type=SearchFieldDataType.String),
    SearchableField(name="category", type=SearchFieldDataType.String, filterable=True),
    SearchField(name="titleVector", type=SearchFieldDataType.Collection(SearchFieldDataType.Single), vector_search_dimensions=1536, vector_search_profile="HnswProfile"),
    SearchField(name="contentVector", type=SearchFieldDataType.Collection(SearchFieldDataType.Single), vector_search_dimensions=1536, vector_search_profile="HnswProfile"),
]

前四個欄位與之前進行全文檢索時差不多，而 titleVector 與 contentVector 兩個新欄位說明一下：

向量欄位的型態設定必須固定為type=SearchFieldDataType.Collection(SearchFieldDataType.Single)，即單精度的浮點數向量

使用 vector_search_dimensions 設定存放的向量維度，Azure OpenAI 的 text-embedding-ada-002 模型所產生的向量維度即為 1536

vector_search_profile 指定要由哪一種演算法來編製 Index，組態名必須要跟後續提供的設定一致

再來還需要一些向量搜尋的設定檔，因為比較繁雜一點，這邊分為三部份說明：

Algorithm

from azure.search.documents.indexes.models import (
    HnswVectorSearchAlgorithmConfiguration, VectorSearchAlgorithmKind, HnswParameters, ExhaustiveKnnVectorSearchAlgorithmConfiguration, ExhaustiveKnnParameters
)
algorithms=[
    HnswVectorSearchAlgorithmConfiguration(
        name="Hnsw",
        kind=VectorSearchAlgorithmKind.HNSW,
        parameters=HnswParameters(
            m=4,
            ef_construction=400,
            ef_search=500,
            metric="cosine",
        )
    ),
    ExhaustiveKnnVectorSearchAlgorithmConfiguration(
        name="ExhaustiveKnn",
        kind=VectorSearchAlgorithmKind.EXHAUSTIVE_KNN,
        parameters=ExhaustiveKnnParameters(
            metric="cosine",
        )
    )
]

algorithms 由多個演算法組態組成，目前在這版本中 Cognitive Search 提供 HNSW 與 Exhaustive KNN 兩種演算法：

HNSW：ANN 算法的其中一種，在編制 Index 時會產生額外的資料結構來加快搜尋速度，希望精確度與計算效率之間取得平衡，適用於大部分情況。

Exhaustive KNN：基本上是一種暴力演算法，可以找出真正最鄰近的向量，適用於願意以搜尋效能換取高準確度且資料集較小的情境。

其中各自都包含演算法的超參數設定，metric 參數指定計算距離的方式，最常見的設定是餘弦相似度，而其他細項都與演算法本身理論有關，這邊只要儘管用就對了（絕對不是我不會

兩種演算法各有優缺點，大致上就是在速度與準確度上做取捨，兩者也能同時放入設定檔並在執行搜尋時選擇不同演算法，但在編制 Index 時只能選擇其中一種，選擇時有兩點需要注意：

演算法本身是不收費的，但 HNSW 因為有產生額外的資料結構，儲存時會占用較多的空間，也代表可能產生更多的費用，詳細可以了解 Cognitive Search 的計價方式

選擇 HNSW 編制 Index 後兩種演算法都能選擇，但如果選擇了 Exhaustive KNN ，在搜尋時就不能使用 HNSW 了

Vectorizer

from azure.search.documents.indexes.models import AzureOpenAIVectorizer, AzureOpenAIParameters
vectorizers=[
    AzureOpenAIVectorizer(
        name="AzureOpenAI",
        kind="azureOpenAI",
        azure_open_ai_parameters=AzureOpenAIParameters(
            resource_uri="https://xxxxx.openai.azure.com/",  # 你的 Azure OpenAI 端點
            deployment_id="text-embedding-ada-002",
            api_key="OPENAI_API_KEY" # 你的 Azure OpenAI 金鑰
        )
    )  
]

這版的 Cognitive Search 能自動將輸入的文字轉換為向量，在使用上可以少一段呼叫 Embedding 的程式，所以必須在設定中先放入 Azure OpenAI 的各項資訊。

Profile

from azure.search.documents.indexes.models import VectorSearchProfile
profiles=[
    VectorSearchProfile(
        name="HnswProfile",
        algorithm="Hnsw",
        vectorizer="AzureOpenAI"
    ),
    VectorSearchProfile(
        name="ExhaustiveKnnProfile",
        algorithm="ExhaustiveKnn",
        vectorizer="AzureOpenAI"
    )
]

在 Profile 中把各項設定綁定在一起，首先是參數 algorithm 與 vectorizer，對應到上述兩個設定，而 name 對應到設定 Index 欄位時使用的 vector_search_profile，記得檢查互相的命名有沒有一樣。

最後就照著格式把設定通通丟進去來建立 Index：

from azure.search.documents.indexes.models import VectorSearch, SearchIndex

vector_search = VectorSearch(profiles=profiles, algorithms=algorithms, vectorizers=vectorizers)
index = SearchIndex(name="text-sample-vertor", fields=fields, vector_search=vector_search)
result = search_index_client.create_or_update_index(index)

其中 SearchIndex 的 name 可以自行設定 Index 的名稱。

建立完成後就能在 Portal 上看到該 Index，切換到 Cognitive Search 畫面左邊的「索引」頁籤，其中可以看到所有欄位與相關設定；兩個儲存向量的欄位與其他欄位略有不同，但都符合上述的設定。

由 Portal 查看建立完成的 Index

建立向量

回到測試資料中，我們必須要為每一筆資料建立對應的向量，才能開始使用向量搜尋，其中我們會對標題與內容都各自生成向量，最簡單快速的方法就是使用 text-embedding-ada-002，作法可以參考以下的 Azure OpenAI Embedding 使用方式。

而另一個更快的方式，就是直接使用現成的向量😍，參考以下第二個連結。

在 Index 中寫入資料

如果選擇了上述更快的方式，基本上已經把寫入 Index 所需要的 Json 格式都整理好了，直接檔案打開後就能用了。

import json
from azure.search.documents import SearchClient
from azure.core.credentials import AzureKeyCredential

with open('docVectors.json', 'r') as file:  
    documents = json.load(file)

search_client = SearchClient(
    endpoint="https://charlie-test.search.windows.net", 
    index_name="text-sample-vertor",
    # credential=AzureKeyCredential("YOUR-SEARCH-API-KEY") # 你的 Cognitive Search 金鑰
)
result = search_client.upload_documents(documents)

寫入完成，一樣回到 Cognitive Search 畫面左邊的「索引」頁籤，稍等幾分鐘後重新整理畫面，就能看到 108 筆資料成功被載入 Index 了！

108 筆資料成功被載入 Index

總結

到此，我們建立了一個可以儲存向量的 Index，並寫入了包含向量欄位在內的 108 筆資料。

下一篇中會在這個 Index 上開始真正執行向量搜尋，在搜尋時快速地切換兩種演算法，還有與全文檢索的混和搜尋方式，都會在下一篇中詳細介紹，請繼續收看！

系列文章

【Azure Cognitive Search】使用 Python SDK 在認知搜尋中進行全文檢索
【Azure Cognitive Search】使用 Python SDK 在認知搜尋中進行向量搜尋（上）
【Azure Cognitive Search】使用 Python SDK 在認知搜尋中進行向量搜尋（下）

完整程式碼

charliewei0716 / demo-azure-cognitive-search-python-vector-search

Charlie Wei Blog

搜尋此網誌

【Azure OpenAI】o1 模型與 2024-09-01-preview API