Bulk Indexing Documents

Bulk index documents endpoint is used to index all the documents of a custom datasource using a series of /bulkindexdocuments requests with a common uploadId. Bulk indexing fully replaces the entire list of documents stored in Glean. After a successful bulk upload, all documents that were not a part of the most recent upload are deleted asynchronously.

There are similar bulk indexing endpoints for other objects like users, groups, employees, and teams as well.

Choosing /indexdocuments vs /bulkindexdocuments

When deciding between /indexdocuments and /bulkindexdocuments, it's important to understand their primary functions and use cases:

  • /bulkindexdocuments : This endpoint is designed for completely refreshing the datasource. It deletes all existing documents and replaces them with the new ones provided. Use this endpoint when you need to replace the existing corpus and upload all documents anew.
  • /indexdocuments : This endpoint is intended for incremental updates. It allows you to add a batch of new documents or update existing ones without affecting the other documents in the index. Choose this option when you want to keep the existing documents intact while adding or updating specific documents.

When to use each endpoint:

  • Use /bulkindexdocuments :
    • When you need to perform a full refresh of the datasource.
    • When all existing documents need to be replaced with a new set of documents.
  • Use /indexdocuments :
    • When you need to add new documents to the existing index.
    • When you need to update specific documents while keeping the rest of the index unchanged.

By selecting the appropriate endpoint based on your needs, you can efficiently manage your document indexing process.

Making your first successful request to /bulkindexdocuments

Here is a sample request to the /bulkindexdocuments endpoint.

cURLpython
Copy
Copied
 curl -X POST  https://customer-be.glean.com/api/index/v1/bulkindexdocuments \
  -H 'Authorization: Basic <Token>' \
  -d '
{ 
  "uploadId": "test-upload-id", 
  "isFirstPage": true, 
  "isLastPage": true, 
  "forceRestartUpload": true,
  "datasource": "gleantest",
  "documents": [
    {
      "datasource": "gleantest",
      "objectType": "EngineeringDoc",
      "id": "test-doc-1",
      "title": "How to bulk index documents",
      "body": {
        "mimeType": "text/plain",
        "textContent": "This doc will help you make your first successful bulk index document request"
      },
      "permissions": {
        "allowedUsers": [
          {
            "email": "myuser@bluesky.test",
            "datasourceUserId": "myuser-datasource-id",
            "name": "My User"
          }
        ],
        "allowAllDatasourceUsersAccess": true
      },
      "viewURL": "https://www.glean.engineering.co.in/test-doc-1",
      "customProperties": [
        {
          "name": "Org",
          "value": "Infrastructure"
        }
      ]
    }
  ] 
}'
Copy
Copied
import glean_indexing_api_client
from glean_indexing_api_client.api import documents_api
from glean_indexing_api_client.model.bulk_index_documents_request import BulkIndexDocumentsRequest
from pprint import pprint

document_api = documents_api.DocumentsApi(api_client)

documents=[DocumentDefinition(
    datasource="gleantest",
    object_type="EngineeringDoc",
    title="How to bulk index documents",
    id="test-doc-1",
    view_url="https://www.glean.engineering.co.in/test-doc-1",
    body=ContentDefinition(mime_type="text/plain", text_content="This doc will help you make your first successful bulk index document request"),
    permissions=DocumentPermissionsDefinition(
      allow_anonymous_access=True
    ))]

bulk_index_documents_request = BulkIndexDocumentsRequest(
  upload_id="test-upload-id", datasource="gleantest", documents=documents, is_first_page=True, is_last_page=True, force_restart_upload=True)

# example passing only required values which don't have defaults set
try:
    document_api.bulkindexdocuments_post(bulk_index_documents_request)
except glean_indexing_api_client.ApiException as e:
    print("Exception when calling DocumentsApi->bulkindexdocuments_post: %s\n" % e)

Let's look at the different fields you need to successfully index documents to Glean. Note that this is just a sample request with minimal fields required to index content. For exhaustive list of fields, please refer here.

Bulk upload model

The bulk upload endpoints delete all entries that are not a part of the most recent upload. For example, /bulkindexdocuments endpoint would delete all the documents that are not present in the most recent upload.

Concurrent uploads are not allowed ie. you cannot start a new upload before the previous upload is finished.

There are some fields that are common across all bulk upload endpoints. We will be describing them here:

uploadId

  • This is the id which uniquely identifies an upload. You need to have a unique uploadId for all the paginated requests you send for an upload.

isFirstPage

  • This denotes whether the page being uploaded is the first page, and needs to be true for the first request and false for all subsequent requests for an upload. Your request would fail if you start a new page without finishing the previous upload.

isLastPage

  • This denotes whether the page being uploaded is the last page, and needs to be true for the last request and false for all other requests for an upload. You cannot start subsequent uploads before the last page ie. page with isLastPage = true is uploaded. If you want to start a new page ignoring the previous upload state, use the forceRestartUpload field.

forceRestartUpload

  • This is required if you want to start a new upload but the previous upload has not finished or has failed. Not specifying this bit in case of an unsuccessful previous upload will fail the request.

In addition to bulk upload fields, we also have:

disableStaleDocumentDeletionCheck

  • The /bulkindexdocuments asynchronously deletes all documents that weren’t a part of the most recent upload session. This can lead to accidental situations where too many documents get wiped in case of an erroneous bulk upload. To mitigate this, we have a deletion check in place which pauses the deletion of stale documents for 7 days if the percentage of docs being deleted exceeds 20%. In case you intentionally want to delete more than 20% of your previously uploaded documents, you can specify disableStaleDocumentDeletionCheck = true , which disables this check and allows the documents to be deleted. Note that documensts are delete asynchronously. If you wish for deletions to take effect immediately, use /processalldocuments endpoint.

Document model

The following is the basic document model used for indexing a new document to Glean. There are other fields too which you can use for advanced functionality. You can refer to them in the API reference docs here.

datasource

  • Represents the document datasource.

id

  • This is a unique identifier for the document. Each document should have a unique id.

body

  • This is used to specify the content which will be used to populate the document body. You also need to specify the mime type of the content.

allowedUsers

  • This represents a list of users who will be able to view this document. For representing a user, there are three fields: email, datasourceUserId, name. It is not required to populate the datasourceUserId field if you have specified isUserReferencedByEmail = true while adding the datasource config. In that case, an email is used to identify a user for permissions.
  • Please note that you must index users before uploading documents to ensure that the permissions are captured. Refer to the Permissions tutorial for more details!

viewURL

  • This represents the document view url. This is a required field and the request would fail if it is not specified for any document. This viewURL must also match the urlRegex specified while creating the datasource.

allowAnonymousAccess

  • This can be set to true if anyone who is signed into Glean can view search results for the document, even if they are not a user of this custom datasource.

allowAllDatasourceUsersAccess

  • This can be set to true if all users of the datasource (as uploaded using the Identity APIs) can view this document.

customProperties

  • This is a list of name - value pairs. These properties are used to populate additional facets which allow you to search using operations like "Org":"Infrastructure" on Glean. Note that the property names are predefined while creating a new datasource here .

Next steps

  • You can check the status of your document using our debugging/troubleshooting APIs. Please refer here for documentation on how to use these APIs.
  • For the indexed document to show up in Glean UI, the datasource must be enabled for search. For now, Glean will need to enable it internally, but in future this will be made available via Glean Admin Console. Once these steps are done, you should be able to search for the indexed document in Glean when logged in as the user having permissions to view the documents.
  • Note that it takes around 15-20 minutes for the documents to be indexed and appear on your Glean UI.