{ "cells": [ { "cell_type": "markdown", "id": "741af25f-4379-4376-82f3-c1731f82d8bd", "metadata": {}, "source": [ "# s3vectorm Comprehensive Tutorial\n", "\n", "This tutorial provides a complete guide to using the s3vectorm library for managing vector data in AWS S3 Vectors service. We'll walk through creating buckets and indexes, storing and querying vectors, and performing cleanup operations.\n", "\n", "## Setup and Imports\n", "\n", "First, let's import all the necessary components and set up our AWS client:" ] }, { "cell_type": "code", "execution_count": 1, "id": "9a6ce6c3-2c10-4d1f-a876-dee6eb5f9535", "metadata": {}, "outputs": [], "source": [ "import boto3\n", "from s3vectorm.api import (\n", " Bucket,\n", " Index,\n", " Vector,\n", " BaseMetadata,\n", " MetaKey,\n", ")" ] }, { "cell_type": "code", "execution_count": 2, "id": "33ed55e0-062e-4e3c-b45b-f74e536b1b89", "metadata": {}, "outputs": [], "source": [ "# Configure your AWS credentials and region\n", "bucket_name = \"s3vectorm-tutorial-bucket\"\n", "index_name = \"document-embeddings\"\n", "\n", "# Create AWS S3 Vectors client\n", "boto_ses = boto3.Session()\n", "s3_vectors_client = boto_ses.client(\"s3vectors\")" ] }, { "cell_type": "markdown", "id": "437bd361-09d3-4e3e-9598-12730807474b", "metadata": {}, "source": [ "## Bucket Management\n", "\n", "### Creating a Vector Bucket\n", "\n", "Before working with vector indexes, we need to create an S3 vector bucket. The bucket serves as a container for all your vector indexes:" ] }, { "cell_type": "code", "execution_count": 3, "id": "79031a05-da64-470e-ba93-256356a3d110", "metadata": {}, "outputs": [], "source": [ "# Create a bucket instance\n", "bucket = Bucket(name=bucket_name)" ] }, { "cell_type": "code", "execution_count": 4, "id": "1df53fbe-43b2-4740-8553-16468350e6b7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "โœ… Bucket created successfully\n" ] } ], "source": [ "# Create the bucket in AWS (returns None if already exists)\n", "create_result = bucket.create(s3_vectors_client)\n", "if create_result:\n", " print(\"โœ… Bucket created successfully\")\n", "else:\n", " print(\"โ„น๏ธ Bucket already exists\")" ] }, { "cell_type": "markdown", "id": "bd23b1e9-02ce-47ec-848d-7d10ad17e57a", "metadata": {}, "source": [ "### Listing Indexes in a Bucket\n", "\n", "You can list all indexes within a bucket using pagination. This is useful for discovering existing indexes:" ] }, { "cell_type": "code", "execution_count": 5, "id": "faba5d92-eb78-4083-aba4-667113c96bad", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "๐Ÿ“‹ Indexes in bucket:\n" ] } ], "source": [ "# List all indexes in the bucket\n", "print(\"๐Ÿ“‹ Indexes in bucket:\")\n", "for page in bucket.list_index(\n", " s3_vectors_client,\n", " prefix=\"document-\", # Optional: filter by prefix\n", " page_size=50\n", "):\n", " indexes = page.indexes or []\n", " for index_summary in indexes:\n", " print(f\" - {index_summary.indexName} (dim: {index_summary.dimension})\")" ] }, { "cell_type": "markdown", "id": "bbec9b39-c16d-400c-8e30-0cd2f92dc5bb", "metadata": {}, "source": [ "## Index Management\n", "\n", "### Creating a Vector Index\n", "\n", "An index defines the structure and properties of your vector data, including dimension and distance metric:" ] }, { "cell_type": "code", "execution_count": 6, "id": "9f54b228-b50d-4307-a9f6-9da297fc7779", "metadata": {}, "outputs": [], "source": [ "# Create an index with specific configuration\n", "index = Index(\n", " bucket_name=bucket_name,\n", " index_name=index_name,\n", " data_type=\"float32\",\n", " dimension=768, # Common dimension for many LLM embeddings\n", " distance_metric=\"cosine\"\n", ")" ] }, { "cell_type": "code", "execution_count": 7, "id": "d85173b2-d46c-451d-80a4-752aec9f5f56", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "โœ… Index created successfully\n" ] } ], "source": [ "# Create the index in AWS\n", "create_result = index.create(s3_vectors_client)\n", "if create_result:\n", " print(\"โœ… Index created successfully\")\n", "else:\n", " print(\"โ„น๏ธ Index already exists\")" ] }, { "cell_type": "markdown", "id": "686e389a-fe2a-4e2f-837e-dfb6b82a8c31", "metadata": {}, "source": [ "### Retrieving an Existing Index\n", "\n", "You can retrieve an existing index configuration from AWS to work with it:" ] }, { "cell_type": "code", "execution_count": 8, "id": "be977ca1-dfc4-47b6-af01-4201b9358474", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "๐Ÿ“Š Retrieved index: document-embeddings\n", " Dimension: 768\n", " Distance metric: cosine\n" ] } ], "source": [ "# Retrieve an existing index by name\n", "existing_index = Index.get(\n", " s3_vectors_client,\n", " vector_bucket_name=bucket_name,\n", " index_name=index_name\n", ")\n", "\n", "if existing_index:\n", " print(f\"๐Ÿ“Š Retrieved index: {existing_index.index_name}\")\n", " print(f\" Dimension: {existing_index.dimension}\")\n", " print(f\" Distance metric: {existing_index.distance_metric}\")\n", "else:\n", " print(\"โŒ Index not found\")" ] }, { "cell_type": "markdown", "id": "2c4f8de8-9baa-41f3-8ec0-33141e6f4051", "metadata": {}, "source": [ "### Creating Index Objects for Deletion\n", "\n", "When you need to delete indexes without knowing their full configuration, you can create lightweight index objects:" ] }, { "cell_type": "code", "execution_count": 9, "id": "88c77fea-44c3-471f-b9d7-67f3e66354b0", "metadata": {}, "outputs": [], "source": [ "# Create index object specifically for deletion operations\n", "deletion_index = Index.new_for_delete(\n", " bucket_name=bucket_name,\n", " index_name=index_name\n", ")" ] }, { "cell_type": "markdown", "id": "7a19074f-627b-4c2c-b62b-075a81512c56", "metadata": {}, "source": [ "## Vector Data Models\n", "\n", "### Defining Custom Vector Classes\n", "\n", "Define your vector data structure by extending the base Vector class with custom metadata fields (``Vector`` is just a subclass of ``pydantic.BaseModel``):" ] }, { "cell_type": "code", "execution_count": 10, "id": "79531bbe-99db-4ab5-b540-b270efad4662", "metadata": {}, "outputs": [], "source": [ "from pydantic import Field\n", "\n", "class DocumentChunk(Vector):\n", " \"\"\"Custom vector class for document chunks with metadata\"\"\"\n", " document_id: str = Field(description=\"ID of the source document\")\n", " chunk_seq: int = Field(description=\"Sequence number of the chunk\")\n", " title: str = Field(description=\"Document title\")\n", " category: str = Field(description=\"Document category\")\n", " owner_id: str = Field(description=\"ID of the document owner\")\n", " created_at: str = Field(description=\"Creation timestamp\")" ] }, { "cell_type": "markdown", "id": "aae29136-61a3-4e08-b941-ce36720d7339", "metadata": {}, "source": [ "### Understanding Vector Conversion\n", "\n", "Vectors can be converted to different formats for AWS operations:" ] }, { "cell_type": "code", "execution_count": 11, "id": "ca5b9a6b-77f7-4021-a72c-5a54de14408f", "metadata": {}, "outputs": [], "source": [ "# Create a sample vector\n", "sample_vector = DocumentChunk(\n", " key=\"doc-1#chunk-1\",\n", " data=[0.1, 0.2, 0.3] * 256, # 768-dimensional vector\n", " document_id=\"doc-1\",\n", " chunk_seq=1,\n", " title=\"Introduction to AI\",\n", " category=\"technology\",\n", " owner_id=\"user-123\",\n", " created_at=\"2025-01-01T10:00:00Z\"\n", ")" ] }, { "cell_type": "code", "execution_count": 12, "id": "4a854559-50bd-4e51-8ec8-b11351428068", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "๐Ÿ“‹ Metadata only: {'document_id': 'doc-1', 'chunk_seq': 1, 'title': 'Introduction to AI', 'category': 'technology', 'owner_id': 'user-123', 'created_at': '2025-01-01T10:00:00Z'}\n" ] } ], "source": [ "# Extract just the metadata\n", "metadata = sample_vector.to_metadata_dict()\n", "print(\"๐Ÿ“‹ Metadata only:\", metadata)" ] }, { "cell_type": "markdown", "id": "84d10601-6bcc-49bf-ad5a-2250b3078e0f", "metadata": {}, "source": [ "## Storing Vector Data\n", "\n", "### Inserting Vectors\n", "\n", "Store multiple vectors in your index efficiently with batch operations:" ] }, { "cell_type": "code", "execution_count": 13, "id": "2d9fd47a-83e3-496a-80a1-f0c8d0223bd4", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "โœ… Successfully stored 4 vectors\n" ] } ], "source": [ "# Create a collection of document vectors\n", "vectors = [\n", " DocumentChunk(\n", " key=\"doc-1#chunk-1\",\n", " data=[0.1] * 768,\n", " document_id=\"doc-1\",\n", " chunk_seq=1,\n", " title=\"Introduction to Machine Learning\",\n", " category=\"technology\",\n", " owner_id=\"user-alice\",\n", " created_at=\"2025-01-01T10:00:00Z\"\n", " ),\n", " DocumentChunk(\n", " key=\"doc-1#chunk-2\",\n", " data=[0.2] * 768,\n", " document_id=\"doc-1\",\n", " chunk_seq=2,\n", " title=\"Introduction to Machine Learning\",\n", " category=\"technology\",\n", " owner_id=\"user-alice\",\n", " created_at=\"2025-01-01T10:05:00Z\"\n", " ),\n", " DocumentChunk(\n", " key=\"doc-2#chunk-1\",\n", " data=[0.3] * 768,\n", " document_id=\"doc-2\",\n", " chunk_seq=1,\n", " title=\"Business Strategy Guide\",\n", " category=\"business\",\n", " owner_id=\"user-bob\",\n", " created_at=\"2025-01-01T11:00:00Z\"\n", " ),\n", " DocumentChunk(\n", " key=\"doc-2#chunk-2\",\n", " data=[0.4] * 768,\n", " document_id=\"doc-2\",\n", " chunk_seq=2,\n", " title=\"Business Strategy Guide\",\n", " category=\"business\",\n", " owner_id=\"user-bob\",\n", " created_at=\"2025-01-01T11:05:00Z\"\n", " )\n", "]\n", "\n", "# Store all vectors in the index\n", "index.put_vectors(s3_vectors_client, vectors)\n", "print(f\"โœ… Successfully stored {len(vectors)} vectors\")" ] }, { "cell_type": "markdown", "id": "b181d364-ca69-4069-b5e4-0ea0a9a944f0", "metadata": {}, "source": [ "## Metadata Query System\n", "\n", "### Defining Metadata Models\n", "\n", "Create queryable metadata models using inheritance for better organization:" ] }, { "cell_type": "code", "execution_count": 14, "id": "6be2bfd3-10ad-4a01-8e88-8327270bc80c", "metadata": {}, "outputs": [], "source": [ "# Base metadata class with common fields\n", "class BaseDocumentMeta(BaseMetadata):\n", " document_id = MetaKey()\n", " chunk_seq = MetaKey()\n", "\n", "# Extended metadata class with additional fields\n", "class DocumentMeta(BaseDocumentMeta):\n", " title = MetaKey()\n", " category = MetaKey()\n", " owner_id = MetaKey()\n", " created_at = MetaKey()" ] }, { "cell_type": "markdown", "id": "051c3988-c998-4ebb-8d5d-0627051b47de", "metadata": {}, "source": [ "### Understanding Query Operators\n", "\n", "The metadata system supports various comparison operators for building complex queries:" ] }, { "cell_type": "code", "execution_count": 15, "id": "1d0e6bc9-67e4-4580-a4cc-19ee6e2fbbb9", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "๐Ÿ” All operators available for metadata filtering\n" ] } ], "source": [ "# Demonstrate all available operators\n", "meta = DocumentMeta()\n", "\n", "# Equality operators\n", "equality_filter = meta.category.eq(\"technology\")\n", "not_equal_filter = meta.owner_id.ne(\"user-deleted\")\n", "\n", "# Comparison operators\n", "sequence_filter = meta.chunk_seq.gt(1) # Greater than\n", "recent_filter = meta.chunk_seq.gte(2) # Greater than or equal\n", "early_filter = meta.chunk_seq.lt(5) # Less than\n", "boundary_filter = meta.chunk_seq.lte(3) # Less than or equal\n", "\n", "# List operators\n", "multi_user_filter = meta.owner_id.in_([\"user-alice\", \"user-bob\"])\n", "not_category_filter = meta.category.nin([\"draft\", \"archived\"])\n", "\n", "# Existence operators\n", "has_title_filter = meta.title.exists(True)\n", "no_title_filter = meta.title.exists(False)\n", "\n", "print(\"๐Ÿ” All operators available for metadata filtering\")" ] }, { "cell_type": "markdown", "id": "13806d23-9dd0-457f-955d-fec901a7262d", "metadata": {}, "source": [ "## Vector Querying\n", "\n", "### Basic Similarity Search\n", "\n", "Perform similarity search to find vectors similar to your query vector:" ] }, { "cell_type": "code", "execution_count": 16, "id": "ff2d1df1-98ef-4c6e-a520-2ee6e880f577", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "๐Ÿ” Basic similarity search results:\n", " 1. Introduction to Machine Learning (distance: -0.000001)\n", " Key: doc-1#chunk-1, Owner: user-alice\n", " 2. Business Strategy Guide (distance: -0.000001)\n", " Key: doc-2#chunk-2, Owner: user-bob\n", " 3. Introduction to Machine Learning (distance: -0.000001)\n", " Key: doc-1#chunk-2, Owner: user-alice\n", " 4. Business Strategy Guide (distance: -0.000001)\n", " Key: doc-2#chunk-1, Owner: user-bob\n" ] } ], "source": [ "# Query vector (representing a search query embedding)\n", "query_data = [0.15] * 768\n", "\n", "# Basic similarity search\n", "results = index.query_vectors(\n", " s3_vectors_client,\n", " data=query_data,\n", " top_k=5,\n", " return_metadata=True,\n", " return_distance=True\n", ")\n", "\n", "# Process results\n", "print(\"๐Ÿ” Basic similarity search results:\")\n", "for i, vector in enumerate(results.as_vector_objects(DocumentChunk), 1):\n", " print(f\" {i}. {vector.title} (distance: {vector.distance:.6f})\")\n", " print(f\" Key: {vector.key}, Owner: {vector.owner_id}\")" ] }, { "cell_type": "markdown", "id": "21f61bf5-4db0-4a64-afd4-46601c590c08", "metadata": {}, "source": [ "### Filtered Similarity Search\n", "\n", "Combine similarity search with metadata filtering for more precise results:" ] }, { "cell_type": "code", "execution_count": 17, "id": "21e82824-8475-4a7e-bdb4-a3e13fe3880a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "๐ŸŽฏ Technology documents:\n", " - Introduction to Machine Learning (chunk 1)\n", " - Introduction to Machine Learning (chunk 2)\n" ] } ], "source": [ "# Search within specific category\n", "category_filter = DocumentMeta.category.eq(\"technology\")\n", "tech_results = index.query_vectors(\n", " s3_vectors_client,\n", " data=query_data,\n", " top_k=3,\n", " filter=category_filter,\n", " return_metadata=True,\n", " return_distance=True\n", ")\n", "\n", "print(\"๐ŸŽฏ Technology documents:\")\n", "for vector in tech_results.as_vector_objects(DocumentChunk):\n", " print(f\" - {vector.title} (chunk {vector.chunk_seq})\")" ] }, { "cell_type": "markdown", "id": "067120a4-1a79-4b88-87c0-350be57bea1f", "metadata": {}, "source": [ "### Complex Query Combinations\n", "\n", "Build sophisticated queries using logical operators (AND, OR):" ] }, { "cell_type": "code", "execution_count": 18, "id": "0e74625e-54a5-4a3c-afb0-ec378bc876ef", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "๐Ÿง  Complex filtered results:\n", " - Introduction to Machine Learning (chunk 2, owner: user-alice)\n" ] } ], "source": [ "# Complex query: Technology documents owned by Alice, chunk sequence > 1\n", "complex_filter = (\n", " DocumentMeta.category.eq(\"technology\") &\n", " DocumentMeta.owner_id.eq(\"user-alice\") &\n", " DocumentMeta.chunk_seq.gt(1)\n", ")\n", "\n", "complex_results = index.query_vectors(\n", " s3_vectors_client,\n", " data=query_data,\n", " filter=complex_filter,\n", " return_metadata=True,\n", " return_distance=True\n", ")\n", "\n", "print(\"๐Ÿง  Complex filtered results:\")\n", "for vector in complex_results.as_vector_objects(DocumentChunk):\n", " print(f\" - {vector.title} (chunk {vector.chunk_seq}, owner: {vector.owner_id})\")" ] }, { "cell_type": "markdown", "id": "20f47c40-039c-48c9-b023-3cb6e9473501", "metadata": {}, "source": [ "### Multi-User Query Example\n", "\n", "Search for content from multiple users using the IN operator:" ] }, { "cell_type": "code", "execution_count": 19, "id": "461273eb-3db2-41a6-ba66-9cb39985b30d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "๐Ÿ‘ฅ Multi-user search results:\n", " - Business Strategy Guide by user-bob\n", " - Introduction to Machine Learning by user-alice\n", " - Business Strategy Guide by user-bob\n", " - Introduction to Machine Learning by user-alice\n" ] } ], "source": [ "# Find documents from specific users\n", "multi_user_filter = DocumentMeta.owner_id.in_([\"user-alice\", \"user-bob\"])\n", "multi_user_results = index.query_vectors(\n", " s3_vectors_client,\n", " data=query_data,\n", " filter=multi_user_filter,\n", " return_metadata=True\n", ")\n", "\n", "print(\"๐Ÿ‘ฅ Multi-user search results:\")\n", "for vector in multi_user_results.as_vector_objects(DocumentChunk):\n", " print(f\" - {vector.title} by {vector.owner_id}\")" ] }, { "cell_type": "markdown", "id": "7b2a7d10-7392-4dad-b344-bd73b116b386", "metadata": {}, "source": [ "## Vector Listing and Management\n", "\n", "### Listing All Vectors\n", "\n", "Retrieve all vectors in the index using pagination. This is useful for data auditing and bulk operations:" ] }, { "cell_type": "code", "execution_count": 20, "id": "69abd35d-e195-4561-8600-a77fa01edce1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "๐Ÿ“ƒ All vectors in index:\n", " - doc-2#chunk-1: Business Strategy Guide (business)\n", " - doc-2#chunk-2: Business Strategy Guide (business)\n", " - doc-1#chunk-2: Introduction to Machine Learning (technology)\n", " - doc-1#chunk-1: Introduction to Machine Learning (technology)\n", "๐Ÿ“Š Total vectors found: 4\n" ] } ], "source": [ "# List all vectors with metadata\n", "print(\"๐Ÿ“ƒ All vectors in index:\")\n", "all_keys = []\n", "\n", "for page in index.list_vectors(\n", " s3_vectors_client,\n", " return_metadata=True,\n", " return_data=False, # Don't return vector data for performance\n", " page_size=100\n", "):\n", " for vector in page.as_vector_objects(DocumentChunk):\n", " all_keys.append(vector.key)\n", " print(f\" - {vector.key}: {vector.title} ({vector.category})\")\n", "\n", "print(f\"๐Ÿ“Š Total vectors found: {len(all_keys)}\")" ] }, { "cell_type": "markdown", "id": "5c129c66-53ba-4d14-9f0c-8bebd000cf09", "metadata": {}, "source": [ "### Listing Vectors with Data\n", "\n", "When you need the actual vector embeddings, enable data return:" ] }, { "cell_type": "code", "execution_count": 21, "id": "8157f453-bcac-447f-b64c-b24d9e22924c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "๐Ÿ”ข Vectors with embedding data:\n", " - doc-2#chunk-1: data preview [0.30000001192092896, 0.30000001192092896, 0.30000001192092896]...\n", " - doc-2#chunk-2: data preview [0.4000000059604645, 0.4000000059604645, 0.4000000059604645]...\n", " - doc-1#chunk-2: data preview [0.20000000298023224, 0.20000000298023224, 0.20000000298023224]...\n", " - doc-1#chunk-1: data preview [0.10000000149011612, 0.10000000149011612, 0.10000000149011612]...\n" ] } ], "source": [ "# List vectors with their embedding data\n", "print(\"๐Ÿ”ข Vectors with embedding data:\")\n", "for page in index.list_vectors(\n", " s3_vectors_client,\n", " return_data=True,\n", " return_metadata=True,\n", " page_size=2 # Small page size for demo\n", "):\n", " for vector in page.as_vector_objects(DocumentChunk):\n", " data_preview = vector.data[:3] if vector.data else None\n", " print(f\" - {vector.key}: data preview {data_preview}...\")" ] }, { "cell_type": "markdown", "id": "163c5d21-419d-4f98-a898-05aae98d473e", "metadata": {}, "source": [ "### Segmented Vector Listing\n", "\n", "For large indexes, use segmentation to process vectors in parallel:" ] }, { "cell_type": "code", "execution_count": 22, "id": "b755079f-ad05-4fdd-97e9-258d878fd2bb", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "๐Ÿ“ฆ Processing segment 1/2:\n", " Found 2 vectors in this segment\n", "๐Ÿ“ฆ Processing segment 2/2:\n", " Found 2 vectors in this segment\n" ] } ], "source": [ "# Process vectors in segments (useful for parallel processing)\n", "segment_count = 2\n", "for segment_index in range(segment_count):\n", " print(f\"๐Ÿ“ฆ Processing segment {segment_index + 1}/{segment_count}:\")\n", "\n", " for page in index.list_vectors(\n", " s3_vectors_client,\n", " segment_count=segment_count,\n", " segment_index=segment_index,\n", " return_metadata=True,\n", " page_size=50\n", " ):\n", " vectors_in_segment = list(page.as_vector_objects(DocumentChunk))\n", " print(f\" Found {len(vectors_in_segment)} vectors in this segment\")" ] }, { "cell_type": "markdown", "id": "fa2e3e60-68fd-48cd-b1c5-b9f8fd2e2e46", "metadata": {}, "source": [ "## Vector Deletion Operations\n", "\n", "### Deleting Specific Vectors\n", "\n", "Remove individual vectors by their keys when you need selective deletion:" ] }, { "cell_type": "code", "execution_count": 23, "id": "4ccf9ecf-6f3d-4dc9-9c0e-31f8a9afbd4c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "๐Ÿ—‘๏ธ Deleted 2 specific vectors\n", "๐Ÿ“Š Remaining vectors: 2\n" ] } ], "source": [ "# Delete specific vectors by key\n", "keys_to_delete = [\"doc-1#chunk-1\", \"doc-2#chunk-1\"]\n", "index.delete_vectors(s3_vectors_client, keys=keys_to_delete)\n", "print(f\"๐Ÿ—‘๏ธ Deleted {len(keys_to_delete)} specific vectors\")\n", "\n", "# Verify deletion\n", "remaining_count = 0\n", "for page in index.list_vectors(s3_vectors_client, return_metadata=True):\n", " for vector in page.as_vector_objects(DocumentChunk):\n", " remaining_count += 1\n", "\n", "print(f\"๐Ÿ“Š Remaining vectors: {remaining_count}\")" ] }, { "cell_type": "markdown", "id": "86a17eda-1463-4208-832e-5c6369c0d9d7", "metadata": {}, "source": [ "### Deleting All Vectors\n", "\n", "Clear all vectors from an index while preserving the index structure:" ] }, { "cell_type": "code", "execution_count": 24, "id": "c406fac3-abc2-4fb4-b1d7-b6e5b7b1bfe1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "๐Ÿงน Deleted 2 vectors from index\n", "โœ… Index now contains 0 vectors\n" ] } ], "source": [ "# Delete all vectors in the index\n", "deleted_count = index.delete_all_vectors(\n", " s3_vectors_client,\n", " page_size=100,\n", " max_items=1000\n", ")\n", "print(f\"๐Ÿงน Deleted {deleted_count} vectors from index\")\n", "\n", "# Verify the index is empty\n", "verification_count = 0\n", "for page in index.list_vectors(s3_vectors_client):\n", " for vector in page.as_vector_objects(DocumentChunk):\n", " verification_count += 1\n", "\n", "print(f\"โœ… Index now contains {verification_count} vectors\")" ] }, { "cell_type": "markdown", "id": "82e4f263-df2e-4aa0-ae62-3b896c402b29", "metadata": {}, "source": [ "## Advanced Index Operations\n", "\n", "### Deleting an Index\n", "\n", "Remove an entire index and all its vectors permanently:" ] }, { "cell_type": "code", "execution_count": 25, "id": "5cd614eb-471a-4498-ab87-b92c40f01020", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "๐Ÿ—‘๏ธ Index deleted successfully\n" ] } ], "source": [ "# Delete the index (this also deletes all vectors)\n", "index.delete(s3_vectors_client)\n", "print(\"๐Ÿ—‘๏ธ Index deleted successfully\")" ] }, { "cell_type": "markdown", "id": "f2f2455d-7a1f-4aaf-b2e3-26adf03fd1d4", "metadata": {}, "source": [ "### Bulk Index Management\n", "\n", "Manage multiple indexes efficiently using list operations:" ] }, { "cell_type": "code", "execution_count": 26, "id": "83f87fc6-3648-4ac4-80d7-bfdde621a14d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "๐Ÿ” Finding all indexes in bucket:\n", "โœ… All indexes deleted\n" ] } ], "source": [ "# List and delete all indexes in a bucket\n", "print(\"๐Ÿ” Finding all indexes in bucket:\")\n", "for page in bucket.list_index(s3_vectors_client):\n", " # Create index objects for deletion\n", " index_list = Index.new_for_delete_from_list_index_response(page)\n", "\n", " for idx in index_list:\n", " print(f\" Deleting index: {idx.index_name}\")\n", " idx.delete(s3_vectors_client)\n", "\n", "print(\"โœ… All indexes deleted\")" ] }, { "cell_type": "markdown", "id": "e66f1e9b-6c0b-4b29-a9a0-465fb816cb55", "metadata": {}, "source": [ "## Cleanup Operations\n", "\n", "### Deleting the Bucket\n", "\n", "Finally, remove the entire bucket. Note that all indexes must be deleted first:" ] }, { "cell_type": "code", "execution_count": 27, "id": "6c6b78a4-7223-4faa-8441-1f536002ee8c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "๐Ÿ—‘๏ธ Bucket deleted successfully\n", "Response: {'ResponseMetadata': {'RequestId': '65a9389e-7a52-4ac2-940c-b7b7bbb4a37c', 'HostId': '', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Sat, 27 Sep 2025 23:19:58 GMT', 'content-type': 'application/json', 'content-length': '2', 'connection': 'keep-alive', 'x-amz-request-id': '65a9389e-7a52-4ac2-940c-b7b7bbb4a37c', 'access-control-allow-origin': '*', 'vary': 'origin, access-control-request-method, access-control-request-headers', 'access-control-expose-headers': '*'}, 'RetryAttempts': 0}}\n" ] } ], "source": [ "# Delete the bucket (must be empty of indexes)\n", "try:\n", " delete_result = bucket.delete(s3_vectors_client)\n", " print(\"๐Ÿ—‘๏ธ Bucket deleted successfully\")\n", " print(f\"Response: {delete_result}\")\n", "except Exception as e:\n", " print(f\"โŒ Failed to delete bucket: {e}\")\n", " print(\"๐Ÿ’ก Make sure all indexes are deleted first\")" ] }, { "cell_type": "markdown", "id": "c9968e21-347d-4ea9-8621-4fd30bcf6c79", "metadata": {}, "source": [ "## Best Practices and Tips\n", "\n", "### Performance Optimization\n", "\n", "Tips for optimal performance:" ] }, { "cell_type": "code", "execution_count": 29, "id": "b9754c5c-6ed0-49bf-8368-725919a5ab87", "metadata": {}, "outputs": [], "source": [ "# 1. Use appropriate page sizes for your use case\n", "# Small pages for interactive applications\n", "for page in index.list_vectors(s3_vectors_client, page_size=10):\n", " pass\n", "\n", "# Large pages for batch processing\n", "for page in index.list_vectors(s3_vectors_client, page_size=1000):\n", " pass\n", "\n", "# 2. Only return data when needed\n", "# Metadata only (faster)\n", "results = index.query_vectors(\n", " s3_vectors_client,\n", " data=query_data,\n", " return_metadata=True,\n", " return_distance=False\n", ")\n", "\n", "# 3. Use segmentation for parallel processing of large datasets\n", "segment_count = 4 # Adjust based on your processing capacity" ] }, { "cell_type": "markdown", "id": "963f9282-ebf2-46cd-b895-463fa5e7e901", "metadata": {}, "source": [ "### Type Safety\n", "\n", "Leverage the type system for better development experience:" ] }, { "cell_type": "code", "execution_count": null, "id": "027bfd71-6691-4427-a090-84aef8e11e3b", "metadata": {}, "outputs": [], "source": [ "# The library preserves vector subclass types\n", "results = index.query_vectors(s3_vectors_client, data=query_data)\n", "typed_vectors = results.as_vector_objects(DocumentChunk)\n", "\n", "# Now your IDE knows these are DocumentChunk instances\n", "for vector in typed_vectors:\n", " # Auto-completion works for custom fields\n", " print(vector.document_id) # โœ… Type-safe access\n", " print(vector.title) # โœ… IDE knows about this field" ] }, { "cell_type": "markdown", "id": "3937ece4-433a-4164-9111-f2f80a70f91b", "metadata": {}, "source": [ "This concludes the comprehensive S3VectorM tutorial. You now have all the tools needed to build sophisticated vector search applications with AWS S3 Vectors service!" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.8" } }, "nbformat": 4, "nbformat_minor": 5 }