Fix a few URLs in notebooks and improve text of cluster use-case

superduper-io · Dec 2, 2023 · 5ddb631 · 5ddb631
1 parent 3da9d53
commit 5ddb631
Show file tree

Hide file tree

Showing 9 changed files with 3,893 additions and 883 deletions.
diff --git a/docs/hr/content/use_cases/productionization/sandbox-example.md b/docs/hr/content/use_cases/productionization/sandbox-example.md
@@ -1,4 +1,4 @@
-# Cluster usage
+# SuperDuperDB: cluster usage
 
 SuperDuperDB allows developers, on the one hand to experiment and setup models quickly in scripts and notebooks, and on the other hand deploy persistent services, which are intended to "always" be on. These persistent services are:
 
@@ -7,7 +7,7 @@ SuperDuperDB allows developers, on the one hand to experiment and setup models q
 - Vector-searcher service
 - Change-data-capture (CDC) service
 
-![](/img/light.png)
+![](../docs/hr/static/img/light.png)
 
 To set up `superduperdb` to use this cluster mode, it's necessary to add explicit configurations 
 for each of these components. The following configuration does that, as well as enabling a pre-configured 
@@ -34,21 +34,28 @@ Once this configuration has been added, you're ready to use the `superduperdb` s
 - Change-data-capture (CDC) service
 - Jupyter notebook service
 
-To set up this environment, navigate to your local copy of the `superduperdb` repository, and build the image with:
+To set up this environment, navigate to your local copy of the `superduperdb` repository, edit the `requirements.txt` to look like this:
+
+```
+.[demo,server,apis]
+```
+
+(you can add any requirements you need in your system) and build the image with:
 
 ```bash
-make testenv_image SUPERDUPERDB_EXTRAS=sandbox
+make testenv_image
 ```
 
 Then start the environment with:
 
 ```bash
-make testenv_init
+make testenv_init SUPERDUPERDB_DATA=sandbox
 ```
 
 This last command starts containers for each of the above services with `docker-compose`. You should see a bunch of logs for each service (mainly MongoDB).
 
-Once you have carried out these steps, you are ready to complete the rest of this notebook.
+Once you have carried out these steps, you are ready to complete the rest of this notebook, which focuses on a implementing
+a production style implementation of vector-search.
 
 
 ```python
@@ -63,6 +70,8 @@ from superduperdb import CFG
 assert CFG.data_backend == 'mongodb://superduper:superduper@mongodb:27017/test_db'
 ```
 
+We'll be using MongoDB to store the vectors and data:
+
 
 ```python
 from superduperdb.backends.mongodb import Collection
@@ -72,6 +81,9 @@ db = superduper()
 doc_collection = Collection('documents')
 ```
 
+We've already prepared some data which was scraped from the `pymongo` query API. You can download it 
+in the next cell:
+
 
 ```python
 !curl -O https://superduperdb-public.s3.eu-west-1.amazonaws.com/pymongo.json
@@ -84,6 +96,8 @@ with open('pymongo.json') as f:
 data[0]
 ```
 
+Let's insert this data:
+
 
 ```python
 from superduperdb import Document
@@ -93,10 +107,8 @@ out, G = db.execute(
 )
 ```
 
-
-```python
-db.metadata.show_jobs()
-```
+We'll use a `sentence-transformers` model to calculate the embeddings. Here's how to wrap the model 
+so that it works with `superduperdb`:
 
 
 ```python
@@ -113,6 +125,8 @@ model = Model(
 )
 ```
 
+Now let's create the vector-search component:
+
 
 ```python
 from superduperdb import Listener, VectorIndex
@@ -130,16 +144,25 @@ jobs, vi = db.add(
 )
 ```
 
+This command creates a job on `dask` to calculate the vectors and save them in the database. You can 
+follow the `stdout` of this job with this command:
+
 
 ```python
 jobs[0].watch()
 ```
 
+After a few moments, you'll be able to verify that the vectors have been saved in the documents:
+
 
 ```python
 db.execute(doc_collection.find_one())
 ```
 
+Let's test a similarity/ vector search using the hybrid query-API of `superduperdb`. This search 
+dispatches one part off to the vector-search server (running on port 8001) and the other (classical) part to MongoDB
+the results are combined by `superduperdb`:
+
 
 ```python
 from IPython.display import Markdown
@@ -165,12 +188,7 @@ for r in result:
     display(Markdown('---'))
 ```
 
-
-```python
-db.drop(force=True)
-```
-
-The great thing about this production mode, is that now allows data to be inserted into the service via other 
+One of the great things about this distributed setup, is that now allows data to be inserted into the service via other 
 MongoDB clients, even from other programming languages and applications.
 
 We show-case this here, by inserting the rest of the data using the official Python MongoDB driver `pymongo`.
@@ -205,12 +223,8 @@ You can confirm that another job has been created and executed:
 db.metadata.show_jobs()
 ```
 
-You can view the `stdout` of the most recent job with this command:
-
-
-```python
-db.metadata.watch_job('a5077d81-0e00-4004-b501-23af356e0234')
-```
+We can now check that all of the outputs (including those inserted via the `pymongo` client) have been populated 
+by the system.
 
 
 ```python

diff --git a/examples/multimodal_image_search_clip.ipynb b/examples/multimodal_image_search_clip.ipynb
@@ -120,7 +120,7 @@
     "import random\n",
     "\n",
     "# Use glob to get a list of image file paths in the 'images_small' directory\n",
-    "images = glob.glob('images_small/*.jpg')\n",
+    "images = glob.glob('images_tiny/*.jpg')\n",
     "\n",
     "# Create a list of SuperDuperDB Document instances with image data\n",
     "# Note: The 'uri' parameter is set to the file URI using the 'file://' scheme\n",

diff --git a/examples/question_the_docs.ipynb b/examples/question_the_docs.ipynb