Skip to content

Commit

Permalink
Fix a few URLs in notebooks and improve text of cluster use-case
Browse files Browse the repository at this point in the history
  • Loading branch information
blythed committed Dec 2, 2023
1 parent 3da9d53 commit 5ddb631
Show file tree
Hide file tree
Showing 9 changed files with 3,893 additions and 883 deletions.
58 changes: 36 additions & 22 deletions docs/hr/content/use_cases/productionization/sandbox-example.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Cluster usage
# SuperDuperDB: cluster usage

SuperDuperDB allows developers, on the one hand to experiment and setup models quickly in scripts and notebooks, and on the other hand deploy persistent services, which are intended to "always" be on. These persistent services are:

Expand All @@ -7,7 +7,7 @@ SuperDuperDB allows developers, on the one hand to experiment and setup models q
- Vector-searcher service
- Change-data-capture (CDC) service

![](/img/light.png)
![](../docs/hr/static/img/light.png)

To set up `superduperdb` to use this cluster mode, it's necessary to add explicit configurations
for each of these components. The following configuration does that, as well as enabling a pre-configured
Expand All @@ -34,21 +34,28 @@ Once this configuration has been added, you're ready to use the `superduperdb` s
- Change-data-capture (CDC) service
- Jupyter notebook service

To set up this environment, navigate to your local copy of the `superduperdb` repository, and build the image with:
To set up this environment, navigate to your local copy of the `superduperdb` repository, edit the `requirements.txt` to look like this:

```
.[demo,server,apis]
```

(you can add any requirements you need in your system) and build the image with:

```bash
make testenv_image SUPERDUPERDB_EXTRAS=sandbox
make testenv_image
```

Then start the environment with:

```bash
make testenv_init
make testenv_init SUPERDUPERDB_DATA=sandbox
```

This last command starts containers for each of the above services with `docker-compose`. You should see a bunch of logs for each service (mainly MongoDB).

Once you have carried out these steps, you are ready to complete the rest of this notebook.
Once you have carried out these steps, you are ready to complete the rest of this notebook, which focuses on a implementing
a production style implementation of vector-search.


```python
Expand All @@ -63,6 +70,8 @@ from superduperdb import CFG
assert CFG.data_backend == 'mongodb://superduper:superduper@mongodb:27017/test_db'
```

We'll be using MongoDB to store the vectors and data:


```python
from superduperdb.backends.mongodb import Collection
Expand All @@ -72,6 +81,9 @@ db = superduper()
doc_collection = Collection('documents')
```

We've already prepared some data which was scraped from the `pymongo` query API. You can download it
in the next cell:


```python
!curl -O https://superduperdb-public.s3.eu-west-1.amazonaws.com/pymongo.json
Expand All @@ -84,6 +96,8 @@ with open('pymongo.json') as f:
data[0]
```

Let's insert this data:


```python
from superduperdb import Document
Expand All @@ -93,10 +107,8 @@ out, G = db.execute(
)
```


```python
db.metadata.show_jobs()
```
We'll use a `sentence-transformers` model to calculate the embeddings. Here's how to wrap the model
so that it works with `superduperdb`:


```python
Expand All @@ -113,6 +125,8 @@ model = Model(
)
```

Now let's create the vector-search component:


```python
from superduperdb import Listener, VectorIndex
Expand All @@ -130,16 +144,25 @@ jobs, vi = db.add(
)
```

This command creates a job on `dask` to calculate the vectors and save them in the database. You can
follow the `stdout` of this job with this command:


```python
jobs[0].watch()
```

After a few moments, you'll be able to verify that the vectors have been saved in the documents:


```python
db.execute(doc_collection.find_one())
```

Let's test a similarity/ vector search using the hybrid query-API of `superduperdb`. This search
dispatches one part off to the vector-search server (running on port 8001) and the other (classical) part to MongoDB
the results are combined by `superduperdb`:


```python
from IPython.display import Markdown
Expand All @@ -165,12 +188,7 @@ for r in result:
display(Markdown('---'))
```


```python
db.drop(force=True)
```

The great thing about this production mode, is that now allows data to be inserted into the service via other
One of the great things about this distributed setup, is that now allows data to be inserted into the service via other
MongoDB clients, even from other programming languages and applications.

We show-case this here, by inserting the rest of the data using the official Python MongoDB driver `pymongo`.
Expand Down Expand Up @@ -205,12 +223,8 @@ You can confirm that another job has been created and executed:
db.metadata.show_jobs()
```

You can view the `stdout` of the most recent job with this command:


```python
db.metadata.watch_job('a5077d81-0e00-4004-b501-23af356e0234')
```
We can now check that all of the outputs (including those inserted via the `pymongo` client) have been populated
by the system.


```python
Expand Down
2 changes: 1 addition & 1 deletion examples/multimodal_image_search_clip.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -120,7 +120,7 @@
"import random\n",
"\n",
"# Use glob to get a list of image file paths in the 'images_small' directory\n",
"images = glob.glob('images_small/*.jpg')\n",
"images = glob.glob('images_tiny/*.jpg')\n",
"\n",
"# Create a list of SuperDuperDB Document instances with image data\n",
"# Note: The 'uri' parameter is set to the file URI using the 'file://' scheme\n",
Expand Down
2,815 changes: 2,781 additions & 34 deletions examples/question_the_docs.ipynb

Large diffs are not rendered by default.

Loading

0 comments on commit 5ddb631

Please sign in to comment.