Skip to content

Commit

Permalink
Add a few hot fixes to the documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
blythed committed Nov 28, 2023
1 parent b8d5e8a commit a7b32f3
Show file tree
Hide file tree
Showing 16 changed files with 140 additions and 117 deletions.
4 changes: 4 additions & 0 deletions docs/hr/content/docs/data_integrations/sql.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@ sidebar_position: 3

# SQL

`superduperdb` supports SQL databases via the [`ibis` project](https://ibis-project.org/).
With `superduperdb`, queries may be built which conform to the `ibis` API, with additional
support for complex data-types and vector-searches.

## Setup

The first step in working with an SQL table, is to define a table and schema
Expand Down
33 changes: 26 additions & 7 deletions docs/hr/content/docs/walkthrough/ai_apis.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,15 +10,17 @@ these providers is similar to instantiating a `Model`:

## OpenAI

Supported:
**Supported**

- Embeddings
- Chat models
- Image generation models
- Image edit models
- Audio transcription models
| Description | Class-name |
| --- | --- |
| Embeddings | `OpenAIEmbedding` |
| Chat models | `OpenAIChatCompletion` |
| Image generation models | `OpenAIImageCreation` |
| Image edit models | `OpenAIImageEdit` |
| Audio transcription models | `OpenAIAudioTranscription` |

The general pattern is:
**Usage**

```python
from superduperdb.ext.openai import OpenAI<ModelType> as ModelCls
Expand All @@ -28,6 +30,15 @@ db.add(Modelcls(identifier='my-model', **kwargs))

## Cohere

**Supported**

| Description | Class-name |
| --- | --- |
| Embeddings | `CohereEmbedding` |
| Chat models | `CohereChatCompletion` |

**Usage**

```python
from superduperdb.ext.cohere import Cohere<ModelType> as ModelCls

Expand All @@ -36,6 +47,14 @@ db.add(Modelcls(identifier='my-model', **kwargs))

## Anthropic

**Supported**

| Description | Class-name |
| --- | --- |
| Chat models | `AnthropicCompletions` |

**Usage**

```python
from superduperdb.ext.anthropic import Anthropic<ModelType> as ModelCls

Expand Down
6 changes: 3 additions & 3 deletions docs/hr/content/docs/walkthrough/ai_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ sidebar_position: 18

# AI Models via `Model` and Descendants

AI models may be wrapped and used in `superduperdb` by the `Model` class and descendants.
AI models may be wrapped and used in `superduperdb` with the `Model` class and descendants.

### Creating AI Models in a Range of Frameworks

Expand Down Expand Up @@ -43,7 +43,8 @@ from superduperdb import superduper
db.add(Pipeline(task='sentiment-analysis'))
```

There is also support for building the pipeline in separate stages with a high degree of customization:
There is also support for building the pipeline in separate stages with a high degree of customization.
The following is a speech-to-text model published by [facebook research](https://arxiv.org/abs/2010.05171) and shared [on Hugging-Face](https://huggingface.co/facebook/s2t-small-librispeech-asr):

```python
from superduperdb.ext.transformers import Pipeline
Expand Down Expand Up @@ -91,4 +92,3 @@ db.add(model)
| `postprocess` | `Callable` applied to individual rows/items or output |
| `encoder` | An `Encoder` instance applied to the model output to save that output in the database |
| `schema` | A `Schema` instance applied to a model's output, whose rows are dictionaries |
```
29 changes: 6 additions & 23 deletions docs/hr/content/docs/walkthrough/apply_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,18 +8,13 @@ sidebar_position: 21

## Procedural API

Applying a model to data, is straightforward with `Model.predict`:

import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

Applying a model to data, is straightforward with `Model.predict`.

### Out-of-database prediction

As is standard in `sklearn` and other AI libraries and frameworks, such as `tensorflow.keras`,
all `superduperdb` models, support `.predict`, predicting directly on datapoints:

To use this functionality, supply the datapoint directly to the `Model`:
all `superduperdb` models, support `.predict`, predicting directly on datapoints.
To use this functionality, supply the datapoints directly to the `Model`:

```python
my_model = ... # code to instantiate model
Expand All @@ -35,15 +30,11 @@ my_model.predict(X=<input_datum>, one=True)

### In-database, one-time model prediction


It is possible to apply a model directly to the database with `Model.predict`.
The parameter `X` refers to the field/column of data which is passed to the model.
In this context, the parameter `X` refers to the field/column of data which is passed to the model.
`X="_base"` passes all of the data (all columns/ fields).

```mdx-code-block
<Tabs>
<TabItem value="mongodb" label="MongoDB">
```
#### MongoDB

```python
my_model = ... # code to instantiate model
Expand All @@ -55,10 +46,7 @@ my_model.predict(
)
```

```mdx-code-block
</TabItem>
<TabItem value="sql" label="SQL">
```
#### SQL

```python
table = db.load('my-table', 'table_or_collection')
Expand All @@ -72,11 +60,6 @@ my_model.predict(
)
```

```mdx-code-block
</TabItem>
</Tabs>
```

### In database, daemonized model predictions with `listen=True`

If is also possible to apply a model to create predictions, and also
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,10 @@ sidebar_position: 22
# Daemonizing `.predict` with listeners

In many AI applications, it's important that a catalogue of predictions is maintained for
all data in the database, or even all changed data.
all data in the database, updated as soon after data-updates and streaming inserts as possible.

In order to allow developers to implement this functionality, `superduperdb` offers
the `Listener` abstraction:
the `Listener` abstraction.

## Creating listeners in-line with `.predict`

Expand Down Expand Up @@ -41,10 +41,13 @@ db.add(
)
```

## Explanation
## Outcome

If a `Listener` has been created, whenever new data is added to `db`,
the `Predictor` instance is loaded and predictions are evaluated on the inserted data.

If change-data-capture has been activated, this process applies, even if the data is inserted
from a client other than `superduperdb`.
:::info
In MongoDB, if [change-data-capture (CDC)](../production/change_data_capture.md) has been configured,
data may even be inserted from third-party clients such as `pymongo`, and is nonetheless still processed
by configured `Listeners` via the CDC service.
:::
20 changes: 14 additions & 6 deletions docs/hr/content/docs/walkthrough/data_encodings_and_schemas.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,24 +47,32 @@ audio = Encoder('audio', encoder=encoder, decoder=decoder)

It's completely open to the user how exactly the `encoder` and `decoder` arguments are set.

You may include this `Encoder` in models, data-insers and more. You can also directly
register `audio` in the system, using:
You may include these `Encoder` instances in models, data-inserts and more. You can also directly
register the `Encoder` instances in the system, using:

```python
db.add(my_array)
db.add(audio)
```

To reload (for instance in another session) do:

```python
my_array_reloaded = db.load('encoder', 'my_array')
audio_reloaded = db.load('encoder', 'audio')
```

## Schemas for SQL

In SQL, one needs to define a schemas to work with tables in `superduperdb`. The `superduperdb.Schema`
builds on top of `Encoder` and allows developers to combine standard data-types used in standard
use-cases, with bespoke data-types via `Encoder`, as defined by, for instance, `audio` above.
For SQL databases, one needs to define a schemas to work with tables in `superduperdb`. The `superduperdb.Schema`
builds on top of `Encoder` and allows developers to combine standard data-types traditionall used in SQL data-bases,
with bespoke data-types via `Encoder`, as defined by, for instance, `audio` above.

To register/ create a `Table` with a `Schema` in `superduperdb`, one uses `superduperdb.backends.ibis.Table`:

```python
from superduperdb.backends.ibis import Table, dtype
from superduperdb import Table
from superduperdb import Schema

db.add(
Table(
Expand Down
21 changes: 12 additions & 9 deletions docs/hr/content/docs/walkthrough/encoding_special_data_types.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,20 +4,20 @@ sidebar_position: 14

# Inserting images, audio, video and other special data

We discovered earlier, that an initial step in working with `superduperdb`
An initial step in working with `superduperdb`
is to establish the data-types one wishes to work with, create `Encoder` instances for
those data-types, and potentially
`Schema` objects for SQL tables.
those data-types, and potentially `Schema` objects for SQL tables. See [here](./data_encodings_and_schemas.md) for
this information.

If these have been created, data may be inserted which use these data-types.

A previously defined `Encoder` may be used directly to insert data to the database.
If these have been created, data may be inserted which use these data-types, including previously defined `Encoder` instances.

## MongoDB

```python
from superduperdb import Document

my_array = db.load('encoder', 'my_array')

files = ... # list of paths to audio files

db.execute(
Expand All @@ -33,7 +33,12 @@ db.execute(

## SQL

With SQL tables, it's important to acknowledge

```python
files = ... # list of paths to audio files

table = db.load('table', 'my-table')

df = pandas.DataFrame([
{
Expand All @@ -44,6 +49,4 @@ df = pandas.DataFrame([
])

db.execute(table.insert(df))
```


```
30 changes: 29 additions & 1 deletion docs/hr/content/docs/walkthrough/inserting_data.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ sidebar_position: 2

After configuring and connecting, you're ready to insert some data.

In SuperDuperDB, data may be inserted using the SuperDuperDB connection `db`,
In `superduperdb`, data may be inserted using the connection `db`,
or using a third-parth client.

## SuperDuperDB data insertion
Expand All @@ -17,6 +17,7 @@ Here's a guide to using `db` to insert data.

```python
from superduperdb.backends.mongodb import Collection
from superduperdb import Document

db.execute(
Collection('<collection-name>')
Expand All @@ -27,6 +28,20 @@ db.execute(
The `records` may be any dictionaries supported by MongoDB, as well as dictionaries
containing items which may converted to `bytes` strings.

Other MongoDB clients may also be used for insertion. Here, one needs to explicitly
take care of conversion of data to `bytes` wherever `Encoder` instances have been used.
For instance, using `pymongo`, one may do:

```python
from superduperdb import Document

collection = pymongo.MongoClient(uri='<your-database-uri>').my_database['<collection-name>']
collection.insert_many([
Document(record).encode() for record in records
])

```

### SQL

Similarly
Expand All @@ -52,4 +67,17 @@ db.execute(
Table('<table-name>')
.insert(pandas.DataFrame(records))
)
```

Native clients may also be used to insert data. Here, one needs to explicitly
take care of conversion of data to `bytes` wherever `Encoder` instances have been used.
For instance, in DuckDB, one may do:

```python
import duckdb
import pandas

my_df = pandas.DataFrame([Document(r).encode() for r in records])

duckdb.sql("INSERT INTO <table-name> SELECT * FROM my_df")
```
Original file line number Diff line number Diff line change
Expand Up @@ -49,4 +49,4 @@ l2 = Listener(
```

This implies that whenever data is inserted to `collection`, `model_1` will compute outputs on that data first,
which we subsequently be consumed by `model_2` as inputs; it's outputs will then also be saved to SuperDuperDB.
which will subsequently be consumed by `model_2` as inputs; its outputs will then also be saved to `db`.
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,17 @@ sidebar_position: 15

# Working with external data sources

Using the MongoDB query API, `superduperdb` supports data added from external data-sources.
The trick is to pass the `uri` parameter to an encoder, instead of the raw-data:
:::warning
This functionality is currently supported for MongoDB only
:::

Using the MongoDB query API, `superduperdb` supports data added from external data-sources.
When doing this, `superduperdb` supports:

- web URLs
- URIs of objects in `s3` buckets

The trick is to pass the `uri` parameter to an encoder, instead of the raw-data.
Here is an example where we add a `.pdf` file directly from a location
on the public internet.

Expand All @@ -22,21 +25,23 @@ from superduperdb.backends.mongodb import Collection

collection = Collection('pdf-files')


def load_pdf(bytes):
text = []
for page in PdfReader(io.BytesIO(bytes)).pages:
text.append(page.extract_text())
return '\n----NEW-PAGE----\n'.join(text)


# no `encoder=...` parameter required since text is not converted to `.pdf` format
pdf_enc = Encoder('my-pdf-encoder', decoder=load_pdf)

PDF_URI = (
'https://papers.nips.cc/paper_files/paper/2012/file/'
'c399862d3b9d6b76c8436e924a68c45b-Paper.pdf'
)

# This command inserts a record which refers to this URI
# and also downloads the content from the URI and saves
# it in the record
db.execute(
collection.insert_one(Document({'txt': pdf_enc(uri=PDF_URI)}))
)
Expand Down
Loading

0 comments on commit a7b32f3

Please sign in to comment.