Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Progress on rasdaman (Deep Learning) UDFs #2

Closed
KathiSchleidt opened this issue May 16, 2023 · 21 comments
Closed

Progress on rasdaman (Deep Learning) UDFs #2

KathiSchleidt opened this issue May 16, 2023 · 21 comments
Assignees
Labels
FAIRiCUBE Hub FAIRiCUBE Hub main interface development

Comments

@KathiSchleidt
Copy link
Member

What's the status on creating rasdaman UDFs?
The requirements were discussed in Bremen, should be clear. If not, please ask!
Details in the UC2 presentation from Bremen.

@robknapen robknapen changed the title Progress on rasdaman UDFs Progress on rasdaman (Deep Learning) UDFs May 16, 2023
@ocampos16
Copy link

@KathiSchleidt as of right now we are still working on the following:

  1. Linking the Python pytorch implementation from Rob into the UDF mechanism. The idea is the replace the existing c++ implementation so that python can be used instead, this will definitely simplify future UDF implementations as well as reduce development time.
  2. Saving a trained model as a collection in rasdaman for further reference from other UDFs.
  3. Designing a catalog mechanism for listing and linking what models can be used with what UDFS.

We will keep you updated with our results as they come.

@KathiSchleidt
Copy link
Member Author

@ocampos16

  1. Very cool! Think being able to create Python based UDF will make this much easier for "normal" users! :)
  2. ah... what's a collection in rasdaman?
  3. This work should be coordinated with what @sMorrone is doing on D4.3 Processing Resource Metadata

More generally (and maybe contained in points 2&3), how can a user see what UDF are available? Or can users only access their own UDF?

@ocampos16
Copy link

ocampos16 commented May 17, 2023

@KathiSchleidt

  1. Indeed I believe the same that is why we are focusing all efforts towards this solution.
  2. It means storing the model inside rasdaman. A collection in rasdaman is equivalent to a table in a relational database.
  3. @sMorrone maybe we can have a quick concall to discuss how we relate your catalog with what rasdaman could provide

More generally (and maybe contained in points 2&3), how can a user see what UDF are available?
-> There is a query in rasdaman query language, rasql, that is specifically designed to list all available UDFs, regardless of the user. I believe that in a web environment using WCS, WCPS, or WMS would be preferred, this part I need to check with @pebau because this involves a standard, if not then we need to think of another solution.
Or can users only access their own UDF?
-> So far any user can access all the UDFs rasql and WCPS, is this acceptable to you?

@KathiSchleidt
Copy link
Member Author

On providing a listing of available UDF, to my view, WCPS getCapabilitities would be my first candidate, in addition to exposing via the processing resource metadata. Please include me on the call sorting this!

On all users being able to access existing UDF, works for me. We should check with the UC partners just to be sure, but pretty sure we won't have the issues we have with sensitive data on sensitive models.

@robknapen
Copy link

ML models trained on sensitive data might need restricted access as well. For instance depending on the user agreement of the data (what derived products are allowed, often not clearly specified for ML models), or wether the training of the model has sufficiently hidden the sensitive (input) data points (otherwise an ML expert might be able to extract them from the model, as a kind off reverse engineering).

@robknapen
Copy link

@ocampos16 Out of curiosity (also relates to 'how to catalogue' and 'what might be restricted'): Do you intend to treat a trained model as a whole, or to split it up into the computational graph and the trained parameters?

@pebau
Copy link

pebau commented May 19, 2023

@robknapen (chiming in here) dissecting a model is a rabbit hole from our perspective, and I can see no advantage - we would treat a model always as a black box.

@pebau
Copy link

pebau commented May 19, 2023

@robknapen

ML models trained on sensitive data might need restricted access as well.

Accepted, at some time access control will be necessary - just not at this stage where we have only 1 anyway :)

@KathiSchleidt
Copy link
Member Author

@robknapen turning @pebau statement around, do you see a situation where we provide the same model with 2 sets of trained parameters?

@robknapen
Copy link

Sure, for example the same CNN model that we used so far can be trained for other (semantic segmentation) tasks (similar though, since the model architecture expects 28 features as input), or it can be trained for a different region. Both would use the same model architecture (= computational graph), but learn different weights. Splitting these two is the basis for what is known as transfer learning in ML. So for inference you can have a model architecture and load it with matching weights and biases for a number of similar prediction tasks. [For sure this is more difficult to implement than a pure black box approach and there might be no short term benefits.]

Libraries such as Tensorflow, Keras, and PyTorch all have methods that support this type of working with deep learning models. The usually long training times makes it a rather common approach to quickly start experimenting.

@pebau
Copy link

pebau commented Sep 5, 2023

status: pytorch-based UDFs work, Jupyterhub almost installed (need Rob's help for completion -> Mohit will contact)

@KathiSchleidt
Copy link
Member Author

@robknapen am I correct that if you have a model trained on 2 different datasets, you'd provide this as 2 different models (most of the info the same, but different input data, maybe different spatial validity)?

@robknapen
Copy link

@KathiSchleidt Yes, the models learn to represent the different datasets. When they are 'too different', it will result in distinct models. When the datasets are different but still similar, a single, more robust, model can be trained on them. So there can be exceptions :-)

@KathiSchleidt
Copy link
Member Author

@robknapen any insight as to what impact these exceptions have on the a/p resource metadata? There, we have the following fields forseen:

  • Input data:URI 1..* : Link to input data/metadata, helpful for a better understanding of context and domain.
  • Characteristics of input data: CharacterString 1 : This field contains a textual description of the main characteristics of each input data to the resource. This field will also include e.g., description of sampling techniques, version of the data (if multiple versions are available), and, in the case of ML resources, also the percentages of training, validation and testing sets. This field may contain details on the suitability of the resource for the chosen geographic area and thematic context.

Can you use these to describe what you'd need to know?

@robknapen
Copy link

@KathiSchleidt I think so. In some cases I would mention an existing (trained) model (or its saved weights) as ‘input data’, and use ‘characteristics’ to explain how it was used.

(Maybe we need a better minimum length for ‘characteristics’? 1 Character doesn’t seem very helpful to me. I would prefer either 0, or enforce some longer text (200+ characters?).)

@KathiSchleidt
Copy link
Member Author

@robknapen

  • shouldn't we differentiate between:
    • Input Data: data the model has been trained on
    • Configuration/weights: how the model has been parameterized
  • on Characteristics of input data, this is of type CharacterString, so free text. This has worried me, as difficult to explain the individual inputs in such a block, but my request to align the cardinality with Input Data was not taken into account

@sMorrone

  • Should we add an entry for model configuration/weights?
  • Should we align the cardinality of the input data description with that of the input data?

@robknapen
Copy link

@KathiSchleidt Yes, we can split it into configuration/initialisation data and input (training) data, to make the difference in purpose more clear.

@sMorrone
Copy link

sMorrone commented Sep 12, 2023

@KathiSchleidt

  • we agree on adding an entry for model configuration/weights & will do asap
  • pertaining to the "align the cardinality of the input data description with that of the input data", current solution we have implemented (a couple of months ago) is using bulleted lists in which each entry is paired with its characteristics. Pic below refers to current online a/p md request form.

image

When the MD is displayed in the catalog, this solution turns out in what can be seen in pic below
image

@robknapen @KathiSchleidt does this work for you?

@pebau
Copy link

pebau commented Dec 15, 2023

Summarizing the status of rasdaman UDFs:

  • trained models + datacube regions of interest can be passed for evaluation to pytorch using the UDF mechanism; the corresponding UDF package nn is deployed, it offers function predict() for this purpose.
  • general python UDFs can be created through a create function statement and copying the code into the rasdaman UDF space (those users who have worked on this already have a login, other prospective users please contact us to create a login).
  • an example model provided by WER has been deployed as a proof of concept on https://fairicube.rasdaman.com .

Let me know if you feel something missing on pytorch UDFs.

@jetschny
Copy link

Jivitesh is now assigned to look into the python UDF implementation (testing and verification). this will provide another UC view and can serve as validation.

@jetschny jetschny added the FAIRiCUBE Hub FAIRiCUBE Hub main interface development label Feb 7, 2024
@jetschny
Copy link

in light of the new issue which formulate the requirements for more ML models in short

#57

I will close this ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FAIRiCUBE Hub FAIRiCUBE Hub main interface development
Projects
None yet
Development

No branches or pull requests

8 participants