Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

define a standardized, serializable representation for modality-agnostic "montage descriptions" #114

Open
jrevels opened this issue Jan 19, 2022 · 8 comments

Comments

@jrevels
Copy link
Member

jrevels commented Jan 19, 2022

We've talked about this for years internally at Beacon, but have never actually gotten around to implementing it.

At Beacon, we have libraries, applications, and distributed services that all perform certain kinds of re-referencing/montaging operations. It's thus pretty common for one actor to want to describe a montage to be fulfilled/resolved/computed by another actor, and for actors to reuse each other's montage descriptions for consistency purposes. Examples:

  • we have a service that streams multiplexed LCPM data to its client in the montage that is specified by the client
  • we have a standardized mechanism for computing bipolar montages for EEG signals

For the sake of interopability between such actors (especially across language environments and service/package API boundaries), it'd be useful/appropriate for Onda to define a generic specification for a serializable representation of a "montage description". It probably should be defined in JSON (which is encodable in Arrow) to enable portability/applicability across many language environments.

It would need to support common arithmetic broadcast operations as well as reduction operations. We should clearly define what "resolvers" need to support. We probably want to also ensure the standard representation is extensible with user-defined operations to support modality-specific operations (e.g., filtering).

@jaypath
Copy link

jaypath commented Jan 19, 2022

Agree... it is not reasonable to build a "library" of montages. Sometimes custom montages are required (a common example is when a lead switch is known to have occurred, and then the montage would need to substitute the leads as appropriate).

Perhaps a montage description/definition in which a set of reference channels is defined (system ref, avg ref, A1, A2, etc) and then channel defs as input#-ref# (example, define references to be ref (recording reference), avgref as the average of all EEG channels, A2 as right ear, and [Fp2 +F7 + f3] + 0.5 * [t3 + c3 + f8] as localLaplacianAverave_fp1. Then I want a montage that has Fp1-ref, Fp1-avgref, Fp1-A2, and Fp1-localLaplacianAverage_fp1

@jrevels
Copy link
Member Author

jrevels commented Feb 5, 2022

Got a start on this while on the plane from SFO to JFK yesterday!

Intro

First, let's define a "montage" in Onda-land to be some specification of a transformation of the form:

(input_multichannel_sample_data_1, input_multichannel_sample_data_2, ...) -> (output_multichannel_sample_data_1, output_multichannel_sample_data_2, ...)

We want Onda montages to be primarily concerned with operations over collections of channels; including realiasing, pattern matching, grouping/stacking/splitting, and numerical broadcast/reduction operations. They may be notably unconcerned with:

  • LPCM encoding information (montages operate only in the "decoded regime")
  • source information (e.g. file_path / file_format)

If we canonicalize a (de)serializable DSL representation for Onda montages, then Onda.jl (and others) can implement resolver functions like:

# executes the transformation specified by `montage` given the input data
execute(montage,
        input_multichannel_sample_data_1,
        input_multichannel_sample_data_2,
        ...)

This would basically require implementing an interpreter for the montage DSL. If we implemented an abstract interpreter (or made sure our interpreter could be run in an abstraction interpretation mode), we could use the exact same DSL to do "data-less" transformation/matching, e.g.

# returns the derived output channel names
evaluate(montage,
         input_channel_names_1,
         input_channel_names_2,
         ...)

(Note that in this approach, like with regex, an empty output channel set would mean that the input didn't "match" the montage.)

It's worth calling out that Julia is a pretty nice language for implementing these kinds of interpreters - you basically just parse content into AST nodes represented via Julia types, then you can use multiple dispatch to your heart's content to actually drive the interpretation.

Simple JSON-based DSL

Here's an extremely rough sketch of a DSL for representing montages staged atop JSON, with AST nodes represented by single-element objects:

  • values
    • [<value>,...] (list of values)
    • {"stack":[[<value>,...],...]} (concatenate all provided lists)
    • patterns
      • "<literal>" (must be valid onda.signal@1 channels name)
      • {"regex":"<regex>"}
      • {"any":[<pattern>,...]} (note that {"any":[]} matches nothing)
      • {"all":[<pattern>,...]} (note that {"all":[]} matches everything)
      • {"not":<pattern>}
    • channels (all operations on [<channel>,...] are broadcasts/reductions over the "channel dimension")
      • {"get":[[<channel>,...], <pattern>]}
        • returns list of elements from [<channel>,...] that match <pattern>; preserves order
        • if <pattern> is <literal>, returns a scalar <channel> not a [<channel>,...]
      • {"-":[<channel>,...]}
      • {"+":[<channel>,...]}
      • {"/":[<channel>,...]}
      • {"*":[<channel>,...]}
      • {"mean":[<channel>,...]}
      • {"median":[<channel>,...]}
      • {"max":[<channel>,...]}
      • {"min":[<channel>,...]}
      • {"abs":[<channel>,...]}
      • {"alias":["<target>", <channel>]}
      • {"custom":["<name>",[<channel>,...]]} (enable custom operations)
  • statements
    • assignment: {"$<binding>":<value>}
      • <binding> must be a valid onda.signal@1 channels name
      • enables $<binding> to be used anywhere in place of <value>
      • a $<binding> must be assigned via an assignment before it can be used
    • {"return":[[<channel>,...],...]}
  • montage description is just a list of statements ending with a return (e.g. [<assignment>,...,<return>])
    • "$<i>" where i is positive nonzero integer is a reserved default binding to the [<channel>,...] for the ith "argument" to the montage

Examples

Return all channels of the input minus the average of channels a, b, and c:

[
    {"$channels":{"get":["$1",{"any":["a","b","c"]}]}},
    {"$avg":{"mean":"$channels"}},
    [{"-":["$1","$avg"]}]
]

Return the dynamic range of the first signal minus the dynamic range of second, stacked on the magnitude of the difference of the two signals' means:

[
    {"$range1":{"-":[{"max":"$1"},{"min":"$1"}]}},
    {"$range2":{"-":[{"max":"$2"},{"min":"$2"}]}},
    {"$diff1":{"-":["$range1","$range2"]}},
    {"$diff2":{"abs":{"-":[{"mean":"$1"},{"mean":"$2"}]}}}
    {"return": [{"stack":["$diff1","$diff2"]}]}
]

Restructure three signals into two signals:

[
    {"$x": [
        {"alias":["x1", {"get":["$1", "a"]}]},
        {"alias":["x2", {"get":["$2", "a"]}]},
        {"alias":["x3", {"get":["$3", "a"]}]},
    ]},
    {"$y": [
        {"alias":["y1", {"get":["$1", "b"]}]},
        {"alias":["y2", {"get":["$2", "b"]}]},
        {"alias":["y3", {"get":["$3", "b"]}]},
    ]},
    {"return": ["$x", "$y"]}
]

Same DSL, Julia-Flavored Syntax

We can stage the same language upon (a very limited subset of) Julia syntax instead of JSON.

Pros:

  • more human readable/writeable
  • enables reuse of Julia metaprogramming facilities

Cons:

  • newcomers might be confused about what syntax/operations are allowed and what isn't
  • newcomers might be confused that syntax is valid Julia syntax but similarly named constructs are different (e.g. any here is not same thing as Julia's Base.any)
  • more annoying to parse/construct for producers/consumers outside of Julia

Below are the same examples written in the JSON DSL, written in Julia-Flavored syntax instead (uses IN[i] instead of $i, => for alias, etc.):

Return all channels of the input minus the average of channels a, b, and c:

channels = IN[1][any("a", "b", "c")];
avg = mean(channels);
return [IN[1] - avg];

Return the dynamic range of the first signal minus the dynamic range of second, stacked on the magnitude of the difference of the two signals' means:

range1 = max(IN[1]) - min(IN[1]);
range2 = max(IN[1]) - min(IN[1]);
diff1 = range1 - range2;
diff2 = abs(mean(IN[1]) - mean(IN[2]));
return [stack(diff1, diff2)];

Restructure three signals into two signals:

x = ["x1" => IN[1]["a"],
     "x2" => IN[2]["a"],
     "x3" => IN[3]["a"]];
y = ["y1" => IN[1]["b"],
     "y2" => IN[2]["b"],
     "y3" => IN[3]["b"]];
return [x, y];

Channel Name Propagation/Generation

Our DSL enables authors to realias channel names whenever they want, but what if they don't? What are the names of the output channels? To answer this question, we should define semantics for tracing/accumulating/propagating intermediate channel names at each operation of our DSL.

Take our earlier example:

[
    {"$channels":{"get":["$1",{"any":["a","b","c"]}]}},
    {"$avg":{"mean":"$channels"}},
    [{"-":["$1","$avg"]}]
]

What should the output channel names be given input channels ["a", "x", "b", "y", "c", "z"]?

IMO it'd make sense for the semantics to follow the bindings used by the author, in which case we might have to add some notion to drive when we want bindings to inline into channel names, or maybe those semantics can be hardcoded into our statement/broadcast/reduction operations.

You could imagine defining these semantics so that it's obvious that the answer to the above is:

["a-avg", "x-avg", "b-avg", "y-avg", "c-avg", "z-avg"]

This also implies that we might need to change Onda's validation criteria for onda.signal@1 channels to allow for much richer structure.

Another thing to consider is cross-referencing across signals, like:

[
    [{"-":["$1","$2"]}]
]

From the Onda specification:

Furthermore, to allow arbitrary cross-signal referencing, a and/or b may be channel names from other signals contained in the recording. If this is the case, such a name must be qualified in the format signal_name.channel_name. For example, an eog signal might have a channel named left-eeg.m1 (the left eye electrode referenced to the mastoid electrode from a 10-20 EEG signal).

This would require some thought to get right / avoid too much magic

@jrevels
Copy link
Member Author

jrevels commented Feb 8, 2022

I should point out that one of my prerequisite action items to the above design brainstorm was to do a bit of review and see how other systems handle this (e.g. for EEG remontaging/specification - I know lots of other systems have some form of custom montage specification); however my plane's WiFi was busted so I couldn't search for stuff, so I dove ahead 😁

Would still be good to do a bit of review here, though.

@jaypath
Copy link

jaypath commented Feb 14, 2022

My take on montages is SUPER simple. Wrote it out here, hopefully you can read my writing. I can define a transform matrix for all the montages I use... though you have possibly suggested use cases where I could not (though I'm not sure about that).

Happy to meet... about to set up a discussion with Dave and Mike..
20220214101042522.pdf
.

@jrevels
Copy link
Member Author

jrevels commented Feb 14, 2022

dope! thanks, that makes sense

i think the "essence" of the DSL-based approach is to enable doing these transforms in a "basis agnostic" way from a linear algebra perspective, since in Onda the basis is data-driven (by sets of channels) not fixed by convention. hence the functional form vs. the matrix form

might be convenient to include a matrix representation as well though

i guess also you could establish some convention like "i don't know what the basis is, but I do always sort the channel names lexographically," in which case you could represent canonical transforms as slices of the same canonical (infinite dimensional but countable) matrix

That makes me realize we could also separate this out into two things:

  1. pattern-matching that extracts + orders a well-known channel set (i.e. explicitly establishes the basis for the montage transform)
  2. the linear transform for applying the montage

that actually might be cleaner but maybe less flexible? worth thinking about

@SimonDanisch
Copy link
Contributor

The matrix form seems easy to implement efficiently, which would be a great pro.

But, it doesn't seem to cover the entire scope of what we want from our transformation system (e.g. human readability, filtering).
I have to admit, I'm also a bit biased against a system, that forces you to express any transform as a matrix.
From my times with Matlab, it remember that it forces you to spent considerable time on figuring out the exact structure of the matrix, while something like all_channels .- mean(all_channels) is quite easy to comprehend and come up with.
As someone who hasn't done matrix math in ages, understanding the avg transform example definitely took me quite some time to understand.
This kind of cognitive work would be needed every time one tries to come up with new transforms, or every time looking at an existing one not written by oneself.
Maybe we can extend on the hybrid @jrevels already proposed and treat the matrix form mainly as an optimization for our implementation?

For the JSON vs Julia debate:
The Julia syntax does look much nicer and tempting now that I look at it again.
I was first against it, but considering that julia is quite easy to parse and working with the AST shouldn't be harder then working with a nested json dict.
I'd try very hard though, to make it work pretty much the same when run as code, so that the confusion for people knowing Julia stays minimal.

I want to bring up two more general concerns:

Performance

It wasn't easy to optimize streams montaging performance. Small slip ups like allocating one temporary array already were noticable in the latency (especially for multiple users), so to be usable in stream we mostly need the performance of handwritten code.
Doing this for even more general transforms won't be easy (although, we can re-use stuff like the memory pool).

So, we should make sure that the specification won't make us invest months into implementing a difficult mapreduce transformation system.
Would be interesting to see if we can use some existing packages to create optimal transformations (Taco, Dagger, some of the Einstein sum packages?).

Security

I think we do need to accept pretty arbitrary specifications from the web. Although, we should be able to restrict it to authorized users, which should decrease the attack surface quite a bit. But, when chosing the parser and the implementation, we should definitely keep an eye out for security concerns.

@jrevels
Copy link
Member Author

jrevels commented Feb 18, 2022

Would be interesting to see if we can use some existing packages to create optimal transformations (Taco, Dagger, some of the Einstein sum packages?).

Yeah, that's a good idea. Worth calling out that "optimal" will mean different things to different users/use cases (e.g. latency vs. throughput). I think once we settle on / publish the spec, we should include a very basic montage resolver in Onda and maybe publish a more-dependency-heavy-but-more-performant resolver (backed by e.g. Dagger) in a separate package

I think we do need to accept pretty arbitrary specifications from the web. Although, we should be able to restrict it to authorized users, which should decrease the attack surface quite a bit. But, when chosing the parser and the implementation, we should definitely keep an eye out for security concerns.

I'm not sure about the relevance of this 🤔 Do you mean a situation where an attacker (or always more likely, a well-intentioned caller making some mistake 😁) accidentally sent some huge transformation description to some resolver function/service that would overload its capacity? i'd imagine that's pretty easily mitigated by the resolver just having some basic statement/complexity limit or something

(keep in mind this thread is public / in OSS land, so there might not be as much context for internal Beacon projects 🙂)

@SimonDanisch
Copy link
Contributor

I'm not sure about the relevance of this

Yes, me neither! DSLs and parsing always ring some alarm bells for me, so I just wanted to quickly bring it up, in case there are some non obvious implications :)

@beacon-biosignals beacon-biosignals deleted a comment from klaff Feb 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants