Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Variant Type #1453

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Add Variant Type #1453

wants to merge 1 commit into from

Conversation

SpencerTorres
Copy link
Member

@SpencerTorres SpencerTorres commented Dec 20, 2024

PR text is pending update

Summary

Implement Variant column type. Partially resolves #1430.
closes #1195

Implementation

This implementation adds 3 major types to the module:

  • ColVariant - the column implementation for (de)serialization
  • Variant - a container to hold variant values (optional for (de)serialization)
  • VariantWithType - an extension of Variant, with the ability to provide a preferred type in cases where it is ambiguous to existing column type detection (such as Array(UInt8) vs String)

ColVariant

Serialization

// Variant(Array(Map(String, String)), Array(UInt8), Bool, Int64, String)
batch, err := conn.PrepareBatch(ctx, "INSERT INTO test_variant (c)")
require.NoError(t, err)
require.NoError(t, batch.Append(int64(42))) // Accepts primitives
require.NoError(t, batch.Append(chcol.NewVariantWithType("test", "String"))) // Accepts Variants with type preference
require.NoError(t, batch.Append(true))
require.NoError(t, batch.Append(chcol.NewVariant([]uint8{0xA, 0xB, 0xC}).WithType("Array(UInt8)"))) 
require.NoError(t, batch.Append(nil)) // Accepts nil
require.NoError(t, batch.Append([]map[string]string{{"key1": "val1"}, {"key2": "val2"}})) // Accepts complex types

When values are appended via col.AppendRow(), the input v interface{} type is checked. If it is nil, a Null discriminator is appended. If it is a VariantWithType, then the specified column type will be appended along with its matching discriminator. The underlying column's AppendRow function is re-used so that we don't need to re-implement its logic.

As a catch-all, the input value will be tested against each column type until it succeeds. For example, Variant(Bool, Int64, String) will try to append as bool, int64, then string. If a value does not fit into any column type, it will return an error.

Sometimes types will conflict. Due to alphabetical sorting of the type, Array(UInt8) would be used before String since Array allows for string input. I have researched different solutions to this, including a type priority system, but it would be complex to implement. For now it is easiest to let the user simply input NewVariantWithType(int64(42), "Int64") or NewVariant(int64(42)).WithType("Int64") if they want a specific type within the variant. For complex types like maps, reflection will be used if a type isn't specified.

After all rows are appended, the Native format is used to serialize the data into the buffer. First with serializationVersion, then the uint8 array for discriminators, then each column's Encode function is re-used as usual (similar to Tuple).

Deserialization

The Native format deserializes the discriminators and builds a set of offsets for each column. This allows for storing multiple columns with mixed lengths. When the user wants to read a row, we can index into the correct row of each column to get the corresponding type.

In practice this looks like this:

var row chcol.Variant // Scan into variant

require.True(t, rows.Next())
err = rows.Scan(&row)
require.NoError(t, err)
require.Equal(t, int64(42), row.MustInt64()) // Variant provides convenience functions for returning a primitive

Or, if you know your types ahead of time, you can also scan directly into it:

var i int64 // Scan directly into int64
require.True(t, rows.Next())
err = rows.Scan(&i)
require.NoError(t, err)
require.Equal(t, int64(84), i)

This pattern works by simply calling the underlying column's ScanRow function. It is safest to scan into Variant however.
If you need to switch types on Variant for your own type detection, you can use variantRow.Any() or variantRow.Interface() to return any/interface{} respectively (provided both for preferred semantics).

Variant

Variant is simply a wrapper around any. It implements stdlib sql interfaces such as driver.Value and Scan. It also has convenience functions for primitives such as Int64. If you need to access the underlying value you can use Any(). This type can be constructed with the NewVariant(v) function.

The Variant type should be used in structs and when scanning from ColVariant. It can also be used for insertion, although VariantWithType may be required if there's overlap between types.

VariantWithType

VariantWithType is the same as Variant, but with a string included to specify the preferred type. You can use this for insertion when the Variant column has types that overlap. For example if you had Variant(Array(UInt8), String), a Go string would be inserted as an Array(UInt8). If you wanted to force this to be a ClickHouse String, you could use NewVariantWithType(v, "String") to provide the preferred type. If the preferred type is not present in the Variant, the row will fail to append to the block. Types can be added on an existing Variant by calling exampleVariant.WithType(t string), which will return a new VariantWithType.

Checklist

Delete items not relevant to your PR:

  • Unit and integration tests covering the common scenarios were added
  • A human-readable description of the changes was provided to include in CHANGELOG
  • For significant changes, documentation in https://github.com/ClickHouse/clickhouse-docs was updated with further explanations or tutorials

@SpencerTorres SpencerTorres mentioned this pull request Dec 20, 2024
3 tasks
Copy link
Contributor

@jkaflik jkaflik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1st review

Comment on lines +45 to +48
// Interface returns the underlying value as interface{}. Same as Any.
func (v Variant) Interface() interface{} {
return v.value
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the point? interface{} is an alias to any.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put both so users could choose whichever name they prefer within their application. Some apps prefer using any and others interface{}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's only a name. Why have both functions?

Comment on lines +50 to +84
// Int returns the value as an int if possible
func (v Variant) Int() (int, bool) {
if i, ok := v.value.(int); ok {
return i, true
}

return 0, false
}

// Int64 returns the value as an int64 if possible
func (v Variant) Int64() (int64, bool) {
if i, ok := v.value.(int64); ok {
return i, true
}

return 0, false
}

// String returns the value as a string if possible
func (v Variant) String() (string, bool) {
if s, ok := v.value.(string); ok {
return s, true
}

return "", false
}

// Bool returns the value as an bool if possible
func (v Variant) Bool() (bool, bool) {
if b, ok := v.value.(bool); ok {
return b, true
}

return false, false
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wonder if we need these functions at all. See:

package main

import "fmt"

type Variant struct {
	value any
}

func (v *Variant) Value() any {
	return v.value
}

func main() {
	v := Variant{}
	i, ok := v.Value().(int)
	fmt.Println(i, ok)
}

We can just expose value and let the user freely type assert.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More: If we make value public, we don't need any value access functions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm considering removing Variant in favor of VariantWithType as the default. There are many cases where the ClickHouse type needs to be provided. If value were public then it doesn't make this type any different from a regular any.

For now I agree it may be best to simply make value public


func (c *Variant) AppendRow(v any) error {
var requestedType string
switch v.(type) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
switch v.(type) {
switch vv := v.(type) {

This will give you a type asserted value ready to use inside case blocks without the need for additional type assertions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I've seen this elsewhere in the code, good point. I can simplify this switch

Comment on lines +333 to +337
if err != nil {
return fmt.Errorf("failed to read variant discriminator version: %w", err)
} else if variantSerializationVersion != SupportedVariantSerializationVersion {
return fmt.Errorf("unsupported variant discriminator version: %d", variantSerializationVersion)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
if err != nil {
return fmt.Errorf("failed to read variant discriminator version: %w", err)
} else if variantSerializationVersion != SupportedVariantSerializationVersion {
return fmt.Errorf("unsupported variant discriminator version: %d", variantSerializationVersion)
}
if err != nil {
return fmt.Errorf("failed to read variant discriminator version: %w", err)
}
if variantSerializationVersion != SupportedVariantSerializationVersion {
return fmt.Errorf("unsupported variant discriminator version: %d", variantSerializationVersion)
}

IMO more readable

Comment on lines +34 to +36
if actualType != expectedType {
t.Fatalf("case index %d Variant type index %d column type does not match: expected: \"%s\" actual: \"%s\"", i, j, expectedType, actualType)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not testify assert function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case I wanted to be able to specify an index via %d. I can check the testify asserts to see if there's a function for this

return conn
}

func TestVariant(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a need to test Variant:

  • stdlib driver?
  • with HTTP protocol? Do you expect any differences?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For both stdlib and HTTP it should be the same behavior since they all use the same functions, but I agree it would be good to add tests for these cases just to verify

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say it's really important for stdlib. There is an additional abstraction over types (like Scanner) in stdlib and it might have unexpected behaviour.

@SpencerTorres SpencerTorres mentioned this pull request Dec 26, 2024
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for new JSON type Add support for experimental Variant data type
2 participants