-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce return_type_from_args
for ScalarFunction.
#14094
Conversation
Signed-off-by: Jay Zhan <[email protected]>
Signed-off-by: Jay Zhan <[email protected]>
Signed-off-by: Jay Zhan <[email protected]>
Signed-off-by: Jay Zhan <[email protected]>
Signed-off-by: Jay Zhan <[email protected]>
Signed-off-by: Jay Zhan <[email protected]>
Signed-off-by: Jay Zhan <[email protected]>
Signed-off-by: Jay Zhan <[email protected]>
Signed-off-by: Jay Zhan <[email protected]>
datafusion/expr/src/udf.rs
Outdated
pub fn is_nullable(&self, args: &[Expr], schema: &dyn ExprSchema) -> bool { | ||
self.inner.is_nullable(args, schema) | ||
} | ||
|
||
pub fn is_nullable_from_args_nullable(&self, args_nullables: &[bool]) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove Expr
dependency
datafusion/expr/src/udf.rs
Outdated
/// The data types of the arguments to the function | ||
pub arg_types: &'a [DataType], | ||
/// The Utf8 arguments to the function, if the expression is not Utf8, it will be empty string | ||
pub arguments: &'a [String], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better name 🤔 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to unify the argument handling so that both return type and nullability are returned the same?
I wonder if it would somehow be possible to add the input nullable information here too 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also not sure about only supporting string args, that is likely a regression in behavior for some users (For example, maybe they look for constant integers as well)
let name_column = &chunk[0]; | ||
let name = match name_column { | ||
ColumnarValue::Scalar(ScalarValue::Utf8(Some(name_scalar))) => name_scalar, | ||
_ => return exec_err!("named_struct even arguments must be string literals, got {name_column:?} instead at position {}", i * 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name_column
output less readable array in this change, remove it for now.
As stated in #13717 (comment) , this new method doesn't necessarily simplify anything. Can you please fill "Rationale for this change"? What problem are we solving? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jayzhan211 -- I think this is a step in the right direction, but I am worried it just makes the API more complicated (adds as many functions as it deprecates)
Challenge: Exprs / constants
It seems to me one challenge is that different information is known for computing return types at different points in the plan (e.g. sometimes we have Expr
and sometimes we don't)
What would you think about making this more explicit in ReturnTypeArgs by making it an enum:
#[derive(Debug)]
pub enum ReturnTypeArgs<'a> {
/// information known at logical planning time
/// Note you can get get type and nullability for each arg
// using the specified ExprSchema
Planning {
pub args: &'a[Expr],
pub schema: &'a dyn ExprSchema
},
/// Information known during Execution
Execution {
/// The data types of the arguments to the function
pub arg_types: &'a [DataType],
pub arg_nullability: [bool],
}
}
Challenge: Multiple APIs (Nullability and return type)
It is somewhat akward to have two functions, one for nullability and one for return type. Also I can imagine that the nullability calculation depends on the input type of arguments too (not just the input nullability) I wonder if we can combine them into a single API:
Maybe something like
/// Information about the output of the function
/// including the data type and nullability:
struct ReturnTypeInfo {
data_type: DataType,
nullable: bool,
}
trait ScalarUDFImpl {
/// Returns the
pub fn return_type_from_args(&self, args: ReturnTypeArgs) -> Result<ReturnTypeInfo>
}
datafusion/expr/src/udf.rs
Outdated
/// The data types of the arguments to the function | ||
pub arg_types: &'a [DataType], | ||
/// The Utf8 arguments to the function, if the expression is not Utf8, it will be empty string | ||
pub arguments: &'a [String], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to unify the argument handling so that both return type and nullability are returned the same?
I wonder if it would somehow be possible to add the input nullable information here too 🤔
datafusion/expr/src/udf.rs
Outdated
/// The data types of the arguments to the function | ||
pub arg_types: &'a [DataType], | ||
/// The Utf8 arguments to the function, if the expression is not Utf8, it will be empty string | ||
pub arguments: &'a [String], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also not sure about only supporting string args, that is likely a regression in behavior for some users (For example, maybe they look for constant integers as well)
Great, I also want this too. |
Yes, it might be, since I assume we don't really need If the constant integer is the only concern, |
#[derive(Debug)]
pub enum ReturnTypeArgs<'a> {
/// information known at logical planning time
/// Note you can get get type and nullability for each arg
// using the specified ExprSchema
Planning {
pub args: &'a[Expr],
pub schema: &'a dyn ExprSchema
},
/// Information known during Execution
Execution {
/// The data types of the arguments to the function
pub arg_types: &'a [DataType],
pub arg_nullability: [bool],
}
} One good thing in this PR is that we don't need Do we really need |
Signed-off-by: Jay Zhan <[email protected]>
Signed-off-by: Jay Zhan <[email protected]>
Signed-off-by: Jay Zhan <[email protected]>
return_type_from_args
and is_nullable_from_args_nullable
return_type_from_args
for ScalarFunction.
I think the usecase is for expressions like So like |
For arrow_cast, the output type is determined by the 2nd argument which is string. arrow_cast(a, 'Int64') returns I64, arrow_cast(a, 'List(Int64)') returns List(Int64) All the functions in datafusion have constant string type, the only concern is that other users do the crazy things with |
Signed-off-by: Jay Zhan <[email protected]>
This is on my review queue, I just haven't quite had time to get to it yet. I will do so tomorrow if not later tonight |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jayzhan211 -- I think this is a very nice improvement and makes the argument handling in user defined functions better
I also like how most functions will be unaffected as we still have ScalarUDFImpl::return_type
I really do think that we should change
pub arguments: &'a [String],
to this before the release:
pub arguments: &'a [Option<&'a ScalarValue>]>`,
But I believe this strongly enough that I will make a proposed follow on PR to this one if you don't want to make the change here
Thanks again
@@ -174,39 +177,3 @@ fn get_coalesce_doc() -> &'static Documentation { | |||
.build() | |||
}) | |||
} | |||
|
|||
#[cfg(test)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why remove these tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is covered in slt, remove duplicated
ColumnarValue::Scalar(ScalarValue::Utf8(Some(name_scalar))) => { | ||
name_scalar | ||
} | ||
// TODO: Implement Display for ColumnarValue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good todo -- maybe we can make a follow on ticket for it
@@ -490,6 +547,40 @@ pub trait ScalarUDFImpl: Debug + Send + Sync { | |||
self.return_type(arg_types) | |||
} | |||
|
|||
/// What [`DataType`] will be returned by this function, given the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also recommend removing the comments from return_type_from_exprs
now that it is deprecated and you have moved the content here
e.as_any() | ||
.downcast_ref::<Literal>() | ||
.map(|literal| match literal.value() { | ||
ScalarValue::Utf8(Some(s)) => s.clone(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be really nice to avoid this clone here -- if we changed the signature to &[Option<&str>]
I think that would be easier to use and more efficient
FYI @andygrove |
I plan to change to ScalarValue after #14167, so we can parse the string easily. If anyone interested in this change can also take it. |
Signed-off-by: Jay Zhan <[email protected]>
Signed-off-by: Jay Zhan <[email protected]>
Signed-off-by: Jay Zhan <[email protected]>
We can start |
Signed-off-by: Jay Zhan <[email protected]>
Signed-off-by: Jay Zhan <[email protected]>
I recommend we do not deprecate return_type - after this PR I think we have a nice API. Namely most simple uses can use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jayzhan211 -- I think this looks really nice ❤️
I left some comment suggestions, but I can also make them as a follow on PR as well
datafusion/common/src/utils/mod.rs
Outdated
let a = Arc::new(Int32Array::from(vec![3; 200])) as ArrayRef; | ||
println!( | ||
"display {}", | ||
pretty_format_columns("ColumnarValue(ArrayRef)", &[a]).unwrap() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I forgot to remove this 😆
|
||
let (return_type, nullable) = | ||
func.return_type_from_args(args)?.into_parts(); | ||
Ok((return_type, nullable)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
datafusion/expr/src/udf.rs
Outdated
/// The data types of the arguments to the function | ||
pub arg_types: &'a [DataType], | ||
/// The Utf8 arguments to the function, if the expression is not Utf8, it will be empty string | ||
pub arguments: &'a [Option<&'a ScalarValue>], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also recommend naming this scalar_arguments
pub arguments: &'a [Option<&'a ScalarValue>], | |
pub scalar_arguments: &'a [Option<&'a ScalarValue>], |
datafusion/expr/src/udf.rs
Outdated
pub struct ReturnTypeArgs<'a> { | ||
/// The data types of the arguments to the function | ||
pub arg_types: &'a [DataType], | ||
/// The Utf8 arguments to the function, if the expression is not Utf8, it will be empty string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// The Utf8 arguments to the function, if the expression is not Utf8, it will be empty string | |
/// Is argument `i` to the function a scalar (constant) | |
/// | |
/// If argument `i` is not a scalar, it will be None | |
/// | |
/// For example, if a function is called like `my_function(column_a, 5)` | |
/// this field will be `[None, Some(ScalarValue::Int32(Some(5)))]` |
#[derive(Debug)] | ||
pub struct ReturnInfo { | ||
return_type: DataType, | ||
nullable: bool, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#[derive(Debug)] | |
pub struct ReturnInfo { | |
return_type: DataType, | |
nullable: bool, | |
} | |
/// Return metadata for this function. | |
/// | |
/// See [`ScalarUDFImpl::return_type_from_args`] for more information | |
#[derive(Debug)] | |
pub struct ReturnInfo { | |
/// The type of the value return by the function | |
return_type: DataType, | |
/// Can the returned value (ever) be null? | |
nullable: bool, | |
} |
@@ -342,6 +348,56 @@ pub struct ScalarFunctionArgs<'a> { | |||
pub return_type: &'a DataType, | |||
} | |||
|
|||
#[derive(Debug)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#[derive(Debug)] | |
/// Information about arguments passed to the function | |
/// | |
/// This structure contains metadata about how the function was called | |
/// such as the type of the arguments, any scalar arguments and if the | |
/// arguments can (ever) be null | |
/// | |
/// See [`ScalarUDFImpl::return_type_from_args`] for more information | |
#[derive(Debug)] |
pub arg_types: &'a [DataType], | ||
/// The Utf8 arguments to the function, if the expression is not Utf8, it will be empty string | ||
pub arguments: &'a [Option<&'a ScalarValue>], | ||
pub nullables: &'a [bool], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub nullables: &'a [bool], | |
/// Can argument `i` (ever) null? | |
pub nullables: &'a [bool], |
debug_assert_eq!(args.arguments.len(), 2); | ||
|
||
args.arguments[1] | ||
.and_then(|sv| sv.try_as_str().flatten().filter(|s| !s.is_empty())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is pretty fancy functional programming
Signed-off-by: Jay Zhan <[email protected]>
Signed-off-by: Jay Zhan <[email protected]>
Love it. And love that this isn't an API change. Thank you so much @jayzhan211 -- this really looks great |
Which issue does this PR close?
Part of #13717
Rationale for this change
return_type_from_args
that has less dependencies onExpr
itself but the computed properties ofExpr
andSchema
includingdata_type
andnullability
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?
TODO
Combine
return_type_from_args
andis_nullable_from_args_nullable