Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce return_type_from_args for ScalarFunction. #14094

Merged
merged 30 commits into from
Jan 20, 2025
Merged

Conversation

jayzhan211
Copy link
Contributor

@jayzhan211 jayzhan211 commented Jan 12, 2025

Which issue does this PR close?

Part of #13717

Rationale for this change

return_type_from_args that has less dependencies on Expr itself but the computed properties of Expr and Schema including data_type and nullability

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

TODO

Combine return_type_from_args and is_nullable_from_args_nullable

Signed-off-by: Jay Zhan <[email protected]>
Signed-off-by: Jay Zhan <[email protected]>
Signed-off-by: Jay Zhan <[email protected]>
Signed-off-by: Jay Zhan <[email protected]>
Signed-off-by: Jay Zhan <[email protected]>
Signed-off-by: Jay Zhan <[email protected]>
Signed-off-by: Jay Zhan <[email protected]>
Signed-off-by: Jay Zhan <[email protected]>
Signed-off-by: Jay Zhan <[email protected]>
@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) functions labels Jan 12, 2025
pub fn is_nullable(&self, args: &[Expr], schema: &dyn ExprSchema) -> bool {
self.inner.is_nullable(args, schema)
}

pub fn is_nullable_from_args_nullable(&self, args_nullables: &[bool]) -> bool {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove Expr dependency

/// The data types of the arguments to the function
pub arg_types: &'a [DataType],
/// The Utf8 arguments to the function, if the expression is not Utf8, it will be empty string
pub arguments: &'a [String],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better name 🤔 ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to unify the argument handling so that both return type and nullability are returned the same?

I wonder if it would somehow be possible to add the input nullable information here too 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also not sure about only supporting string args, that is likely a regression in behavior for some users (For example, maybe they look for constant integers as well)

let name_column = &chunk[0];
let name = match name_column {
ColumnarValue::Scalar(ScalarValue::Utf8(Some(name_scalar))) => name_scalar,
_ => return exec_err!("named_struct even arguments must be string literals, got {name_column:?} instead at position {}", i * 2)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name_column output less readable array in this change, remove it for now.

@findepi
Copy link
Member

findepi commented Jan 12, 2025

As stated in #13717 (comment) , this new method doesn't necessarily simplify anything.

Can you please fill "Rationale for this change"? What problem are we solving?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jayzhan211 -- I think this is a step in the right direction, but I am worried it just makes the API more complicated (adds as many functions as it deprecates)

Challenge: Exprs / constants

It seems to me one challenge is that different information is known for computing return types at different points in the plan (e.g. sometimes we have Expr and sometimes we don't)

What would you think about making this more explicit in ReturnTypeArgs by making it an enum:

#[derive(Debug)]
pub enum ReturnTypeArgs<'a> {
    /// information known at logical planning time
    /// Note you can get get type and nullability for each arg
   // using the specified ExprSchema
    Planning {
       pub args: &'a[Expr],
       pub schema: &'a dyn ExprSchema
    },
    /// Information known during Execution
    Execution {
    /// The data types of the arguments to the function
      pub arg_types: &'a [DataType],
      pub arg_nullability: [bool],
    }
}

Challenge: Multiple APIs (Nullability and return type)

It is somewhat akward to have two functions, one for nullability and one for return type. Also I can imagine that the nullability calculation depends on the input type of arguments too (not just the input nullability) I wonder if we can combine them into a single API:

Maybe something like

/// Information about the output of the function
/// including the data type and nullability:
struct ReturnTypeInfo {
  data_type: DataType,
  nullable: bool,
}

trait ScalarUDFImpl {
    /// Returns the 
    pub fn return_type_from_args(&self, args: ReturnTypeArgs) -> Result<ReturnTypeInfo> 
}

/// The data types of the arguments to the function
pub arg_types: &'a [DataType],
/// The Utf8 arguments to the function, if the expression is not Utf8, it will be empty string
pub arguments: &'a [String],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to unify the argument handling so that both return type and nullability are returned the same?

I wonder if it would somehow be possible to add the input nullable information here too 🤔

/// The data types of the arguments to the function
pub arg_types: &'a [DataType],
/// The Utf8 arguments to the function, if the expression is not Utf8, it will be empty string
pub arguments: &'a [String],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also not sure about only supporting string args, that is likely a regression in behavior for some users (For example, maybe they look for constant integers as well)

@jayzhan211
Copy link
Contributor Author

Multiple APIs (Nullability and return type)

Great, I also want this too.

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Jan 12, 2025

I am also not sure about only supporting string args, that is likely a regression in behavior for some users (For example, maybe they look for constant integers as well)

Yes, it might be, since I assume we don't really need Expr but String instead. However, in theory they can achieve what they need with String, but of course this is breaking change.

If the constant integer is the only concern, ScalarValue or ColumnarValue looks good for me too!

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Jan 12, 2025

#[derive(Debug)]
pub enum ReturnTypeArgs<'a> {
    /// information known at logical planning time
    /// Note you can get get type and nullability for each arg
   // using the specified ExprSchema
    Planning {
       pub args: &'a[Expr],
       pub schema: &'a dyn ExprSchema
    },
    /// Information known during Execution
    Execution {
    /// The data types of the arguments to the function
      pub arg_types: &'a [DataType],
      pub arg_nullability: [bool],
    }
}

One good thing in this PR is that we don't need Expr anymore, we compute data type and nullable in datafusion core and they are not "public" for customization.

Do we really need Planning? My thought is that whenever we have Expr and Schema we can compute corresponding DataType and Nullability. Therefore, even for Planning stage, we can still get the information DataType + Nullability

Signed-off-by: Jay Zhan <[email protected]>
Signed-off-by: Jay Zhan <[email protected]>
@jayzhan211 jayzhan211 changed the title Add two new methods in ScalarFunction return_type_from_args and is_nullable_from_args_nullable Introduce return_type_from_args for ScalarFunction. Jan 13, 2025
@alamb
Copy link
Contributor

alamb commented Jan 13, 2025

Do we really need Planning? My thought is that whenever we have Expr and Schema we can compute corresponding DataType and Nullability. Therefore, even for Planning stage, we can still get the information DataType + Nullability

I think the usecase is for expressions like arrow_cast whose output type depends on the actual value of the arguments (not just the type of the argument)

So like arrow-cast(String) can return Int64 or Float64 depending on the value of the argument -- the value is passed as an Expr

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Jan 14, 2025

So like arrow-cast(String

For arrow_cast, the output type is determined by the 2nd argument which is string.

arrow_cast(a, 'Int64') returns I64, arrow_cast(a, 'List(Int64)') returns List(Int64)

All the functions in datafusion have constant string type, the only concern is that other users do the crazy things with Expr, but I don't think it is practical since String is enough in my opionion

Signed-off-by: Jay Zhan <[email protected]>
@alamb
Copy link
Contributor

alamb commented Jan 16, 2025

This is on my review queue, I just haven't quite had time to get to it yet. I will do so tomorrow if not later tonight

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jayzhan211 -- I think this is a very nice improvement and makes the argument handling in user defined functions better

I also like how most functions will be unaffected as we still have ScalarUDFImpl::return_type

I really do think that we should change

    pub arguments: &'a [String],

to this before the release:

    pub arguments: &'a [Option<&'a ScalarValue>]>`,

But I believe this strongly enough that I will make a proposed follow on PR to this one if you don't want to make the change here

Thanks again

datafusion/expr/src/udf.rs Show resolved Hide resolved
datafusion/expr/src/udf.rs Outdated Show resolved Hide resolved
@@ -174,39 +177,3 @@ fn get_coalesce_doc() -> &'static Documentation {
.build()
})
}

#[cfg(test)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove these tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is covered in slt, remove duplicated

ColumnarValue::Scalar(ScalarValue::Utf8(Some(name_scalar))) => {
name_scalar
}
// TODO: Implement Display for ColumnarValue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good todo -- maybe we can make a follow on ticket for it

@@ -490,6 +547,40 @@ pub trait ScalarUDFImpl: Debug + Send + Sync {
self.return_type(arg_types)
}

/// What [`DataType`] will be returned by this function, given the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also recommend removing the comments from return_type_from_exprs now that it is deprecated and you have moved the content here

datafusion/expr/src/udf.rs Show resolved Hide resolved
e.as_any()
.downcast_ref::<Literal>()
.map(|literal| match literal.value() {
ScalarValue::Utf8(Some(s)) => s.clone(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be really nice to avoid this clone here -- if we changed the signature to &[Option<&str>] I think that would be easier to use and more efficient

@alamb
Copy link
Contributor

alamb commented Jan 17, 2025

FYI @andygrove

@jayzhan211
Copy link
Contributor Author

I plan to change to ScalarValue after #14167, so we can parse the string easily.

If anyone interested in this change can also take it.

Signed-off-by: Jay Zhan <[email protected]>
@alamb
Copy link
Contributor

alamb commented Jan 18, 2025

I plan to change to ScalarValue after #14167, so we can parse the string easily.

If anyone interested in this change can also take it.

I have just merged in #14167

@github-actions github-actions bot added the common Related to common crate label Jan 19, 2025
@jayzhan211
Copy link
Contributor Author

We can start return_type deprecation after this PR

@jayzhan211 jayzhan211 marked this pull request as draft January 19, 2025 03:32
@jayzhan211 jayzhan211 marked this pull request as ready for review January 19, 2025 07:16
@alamb
Copy link
Contributor

alamb commented Jan 19, 2025

We can start return_type deprecation after this PR

I recommend we do not deprecate return_type - after this PR I think we have a nice API. Namely most simple uses can use return_type and ones that need more control (like returning nullability, or getting scalar args) can use return_type_from_args

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jayzhan211 -- I think this looks really nice ❤️

I left some comment suggestions, but I can also make them as a follow on PR as well

let a = Arc::new(Int32Array::from(vec![3; 200])) as ArrayRef;
println!(
"display {}",
pretty_format_columns("ColumnarValue(ArrayRef)", &[a]).unwrap()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I forgot to remove this 😆


let (return_type, nullable) =
func.return_type_from_args(args)?.into_parts();
Ok((return_type, nullable))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

/// The data types of the arguments to the function
pub arg_types: &'a [DataType],
/// The Utf8 arguments to the function, if the expression is not Utf8, it will be empty string
pub arguments: &'a [Option<&'a ScalarValue>],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also recommend naming this scalar_arguments

Suggested change
pub arguments: &'a [Option<&'a ScalarValue>],
pub scalar_arguments: &'a [Option<&'a ScalarValue>],

pub struct ReturnTypeArgs<'a> {
/// The data types of the arguments to the function
pub arg_types: &'a [DataType],
/// The Utf8 arguments to the function, if the expression is not Utf8, it will be empty string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// The Utf8 arguments to the function, if the expression is not Utf8, it will be empty string
/// Is argument `i` to the function a scalar (constant)
///
/// If argument `i` is not a scalar, it will be None
///
/// For example, if a function is called like `my_function(column_a, 5)`
/// this field will be `[None, Some(ScalarValue::Int32(Some(5)))]`

Comment on lines +360 to +364
#[derive(Debug)]
pub struct ReturnInfo {
return_type: DataType,
nullable: bool,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#[derive(Debug)]
pub struct ReturnInfo {
return_type: DataType,
nullable: bool,
}
/// Return metadata for this function.
///
/// See [`ScalarUDFImpl::return_type_from_args`] for more information
#[derive(Debug)]
pub struct ReturnInfo {
/// The type of the value return by the function
return_type: DataType,
/// Can the returned value (ever) be null?
nullable: bool,
}

@@ -342,6 +348,56 @@ pub struct ScalarFunctionArgs<'a> {
pub return_type: &'a DataType,
}

#[derive(Debug)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#[derive(Debug)]
/// Information about arguments passed to the function
///
/// This structure contains metadata about how the function was called
/// such as the type of the arguments, any scalar arguments and if the
/// arguments can (ever) be null
///
/// See [`ScalarUDFImpl::return_type_from_args`] for more information
#[derive(Debug)]

pub arg_types: &'a [DataType],
/// The Utf8 arguments to the function, if the expression is not Utf8, it will be empty string
pub arguments: &'a [Option<&'a ScalarValue>],
pub nullables: &'a [bool],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub nullables: &'a [bool],
/// Can argument `i` (ever) null?
pub nullables: &'a [bool],

debug_assert_eq!(args.arguments.len(), 2);

args.arguments[1]
.and_then(|sv| sv.try_as_str().flatten().filter(|s| !s.is_empty()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is pretty fancy functional programming :bowtie:

@github-actions github-actions bot removed the common Related to common crate label Jan 20, 2025
@alamb
Copy link
Contributor

alamb commented Jan 20, 2025

Love it. And love that this isn't an API change. Thank you so much @jayzhan211 -- this really looks great

@alamb alamb merged commit d3f1c9a into apache:main Jan 20, 2025
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate functions logical-expr Logical plan and expressions physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants