You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Dataverse currently lacks a field type to store string values without text analysis. This would be appropriate for fields like IDs (e.g. ORCIDs) or enums where exact matches are required.
Adding a "string" field type would resolve issues with incorrect search results being matched due to the text analysis. Examples:
For IDs such as ORCIDs, the ID parts may be matched in any order, e.g. a query for authorIdentifier:0000-2345-0001-678X matches a result with authorIdentifier:0000-0001-2345-678X:
(Note: this could be prevented by querying for authorIdentifier:"0000-2345-0001-678X" to preserve order, but I still think it's a bug that reordered matching is even possible.)
Matches are made based on substring matching, e.g. a query for language:"Russian" matches a result with language:"Russian Sign Language".
This could also result in incorrect ID matching, if one ID is substring/prefix of another ID.
What kind of user is the feature intended for?
(Example users roles: API User, Curator, Depositor, Guest, Superuser, Sysadmin)
All search users (e.g. API user, guest).
What inspired the request?
Inaccurate search results being returned in some cases, as described above.
What existing behavior do you want changed?
/
Any brand new behavior do you want to add to Dataverse?
Introduce a "string" field type to metadata block TSVs for fields where text analysis isn't needed.
Any open or closed issues related to this feature request?
Not aware of any.
Are you thinking about creating a pull request for this feature?
Yes, we are interested in creating a PR.
The text was updated successfully, but these errors were encountered:
Sure, sounds good. Thanks for offering to make a PR!
For the record, we do use the string type for facets already. Otherwise they wouldn't work when there is more than one word. That is, we index both "text_en" and "string" for these values. The "text_en" version is supposed to be for better searching but as you say, in some cases like your Russian Sign Language example, perhaps the results are not so good. 😱
Overview of the Feature Request
Dataverse currently lacks a field type to store string values without text analysis. This would be appropriate for fields like IDs (e.g. ORCIDs) or enums where exact matches are required.
Adding a "string" field type would resolve issues with incorrect search results being matched due to the text analysis. Examples:
authorIdentifier:0000-2345-0001-678X
matches a result withauthorIdentifier:0000-0001-2345-678X
:(Note: this could be prevented by querying for
authorIdentifier:"0000-2345-0001-678X"
to preserve order, but I still think it's a bug that reordered matching is even possible.)language:"Russian"
matches a result withlanguage:"Russian Sign Language"
.This could also result in incorrect ID matching, if one ID is substring/prefix of another ID.
What kind of user is the feature intended for?
(Example users roles: API User, Curator, Depositor, Guest, Superuser, Sysadmin)
All search users (e.g. API user, guest).
What inspired the request?
Inaccurate search results being returned in some cases, as described above.
What existing behavior do you want changed?
/
Any brand new behavior do you want to add to Dataverse?
Introduce a "string" field type to metadata block TSVs for fields where text analysis isn't needed.
Any open or closed issues related to this feature request?
Not aware of any.
Are you thinking about creating a pull request for this feature?
Yes, we are interested in creating a PR.
The text was updated successfully, but these errors were encountered: