Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Add "string" type for metadata fields #11147

Open
vera opened this issue Jan 10, 2025 · 1 comment
Open

Feature Request: Add "string" type for metadata fields #11147

vera opened this issue Jan 10, 2025 · 1 comment
Labels
Type: Feature a feature request

Comments

@vera
Copy link
Contributor

vera commented Jan 10, 2025

Overview of the Feature Request

Dataverse currently lacks a field type to store string values without text analysis. This would be appropriate for fields like IDs (e.g. ORCIDs) or enums where exact matches are required.

Adding a "string" field type would resolve issues with incorrect search results being matched due to the text analysis. Examples:

  1. For IDs such as ORCIDs, the ID parts may be matched in any order, e.g. a query for authorIdentifier:0000-2345-0001-678X matches a result with authorIdentifier:0000-0001-2345-678X:

image

(Note: this could be prevented by querying for authorIdentifier:"0000-2345-0001-678X" to preserve order, but I still think it's a bug that reordered matching is even possible.)

  1. Matches are made based on substring matching, e.g. a query for language:"Russian" matches a result with language:"Russian Sign Language".

image

This could also result in incorrect ID matching, if one ID is substring/prefix of another ID.

What kind of user is the feature intended for?
(Example users roles: API User, Curator, Depositor, Guest, Superuser, Sysadmin)

All search users (e.g. API user, guest).

What inspired the request?

Inaccurate search results being returned in some cases, as described above.

What existing behavior do you want changed?

/

Any brand new behavior do you want to add to Dataverse?

Introduce a "string" field type to metadata block TSVs for fields where text analysis isn't needed.

Any open or closed issues related to this feature request?

Not aware of any.

Are you thinking about creating a pull request for this feature?

Yes, we are interested in creating a PR.

@vera vera added the Type: Feature a feature request label Jan 10, 2025
@pdurbin
Copy link
Member

pdurbin commented Jan 10, 2025

Sure, sounds good. Thanks for offering to make a PR!

For the record, we do use the string type for facets already. Otherwise they wouldn't work when there is more than one word. That is, we index both "text_en" and "string" for these values. The "text_en" version is supposed to be for better searching but as you say, in some cases like your Russian Sign Language example, perhaps the results are not so good. 😱

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature a feature request
Projects
None yet
Development

No branches or pull requests

2 participants