Feature Request: Add "string" type for metadata fields #11147

vera · 2025-01-10T11:09:00Z

Overview of the Feature Request

Dataverse currently lacks a field type to store string values without text analysis. This would be appropriate for fields like IDs (e.g. ORCIDs) or enums where exact matches are required.

Adding a "string" field type would resolve issues with incorrect search results being matched due to the text analysis. Examples:

For IDs such as ORCIDs, the ID parts may be matched in any order, e.g. a query for authorIdentifier:0000-2345-0001-678X matches a result with authorIdentifier:0000-0001-2345-678X:

(Note: this could be prevented by querying for authorIdentifier:"0000-2345-0001-678X" to preserve order, but I still think it's a bug that reordered matching is even possible.)

Matches are made based on substring matching, e.g. a query for language:"Russian" matches a result with language:"Russian Sign Language".

This could also result in incorrect ID matching, if one ID is substring/prefix of another ID.

What kind of user is the feature intended for?
(Example users roles: API User, Curator, Depositor, Guest, Superuser, Sysadmin)

All search users (e.g. API user, guest).

What inspired the request?

Inaccurate search results being returned in some cases, as described above.

What existing behavior do you want changed?

/

Any brand new behavior do you want to add to Dataverse?

Introduce a "string" field type to metadata block TSVs for fields where text analysis isn't needed.

Any open or closed issues related to this feature request?

Not aware of any.

Are you thinking about creating a pull request for this feature?

Yes, we are interested in creating a PR.

The text was updated successfully, but these errors were encountered:

pdurbin · 2025-01-10T16:00:22Z

Sure, sounds good. Thanks for offering to make a PR!

For the record, we do use the string type for facets already. Otherwise they wouldn't work when there is more than one word. That is, we index both "text_en" and "string" for these values. The "text_en" version is supposed to be for better searching but as you say, in some cases like your Russian Sign Language example, perhaps the results are not so good. 😱

vera added the Type: Feature a feature request label Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Add "string" type for metadata fields #11147

Feature Request: Add "string" type for metadata fields #11147

vera commented Jan 10, 2025 •

edited

Loading

pdurbin commented Jan 10, 2025

Feature Request: Add "string" type for metadata fields #11147

Feature Request: Add "string" type for metadata fields #11147

Comments

vera commented Jan 10, 2025 • edited Loading

pdurbin commented Jan 10, 2025

vera commented Jan 10, 2025 •

edited

Loading