Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search is not correctly working for words containing non-common characters like Turkish "İ" #33003

Open
akolhun opened this issue Dec 5, 2024 · 2 comments
Assignees
Milestone

Comments

@akolhun
Copy link
Contributor

akolhun commented Dec 5, 2024

Describe the bug
While document is stored with a field value = "ÜRÜNLERİ" - it cannot be then found by exacly same keyword "ÜRÜNLERİ" (but gets found by "ÜRÜNLERI" )

To Reproduce
Given the schema as

schema test_schema {
    document test_schema {

        field sku type string {
            indexing: summary | attribute
            match {
              word
            }
        }

    }
}

And document indexed as :

{
    "fields": {
        "sku": "ÜRÜNLERİ"
    }
}

Then the following search query does return the doc

"yql": "select * from test_schema where sku contains 'ÜRÜNLERİ'",

but this one with "incorrect" "I" returns the doc:

"yql": "select * from test_schema where sku contains 'ÜRÜNLERI'",

Expected behavior
search returns the doc for search term "ÜRÜNLERİ'"

Environment
docker image: vespaengine/vespa:8.452.13

Vespa version
8.452.13

Additional context
Issue might be reproduced within the app package attached:
vespa_encoding_issue.zip

Indexing request:

curl --location 'http://localhost:8080/document/v1/test/test_schema/docid/test_doc_123' \
--header 'Content-Type: application/json' \
--data '{
    "fields": {
        "sku": "ÜRÜNLERİ"
    }
}
'

Search request:

curl --location 'http://localhost:8080/search/' \
--header 'Content-Type: application/json' \
--data '{
    "user": "ak",
    "yql": "select * from test_schema where sku contains '\''ÜRÜNLERİ'\''"
}'
@akolhun
Copy link
Contributor Author

akolhun commented Dec 5, 2024

  1. There was also verified an approach with an explicit language set during both feeding and search:
...
        field language type string {
            indexing: summary | attribute | set_language
            attribute: fast-search
            match: word
        }

Then saving the doc with language = 'tr-TR' and searching it as:

curl --location 'http://localhost:8080/search/' \
--header 'Content-Type: application/json' \
--data '{
    "user": "ak",
    "yql": "select * from test_schema where sku contains '\''ÜRÜNLERİ'\''",
    "language": "tr-TR"
}'

does not succeed either

  1. We asume the issue happens at feeding level while lowercasing the value.
    See utf-16 decial code of the "İ" letter:
I: 73
ı: 305
İ: 304
i: 105

Now in Turkish alphabet lowercased I (73) is ı (305)
while in English: lowercased I (73) is i (105)

@akolhun akolhun changed the title Search is not correctly working for words containing special charaters like Turkish "İ" Search is not correctly working for words containing non-common charaters like Turkish "İ" Dec 5, 2024
@jobergum
Copy link

jobergum commented Dec 6, 2024

Hey, thanks for the detailed ticket. Attribute fields are not subject to linguistic processing at indexing or query time, so this is unrelated to language settings/set_language. This issue is related to case folding, as using match:cased works well.

field sku type string {
            indexing: summary | attribute
            match:cased

 }

Tracing with tracelevel=9 using cased matching, avoids the faulty lowercasing in the container

vespa query 'yql=select * from msmarco where sku contains "ÜRÜNLERİ"' 'tracelevel=9'
 {
                                "message": "msmarco.num0 search to dispatch: query=[sku:ÜRÜNLERİ] timeout=9998ms offset=0 hits=10 groupingSessionCache=true sessionId=c168cd80-a971-4c95-bb58-e058dcd61332.1733475402182.9.default grouping=0 :  restrict=[msmarco]"
                            },
                            {
                                "message": "Current state of query tree: EXACTSTRING[fromSegmented=false index=\"sku\" origin=null segmentIndex=0 stemmed=false uniqueID=1 words=true]{\n  \"ÜRÜNLERİ\"\n}\n"
                            },
  "attribute": {
                                                                "[type]": "IAttributeVector",
                                                                "name": "sku",
                                                                "type": "string",
                                                                "fast_search": false,
                                                                "filter": false
                                                            },
                                                            "query_term": "\u00DCR\u00DCNLER\u0130"
                                                        },

Without case matching (default) you get the following trace:

 {
                                "message": "msmarco.num0 search to dispatch: query=[sku:ürünleri̇] timeout=9998ms offset=0 hits=10 groupingSessionCache=true sessionId=c168cd80-a971-4c95-bb58-e058dcd61332.1733475567095.11.default grouping=0 :  restrict=[msmarco]"
                            },
                            {
                                "message": "Current state of query tree: EXACTSTRING[fromSegmented=false index=\"sku\" origin=null segmentIndex=0 stemmed=false uniqueID=1 words=true]{\n  \"ürünleri̇\"\n}\n"
                            },
                            {
                                "message": "YQL+ representation: select * from msmarco where sku contains ({normalizeCase: false, id: 1}\"\\u00FCr\\u00FCnleri\\u0307\") timeout 9998"
                            },

So it looks like the lowercasing in the stateless container layer is the issue here.

@bjormel bjormel assigned bjormel and unassigned bjormel Dec 6, 2024
@kkraune kkraune added question and removed question labels Dec 10, 2024
@hmusum hmusum added this to the soon milestone Dec 11, 2024
@hmusum hmusum changed the title Search is not correctly working for words containing non-common charaters like Turkish "İ" Search is not correctly working for words containing non-common characters like Turkish "İ" Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants