Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TextDecoder incorrectly decodes 0x92 for Windows-1252 #56542

Open
unilynx opened this issue Jan 10, 2025 · 5 comments
Open

TextDecoder incorrectly decodes 0x92 for Windows-1252 #56542

unilynx opened this issue Jan 10, 2025 · 5 comments
Labels
confirmed-bug Issues with confirmed bugs. encoding Issues and PRs related to the TextEncoder and TextDecoder APIs. good first issue Issues that are suitable for first-time contributors. v22.x v22.x Issues that can be reproduced on v22.x or PRs targeting the v22.x-staging branch. v23.x v23.x Issues that can be reproduced on v23.x or PRs targeting the v23.x-staging branch.

Comments

@unilynx
Copy link

unilynx commented Jan 10, 2025

Version

v23.6.0, v22.13.0

Platform

Darwin xxx 24.1.0 Darwin Kernel Version 24.1.0: Thu Oct 10 21:00:32 PDT 2024; root:xnu-11215.41.3~2/RELEASE_ARM64_T6030 arm64

(but no problems reproducing it on Linux amd64 or arm64/)

Subsystem

No response

What steps will reproduce the bug?

const decoded = new TextDecoder("Windows-1252").decode(new Uint8Array([146])).charCodeAt(0);;
console[decoded === 8217 ? 'error' : 'log'](`Expected 8217 got ${decoded}`);

146 is now decoded as 146, not 8217. This still worked in v23.3.0 and v22.12.0 - might be related to the fix for #56219 ?

How often does it reproduce? Is there a required condition?

It always fails

What is the expected behavior? Why is that the expected behavior?

https://web.archive.org/web/20151027124421/https://msdn.microsoft.com/en-us/library/cc195054.aspx shows 0x92 should indeed be 0x2019

What do you see instead?

something else

Additional information

No response

@targos
Copy link
Member

targos commented Jan 10, 2025

Indeed, windows-1252 is not exactly the same as latin-1. How is it that the DecodeLatin1 is called for that encoding? @anonrig

@targos targos added confirmed-bug Issues with confirmed bugs. encoding Issues and PRs related to the TextEncoder and TextDecoder APIs. v22.x v22.x Issues that can be reproduced on v22.x or PRs targeting the v22.x-staging branch. v23.x v23.x Issues that can be reproduced on v23.x or PRs targeting the v23.x-staging branch. labels Jan 10, 2025
@blexrob
Copy link

blexrob commented Jan 10, 2025

The problem is the latin1 decoder is used for the Windows-1252 codepage. Those encodings are not equivalent, especially for characters 0x80-0x9F. These are not in use for latin1 / ISO 8859-1, but defined for Windows-1252 (see https://en.wikipedia.org/wiki/Windows-1252#Codepage_layout).

The latin1 decoder just decodes a number to the unicode codepoint with the same number (eg. 0x65 is decoded to \u0065, and 0x92 to \u0092). But Windows-1252 differs from ISO 8859-1 in the 0x80-0x9F range, eg. 0x92 maps to \u2019.

Therefore, the latin1 decoder can't be used to decode Windows-1252.

I think the best way to keep the decoder optimization is to add the encoding `iso-8859-1, and all the mappings in lib/internal/encodings.js that currently map to windows-1252 should be mapped to that - except cp1252, windows-1252 and x-cp1252. But, I don't know if the current infrastructure supports that encoding though.

cp819 should be equivalent to iso-8859-1 (https://web.archive.org/web/20150531085023/http://www-01.ibm.com/software/globalization/cp/cp00819.html)

Current mappings to windows-1252 in encodings.js:

  ['ansi_x3.4-1968', 'windows-1252'],
  ['ascii', 'windows-1252'],
  ['cp1252', 'windows-1252'],
  ['cp819', 'windows-1252'],
  ['csisolatin1', 'windows-1252'],
  ['ibm819', 'windows-1252'],
  ['iso-8859-1', 'windows-1252'],
  ['iso-ir-100', 'windows-1252'],
  ['iso8859-1', 'windows-1252'],
  ['iso88591', 'windows-1252'],
  ['iso_8859-1', 'windows-1252'],
  ['iso_8859-1:1987', 'windows-1252'],
  ['l1', 'windows-1252'],
  ['latin1', 'windows-1252'],
  ['us-ascii', 'windows-1252'],
  ['windows-1252', 'windows-1252'],
  ['x-cp1252', 'windows-1252'],

@blexrob
Copy link

blexrob commented Jan 10, 2025

Using the { stream: true } option in decode() permanently disables the latin1 fast path for a TextDecoder, so the following workaround is possible:

const decoder = new TextDecodere(encoding);
if (["cp1252", "windows-1252", "x-cp1252"].includes(encoding.toLowerCase())) 
  decoder.decode(Buffer.from([]), { stream: true });
const decoded = decoder.decode(todecode);

@domenic
Copy link
Contributor

domenic commented Jan 10, 2025

TextEncoder in general has many bugs related to not matching the relevant web standard. See this issue I opened some time ago which hasn't gotten any response: #40091

@joyeecheung joyeecheung added the good first issue Issues that are suitable for first-time contributors. label Jan 10, 2025
@joyeecheung
Copy link
Member

From a glance #40091 and this look like good first issues (or maybe more like good second issues?), marked it as such.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
confirmed-bug Issues with confirmed bugs. encoding Issues and PRs related to the TextEncoder and TextDecoder APIs. good first issue Issues that are suitable for first-time contributors. v22.x v22.x Issues that can be reproduced on v22.x or PRs targeting the v22.x-staging branch. v23.x v23.x Issues that can be reproduced on v23.x or PRs targeting the v23.x-staging branch.
Projects
None yet
Development

No branches or pull requests

5 participants