TextDecoder incorrectly decodes 0x92 for Windows-1252 #56542

unilynx · 2025-01-10T09:23:56Z

Version

v23.6.0, v22.13.0

Platform

Darwin xxx 24.1.0 Darwin Kernel Version 24.1.0: Thu Oct 10 21:00:32 PDT 2024; root:xnu-11215.41.3~2/RELEASE_ARM64_T6030 arm64

(but no problems reproducing it on Linux amd64 or arm64/)

Subsystem

No response

What steps will reproduce the bug?

const decoded = new TextDecoder("Windows-1252").decode(new Uint8Array([146])).charCodeAt(0);;
console[decoded === 8217 ? 'error' : 'log'](`Expected 8217 got ${decoded}`);

146 is now decoded as 146, not 8217. This still worked in v23.3.0 and v22.12.0 - might be related to the fix for #56219 ?

How often does it reproduce? Is there a required condition?

It always fails

What is the expected behavior? Why is that the expected behavior?

https://web.archive.org/web/20151027124421/https://msdn.microsoft.com/en-us/library/cc195054.aspx shows 0x92 should indeed be 0x2019

What do you see instead?

something else

Additional information

No response

The text was updated successfully, but these errors were encountered:

targos · 2025-01-10T09:35:56Z

Indeed, windows-1252 is not exactly the same as latin-1. How is it that the DecodeLatin1 is called for that encoding? @anonrig

blexrob · 2025-01-10T10:00:13Z

The problem is the latin1 decoder is used for the Windows-1252 codepage. Those encodings are not equivalent, especially for characters 0x80-0x9F. These are not in use for latin1 / ISO 8859-1, but defined for Windows-1252 (see https://en.wikipedia.org/wiki/Windows-1252#Codepage_layout).

The latin1 decoder just decodes a number to the unicode codepoint with the same number (eg. 0x65 is decoded to \u0065, and 0x92 to \u0092). But Windows-1252 differs from ISO 8859-1 in the 0x80-0x9F range, eg. 0x92 maps to \u2019.

Therefore, the latin1 decoder can't be used to decode Windows-1252.

I think the best way to keep the decoder optimization is to add the encoding `iso-8859-1, and all the mappings in lib/internal/encodings.js that currently map to windows-1252 should be mapped to that - except cp1252, windows-1252 and x-cp1252. But, I don't know if the current infrastructure supports that encoding though.

cp819 should be equivalent to iso-8859-1 (https://web.archive.org/web/20150531085023/http://www-01.ibm.com/software/globalization/cp/cp00819.html)

Current mappings to windows-1252 in encodings.js:

  ['ansi_x3.4-1968', 'windows-1252'],
  ['ascii', 'windows-1252'],
  ['cp1252', 'windows-1252'],
  ['cp819', 'windows-1252'],
  ['csisolatin1', 'windows-1252'],
  ['ibm819', 'windows-1252'],
  ['iso-8859-1', 'windows-1252'],
  ['iso-ir-100', 'windows-1252'],
  ['iso8859-1', 'windows-1252'],
  ['iso88591', 'windows-1252'],
  ['iso_8859-1', 'windows-1252'],
  ['iso_8859-1:1987', 'windows-1252'],
  ['l1', 'windows-1252'],
  ['latin1', 'windows-1252'],
  ['us-ascii', 'windows-1252'],
  ['windows-1252', 'windows-1252'],
  ['x-cp1252', 'windows-1252'],

blexrob · 2025-01-10T10:13:07Z

Using the { stream: true } option in decode() permanently disables the latin1 fast path for a TextDecoder, so the following workaround is possible:

const decoder = new TextDecodere(encoding);
if (["cp1252", "windows-1252", "x-cp1252"].includes(encoding.toLowerCase())) 
  decoder.decode(Buffer.from([]), { stream: true });
const decoded = decoder.decode(todecode);

domenic · 2025-01-10T10:33:27Z

TextEncoder in general has many bugs related to not matching the relevant web standard. See this issue I opened some time ago which hasn't gotten any response: #40091

joyeecheung · 2025-01-10T17:24:26Z

From a glance #40091 and this look like good first issues (or maybe more like good second issues?), marked it as such.

joyeecheung added the good first issue Issues that are suitable for first-time contributors. label Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TextDecoder incorrectly decodes 0x92 for Windows-1252 #56542

TextDecoder incorrectly decodes 0x92 for Windows-1252 #56542

unilynx commented Jan 10, 2025

targos commented Jan 10, 2025

blexrob commented Jan 10, 2025 •

edited

Loading

blexrob commented Jan 10, 2025

domenic commented Jan 10, 2025

joyeecheung commented Jan 10, 2025

TextDecoder incorrectly decodes 0x92 for Windows-1252 #56542

TextDecoder incorrectly decodes 0x92 for Windows-1252 #56542

Comments

unilynx commented Jan 10, 2025

Version

Platform

Subsystem

What steps will reproduce the bug?

How often does it reproduce? Is there a required condition?

What is the expected behavior? Why is that the expected behavior?

What do you see instead?

Additional information

targos commented Jan 10, 2025

blexrob commented Jan 10, 2025 • edited Loading

blexrob commented Jan 10, 2025

domenic commented Jan 10, 2025

joyeecheung commented Jan 10, 2025

blexrob commented Jan 10, 2025 •

edited

Loading