Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Extend escape analysis to account for arrays with non-gcref elements #104906

Merged
merged 97 commits into from
Jan 22, 2025

Conversation

hez2010
Copy link
Contributor

@hez2010 hez2010 commented Jul 15, 2024

Positive case:

var chs = new char[42];
chs[1] = 'a';
Console.WriteLine((int)chs[1] + chs.Length);

Codegen:

; Assembly listing for method ArrayAllocator.Program:Main() (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;* V00 loc0         [V00    ] (  0,  0   )    long  ->  zero-ref    class-hnd exact <short[]>
;  V01 OutArgs      [V01    ] (  1,  1   )  struct (32) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V02 tmp1         [V02    ] (  0,  0   )  struct (104) zero-ref    do-not-enreg[SF] "stack allocated array temp"
;* V03 tmp2         [V03    ] (  0,  0   )    long  ->  zero-ref    single-def "V02.[000..008)"
;* V04 tmp3         [V04    ] (  0,  0   )     int  ->  zero-ref    single-def "V02.[008..012)"
;* V05 tmp4         [V05    ] (  0,  0   )   short  ->  zero-ref    "V02.[018..020)"
;
; Lcl frame size = 40

G_M25548_IG01:  ;; offset=0x0000
       sub      rsp, 40
                                                ;; size=4 bbWeight=1 PerfScore 0.25
G_M25548_IG02:  ;; offset=0x0004
       mov      ecx, 84
       call     [System.Console:WriteLine(int)]
       nop
                                                ;; size=12 bbWeight=1 PerfScore 3.50
G_M25548_IG03:  ;; offset=0x0010
       add      rsp, 40
       ret
                                                ;; size=5 bbWeight=1 PerfScore 1.25

; Total bytes of code 21, prolog size 4, PerfScore 5.00, instruction count 6, allocated bytes for code 21 (MethodHash=5b0b9c33) for method ArrayAllocator.Program:Main() (FullOpts)

Negative case:

var chs = new char[42];
chs[1] = 'a';
Console.WriteLine((int)chs[42] + chs.Length);

Codegen:

; Assembly listing for method ArrayAllocator.Program:Main() (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;* V00 loc0         [V00    ] (  0,  0   )    long  ->  zero-ref    class-hnd exact <short[]>
;  V01 OutArgs      [V01    ] (  1,  1   )  struct (32) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V02 tmp1         [V02    ] (  0,  0   )  struct (104) zero-ref    do-not-enreg[SF] "stack allocated array temp"
;  V03 tmp2         [V03,T00] (  1,  0   )   byref  ->  rbx         must-init "dummy temp of must thrown exception"
;* V04 tmp3         [V04    ] (  0,  0   )    long  ->  zero-ref    single-def "V02.[000..008)"
;* V05 tmp4         [V05    ] (  0,  0   )     int  ->  zero-ref    single-def "V02.[008..012)"
;* V06 tmp5         [V06    ] (  0,  0   )   short  ->  zero-ref    single-def "V02.[018..020)"
;
; Lcl frame size = 32

G_M25548_IG01:  ;; offset=0x0000
       push     rbx
       sub      rsp, 32
       xor      ebx, ebx
                                                ;; size=7 bbWeight=0 PerfScore 0.00
G_M25548_IG02:  ;; offset=0x0007
       call     CORINFO_HELP_RNGCHKFAIL
       movsx    rcx, word  ptr [rbx]
       call     [System.Console:WriteLine(int)]
       int3
                                                ;; size=16 bbWeight=0 PerfScore 0.00

; Total bytes of code 23, prolog size 5, PerfScore 0.00, instruction count 7, allocated bytes for code 23 (MethodHash=5b0b9c33) for method ArrayAllocator.Program:Main() (FullOpts)
; ============================================================

Benchmark on Mandelbrot:

Method Job Mean Error StdDev Code Size Allocated
MandelBrot NoStackAllocationArray 199.7 us 1.30 us 1.22 us 1,996 B 2.49 KB
MandelBrot StackAllocationArray 195.8 us 1.16 us 1.08 us 2,414 B 1.14 KB

Diff: https://www.diffchecker.com/bNP4qHdF/

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 15, 2024
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Jul 15, 2024
Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For arrays (and also perhaps boxes and ref classes) we ought to have some kind of size limit... possibly similar to the one we use for stackallocs.

We need to be careful we don't allocate a lot of stack for an object that might not be heavily used, as we'll pay per-call prolog zeroing costs.

@hez2010
Copy link
Contributor Author

hez2010 commented Jan 18, 2025

@AndyAyersMS Just experimented a bit with late dead stores removal by unexposing locals in the liveness and repeating to convergence: commit 3450580

Before adding late dead stores removal: diffs
After adding late dead stores removal: diffs

Relative diffs:

linux arm64: -26.588 bytes
linux x64: -27,113 bytes
osx arm64: -14,360 bytes
windows arm64: -25,528 bytes
windows x64: -26,628 bytes

Diffs look interesting but TP impacts are too high. Now I'm reverting the experiment commit.

@hez2010 hez2010 force-pushed the value-array-stack-alloc branch from 3450580 to 1915450 Compare January 18, 2025 16:28
@AndyAyersMS
Copy link
Member

@jakobbotsch can you look this one over again?

#ifdef FEATURE_READYTORUN
if (comp->opts.IsReadyToRun() && data->IsHelperCall(comp, CORINFO_HELP_READYTORUN_NEWARR_1))
{
len = data->AsCall()->gtArgs.GetArgByIndex(0)->GetNode();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed R2R support more completely

Comment on lines 2849 to 2854
GenTree* const mtStore = gtNewStoreLclFldNode(lclNum, TYP_I_IMPL, 0, mt);
Statement* const mtStmt = gtNewStmt(mtStore);

fgInsertStmtBefore(block, newStmt, mtStmt);
gtSetStmtInfo(mtStmt);
fgSetStmtSeq(mtStmt);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
GenTree* const mtStore = gtNewStoreLclFldNode(lclNum, TYP_I_IMPL, 0, mt);
Statement* const mtStmt = gtNewStmt(mtStore);
fgInsertStmtBefore(block, newStmt, mtStmt);
gtSetStmtInfo(mtStmt);
fgSetStmtSeq(mtStmt);
GenTree* const mtStore = gtNewStoreLclFldNode(lclNum, TYP_I_IMPL, 0, mt);
Statement* const mtStmt = fgNewStmtFromTree(mtStore);
fgInsertStmtBefore(block, newStmt, mtStmt);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto below, more things can switch to use fgNewStmtFromTree

@@ -181,6 +254,11 @@ inline bool ObjectAllocator::CanAllocateLclVarOnStack(unsigned int lclNu
return false;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we should have a limit on the aggregate stack allocated size. That's somewhat preexisting, but probably more important now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree but will do that in a future PR

Comment on lines +6249 to +6260
// If this is a local array, there are no asyncronous modifications, so we can set the
// conservative VN to the liberal VN.
//
VNFuncApp arrFn;
if (vnStore->IsVNNewLocalArr(arrVN, &arrFn))
{
loadTree->gtVNPair.SetConservative(loadValueVN);
}
else
{
loadTree->gtVNPair.SetConservative(vnStore->VNForExpr(compCurBB, loadType));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually show up as benefits?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some limited cases, yes... eg we can const propagate through a[2] = 1; y = a[2];

@AndyAyersMS
Copy link
Member

@jakobbotsch think I've addressed most of the key points. Overall size limit will come in a future PR.

@jakobbotsch
Copy link
Member

@AndyAyersMS Did you push those changes?

@AndyAyersMS
Copy link
Member

Ah, I pushed to my fork, but ... this PR is not from my fork.

@AndyAyersMS
Copy link
Member

@jakobbotsch changes are there now

@@ -4227,6 +4227,7 @@ enum GenTreeCallFlags : unsigned int
GTF_CALL_M_CAST_CAN_BE_EXPANDED = 0x04000000, // this cast (helper call) can be expanded if it's profitable. To be removed.
GTF_CALL_M_CAST_OBJ_NONNULL = 0x08000000, // if we expand this specific cast we don't need to check the input object for null
// NOTE: if needed, this flag can be removed, and we can introduce new _NONNUL cast helpers
GTF_CALL_M_STACK_ARRAY = 0x10000000, // this call is a new array helper for a stack allocated array.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we weren't able to get rid of this since it's also used in VN.

Comment on lines +14375 to +14380
// modHeap = false;
}
else if (vnf == VNF_JitReadyToRunNewArr)
{
vnf = VNF_JitReadyToRunNewLclArr;
// modHeap = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: commented code (feel free to remove as part of a follow up)

@@ -14338,7 +14367,22 @@ bool Compiler::fgValueNumberHelperCall(GenTreeCall* call)
}
}

if (isAlloc && ((call->gtCallMoreFlags & GTF_CALL_M_STACK_ARRAY) != 0))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this could be changed to FindWellKnownArg(StackArrayLocal) != nullptr to get rid of the flag

Copy link
Member

@jakobbotsch jakobbotsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@AndyAyersMS
Copy link
Member

Thanks. I have more changes in this area so I can handle the last few bits in a subsequent PR.

@AndyAyersMS AndyAyersMS merged commit 7d75878 into dotnet:main Jan 22, 2025
113 of 115 checks passed
@AndyAyersMS
Copy link
Member

@hez2010 thanks for all the work you did here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants