Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat] Suggest of using MappingUtils to compute coordiantes automatically for different warpSize #512

Open
yiakwy-xpu-ml-framework-team opened this issue Sep 27, 2024 · 0 comments

Comments

@yiakwy-xpu-ml-framework-team
Copy link

yiakwy-xpu-ml-framework-team commented Sep 27, 2024

MappingUtils has been interagrated into in ROCM SDK 6.2, which defines coordinates <waveRows, waveCols> in the form of

blockDim = (waveRows * warpSize, waveCols) // warpSize is 64 in AMD GPU, and 32 in NVGPU

<waveRows, waveCols> warp coordinates in each threads block distributed to each SM(NV)/CUs(AMD).

This feature can eliminate hard coded warp size, and partition hirearchy transformation, which relies on HW memory hirearchy and make sure codes work correctly cross platform.

Note partition hirearchy transformation , HW memory hirearchy can changes with hardware. For example L2 cache may have different memory banks (4 banks) than LDS (64 banks), that means the best (if exist) swizzling technology super parameters for memory level_{i} is different from memroy level_{i+1}.

The codes of MappingUtils for a threads block looks like:

    template <uint32_t BlockHeight, uint32_t BlockWidth, typename DataT, typename DataLayout>
    struct MappingUtil {
        static inline uint32_t laneId();
        
        //  Local wave coordinate relative to workgroup, above example <waveRows, waveCols> for warp level programming with warp sync API
        static inline WaveCoordT WaveCoordT waveCoord();
 
        // Global block (grid) coordinate of current wave
        static inline BlockCoordT blockCoord();
 
        // Matrix coordinate of current wave
        static inline MatrixCoordT matrixCoord();
    }

Morover, the warp size partition is dependent on the instruction used.

For example, the partition for instruct m8n8.x4 ( 8x8 matrix fragment x 4) instruction must be different from instruct m16n16.x4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant