Render advanced 3D graphics and perform data-parallel computations using graphics processors using Metal.

Metal Documentation

Posts under Metal subtopic

Post

Replies

Boosts

Views

Activity

CMake unable to generate the Xcode file described in this tutorial
In the Creating A 3D Application With Hydra Rendering tutorial on the Apple Developer website, on the last step where I execute this command: cmake -S ~/Users/macuser/CreatingA3DApplicationWithHydraRendering/ -B ~/Users/macuser/CreatingA3DApplicationWithHydraRendering/ I keep getting an error: CMake Error at CMakeLists.txt:5 (include): include could not find requested file: /Users/macuser/USDInstall/bin/pxrConfig.cmake I've tried to follow the instructions as mentioned in the README.md file included in the project files at least 5 times as well as moving the pxrConfig.cmake file around and copying it in different folders, then executed the command and was still unsuccessful into generating the proper file expected to compile and render the HydraPlayer renderer. How do I get cmake to generate the Xcode file to create the HydraPlayer renderer?
1
0
104
May ’25
Metal and Swift PM
I have run into an issue where I am trying to use atomic_float in a swift package but I cannot get things to compile because it appears that the Swift Package Manager doesn't support Metal 3 (atomic_float is Metal 3 functionality). Is there any way around this? I am using // swift-tools-version: 6.1 and my Metal code includes: #include <metal_stdlib> #include <metal_geometric> #include <metal_math> #include <metal_atomic> using namespace metal; kernel void test(device atomic_float* imageBuffer [[buffer(1)]], uint id [[ thread_position_in_grid ]]) { } But I get an error on the definition of atomic_float . Any help, one more importantly, where I could have found this information about this limitation, would be helpful. -RadBobby
0
0
60
Apr ’25
Support for clock() shader instruction in MSL similar to VK_KHR_shader_clock instructions
Hi, seems MSL is missing support for a clock() shader instruction available in other graphics APIs like Vulkan or OpenGL for example.. useful for counting cost in number of clock cycles of some code insider shader with much finer granularity than launching a micro kernel with same instructions and measuring cycles cost from CPU.. also useful for MoltenVK to support that extensions.. thanks..
1
0
85
Apr ’25
Physics bug in WWE 2K25 with GPTK2.1
The game physics work as expected using GTPK 2.0 using Crossover 24 or Whisky. However, using GPTK 2.1 with Crossover 25, the player and camera physics misbehave. See https://www.reddit.com/r/WWEGames/comments/1jx9mph/the_siamese_elbow/ and https://www.reddit.com/r/WWEGames/comments/1jx9ow4/camera_glitch/ Full video also linked in the Reddit post. I have also submitted this bug via the feedback assistant.
2
0
132
Apr ’25
Why slower with larger threadgroup memory?
I'm implementing optimized matmul on metal: https://github.com/crynux-ai/metal-matmul/blob/main/metal/1_shared_mem.metal I notice that performance is significantly different with different threadgroup memory set in [computeEncoder setThreadgroupMemoryLength] All other lines are exactly same, the only difference is this parameter. Matmul performance is roughly 250 GFLops if I set 32768 (max bytes allowed on this M1 Max), but 400 GFLops if I set 8192. Why does this happen? How can I optimize it?
2
0
71
Apr ’25
Metal and Swift Concurrency
Hi, Introducing Swift Concurrency to my Metal app has been a bit challenging as Swift Concurrency is limited by the cooperative thread pool. GPU work is obviously not CPU bound and can block forward moving progress, especially when using waitUntilCompleted on the command buffer. For concurrent render work this has the potential of under utilizing the CPU and even creating dead locks. My question is, what is the Metal's teams general recommendation when it comes to concurrency? It seems to me that Dispatch or OperationQueues are still the preferred way for Metal bound tasks in order to gain maximum performance? To integrate with Swift Concurrency my idea is to use continuations that kick off render jobs via Dispatch or Queues? Would this be the best solution to bridge async tasks with Metal work? Thanks!
5
0
998
Apr ’25
Diagnose data access latency
The code is pretty simple kernel void naive( constant RunParams *param [[ buffer(0) ]], const device float *A [[ buffer(1) ]], // [N, K] device float *output [[ buffer(2) ]], uint2 gid [[ thread_position_in_grid ]]) { uint a_ptr = gid.x * param->K; for (uint i = 0; i < param->K; i++, a_ptr++) { val += A[b_ptr]; } output[ptr] = val; } when uint a_ptr = gid.x * param->K, the code got 150 GFLops when uint a_ptr = gid.y * param->K, the code got 860 GFLops param->K = 256; thread per group: [16, 16] I'd like to understand why the performance is so different, and how can I profile/diagnose this to help with further optimization.
0
0
49
Apr ’25
VRAM not freeing in Elite Dangerous
So I've been trying out GPTK with Elite Dangerous Horizons game and it looks like from what I can tell. The VRAM keeps going up until it goes over the limit where it drops the FPS to 1-3 FPS and then crashes the game. From the Performance HUD I can see that it looks like when using GPTK, the VRAM usage just keeps climbing and I never saw it drop down at all. I did some limited testing, and from that I think I can conclude that it is probably not a VRAM leak, but it might be caching it. The reason for this is because I noticed that if I went back to the area that I've been before. It won't increase the VRAM usage. So either there is something wrong with the freeing VRAM memory part, or it could be that GPTK might not be reporting the right amount of VRAM available to use? So maybe that's why it keeps allocating VRAM until it went out of memory and crashed the game. Just to test, I did try running the game with DXVK+MoltenVK combo, and I can see that it works just fine. VRAM is being freed up when it's no longer used. Is this a known issue in some games?
12
3
807
Apr ’25
iOS Metal system delayed one Vsync period to really display the frame on the screen
View Layout Add the following views in a view controller: Label View A, with a subview of the same size: MTKView A View B, with a subview of the same size: MTKView B Refresh Rates of Each View The label view refreshes at 60fps (driven by CADisplayLink). MTKView A and B refresh at 15fps. MTKView Implementation Details The corresponding CAMetalLayer's maximumDrawableCount is set to 2, changed to double buffering. The scheduling mechanism is modified; drawing is not driven by the internal loop but is done manually. The draw call is triggered immediately upon receiving a frame. self.metalView.enableSetNeedsDisplay = NO; self.metalView.paused = YES; A new high-priority queue is created for drawing, instead of handling it on the main queue. MTKView Latency Tracking The GPU completion time T1 is observed through the addCompletedHandler callback of the CommandBuffer. The presentation time T2 of the frame is observed through the addPresentedHandler callback of the currentDrawable in MTKView. Testing shows that T2 - T1 > 16.6ms (the Vsync period at 60Hz). This means that after the GPU rendering in MTLView is finished, the frame is not actually displayed at the next Vsync instruction but only at the Vsync instruction after that. I believe there is an extra 16.6ms of latency here, which I want to eliminate by adjusting the rendering mechanism. Observation from Instruments From Instruments, the Surface presentation aligns with the above test results. After the Metal encoder finishes, the Surface in Display switches only after the next-next Vsync instruction. See the image in the link for details. Questions According to a beginner's understanding, after MTKView's GPU rendering is finished, the next Vsync instruction should officially display (make it visible). However, this is not what is observed. Does the subview MTKView need to wait for another Vsync cycle to be drawn to the actual display buffer? The label updates its text at 60fps, so the entire interface should be displayed at 60fps. Is the content of MTKView not synchronized when the display happens? Explanation of the Reasoning Behind Some MTKView Code Details Changing from the default triple buffering to double buffering helps reduce the latency introduced by rendering. Not using MTKView's own scheduling mechanism but using manual triggering of the draw method is because MTKView's own scheduling mechanism is driven by CADisplayLink. Therefore, if a frame falls within a Vsync window, it needs to wait for the next Vsync window to trigger the draw operation, which introduces waiting latency.
0
0
73
Apr ’25
Threadgroup memory for fragment shader
Hello I am trying to get thread group memory access in fragment shader. In essence, I would like to have all the fragments in a tile to bitwiseOR some value. My idea was to use simd_or across the SIMD group, then make each SIMD group thread 0 to atomic or the value into thread group memory. Finally very first thread of the tile would be tasked with writing the value down to texture with write access. Now, I can allocate the thread group memory argument to the fragment function all right. MTLRenderEncoder has setThreadgroupMemoryLength call, which I am using the following way [renderEncoder setThreagroupMemoryLength: 16 offset: 0 atIndex:0] Unfortunately, all I am getting is the following error (runtime assertion) -[MTLDebugRenderCommandEncoder setThreadgroupMemoryLength:offset:atIndex:]:3487: failed assertion Set Threadgroup Memory Length Validation offset + length(16) must be <= threadgroupMemoryLength(0).` What I am doing wrong? How I can get thread group memory in the fragment shader? I know I could use tile shading and compute function but the problem is that here I really like to use fragment stuff. Will be grateful for help.
1
0
72
Apr ’25
Slow compilation
Hi, I am working with a large project. We are compiling each material to its own .metallib. They all include many common files full of inline functions. Finally we link it all together at the end with a single big pathtrace kernel. Everything works as expected, however the compile times have gotten completely out of hand and it takes multiple minutes to compile at runtime (to native code). I have gathered that I can do this offline by using metal-tt however if I am wondering if there is a way to reduce the compile times in such a scenario, and how to investigate what the root cause of the problem is. I suspect it could have to do with the fact that every materials metallib contains duplications of all the inline functions. Any ideas on how to profile and debug this? Thanks, Rasmus
0
1
44
Mar ’25
Metal texture allocated size versus actual image data size
Hello. In the iOS app i'm working on we are very tight on memory budget and I was looking at ways to reduce our texture memory usage. However I noticed that comparing ASTC8x8 to ASTC12x12, there is no actual difference in allocated memory for most of our textures despite ASTC12x12 having less than half the bpp of 8x8. The difference between the two only becomes apparent for textures 1024x1024 and larger, and even in that case the actual texture data is sometimes only 60% of the allocation size. I understand there must be some alignment and padding going on, but this seems extreme. For an example scene in my app with astc12x12 for most textures there is over a 100mb difference in astc size on disk versus when loaded, so I would love to be able to recover even a portion of that memory. Here is some test code with some measurements i've taken using an iphone 11: for(int i = 0; i &lt; 11; i++) { MTLTextureDescriptor *texDesc = [[MTLTextureDescriptor alloc] init]; texDesc.pixelFormat = MTLPixelFormatASTC_12x12_LDR; int dim = 12; int n = 2 &lt;&lt; i; int mips = i+1; texDesc.width = n; texDesc.height = n; texDesc.mipmapLevelCount = mips; texDesc.resourceOptions = MTLResourceStorageModeShared; texDesc.usage = MTLTextureUsageShaderRead; // Calculate the equivalent astc texture size int blocks = 0; if(mips == 1) { blocks = n/dim + (n%dim&gt;0? 1 : 0); blocks *= blocks; } else { for(int j = 0; j &lt; mips; j++) { int a = 2 &lt;&lt; j; int cur = a/dim + (a%dim&gt;0? 1 : 0); blocks += cur*cur; } } auto tex = [objCObj newTextureWithDescriptor:texDesc]; printf("%dx%d, mips %d, Astc: %d, Metal: %d\n", n, n, mips, blocks*16, (int)tex.allocatedSize); } MTLPixelFormatASTC_12x12_LDR 128x128, mips 7, Astc: 2768, Metal: 6016 256x256, mips 8, Astc: 10512, Metal: 32768 512x512, mips 9, Astc: 40096, Metal: 98304 1024x1024, mips 10, Astc: 158432, Metal: 262144 128x128, mips 1, Astc: 1936, Metal: 4096 256x256, mips 1, Astc: 7744, Metal: 16384 512x512, mips 1, Astc: 29584, Metal: 65536 1024x1024, mips 1, Astc: 118336, Metal: 147456 MTLPixelFormatASTC_8x8_LDR 128x128, mips 7, Astc: 5488, Metal: 6016 256x256, mips 8, Astc: 21872, Metal: 32768 512x512, mips 9, Astc: 87408, Metal: 98304 1024x1024, mips 10, Astc: 349552, Metal: 360448 128x128, mips 1, Astc: 4096, Metal: 4096 256x256, mips 1, Astc: 16384, Metal: 16384 512x512, mips 1, Astc: 65536, Metal: 65536 1024x1024, mips 1, Astc: 262144, Metal: 262144 I also tried using MTLHeaps (placement and automatic) hoping they might be better, but saw nearly the same numbers. Is there any way to have metal allocate these textures in a more compact way to save on memory?
8
0
2.7k
Mar ’25
Threadgroup configuration for tile shading
Hello! I have a question about how thread groups work with tile shading. When running "traditional" compute, I get to choose both thread group size and the grid size. However, when using tile shading kernel I only have dispatchThreadsPerTile method - this controls how many threads will be ran in each tile. So far so good, but what about thread groups? The examples in video "Tile Shading on A11" seem to suggest that there will be only one thread group per tile. In the video, [[thread_index_in_threadgroup]] is called "local_id" and it is used to access the image block. I assume this is the default configuration. So when one does the following: Creates MTLRenderPassDescriptor with tileWidth set to W and tileHeight set to H Fires up the tile shading kernel using dispatchThreadsPerTile with MTLSize size = { W, H, 1 } I understand that the result is 1-to-1 mapping between the tile "pixels" and kernel threads. Now, what I would like to do is to have more than one thread group there. I want this for performance reasons: I have a certain compute kernel which I know executes very well with small thread group size. In fact, { 32, 1, 1 } seems to be the fastest. My understanding is that even if I set tile size to 16x16, and so I am executing 256 threads there, there will only be one SIMD group active in a thread group. Meaning that this SIMD group has to execute 8 times over the tile. Is it possible somehow? Or perhaps the limitations of the API are pointing at the limitations of hardware itself, and if I want to execute with SIMD group sized thread groups I have to use "traditional" compute encoder? Will be grateful for help. Michał
0
0
36
Mar ’25
Metal triangle strips uniform opacity.
I have this drawing app that I have been working on for the past few years when I have free time. I recently rebuilt the app in Metal to build out other brushes and improve performance, need to render 10000s of lines in realtime. I’m running into this issue trying to create a uniform opacity per path. I have a solution but do not love it - as this is a realtime app and the solution could have some bottlenecks. If I just generate a triangle strip from touch points and do my best to smooth, resample, and handle miters I will always get some overlaps. See: To create a uniform opacity I render to an offscreen texture with blending disabled. I then pre-multiply the color and draw that texture to a composite texture with blending on (I do this per path). This works but gets tricky when you introduce a textured brush, the edges of the texture in the frag shader cut out the line. Pasted Graphic 1.png Solution: I discard below a threshold fragment float4 fragment_line(VertexOut in [[stage_in]], texture2d<float> texture [[ texture(0) ]]) { constexpr sampler s(coord::normalized, address::mirrored_repeat, filter::linear); float2 texCoord = in.texCoord; float4 texColor = texture.sample(s, texCoord); if (texColor.a < 0.01) discard_fragment(); // may be slow (from what I read) return in.color * texColor; } Better but still not perfect. Question: I'm looking for better ways to create a uniform opacity per path. I tried .max blending but that will cause no blending of other paths. Any tips, ideas, much appreciated. If this is too detailed of a question just achieve.
1
0
62
Mar ’25
MTLBinaryArchive Size
I'm trying to use MTLBinaryArchive. I collected a BinaryArchive from one device and used metal-tt to translate it for all supported iPhone devices, ranging from iPhone 7 Plus to iPhone 16. However, this BinaryArchive is quite large, around 1.5GB uncompressed, and about 500MB compressed in the IPA. I'm wondering how to address the size issue. I watched the WWDC 2022 video, which mentioned that the operating system or app installation process would handle compatibility. Does this compatibility support different GPU chips? I tried installing an IPA with a BinaryArchive collected only from an iPhone 12 on an iPhone 13, but the BinaryArchive didn't take effect. I also saw that Apple supports App Thinning. However, it seems that resources in the Asset Catalog cannot be accessed via URL, and creating an MTLBinaryArchive requires a URL. Is it possible for MTLBinaryArchive to be distributed through App Thinning? The WWDC 2022 video also mentioned using the -Os optimization flag to reduce size. Can this give an estimate of how much compression it would achieve? Are there any methods to solve the BinaryArchive size issue without impacting performance?
0
1
48
Mar ’25
How can I get pixel coordinates in the fragment tile function?
In this video, tile fragment shading is recommended for image processing. In this example, the unpack function takes two arguments, one of which is RasterizerData. As I understand it, this is the data passed to us from the previous stage (Vertex) of the graphics pipeline. However, the properties of MTLTileRenderPipelineDescriptor do not include an option for specifying a Vertex function. Therefore, in this render pass, a mix of commands is used: first, a draw command is executed to obtain UV coordinates, and then threads are dispatched. My question is: without using a draw command, only dispatch, how can I get pixel coordinates in the fragment tile function? For the kernel tile function, everything is clear. typedef struct { float4 OPTexture [[ color(0) ]]; float4 IntermediateTex [[ color(1) ]]; } FragmentIO; fragment FragmentIO Unpack(RasterizerData in [[ stage_in ]], texture2d<float, access::sample> srcImageTexture [[texture(0)]]) { FragmentIO out; //... // Run necessary per-pixel operations out.OPTexture = // assign computed value; out.IntermediateTex = // assign computed value; return out; }
1
0
121
Mar ’25
why GLDContextRec::flushContextInternal() leads to abort
The flushContextInternal function in glr_sync.mm:262 called abort internally. What caused this? Was it due to high device temperature or some other reason? Date/Time: 2024-08-29 09:20:09.3102 +0800 Launch Time: 2024-08-29 08:53:11.3878 +0800 OS Version: iPhone OS 16.7.10 (20H350) Release Type: User Baseband Version: 8.50.04 Report Version: 104 Exception Type: EXC_CRASH (SIGABRT) Exception Codes: 0x0000000000000000, 0x0000000000000000 Triggered by Thread: 0 Thread 0 name: Thread 0 Crashed: 0 libsystem_kernel.dylib 0x00000001ed053198 __pthread_kill + 8 (:-1) 1 libsystem_pthread.dylib 0x00000001fc5e25f8 pthread_kill + 208 (pthread.c:1670) 2 libsystem_c.dylib 0x00000001b869c4b8 abort + 124 (abort.c:118) 3 AppleMetalGLRenderer 0x00000002349f574c GLDContextRec::flushContextInternal() + 700 (glr_sync.mm:262) 4 DiSpecialDriver 0x000000010824b07c Di::RHI::onRenderFrameEnd() + 184 (RHIDevice.cpp:118) 5 DiSpecialDriver 0x00000001081b85f8 Di::Client::drawFrame() + 120 (Client.cpp:155) 2024-08-27_14-44-10.8104_+0800-07d9de9207ce4c73289507e608e5de4320d02ccf.crash
1
0
77
Mar ’25
Xcode Playground - The LLDB RPC server has crashed.
I am trying to learn Metal development on my MacBook Pro M1 Pro (Sequoia 15.3.1) on Xcode Playground, but when I write these two lines of code: import Metal let device = MTLCreateSystemDefaultDevice()! I get the error The LLDB RPC server has crashed. Any ideas as to what I can do to solve this? I have rebooted the machine and reinstalled Xcode...
3
0
417
Mar ’25
Metal: Non-uniform thread groups unsupported in Simulator? Is it?
My app is running Compute Shaders that use non-uniform thread groups. When I run the app in the debugger with a simulator target the app crashes on encoder.dispatchThreads and the error message is: Dispatch Threads with Non-Uniform Threadgroup Size is not supported on this device. Previously the log output states that: Metal Shader Validation is unsupported for Simulator. However: When I stop the debugger and just run the app in the simulator without the debugger attached, the app just runs fine and does not crash. The SwiftUI Preview that also triggers the Compute Shader when preparing data also just runs fine without a crash. I can run and debug on a real device no problem - I just don't have all sizes available. Is there anything I need to check in my lldb/simulator configuration? It obviously does work, just the debugger cannot really deal with it? Any input would be nice as this really slows my down as I have to be extremely careful when debugging on the simulator.
2
0
524
Mar ’25