Endpoint Security Framework deadline

Hello.

When testing with the Endpoint Security Framework for the AUTH_OPEN event, I found that the deadline was 15 seconds, but the actual process termination occurred at 5 or 6 seconds. Is this intended?

Answered by DTS Engineer in 848978022

When testing with the Endpoint Security Framework for the AUTH_OPEN event, I found that the deadline was 15 seconds, but the actual process termination occurred at 5 or 6 seconds. Is this intended?

That depends on what you mean by "intended"...

In terms of why you're seeing that difference, the basic issue here is a disconnect between what the deadline actually means and the information available to your ES client. At a very high level, the deadline describes how "long" the system will stall that particular syscall. You can visualize that processing sequence as something like this:

  1. The in kernel receives the syscall and does its processing
  2. The in kernel ES system sets the deadline and calls out to user space.
  3. Time passes
  4. Your ES client receives the event and does its processing.
  5. Your ES client sends its response.
  6. Time passes.
  7. The in kernel ES system receives your response and completes the syscall.

The disconnect here is that what the deadline describes is the time between "2" and "7", but the ONLY part of that process your ES client can see/control are 4 & 5. In concrete terms, what's actually doing on when this occurs:

...actual process termination occurred at 5 or 6 seconds

...is that ~10s had already been wasted/consumed at #3.

However, the bigger issue here is a more fundamental misunderstanding of the API you're dealing with and the risk inherent. More specifically:

  • As I've written about here and here, the EndpointSecurity framework is easily the most difficult and dangerous API on the system.

  • Its primary failure mode is NOT crashes/terminations, it's performance glitches and unexpected failures, generally under bizarre and/or difficult to reproduce conditions (see example here).

  • In practice, the deadline value is a WILDLY inflated guideline as to how an ES client actually processes events. This forum post has more details, but the "basic" guideline is that your ES client needs to be able to process events <100ms.

Pulling from the posts above, there are two key ideas I would really internalize:

"If you only focus on the specific failure, you can end up stuck fixing an endless stream of "random" failures as the system/apps find new and interesting ways to trip over your ES client." The issue here is that, assuming you're processing any of the "interesting" events (notably, "open"), ANY significant delay in individual processing creates a risk that you won't be able to process events as fast as the system can generate them. That backlog can then grow large enough that the system eventually terminates your app for failing to process events. Note that the real risk here isn't normal system activity, it's an attacker using that behavior to directly break your ES client.

Most ES issues (particularly deadline terminations) are ACTUALLY caused by design defects in how the ES client processes events, NOT whatever specific combination of factors happened to trigger any given failure. Until your focus is on making that code design work well, your ES client won't really "work".

"Be your own worst enemy, particularly when it comes to testing your product. Build testing scenarios that intentionally push your client to "destruction". Many clients have problems running alongside other ES clients, so I would both test with other products in common use AND build my own "pathologically bad" client that I could test "against"."

The problem here is that the larger system is so complex and difficult to predict that user-focused testing is far less useful than it would be in other contexts. It will uncover the stream of specific failures I referenced above, but that dynamic is exactly what you're trying to avoid. The solution here is to "get ahead" of the problem by building your own testing tools and process that are intentionally TRYING to break your client. I highlighted the second client case both because it's disappointingly common and because running with a pathological client is one of the best ways to introduce disruptions into your own client. See this post for more details.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Personally, I never dared breaking the ES AUTH event's deadline, because people scared me that Apple counts your failures and will revoke your ES "capability" certificate if you exceed certain limit.

My two cents --

Whenever my ES client was terminated for "not answering in time" -- it ALWAYS happened after 5-7 seconds, I think - regardless of the actual deadlines of the many events I receive and handle.

This stretch of time reminds me of something different with similar behavior -- When you install a daemon publishing an XPC service (place a .plist definition file in /Library/LaunchDaemons/com.mycompany.mydaemon.plist), then if your daemon fails to launch and initialize itself and be ready to receive XPC calls -- it's being kicked/terminated by launchd, after about the same 5-7 seconds.

But maybe that's my imagination...

When testing with the Endpoint Security Framework for the AUTH_OPEN event, I found that the deadline was 15 seconds, but the actual process termination occurred at 5 or 6 seconds. Is this intended?

That depends on what you mean by "intended"...

In terms of why you're seeing that difference, the basic issue here is a disconnect between what the deadline actually means and the information available to your ES client. At a very high level, the deadline describes how "long" the system will stall that particular syscall. You can visualize that processing sequence as something like this:

  1. The in kernel receives the syscall and does its processing
  2. The in kernel ES system sets the deadline and calls out to user space.
  3. Time passes
  4. Your ES client receives the event and does its processing.
  5. Your ES client sends its response.
  6. Time passes.
  7. The in kernel ES system receives your response and completes the syscall.

The disconnect here is that what the deadline describes is the time between "2" and "7", but the ONLY part of that process your ES client can see/control are 4 & 5. In concrete terms, what's actually doing on when this occurs:

...actual process termination occurred at 5 or 6 seconds

...is that ~10s had already been wasted/consumed at #3.

However, the bigger issue here is a more fundamental misunderstanding of the API you're dealing with and the risk inherent. More specifically:

  • As I've written about here and here, the EndpointSecurity framework is easily the most difficult and dangerous API on the system.

  • Its primary failure mode is NOT crashes/terminations, it's performance glitches and unexpected failures, generally under bizarre and/or difficult to reproduce conditions (see example here).

  • In practice, the deadline value is a WILDLY inflated guideline as to how an ES client actually processes events. This forum post has more details, but the "basic" guideline is that your ES client needs to be able to process events <100ms.

Pulling from the posts above, there are two key ideas I would really internalize:

"If you only focus on the specific failure, you can end up stuck fixing an endless stream of "random" failures as the system/apps find new and interesting ways to trip over your ES client." The issue here is that, assuming you're processing any of the "interesting" events (notably, "open"), ANY significant delay in individual processing creates a risk that you won't be able to process events as fast as the system can generate them. That backlog can then grow large enough that the system eventually terminates your app for failing to process events. Note that the real risk here isn't normal system activity, it's an attacker using that behavior to directly break your ES client.

Most ES issues (particularly deadline terminations) are ACTUALLY caused by design defects in how the ES client processes events, NOT whatever specific combination of factors happened to trigger any given failure. Until your focus is on making that code design work well, your ES client won't really "work".

"Be your own worst enemy, particularly when it comes to testing your product. Build testing scenarios that intentionally push your client to "destruction". Many clients have problems running alongside other ES clients, so I would both test with other products in common use AND build my own "pathologically bad" client that I could test "against"."

The problem here is that the larger system is so complex and difficult to predict that user-focused testing is far less useful than it would be in other contexts. It will uncover the stream of specific failures I referenced above, but that dynamic is exactly what you're trying to avoid. The solution here is to "get ahead" of the problem by building your own testing tools and process that are intentionally TRYING to break your client. I highlighted the second client case both because it's disappointingly common and because running with a pathological client is one of the best ways to introduce disruptions into your own client. See this post for more details.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Messages are delivered in the order they enter the kernel, not in deadline first order. Also not all deadlines are equal. You could be responding to a message with a deadline 14 seconds in the future while you block a message with a deadline 5 seconds in the future.

If you want to avoid missing deadlines:

  • respond to all messages immediately where possible
  • if you can't respond immediately use es_retain_message() and pass the message to another thread to respond so you can continue de-queuing.

NEVER block the handler thread for more than a few microseconds. NEVER do any IO AT ALL from the handler thread. If you need to perform IO to make a policy decision retain the message and do the IO from a seperate thread.

You can respond to messages in deadline order rather than delivery order by retaining them, and storing them in an ordered-tree-map using the deadline as the key.

Endpoint Security Framework deadline
 
 
Q