Wednesday, March 26, 2014

F5 Big-IP Load Balanced WCF Services

*update* - This follow up may provide more detail on finding a solution 

We have been trying to configure our new F5 Big-IP load balancer for some WCF services and encountered a strange issue. 
The service uses wsHttpBinding with TransportWithMessageCredential over SSL.  The Auth type is Windows.
When disabling either node on the F5, the service worked.  However when both nodes were active, the service failed with an exception:
Exception:
Secure channel cannot be opened because security negotiation with the remote endpoint has failed. This may be due to absent or incorrectly specified EndpointIdentity in the EndpointAddress used to create the channel. Please verify the EndpointIdentity specified or implied by the EndpointAddress correctly identifies the remote endpoint.
Inner Exception:
The request for security token has invalid or malformed elements.
Numerous documentation and blogs highlight that *the* way to support load balancing in WCF is to turn off Security Context Establishment by setting EstablishSecurityContext=false in the binding configuration, or by turning on 'sticky sessions'.
http://ozkary.blogspot.com.au/2010/10/wcf-secure-channel-cannot-be-opened.html
http://msdn.microsoft.com/en-us/library/ms730128.aspx
http://msdn.microsoft.com/en-us/library/vstudio/hh273122(v=vs.100).aspx
We did not want to use sticky sessions; although this did fix the issue the F5 logs showed that load balancing was not working as we wanted.
Unfortunately, we already had EstablishSecurityContext set to false,  so the security negotiation should have been occuring on each request, which meant it should be working. 
After hours of investigating other binding settings, creating test clients, updating the WCFTestTool configurations and generally fumbling around, eventually we went back to reconfiguring the F5.  Although it worked when only one node was active with exactly the same configuration, unless the WCF binding documentation we found *everywhere* was a complete lie it had to be F5.
It was finally traced to the F5 OneConnect (http://support.f5.com/kb/en-us/solutions/public/7000/200/sol7208.html) configuration.  This does some jiggery-pokery to magically pool backend connections to improve performance.  It also seems to break WCF services, at least it broke ours. 
Disabling OneConnect on the F5 application profile resolved the issue immediately.
We now have our load balanced, non-persistent, WCF services behind the shiny F5 working.
As this hadn't come up online that I could find, I can only assume it was related to the combination of TransportWithMessageCredential and Windows as the message credential type.
*edit* So the solution was not as straight forward as we thought, and currently we have reverted back to “sticky sessions” in the F5 to get this to work.  Even with EstablishSecurityContext=false, and OneConnect disabled, the same failure will occur if a single client has two concurrent requests using two separate threads (our clients are web applications) and the F5 routes each connection to a separate service node.

While we investigate further, the short term solution is to use Transport security instead of TransportWithMessageCredential.  As this required a client change, we had to deploy multiple bindings and each client app will upgrade while we use Sticky Sessions on the F5.  Once all clients are on the new binding we can remove the old binding and disable sticky sessions again.

Transport security works for us, but it is not perfect.  It reduces security (SSL only, no message encryption) and reduces flexibility (we can’t for instance switch to client certificate, or username authentication for individual per request auth).
It does however keep the services stable and gives us time to perform a thorough analysis of the problem.

Tuesday, March 25, 2014

“Don’t Query From the View” - Unit of Work and AOP

Maintains a list of objects affected by a business transaction and coordinates the writing out of changes and the resolution of concurrency problems.


I was asked about a warning from NHProfiler from a colleague to see my opinion on the matter.  The warning was:  http://hibernatingrhinos.com/products/NHProf/learn/alert/QueriesFromViews “Don’t Query From the View”.  

The team had a session lifetime per controller action using AOP (Attribute Filters), but needed to extend this out to be active during the view rendering phase after the controller completed.  They did this by instantiating the session in the request filter chain instead of the controller attribute filters.  When they did this, they received the NHProfiler warning above.

The collegue was dismissive of the warning, partly because the implementation of the code was innocuous and partly because the warning reasons were not particularly compelling.  

There are some serious implications of the pattern that was being followed however, some of which I have covered on this blog in the past (here) and (here).

 
tldr;
·        Sessions can be open as long as you need them to be.  You do not need to arbitrarily close them if you know they will be needed again as soon as the next step in the pipeline is reached.
·        There are (arguably) valid reasons why the View rendering would be able to access an active session.
·        The longer a session is open, the more chance of unexpected behaviour from working with Entities in a connected state.  Care should be taken.


Pro1: Simplicity
One benefit from having a session opened and closed automatically using cross-cutting AOP is that your application doesn’t need to care about it.  It knows that it will have a session when used, and will commit everything when the scope ends.  This is often done as a controller/action filter attribute, or higher in the request pipeline.  You don’t need to pass a session around, check if it exists, etc.

Con1: Deferring code to View Rendering adds complexity.
I have argued that having the session active outside the controller makes it more difficult to maintain and debug your code as the session is active during the View render component.  The response was that forcing a controller action to pre-load everything the view *might* need and just pushing a dumb model prevents the view from performing optimisation of what it actually loads.  I don’t believe that the View rendering should have such an impact on code execution but the point is a valid point of contention.

Con2: Loss of control.
When using AOP to manage the session lifetime you have much less control over the failed state.  As the failure occurred outside of the business logic context, you can’t easily use business logic to handle the failure.  If the standard policy is to simply direct you to an error page or similar, then this behaviour is completely valid.  However if you needed to perform actions based on the failure (such as attempting to resolve optimistic concurrency issues automatically) then you can’t. 

Con3: Loss of traceability.
When using a long running session there is no traceability between the persistence and the application logic that made the entity changes.  If you experience an unexpected query when the unit of work commits, you can’t immediately identify which code caused this behaviour.

Con4: Unexpected change tracking.
Having a long running (and potentially ‘invisible’) session exposes you to potentially unexpected behaviour due to the session tracking all changes on the underlying entities.  If you load an entity from the data access layer, then make some changes for display purposes (perhaps change a Name property from “Jason Leach” to “LEACH, Jason”) before passing that entity to the view, when the unit of work cleanup occurs it will persist that change to the database because all of your changes are being tracked. 
A less obvious example is if you had a Parent entity with a list of Children.  In a particular controller action you want to get the parent and only a subset of the children.  So you might select the parent and all children.  Then you may do something like Parent.Children = Parent.Children.Select(x=>childfilter).ToList().  Depending on your configuration this is likely to delete all of the children not matching the filter when the unit of work completes.  Oops.

Con3 and 4 are direct side effects of leaving a session open longer.  In the Parent->Child example, you would likely only need the session active when you did the initial load of the Parent and Child entities, then close the session and start manipulating the entities.   Obviously you should be mapping entities to ViewModels and never manipulating those entities, but it is a very easy mistake to make.  Coupled with con3, it can be near impossible to identify in a complex unit of work.

Conclusion
As long as you are careful about what you are doing, and not arbitrarily altering domain model entities while a session is open, then long-running sessions are not a problem.  However there is significant potential of issues if your devs don't understand the risks.
As long as you are happy with standard error/failure, or UI driven resolution workflows, then AOP for unit of work management is acceptable, again as long as the risks / limitations are acknowledged.

 

Thursday, March 20, 2014

Little Things

As I take on more architectural responsibility I write far less code.  I do however provide code samples and give advice on how to solve certain issues.

If there is a flaw in the samples provided that is obvious in the integrated solution, if not the code sample,  it is a major failing on my part to have not identified it.  I hate that.

When the flaw is obvious and the fix is equally obvious,  then is that still my failing, or a failing of the developers for blindly following a code example and not reviewing or taking the effort to understand the code. 

Probably both.