Site to Site Replication only works for a few hours in the morning (each morning)


we have been fighting odd active directory replication issue on month , hoping can provide insight. have 5 ad servers in following orientation...

site hq
- prime running windows 2008 r2
- ad2 running windows 2008 r2

site colo
- ad3 running windows 2008 (not r2)
- ad3new running windows 2008 r2

site branch
- ad4 running windows 2008 r2

the domain @ windows 2008 functional level.

there on site site vpns between 3 sites , ip intersite transports site links defined 3 possible connections cost of 100 , interval of 15. each ip site link configured schedule of available day long.

every day following sequence of events happens...

* somewhere between 6:30 , 7:30am servers start sync each other perfectly. can make ad changes , replicate across servers without issues. during time repadmin commands work across servers.

* typically somewhere in 10:30 11:30am time frame start errors replicating data - between hq , colo sites. manifests event 1232 call timeout dc rpc client , and event 1925 kcc. additionally repadmin commands fail when attempting connect branch servers.

* rest of day intra-site replication between prime , ad2 work fine - , periodically branch ad server updated well. colo sites remain unreplicated , continue errors remainder of day. while down - ability ping , remote desktop between servers fine - if there network hiccup happens - network stable hours without sites recovering.

* magically next morning around 6:30 , 7:30am servers able replicate without issue , 3-5 hours of immediate replication , happens again.

as stated above - there on site-to-site vpn connections between 3 sites actively monitored prtg. these connections remain open day long. site topology has colo servers attempting replicate hq servers - , both sites have 100mb data connections remain active during entire time. additionally prtg bandwidth monitoring shows these links have no spikes in traffic anywhere near max capacity of links during time outages begin nor during rest of day.

does have insight why these servers stop communicating each other same time every day , report errors? why magically start work again each day without changes being made network or ad configuration?

this has been going on over month now. when first started happen had 1 windows 2008 server , 2 windows 2003 servers in hq. phased out windows 2003 servers , upgraded functional level windows 2008 - did not solve problem. tried put new windows 2008 r2 server out @ colo site hoping if limited other server 1 server impacted. both appear having connectivity issues @ same time.

it if there 1 hung connection blocking other syncs site , each morning bottleneck released.

thank in advance direction can provide.


finally - root cause has been found. juniper firewall @ colo running latest , greatest firmware (screenos 6.3.0r18.0) culprit.

the firewall uses alg (application layer gateway) objects interrogate calls things ftp uses dynamic ports. 1 of them ms-rpc , has serious bug in cause freeze traffic periods of time , recover on own. once knew ms-rpc alg found several postings discuss similar things. references articles indicated higher port numbers, others indicating broken windows 2008 , fixed recently.

even though site site vpn set allow "any" traffic - tcp proxy mechanism of firewall still using these objects.

one solution have been done disable ms-rpc alg (and possibly dns alg) - ended going solution updated tunnel's firewall policy use custom "all ports" service object , ignore application setting in policy. ui not allow select "ignore" service "any" allow use custom service. created custom service permitted on site site links. caused bypass alg objects entirely , traffic started flow again , has been quite time.



Windows Server  >  Directory Services



Comments

Popular posts from this blog

Error: 0x80073701 when trying to add Print Services Role in Windows 2012 Standard

Disconnecting from a Windows Server 2012 R2 file sharing session on a Windows 7,8,10 machine

Windows 2016 RDS event 1306 Connection Broker Client failed to redirect the user... Error: NULL