We are expericing some low amount of packeloss per node in our network running pure OSPF/BGP on darkfiber.
I believe this has to do with BGP updates affecting the routed traffic. The boxes 2216s, 1072s and 2004s are only running a couple of gigabit.
We have 2m routes in the tables from various sources transit and peering with 20 BGP speakers and 80 ospf speakers with close to 1000 LSAs internally in OSPF.
Pingloss is aprox 0,2-0,3% per hop and since we have a lot it adds up.
Previously I thought this was related to 2004s, but we see this now too when we have replaced some with 2216s to get rid of the issue.
Anyone else are seeing similar?
we also planned replacing our CCR2004-1G-12+2XS with CCR2216 because of the packet loss we are getting. Hoped this might go away with the more powerfull CCR2216 and blamed packet loss to the massive amount of BGP sessions we have. Good or bad to know that CCR2216 might not be the cure for this by now.
We are operating a ring of 5 CCR2004. All connected to two internal bird route reflectors. Two are connected to one upstream provider (full IPv4 und IPv6 table), one is connected to another upstream provider and the AMS-IX (110 BGP connections) another one is connected to the DE-CIX (180 BGP connections). The last one is just serving our office and is only connected to the route reflectors. Our backbone is held together by OSPF.
Which is really strange: There are some days where everything works OK. But then especially the DE-CIX router sometimes starts spiking up CPU load which leads to massive packet loss. VoIP is just not working on those days. And I can't figure out the reason for it.
We already tried to replace one of the CCR2004 with one CCR2216 but the lead to unreachable adjacent routers. So for now we are planning further experiment with CCR2216, full table und adjacent routers.
Another strange thing: When our DE-CIX router reaches a certain amount of peers (which slightly differs) the routing process seem to start to behave odd. It constantly starts spitting out route changes to our route reflectors which are then distributed throughout our networks. Our BGP traffic then goes from the normal 150kBit/s up to constant 2MBit/s just for announcements. First we thought one of our peers is doing malicous things. But after enabling one peer after another it is total random, which peer triggers that behaviour. It seems to be a function of number of active peers and received routes. But I haven't figured out yet how to exactly trigger this problem. It is quite difficult to rebuild this in the lab because I do not have that amount of real peers in the lab. I could supply my lab with multiple full tables. But that does not simulate those massive amount of full table subsets with duplicate routes you get at a IXP.
How did you configure affinity on your BGP sessions? I haven't really figured out how exectly behaves this. Tried to use same numbers for sessions that should be operated in the same process. Some sessions are running in the same process but some are running in extra process. Seems it is kind of random. Only setting that works as expected is "alone". Even "remote as" is grouping different remote ASN together in one process.