HowTo: Load Balancing multiple Internet connections

A frequent request we receive is how to use a MikroTik Router to get more bandwidth by ‘joining’ multiple internet feeds together. There are a number of different methods, however it’s a good moment to clarify that the term ‘line bonding’ is not the same as ‘Load balancing’. With line bonding we are actually sending each packet in a ’round robin’ fashion up multiple lines and at the ISP end they are joined back together again into a single circuit.  This is a service that can only be carried out at a data centre or ISP and all lines must be all connected to a common interface sharing the same IP address.  Any Public IP Addresses used at the remote site must all be routable over any one of the multiple lines.  Not so easy when all the lines are possibly from completely different service providers.

Therefore some 3rd party provided ‘line bonding’ services actually use VPNs to carry the data from the client’s remote site, which then terminate at one common VPN end point in their data centre. The traffic is ‘joined back together’ by bridging those VPNs and as the remote site’s traffic leaves the provider’s router in the data centre with a single common IP address, internet access is assured. However this is relying on all the multiple independent line providers having near equal latency as the biggest flaw with this approach is ‘out of order packets’.  TCP can cope with a certain amount of out of order packets arriving, however services operating over UDP (e.g. SIP VOIP) do not work well when packets arrive out of order. So if one line is slower than another, packets can arrive out of order at the data centre and in the opposite direction, packets sent from the service provider to the client down the VPN lines could arrive out of order.

Load Balancing uses a completely different idea. Instead of ‘aggregating’ all the lines together and having all the packets from any one data connection going up/down the lines in equal proportions, round robin fashion, we break the traffic down into uniquely different connections and push all the packets of any one connection up just one line.  Any new connection will go up what appears to be a randomly chosen line.  Over time therefore and with a sufficiently high enough number of connections, all the traffic will very roughly balance all the traffic up all the lines.  Now it may be that one connection is of a much higher amount of traffic bandwidth than another and therefore the amount of traffic on the multiple lines will not be equally spread across them. However we will still be spreading the ‘connection load’ across all the lines, so that with enough connections from a large enough user base, on average, the traffic load is very roughly equally spread across the multiple lines.  The big advantage of using this method of splitting the connections across the multiple lines is that we do not need to be sure that all lines are from the same provider, nor do we need to be concerned about varying latency between the lines. So, how can we do this?

One system often used with Linux based systems is ECMP (Equal Cost Multiple Path).  By adding a default gateway to ALL the multiple lines, all with an equal cost, the connections will be evenly distributed across the multiple lines.  Seems easy and indeed this does indeed appear to work.  However it has been well documented that every 10 minutes the Linux Kernel, which is where the Routing Engine resides, tears down all the connections actively going up a chosen route and may on re-connection push the traffic up a completely different route, thus changing the source address if using NAT.  Many HTTPS servers do not like that and may complain to the end user that they’ve been logged out due to a security issue.  Not good!

MikroTik therefore designed a new method called PCC (Per Connection Classifier) to get around this problem that allows us to mangle packets and arrange them to use different routing tables, one per line.  In this way we can have say three lines, with three default routes, but the PCC mangle rules we will create will force each of the connections into using the different routing tables.  No connections get torn down. So, much more reliable. OK – let’s show you then how to do this with some code…

First thing to note is that by default, bridge interfaces were designed for transparently bridging Layer 2 traffic and therefore it is logical that the traffic passing between the bridge ports and the bridge will not normally need to be processed by the Layer 3 firewall rules. To force Layer 2 traffic on a bridge to be processed by the Layer 3 firewall rules, we need to explicitly enable this.

/interface bridge settings
set use-ip-firewall=yes

If the traffic on the ports is encapsulated with PPPoE, then this will also need to be explicitly allowed to go into the Layer 3 Firewall rules as well.

/interface bridge settings
set use-ip-firewall-for-pppoe=yes

We need to ensure that any traffic going to our local and internal LAN interfaces on the router bypasses all the ‘line balancing’ rules.  This is traffic that is destined for the internal network and will not be required to be mangled by any rules to make it split connections up the multiple backhaul lines.  Therefore we must ensure we initially set a rule for all internal traffic to bypass the mangle rules that follow below.  We do this with an action of ‘accept’. Assuming all internal traffic is on 192.168.88.0/24 and all internal interfaces are on the default bridge ‘bridge-local’ we would use this:

/ip firewall mangle
add chain=prerouting dst-address=192.168.88.0/24 in-interface=bridge-local action=accept 

The above CLI command is in the ‘pre-routing’ chain and therefore is processed before any routing decisions are carried out and will thus capture traffic to the router and traffic being forwarded by the router elsewhere. The above rule tests for all traffic destined for the local internal network (192.168.88.0/24) which is coming INTO the local LAN interface (in-interface=bridge-local). This can only be local L2 traffic going into the router interface.  As this is a bridge interface, it may be traffic destined for one of the other ports in the bridge and therefore must not be processed by the following rules.  It therefore has an action of ‘accept‘. In this way the rule ‘actions’ but effectively does nothing but stop any further processing of those types of packets.  This is because when a mangle rule is actioned, any further and lower down rules are not processed.  The only exception is that for certain action types, the value ‘passthrough‘ can be enabled which will allow further rules to be processed.

For each of the public WAN interfaces, the reply traffic for any connections made directly to them from the public WAN must always go out the same way it came in.  This could be traffic such as Winbox or SSH into the Public WAN IPs. We do not wish the reply traffic being potentially sent out of a completely different interface than it came in on just because of our clever PCC mangle rules.  As traffic going into the router will go through the ‘input’ chain and traffic leaving the router itself will be going through the ‘output’ chain we now use two sets of rules, one in the input chain, one in the output chain for each public interface we are load balancing upon.  In this example we will be demonstrating load balancing across two lines.

We must first mangle connections and then mark with routing marks.  (This is because we are not requiring to mangle individual packets, but the whole connection).  Therefore for these mangle rules, we need to identify the incoming public WAN traffic and to save CPU processing, mark the connection with a ‘connection mark’ only if it hasn’t already been marked (the connection mark ‘no-mark’ is a reserved name and it means that there is no connection mark applied yet). :

/ip firewall mangle
add chain=input  in-interface=pppoe-out1 connection-mark=no-mark action=mark-connection new-connection-mark=WAN1 passthrough=no
add chain=input  in-interface=pppoe-out2 connection-mark=no-mark action=mark-connection new-connection-mark=WAN2 passthrough=no

Having now got the incoming new connections marked with a connection mark of WAN1 or WAN2 as appropriate, we now need to ensure that the outgoing connections will go the same way back out again. We do this by giving all those outgoing connections a routing-mark and it will be those different routing marks that will force the traffic to use the different routing tables, one per WAN interface.

/ip firewall mangle
add chain=output  out-interface=pppoe-out1 connection-mark=WAN1 action=mark-routing new-routing-mark=WAN1 passthrough=no
add chain=output  out-interface=pppoe-out2 connection-mark=WAN2 action=mark-routing new-routing-mark=WAN2 passthrough=no

We now need to perform the PCC ‘magic’ that will split the internal connections equally amongst all WAN lines.  Here, the following rules take the internal LAN traffic that is outgoing and will split the traffic connections into two ways.  Every unique combination of destination address and source address will be given a different connection mark (and will in turn, go out of a different WAN connection). There are a number of different PCC methods to determine how to split the connections.  One popular one is on ‘destination IP and port’.  However if you do that, a client device may open up many threads of connections to a remote server and each one will go out of a different outgoing WAN connection.  With connections to HTTPS SSL based websites, such as banking sites, this may lead to major problems.  We are going to just use the destination and source address to split the connections between the two lines as our unique indicator that these are connections we can detect and are to be pushed up the multiple and different WAN lines.  Once again we use the ‘no-mark’ connection mark test as once a connection has been marked, there is no need to do it again and this saves on CPU power.  We can only apply routing marks in the output and prerouting chains.  As we need to identify traffic that is being forwarded by the router, not any traffic that is destined for it and the output chain is only for traffic that has been generated by the router itself, the prerouting chain is the only one suitable.  However the prerouting chain actually contains two types of traffic, forwarded traffic (which we want to capture) and traffic destined for the router itself (which we do not want to capture for marking).  

Therefore, to only mark outbound forwarded traffic we can use a rule that tests for ‘dst-address-type is NOT local’. I.e. that the traffic is not destined for a network address which is on the router.  Note also that the PCC mangle rule states either ‘:2/0’ or ‘:2/1’.  This equates to ‘divide the traffic connections by two’ and that will then leave you either with no remainder (the ‘/0’ part) or with 1 as a remainder (the ‘/1’ part).  In this way the traffic is split into two unique sets of connections by the PCC rule.  If you were splitting the traffic three ways, the 3 rules would have ‘3/0’, ‘3/1’ and ‘3/2’ in them.  I.e. 3 divided by 3 is one, remainder zero (the ‘3/0’), 2 divided by 3 is zero with a remainder of two (the ‘3/2’) and 1 divided by 3 is again zero, but with a remainder of one (the ‘3/1’).

/ip firewall mangle
add chain=prerouting connection-mark=no-mark dst-address-type=!local  in-interface=bridge-local per-connection-classifier=both-addresses:2/0  action=mark-connection new-connection-mark=WAN1
add chain=prerouting connection-mark=no-mark dst-address-type=!local  in-interface=bridge-local per-connection-classifier=both-addresses:2/1  action=mark-connection new-connection-mark=WAN2

By default ‘passthrough=yes‘ is enabled on mangle rules such as those above. Now that we have marked the two way split in connections, we now need to attach a Routing mark to those two connections so that we can direct each connection to two different routing tables rather than the usual ‘main’ routing table (we have yet to create two new routing tables called WAN1 and WAN2 which will use these routing marks):

/ip firewall mangle
add chain=prerouting  connection-mark=WAN1 in-interface=bridge-local action=mark-routing new-routing-mark=WAN1  passthrough=no
add chain=prerouting  connection-mark=WAN2 in-interface=bridge-local action=mark-routing new-routing-mark=WAN2  passthrough=no

The traffic will now be able to be split two ways, with each alternate connection being redirected to two different routing tables, WAN1 and WAN2.  These two new routes are created as follows:

/ip route
add dst-address=0.0.0.0/0  gateway=pppoe-out1 routing-mark=WAN1 
add dst-address=0.0.0.0/0  gateway=pppoe-out2 routing-mark=WAN2

The traffic being generated by the router itself however will not have any routing marks and also, ‘next hop’ calculations are only possible by the ‘main’ routing table therefore we will still need ‘normal’ routing table entries for the default routes going out the multiple WAN connections.  We need two in this case.  We apply different distances to them so that if there is a WAN line failure, the traffic will go out the second line instead.

/ip route
add dst-address=0.0.0.0/0  gateway=pppoe-out1 distance=1
add dst-address=0.0.0.0/0  gateway=pppoe-out2 distance=2

If the routing table gateways are remote IP addresses, rather than dynamic PPPoE interfaces as per the example above, ensure you also add ‘check-gateway=ping’ to each of those routes to ensure that the main routing table can calculate the state of the upstream gateway and change the routing table to route out of the other interface correctly in case a gateway goes down!

If the router will be applying a NAT Masquerade function and you do not have a static IP on the public WAN interfaces, use an action of ‘masquerade’.  If however you have static IPs assigned to each WAN interface, it is better to use an action of ‘src-nat’ and to input the correct specific IP address assigned for that interface.

Example for dynamic IPs:

/ip firewall nat
add chain=src-nat  out-interface=pppoe-out1 action=masquerade
add chain=src-nat  out-interface=pppoe-out2 action=masquerade

Example for static IPs of 1.2.3.4 and 4.3.2.1:

/ip firewall nat
add chain=src-nat  out-interface=pppoe-out1 action=src-nat to-addresses=1.2.3.4
add chain=src-nat  out-interface=pppoe-out2 action=src-nat to-addresses=4.3.2.1

Once all the above rules are applied, you should then begin to see traffic being very roughly distributed across the multiple lines.  Remember that as the traffic is based on connections, rather than raw packets, the traffic will never be completely even.  Any one client’s connection may have a high traffic demand and that will be seen to increase just the one physical WAN connection to a greater throughput figure compared to the other. This is normal and to be expected.  The more client connections made and the more WAN lines added into the ‘load balancing’ set will even these ‘irregularities’  out over time.  Also, if any one line has a different speed than another, any connections made up that slower WAN link will cause the client to have a slower experience.  Remember, this is not ‘line-bonding’ but ‘load balancing’!  If possible, try to join together similar types and speeds of WAN lines, but luckily with PCC it’s not essential.  The example below shows how the internal LAN is connected to ether1 and the upstream WAN connections are pppoe-ou1 and pppoe-out2.  The internal network is uploading approximately 838k from various client connections and the load is roughly balanced across the two lines, 401k and 426k each:

Load balancing

Load balancing multiple WAN lines

If you need any assistance to line balance more than 2 lines together or just need some help with your RouterOS config, we also offer remote and on-site consultancy. See here for more details.

23 Comments

  1. Nick Alcock says:

    This ‘per connection classifier’ sounds similar to the way nexthop routes already work in Linux, where each flow is sent to a single one of the applicable multipath routes (weighted by the multipath route’s weight), modulo of course things like MTU changes or routes going down. Anything else does indeed lead to catastrophic levels of packet reordering. However, distinct flows can go out of different endpoints, which leads to precisely the problem you describe, if you’re unlucky enough to be using an ISP which doesn’t allow those endpoints to share the same IP address, and a website stupid enough that it doesn’t realise that multihomed hosts exist and that IP addresses are bound to network interfaces, not computers.

    As for the ‘every ten minutes’ stuff: sounds like you’re talking about the routing cache’s GC interval. The routing cache was removed in v3.6, released more than two years ago… which means the change can now happen between two adjacent flows initiated milliseconds apart!

  2. Jack Gough says:

    Great article, thanks!

    Can you explain why it is “better” to use src-nat over masquerade ?

    • LinITX Trainer says:

      ‘masquerade’ and ‘scr-nat’ actions perform the same NAT’ting function in that it changes the source address as the packet exits the interface on the router. The difference is in the manner of how it choses what IP address to change it to. Masquerade will change it to whatever IP address the exit interface has. Not a problem if it only has one IP.
      However if the interface has multiple IPs, then the src-nat action type allows one to input a specific IP you wish all packets to be “NAT’ted” to.

  3. Tony says:

    Thank you for information that is correct and works.
    This works perfectly!!!!!!!!!!!!!!!!!!!!!!!!!

  4. RogerCWB says:

    Hi!

    There is something wrong with my config, because I have 2 PPPOE connections with my provider, but when I tested with speedtest.net I don’t have the double of speed and just one link is used.

  5. RogerCWB says:

    Ok, I solved my problem to have link agregation sum the bandwidth, but I had to change:

    per-connection-classifier=both-addresses

    to

    per-connection-classifier=both-addresses-and-ports

    Works but I don’t know what means.

    • LinITX Trainer says:

      As explained in my article above, using “per-connection-classifier=both-addresses” ensures that SSL connections work. Using “per-connection-classifier=both-addresses-and-ports” will mean that traffic is now shared across all lines and I agree speedtests will show a greater aggregate speed, however SSL connections to websites such as banks will fail as they will see multiple SSL connections coming from multiple IPs. I am aware that banks are especially sensitive to this. Therefore using “per-connection-classifier=both-addresses” is the most reliable method.

  6. Shyam Thapa says:

    Nice Article Thank you .

  7. Nawir says:

    If I have Dual WAN load balance Active/Active.
    I believe for Internet Banking, I need to make sure the session coming from user source ip > ibank > back to user source ip.
    But how to proof that the session follow that behavior.
    If I am using traceroute from user pc. It only proof one way session.
    If I am using traceroute.org. That only proof another new session from external to internal.
    tq

    • LinITX Trainer says:

      The supplied mangle rules in the article creates a connection out of one single WAN port based on matching source and destination addresses.
      Therefore if the source and destination IPs remain the same, each new connection will go out of the same WAN port each time.

  8. DimitrisV says:

    Hello There

    i have one question

    in this rule
    /ip firewall mangle
    add chain=prerouting dst-address=192.168.88.0/24 in-interface=bridge-local action=accept

    how is it possible the traffic that has destination ip address from the network 192.168.88.0/24 to be incoming or inbound traffic to the bridge-local interface?
    From my understanding of networks Packets that have dest ip address from the 192.168.88.0/24 are coming to the local network and thus they are outbound or outgoing from the bridge-local interface.

    Let me know if i understand something the wrong way

    Thank you for your great article

    • LinITX Trainer says:

      If a client device on the internal network 192.168.88.0/24 sends any packet onto the LAN (unicast, multicast or broadcast) the packet may arrive on the bridge interface and if the bridge is configured for “use IP firewall”, it will require forwarding to another port via the Layer3 IP firewall. This ensures the packet is not processed by the PCC rules and thus forwarded to one of the WAN ports in error.

  9. Osa Hady says:

    Great! You’ve perfectly explained the idea could you paste a YouTube video.

  10. Osa Hady says:

    What is the purpose of pppoeout1 and pppoeout2 and can we do the job without them.

    • LinITX Trainer says:

      pppoe-out1 and pppoe-out2 are examples. If your WAN connections are a different type of interface (e.g. ethernet), use them instead.

  11. Bill says:

    what if the connection was marked and the gateway is offlined? e.g. the connection is marked “WAN1_Out” but the WAN1_Out gateway is offline, will the traffic goes to the other route?

    • LinITX Trainer says:

      you need to add backup routes in each of the WAN routing tables with a higher distance. Then when the primary route fails for say, WAN1 all WAN1 traffic will instead go over WAN2

  12. Great article, Thank you for sharing it. thanks!

  13. fcforensic says:

    Hi, I saw your article and it seems to me well done. I wanted to put your attention to my problem, I realized a load balancing as described by you, using as one gateway my gateway to my CCR1016 router. The management system I use to share the band with my associates, Splynx, uses the router port on which I have addressed the load balance as gateway, the address pool is created within Splynx. Generating a PPPoE connection between the radius and another port of the CCR1016. In practice everything works perfectly except that traffic from individual PPPoE users is routed only to Wan1 and never from the other 3 Wan. I think the reason for this is because the router does not identify the different sources as the traffic is encapsulated within the PPPoE.
    Do you have tips on how to solve this problem many thanks

    • LinITX Trainer says:

      perhaps split the operation of the load balancing onto a new router? Although MikroTik RouterOS can be programmed to just about do everything bar make the coffee, this does not always mean you should make a single router perform every network task. Try splitting the roles to two routers. 🙂

  14. Andreas Zwinzscher says:

    Thank you for this article. Got I working without problems. I have got just one question:

    How exactly can I define, that a special connection (for example every connection from a special client in LAN) should always use pppoe-out1?

    I tried to experiment with prerouting and use the LAN-IP as scr.address but it did not work.

    • LinITX Trainer says:

      add a new mangle rule in pre-routing chain above the pair of PCC rules, that tests for your device’s src-address and set the action to mark the packet with a connection mark of WAN1 (leave passthrough enabled). As the packet will have a pre-existing connection mark (WAN1) the pair of PCC rules immediately below will not do anything. However, the routing mark rules will still function and thus force the single client to go out of WAN1

  15. Michal says:

    As LinITX Trainer mentioned using “per-connection-classifier=both-addresses” I also faced problem with reply loging in on one forwarded port. For most of cases it was not working. After choosing both it’s working well. However you mentioned that it can cause SSL problem now. Is there any other solution or possibility why this port was blocked? I forwarded my port with NAT rule dstnat specifying incoming interface to local IP address.

    Thank you for support.

    Perfect article btw. Thank you !

Leave a Reply