October 19, 2017

My CCIE Training Guide

Domain 1 Security and Risk Management - Part 1

First Domain of the CISSP hold 12 Sections and discuss aspects of Risk Management Concepts, Tools, Laws, Standards, around People Process and Technology.  here are some short highlights from my notes:

Understand and apply concepts of confidentiality, integrity, and availability 

CIA (Confidentiality / Integrity / Availability ) if I would to say them in my own words I would say that  Confidentiality is the way to assure asset is kept secret from any unauthorized system and / or person. 
  • How To Protect: most common is the use of encryption taking data and encrypting is done by multiple different techniques.
Integrity is the assurance that asset you have was not handled in any way shape or form by an unauthorized system and / or person.
  • How To Protect: That is more complex however can be done by introducing multiple mechanisms like together refereed to as the AAA (Triple AAA from networking or 5 A from ISC2 world) Identification Authentication Authorization Audit Accounting 
Availability making sure asset is obtainable (I had to look for other word :-)) when needed
  • How To Protect: In a high level that is by assuring service / asset health and stability 

Now often the CIA is refereed to as CIA triad 

Note: the word asset was mentioned multiple times to assure we get use to the terminology.
Asset: can be "data / person / company / resource / service..." or anything you can put a value to it and is worth protecting.
Google Definition: a useful or valuable thing, person, or quality.

What is AAA in the CISSP world?

Identification - Process of providing Identity available to the next stage of authentication in the world I am from Identification and Authentication are part of the same process as without one the other can't exist however for the sake of CISSP lets keep open mind.
Authentication - Once you received the Identifier we need to be able to authenticate and make sure that this is indeed the account and there are different authentication methods like password, pin code, bio (finger print)...
Authorization - After we have passed Authentication then we need to be able to provide limit access to resources according to our job requirement providing to much may impact confidentiality and integrity and providing to little may impact availability  
Audit - auditing is a very important function and again from my networking world it was part of accounting, the audit function is to provide monitoring and ability to go back and look who did what and when, very important part in troubleshooting and fundamental part of ability to be able  prove non-repudiation  
Accounting - The ability to prove a subject identity and track his activities if needed to later be presented in court of law.

Alignment of security function to business strategy, goals, mission, and objectives  

First maybe lets define what is Governance - according to google dictionary it is the action of governing, meaning ?! if you own a company or if you are one of the C-level function in a company it would be expected of you to govern and lead the company in the path to success, and part of it would be taking responsibility to providing company policies, goals, mission statements.

Elements to remember with related to Governance:
  1. Corporate Executive Must be committed to the Security Plan - Due Care!
  2. Corporate Executive are to define the mission statement and company policy.
  3. CISO / CSO should not be subject to company politics and avoid and possible conflict of interest
  4. Company Executive have the responsibility highest responsibility to the company security and in case they where care less they also may be subject to personal legal actions against them.
  5. Security Plan is subject to Due Diligence, always be responsive to the needed changes

Organizational processes 

Like with life when you get to a cross road there is higher risk as a cross road increase complexity and evolve cars moving on the same road in different directions, introducing proper mechanism like rules signs, light.. will reduce the risk, same is with Organizational Changes when purchasing new company / systems or god forbid when laying of personal the Organization need to be ready to face the implications

  • make sure there is a well elaborated and sorted plan
  • make sure all personal and / or systems are informed and ready for the change
  • prepare a backup / restoration / rollback (you name it) plan
  • make sure you have a way to monitor and measure the change and identify and negative impact

Organizational roles and responsibilities 

Roles and responsibilities are highly important, to do your job well especially in large organization you need to know what are your duties what is expected from you and how can you assist to the goals of your organization.

Key Roles To Know and remember:

Data Owner - as the name suggest it is the data highest authority for making sure data security is in order and normally will be senior manager, the Data owner is responsible for classifying also the data security level.

Data Custodian - this is  for whom that is being given the task of practically making sure data security is addressed as classified and according to the guidelines, normally would be IT / IS.

Auditor - is responsible for the monitor and making sure security policy's are being followed implemented and issue periodic reporting to be review by senior management. in case auditor discover and report issues the senior management must address.

Senior Manager- have the top responsibility and liability for organization security however the implementation of security is a function that is delegated to Security professionals

User - Any user in the organization have his role in keeping the corporate security policy by following the provided policies and procedures.

Due Care/Due Diligence

Due Care

It is the action of "caring" about the possible of system / person other might do harm to an asset!
  •  Data Owner (normally Organization Executive) is obligated to Due Care
Law: the conduct that a reasonable man or woman will exercise in a particular situation, in looking out for the safety of others. If one uses due care then an injured party cannot prove negligence. This is one of those nebulous standards by which negligence is tested. Each juror has to determine what a "reasonable" man or woman would do.  reference 

Due Diligence

It is an action performed in iterative and repeatable manner with steps taken for verifying / monitoring and applying actions in order to preserve company policy and standards.
  • Data Owner is obligated to make sure a due diligence is conducted on normal basis 
  • Data Custodian are normally performing the due diligence in practice.
Google Translate: reasonable steps taken by a person in order to satisfy a legal requirement, especially in buying or selling something.

To be continue...

by shiran guez (noreply@blogger.com) at October 19, 2017 03:37 PM

ipSpace.net Blog (Ivan Pepelnjak)

Another DMVPN Routing Question

One of my readers sent me an interesting DMVPN routing question. He has a design with a single DMVPN tunnel with two hubs (a primary and a backup hub), running BGP between hubs and spokes and IBGP session between hubs over a dedicated inter-hub link (he doesn’t want the hub-to-hub traffic to go over DMVPN).

Here's (approximately) what he's trying to do:

Read more ...

by Ivan Pepelnjak (noreply@blogger.com) at October 19, 2017 06:00 AM

Networking Now (Juniper Blog)

Secure Data in a Device Driven World

IoT security has become one of those harrowing buzzwords over the past few years, as connected devices have gone from a seemingly innocent addition to increase convenience in your life to a potential avenue for attackers to steal or control your data. IoT shouldn’t be scary, what it should do is propel us to take a fundamentally different approach to cybersecurity to ensure this new form of data collection isn’t exposing us to risk or causing us harm.


Recent research has shown that there are 8.4 billion IoT devices in use today – that number is expected to surpass more than 20 billion by 2020. This sheer magnitude and scale of new devices is one of the key issues leading to an increase in risks across our ecosystems. If we adopt a few best practices we can help ensure that the data collected by these devices remains safe. Here are three ways to make that happen.

by Kevin Walker at October 19, 2017 01:37 AM

October 18, 2017

Networking Now (Juniper Blog)

Mobile Malware and Sky ATP

There has been a dramatic increase in attacks aimed at smartphones, tablets, and even "smart TVs", mostly targeting the Android ecosystem. Unlike Apple's iOS, Android allows users to use alternate app stores and to "sideload" arbitrary apps onto a device. There are entire marketplaces of "cracked" apps -- unauthorized versions of paid apps distributed for free -- and many thousands more apps that offer malicious payloads in addition to their advertised functionality. 


In this post, we'll look at a recent example of a "locker", an application that takes control of a device and demands a ransom payment.

by AsherLangton at October 18, 2017 08:56 PM

My Etherealmind
The Networking Nerd

Scotty Isn’t DevOps

I was listening to the most recent episode of our Gestalt IT On-Presmise IT Roundtable where Stephen Foskett mentioned one of our first episodes where we discussed whether or not DevOps was a disaster, or as I put it a “dumpster fire”. Take a listen here:

<iframe allowfullscreen="true" class="youtube-player" height="359" src="https://www.youtube.com/embed/W7SKOWiVQm0?version=3&amp;rel=1&amp;fs=1&amp;autohide=2&amp;showsearch=0&amp;showinfo=1&amp;iv_load_policy=1&amp;wmode=transparent" style="border:0;" type="text/html" width="584"></iframe>

Around 13 minutes in, I have an exchange with Nigel Poulton where I mention that the ultimate operations guy is Chief Engineer Montgomery Scott of the USS Enterprise. Nigel countered that Scotty was the epitome of the DevOps mentality because his crazy ideas are what kept the Enterprise going. In this post, I hope to show that not only was Scott not a DevOps person, he should be considered the antithesis of DevOps.

Engineering As Operations

In the fictional biography of Mr. Scott, all he ever wanted to do was be an engineer. He begrudging took promotions but found ways to get back to the engine room on the Enterprise. He liked working starships. He hated building them. His time working on the transwarp drive of the USS Excelsior proved that in the third Star Trek film.

Scotty wasn’t developing new ideas to implement on the Enterprise. He didn’t spend his time figuring out how to make the warp engines run at increased efficiency. He didn’t experiment with the shields or the phasers. Most of his “miraculous” moments didn’t come from deploying new features to the Enterprise. Instead, they were the fruits of his ability to streamline operations to combat unforeseen circumstances.

In The Apple, Scott was forced to figure out a way to get the antimatter system back online after it was drained by an unseen force. Everything he did in the episode was focused on restoring functions to the Enterprise. This wasn’t the result of a failed upgrade or a continuous deployment scenario. The operation of his ship was impacted. In Is There No Truth In Beauty, Mr. Scott even challenges the designer of the Enterprise’s engines that he can’t handle them as well as Scotty. Mr. Scott was boasting that he was better at operations than a developer. Plain and simple.

In the first Star Trek movie, Admiral Kirk is pushing Scotty to get the Enterprise ready to depart in hours after an eighteen month refit. Scotty keeps pushing back that they need more time to work out the new systems and go on a shakedown cruise. Does that sound like a person that wants to do CI/CD to a starship? Or does it sound more like the caution of an operations person wanting to make sure patches are deployed in a controlled way? Every time someone in the series or movies suggested doing major upgrades or redesigns to the Enteprise, Scotty always warned against doing it in the field unless absolutely necessary.

Montgomery Scott isn’t the King of DevOps. He’s a poster child for simple operations. Keep the systems running. Deal with problems as they arise. Make changes only if necessary. And don’t monkey with the systems! These are the tried-and-true refrains of a person that knows that his expertise isn’t in building things but in making them run.

Engineering as DevOps

That’s not to say that Star Trek doesn’t have DevOps engineers. The Enterprise-D had two of the best examples of DevOps that I’ve ever seen – Geordi LaForge and Data. These two operations officers spent most of their time trying new things with the Enterprise. And more than a few crises arose because of their development aspirations.

LaForge and Data were constantly experimenting on the Enterprise in an attempt to make it run better. Given that the mission of the Enterprise-D did not have the same five-year limit as the original, they were expected to keep the technology on the Enterprise more current in space. However, their experiments often led to problems. Destabilizing the warp core, causing shield harmonics failures, and even infecting the Enterprise’s computer with viruses were somewhat commonplace during Geordi’s tenure as Chief Engineer.

Commander Data was also rather fond of finding out about new technology that was being developed and trying to integrate it into the Enterprise’s systems. Many times, he mentioned finding out about something being developed the the Daystrom Institute and wanting to see if it would work for them. Which leads me to think that the Daystrom Institute is the Star Trek version of Stack Overflow – copy some things you think will make everything better and hope it doesn’t blow up because you didn’t understand it.

Even if it was a plot convenience device, it felt like the Enterprise was often caught in the middle of applying a patch or an upgrade right when the action started. An exploding star or an enemy vessel always waited until just the right moment to put the Enterprise in harm’s way. Even Starfleet seemed to decide the Enterprise was the only vessel that could help after the DevOps team took the warp core offline to make it run 0.1% faster.

Perhaps instead of pushing forward with an aggressive DevOps mentality for the flagship of the Federation, Geordi and Data would have done better to take lessons from Mr. Scott and wait for appropriate windows to make changes and upgrades and quite tinkering with their ship so often that it felt like it was being held together by duct tape and hope.

Tom’s Take

Despite being fictional characters, Scotty, Geordi, and Data all represent different aspects of the technology we look at today. Scotty is the tried-and-true operations person. Geordi and Data are leading the charge to keep the technology fresh. Each of them has their strong points, but it’s hard to overlook Scotty as being a bastion of simple operations mentalities. Even when they all met together in Relics, Scotty was thinking more about making things work and less on making them fast or pretty or efficient. I think the push to the DevOps mentality would do well to take a seat and listen to the venerable chief engineer of the original Enterprise.

by networkingnerd at October 18, 2017 06:11 PM

Dyn Research (Was Renesys Blog)

What Does “Internet Availability” Really Mean?

The Oracle Dyn team behind this blog have frequently covered ‘network availability’ in our blog posts and Twitter updates, and it has become a common topic of discussion after natural disasters (like hurricanes), man-made problems (including fiber cuts), and political instability (such as the Arab Spring protests). But what does it really mean for the Internet to be “available”? Since the Internet is defined as a network of networks, there are various levels of availability that need to be considered. How does the (un)availability of various networks impact an end user’s experience, and their ability to access the content or applications that they are interested in? How can this availability be measured and monitored?

Deriving Insight From BGP Data

Many Tweets from @DynResearch feature graphs similar to this one, which was included in a September 20 post that noted “Internet connectivity in #PuertoRico hangs by a thread due to effects of #HurricaneMaria.”

There are two graphs shown — “Unstable Networks” and “Number of Available Networks”, and the underlying source of information for those graphs is noted to be BGP Data. The Internet analysis team at Oracle Dyn collects routing information in over 700 locations around the world, giving us an extensive picture of how the networks that make up the Internet are interconnected with one another. Using a mix of commercial tools and proprietary enhancements, we are also able to geolocate the IP address (network) blocks that are part of these routing announcements — that is, we know with a high degree of certainty whether that network block is associated with Puerto Rico, Portugal, or Pakistan. With that insight, we can then determine the number of networks that are generally associated with that geography. The lower “Number of Available Networks” graph shows the number of networks (IP address blocks, also known as “prefixes”) that we have geolocated to that particular geography. This number declines when paths to those networks are no longer present in routing announcements (are “withdrawn”), and increases when paths to those networks become available again. The upper “Unstable Networks” graph represents the number of networks that have recently exhibited route instability — when we see a flurry of messages about a network, we consider it to be unstable.

Necessary But Not Sufficient

However, as we mentioned in a previous blog post, “It is worth keeping in mind that core network availability is a necessary, but not sufficient, condition for Internet access. Just because a core network is up does not mean that users have Internet access—but if it is not up, then users definitely do not have access.” In other words, if a network (prefix) is being announced, that announcement may be coming from a router in a hardened data center, likely on an uninterruptible power supply (and maybe a generator). Just because the routes (paths to the network prefixes) are seen as being available, it does not necessarily mean that those routes are usable, since the last mile network infrastructure behind them may still be damaged and unavailable.

These “last mile” network connections to your house, your cell phone, or your local coffee shop, library, or place of business are critical links for end user access. When these networks are unavailable, then it becomes hard, if not impossible, for end users to access the Internet. More specifically, the components of the local networks in your house or coffee shop/library/business need to be functional — the routers/modems need to have power, and be connected to the last mile networks. Because of the power issues and physical damage (downed or broken power/phone/cable lines, impaired cell towers) that often accompany natural disasters, these local and last mile networks are arguably the most vulnerable critical links for Internet access.

Determining Last Mile Network Availability

While network availability can be measured at least in part by monitoring updates to routing announcements, last mile network availability can be determined both through reachability testing as well as observing traffic originating in those networks. On the latter point, our best perspective is currently provided by requests to Oracle Dyn’s Internet Guide – an open recursive DNS resolution service. With this service, end user systems are configured to make DNS requests directly to the Internet Guide DNS resolvers, rather than the recursive resolvers run by their Internet Service Provider. (Users often do this for performance or privacy reasons, though some ISPs will simply have their users default to using a third-party resolver instead of running their own.) Using the same IP address geolocation tools described above, we can determine where the users appear to be connecting from. Looking at the graph below, we can see a roughly diurnal pattern in DNS traffic in the days before Hurricane Maria makes landfall in Puerto Rico. (It is interesting to note that the peaks increase significantly as the hurricane approaches.) However, the rate of queries drops sharply, reaching a near-zero level, at 11:30 UTC on September 20, about an hour and a half after Maria initially made landfall, due to damage caused to local power and Internet infrastructure.

On the former point, regarding reachability testing, this insight can be gathered from the millions of daily traceroutes done to endpoints around the globe. Because the Oracle Dyn team has been actively gathering these traceroutes for nearly a decade, they have been able to identify endpoints across network providers that are reliably reachable, and can serve as a proxy for that network’s availability. The graph below illustrates the results of regular traceroutes to an endpoint in Liberty Puerto Rico, a local telecom provider. It shows that traceroutes to IP addresses announced by Liberty PR generally traverse networks including San Juan Cable, AT&T, and AT&T Mobility Puerto Rico. These networks are some of Liberty PR’s “upstream providers”, connecting it to the rest of the Internet. It is clear that the number of responding targets (of these traceroutes) drops sharply just before mid-day (UTC) on September 20, and further degrades over the next 15 hours or so, reaching zero just after midnight. These endpoints presumably became unreachable as power was lost around the island, copper and fiber lines were damaged, etc.

International Borders

Above, we have examined the various ways that Oracle monitors and measures network availability in the face of disaster-caused damage. However, there is another common cause of Internet outages — government-ordered shutdowns. In the past several years, we have seen Iraq shut down Internet access to prevent cheating on exams, and Syria has taken similar steps as well, as shown in the graph below. We have also seen countries such as Egypt shut down access to the global Internet in response to widespread protests against the government. In countries where such actions occur, the core networks often connect to the global Internet through a state-owned/controlled telecommunications provider and/or through a limited number of network providers at their international border. This situation was examined in more detail in a blog post published nearly five years ago by former Dyn Chief Scientist Jim Cowie. The post, entitled “Could It Happen In Your Country?”, examines the diversity of Internet infrastructure at national borders, classifying the risk potential for Internet disconnection.

In these cases, our measurements will see the number of available networks decline, often to zero, because all routes to the country’s networks have been withdrawn. In other words, the networks within the country may still up and functional, but other Internet network providers elsewhere in the world have no way of reaching these in-country networks because paths to them are no longer present within the global routing tables.


In order for Internet access to be “available” to end users, international connectivity, core network infrastructure, and last mile networks must all be up, available, and interconnected. Availability of these networks can be measured and monitored through the analysis of several different data sets, including BGP routing announcements, recursive DNS traffic, and traceroute paths, and further refined through the analysis of Web traffic and EDNS Client Subnet information in authoritative DNS requests.

And as always, we will continue to measure and monitor Internet availability around the world, providing evidence of brief and ongoing/repeated disruptions, whatever the underlying cause.

by David Belson at October 18, 2017 05:49 PM

My Etherealmind
ipSpace.net Blog (Ivan Pepelnjak)

Must Read: Network Engineer Persona

David Gee (whom I finally met in person during recent ipSpace.net Summit) published a fantastic series of articles on what someone bringing together networking, development and automation should know and do.

Read more ...

by Ivan Pepelnjak (noreply@blogger.com) at October 18, 2017 05:25 AM

XKCD Comics

October 17, 2017

Networking Now (Juniper Blog)

Emotet Spambot activity surges!

For several weeks, Cyphort Labs (now part of Juniper Networks Sky ATP) has been observing the renewed activity of the Emotet malspam campaign. As can be seen from the chart below, based on our telemetry, this variant of emotet started to show activity on the first week of August and picked up dramatically towards the end of September through early October.EmotetActivity.png


Emotet is primarily a trojan downloader which downloads additional modules that would perform the following malicious activity:

  • Steal Information (email accounts and browsers credentials).
  • Spread itself via Email Spam.
  • Participate in a DDOS attacks.
  • Steal bank credentials (only in prior versions).


Its recent uptick in activity may be attributed to its spam module and it could mean that it has added a significant number of infected hosts to its botnet to cause this spike in activity.


Infection Chain


Emotet has historically been distributed through Spam emails containing phishing links leading to malicious office documents which would then download the Emotet payload. We have also seen downloaders that arrive as PDF containing a malicious link that leads to a malicious macro (creadit to: @JAMESWT_MHT).



Emotet Spam Email.



 PDF with link to malicious Emotet trojan.


The malicious macro will execute a powershell command to download emotet.




In our case, the emotet sample was downloaded from http://austxport[.]com[.]au/redbeandesign/zaW/  as a plain executable.



The downloaded emotet is saved and executed from the %temp% folder as {random_number}.exe.  It copies itself into the %system% folder as searchlog.exe. For persistence, it installs itself as a service named searchlog as seen below.




Command and Control

Emotet contacts its CnC server, which in one particular case happens to be, via HTTP but uses port 443. Communication to its CnC is encrypted using a custom protocol. It will first send information about the infected system, such as username, OS version, etc.


As with the previous emotet versions, it responds with HTTP/1.1 404 Not Found. This is a clever way for the actors to hide the communicaton as many security devices like web gateways will not process 404 response pages. As you can see from the pcap below, it still sends encrypted data.



Emotet HTTP Request/Response showing the 404 response code.


At the time of our analysis, The CnC server was not returning any malicious samples anymore. 




SPAM Email Hashes:




MSWord Attcachment Document:



Emotet Payload:



This analysis is courtesy of Paul Kimayong.

by mhahad at October 17, 2017 08:30 PM

Mobile malware and Sky ATP

There has been a dramatic increase in attackes aimed at smartphones, tablets, and even "smart TVs", mostly targeting the Android ecosystem. Unlike Apple's iOS, Android allows users to use alternate app stores and to "sideload" arbitrary apps onto a device. There are entire marketplaces of "cracked" apps -- unauthorized versions of paid apps distributed for free -- and many thousands more apps offering malicious payloads in addition to their advertised functionality. 


We'll look at a recent example of a "locker", an application that takes control of a device and demands a ransom payment. Unlike typical PC ransomware, lockers don't encrypt the device's storage, but simply take over the display in a way that is nearly impossible to exit, rendering the device unusable. This particular sample purports to be an app for a popular pornographic site.





Launching the app shows a brief installation screen.




This is followed by an official-looking demand saying that "suspicious" files have been found, and that the device is locked until a $500 penalty is paid.





A typical user will find it nearly impossible to exit from this malicious app. To see how the malware authors accomplish this, we first note that the app requests a wide range of permissions.




The highlighted permission, SYSTEM_ALERT_WINDOW, allows the app to display a notification that covers the entire screen and cannot be dismissed. In addition, the app runs a simple service in the background to restart itself reboot, and in case of crash or termination.


The app gathers information about the user and attempts to take a picture of the victim using the device's front-facing camera. This information is displayed, followed by a sequence of graphic and disturbing pornographic images purportedly discovered on the user's device.




Despite this allegation, which is accompanied by the text of various laws concerning illegal pornography, these images are actually part of the malware itself. Here, in the decompiled app's resources, we find these pornographic images among assorted icons and logos:




The app solicits a ransom payment via a OneVanilla prepaid debit card.




In the app's decompiled code, we can see that the application verifies that the credit card entered by the victim has the appropriate prefix for a OneVanilla-issued card:




The app is written in Java, which can often be decompiled back to something similar to the original source code. However, the malware authors appear to have used an automated tool to obfuscate the code and make it more difficult to analyze. Here is the snippet of code that uploads the credit card information to a server controlled by the malware distributor:




Removing the base64 encoding, we start to see hints of the operation in the form of ASCII strings:




With additional manual deobfuscation, we find the code that uploads the credit card information as a parameter in an HTTP GET request:




This GET request failed in our research environment, possibly because the server had already been discovered and taken offline, but we can see the full URL with credit card number in the app's cache:




Despite the failure, we are told that our "request will be processed in 24 hours":







Sky ATP supports both static and dynamic analysis of Android apps, and applies the same machine learning deep-analysis pipeline as for Windows executables, documents, and media files:





Note that in the analysis above, no information about the device itself is uploaded with the ransom payment. Aside from the credit card number itself, the malware authors have no way to associate the payment with a particular victim, and there does not appear to be any mechanism for remotely disabling the locker. However, the locker can be safely stopped and removed by booting into the device's safe mode and manually uninstalling the app:






by AsherLangton at October 17, 2017 04:49 PM

Moving Packets

Decoding LACP Port State

It’s frustrating when the output to a show command gives exactly the information needed, but in a format which is unintelligible. So it is with the Partner Port State field in the NXOS show lacp neighbor interface command which reports the partner port state as a hexadecimal value. To help with LACP troubleshooting, here’s a quick breakdown of the port states reported on by LACP, and how they might be seen in Junos OS and NXOS.

LACP Port State

The LACP port state (also known as the actor state) field is a single byte, each bit of which is a flag indicating a particular status. In this table, mux (i.e. a multiplexer) refers to the logical unit which aggregates the links into a single logical transmitter/receiver.

The meaning of each bit is as follows:

Bit Name Meaning
0 LACP_Activity Device intends to transmit periodically in order to find potential members for the aggregate. This is toggled by mode active in the channel-group configuration on the member interfaces.
1 = Active, 0 = Passive.
1 LACP_Timeout Length of the LACP timeout.
1 = Short Timeout, 0 = Long Timeout
2 Aggregation Will allow the link to be aggregated.
1 = Yes, 0 = No (individual link)
3 Synchronization Indicates that the mux on the transmitting machine is in sync with what’s being advertised in the LACP frames.
1 = In sync, 0 = Not in sync
4 Collecting Mux is accepting traffic received on this port
1 = Yes, 0 = No
5 Distributing Mux is sending traffic using this port
1 = Yes, 0 = No
6 Defaulted Whether the receiving mux is using default (administratively defined) parameters, if the information was received in an LACP PDU.
1 = default settings, 0 = via LACP PDU
7 Expired In an expired state
1 = Yes, 0 = No

Junos OS and NXOS

Junos OS users are probably smiling right now, as this should look very familiar:

john@switch&gt; show lacp interfaces ae1
Aggregated interface: ae1
    LACP state:       Role   Exp   Def  Dist  Col  Syn  Aggr  Timeout  Activity
      xe-1/0/0       Actor    No    No   Yes  Yes  Yes   Yes     Fast    Active
      xe-1/0/0     Partner    No    No   Yes  Yes  Yes   Yes     Fast    Passive
      xe-2/0/0       Actor    No    No    No   No   No   Yes     Fast    Active
      xe-2/0/0     Partner    No    No    No  Yes  Yes   Yes     Fast    Passive

Cisco users on the other hand may be weeping quietly when viewing a port-channel summary:

us-atl01-z1fa07a# show lacp neighbor interface port-channel 101
Flags:  S - Device is sending Slow LACPDUs F - Device is sending Fast LACPDUs
        A - Device is in Active mode       P - Device is in Passive mode
port-channel101 neighbors
Partner's information
            Partner                Partner                     Partner
Port        System ID              Port Number     Age         Flags
Eth1/6      127,39-0d-12-c2-2b-40  0x3             427434      SA

            LACP Partner           Partner                     Partner
            Port Priority          Oper Key                    Port State
            127                    0x2                         0x3f

Partner's information
            Partner                Partner                     Partner
Port        System ID              Port Number     Age         Flags
Eth2/6      127,39-0d-12-c2-2b-40  0x1             112         SA

            LACP Partner           Partner                     Partner
            Port Priority          Oper Key                    Port State
            127                    0x2                         0x3f

The partner port state is 0x3f, which is not very helpful. The good news is that looking at individual members does reveal the information in a more human-friendly format:

us-atl01-z1fa07a# show lacp interface eth 1/6
Interface Ethernet1/6 is up
Local Port: Eth1/6   MAC Address= 0-de-fb-11-32-a6
  System Identifier=0x8000,  Port Identifier=0x8000,0x106
  Operational key=100
  LACP_Timeout=Long Timeout (30s)
  Partner information refresh timeout=Short Timeout (3s)
Actor Admin State=(Ac-1:To-1:Ag-1:Sy-0:Co-0:Di-0:De-0:Ex-0)
Actor Oper State=(Ac-1:To-0:Ag-1:Sy-1:Co-1:Di-1:De-0:Ex-0)
Neighbor: 0x3
  MAC Address= 39-0d-12-c2-2b-40
  System Identifier=0x7f,  Port Identifier=0x7f,0x3
  Operational key=2
  LACP_Timeout=short Timeout (1s)
Partner Admin State=(Ac-0:To-1:Ag-0:Sy-0:Co-0:Di-0:De-0:Ex-0)
Partner Oper State=(Ac-1:To-1:Ag-1:Sy-1:Co-1:Di-1:De-0:Ex-0)
Aggregate or Individual(True=1)= 1

However, for the sake of anybody who has been sent output from show lacp neighbor interface port-channel X and wants to understand the hex value that’s displayed (0x3F in this case), it’s pretty simple.


  • Convert hexadecimal to binary. Hexadecimal 0x3F is 00111111 in binary.
  • Flip the bits around. 00111111 becomes 11111100
  • Map the bits in this order to the table above:

1 -> ACTIVE mode
1 -> SHORT timeout
1 -> WILL aggregate
1 -> In SYNC
1 -> Mux is Collecting
1 -> Mux is Distributing
0 -> NOT running administratively configured settings
0 -> NOT expired

Alternatively, I suppose, flip the table so that the entries run from 7 to 0 instead of 0 to 7, then you don’t have to flip the bits; either way works. In this case 0x3F indicates a link which is an active part of the aggregated interface.

Clearly what we want from a link is that bit 7 is 0 (not expired) and bits 2-5 are 1 (will aggregate, in sync, collecting, distributing).

A recent port I had trouble with was reported as partner port state 0xC7, which in binary is 11000111, which when flipped to 11100011 means:

1 -> ACTIVE mode
1 -> SHORT timeout
1 -> WILL aggregate
0 -> NOT In Sync
0 -> Mux is NOT Collecting
0 -> Mux is NOT Distributing
1 -> Running administratively configured settings

Clearly this link was not happy, but thankfully a shut / no shut sequence was enough to revive the patient.

Happy aggregating!

If you liked this post, please do click through to the source at Decoding LACP Port State and give me a share/like. Thank you!

by John Herbert at October 17, 2017 02:49 PM


Cloud Native: Upgrading a Workflow Engine or Orchestrator

On a train this morning, I read Ivan Pepelnjak’s Twitter stream (because what else is there to do whilst relaxing with a coffee?), I came across this blog post on upgrading virtual-appliances.

Couldn’t agree more with the approach, but what about upgrading a workflow engine or orchestrator? I’ll call this entity a ‘wfeo’ just to make typing this article easier.

The perceived turmoil in undertaking this kind of an upgrade task is enough to make new born babies cry. Fear not. Any half decent wfeo contains it’s gubbins (workflows, drivers, logic, data) in a portable and logical data structure.

Taking StackStorm as an example, each integration (official parlance; ‘pack’), is arranged into a set of directories.
Within each directory are more directories with special names and a set of files like READMEs, configuration schemas and pack information. These top level directories that contain the pack, are portable between install bases of StackStorm giving us the power to easily clone installations, repair logic in case of a troubled upgrade and install logic freshly for new installations.

As with any platform, some syntax might change so always read the release notes for the platform and packs.

Ivan’s point is that you do not treat virtual appliances as special creatures because they are not.

So when it comes to our wfeos, what do we do? We could upgrade or we could just install the new version and port our logic across.

Going for the upgrade route (if you cannot easily just re-deploy a new virtual-machine), our portable logic provides us a safety net to re-install if it goes wrong.
If you can deploy a new virtual-machine, then this is the cleanest route. Simply deploy, copy your logic across (contained in a pack), install the packs that your logic depends on and configure them.

There is one advantage to upgrading wfeos that you do not have with virtual network appliances. That is they deal with the control-plane and you can test them easily before making them live.
The hardest bit is bringing the new wfeo into production. That could be just changing an IP address or DNS record when you’re ready, then pushing the old one to rapidly available storage once you are happy that things are behaving correctly in case it all goes wrong.

Workflow engines and orchestrators are no different to any other virtual network appliance or virtual network function (VNF), so don’t treat them any differently.

Using Ivan’s closing statement, go through the process once, then automate it!

The post Cloud Native: Upgrading a Workflow Engine or Orchestrator appeared first on ipengineer.net.

by David Gee at October 17, 2017 01:02 PM

ipSpace.net Blog (Ivan Pepelnjak)

Upgrading Virtual Appliances

In every SDDC workshop I tried to persuade the audience that the virtual appliances (particularly per-application instances of virtual appliances) are the way to go. I usually got the questions along the lines of “who will manage and audit all these instances?” but once someone asked “and how will we upgrade them?”

Short answer: you won’t.

Read more ...

by Ivan Pepelnjak (noreply@blogger.com) at October 17, 2017 05:17 AM

October 16, 2017

Moving Packets

KRACK WPA2 Vulnerability Announced – Upgrade Now

If you haven’t already heard about the KRACK (Key Reinstallation Attack) vulnerability announced today, head over to the information page at https://www.krackattacks.com/ as quick as your fingers will take you because Mathy Vanhoef of imec-DistriNet has found a vulnerability in the WPA2 protocol which has a very wide impact.


The challenge here is that for this isn’t a bug in any particular implementation or commonly-used library; rather, it’s a vulnerability in the protocol itself which means that any correct implementation of the protocol is vulnerable. This also does not just apply to wireless access points; remember that most cell phones can also act as wireless APs for purposes of wireless tethering, so they may be vulnerable too.

Impressively, a number of vendors have released code which has been patched for the vulnerability today, and a number of vendors included fixes before today’s public announcement. However, those are useless if people don’t install the upgrades. I strongly advise going now and finding what your wireless vendor has done, and installing any available patched code.

Ubiquiti Update

Since I know you’re all following my Ubiquiti experiences, I’ll note that UBNT released code this morning for my Unifi AC-AP-PRO access points, and I upgraded them before breakfast this morning. The only minor annoyance is that this code release has not been pushed to the current stable 5.5.24 controller yet, so until that happens it’s necessary to trigger a manual upgrade for each device. Also, if you have enabled automatic updates, turn them off before you upgrade or you may find the 3.8.x release undoing the manual upgrade to 3.9.3 (yes, the 5.5.24 controller believes that it should upgrade the APs from 3.9.3 to 3.8.x). The push will hopefully occur shortly, but Ubiquiti usually waits about a week while early adopters install the code so they can be confident that it did not introduce any other issues (i.e. regression testing).

My 2 Bits

It sucks when a vulnerability like this hits the wire, but I give respect to Mathy Vanhoef for following a responsible disclosure process and allowing vendors some time to prepare patches before the vulnerability was shared publicly.

If you liked this post, please do click through to the source at KRACK WPA2 Vulnerability Announced – Upgrade Now and give me a share/like. Thank you!

by John Herbert at October 16, 2017 02:46 PM


Network Automation: Leaky Abstractions

I hear people talk about leaky abstractions all the time. I’m not sure that some of the people that use it have researched the term.

As network-automation blurs the line between software and networking, terms like this are used more commonly than you might expect.

When you hear someone say ‘leaky abstraction’, what does it really mean? This question drove me to a little research effort.

The term ‘leaky abstraction‘ was popularised in 2002 by Joel Spolsky. I totally misunderstood this statement when I first heard it, so naturally the researcher in me went off trawling the web to get a more correct view.

My original and misinformed understanding is explained in the example below.

The Example

Taking the example of a car, the abstraction interface or vehicle controls allows a user to manoeuvre the vehicle between a start and end point whilst keeping the passenger as comfortable as possible.

A car has air modification capability, human body heaters and it can even project audio to your ears. Most vehicles have an on switch (engine start or power switch), they have directional and velocity controls that come in the form of a steering wheel, a set of pedals including accelerator (gas), a gear shifter, brake and on manual vehicles a manual clutch.

My original interpretation of a ‘leaky abstraction’ would be having to change the air to fuel ratio for the engine by hand. The engine should take care of this and as a driver, it cannot be assumed that drivers have mechanical or chemistry skills to understand or manipulate the ratio correctly. The inner workings of the engine are exposed through an otherwise simple abstraction layer. My interpretation was this fuel mixture example, is a leaky abstraction or ‘something polluted the abstraction’.

On an electric vehicle, this air to fuel ratio control does not exist. The abstraction layer although nearly identical in every way, is different.

So, what could be a leaky abstraction here? How about driving your vehicle over different terrains? For this, let’s assume your abstraction layer is: accelerator, brake, steering wheel.

Under normal conditions, a vehicle goes where it is pointed towards. A leaky abstraction could be the fact the vehicle misbehaves when on ice or sand, providing feedback through the controls. Some vehicles cope with it better. For instance, a car with great traction control software will cope with different terrains differently to one that just has a mechanical differential. Your experience between two vehicles, despite having the very same abstraction layer will differ massively. This is my interpretation of ‘leaky abstraction’ using vehicles and our very basic abstraction layer consisting of accelerator, brake and steering wheel.

So why do I need to worry about it?

Don’t panic! Leaky abstractions can cause problems for automation workflows and it’s your duty to figure out how behaviour changes when something out of your control through the abstraction layer changes. Do you want to feel the effects of driving on snow? Or do you just want to try and move forwards and be told if that’s not possible?

Do your homework when using software libraries and automation tooling to ensure that the abstractions you use will behave in a way that are manageable. I like to use NAPALM (Network Automation and Programmability Abstraction Layer with Multivendor support) as something that does a reasonable job of hiding the underlying complexity. A vehicle view of this would be, consumers do not need to worry about what settings are in the engine, but a mechanic can tweak them from time to time. NAPALM has this similar approach. To consume is simple, to fine tune requires a bit more skill. What happens in terms of leaks? Well, latency or jitter of packets to and from devices can leak through in terms of broken calls and it’s possible to receive malformed data because of vendor code changes.

Joel Spolsky uses the example of TCP consuming the IP technology to describe leaky abstraction. If you haven’t read the article, it’s a must read!


Tony Hsieh to my amusement also used the concept of accelerator on a vehicle one of his blog posts. This is worth a read.

Leaky abstractions in automation can cause problems higher up the stack (think metadata and passing it around) so always make sure you handle your logic with error handling and timeouts to ensure your result is deterministic.

As always, these posts represent my own learnings and thoughts. Feel free to comment and open up the debate.

The post Network Automation: Leaky Abstractions appeared first on ipengineer.net.

by David Gee at October 16, 2017 02:39 PM

Network Design and Architecture

CCDE October Online Class is starting, why CCDE from Orhan Ergun ?

CCDE October Online Instructor Led Class will start today. My Online CCDE Classes are 10 days, everyday around 4 hours. But really, let’s be honest, can you understand everything in 10 days ? So, can you pass the CCDE Practical exam just studying this 10 days course ?   No. No. Even if you are […]

The post CCDE October Online Class is starting, why CCDE from Orhan Ergun ? appeared first on Cisco Network Design and Architecture | CCDE Bootcamp | orhanergun.net.

by Orhan Ergun at October 16, 2017 09:06 AM

ipSpace.net Blog (Ivan Pepelnjak)

New Webinar: QoS Fundamentals (and Other Events)

I listened to Ethan Banks’ presentation on lessons learned running active-active data centers years ago at Interop, and liked it so much that I asked him to talk about the same topic during the Building Next-Generation Data Center course.

Not surprisingly, Ethan did a stellar job, and when I heard he was working on QoS part of an upcoming book asked him whether he’d be willing to do a webinar on QoS.

Read more ...

by Ivan Pepelnjak (noreply@blogger.com) at October 16, 2017 05:12 AM

XKCD Comics

October 14, 2017

My CCIE Training Guide

What is about to change in CISSP from Apr 2018

Change have arrived and like with other professional certification there is almost a standard time before certification gets its update, with most anywhere it is between 3 - 4 years, CISSP is no different and since last update was on 2015 the change is arriving here as well.

For the people that wish to see the official existing and new outline

I have decided to write this post as the new out line is more of a list of Domain and Sections within the domain without hint or indication to what was modified actually and I could not find anyone else that done that comparison, I had to take the task and do the comparison, please be advised that I have done it for my own "pleasure" so apologies if I missed something :-)

Lets start with the obvious change:

CISSP - Before Apr 2018 CISSP - from Apr 2018
1. Security and Risk Management 16% 15%
2. Asset Security 10% 10%
3. Security Engineering 12% 13%
4. Communications and Network Security 12% 14%
5. Identity and Access Management 13% 13%
6. Security Assessment and Testing 11% 12%
7. Security Operations 16% 13%
8. Software Development Security 10% 10%

So as you can see from the table above there are not mind blowing ground up changes , we are still in 8 Domain format, there are small variations in the ratio between the domains and since we have 250 Questions still questions have the same wight 1% eq 2.5 questions so if you look at that this way and take an example Domain 1 was reduced in 2 - 3 questions for the favor of Domain 3 that Ratio was increased by 1%. I would see that as a very minor diff.

Now if you look into each Domain in more details then

Domain 1 Security and Risk Management - originally with 12 Sections and still is with 12 Sections however 
  • Section 1.2 was reduced to 5 sub areas from 6 by merging Due Care and Due Diligence into one section, does it mean we need to know less about them ?! I think not
  • Section 1.4 Similarly Computer Crime (The law Term) was changed to Cyber Crime and was merged with Data Breachs
  • Section 1.9 Again 12 sub areas where trimmed by merging content to 11 sections
Domain 2 Asset Security - Seem to be unchanged for the most part small change to Section 2.5.4 instead of cryptography it was modified to Data protection methods I would think it is a more global look of what is available to Data protection other then the focus on Crypto

Domain 3 Security Engineering 
  • Section 3.5 was appended with IOT, I would say kind of expected change with all the buzz around it (no offense intended).
  • Section 3.11.7 Water Issues was modified to Environment Issues, as well seem to be kind of obvious to change as focus only on Water hazards kind of ...
Domain 4 Communication and Network Security 
  • Section 4.1.7 Cryptography used to maintain communication security - removed
  • Section 4.2.6 Physical devices - removed
  • Section 4.4 Prevent and Mitigate network Attack was removed
Domain 5 IAM 
  • Section 5.3 as was was removed and new 5.3 is equivalent to Old Section 5.4 and in addition it seem to be segmented to 3 sub areas Cloud, On-Premise and Federated.
  • Section 5.6 Prevent and Mitigate access control - removed
  • Section 5.7 Manage the Identity - removed
Domain 6 Security Assessment and Testing
  • Section 6.1 was extended with 3 sub areas of Internal , External Third Party
  • Section 6.5 was getting the same workout Section 6.1 received  
Domain 7 Security Operations
  • Section 7.16 Address personnel safety and security concerns was extended and received 4 sub areas Travel , Security training and awareness, Emergency management , Duress

Domain 8 Software Development Security 
  • Section 8.2 was trimmed from 5 sub areas to 3
    • Security weaknesses and vulnerabilities at the source-code level - removed
    • Security or API -removed
  • Section 8.3 Acceptance testing - removed
  • New Section 8.5 Define and apply secure coding guidelines and standards with 3 sub areas
    • Security weaknesses and vulnerabilities at the source-code level
    • Security of application programming interfaces
    • Secure coding practices

So overall if looking on the changes there are not fundamental but I think they are the necessary to be made if looking into the industry, so good luck to me and who ever is going to take the challenge :-)

by shiran guez (noreply@blogger.com) at October 14, 2017 07:50 PM

ipSpace.net Blog (Ivan Pepelnjak)

Worth Reading: Things Network Engineers Hate

Some of the things Ethan Banks writes are epic. The latest one I stumbled upon: Things Network Engineers Hate. I particularly loved the rant against long-distance vMotion (no surprise there ;).

by Ivan Pepelnjak (noreply@blogger.com) at October 14, 2017 07:40 AM

October 13, 2017

Networking Now (Juniper Blog)

How One Company is Preparing for GDPR

You’d be forgiven for thinking that GDPR (General Data Protection Regulation) is centered around just one thing: the potential to be fined up to four percent of your organization’s revenue for non-compliance.

by lfisher at October 13, 2017 05:42 PM

Moving Packets

Pre-Provisioning Your FEXen For Fun and Profit

In this post, I’ll discuss how to protect your income by using the FEX pre-provisioning capability of NXOS. I discovered the hard way that not pre-provisioning your FEX can have catastrophic side effects. What better story to post on Friday the 13th?

Pre-Provisioning your Cisco FEX

FEXy Time

Attaching a FEX to a Nexus switch is relatively simple; a few commands on each of the two switches the FEX connects to and it’s up and running. It’s also possible to pre-provision the FEX modules in the configuration. The documentation doesn’t make it entirely clear why this would be desirable, beyond the rather cryptic:

In some Virtual Port Channel (vPC) topologies, pre-provisioning is required for the configuration synchronization feature. Pre-provisioning allows you to synchronize the configuration for an interface that is online with one peer but offline with another peer.

Got that? In other words, pre-provisioning makes it possible to configure a FEX module that isn’t there yet, or that is powered down, or is only connected to one side of a VPC pair for some inexplicable reason. Maybe I’ve ordered some
(plural of FEX) and want to configure the ports ahead of time? Whatever the rationale for doing so, I’ve never previously needed pre-provisioning for FEX modules, and working this way has never bitten me. Or, I should say, had never bitten me.

Replacing a Nexus Switch

I wrote a post earlier this year called No Hassle Hardware Replacement with DCNM. I stand by that post, but there is one really important issue which I did not take into account.

A few months ago I had to RMA a Nexus switch which had FEXen attached. I followed the same process I described in my DCNM post; I configured a serial number, identified the NXOS version to install and uploaded the configuration from the switch which was being replaced. I powered up the switch and DCNM performed its magic, upgrading the code, and sending the configuration to the switch. I checked the uplinks to the fabric spine, the peering with neighboring switches, and the connectivity to the attached compute stacks and all was fine. Five minutes later, the red alert klaxon was sounding and it was obvious that something had gone very wrong.

Here’s a simplified version of what happens when DCNM performs Power On Auto Provisioning (POAP):

  • Loads the desired software image to the Nexus switch
  • Sets the boot parameters to load the new software on next reload
  • Installs the switch config as a ‘scheduled configuration’ to read after reload

The scheduled configuration is smarter than it might sound. Imagine that the Nexus switch is currently running 4.x, and the desired version of code is 5.x, and the configuration contains commands that are only available in 5.x. If the configuration were applied while the switch was still running 4.x, the 5.x-only commands would fail. Thus a scheduled configuration is only loaded after the switch has reloaded and booted from the new software version (5.x) and the commands will be valid. Clever stuff.

And The Problem Is?

The scheduled configuration on the Nexus switch gets parsed and installed before the attached FEXen have completed loading and are online. As a result, all configurations referring to ports on FEX modules are rejected because they refer to invalid port numbers. That’s not good, but let’s not worry because the other switch in the Nexus VPC pair is still up and running with the full configuration, right?

VPC consistency is an interesting beast. FEX ports have to be configured identically on both switches in order for them to work; if they aren’t configured identically, they, uh, get suspended. The scheduled configuration–which has loaded with all the FEX port configuration rejected–now means my two switches are out of sync, so all of the FEX ports on the second switch go into suspended mode. This is not good, as is probably obvious, because now all my FEX ports on all attached FEXen have gone down, which means so did all the servers connected to the FEX.

The Solution: Pre-Provisioning

Pre-provision your FEXen! For example:

slot 140
 provision model N2K-C2248T
slot 141
 provision model N2K-C2248T

When the scheduled configuration loads with pre-provisioning commands in it, the Nexus now knows how many ports (and what kind) to pre-allocate on what virtual slot, so the configuration doesn’t get rejected. Problem solved!

It Is Known

I should note that this is not a bug; this is expected behavior and Cisco notes this in the VPC Operations Guide which I’m sure we’ve all read carefully. The documentation provides a good set of steps to follow, but they are impractical where DCNM is doing POAP. The guide also discusses needing configuration sync enabled. For a variety of reasons I don’t use configuration sync, but the rest of the steps are still relevant.

The Cisco Nexus engineering team says that this behavior with a scheduled configuration is expected and is by design, so there’s nothing to fix in NXOS. I would argue that it wouldn’t be rocket science for the switch to look at the config and say Oh, wait a moment, these fex-fabric ports suggest that maybe we have a FEX coming online and I should wait to apply anything on this slot until I see something. Maybe that would make things worse.

The DCNM engineering team also sees nothing to fix on the DCNM side; it is successfully delivering the configuration as requested. Consequently I’ve made a feature request. In DCNM when a configuration is uploaded for POAP, DCNM looks at the configuration and extracts the hostname and the management IP so that those data can be displayed in the POAP status tables. I’ve asked that DCNM goes one step further, and looks for the commands switchport mode fex-fabric in the configuration while not also seeing slot XXX\provision ... in the same. If that’s the case, it’s evident that a FEX is, or is intended to be, connected to the switch but pre-provisioning has not been configured. While DCNM would not be able to automatically insert pre-provisioning commands for you, would it hurt to pop up an alert which says You have uploaded a configuration containing FEX interfaces but no pre-provisioning configuration exists. This may take down all FEX ports when deployed! Maybe it will happen, though I’m sure it’s a low priority feature request.

In conclusion: pre-provision your FEXen if you’re going to do POAP or otherwise activate a configuration prior to the FEX modules coming online. Read the manuals, perhaps? Either way, lesson learned.

If you liked this post, please do click through to the source at Pre-Provisioning Your FEXen For Fun and Profit and give me a share/like. Thank you!

by John Herbert at October 13, 2017 02:58 PM


Network Automation Engineer Persona: Part Four

Part three introduced the first three key skills. This part presents the introduction to the last three core skills and a call to action.

Key Skill Four

I’m trying very hard to refrain from using the term DevOps, but the fundamentals of the DevOps movement are super important. The DevOps fundamental pillars are improving the flow of work, improving the quality using a feedback loop and sharing. A huge array of books have been created on the topic of DevOps in addition to blog posts and podcasts. If we view the persona of the Network Automation Engineer through the lens of the DevOps persona, the two are very similar. If we are to increase the flow of tasks and improve the quality of them using automation, then we need to be able to fix the issues close to the source of the problems and share knowledge. We do that with logging and an attitude change. Logging is critical for successful automation projects as well as attitude.

Knowing how to transmit logs, how to capture logs, how to sort through them and how to realize events from them is an entire skill. There are software stacks dedicated to this mission like ELK (Elastic Search, Logstash and Kibana) and the paid for Splunk. The components required are data ingestion, data transformation, data storage and the ability to view and query them flexibility and on your own terms. You can practice and hone this skill for free thanks to open source.

Key Skill Five

Treating success and failure as indifferent but important guests in your organization. Embrace both, learn from both and try not to put more meaning on one than the other. Emotionally they feel different, but in this role, try to separate the feelings from the meaning. Getting things right should be normal and failures and errors are also normal. Things change; always. A transistor might burn out on a CPU, a RAM cell may die, a datacenter might get hit by an asteroid and you may enjoy the great stability period of a religious festival like Christmas. Weirdly, engineers might try and fix a phantom problem over a quiet period like the last week of December and not at all react to a CPU causing a process to segfault in more normal time periods.

Key Skill Six

Knowing what to automate, when to automate and how to track the success vs failure of your automation is all great, but what about the third pillar of DevOps? Some of the worst problems I’ve seen in the network automation space is when a network focussed development team and engineering team work in isolation. Development teams rarely know how to approach talking to a network device and even worse, network engineers rarely know the programmatic interfaces available to them.

Key skill six is to understand what programmatic interfaces you have available at your disposal, know what they can do for you and understand the impact of using them. Just because they are there, doesn’t mean you can use them for the purpose you had in mind either. For instance, I would not turn on verbose monitoring over a NETCONF TCP session because I know CPU consumption would ramp up, rendering my control-plane unavailable and probably causing an outage. Instead I might increase my SNMP poll rate to get higher data granularity. That will still affect the CPU so I have to think about the actual cost of my requirement. Should I really be increasing my SNMP poll rate? What is that I want to achieve here? If it’s faster detection of a threshold, why not set-up an SNMP trap with a threshold on the network operating system? Does that buy me what I need?

Also learning about these programmatic interfaces means that any development team you do encounter can benefit from your knowledge instead of re-inventing the wheel again and again. This is more common than you would like to think. Key skill four talks about sharing and attitude. Sharing your knowledge with a development team of how to do something in your domain will improve the attitude of the organisation and you might learn something from them. Hell, you might need a vendor module integrating against your automation platform. You might have just earned enough respect to receive help from them!

Call To Action

If you are a network engineer and are wondering where you go next given all of the noise about automation, the “network automation engineer” persona is a natural progression.

Learning new skills can be fun and meeting the requirements laid out by your organisation with those new skills can be both mentally and financially rewarding for all involved. Ensure you translate the value of learning new skills properly to those that might stand in the way of progress.

All organisations are on a journey technologically and there is no silver bullet or perfect solution. All targets are on the move and that means we all have to keep up, that includes you.

Becoming the network automation engineer does not mean you have to become a developer, but it does mean you have to think about data, data transformation and how to react to unsolicited data generated from the network.

Every computing problem comes down to data and transforming it based on a decision. We can do this at an automation, programming or machine-learning level and they’re not exclusive. There is silver bullet or rainbow pooping unicorn. Remember our cute carrot wearing donkey from part three?

Just because you have the hunger to embrace automation and the culture that goes with it, your organisation might not have. Do not be afraid to change the organisation you spend your life energy on to progress.

Go looking for organisations that present problems with an appetite to solve them. When industry celebrities disclose their infamous stories, they didn’t find a job title that let them gather the experience. They took risks when something needed to be solved. Some people try this and get fired. The higher the appetite, the less likely you are to get fired. In most DevOps books, despite lots of organisations transforming under the weight of business demands, some changes do not remain for long. Entire teams are dismembered when new management takes over. Roll with the punches.

Be brave network engineer. Make your life easier, get those mental and financial rewards, react to business needs quicker and most importantly, have fun doing it. You are human and humans need a sense of achievement and meaning. Go forth and evolve!

The post Network Automation Engineer Persona: Part Four appeared first on ipengineer.net.

by David Gee at October 13, 2017 01:56 PM

ipSpace.net Blog (Ivan Pepelnjak)

[Video] Building a Pure Layer-3 Data Center with Cumulus Linux

One of the design scenarios we covered in Leaf-and-Spine Fabric Architectures webinar is a pure layer-3 data center, and in the “how do I do this” part of that section Dinesh Dutt talked about the details you need to know to get this idea implemented on Cumulus Linux.

We covered a half-dozen design scenarios in that webinar; for an even wider picture check out the new Designing and Building Data Center Fabrics online course.

by Ivan Pepelnjak (noreply@blogger.com) at October 13, 2017 07:48 AM

XKCD Comics

October 12, 2017


Network Automation Engineer Persona: Part Three

Part three! Let’s get straight to business and carry on where we left off from part two.

Key Skill One

Thinking about automation in an agnostic way is your first footstep. Automation is about data flowing through building blocks that do things and decision points, allow you when to do things.

Removing CLI and replacing it with an abstraction layer isn’t much of a win. For instance, I regularly talk about the process of creating a VLAN and applying it to an Ethernet switch-port on a tagged interface. This somewhat simple ‘workflow’ creates more conversational friction than imagine-able. Let’s work through it.

Task: Create a VLAN
This task requires domain-specific parameters to a VLAN. These are: ‘VLAN_Number’ and ‘VLAN_Description’.

Task: Apply VLAN to Switchport
This task requires domain-specific parameters to a switchport. These are: ‘Port_Name’ and ‘VLAN_Number’.

Note how the inputs flow through the actions within the workflow?

The green arrows descending illustrate the ‘success transition path’ for each action component.

So, what about these questions?
1. Is the VLAN in use?

We can be more specific here, but it adds complications to the answer. Version two is: “Is the VLAN in use in the network zone that the device resides in?” To get a reliable answer, we now need to model the network and test for true/false values against a rule. It might be to ensure that the VLAN doesn’t appear in a device radius of +1 around the device in question. See how complex this just got?

2. When I convert this flowchart to a workflow, for the task “Apply VLAN to Switchport”, how do I make sure the VLAN is applied to the correct device?

The answer to this is “This task will accept a hostname, driver kind and credentials to reach the device”. More complexity. What is a driver in this scenario? It is something that converts our declarative information and through the process of an imperative implementation (could be an application or a script), it delivers the information transformed in device specific parlance. In other words? The input is sent to something that talks the same languages as the device. Different automation platforms have different ways of dealing with these constructs, but the theory still applies.

3. Is it good practice to record the changes I make?

Yes! The reason why we have so many broken ‘sources of truth’ in our trade is because we do not record enough. We are used to making a change, then updating the ‘source of truth’. Why not change the ‘source of truth’ automatically as part of the automation? Totally possible and considered normal. What, how and where, are conversations to have with your organization.

4. When the data has been applied, do I validate?

Yes, validate, but decide how deep you go before you commit to creating the workflow. For our simple example, which is already growing in complexity, there are layers of validation we can achieve. That list looks like:
– “Check that the configuration appears in the device configuration”
– “Check that the configuration appears in operational output (show commands in CLI parlance)”
– “Can I see hosts on the VLAN?”. This requires gaining Layer2 and Layer3 visibility on to the VLAN using operational API calls.
– “I’m going to ensure its passing traffic”. This approach is a separate workflow in its own right. Now we might plumb a traffic generator between a set of ports over this VLAN to ensure it’s passing traffic.

Our workflow now looks different and can take many forms. We have decision making happening within the flowchart and sub-requirements for deeper integration with our target devices as well as supporting infrastructure like traffic generation capabilities.

To summarise this key skill, an “agnostic view” is a key skill to develop. Network operating systems have a weird way of getting under your skin and you learn to love some and loath others. As hard as it might be, it’s time to move on.

Key Skill Two

Figure out what your automation requirements actually are before you side with a technology.

This might not sound like a skill, but our habits of siding with a network vendor and then trying to make the technology fit is an old habit. Call it cultural or bad habit, same point.

Just because one platform is written in Python and another is with Ruby, it should not lead you to a decision on platform choice. Support for third-party integration is a decision point and ease of creating workflows is another. Using modern expectations, if the community is dead then the life-length might be under question.

The right tool for the right job is the logical answer, but it’s a case of understanding your environment requirements end-to-end, the potential future and attitude of your organisation. Your analysis might result in listing multiple platforms. This is normal and just because one platform doesn’t offer everything on your list, it does not mean the things it can do are to be discounted.

This skill will allow you to be aware of Unicorns. More often than not, Unicorns are just Donkeys wearing a carrot with sparkly hooves. There is no all encompassing wonder platform. Even more fun, the offerings constantly change!

Key Skill Three

Understand how, why and when to consume vendor automation extensions. Vendor automation extensions or libraries simply provide one interface for you to consume, whilst talking something else underneath, like CLI, NETCONF or even perhaps SNMP. There are no hard or fast rules on the binding methods.

Simple translation: This could be to consume an Ansible module or StackStorm pack. These modules and packs consume arbitrary and agnostic data and deal with the imperative how. No longer do you worry about memorizing a command, but the focus is on feeding arbitrary data parameters into a black software box and something happening on the target device.

Network vendors also offer abstraction libraries. Juniper for example offer the PyEZ library. These libraries are designed to be used by script writers and application developers. Unless someone has written an integration to an automation platform, deciding to consume the library requires more technical expertise to keep your environment loosely-coupled. You might wonder what this means if you’re a network engineer and a simple answer is: “Imagine automation integrations being like a child’s play set. Both play sets allow you to push shapes through shaped holes. Both play sets have a circle shape and a circle shaped hole. One circle shape might not fit through the hole on the playset it doesn’t belong to, despite the shapes being the same”. Taking a real example, creating a VLAN on one vendor library might be a direct call to a createVLAN() function. In another library it might be runCommand(commands) function. Both achieve the same thing, but now you have a different coupling method, which technically is an abstraction problem. This is referred to in general as “tightly-coupled”. It means you can’t swap one thing for another with ease.

If your automation layer strategy is to arrange declarative/imperative blocks into an engineering flow chart arrangement and have data flow through it like a stream, then your stream will be choppy if it has to move around different objects and blocks. Imagine a multi-layer stream with a focus on making the top stream as smooth as possible. It’s not always possible, but you can reduce some of the pain. As a good example, NAPALM tries to do just that. For getting information each driver aims to support the same set of ‘getters’, and setting information is down to pushing specific configuration chunks over whatever mechanism the driver supports. This tool offers a support matrix that show the potential of what ‘could’ be supported. Actual supported features against vendor platforms comes down to the implementation of each driver. This approach means you get one ‘getter’ for showing BGP neighborships for instance for all the vendor platforms supported by the tool. This means your flowchart calls the same function or module irrelevant of the vendor. To boil this point down to the basics, your automation implementation doesn’t change when you change the vendor. Going back to the stream anecdote, the stream gifts us with smooth sailing.

Agnostic and good tools try to avoid introducing more entropy as their list of supported systems increases. Usage based friction is reduced and declarative modules become familiar despite the systems you are interfacing against.


The next article will contain three more key skills.

Notice that these skills are not product based, or vendor parlance based. These are generic skills that can be applied widely. Are these not worth investing in?

Some of these skills have been approached at a high level. I’ll look at creating more articles to describe these key skills in more depth! Please make comments or email to show interest!

This is part three of a four part series. Here is the last part if you want to read on.

The post Network Automation Engineer Persona: Part Three appeared first on ipengineer.net.

by David Gee at October 12, 2017 05:53 PM

My Etherealmind

Musing: Network Fabrics Of History

Once upon a time, there were many fabrics. There can be only one.

by Greg Ferro at October 12, 2017 05:45 PM

The Networking Nerd

More Accurate IT Acronyms

IT is flooded with acronyms. It takes a third of our working life to figure out what they all mean. Protocols aren’t any easier to figure out if it’s just a string of three or four letters that look vaguely like a word. Which, by the way, you should never pronounce.

But what if the acronyms of our favorite protocols didn’t describe what the designers wanted but instead described what they actually do?

  • Sporadic Network Mangling Protocol

  • Obscurity Sends Packets Flying

  • Expensive Invention Gets Routers Puzzled

  • Vexing Router Firmware

  • Really Intensive Protocol

  • Someone Doesn’t Worry About Networking

  • Somewhat Quixotic Language

  • Blame It oN DNS

  • Cisco’s Universal Call Misdirector

  • Some Mail’s Thrown Places

  • Mangles Packets, Looks Silly

  • Amazingly Convoluted Lists

  • ImProperly SECured

  • May Push Lingering Sanity To Expire

Are there any other ones you can think of? Leave it in the comments.

by networkingnerd at October 12, 2017 05:02 PM

Networking Now (Juniper Blog)

Global AWS Transit VPC solution with Integrated security at a competitive price point


Transit-VPC solution with Juniper’s virtual SRX allows enterprises to seamlessly add NGFW services and connectivity to large multi-VPC AWS deployments. This solution utilizes a hub-and-spoke topology where every VPC connects to a special “transit VPC” which serves as a central hub for internal traffic, as well as external traffic sent to the corporate on-premises data center or the internet

by praviraj at October 12, 2017 04:40 PM

Cyber-Threat: Is your team up to the challenge?

Our people are our greatest asset; this is the universal mantra amongst organizations who value their staff and reputation, and never has this been truer than in the fight against cyber-crime. If any team has ever been asked to up its game and be one step ahead of an ingenious and cunning enemy, it’s the network and IT security team tasked with overcoming cyber-criminals.

by lpitt at October 12, 2017 04:39 PM

“Alexa, ask SkyATP…”


Technology isn’t the only thing that has advanced in recent years.  User behavior has been evolving as well. Take something as simple as getting from Point A to Point B: from printed maps to GPS-enabled smartphone maps to, in the near future, self-driving cars, we have all adopted new behaviors and adapted to new technologies that make our lives much easier and more efficient.


by praviraj at October 12, 2017 04:33 PM

Security and the de-commoditisation of data


Data is bucking the trend that suggests all things eventually become a simple commodity. We’ve all spent years just seeing data as ‘there’; whether it’s a spreadsheet, email or information on a website/social media – data just exists. However, with recent, and massive, growth in stored data its value throughout its lifetime has now changed.


What do I mean by this?

by lpitt at October 12, 2017 04:30 PM

Securing the Distributed Enterprise


Today’s enterprise is complex with digital operations and data located in the cloud, at headquarters and on remote sites. Coupled with business need for always-on working and application reliability and it’s not surprising that teams often struggle under the weight of their work.

by lpitt at October 12, 2017 04:24 PM

Moving Packets

My Lexicon: Fexen

Fexen (noun, pl.; pronounced Fex-uhn)


Do we have any copper FEXen on those switches?


Fexen is the plural of FEX (the Cisco Nexus Fabric Extender modules). Oh, I know, “FEXes” is just as easy to say, but somehow FEXen seems to work better. Try and use this word in conversation today and see how it feels.

We have about 20 FEXen distributed around the data center.

I think you’ll like it.

If you liked this post, please do click through to the source at My Lexicon: Fexen and give me a share/like. Thank you!

by John Herbert at October 12, 2017 02:51 PM

My Etherealmind

Response: Network Management: Why the CLI?

People get angry when I talk about the death of the CLI for network operations. I’m not the only one: I believe that in today’s network device landscape, the CLI is primarily a tool to promote vendor lock-in by training network engineers to rely on and value it so highly. Yes, devices can be configured […]

by Greg Ferro at October 12, 2017 02:11 PM

ipSpace.net Blog (Ivan Pepelnjak)

Turn Your Ansible Playbook into a Bash Command

In one of the previous blog posts I described the playbook I use to collect SSH keys from network devices. As I use it quite often, it became tedious to write ansible-playbook path-to-playbook every time I wanted to run the collection process.

Ansible playbooks are YAML documents, and YAML documents use # to start comments, so I thought “what if I’d use a YAML comment to add shebang and turn my YAML document into a script

TL&DR: It works. Now for the longer story…

Read more ...

by Ivan Pepelnjak (noreply@blogger.com) at October 12, 2017 08:38 AM

October 11, 2017

My Etherealmind

Zodiac WX – Northbound Networks

A WiFi Base station using OpenFlow for $250. The Zodiac WX is the world’s first fully integrated OpenFlow® Wireless Access Point. It is a high powered ceiling / wall mountable Dual-Band AC1200 AP that includes 2 Gigabit Ethernet ports and support for PoE. We have integrated our Zodiac OpenFlow® engine directly into the wireless drivers so […]

by Greg Ferro at October 11, 2017 05:56 PM

ipSpace.net Blog (Ivan Pepelnjak)

Update: Brocade Data Center Switches

Second vendor in this year’s series of data center switching updates: Brocade.

Not much has happened on this front since last year’s update. There was a maintenance release of Brocade NOS, they launched SLX series of switches, but those are so new that the software documentation didn’t have time to make it to the usual place (document library for individual switch models), it's here.

In any case, the updated videos (including edited 2016 content which describes IP Fabric in great details) are online. You can access them if you bought the webinar recording in the past or if you have an active ipSpace.net subscription.

by Ivan Pepelnjak (noreply@blogger.com) at October 11, 2017 04:06 PM

Moving Packets

iTerm2 Tip: Repeating Commands Using a Coprocess

iTerm2 is a great terminal for MacOS; far better than Apple’s built-in Terminal app, and it’s my #1 recommendation for Mac-based network engineers. One of the many reasons I like it is that it has a feature that solves a really annoying problem.

Iterm Repeat Title

It’s tedious having to issue a command repeatedly so that you can see when and if the output changes. I’ve had to do this in the past, repeating commands like show ip arp so that I can spot when an entry times out and when it it refreshes. The repeated sequence of up arrow, Enter, up arrow, Enter, up arrow, Enter drives me mad.

Some vendors offer assistance; A10 Networks for example has a repeat command in the CLI specifically to help with show commands:

a10-vMaster[2/2]#repeat 5 show arp
Total arp entries: 25       Age time: 300 secs
IP Address         MAC Address          Type         Age   Interface    Vlan
---------------------------------------------------------------------------      0000.5e00.01a1       Dynamic      17    Management   1      ac4b.c821.57d1       Dynamic      255   Management   1      001f.a0f8.d901       Dynamic      22    Management   1
Refreshing command every 5 seconds. (press ^C to quit) Elapsed Time: 00:00:00
Total arp entries: 25       Age time: 300 secs
IP Address         MAC Address          Type         Age   Interface    Vlan
---------------------------------------------------------------------------      0000.5e00.01a1       Dynamic      22    Management   1      ac4b.c821.57d1       Dynamic      260   Management   1      001f.a0f8.d901       Dynamic      27    Management   1
Refreshing command every 5 seconds. (press ^C to quit) Elapsed Time: 00:00:05

I have used this feature quite a lot, and I particularly like that there’s a built-in elapsed time marker between each command. But how to do the same in NXOS (for example) which does not have the repeat command?

iTerm2 Run Coprocess

Like many feature-rich applications, I suspect many users don’t get to try out all of iTerm2’s available features, because the base functionality is so good there’s often no need to dig much further into it than how to select a font and color. The Run Coprocess command is one that I simply don’t hear people talk about, so let’s take a look!

Iterm Run Coprocess Menu Option

The Run Coprocess command allows a shell script or other executable to generate output which is sent as input to the terminal window. It’s easy to test by creating a shell script which uses echo command to write text to STDOUT:

echo ls -al

Make the file executable (chmod +x <filename>), then in a window which is at a shell prompt, run the file using Run Coprocess:

Iterm Run Coprocess

It’s necessary to include a full path to the script, in this case my home directory, or ~/. The script runs and ls -al is echoed to my active terminal window as if I had typed it the command prompt. Voila, a file listing!

Repeating Commands

If I extend the logic above, it should be easy to create a script which will issue the ‘show ip arp’ command at a known interval and, to get a timestamp, how about I add a ‘show clock’ command first too?:

while true; do
    echo show clock
    echo show ip arp
    sleep 5

Since this script will loop forever, it is probably a good time to mention how to stop a Coprocess:

Iterm - Stop Coprocess

Does it work? You bet:

nxos-sw1# show clock
15:59:06.009 UTC Sat Oct 07 2017
nxos-sw1# show ip arp 

Flags: * - Adjacencies learnt on non-active FHRP router
       + - Adjacencies synced via CFSoE
       # - Adjacencies Throttled for Glean
       D - Static Adjacencies attached to down interface

IP ARP Table for context default
Total number of entries: 2
Address         Age       MAC Address     Interface   00:10:11  00a7.420b.0570  Vlan1984   00:10:51  00c1.647f.9ac0  Vlan1983

nxos-sw1# show clock
15:59:11.013 UTC Sat Oct 07 2017
nxos-sw1# show ip arp

Flags: * - Adjacencies learnt on non-active FHRP router
       + - Adjacencies synced via CFSoE
       # - Adjacencies Throttled for Glean
       D - Static Adjacencies attached to down interface

IP ARP Table for context default
Total number of entries: 2
Address         Age       MAC Address     Interface   00:10:16  00a7.420b.0570  Vlan1984   00:10:56  00c1.647f.9ac0  Vlan1983

Bearing in mind that Cisco IOS/NXOS/XR all ignore commands beginning with a !, the timestamp generation could even be moved into the script:

while true; do
    echo ! `date`
    echo show ip arp
    sleep 5

And the result is as you’d hope:

nxos-sw1# ! Sat Oct 7 12:03:26 EDT 2017
nxos-sw1# show ip arp vrf prod

This is helpful because now I can create a repeating command tool for Cisco devices which will issue a timestamp, issue any command I want, and repeat with whatever interval I choose as well:


# Call as repeatcmd.sh <command></command> 

# Get text to output from args

# Check that a command was supplied
if [[ $CMD == "" ]]; then

# Default timeout is 3s

# ...but if you set a 2nd arg which is &gt;0, that's the new repeat timer
if [ $2 &gt; 0 ]; then

# Send the command text every $REPEAT seconds
while true; do
    echo ! `date`
    echo $CMD
    sleep $REPEAT

As long as the command argument ($1) is issued as a single string (use quotes!) I can now trigger any command:

Iterm - Generic Coprocess Command

And the result:

us-atl01-z1fa07a# ! Sat Oct 7 12:09:01 EDT 2017
us-atl01-z1fa07a# show ip arp vrf prod

Flags: * - Adjacencies learnt on non-active FHRP router
       + - Adjacencies synced via CFSoE
       # - Adjacencies Throttled for Glean
       D - Static Adjacencies attached to down interface

IP ARP Table for context default
Total number of entries: 2
Address         Age       MAC Address     Interface   00:01:24  00a7.420b.0570  Vlan1984   00:02:04  00c1.647f.9ac0  Vlan1983
us-atl01-z1fa07a# ! Sat Oct 7 12:09:03 EDT 2017
us-atl01-z1fa07a# show ip arp

Flags: * - Adjacencies learnt on non-active FHRP router
       + - Adjacencies synced via CFSoE
       # - Adjacencies Throttled for Glean
       D - Static Adjacencies attached to down interface

IP ARP Table for context default
Total number of entries: 2
Address         Age       MAC Address     Interface   00:01:26  00a7.420b.0570  Vlan1984   00:02:06  00c1.647f.9ac0  Vlan1983

Note that the timestamps are 2 seconds apart. The unix sleep command is pretty vague about how precise it will be (thus we assume that’s going to sleep for around two seconds in this case), but with a timestamp to confirm, at least I’ll know what the delay actually was.

Handy, right? I think so. What other uses do you have for a terminal Coprocess?

If you liked this post, please do click through to the source at iTerm2 Tip: Repeating Commands Using a Coprocess and give me a share/like. Thank you!

by John Herbert at October 11, 2017 03:36 PM

My Etherealmind

GE’s Irish wind farm to supply total output to Microsoft for 15 years

Getting difficult to build a cloud DC when you have to buy your own power source

by Greg Ferro at October 11, 2017 02:03 PM

Dyn Research (Was Renesys Blog)

Performance Implications of CNAME Chains vs Oracle ALIAS Record

The CNAME resource record was defined in RFC 1035 as “the canonical name for an alias.” It plays the role of a pointer, for example, the CNAME informs the requestor that www.containercult.com is really this other name, instance001.couldbalancer.example.com.

The CNAME record provides a “configure once” point of integration for third party platforms and services. A CNAME is often used as opposed to an A/AAAA record for the same reason developers often use variables in their code as opposed to hard coded values. The CNAME can easily be redefined by the third party or service provider without requiring the end user to make any changes.

A stipulation that prevents use of the CNAME at the apex is that no other records can exist at or alongside a CNAME. This specification is what prevents an end user from being able to place a CNAME at the apex of their zone due to the other records, which must be defined at the apex such as the Start of Authority (SOA).

ALIAS / ANAME – The way of the future 

The Oracle ALIAS record allows for CNAME-like functionality at the apex of a zone. The Oracle implementation of the ALIAS record at the apex uses private internal recursive resolvers to “unwind the CNAME chain.”

Consider, for example, a web application firewall, WAF, implementation which uses a CNAME to direct users to the WAF endpoint. The consumer of the service simply creates a CNAME to the endpoint provided. The initial mapping is the only thing which the consumer has control over. After implementing the service, we can dig deeper into the way the service is implemented in the DNS. Below we see the full CNAME chain.  

www.containercult.com.                          60      IN      CNAME   www-containercult-com.wafservice.com.
www-containercult-com.wafservice.com            300     IN      CNAME   control.wafservice.com.
control.wafservice.com.                         120     IN      CNAME   endpoint-cloud-vip.wafservice.com.
endpoint-cloud-vip.wafservice.com.              3600    IN      CNAME   loadbalancer1337.lb.cloudprovider.example.com.
loadbalancer1337.lb.cloudprovider.example.com.  60      IN      A

In the example above, the WAF service is implemented via a CNAME record mapping www.containtercult.com to www-containercult-com.wafservice.com. The service operator maps the vanity CNAME to a service name, control.wafservice.com. This is a CNAME to another record in the wafservice.com zone which is ultimately a CNAME to a load balancer endpoint at a cloud provider.  

The Oracle ALIAS record is implemented in a way in which our internal resolver will constantly keep all of these records in cache. When a recursive resolver requests www.containercult.com, we can hand back the A record for the cloud load balancer. This reduces variability from cache misses, network latency, packet loss, etc. Saying it reduces variability is one thing, quantifying it is another. 

ALIAS Testing 

To quantify the reduction in variability and potential performance gains from ALIAS/ANAME record implementation, we performed a number of tests using the RIPE Atlas network. The RIPE Atlas platform provides access to the internal resolvers used by a number of ISPs that are only accessible from their networks. It also allows us to run tests from the perspective of end users, providing insight into the last mile of a number of global networks. To select which networks would be included in testing, we took a one month sample of production traffic to our authoritative DNS platform and selected networks from the top twenty which also had appropriate RIPE Atlas probe density.  

Variables being considered: 

  • End User / Client – Testing from the perspective of end users is critical to understanding the nuance of internet performance.  
  • Recursive Resolver – Recursive resolver implementations have varying configurations. Some modify the TTL of records, some are operated as clusters with a large shared cache others have many individual caches, some perform prefetching of popular records, etc. 
  • Authoritative Resolvers – In the example above, there are three different namespaces being referenced. Each might be served by a different authoritative provider which might have varying proximity to the end user’s recursive resolver. 
  • Networks – The networks facilitating communication between all these components have different performance profiles from the last mile to well-connected internet exchanges 

Test 1: WAF Service Implementation 

A set of RIPE Atlas probes acting as clients configured their default local resolver to request two records. One record being the first in a CNAME chain for a WAF, the other being an ALIAS record for the same WAF service. As expected, the raw results contain a number of outliers in both test scenarios created by packet loss and last mile performance issues. 

For example: In the time series below, you can see some pretty serious outliers. 

A time series isn’t ideal for communicating what happened. As you can see above, it looks like “most” response times were less than 1000 ms. To better quantify, we look at a histogram of the results. 

The median response time for the WAF ALIAS record was 44.96 ms, whereas the median response time for the WAF CNAME Chain was 63.18 ms a difference of 18.22 ms. The boxplots below indicate that the median response time for the ALIAS record is aligned with the beginning of the 2nd quartile response times of the CNAME chain.  

Test 2: Cloud Load Balancer  

Test 1 focused on a CNAME chain with 5 links, whereas many implementations might have only a single CNAME. To test this scenario, the same population of probes requested one record, which was a CNAME, to a cloud load balancer and another record, which is an ALIAS, pointing to the same load balancer. 

Test 3: Counter Point 

The first two tests showed clear performance gains for the ALIAS ANAME implementation. We thought it was important to create an example of the opposite, an instance where the ALIAS record is slower to highlight some nuance. To accomplish this, we set up some tests in South Korea. South Korea is known for having well provisioned high-speed networks deployed within the country, but paths out of the country to the wider internet can be slower. 

For this test, the CNAME chain example can be resolved within South Korea. The clients, recursive resolvers, and authoritative providers all have a presence within the country. Resolving the ALIAS record requires the incountry resolver to issue queries to either Hong Kong or Tokyo, which takes much longer than resolving the CNAME chain in country. South Korea internally is well connected but the paths to Tokyo and Hong Kong require traversing undersea cables. This is why it is important to understand your customers use case and monitor performance.

The ANAME provides an option for infrastructure operators that are looking for CNAME at the apex of the zone functionality. The ANAME helps reduce variability in response times from the to recursive resolvers and clients by actively maintaining the CNAME chain in a local recursive cache. As Evan Hunt pointed out at the DNS OARC meeting in San Jose, as the ANAME standard is adopted, recursive resolvers may start to implement ANAME verification, potentially reducing some of the performance gains of the new record type. That being said, following Lord Kelvin’s advice “to measure is to know” … we will keep on measuring. 

For more detail check out our webinar.

by Chris Baker at October 11, 2017 01:39 PM

Network Design and Architecture

Microwave or Fiber which one is faster ?

Microwave or Fiber which one is faster ? I will explain  the faster connectivity option and some of the use cases for each, deployment considerations a bit in this post. Why latency is important for some special businesses  ? Have you heard about HFT (High Frequency Trading) ? If you like the discussion points, after […]

The post Microwave or Fiber which one is faster ? appeared first on Cisco Network Design and Architecture | CCDE Bootcamp | orhanergun.net.

by Orhan Ergun at October 11, 2017 01:00 PM

XKCD Comics

October 10, 2017

Networking Now (Juniper Blog)

Simple Steps to Increase Your Online Safety

The internet has revolutionized the way we live our lives and has provided greater convenience and access to information, entertainment and services. But it seems that every week we hear about a new virus, cyber attack, or data-breach. Cyber-criminals are increasing the frequency and sophistication of their attacks on governments, businesses and individuals. They are after our personal information in order to use it against us or for profit.


As the author Bodie Thoene said, “What is right is often forgotten by what is convenient” and this is unfortunately often the case while going online. The National Cyber Security Alliance’s website - Stay Safe Online – is a thorough resource with guidance from online safety basics to resources to how to cyber secure yourself and your business and even delves further into how to report cybercrime. Ultimately, our online responsibility is up to us and here are a few tips to stay safe online:

by JBlatnik at October 10, 2017 10:34 PM

Creating a Culture of Cybersecurity

For organizations implementing or enhancing cyber-security policies, the type of culture and technology changes required to prevent attacks can be a sensitive issue. The ideal scenario is one of partnership – where employees understand the rationale for policy and act as additional eyes and ears for the company – creating a unified defense against would-be attackers. Here are some key considerations when it comes to creating a culture of cybersecurity.

by bworrall at October 10, 2017 10:31 PM

My Etherealmind

Troubleshooting: A journey into the unknown

Enjoyed this troubleshooting 'war' story from Booking.com

by Greg Ferro at October 10, 2017 06:12 PM


Network Automation Engineer Persona: Part Two

This article is number two in a series. The first part can be found here.

There has been a thought trend in the last few years leading network engineers to think they need to be developers. This is totally nonsense. When we want to learn a new skill, there is a precursor which says “I want to do X, so therefore I need to learn about what X”. If you’re thinking “I should be learning Python”, I ask to what goal? What is making you ask this question? Maybe the question should be, for a network automation engineer role, what skills do I need to learn? Stop guessing!

The Network Automation Engineer role combines deep network knowledge, with the ability to describe, collect and transmit domain specific data through one or more abstraction layer type components. It requires knowledge of how to collect data from databases and data-stores of various types. Where does a list of IP addresses get stored? How are they stored? How are they retrieved? The role requires an awareness of the cause for making a change and the implication of making them. Gaining the skills to become this persona isn’t a full career change, but a set of additional learnings if you are already a network engineer. Put another way, it’s one version of an upgrade pack to the trusty network engineer persona.

What if you are a software developer wanting different challenges? Does the same still apply? If you have the appetite to learn networking up to a professional level, then you can take your developer brain and experience with you on this journey. You will also require empathy for the issues and mechanical sympathy for the problems, as wiping the slate clean and starting again isn’t an option for production infrastructure (normally). It’s a game of progression!

The Problem With Human Beans

Solving the ‘volume of changes’ problem requires a different approach to just adding humans and expecting things to scale up. Homo sapiens (or “human beans” as my old physics teacher used to say) have a limited throughput and amount of RAM. Our brain waves operate in the 0-100Hz range (absolute minimum and maximum values for five types) and physically we’re not exactly capable of catching bullets or breaking the “sound barrier” from typing quickly. Our volatile memory isn’t as good as RAM. Just because our eyes are open, it does not mean we can retain the content of a book by scanning over the pages, until we go to sleep and power down. We have limits. Despite our slow brain waves, we can achieve an enormous amount. Science today is trying to unlock the magic of the brain, yet, given this tremendous organ we call our mind, we can only do and contemplate so much in a given time domain.

If our brain and physical form can only handle so much, simple and humanistic automation is the first answer to handling demand. As humans we can automate things humans would have previously have done. This doesn’t involve replacing the network with OpenFlow capable devices or programmable pipelines that only network skilled software developers can handle.

Whilst “Knowledge Defined Networking” or “Machine Learning Driven Intent Based Networking” potentially removes our automation panacea (meaning: automating tasks a human would do in a human way), we’re not quite at the point where we can throw network devices together and expect a boring “it works” outcome.

Skills To Focus On

Good automation tooling give us the ability to generate codified flowcharts that take inputs and deal with the ‘how’ in a layered way. It’s about data flowing through our mental flowcharts and what we do with it. I recently wrote another article describing this declarative and imperative thinking.

On solid foundations, great things can be constructed. As a network engineer who has mastered IP, Ethernet and routing protocols, society has not only made your job harder, it has gifted you empty slots, which can be filled with automation skills.

The next article focusses on skills that you might want to learn to start or progress your journey towards being the “network automation engineer” person in your organisation. These skills are generic, agnostic and flexible enough for approach based on logic and common sense.

If you want to read on, here is part three.

The post Network Automation Engineer Persona: Part Two appeared first on ipengineer.net.

by David Gee at October 10, 2017 05:47 PM

My Etherealmind