What do you get when you cross iSCSI with lossless Ethernet? A lot of confusion.
I do quite a lot of presentations regarding converged networks, including Fibre Channel, whether native or over Ethernet (i.e., FCoE), iSCSI, NAS, etc. The hardest part about these presentations are combating some of the expectations that audiences have, considering they may come from server backgrounds, network backgrounds, or storage backgrounds.
Why is this important? Quite frankly, because like fish who don’t know they live in the water, they have come to grow unaware of their own environmental backgrounds. They tend to forget the assumptions with which they make their decisions. Ethernet and Fibre Channel networking people have very different fundamental philosophies about the way their networks run.
In the world of converged networks, this can cause some, er, unintended consequences.
Recently I had a friend -- we’ll call him Pete (because that’s the first fictional name that came to mind) stop by while I was having lunch and ask if he could “ask a quick question.”
(Why is it that the questions are quick, but the answers are so, so long? Just one of those ponderables of life, I suppose.)
“Sure, what’s up?” I said. It’s not like I was hungry anyway.
“I’ve been trying to figure out this whole converged networking thing,” Pete said, and I could see he wasn’t sure what the right words to use were.
“You mean FCoE?” I asked. Usually most of the questions I get surround FCoE. Go figure. Quelle Surprise!
“No, actually, iSCSI.”
“Okay, sure,” I said. “What’s the question?” Technically speaking, iSCSI is a convergence technology. You can run block-level storage on a VLAN running over the same equipment that handles traditional Ethernet. We’ve been doing this for years.
Pete took a deep breath, and then blurted out, “I’ve been told that you can do converged networks with iSCSI, but in order to do converged networks we need to use DCB, so does that mean that we have to run iSCSI over DCB?”
Now it was my turn to be confused. “I’m confused,” I said. Coy about my feelings, I am. “By DCB, do you mean lossless Ethernet?”
“Well,” I said, “It may be that the problem you’re having is that DCB is not the same thing as lossless.”
Pete looked like he was about to cry. “What?!” he said. “But everyone keeps saying that you need DCB for FCoE because it’s lossless and that’s how you get converged networks…” His voice trailed off.
I motioned for him to take a seat. He did. I reached over to grab a napkin and pulled out a pen. “Let me try to clarify a couple of things,” I said. Indicating the napkin I said, “I hope you don’t mind, I tend to think visually.”
He nodded. I swear, he was pouting. “Cheer up,” I said. “We’ll get you on the right page.”
Clearing up the Terminology
Let’s try to break it down a little. “DCB” is actually a category of standard documents. It’s not just one particular thing,” I said, starting to write everything down.
“You see,” I said, tapping at each of the different lines, “there are many parts of DCB, not just lossless. The important thing to keep in mind is that just because something is included as part of the standard, doesn’t mean it’s required for every single case.”
Pete nodded. I took a bite of my food, and instantly regretted it. I hate talking with my mouth full.
“Take thith, for exthample,” I said, trying not to spit food across the table, and swallowed hurriedly. I pointed to the last line on the napkin. “While this is part of the DCB, it doesn’t apply in all convergence cases. You don’t need this for FCoE, for example.”
“Why not?” Pete asked.
I waved my hand. “It’s a bit long to get into it, but I’ll send you the link to that later. For now, let’s just say that these documents are designed to help you understand a way to solve a problem. It’s not necessarily the only way, just a way that the industry had standardized. Make sense?”
“So far,” he admitted.
“So,” I continued. “When someone says that you need DCB, they may be talking about one of the solutions in one of these documents, or two, but not necessarily any specific one of them. You certainly don’t need to do all of them for every possible problem.”
“Okay,” he said. “So what are they talking about when they say they need DCB for iSCSI?”
iSCSI and DCB
“That’s a good question,” I admitted. “I’m not sure what they mean. We’ve been doing converged networks with iSCSI and LAN for a while now, and we certainly don’t need to do DCB in order for that to work.”
“But I keep hearing about ‘Lossless iSCSI,’” he protested. “Don’t you need DCB to do that?”
“Well, yeah,” I said. “That’s kind of written into the name, isn’t it? But do you need all of DCB to make it work? Or just the lossless part?”
He thought about that for a minute. “A fair point,” he said.
“It seems to me that if you’re going to be making a DCB argument for iSCSI -- assuming we’re not talking lossless here -- then you really could use the ETS part of the standard for guaranteeing minimum bandwidth.”
“What do you mean?”
“When you start looking at running iSCSI on 10GbE links,” I said, drawing, “If you want to set aside a minimum amount of bandwidth, well, ETS can help with that.”
“What does it do?” Peter asked.
“Effectively it allows you to place bandwidth into groups and assign bandwidth percentages to those groups,” I said. “For instance, if you want to give iSCSI 30% of the bandwidth of this pipe, you could do that. With Cisco switches we accomplish this through our QoS settings.”
Peter stroked his chin thoughtfully. “And you don’t need DCB -- I mean, lossless -- for this?”
“Not unless you want to run lossless iSCSI,” I said. “But you have to be very clear on why you would want to do that.”
Peter’s eyes grew wide. “Yes, exactly!” he said. “That’s what I need to know!”
I took this as an opportunity to grab another bite of my lunch. “Can you be more specific?” I asked, grimacing at the food. It was cold. I should have ordered a salad.
“So if we can have lossless traffic for Fibre Channel and FCoE, won’t it help with iSCSI too? Why don’t we run all iSCSI traffic over lossless Ethernet?”
“Because we have different design principles for lossy traffic like iSCSI than we do for lossless traffic like Fibre Channel and FCoE,” I replied.
“But it’s all block storage, right?” he asked.
“Yes, that’s true,” I said. “But how a server accesses data on a storage array is a separate issue than how we network that access to the storage.”
I could see by the look on his face that I lost him. Out came napkin number 3.
“Look at it this way,” I said. “One of the things that makes iSCSI so popular is that it doesn’t require all the know-how that Fibre Channel does. It’s not as specialized, right?”
“One of the beautiful things about iSCSI is that, not only is it cheap, but the network itself does a lot of the heavy lifting for you. All you need are the IP addresses of the host and target and then the network takes care of the rest.”
He frowned. “There’s more to it than that,” he said.
“Yes, I know,” I said, a little more dismissively than I meant. “But in essence setting up iSCSI is easy because there are a lot of things you don’t have to do that you would in a Fibre Channel environment.”
The Tyranny of SCSI
“Okay,” I continued. “Let’s start off with SCSI. SCSI makes some assumptions about the way these two devices talk to each other. For instance, SCSI assumes that there is a 1-to-1 relationship between host and target, and that the exchanges between the two devices are always in sequence.”
“And SCSI can’t do error correction,” Peter interjected.
“Well,” I said, hesitantly, “technically it can, but it’s so difficult and can be done in so few cases that for all intents and purposes it’s useless for what we’re trying to do here.”
“Thanks,” Peter said sarcastically. “Way to be pedantic.”
I looked at him for a moment. “Anywaaaaaay…”
“When we moved the disk outside of the host and inserted a network in between them,” I continued drawing on the napkin until I had a respectable mockup of a network (sometimes my artwork amazes even myself). “We still had to maintain that 1:1 relationship, right?”
“So, this is one of the reasons why FC requires a lossless network in the first place: it is very difficult to fix the data if it gets out of sequence or if frames get dropped. But that’s not what happens in iSCSI.”
“With iSCSI we use TCP to create sessions between the host and the target, so that when Ethernet packets get sent all crazy-like over the network -” I drew squiggles across the napkin -- “TCP will keep track of the sequence and reassemble them in order and present it to the SCSI layer.”
“Voilá!” I grinned. “In-order SCSI -- over a lossy Ethernet network! I’m exaggerating a bit, of course, but the point here is that no matter which way the packets take to get to the storage, SCSI still needs to be happy.”
“Very nice,” Peter said. “So why do people want lossless iSCSI?”
I shrugged. “Traditionally, iSCSI has suffered from performance concerns when compared to Fibre Channel. It made sense a while ago, back when Ethernet was running at 1Gig speeds while FC went from 2 to 4 to 8 Gb. The extra serialization in a TCP/IP header also means that there were some impacts on latency.”
“But we’re using 10Gb Ethernet now. Even 40Gb,” he countered.
“Yes, that’s true. But bandwidth is not the same thing as latency,” I said. “Having the extra bandwidth is very useful for all kinds of traffic, it’s true. A lossless iSCSI connection would mean, in theory, that you could simply use the lower latency capabilities of lossless Ethernet to make iSCSI really sing.”
“But you can’t?” Peter asked.
I shook my head. “No, of course you can. But whether you’re talking about Fibre Channel or FCoE or any other type of lossless environment, you design for the lossless environment, not the protocol. Otherwise you can wind up getting really bad performance.”
I pointed to the latest napkin. “See this squiggle here? I know this isn’t drawn to scale, but if you were to take this to a Fibre Channel or FCoE storage architect he would freak. We simply don’t allow traffic to do this in a lossless world.
“But what about iSCSI?” I asked, rhetorically. “Effectively any kind of traffic pattern that could happen in a TCP/IP world could affect iSCSI, right?”
“Which means that right from the start we have very different architectures to take into consideration,” I said. “You’ve seen some of the Ethernet networks. How many switches might you have between a host and a storage target using Ethernet?”
He shrugged. “I have no idea.”
“Exactly,” I said, nodding. “Could be one, could be one hundred. Who knows?”
“Well, maybe not a hundred,” he said, drawing his words out, skeptically.
“Now who’s being pedantic?” I retorted, smiling. “You get the idea. We just don’t know. In lossless environments, however, we do know.”
I started drawing on a new napkin.
“With lossless, whether we’re talking about FC or FCoE,” I said, struggling to talk and draw at the same time. Thank God I wasn’t trying to chew gum. “We have a very clear set of topologies that we play with. We want to make sure our storage is close to the host, because that gives us the best performance. As a result we usually only have one or two switches in between.”
“That’s the ‘edge-core-edge’ topology,” Peter offered.
“Right,” I agreed. “Really large environments can have three, but it’s very, very rare to have any more than that. We do this so that we can keep our bandwidth under tight control with low oversubscription and fan-in ratios.”
I paused, and realized I’d need to explain a little more. I drew a few more lines on the napkin for emphasis. “Different applications have different types of traffic. Some times they’re very bursty, or irregular. Sometimes they’re sustained,” I said. “Since we want to make sure that all our traffic is processed, we need to make sure we strike a balance between the number of servers we have and the amount of bandwidth we have going to the storage.”
“How do you know the right numbers to use?” Peter asked.
“Storage vendors generally provide guidance in this process,” I admitted. “It depends heavily on what kind of applications we’re talking about. Usually the ratios are between 4-to-1 to around 20-to-1.”
I tapped the napkin. “The key thing to remember is that every time we design a Fibre Channel or FCoE SAN we have to take into account the oversubscription ratio. Take this example. Here we have 60 servers running at 4G speeds connecting to 3 8GFC ISLs connected to six 4Gb FC links to storage. The fan-in ratio -- that is, the number of bandwidth from servers to bandwidth to storage -- is at the ratio of 10:1.”
Peter looked troubled. “I thought that it wasn’t true 8G, though.”
“You’re right,” I said. “But the important part here is the ratio, which is correct.”
“Okay,” he said. “But can’t we do this with iSCSI?
“Well, if we’re going to do lossless iSCSI, we pretty much have to take these same issues into consideration. But the fan-in ratio can be an order of magnitude greater -- or more!”
“So what would happen if you did this with iSCSI?” Peter asked.
“There’s always a chance in a lossless network for Head of Line Blocking,” I said as I pulled another napkin, realizing I had started quite a collection. “Let me give you an example.”
“Let’s suppose that you have 1G connections from your servers into a 10G array,” I tried to sketch something out quickly so as not to lose momentum. “And let’s say, for the sake of simplicity, that you have steady traffic from all of these servers going up to the target.”
“So far, everything looks copacetic, right?
“Yup,” Peter confirmed.
“What happens when you start getting a bursty traffic from a source, like this guy here on the left, that saturates the link?” I said, with melodramatic suspense.
“We’ve got congestion,” Peter offered.
“You got it,” I confirmed. “Normally what happens is that the TCP session will renegotiate with the host in order to reduce the transmission window sizes. But since that’s layer 4, and lossless traffic happens at layer 2, this will happen before TCP gets a chance to kick in.”
I pointed to the top switch. “This guy is getting hit pretty hard. He’s got 12G coming in, but can only process 10G at a time. So, the lossless “no-drop” traffic is queued and the buffers begin to fill up.”
I raised my hands to demonstrate a rising water level. “When we get to that high water mark, PFC kicks in, and sends PAUSE frames down the links,” I said, drawing on the napkin again.
“So,” I said, looking up from the napkin. ‘Congratulations. Everything is flow controlled.”
Peter looked confused. “But everything is paused!”
“Yup. It’s controlled.” I couldn’t suppress a sardonic smirk.
“So this happens in Fibre Channel?” he asked, dubious.
“It can,” I admitted, “which is why we control for these types of behaviors with predetermined traffic engineering for our SANs. We carefully calculate the oversubscription ratios, the fan-in ratios, and make sure that we’re operating within tight parameters.”
“Besides,” I said, collecting the napkins, “in Fibre Channel this Head-of-Line-Blocking is a transitory event. Once some BB Credits are returned to the blocked port, the port can send again. In a PAUSE situation, though, there are no BB_Credits. With PAUSE, it’s on a timer.”
“But what about TCP?” he asked.
I shrugged. “TCP isn’t invoked immediately, because frames are PAUSED, not dropped. At this point, TCP doesn’t enter into it. But you raise an excellent point: what about TCP?”
What About TCP?
“See, TCP is session-based,” I said, tapping the pen between the host and the target. “This means a couple of things.
“First, in a normal iSCSI environment TCP would handle all the appropriate window sizing between the two entities here. That means that in this lossless environment you can wind up locking out all of the hosts as the PAUSE timers work themselves out, instead of letting TCP do its job.
“Second, and most importantly,” I said, “If you’re not careful, there is the potential for really messing up the session altogether and closing it, even possibly losing data.”
Peter’s eyes widened. “What? How?”
“Simple, son.” I said in my best Strother Martin impression. “What we have here, is a failure to communicate.”
Peter looked at me blankly.
“Cool Hand Luke?” I asked. “No?” I sighed. No one appreciates the classics any more.
“If you’re interested, we can probably go through this another time,” I said, glancing at my food. Longingly. “For now, let’s keep it simple. TCP adjusts its transmission window based upon how much data it can process at any given time.”
“Because of this, the window can shrink just as easily as it can grow. But, in order to do this, it has to send information back to the host. But what if that connection is PAUSEd?” I hinted.
“The client might not know that it can’t send as much data,” Peter was nodding as he was speaking. He was starting to get it.
“RIght,” I said. “So, if the client has already sent out information, but is paused somewhere in the network, the target can’t take the additional information and anything over and beyond that will need to be dropped. Not only does this really defeat the purpose of a no-drop transmission, but will force the client to retransmit the data and there goes the efficiency.”
“Can’t you prevent the window from shrinking?” Peter asked.
I nodded. “Yes, in fact TCP has a simple knob that does just that: you can force a device to not be allowed allowed to shrink the window. In order to do that, though, you have to make the target more patient. That is, it has to wait longer to reduce the buffer size, which of course prolongs the congestion time.”
“Not only that, but sometimes TCP will reduce the window size all the way down to zero. Since the same window is negotiated on both the host and the target, you are effectively shutting down the storage connection altogether. This may be fine for file-based storage, but with block-based storage -- especially booting-from-SAN -- you’re going to find yourself in some pretty sticky situations if that happens.”
Peter gulped. “Can’t you just reopen the window?”
I nodded. “Of course, but the problem is that the host is dependent upon receipt of the “window opening” segment. If that traffic is PAUSEd…”
Peter nodded and finished the thought for me. “The host terminates the connection.”
I tapped my nose and pointed at him. “Spot on,” I said. “There are ways to prevent this from happening -- or try to -- like sending special probes out, but even this isn’t perfect. When the target decides to reopen the window, it might reopen it at too small a value, which in turn creates many small segments, which in turn reduces the overall efficiency. And that’s just at the TCP level, which assumes that the traffic is flowing at Layer 2.”
“And this type of environment,” Peter said, thinking it through, “is exactly the kind of situation to be impacted by traffic being PAUSEd at Layer 2.”
“Yup,” I agreed. “Or to put it another way, in theory the exact same congestion event which would create this TCP window problem is the one which would initiate PAUSE frames. This ‘perfect storm’ of problems could — theoretically at least — happen any time there is congestion, especially in large sprawling iSCSI networks.”
“Like the ones that often exist today,” he mused. “This reminds me of the ‘buffer bloat’ problem with TCP.”
I wasn’t as familiar with this, and it must have shown on my face. “When congestion occurs,” he continued, “TCP traffic can get stashed in buffers for so long that the sender assumes the recipient never got it, and retransmits.”
“Ah,” I said. “And once again, there goes your efficiency.”
“Right,” he said. “But buffer bloat is generally found in the scenario of slow WAN links and a bazillion hops, with wait times as high as thousands of milliseconds. You’re talking about what sounds like a relatively brief moment of congestion.”
“From the TCP side, sure,” I said. “But when we’re using PAUSE we’re talking about a fixed timer. And remember, we’re talking about flow controlling in a somewhat cascaded fashion, where all of these hosts-target sessions are being PAUSED simultaneously. Will the congestion and resulting PAUSE be quick enough -- for all the nodes that have been PAUSEd and then unPAUSEd -- that TCP is not disrupted?”
“So,” he said thoughtfully, “the question is whether the extended congestion lasts long enough to cause TCP to react in an oversubscribed scenario.”
To Use or Not to Use Lossless iSCSI
“So you shouldn’t use iSCSI in a lossless environment?” Peter asked.
I shook my head. “No, I’m not saying that,” I said. “Obviously, the technology works. But what I think you have to understand is that the key design principle here is lossless, not the iSCSI part.”
I took a sip of my drink, which gave me a second to gather my thoughts. “Look, I would hope that it’s a really rare event. But generally in lossless environments we historically actively avoid those events, rather than hope it doesn’t happen.
“It seems to me that what people want is the convenience of setting up iSCSI but the better performance they have seen with lossless environments like FC or FCoE. That’s fine, but I think that they are forgetting the reason why they liked iSCSI in the first place.
“It doesn’t matter what protocol you decide to use -- you still need to do the math for lossless environments, and you still need to be pre-deterministic with the traffic. You can’t have dozens of switches between your hosts and your targets, especially if you’ve got fan-in ratios in the triple digits.”
I shrugged again. “Lossless iSCSI works -- with the appropriate caveats. A single-switch scenario? Sure! Go for it!” I held my hands up in a conceding gesture. “You’re likely to be well under the subscription ratios in that type of environment. Of course, it means a little extra configuration on the switch, but that’s no big deal if you want to squeeze every nanosecond you can out of your latency.
“But I haven’t seen anyone really push lossless iSCSI environments to the same scale that we do in traditional deployments. For me, it’s just conjecture at this point. I’ve seen testing with one switch, or two switches, but not the extended sprawl that we have come to see in typical situations.”
“But at least I know it’s not required,” Peter said, grinning. “So, iSCSI shops need to think about what they are really trying to achieve with their specific topology by implementing DCB elements to enforce lossless Ethernet.”
“Yes, absolutely,” I agreed.
“Fantastic,” he said. “I really appreciate this. I really owe you one.”
You owe me a lunch! “No problem, glad to help out.”
He indicated the napkins strewn about the table, and I motioned for him to take them. In retrospect, it’s a shame I didn’t keep them myself. They would have made for a great visual in a blog article.
[Author’s note: Special thanks to Ethan Banks and Andy Sholomon for their extremely valuable insights and thought-provoking comments/questions.]