Hardware, software, or going crazy? Troubleshooting the network

Over the past few weeks, I’ve noticed some hiccups in my LAN.

It started with scp’ing files over to arrakis.  Larger files would just time out after a short while.  But we’re not talking that large.  We’re talking in the order of 2-3 MB.  I was able to resolve this by either using rsync or appending the -C flag to enable compression.  It was a minor annoyance, but I figured I had to have just missed a setting when I upgraded openssh sometime in the recent past.   At that point, I didn’t notice anything else amiss.

Within the past week, the family noticed issues playing back movies stored on arrakis.  My son had been watching a movie on one of the PVRs and came running to me saying that it had cut off.  I put it back on for him and tried to get him to tell me when/where it had cut off, but he couldn’t put into words where it had happened.  So, I chalked it off to either him or his sister accidentally hitting ‘stop’ on the remote.  A day or two later, the kids were watching Coraline (wonderful movie, BTW) when it cut off after a few minutes.  I was able to replicate it multiple times on the single PVR using MythTV and mplayer.   I pulled up the NFS share on my desktop and confirmed that it was dropping at the same place.

I still had our bluray copy on the bookshelf, so I popped that in for them while I investigated.

At first, I thought it had something to do with recent mdadm issues under Gentoo, so I made sure I had downgraded to 3.2.3 and rebooted.  Same thing.  I unmounted the array and ran a fsck, but that came up empty.   I do have one drive that has a few bad spots due to issues with the previous case cooking my drives, but they were marked as bad, so nothing should be writing there.  Smartmon hasn’t reported anything else.

At this point, it has to be a bad file.  I snagged another copy of Coraline from the same group. [1]  Before I did anything though, I created a par2 set against the new copy and used that to check the original.  Par didn’t come up with any issues.  I then did a diff between the two and didn’t come up with a difference there.

Weird.

So, I went ahead and just deleted the original and copied over the new.

Same problem, different spot.  WTF.

As a test, I mounted the array on my desktop under two spots.  One via NFS and the other via sshfs.  I started two instances of mplayer so that both locations of the same file would play back at the same time.  If was an error with the array, it would fail simultaneously and indicate an error either with mdadm or a drive failure.  If it worked, then I wasn’t sure. I expected it to fail on both counts.   Unfortunately, it did not.   The video playback over NFS failed while the sshfs copy continued on without issue.  I replicated this several times to make sure it wasn’t a fluke.  With this, I was thinking maybe it was an issue with NFS.   Arrakis was still running a 3.3 series kernel while everything else was on 3.4.2, so I reompiled the kernel as well as the nfs-utils package and restarted.

For a little while, it seemed to work.   Then yesterday, Jurassic Park started to have problems.   So this morning, I downgraded the firmware of my router back to a version that I knew was perfectly fine to eliminate that.  I didn’t wipe my settings, though, so that may be an issue.  I threw on another movie for the kids and watched it cut out 3-4 times on me.  To keep them happy, I mounted the array over sshfs on that PVR and they haven’t had a complaint yet.

I only have a few things left to try.  I’ve noticed that arrakis is served off of a second, daisy chained switch.  My network goes modem -> router -> switch 1 -> switch 2.  I had toyed with the idea of running modem -> router -> (switch 1) (switch 2), but was told I wouldn’t see any benefit or detriment by doing it either way. And to be fair, I haven’t in the past 3 years.  But, to rule out a switch flaking out, I’m going to move the connection from switch 2 to the router directly and see what happens.  If it STILL fails, I’m going to replace that cat 5e and switch to an add-on NIC.   If it STILL fails, I’m thinking about replacing the router.

Otherwise, I’m not sure what else it could be.  I don’t think I’ve overlooked anything at this point.  Since it works over sshfs vs nfs, AND the problems I’ve had with scp, I’m pretty positive it’s a network issue.

<edit>I should note that I’ve tried scp’ing files to other machines without issue. It only happens when I’m going TO arrakis. If I manage to get it there, I can transfer it FROM arrakis to my desktop without a problem.</edit>

I’m open to suggestions.

[1] Honestly, it’s easier to buy a copy and download a rip someone else has made than to rip it myself.  I don’t have issues ripping DVDs.  Now that I’ve migrated to blu-ray, ripping takes a lot more work.  20-40 GB for the uncompressed copy and several hours to re-encode and compress that down, only to find out that there are severe artifacts during playback. Ugh. It’s frustrating.  I don’t feel bad about it since I do buy the physical disc.  I’ve been looking at Vudu/Walmart’s idea of trading in your disc and they give you a digital copy that you can stream where ever/whenever you want, but that’s a post for another day.

~ by praetor on June 21, 2012.

Leave a comment