Pavel,
thanks for your suggestions. Something holding a lock is my suspicion as well. As part of my investigation I'm running a script that dumps the system state. Part of this dump is lsof. Quite reliably I get a "soft lockup" bug like this:
Aug 30 07:36:51 localhost kernel: BUG: soft lockup - CPU#0 stuck for 11s! [lsof:30009]
Aug 30 07:36:51 localhost kernel:
Aug 30 07:36:51 localhost kernel: Pid: 30009, comm: lsof
Aug 30 07:36:51 localhost kernel: EIP: 0060:[pit_read+96/123] CPU: 0
Aug 30 07:36:51 localhost kernel: EIP is at pit_read+0x60/0x7b
Aug 30 07:36:51 localhost kernel: EFLAGS: 00000282 Tainted: P (2.6.23.12-himonn-4329 #1)
Aug 30 07:36:51 localhost kernel: EAX: 00000001 EBX: 00000001 ECX: 00c16421 EDX: 00000282
Aug 30 07:36:51 localhost kernel: ESI: f718eb40 EDI: f718eb40 EBP: 0185123e DS: 007b ES: 007b FS: 0000
Aug 30 07:36:51 localhost kernel: CR0: 8005003b CR2: 080afee0 CR3: 37c7d000 CR4: 000006d0
Aug 30 07:36:51 localhost kernel: DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
Aug 30 07:36:51 localhost kernel: DR6: ffff0ff0 DR7: 00000400
Aug 30 07:36:51 localhost kernel: [getnstimeofday+43/172] getnstimeofday+0x2b/0xac
Aug 30 07:36:51 localhost kernel: [ktime_get_real+13/33] ktime_get_real+0xd/0x21
Aug 30 07:36:51 localhost kernel: [dev_hard_start_xmit+52/618] dev_hard_start_xmit+0x34/0x26a
Aug 30 07:36:51 localhost kernel: [__qdisc_run+77/284] __qdisc_run+0x4d/0x11c
Aug 30 07:36:51 localhost kernel: [net_tx_action+156/168] net_tx_action+0x9c/0xa8
Aug 30 07:36:51 localhost kernel: [__do_softirq+53/117] __do_softirq+0x35/0x75
Aug 30 07:36:51 localhost kernel: [do_softirq+34/38] do_softirq+0x22/0x26
Aug 30 07:36:51 localhost kernel: [local_bh_enable+107/121] local_bh_enable+0x6b/0x79
Aug 30 07:36:51 localhost kernel: [established_get_first+124/136] established_get_first+0x7c/0x88
Aug 30 07:36:51 localhost kernel: [established_get_first+111/136] established_get_first+0x6f/0x88
Aug 30 07:36:51 localhost kernel: [tcp_get_idx+119/152] tcp_get_idx+0x77/0x98
Aug 30 07:36:51 localhost kernel: [seq_read+208/625] seq_read+0xd0/0x271
Aug 30 07:36:51 localhost kernel: [seq_read+0/625] seq_read+0x0/0x271
Aug 30 07:36:51 localhost kernel: [proc_reg_read+53/70] proc_reg_read+0x35/0x46
Aug 30 07:36:51 localhost kernel: [proc_reg_read+0/70] proc_reg_read+0x0/0x46
Aug 30 07:36:51 localhost kernel: [vfs_read+166/296] vfs_read+0xa6/0x128
Aug 30 07:36:51 localhost kernel: [sys_read+65/103] sys_read+0x41/0x67
Aug 30 07:36:51 localhost kernel: [sysenter_past_esp+95/133] sysenter_past_esp+0x5f/0x85
Aug 30 07:36:51 localhost kernel: =======================
Which to me looks like as if lsof tries to read or stat a proc file for an open socket and then gets stuck on a lock held by something else.
I have another question that you might be able to answer (I very much hope so):
The problem eventually rights itself, so I assumed that a hardware reset kicks the WLAN card into a working state again. To verify this, I switched on reset debugging and indeed got traces, but not for ath_reset, which I expected, but for a channel change:
Sep 5 17:37:16 localhost kernel: change_channel: called by scan_next
Sep 5 17:37:16 localhost kernel: ath_set_channel: called by change_channel
Sep 5 17:37:16 localhost kernel: ath_chan_set: 44 (5220 MHz) -> 52 (5260 MHz)
Because change_channel is called by scan_next I assumed initially that this is unrelated to my problem and part of the normal operations. However, I discovered that this only happens at the network nodes that are affected by the problem (the affected nodes are identified by their position in the network topology, not the actual hardware, e.g. I changed nodes around and nodes that work perfectly well on one spot behave erroneously on another). I'm now somewhat at a loss, because I can't interpret this observation. My question: Do you know which error condition would trigger scan_next?
Thanks in advance and kind regards
Jörg
-----Ursprüngliche Nachricht-----
Gesendet: Donnerstag, 4. September 2014 17:49
An: Pommnitz Jörg
Betreff: Re: Seeing strange queueing problems (??) with Madwifi on old
2.6.23 kernel
Hi Jörg,
It has been a while since I looked at the MadWifi code for goals other
than porting it to the new kernels, but let me just give you some ideas.
Post by Pommnitz JörgHello all,
for reasons I won't discuss here, we are stuck with Madwifi on an old 2.6.23 kernel.
That should be OK. MadWifi targeted kernels as old as 2.4.22 until
recently, so I could not really use the features offered by the newer
kernels. I removed support for older kernels because I could not test
it, but even now, kernels as old as 2.6.13 are supported.
Post by Pommnitz JörgI see a very strange problem where an adhoc network node (running
madwifi, of course) suddenly seems to stop sending anything at all
(including IBSS beacons) and after minutes (!!) seems to push out
all the missing packets in a short burst.
First of all, check if other hardware would do it in the Ad-Hoc mode
and if the hardware you are using would have that problem in other
modes (station and AP).
Post by Pommnitz JörgTo illustrate this, have a look at the attached packet capture. This
is from a MANET that runs OLSRv1. The misbehaving node has the
address 10.11.0.2. The first packet is number 746 (113 seconds in
the trace). A lot of packets that were sent seconds apart arrive
almost at the same time. Other neighbouring nodes do not show this
behaviour.
Any idea what might cause this?
A lock (mutex perhaps) in the kernel code that blocks all
transmissions. Or a lock in MadWifi itself. Print timestamps in the
beginning and the end of ieee80211_hardstart() and ath_hardstart().
That would give you an idea.
It's also possible that the particular card detects interference
incorrectly and blocks all transmissions. That would be a hardware
problem.
--
Regards,
Pavel Roskin
________________________________
IABG mbH
Sitz der Gesellschaft: Ottobrunn, Registergericht: Amtsgericht München, HRB 5499
Geschäftsführung: Prof. Dr.-Ing. Rudolf F. Schwarz Vorsitz,
Dipl.-Ing. Matthias Spott
Vorsitzender des Aufsichtsrats: RA Engelbert Kupka MdL a.