[etherlab-dev] [PATCH] Default patchset 20160804

Discussion:

Gavin Lambert

2016-08-04 03:53:49 UTC

Another default-branch patchset update is attached. It is still based on
5a70ffc4644b.

For convenience, I've made an online patchset repository
<https://sourceforge.net/u/uecasm/etherlab-patches/ci/default/tree/#readme>
. It contains a README that describes the individual patches (from prior
posts, where I could find them; let me know if something is unclear) and
gives the commands required to clone your own copy of the upstream
repository and apply the patches to it.

Modified patches:

* 0051-fsm_change-external-datagram.patch:
0052-fsm_slave_config-external-datagram.patch:
0053-fsm_slave_scan-external-datagram.patch:
Fixed some spots where it was accessing the wrong datagram.

New patches:

* 0043-ethercat-diag.patch:
Recently posted to the users list by Ralf Roesch, adding a "diag" command to
the ethercat command line tool, to aid in locating lost links and other
comms errors.

Other than rebasing within the patchset, I've made the following tweaks to
the code (otherwise it is largely unchanged):

o Various whitespace fixes.

o Moved a few lines in main.cpp for consistent grouping.

o Changed llc_reset to a bool, since it was being used as a bool.

o Tweaked a few lines from assignment to compound assignment to simplify.

o Removed the data type lookup in EscRegRead and EscRegWrite.

* Since this is all internal and only used to get the size, which is well
known to the caller, the lookup seemed unnecessary.

o Made EscRegRead and EscRegWrite output errors to stderr instead of
stdout.

o Fixed a printf format issue that generated a compiler warning.

o Made EscRegRead and EscRegWrite treat errors as non-fatal.

* Some slaves do not implement all of these registers, and so trying to
read them will produce an "I/O error" exception. In this case it makes more
sense to continue reading the other registers than to abort.

* For example, Beckhoff EL3062 does not implement register 0x030C.

* 0x0308-0x030B can similarly be absent on some older slaves.

I'm a little hesitant about the command name being "diag" - while it's not a
bad name for network diagnostics or error stats it might be confused with
the "Diagnosis History" object as specified in ETG1020, which is an entirely
different thing. (And something that might be useful to add to the tool in
the future.) I'm open to alternative suggestions.

* 0044-diag-readwrite.patch:
This is a further modification on top of the previous patch which replaces
several separate read and write requests with a single read or read+write
request (plus one additional read) per slave. (So naturally it depends on
patch 0026.)

In theory this is more efficient, but most importantly since the reset
occurs using the same datagram as the read, it's now atomic and there's no
risk of losing counts (which could previously happen if the slave
incremented its counter after the read but before the write).

* 0045-slave-config-position.patch:
Adds a "position" field to the structure returned by
ecrt_slave_config_state. This allows you to quickly get the ring position
of a slave from its relative alias:offset address, which in turn allows you
to call other APIs that require this (eg. ecrt_master_get_slave).

Note that the position is only valid if "online" is true, and that it is
possible for the value to be stale (ie. the slave has moved to a different
position) if the network changes and is consequently rescanned after this
call. So use it defensively. (You're reasonably safe in the period between
requesting the master and activating it, as rescans are inhibited during
this time. OTOH, only the application can request the master; an external
tool can't.)

I'm considering whether it would be useful to make a general function
available for this conversion, to avoid duplicating the alias:offset
conversion logic in too many places (eg. the tool requires it as well, but
can't use the slave_config-based conversion since it can't request the
master).

* 0046-e1000e-link-detection.patch:
Fixed link detection in e1000e driver for 3.10-3.16.

This is Christoph Permes
<http://lists.etherlab.org/pipermail/etherlab-dev/2016/000554.html> ' 3.16
patch and 3.14-v2 patch, with the latter backported to 3.12 and 3.10. I
haven't tested these personally.

* 0057-fsm_foe-simplify.patch:
Removes some redundant fields from the FoE FSM; some were unused copy/paste
holdovers from the CoE FSM while others were duplicated unnecessarily
between the read and write operations, which can't be concurrent for a given
slave anyway.

Also fixes the case where the incoming data exceeds the provided buffer to
properly terminate the state machine instead of leaving things dangling.
Although note that this still leaves the FoE conversation itself dangling,
so you'll likely get an error on the next request if this occurs.

* 0058-foe-password.patch:
Adds support for sending an FoE password along with read or write requests.

Also implements the -o option for the foe_read command (which was documented
but not implemented).

Also makes the ioctl behind foe_read actually use the buffer size requested
by the caller (instead of a hard-coded value); though note that foe_read
itself still uses a hard-coded value of its own (but it's larger, so bigger
files should be readable now). It's possible that users on limited-memory
embedded systems might need to reduce this value, but it's still fairly
conservative as RAM sizes go.

* 0059-foe-requests.patch:
Makes FoE transfer requests into public ecrt_* API, similar to SDO requests.

Primarily (following my goal of "parallel all the things"), this allows FoE
transfers to be non-blocking so that transfers to multiple slaves can occur
concurrently from the same requesting thread (previously this was only
possible by using separate threads, since the only API was blocking). Note
that due to patch 0018 you can call ecrt_master_deactivate() to delete these
requests when you're done with them, even if you haven't called
ecrt_master_activate() yet.

It has a possible side benefit that FoE transfers can now be started and
monitored from realtime context, although as FoE is mostly used for firmware
updates this is unlikely to be all that useful in practice.

I considered a few alternative approaches to this (the next leading
contender was to make async versions of the existing FoE ioctls), but this
seemed more consistent with existing APIs. I'm open to suggestions here too
though, since it does feel like a slightly odd fit. (But works quite
nicely.)

* 0060-foe-request-progress.patch:
Adds a way to get a "current progress" value (actually the byte offset) for
async FoE transfers.

Christoph Schröder

2016-09-28 14:07:06 UTC

Permalink

Hi all,

I am currently testing the EtherCAT master with Xenomai 2.6.5 on a
Debian wheezy (I patched kernel 3.2 for this). I have a few questions
regarding problems I encountered.

#1.)
Starting with the tarball release 1.5.2 and encountered a problem with
ecrt_master_reference_clock_time which led to a segmentation fault. My
DC config here is basically the same as in the rtai_rtdm_dc example with
minor fixes since I am not using RTAI. The rest is based on the xenomai
example. The problem seems to be fixed in the mercurial repo (tested
5a70ffc4644b for later tests of the patch queue) and I would like to
know which commit fixed this issue. Unfortunately I can't find the point
where the release 1.5.2 was taken from since the changelog messages do
not correspond to the commit messages and there is no Label for the release.

This is my debugging output:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff7fd8700 (LWP 4389)]
0x00007ffff68d53ca in vfprintf () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) backtrace
#0 0x00007ffff68d53ca in vfprintf () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007ffff68daa00 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007ffff68d553e in vfprintf () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x00007ffff68e0188 in fprintf () from /lib/x86_64-linux-gnu/libc.so.6
#4 0x00007ffff7bd8944 in ecrt_master_reference_clock_time (
master=<optimized out>, time=<optimized out>) at master.c:717
#5 0x0000000000402ea5 in sync_DCs2 () at main.c:754
#6 0x0000000000401d35 in cyclic_task_proc () at main.c:183
#7 0x00007ffff73c2a99 in rt_task_trampoline ()
from /usr/xenomai/lib/libnative.so.3
#8 0x00007ffff667cb50 in start_thread ()
from /lib/x86_64-linux-gnu/libpthread.so.0
#9 0x00007ffff696ffbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#10 0x0000000000000000 in ?? ()

#2.)
I did some minor tests with the patch queue and got some bad system
freezes with the xenomai example. I could locate the patch that seems to
cause the system freezes:
0011-Master-locks-to-avoid-corrupted-datagram-queue.patch
The only notable thing I could see in the kernel log is that the slaves
went back to PREOP. The Xenomai task was still running and hanging at
some point of the cycle (I placed an rt_printf in the cycle which should
have printed the cycle_counter value every other second).
The patch series seems to work if I apply the patches up to
0010-Sdo-directory-now-only-fetched-on-request.patch. Is this
reproduceable for you?

#3.)
In both versions (1.5.2 and repository 5a70ffc4644b) I get a lost frame
at startup. Is this anything to worry about?
[Wed Sep 28 15:24:51 2016] EtherCAT 0: Master thread exited.
[Wed Sep 28 15:24:51 2016] EtherCAT 0: Starting EtherCAT-OP thread.
[Wed Sep 28 15:24:51 2016] EtherCAT WARNING 0: 1 datagram UNMATCHED!
[Wed Sep 28 15:24:52 2016] EtherCAT 0: Domain 0: Working counter changed
to 2/3.
[Wed Sep 28 15:24:52 2016] EtherCAT 0: Slave states on main device: OP.

#4.)
Will there be a new release aka a new version of the EtherCAT master in
the near future based on the patches?

Tanks and best regards,
Christoph

________________________________

Helmholtz-Zentrum Berlin für Materialien und Energie GmbH

Mitglied der Hermann von Helmholtz-Gemeinschaft Deutscher Forschungszentren e.V.

Aufsichtsrat: Vorsitzender Dr. Karl Eugen Huthmacher, stv. Vorsitzende Dr. Jutta Koch-Unterseher
Geschäftsführung: Prof. Dr. Anke Rita Kaysser-Pyzalla, Thomas Frederking

Sitz Berlin, AG Charlottenburg, 89 HRB 5583

Postadresse:
Hahn-Meitner-Platz 1
D-14109 Berlin

http://www.helmholtz-berlin.de

Gavin Lambert

2016-09-28 23:21:22 UTC

Permalink

On 29 September 2016 03:07 quoth Christoph Schröder,

Post by Christoph SchrÃ¶der
#1.)
Starting with the tarball release 1.5.2 and encountered a problem with
ecrt_master_reference_clock_time which led to a segmentation fault. My
DC config here is basically the same as in the rtai_rtdm_dc example with
minor fixes since I am not using RTAI. The rest is based on the xenomai
example. The problem seems to be fixed in the mercurial repo (tested
5a70ffc4644b for later tests of the patch queue) and I would like to know
which commit fixed this issue. Unfortunately I can't find the point where

the

Post by Christoph SchrÃ¶der
release 1.5.2 was taken from since the changelog messages do not
correspond to the commit messages and there is no Label for the release.
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff7fd8700 (LWP 4389)] 0x00007ffff68d53ca in
vfprintf () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) backtrace
#0 0x00007ffff68d53ca in vfprintf () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007ffff68daa00 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007ffff68d553e in vfprintf () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x00007ffff68e0188 in fprintf () from /lib/x86_64-linux-gnu/libc.so.6
#4 0x00007ffff7bd8944 in ecrt_master_reference_clock_time (
master=<optimized out>, time=<optimized out>) at master.c:717

Given that stack trace, and that it works on default but not 1.5.2, then
most likely the commit that worked around the issue for you was
https://sourceforge.net/p/etherlabmaster/code/ci/3affe9cd0b66fe55ef8e8060778
ef9461a8204a0.

Having said that, given that the only reason I can think of that this would
segfault is if strerror returned NULL or an invalid pointer, it suggests
that you might have a broken or badly configured libc. If you're building
the libc yourself, make sure that you're using an up-to-date version and
haven't excluded the strerror text.

Another possibility is that if you were concurrently calling strerror() on
another thread (and your libc doesn't implement strerror in a thread-local
manner) then it could have corrupted the buffer. Most likely another patch
would be required to resolve this "properly", although one workaround for
this is to avoid calling ecrt_* APIs from more than one thread.

Although I suppose since you're linking to RTDM it's possible that
strerror() is coming from there rather than the libc; I'm not exactly sure
how RTAI/Xenomai work. Or possibly that in that context it could be that
the fprintf(strerr) itself is failing -- but this isn't new code so I would
have thought the problem would have come up earlier if that were the case.

I'm not sure exactly which commit 1.5.2 is based on, but it will be one of
the ones in the "stable-1.5" branch. Everything on "default" is newer than
that.

Post by Christoph SchrÃ¶der
#2.)
I did some minor tests with the patch queue and got some bad system
freezes with the xenomai example. I could locate the patch that seems to
0011-Master-locks-to-avoid-corrupted-datagram-queue.patch
The only notable thing I could see in the kernel log is that the slaves

went

Post by Christoph SchrÃ¶der
back to PREOP. The Xenomai task was still running and hanging at some

point

Post by Christoph SchrÃ¶der
of the cycle (I placed an rt_printf in the cycle which should have printed

the

Post by Christoph SchrÃ¶der
cycle_counter value every other second).
The patch series seems to work if I apply the patches up to 0010-Sdo-
directory-now-only-fetched-on-request.patch. Is this reproduceable for
you?

I'm not sure about this as I don't use Xenomai myself. That particular
patch was authored by Knud Baastrup, so I've added him to the email chain
directly just in case. If I recall correctly I think he, like myself, was
using PREEMPT_RT so it's possible that this has not been tested with
Xenomai.

Do you have locking on the Xenomai side as well? Do you call ecrt APIs from
multiple Xenomai tasks? I believe the patch assumes that there is no
external locking between tasks, so you might be running into deadlocks
depending on the order in which things happen.

Using Linux locks between Xenomai tasks is probably not ideal, but I would
have expected that it ought to work as this occurs in other places as well.

Post by Christoph SchrÃ¶der
#3.)
In both versions (1.5.2 and repository 5a70ffc4644b) I get a lost frame at
startup. Is this anything to worry about?
[Wed Sep 28 15:24:51 2016] EtherCAT 0: Master thread exited.
[Wed Sep 28 15:24:51 2016] EtherCAT 0: Starting EtherCAT-OP thread.
[Wed Sep 28 15:24:51 2016] EtherCAT WARNING 0: 1 datagram UNMATCHED!
[Wed Sep 28 15:24:52 2016] EtherCAT 0: Domain 0: Working counter changed
to 2/3.
[Wed Sep 28 15:24:52 2016] EtherCAT 0: Slave states on main device: OP.

I don't think this is anything to worry about; it's probably just that the
idle thread sent a request and then exited before the reply came back; the
reply then sat in the buffers until the OP thread started but it had either
timed out or reset the state machines in the meantime so it was no longer
expected.

Post by Christoph SchrÃ¶der
#4.)
Will there be a new release aka a new version of the EtherCAT master in

the

Post by Christoph SchrÃ¶der
near future based on the patches?

I'm hoping so, but it's not up to me. :) More feedback and sorting out
things like these Xenomai issues you've encountered may help to move towards
that though.

Christoph Schröder

2016-09-30 12:33:08 UTC

Permalink

Hi Gavin,

thanks for the answer.

Post by Gavin Lambert
Given that stack trace, and that it works on default but not 1.5.2, then
most likely the commit that worked around the issue for you was
https://sourceforge.net/p/etherlabmaster/code/ci/3affe9cd0b66fe55ef8e8060778
ef9461a8204a0.
Having said that, given that the only reason I can think of that this would
segfault is if strerror returned NULL or an invalid pointer, it suggests
that you might have a broken or badly configured libc. If you're building
the libc yourself, make sure that you're using an up-to-date version and
haven't excluded the strerror text.
Another possibility is that if you were concurrently calling strerror() on
another thread (and your libc doesn't implement strerror in a thread-local
manner) then it could have corrupted the buffer. Most likely another patch
would be required to resolve this "properly", although one workaround for
this is to avoid calling ecrt_* APIs from more than one thread.
Although I suppose since you're linking to RTDM it's possible that
strerror() is coming from there rather than the libc; I'm not exactly sure
how RTAI/Xenomai work. Or possibly that in that context it could be that
the fprintf(strerr) itself is failing -- but this isn't new code so I would
have thought the problem would have come up earlier if that were the case.
I'm not sure exactly which commit 1.5.2 is based on, but it will be one of
the ones in the "stable-1.5" branch. Everything on "default" is newer than
that.

My test application has only one Xenomai-task (thread) like the xenomai
example, so I don't think this is a concurrency problem unless there is
a thread of the master itself involved. My libc is rather old though
(2.13). Unfortunately there is no newer version backported for Debian
wheezy and I don't want to install it from sources since it wont be
available for working environments anyway. I will leave this matter for
now since it seems to be fixed or at least omitted already. I needed to
know when this was fixed to add the fix as patch to our Debian package
of the EtherCAT master.

Post by Gavin Lambert
#2.)
I did some minor tests with the patch queue and got some bad system
freezes with the xenomai example. I could locate the patch that seems to
0011-Master-locks-to-avoid-corrupted-datagram-queue.patch
The only notable thing I could see in the kernel log is that the slaves went
back to PREOP. The Xenomai task was still running and hanging at some point
of the cycle (I placed an rt_printf in the cycle which should have printed
the cycle_counter value every other second).
The patch series seems to work if I apply the patches up to 0010-Sdo-
directory-now-only-fetched-on-request.patch. Is this reproduceable for
you?
I'm not sure about this as I don't use Xenomai myself. That particular
patch was authored by Knud Baastrup, so I've added him to the email chain
directly just in case. If I recall correctly I think he, like myself, was
using PREEMPT_RT so it's possible that this has not been tested with
Xenomai.
Do you have locking on the Xenomai side as well? Do you call ecrt APIs from
multiple Xenomai tasks? I believe the patch assumes that there is no
external locking between tasks, so you might be running into deadlocks
depending on the order in which things happen.
Using Linux locks between Xenomai tasks is probably not ideal, but I would
have expected that it ought to work as this occurs in other places as well.

This problem occured with the xenomai example (./examples/xenomai in the
masters source code) as well. There is only one Xenomai task and no
explicit locking from applications side. I am new to Xenomai but as far
as I understand Xenomai it uses a 'dual kernel' configuration called
'cobalt core' which has higher priority than the normal kernel and does
all the scheduling of realtime tasks (see
https://xenomai.org/start-here/#How_does_Xenomai_deliver_real-time). A
Xenomai task should therefore block every task executed in normal kernel
space until it's executed. My guess is that the task waits infinitely
for a master component to be unlocked by another thread in kernel space
which is never done because this thread is not executed due to the
higher priority of the Xenomai task.

Best regards,
Christoph

________________________________

Helmholtz-Zentrum Berlin für Materialien und Energie GmbH

Mitglied der Hermann von Helmholtz-Gemeinschaft Deutscher Forschungszentren e.V.

Aufsichtsrat: Vorsitzender Dr. Karl Eugen Huthmacher, stv. Vorsitzende Dr. Jutta Koch-Unterseher
Geschäftsführung: Prof. Dr. Anke Rita Kaysser-Pyzalla, Thomas Frederking

Sitz Berlin, AG Charlottenburg, 89 HRB 5583

Postadresse:
Hahn-Meitner-Platz 1
D-14109 Berlin

http://www.helmholtz-berlin.de