[squid-dev] [PATCH] assertion failed: Write.cc:41: "!ccb->active()"

Alex Rousskov rousskov at measurement-factory.com
Thu Mar 10 21:35:51 UTC 2016


On 03/10/2016 12:14 PM, Christos Tsantilas wrote:

>  I am attaching two patches for this bug. 

I will re-summarize the problem we are dealing with using higher-level
concepts so that it is easier to grok what Christos is talking about:

1. Ftp::Client cannot deal with more than one FTP command at a time.

2. Ftp::Server must _delay_ reading the next FTP command due to #1.

3. Delaying is complicated because Ftp::Server receives responses from
   "Squid" (ICAP, errors, caching, etc.), and not Ftp::Client itself:
   Ftp:Server may receive a response for the current FTP command
   long _before_ Ftp::Client is done dealing with that same command!
   Christos has highlighted a few specific cases where that happens.

   [ In HTTP, this problem is much smaller because the two Squid HTTP
     sides do not share much state: New Http::Client objects are created
     as needed for the same ConnStateData object while the old ones
     are finishing their Http::Client jobs. ]

4. The current Squid code (v3.5 and trunk) contains a mechanism for #2,
   but that mechanism does not handle a now-known race condition. The
   bug leads to the Write.cc assertion. Our take1 patch fixes that bug.

5. While working on #4, we realized that the existing mechanism for #2
   relies on Ftp::Client existence and other conditions that may not
   materialize. If things go wrong, Ftp::Server may get stuck waiting
   for Ftp::Client that either did not exist or did not end the wait.
   Our take2 patch revises the mechanism for #2 to be much more robust.
   We do not know of any specific way to hit the bugs fixed by take2,
   but that does not mean it is (and will remain) impossible.

Take2 also relays FTP origin errors to the FTP user in more cases.


> One simple for squid-3.5 (t1
> patch) and one more complex (t2 patch). The simple patch solve the bug
> for now, but may leave other similar bugs in squid.

Amos, do you want us to port take2 to v3.5? The take1 patch for v3.5 is
enough to fix the known assertion. Take2 fixes that assertion as well,
but it is bigger because it also fixes design problems that may lead to
other bugs in v3.5. Which one do you want in v3.5?

For trunk/v4, there is no doubt in my mind that we should use take2 (or
its polished variant).


The rest is my take2 review.


> +    bool clientSideWaitingForUs; ///< whether the client is waiting for us

Let's avoid confusing "client side" and "client" terms in new code,
especially when it is Ftp::Server that is waiting:

  /// whether we are between Ftp::Server::startWaitingForOrigin() and
  /// Ftp::Server::stopWaitingForOrigin() calls
  bool originWaitInProgress;


> +void
> +Ftp::Relay::swanSong()
> +{
> +    if (clientSideWaitingForUs) {
> +        CbcPointer<ConnStateData> &mgr = fwd->request->clientConnectionManager;
> +        if (mgr.valid()) {
> +            if (Ftp::Server *srv = dynamic_cast<Ftp::Server*>(mgr.get())) {
> +                typedef UnaryMemFunT<Ftp::Server, int> CbDialer;
> +                AsyncCall::Pointer call = asyncCall(11, 3, "Ftp::Server::stopWaitingForOrigin",
> +                                                    CbDialer(srv, &Ftp::Server::stopWaitingForOrigin, 0));
> +                ScheduleCallHere(call);

and

>  void
>  Ftp::Relay::serverComplete()
>  {
>      CbcPointer<ConnStateData> &mgr = fwd->request->clientConnectionManager;
>      if (mgr.valid()) {
> +        if (clientSideWaitingForUs) {
> +            if (Ftp::Server *srv = dynamic_cast<Ftp::Server*>(mgr.get())) {
> +               typedef UnaryMemFunT<Ftp::Server, int> CbDialer;
> +               AsyncCall::Pointer call = asyncCall(11, 3, "Ftp::Server::stopWaitingForOrigin",
> +                                                   CbDialer(srv, &Ftp::Server::stopWaitingForOrigin, ctrl.replycode));
> +               ScheduleCallHere(call);
> +               clientSideWaitingForUs = false;
> +            }
> +        }


Let's move this nearly duplicated code into a new
Ftp::Relay::stopOriginWait(int code) method _and_ always clear
clientSideWaitingForUs/originWaitInProgress there.


>  void
>  Ftp::Server::writeForwardedReply(const HttpReply *reply)
>  {
>      Must(reply);
>  
> +    if (master->waitingForOrigin) {
> +        delayedReply = reply;
> +        return;
> +    }
> +

If this method is called twice while waitingForOrigin is true, should we
throw an exception, honor the first reply, or honor the last one? Please
adjust the code as need. *If* honoring the last reply is the best
strategy (which is what take2 does), add a comment:

    delayedReply = reply; // and forget any previous ones


> +    debugs(33, 5, "Transfering data in progress, waitting server side to finish");

I suggest using "waiting for Ftp::Client data transfer to end" for
brevity sake. This will also fix the two spelling errors in the above text.


> +    // Currently only data upload and data download requests are 
> +    // waiting for the origin:
> +    // Must(master->serverState  == fssHandleUploadRequest ||
> +    //      master->serverState == fssHandleDataRequest);

True, but our code does not _rely_ on that, right? I would remove that
comment to avoid the implication that it does rely on that and we are
just afraid to check.

If you decide to keep this comment or enable these Must()s, please move
all that _after_ waitingForOrigin is cleared.


>     if (master->serverState == fssHandleDataRequest) {
> +        if (!master->userDataDone) {
...
> +            originDataDownloadAbortedOnError = (originStatus > 400);
> +            return;
> +        }
> +
> +        completeDataExchange();
> +    } else {
> +        if (delayedReply != NULL) {
> +            writeForwardedReply(delayedReply.getRaw());
> +            delayedReply = NULL;
> +        }
> +    }


The above logic looks correct to me, but I feel like I am reading an
inside-out code. Please consider this instead:

// HttpReply in fssHandleDataRequest lacks status code
if (master->serverState == fssHandleDataRequest)
    originDataDownloadAbortedOnError = (originStatus > 400);

if (!master->userDataDone) {
    ... your big comment here ...
    // HttpReply in fssHandleDataRequest lacks status code
    if (master->serverState == fssHandleDataRequest)
        originDataDownloadAbortedOnError = (originStatus > 400);

    debugs(33, 5 "too early to write the response");
    return;
}

if (delayedReply) {
    writeForwardedReply(delayedReply.getRaw());
    delayedReply = nullptr;
} else {
    completeDataExchange();
}


Please note that my "HttpReply in fssHandleDataRequest lacks status
code" is based on your IRC comments, but it does not mesh well with your
"If the server side aborted with an error before we are done, we need to
record it" source code comment which seems to imply that we should use
originStatus in all "server aborted first" cases and not just those
where serverState is fssHandleDataRequest. Please adjust the final
comment as needed.


> +            originDataDownloadAbortedOnError = (originStatus > 400);

To minimize questions/doubts, please include 400 unless it is a
successful status.


> +    //Adjust our reply if the server responded with an error:
> +    if (master->userDataDone == 226 && originDataDownloadAbortedOnError)
> +        master->userDataDone = 451;

Please add a debugs() statement here so that we know why the code was
changed during triage.


> -    MasterState(): serverState(fssBegin), clientReadGreeting(false), userDataDone(0), waitForOriginData(false) {}
> +MasterState(): serverState(fssBegin), clientReadGreeting(false), userDataDone(false), waitingForOrigin(false) {}

Unnecessary and wrong initial value change for userDataDone.


> +    /// whether we have to wait the transfer on the Squid-origin data connection
> +    /// to be finished

Let's shorten that to "whether we wait for the origin data transfer to end".

> +    /// whether the transfer aborted with an error on server side
> +    bool originDataDownloadAbortedOnError;

Avoid "server side" with "whether the origin data transfer aborted".


> +    /// Set when the HttpReply can not be forwarded right now to the client 
> +    /// and must be delayed unless the server side is done.

Simplify and describe the member meaning, not what the code does with
it: "a response which writing was postponed until stopWaitingForOrigin()".


> -    /// whether the transfer on the Squid-origin data connection is not over yet
> -    bool waitForOriginData;
> +    /// whether we have to wait the transfer on the Squid-origin data connection
> +    /// to be finished
> +    bool waitingForOrigin;

To reduce "global" or "shared" state, if you can now move
waitingForOrigin to Ftp::Server, you should do that IMO.


All of these are pretty much cosmetic changes that do not alter the
proposed patch functionality AFAICT. IMO, if there are no objections,
the polished patch should go into trunk (see above regarding v3.5).


Thank you,

Alex.



More information about the squid-dev mailing list