[squid-dev] [PATCH] assertion failed: Write.cc:41: "!ccb->active()"

Mon Mar 14 01:30:30 UTC 2016

I would like to respond on some of the things in the post related to V4.

First 3.5 is very stable for a very long time now.
It have couple bugs in it and it will be good to somehow make them gone 
but I must highlight couple things which might not be clear.

Atomic patches are great and they solve things in a very good way.
But every bug has it's own "life" and we can understand that it would 
not be possible to write a perfect product without couple bugs\errors\typos.
I like many believe in perfection but from the last test(RAM ONLY) I 
ran, it seems that the hardware is not really ready to specific heat. 
The DDR2 ECC cards started pooping out of their socket in the middle of 
a simple test when it's pretty cold in the room...

+1 for trunk while I still have couple doubts about what I am 
understanding about take1 and take2 steps.
I will try to see how the service works after the patches will be 
applied to V4.

I have couple more build nodes and the automation learning keeps me busy.

Eliezer

On 10/03/2016 23:35, Alex Rousskov wrote:
> On 03/10/2016 12:14 PM, Christos Tsantilas wrote:
>
>>   I am attaching two patches for this bug.
>
> I will re-summarize the problem we are dealing with using higher-level
> concepts so that it is easier to grok what Christos is talking about:
>
> 1. Ftp::Client cannot deal with more than one FTP command at a time.
>
> 2. Ftp::Server must _delay_ reading the next FTP command due to #1.
>
> 3. Delaying is complicated because Ftp::Server receives responses from
>     "Squid" (ICAP, errors, caching, etc.), and not Ftp::Client itself:
>     Ftp:Server may receive a response for the current FTP command
>     long _before_ Ftp::Client is done dealing with that same command!
>     Christos has highlighted a few specific cases where that happens.
>
>     [ In HTTP, this problem is much smaller because the two Squid HTTP
>       sides do not share much state: New Http::Client objects are created
>       as needed for the same ConnStateData object while the old ones
>       are finishing their Http::Client jobs. ]
>
> 4. The current Squid code (v3.5 and trunk) contains a mechanism for #2,
>     but that mechanism does not handle a now-known race condition. The
>     bug leads to the Write.cc assertion. Our take1 patch fixes that bug.
>
> 5. While working on #4, we realized that the existing mechanism for #2
>     relies on Ftp::Client existence and other conditions that may not
>     materialize. If things go wrong, Ftp::Server may get stuck waiting
>     for Ftp::Client that either did not exist or did not end the wait.
>     Our take2 patch revises the mechanism for #2 to be much more robust.
>     We do not know of any specific way to hit the bugs fixed by take2,
>     but that does not mean it is (and will remain) impossible.
>
> Take2 also relays FTP origin errors to the FTP user in more cases.
>
>
>> One simple for squid-3.5 (t1
>> patch) and one more complex (t2 patch). The simple patch solve the bug
>> for now, but may leave other similar bugs in squid.
>
> Amos, do you want us to port take2 to v3.5? The take1 patch for v3.5 is
> enough to fix the known assertion. Take2 fixes that assertion as well,
> but it is bigger because it also fixes design problems that may lead to
> other bugs in v3.5. Which one do you want in v3.5?
>
> For trunk/v4, there is no doubt in my mind that we should use take2 (or
> its polished variant).
>
>
> The rest is my take2 review.
>
>
>> +    bool clientSideWaitingForUs; ///< whether the client is waiting for us
>
> Let's avoid confusing "client side" and "client" terms in new code,
> especially when it is Ftp::Server that is waiting:
>
>    /// whether we are between Ftp::Server::startWaitingForOrigin() and
>    /// Ftp::Server::stopWaitingForOrigin() calls
>    bool originWaitInProgress;
>
>
>> +void
>> +Ftp::Relay::swanSong()
>> +{
>> +    if (clientSideWaitingForUs) {
>> +        CbcPointer<ConnStateData> &mgr = fwd->request->clientConnectionManager;
>> +        if (mgr.valid()) {
>> +            if (Ftp::Server *srv = dynamic_cast<Ftp::Server*>(mgr.get())) {
>> +                typedef UnaryMemFunT<Ftp::Server, int> CbDialer;
>> +                AsyncCall::Pointer call = asyncCall(11, 3, "Ftp::Server::stopWaitingForOrigin",
>> +                                                    CbDialer(srv, &Ftp::Server::stopWaitingForOrigin, 0));
>> +                ScheduleCallHere(call);
>
> and
>
>>   void
>>   Ftp::Relay::serverComplete()
>>   {
>>       CbcPointer<ConnStateData> &mgr = fwd->request->clientConnectionManager;
>>       if (mgr.valid()) {
>> +        if (clientSideWaitingForUs) {
>> +            if (Ftp::Server *srv = dynamic_cast<Ftp::Server*>(mgr.get())) {
>> +               typedef UnaryMemFunT<Ftp::Server, int> CbDialer;
>> +               AsyncCall::Pointer call = asyncCall(11, 3, "Ftp::Server::stopWaitingForOrigin",
>> +                                                   CbDialer(srv, &Ftp::Server::stopWaitingForOrigin, ctrl.replycode));
>> +               ScheduleCallHere(call);
>> +               clientSideWaitingForUs = false;
>> +            }
>> +        }
>
>
> Let's move this nearly duplicated code into a new
> Ftp::Relay::stopOriginWait(int code) method _and_ always clear
> clientSideWaitingForUs/originWaitInProgress there.
>
>
>>   void
>>   Ftp::Server::writeForwardedReply(const HttpReply *reply)
>>   {
>>       Must(reply);
>>
>> +    if (master->waitingForOrigin) {
>> +        delayedReply = reply;
>> +        return;
>> +    }
>> +
>
> If this method is called twice while waitingForOrigin is true, should we
> throw an exception, honor the first reply, or honor the last one? Please
> adjust the code as need. *If* honoring the last reply is the best
> strategy (which is what take2 does), add a comment:
>
>      delayedReply = reply; // and forget any previous ones
>
>
>> +    debugs(33, 5, "Transfering data in progress, waitting server side to finish");
>
> I suggest using "waiting for Ftp::Client data transfer to end" for
> brevity sake. This will also fix the two spelling errors in the above text.
>
>
>> +    // Currently only data upload and data download requests are
>> +    // waiting for the origin:
>> +    // Must(master->serverState  == fssHandleUploadRequest ||
>> +    //      master->serverState == fssHandleDataRequest);
>
> True, but our code does not _rely_ on that, right? I would remove that
> comment to avoid the implication that it does rely on that and we are
> just afraid to check.
>
> If you decide to keep this comment or enable these Must()s, please move
> all that _after_ waitingForOrigin is cleared.
>
>
>>      if (master->serverState == fssHandleDataRequest) {
>> +        if (!master->userDataDone) {
> ...
>> +            originDataDownloadAbortedOnError = (originStatus > 400);
>> +            return;
>> +        }
>> +
>> +        completeDataExchange();
>> +    } else {
>> +        if (delayedReply != NULL) {
>> +            writeForwardedReply(delayedReply.getRaw());
>> +            delayedReply = NULL;
>> +        }
>> +    }
>
>
> The above logic looks correct to me, but I feel like I am reading an
> inside-out code. Please consider this instead:
>
> // HttpReply in fssHandleDataRequest lacks status code
> if (master->serverState == fssHandleDataRequest)
>      originDataDownloadAbortedOnError = (originStatus > 400);
>
> if (!master->userDataDone) {
>      ... your big comment here ...
>      // HttpReply in fssHandleDataRequest lacks status code
>      if (master->serverState == fssHandleDataRequest)
>          originDataDownloadAbortedOnError = (originStatus > 400);
>
>      debugs(33, 5 "too early to write the response");
>      return;
> }
>
> if (delayedReply) {
>      writeForwardedReply(delayedReply.getRaw());
>      delayedReply = nullptr;
> } else {
>      completeDataExchange();
> }
>
>
> Please note that my "HttpReply in fssHandleDataRequest lacks status
> code" is based on your IRC comments, but it does not mesh well with your
> "If the server side aborted with an error before we are done, we need to
> record it" source code comment which seems to imply that we should use
> originStatus in all "server aborted first" cases and not just those
> where serverState is fssHandleDataRequest. Please adjust the final
> comment as needed.
>
>
>> +            originDataDownloadAbortedOnError = (originStatus > 400);
>
> To minimize questions/doubts, please include 400 unless it is a
> successful status.
>
>
>> +    //Adjust our reply if the server responded with an error:
>> +    if (master->userDataDone == 226 && originDataDownloadAbortedOnError)
>> +        master->userDataDone = 451;
>
> Please add a debugs() statement here so that we know why the code was
> changed during triage.
>
>
>> -    MasterState(): serverState(fssBegin), clientReadGreeting(false), userDataDone(0), waitForOriginData(false) {}
>> +MasterState(): serverState(fssBegin), clientReadGreeting(false), userDataDone(false), waitingForOrigin(false) {}
>
> Unnecessary and wrong initial value change for userDataDone.
>
>
>> +    /// whether we have to wait the transfer on the Squid-origin data connection
>> +    /// to be finished
>
> Let's shorten that to "whether we wait for the origin data transfer to end".
>
>> +    /// whether the transfer aborted with an error on server side
>> +    bool originDataDownloadAbortedOnError;
>
> Avoid "server side" with "whether the origin data transfer aborted".
>
>
>> +    /// Set when the HttpReply can not be forwarded right now to the client
>> +    /// and must be delayed unless the server side is done.
>
> Simplify and describe the member meaning, not what the code does with
> it: "a response which writing was postponed until stopWaitingForOrigin()".
>
>
>> -    /// whether the transfer on the Squid-origin data connection is not over yet
>> -    bool waitForOriginData;
>> +    /// whether we have to wait the transfer on the Squid-origin data connection
>> +    /// to be finished
>> +    bool waitingForOrigin;
>
> To reduce "global" or "shared" state, if you can now move
> waitingForOrigin to Ftp::Server, you should do that IMO.
>
>
> All of these are pretty much cosmetic changes that do not alter the
> proposed patch functionality AFAICT. IMO, if there are no objections,
> the polished patch should go into trunk (see above regarding v3.5).
>
>
> Thank you,
>
> Alex.
>
> _______________________________________________
> squid-dev mailing list
> squid-dev at lists.squid-cache.org
> http://lists.squid-cache.org/listinfo/squid-dev
>