Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

N°4281 - Add RFC-2045-Compliant Way of Handling Filenames in Attachments #16

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

mraenker
Copy link
Contributor

@mraenker mraenker commented Nov 2, 2022

Reason for the bugfix

  • Customer complained, in emails, where the filenames of attachments are specified using Content-Type: application/octet-stream; charset=UTF-8; name=example_attachment_mail.csv or Content-Disposition: attachment; filename=example_attachment_mail.csv (meaning: a filename or name attribute, where the filename is not quoted), are not considered as valid filesnames and thus, the filetype with an added integer value is used as the filename of the attachment.
  • RFC 2045 defines, not using quotes is allowed, if the filename obliges to the rules described in the comment for the regex in my commit.

Explanation of the approach to fix it

  • First of all, please note, this PR will change the behavior of this function slightly - but, I think, only for good: filenames from now on will never contain quotes or doublequotes.
  • The approach is: use a regex with capture groups (and a proper explanation for the regex) to handle the rules of the RFC in this order: (1) the default case for double quoted filename= or name= attributes, (2) the same with single quotes (the RFC doesn't specify the method of quoting, so we take care of that aswell), (3) the case, where the filename isn't quoted.

Implementationdetails and Caveats

  • Instead of creating a complex regex with lookbehinds and stuff like that, I've opted to simply whitelist all chars the RFC specifies to be compatible to be filename, when no quotes are used.
  • Since the existing code does index into $aMatches[1] to get the result of the match, the content of $aMatches[1] can now be different compared to before, since we are now using capture groups and thus another level of possible results is introduced. But what we can be sure about is: $aMatches[1] will always contain a) the filename with quotes, or b) the filename without quotes. For that reason, we simply use str_replace() later on to remove all types of quotes from the filenames, since nobody wants filenames containing any kind of quotes, anyway. I think, no harm will be done by this slight change in behavior.
  • Here are three examples for the changed contents of $aMatches:
  1. Default case, where the filename is doublequoted, before the patch was
Array
(
    [0] => filename="moz-screenshot.png"
    [1] => moz-screenshot.png
)

and now will be

Array
(
    [0] => filename="moz-screenshot.png"
    [1] => "moz-screenshot.png"
    [2] => moz-screenshot.png
)

(Note the difference in [1] and [1] from before and after.)

  1. Case with single quotes, after the patch (wasn't handled correctly until now) will be
Array
(
    [0] => name='example_attachment_mail.csv'
    [1] => 'example_attachment_mail.csv'
    [2] => 
    [3] => example_attachment_mail.csv
)

(Note the quotes in [1].)

  1. Case without any quotes, after the patch (wasn't handled correctly until now) will be
    [0] => name=example_attachment_mail.csv
    [1] => example_attachment_mail.csv
    [2] => 
    [3] => 
    [4] => example_attachment_mail.csv
    [5] => example_attachment_mail.csv

(Note: No quotes in [1])

Testresults

  • I've created a new testmail test/emailsSample/email_with_and_without_quotes_around_attachment_name.eml containing all cases of a filename and a name attribute a) with double quotes, b) with single quotes, c) without any quotes. It is contained in this PR aswell.
  • I've ran utils/test.php for all testmails before the patch and after and created a diff of the output: Only my testcases and attachments of three other testmails, which don't comply with the standard occur. The explanation of the other testcases occuring are:
  • 1544c1544: In this case, no quotes are used for the filename, but the filename contains a space, which is not permitted as of the RFC.
  • 1618c1618 and 1622c1622: In this cases, the filename until now was wrongfully built, since no quotes, but containing a space resulting in a filename containing the attributes size and creation_date by accident.

All other attachment filenames stay the same. Please see the following screenshot for proof.
image

A word about the caveat

Yes, this PR will slightly change the behavior of the function (no filenames containing quotes are possible anymore), but I don't think, this will cause any harm (quite the opposite, I think). Furthermore, recreating the exact behavior with quoted filenames as before, will result in a way more complicated function, where we'd need to first check, which rule is applicable and then - based on the rule chosen - apply the right indexing into $aMatches. I think, this approach would overcomplicate things in an unnecessary way. - But of course we are free to discuss this.

/cc @larhip

Cheers,
Martin

@Molkobain Molkobain added the bug Something isn't working label Nov 2, 2022
@Molkobain Molkobain self-assigned this Nov 2, 2022
@Molkobain Molkobain removed their assignment Nov 2, 2022
@mraenker
Copy link
Contributor Author

mraenker commented Nov 4, 2022

Hi Combodo,

short update:

  • The three different parts of the regex are now correctly paranthesized
  • Squashed the four commit into one, so you have a better overview

Tests were ran successfully again. Same diffs for against results before the patch and same results for against my "first-after-the-patch"-tests.

Cheers,
Martin

@mraenker
Copy link
Contributor Author

Rebased and Commits squashed and pushed again. @Molkobain : I hope, this is now ready for a review (when you got the time to do so).

Martin

classes/rawemailmessage.class.inc.php Outdated Show resolved Hide resolved
@Molkobain Molkobain self-assigned this Dec 14, 2022
Copy link
Contributor

@Molkobain Molkobain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the code itself, it seems ok to me. Just ran some test with more "complex" filenames having single quotes (happens a lot in french), spaces and all. It all worked.

Only change requested is vocabulary. PR will be review during the next technical review. Thanks for adding the test sample!

classes/rawemailmessage.class.inc.php Outdated Show resolved Hide resolved
@Molkobain
Copy link
Contributor

Discussed during technical review, code itself is accepted.
We just would like you to add a unit test for the regexp itself, so the test documents the test cases which will:

  • Improve robustness for futures changes
  • Ease the understanding of the RegExp itself through the documented examples

If you need some help in adding the unit test (extract the regexp in a dedicated const/method, add the test file, ...), let us know :)

@mraenker
Copy link
Contributor Author

mraenker commented Jan 4, 2023

Thank you for the follow up.

If you need some help in adding the unit test (extract the regexp in a dedicated const/method, add the test file, ...), let us know :)

If you can pinpoint me to an example, where this is done already, I think I can figure it out on my own. But I don't want to prepare something, which isn't helpful the way I did it (e. g. usage of PHPUnit or alike).

@Molkobain
Copy link
Contributor

Sure, you can:

  • Duplicate the combodo-email-synchro/test/TestEmlFiles.php test file (for namespace, parent class, ...)
  • Delete all methods (even the setUp() won't be necessary there)
  • Duplicate testToAcronym() and ToAcronymProvider() methods from the iTop test/application/UtilsTest.php file as an example.
    • ToAcronymProvider() will show you how to provide / document test cases
    • testToAcronym() will show you how to make the test itself

Mind that you will have to extract the RegExp in a class constant or method so that you can use it in both the business logic and the unit test.

@Molkobain
Copy link
Contributor

Moved to functional review to avoid loosing a month

@piRGoif
Copy link
Contributor

piRGoif commented Jan 10, 2023

Accepted during functional review

@mraenker
Copy link
Contributor Author

mraenker commented Jan 25, 2023

But you are right, we could/should discuss them more internally before giving any feedback... The only issue is that we don't really have an official dedicated time for PRs, it's just on good will. :/

Well :( I didn't knew that. This explains things. Thank you for all the extra work you are putting in.

It's not a strict process but we discuss/review the technical implementation of each PR together (R&D team) at least once a month (on the first Tuesday). In the meantime we often check them on our end to see if there are things that could be improved before that review so it doesn't take another month before it can move forward to the functional review. Then on the second Tuesday of each month we review PRs on their functional side with both R&D and Product owners.

Yeah, don't get me wrong. I wasn't trying to critizise your processes, whatever they may look like (especially with the info above). And I think you are completely right, requesting "simple" improvements upfront to avoid loosing another month on each PR is a good practice. - But maybe it's worth to shortly discuss those internally upfront, aswell. I think, it could reduce the amount of artificial stress on all sides, coming around with a (for the moment) complete list of requested changes, as the amount of necessary iterations of coding-review-communicate might be reduced on all sides.

Thank you for explaining. I'll work on the requested improvements in the next few days.

@Molkobain
Copy link
Contributor

Exactly we should try to discuss things internally before asking for changes multiple times which can be a pain in the ash for the PR initiator. Especially as you can get the feeling that it's a never ending situation 😅

We'll try to improve this in the future :)

@piRGoif
Copy link
Contributor

piRGoif commented Jan 25, 2023

Hello,
All my apologies, I missed Guillaume saying it is fine with having only a regexp... and I can't find it anymore :/ (but this PR generated a lot of discussion, and there are lot of them in resolved conversation now...)

The usual process is only 1 Combodo dev is set as assignee to the PR. He reads it, check for Combodo requirements, and then present it to the rest of the team during reviews (technical, functional, as Guillaume said). We then give one consolidated answer.

Your PR already passed both reviews, but we asked for more changes afterwards.
Guillaume asked for my opinion, and I missed your PR reviews.., So actually we are in a grey area :)
Next time I'll have a chat IRL with the PR assignee before answering !
Sorry for this confusion :/

@Molkobain tell me when available so we can agree on this PR status O:)

@Molkobain
Copy link
Contributor

@Molkobain tell me when available so we can agree on this PR status O:)

I'm ok with changing the static property to a constant you are right. We just need to ensure that static:: is used instead of self::

@piRGoif
Copy link
Contributor

piRGoif commented Jan 26, 2023

We discussed IRL with Guillaume to check every aspect of my review and the past remarks we made.

We are proposing to :

  • set the regexp as a const instead of an attribute
  • keep the existing regexp test as is
  • factorize the 2 current regexp use in a new method (containing : if != '' + preg_match + end)
  • no test to add on that new method

As you already done lot of work, I can make the modifications if you wish ? (and also if you checked the "allow maintainers" PR checkbox)

@mraenker
Copy link
Contributor Author

mraenker commented Jan 27, 2023

As you already done lot of work, I can make the modifications if you wish ? (and also if you checked the "allow maintainers" PR checkbox)

I'd be thankful, if you did the modifications, yes. Thank you. Unfortunately, I seem to be blind. Where can I find the "Allow Maintainers"-Checkbox on the PR? I even searched the page with Ctrl+F and can't find it. 😞

/edit:

All my apologies, I missed Guillaume saying it is fine with having only a regexp... and I can't find it anymore :/ (but this PR generated a lot of discussion, and there are lot of them in resolved conversation now...)

No problem at all. Thank you for the explanation!

@mraenker
Copy link
Contributor Author

@piRGoif : I've added you to our list of maintainers for this repo. I hope you got an invitation and you can then apply your changes.

@piRGoif
Copy link
Contributor

piRGoif commented Feb 3, 2023

@mraenker Just pushed the modifications in f0521eb
Ok for you ?

@piRGoif
Copy link
Contributor

piRGoif commented Feb 3, 2023

By the way, the "|allow edits from maintainers](https://docs.github.com/en/github-ae@latest/pull-requests/collaborating-with-pull-requests/working-with-forks/allowing-changes-to-a-pull-request-branch-created-from-a-fork)" option is described in our contributing file

Someone told rencently this option is only available during PR creation, and that is what I understand reading the doc :

When a user creates a pull request from a fork that they own, the user generally has the authority to decide if other users can commit to the pull request's compare branch

@mraenker
Copy link
Contributor Author

mraenker commented Feb 6, 2023

@piRGoif : Looks totally fine for me. Thank you very much! - Thank you aswell for the info on when I need to set the checkbox to allow you to edit PRs.

I just re-synched our branch on our fork to this master. Do I need to do something else before you can continue?

@piRGoif
Copy link
Contributor

piRGoif commented Feb 6, 2023

Hello,

You're welcome Martin, we are very glad you did this change, and again all our apologies for the confusion on our side (mine in particular O:) )

I see you made a merge commit from master to your fork branch... Not sure this will go well when merging the pr :/ We'll see.

@Molkobain
Copy link
Contributor

Accepted during technical review.
Guillaume will dig into the "coma thing in the unit test" before the functional review.

@piRGoif
Copy link
Contributor

piRGoif commented Apr 11, 2023

Accepted during functional review.
@Molkobain waiting for your review on the broken test use case

@Molkobain
Copy link
Contributor

I spent some time on this issue (the comma thing in test cases) but I couldn't find why it's behaving like this...
The issue is the same on my setup as for Martin, comma behave differently than other chars. I have no clue why.

The only notable thing is that in the base REGEXP, the char \x2D (the one after comme -\x2C-) is highlighted differently by PHPStorm. Don't know if it means something related or not.
image

@Molkobain
Copy link
Contributor

Actually maybe it has something to do with 2A (*) and 2B (+) being interpreted as regexp tokens, which would explain why 2D (-) is highlighted differently.
If I move both 2A / 2B at the end of the regexp, 2D is no longer highlighted and all test cases pass.

image

image

So maybe this is just a matter of how to add the special regexp tokens as actual chars, not tokens.
I would advise to find a way to escape these tokens or to use regular chars to ease escaping.

Moving 2A / 2B at the end of the regexp is not a solution in itself, it's just moving the issue somewhere else 😁

What do you all think?

@Hipska
Copy link
Collaborator

Hipska commented May 6, 2023

I would also use regular characters.

I think maybe they all need to be double escaped? What do you get when you print the var FILENAME_REGEX?

@mraenker
Copy link
Contributor Author

mraenker commented Jun 6, 2023

Hi all.

To move on with this PR, I'd like to respond to the latest comments and summarize.

  1. Afaik the only open issue with the PR is the screwed up behavior with 2A and/or 2B
  2. We know, just putting them at the end won't solve the issue. So, we still would violate the RFC (which is of importance, if we state, it's and RFC-compliant handling of the case in question.
  3. Hipska suggests to use regular characters.
  4. We don't know, what happens, when we double-escape the two Hex-Representations (could you test this sometime, @Molkobain ?)

Regarding point 3) from Hipska: What I like especially is, that it's a pragmatic approach. I think, I even already tried it in the past, but noticed a big (at least for me) downside: The Regex contains "valid chars", which are easily mistaken for other chars or being not even recognized as chars, since they are somewhat special. But to be even more pracmatic: Does anything prevent us from mixing the representation? The char in question, which creates the problems, is the comma. A comma is easy to not to be mistaken (like space vs. tab or something alike). -> Let's just use the regular "comma"-Char and only use the Hex-Repr for chars, which can be easily mistaken.

What do you think about that?

@Hipska
Copy link
Collaborator

Hipska commented Jun 6, 2023

Did you already try a simple print (to log or with var_dump) of the FILENAME_REGEX var? I think that will explain a lot..

I just tried in 3v4l and it shows they definitely need to be double escaped!!

@Molkobain Molkobain removed their assignment Aug 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Pending technical review
Development

Successfully merging this pull request may close these issues.

5 participants