Discussion:
Patch: Unicode email support (RFC 6531, 6532, 6533)
(too old to reply)
Arnt Gulbrandsen
2014-05-15 14:18:32 UTC
Permalink
Hi,

at http://arnt.gulbrandsen.priv.no/tmp/postfix-eai-patch you will find a
patch to add unicode email support to Postfix. The patch is relative to
postfix-2.12-20140316.

I tried to append it to a list posting, but the result was too large for
the list, hence the URL.

A short overview of the RFCs: You can use naked UTF8 in localparts and
domain, and you can usually forget about quoted-printable. There's an
interlock to make sure that UTF8 messages are only ever sent to servers
that understand UTF8 addresses. There is no fallback to ASCII addresses
(there was an experiment with fallbacks and that was doubleplusunfun).

An overview of the patch: There is one new named attribute to the queue
file, smtputf8. If that is present, then the message can only be passed on
to remote servers if they support the SMTPUTF8 extension. Local delivery
always works. Mail from local senders (sendmail -bm) gets the attribute if
it uses any unicode address. Postfix tries to use ASCII only as long as
that's possible, but either the sender or any recipient needs UTF8, then
UTF8 it is.

Postfix implements most of the RFCs' rules. There are at a few exceptions:
Postfix doesn't have EXPN, so I didn't implement unicode EXPN either. VRFY
was much more flexible than in the RFC, so I made the extension
correspondingly flexible. Postfix' smtpd will use 8-bit in error messages
sometimes, which RFC6531 says it should only do when the client declares
support for it. I think Postfix behaves correctly and left it.

RFC 6533 contains some combinations of features that I don't think Postfix
can reach and therefore did not implement.

Postfix accepts UTF8 in myorigin and in mystdestination, but not in
myhostname. Myhostname is sent in the EHLO argument, and I'm not man enough
to send either UTF8 or xn--barf-barf in the EHLO argument.

I'm sure this will need minor improvements, and there might be bugs. Please
let me know.

Arnt
Arnt Gulbrandsen
2014-05-15 14:24:27 UTC
Permalink
One more comment. I saw some comments about backward compatibility in the
source.

If you upgrade to this, then add some messages to UTF8 addresses, then
downgrade, then... well, it won't quite work, but it won't quiet break
either. The UTF8 mail won't be delivered (the old version doesn't know how
to), but postfix also won't crash. At the end of the queue lifetime
there'll be a DSN.

Arnt
Wietse Venema
2014-05-17 20:18:55 UTC
Permalink
Post by Arnt Gulbrandsen
One more comment. I saw some comments about backward compatibility in the
source.
If you upgrade to this, then add some messages to UTF8 addresses, then
downgrade, then... well, it won't quite work, but it won't quiet break
either. The UTF8 mail won't be delivered (the old version doesn't know how
to), but postfix also won't crash. At the end of the queue lifetime
there'll be a DSN.
I'm currenly on vacation, and expect to evaluate this in the coming month
or so. This requires carefully reading of the relevant RFCs.

Wietse
Arnt Gulbrandsen
2014-05-19 07:43:40 UTC
Permalink
Post by Wietse Venema
I'm currenly on vacation, and expect to evaluate this in the coming month
or so. This requires carefully reading of the relevant RFCs.
OK.

I suggest reading 6530, -31 and -32 normally, then the patch carefully,
then 6531, -32 and -33 carefully. Let me know if you have any questions,
and thank you for your efforts.

Arnt
Arnt Gulbrandsen
2014-06-04 09:29:52 UTC
Permalink
Post by Arnt Gulbrandsen
at http://arnt.gulbrandsen.priv.no/tmp/postfix-eai-patch you
will find a patch to add unicode email support to Postfix. The
patch is relative to postfix-2.12-20140316.
I see about ten people have downloaded the patch, but noone has sent mail
to the autoresponder. I take it that some of you have glanced at the patch,
or perhaps reviewed it properly, but noone has compiled or tested it.
Right?

I should be grateful for any comments, if any of you have looked at the
code.

Arnt
Wietse Venema
2014-06-04 10:55:18 UTC
Permalink
Post by Arnt Gulbrandsen
Post by Arnt Gulbrandsen
at http://arnt.gulbrandsen.priv.no/tmp/postfix-eai-patch you
will find a patch to add unicode email support to Postfix. The
patch is relative to postfix-2.12-20140316.
I see about ten people have downloaded the patch, but noone has sent mail
to the autoresponder. I take it that some of you have glanced at the patch,
or perhaps reviewed it properly, but noone has compiled or tested it.
Right?
Last week I finished the port of Debian-style shared libraries and
dynamicmaps.cf much of which I did during my vacation in Europe.
Post by Arnt Gulbrandsen
I should be grateful for any comments, if any of you have looked at the
code.
I have looked at parts of the patch in my copious time.

First, Postfix behavior must not change unless mail is flagged as
EAI, regardless of whether it contains 8-bit headers or envelopes.

Thus, the SMTP client, cleanup daemon, and other daemon programs
MUST NOT engage into any EAI-related stuff unless a message is
flagged as EAI-enabled. I will add a guard around that code.

As for EAI auto-detection in Postfix the sendmail command, I will
make that configurable if it isn't already.

Have you given any thought of what happens when a company installs
Postfix-EAI on the perimeter, and wants to forward the mail to their
internal systems that may or may not have EAI support?

Postfix has already passed 8-bit headers and envelopes for 15 years
and we can't suddenly stop doing that when a down-stream system
doesn't announce SMTPUTF8.

I haven't looked yet at the interface with database systems.
At this interface we can expect characterset issues.

Wietse
Arnt Gulbrandsen
2014-06-04 11:46:03 UTC
Permalink
Post by Wietse Venema
I have looked at parts of the patch in my copious time.
I hoped someone else would ;) I do feel a little guilty about imposing on
you alone.
Post by Wietse Venema
First, Postfix behavior must not change unless mail is flagged as
EAI, regardless of whether it contains 8-bit headers or envelopes.
Are you sure?

I quite agree about 8-bit use in most of the messages, but we now have an
RFC that lays exclusive claim to how to interpret 8-bit in localparts and
domains. Anyone who uses 8-bit localparts/addresses incompatibly will have
a bit of a problem in the future.
Post by Wietse Venema
Thus, the SMTP client, cleanup daemon, and other daemon programs
MUST NOT engage into any EAI-related stuff unless a message is
flagged as EAI-enabled. I will add a guard around that code.
The smtputf8 flag in the queue file acts as such a guard. I noticed that
postfix tends to trust the queue, so I followed along with that: If the
queue file doesn't say smtputf8, then e.g. the SMTP client will happily
send 8-bit localparts.

The SMTP server is the biggest exception. It actually refuses 8-bit
localparts/domains unless the SMTPUTF8 flag is set. The RFC says it must.
If you'd rather not, just skip the two if() clauses in smtpd.c that mention
error code 5.6.7.

The other exceptions is in the DSN generation, where some error codes may
change, and in printable(), which now considers UTF8 to be printable for
formerly did not.
Post by Wietse Venema
As for EAI auto-detection in Postfix the sendmail command, I will
make that configurable if it isn't already.
Very good point. Like this, perhaps:

diff --git a/src/sendmail/sendmail.c b/src/sendmail/sendmail.c
index f23dc9f..e3019bb 100644
--- a/src/sendmail/sendmail.c
+++ b/src/sendmail/sendmail.c
@@ -778,6 +778,7 @@ static void enqueue(const int flags, const char
*encoding,
}
}

+#if !defined(NO_EAI)
/*
* If either the sender or any recipients contain non-ascii
* characters, then this message has to be sent with the SMTPUTF8
@@ -788,6 +789,7 @@ static void enqueue(const int flags, const char
*encoding,
rec_fprintf(dst, REC_TYPE_ATTR, "%s=%d",
MAIL_ATTR_SMTPUTF8, 1);
}
+#endif

/*
* Append the message contents to the queue file. Write chunks of at
most
Post by Wietse Venema
Have you given any thought of what happens when a company installs
Postfix-EAI on the perimeter, and wants to forward the mail to their
internal systems that may or may not have EAI support?
Yes.

Mail between that company and other ASCII addresses works as before, in
both directions.

Outgoing mail from that company to unicode addresses may begin to work,
depending on whether the internal origin server supports EAI.

Incoming mail to that company from unicode addresses still doesn't work.
Until now Postfix would accept 8-bit localparts but frown on 8-bit domains
in the MAIL FROM command. Now it will accept 8-bit domains too, but if the
internal systems don't, then the perimeter server will generate a DSN when
it tries to forwad the mail.
Post by Wietse Venema
Postfix has already passed 8-bit headers and envelopes for 15 years
and we can't suddenly stop doing that when a down-stream system
doesn't announce SMTPUTF8.
In that case you'll want to remove the two smtpd.c blocks that mention
error code 5.6.7.
Post by Wietse Venema
I haven't looked yet at the interface with database systems.
At this interface we can expect characterset issues.
I changed printable(), so e.g. any log systems that only accepted ASCII may
get problems. Things like recipient table lookups always had to accept
8-bit localparts, now they have to accept 8-bit domain side too.

Other issues should be unlikely, since the eightbittery is passed along
without any actual changes. No upcasing, downcasing, charset conversions or
other complications. The only conversion the code does is just before it
does an MX lookup.

Arnt
Wietse Venema
2014-06-04 13:23:24 UTC
Permalink
Post by Arnt Gulbrandsen
Post by Wietse Venema
I have looked at parts of the patch in my copious time.
I hoped someone else would ;) I do feel a little guilty about imposing on
you alone.
Post by Wietse Venema
First, Postfix behavior must not change unless mail is flagged as
EAI, regardless of whether it contains 8-bit headers or envelopes.
Are you sure?
Yes. We must maintain compatibility with existing practice. Postfix
has always passed 8-bit headers and envelopes (localparts) for the
past 15 years. It would be an unaceptable compatibility break if,
for example, a corporate perimeter MTA were to start bouncing inbound
mail just because 1) some up-stream client is changed to flag that
email as SMTPUTF8, but 2) some down-stream internal server doesn't
announce SMTPUTF8.
Post by Arnt Gulbrandsen
Post by Wietse Venema
Thus, the SMTP client, cleanup daemon, and other daemon programs
MUST NOT engage into any EAI-related stuff unless a message is
flagged as EAI-enabled. I will add a guard around that code.
The smtputf8 flag in the queue file acts as such a guard.
No it doesn't. Example: ORCPT handling in the cleanup Milter client
and in the SMTP client is unconditional on the smtputf8 flag.
However, given that UTF8 addresses use a special encoding, I suspect
that it is better to decode them properly (the alternative would
be to not decode them at all and just pass them on, but that requires
some extra code to handle existing queue files that contain decoded
attributes).

[configurable EAI detection in the Postfix sendmail command]
Post by Arnt Gulbrandsen
+#if !defined(NO_EAI)
...
Post by Arnt Gulbrandsen
+#endif
That is not what I call configurable. That is what I call compiled-in
hard-coded behavior.
Post by Arnt Gulbrandsen
Post by Wietse Venema
Have you given any thought of what happens when a company installs
Postfix-EAI on the perimeter, and WANTS TO FORWARD THE MAIL TO THEIR
INTERNAL SYSTEMS that may or may not have EAI support?
Yes.
...
Post by Arnt Gulbrandsen
Outgoing mail from that company to unicode addresses may begin to work,
depending on whether the internal origin server supports EAI.
Incorrect. This does not require any EAI support in the SMTP client.
The SMTP client simply hands the mail to the gateway without any
transformation of the recipient domain.
Post by Arnt Gulbrandsen
Incoming mail to that company from unicode addresses still doesn't work.
This has worked for 15 years, at least with UTF8 localparts. We
must maintain compatibility with existing practice. It would be an
unacceptable compatibility break if Postfix were to suddenly start
rejecting such mail.
Post by Arnt Gulbrandsen
Post by Wietse Venema
I haven't looked yet at the interface with database systems.
At this interface we can expect characterset issues.
I changed printable(), so e.g. any log systems that only accepted ASCII may
get problems. Things like recipient table lookups always had to accept
8-bit localparts, now they have to accept 8-bit domain side too.
Is there a possibity that the same domain name may exist as an UTF8
string in some contexts and as xn-mumble elsewhere? If this is a
problem then it will affect many database lookups.

How do UTF8 domain names interact with DNS RHSBL lists? Do they
expect the UTF8 form or the xn--mumble form?

How do UTF8 domain names interact with reject_unknown_sender_domain,
reject_unknown_recipient_domain, etc.? It looks like you are passing
the UTF8 domain name in DNS queries.
Post by Arnt Gulbrandsen
Other issues should be unlikely, since the eightbittery is passed along
without any actual changes. No upcasing, downcasing, charset conversions or
other complications. The only conversion the code does is just before it
does an MX lookup.
First, all Postfix table lookups are case-insensitive by default.
You may have missed that.

Second, not all lookup tables may support UTF8. What does the POSIX
standard have to say about this for regular expressions? This
affects the regexp: table.

Third, in database queries, strings that contain UTF8 may require
special treatment when the default locale is not unicode-based.
We must maintain compatibility with existing practice: Postfix
currently passes 8bit strings as if they are in the default locale.
It would be an unacceptable compatibility break if Postfix suddenly
starts to fail those queries just because they aren't well-formed
UTF8.

So it looks like there all the work on the database interface still
needs to be done.

Finally, you appear to have broken the valid_hostname(3) abstraction.
This module enforces RFC rules for hostnames (and domain names) in
calls of infrastructure functions such as getaddrinfo(), getnameinfo()
and functions at lower levels in the stack.

Unless the EAI RFCs say otherwise, the hostname in HELO commands
cannot be an UTF8 string, therefore it cannot be treated as if it
is a recipient domain.

Recipient domains require a validator that is specific for recipient
domains, and that validator does not belong in the valid_hostname(3)
module. I think this also requires a different version of the
host_port() function that is specific for recipient addresses and
that has flag whether or not UTF8 functionality is enabled.

More later, after I have reviewed the rest of the code, and after
I have checked it against the RFCs for compliance and completeness.

Wietse
Arnt Gulbrandsen
2014-06-04 14:41:31 UTC
Permalink
Post by Wietse Venema
...
Yes. We must maintain compatibility with existing practice. Postfix
has always passed 8-bit headers and envelopes (localparts) for the
past 15 years. It would be an unaceptable compatibility break if,
for example, a corporate perimeter MTA were to start bouncing inbound
mail just because 1) some up-stream client is changed to flag that
email as SMTPUTF8, but 2) some down-stream internal server doesn't
announce SMTPUTF8.
I think you're right. The two code blocks that return 5.6.7 should perhaps
be included later, but definitely not included now.
Post by Wietse Venema
Post by Arnt Gulbrandsen
Post by Wietse Venema
Thus, the SMTP client, cleanup daemon, and other daemon programs
MUST NOT engage into any EAI-related stuff unless a message is
flagged as EAI-enabled. I will add a guard around that code.
The smtputf8 flag in the queue file acts as such a guard.
No it doesn't.
OK: It's meant to act as such a guard.
Post by Wietse Venema
Example: ORCPT handling in the cleanup Milter client
and in the SMTP client is unconditional on the smtputf8 flag.
However, given that UTF8 addresses use a special encoding, I suspect
that it is better to decode them properly (the alternative would
be to not decode them at all and just pass them on, but that requires
some extra code to handle existing queue files that contain decoded
attributes).
You'll see some other code like that in the DSN generation, when it chooses
quoting format. I didn't find an alternative I really liked.

It's not clear to me that UTF8 addresses always use that special encoding.
They probably should, but I found 6533 rather confusing. The niceties of
UTF8 addresses in SMTPUTF8 messages vs. UTF8 addresses in other settings
aren't as simple as I wish they were.

The ORCPT code in Milter/SMTP expects that all 8-bit addresses are SMTPUTF8
addresses that have somehow escaped into ASCIIland, so they should be
encoded as RFC6533 says in ORCPT. That's based on my reading of RFC6533. I
don't entirely like it, but I don't see any real alternative either. If you
see localpart "jøran" and don't know whether it's just-send-8 or escaped
EAI, should you follow EAI's quoting rules or extrapolate from RFC1984?

And what should you do if you receive an ORCPT using EAI-style quoting even
though the MAIL FROM did not declare SMTPUTF8? Should that ORCPT be
reencoded using 1984 encoding or keep its EAI encoding? Icky.
Post by Wietse Venema
Post by Arnt Gulbrandsen
Post by Wietse Venema
Have you given any thought of what happens when a company installs
Postfix-EAI on the perimeter, and WANTS TO FORWARD THE MAIL TO THEIR
INTERNAL SYSTEMS that may or may not have EAI support?
Yes.
...
Post by Arnt Gulbrandsen
Outgoing mail from that company to unicode addresses may begin to work,
depending on whether the internal origin server supports EAI.
Incorrect. This does not require any EAI support in the SMTP client.
The SMTP client simply hands the mail to the gateway without any
transformation of the recipient domain.
If the best MX for the unicode recipient obeys RFC6531 section 3.4, then
the SMTP client on the gateway has to use the SMTPUTF8 MAIL FROM parameter,
ie. support EAI. By extension the origin server has to do the same.
Post by Wietse Venema
Post by Arnt Gulbrandsen
Incoming mail to that company from unicode addresses still doesn't work.
This has worked for 15 years, at least with UTF8 localparts.
Sorry about the sloppy writing. I meant unicode domains. You're right, it
has worked with 8-bit localparts in ASCII domains.
Post by Wietse Venema
We
must maintain compatibility with existing practice. It would be an
unacceptable compatibility break if Postfix were to suddenly start
rejecting such mail.
OK.
Post by Wietse Venema
Is there a possibity that the same domain name may exist as an UTF8
string in some contexts and as xn-mumble elsewhere? If this is a
problem then it will affect many database lookups.
As far as I can tell the xn-- mumble is never used outside the DNS lookups,
neither in the RFCs nor in practice. The EAI RFCs say to use the xn-- form
for MX lookups, to use an ASCII domain name for the EHLO argument, and
otherwise don't discuss xn--.

In particular they don't say that the email address ***@xn--bar is
equivalent to ***@bär. They also don't say it's different.

I chose to make them essentially different. If a site admin chooses to add
xn--bar to mydestinations, that user has to configure the rest so it works.
I chose that mostly because I think xn-- is a phisher's dream. People won't
recognize their own domains. But the choice also makes life simpler for
table/database lookups.
Post by Wietse Venema
How do UTF8 domain names interact with DNS RHSBL lists? Do they
expect the UTF8 form or the xn--mumble form?
Unknown as yet. I expect it'll have to be xn-- mumble, but that's really
just my guesswork. As far as I could tell none of the RHSBL operators have
considered that matter yet.
Post by Wietse Venema
How do UTF8 domain names interact with reject_unknown_sender_domain,
reject_unknown_recipient_domain, etc.? It looks like you are passing
the UTF8 domain name in DNS queries.
I added a new function, valid_mail_domain(), which is essentially like the
old valid_hostname() except that it takes UTF8 and converts at xn--mumble,
then I inspected each caller to decide whether it should call
valid_hostname() or valid_mail_domain(). If you want I'll list each caller
and my rationale for the decision.
Post by Wietse Venema
First, all Postfix table lookups are case-insensitive by default.
You may have missed that.
Indeed I did. Mydestinations will need more work, at least. I'll look at
it.
Post by Wietse Venema
Second, not all lookup tables may support UTF8. What does the POSIX
standard have to say about this for regular expressions? This
affects the regexp: table.
Third, in database queries, strings that contain UTF8 may require
special treatment when the default locale is not unicode-based.
We must maintain compatibility with existing practice: Postfix
currently passes 8bit strings as if they are in the default locale.
It would be an unacceptable compatibility break if Postfix suddenly
starts to fail those queries just because they aren't well-formed
UTF8.
Are you saying that at present, Postfix treats other people's 8bit as
though it were case-insensitive in the server's locale? And that Postfix
requires tables to be case-insensitive and silently expecting them to use
the right locale?

The pgsql table, at least, appears to uses the locale that was chosen while
creating the database, not the system locale on the Postfix server.

Being compatible with that will require a bit of luck.
Post by Wietse Venema
So it looks like there all the work on the database interface still
needs to be done.
Finally, you appear to have broken the valid_hostname(3) abstraction.
This module enforces RFC rules for hostnames (and domain names) in
calls of infrastructure functions such as getaddrinfo(), getnameinfo()
and functions at lower levels in the stack.
Unless the EAI RFCs say otherwise, the hostname in HELO commands
cannot be an UTF8 string, therefore it cannot be treated as if it
is a recipient domain.
I agree (and I think I said as much in the README). That's why I call
valid_hostname() in many cases and valid_mail_domain() in others.
Post by Wietse Venema
Recipient domains require a validator that is specific for recipient
domains, and that validator does not belong in the valid_hostname(3)
module. I think this also requires a different version of the
host_port() function that is specific for recipient addresses and
that has flag whether or not UTF8 functionality is enabled.
Moving valid_mail_domain() into its own file is fine. The purpose of
valid_mail_domain() is precisely to validate recipient (and sender)
domains.

I think host_port() had better not be split. It's used for mail hosts, and
those are like EHLO arguments, they have to be ASCII even when
sender/recipient domains can be unicode. So in /etc/postfix/transport the
LHS can use unicode but the RHS cannot for the foreseeable future.
Post by Wietse Venema
More later, after I have reviewed the rest of the code, and after
I have checked it against the RFCs for compliance and completeness.
You've already found at least two holes, perhaps four. You told me earlier
I needn't bother about minor improvements. Are these big enough that you'd
prefer me to submit a new patch?

Thanks for your responses; I hope I haven't disturbed your vacation too
much.

Anrt
Wietse Venema
2014-06-04 16:45:40 UTC
Permalink
Post by Arnt Gulbrandsen
Post by Wietse Venema
Post by Arnt Gulbrandsen
Post by Wietse Venema
Have you given any thought of what happens when a company installs
Postfix-EAI on the perimeter, and WANTS TO FORWARD THE MAIL TO THEIR
INTERNAL SYSTEMS that may or may not have EAI support?
Yes.
...
Post by Arnt Gulbrandsen
Outgoing mail from that company to unicode addresses may begin to work,
depending on whether the internal origin server supports EAI.
Incorrect. This does not require any EAI support in the SMTP client.
The SMTP client simply hands the mail to the gateway without any
transformation of the recipient domain.
If the best MX for the unicode recipient obeys RFC6531 section 3.4, then
the SMTP client on the gateway has to use the SMTPUTF8 MAIL FROM parameter,
ie. support EAI. By extension the origin server has to do the same.
You are missing the point. The internal SMTP client does not
look up the recipient MX host. It just gives the mail to the
perimeter gateway.

Therefore, a non-EAI internal SMTP client can send an email reply
to an EAI sender.
Post by Arnt Gulbrandsen
Post by Wietse Venema
Is there a possibity that the same domain name may exist as an UTF8
string in some contexts and as xn-mumble elsewhere? If this is a
problem then it will affect many database lookups.
As far as I can tell the xn-- mumble is never used outside the DNS lookups,
neither in the RFCs nor in practice. The EAI RFCs say to use the xn-- form
for MX lookups, to use an ASCII domain name for the EHLO argument, and
otherwise don't discuss xn--.
Thus an EAI domain name may show up as xn--mumble in HELO commands.
Post by Arnt Gulbrandsen
Post by Wietse Venema
First, all Postfix table lookups are case-insensitive by default.
You may have missed that.
Indeed I did. Mydestinations will need more work, at least. I'll look at
it.
Post by Wietse Venema
Second, not all lookup tables may support UTF8. What does the POSIX
standard have to say about this for regular expressions? This
affects the regexp: table.
Third, in database queries, strings that contain UTF8 may require
special treatment when the default locale is not unicode-based.
We must maintain compatibility with existing practice: Postfix
currently passes 8bit strings as if they are in the default locale.
It would be an unacceptable compatibility break if Postfix suddenly
starts to fail those queries just because they aren't well-formed
UTF8.
Are you saying that at present, Postfix treats other people's 8bit as
though it were case-insensitive in the server's locale? And that Postfix
requires tables to be case-insensitive and silently expecting them to use
the right locale?
I make seveal statements, and they are to be read separately. All table
lookups are case-insensitive by default. Apart from that, Postfix treats
8bit strings as in the current locale.
Post by Arnt Gulbrandsen
Post by Wietse Venema
Finally, you appear to have broken the valid_hostname(3) abstraction.
This module enforces RFC rules for hostnames (and domain names) in
calls of infrastructure functions such as getaddrinfo(), getnameinfo()
and functions at lower levels in the stack.
Unless the EAI RFCs say otherwise, the hostname in HELO commands
cannot be an UTF8 string, therefore it cannot be treated as if it
is a recipient domain.
I agree (and I think I said as much in the README). That's why I call
valid_hostname() in many cases and valid_mail_domain() in others.
It uses valid_mail_domain() in reject_invalid_hostname and
reject_non_fqdn_hostname, but in reject_unknown_hostname() it passes
the UTF8 string to dns_lookup_l().
Post by Arnt Gulbrandsen
You've already found at least two holes, perhaps four. You told me earlier
I needn't bother about minor improvements. Are these big enough that you'd
prefer me to submit a new patch?
There will be more. I'll just document them and fix them, so I
don't have to spend a lot of time reviewing another version.

Wietse
Viktor Dukhovni
2014-06-04 16:58:43 UTC
Permalink
Post by Arnt Gulbrandsen
As far as I can tell the xn-- mumble is never used outside the DNS lookups,
neither in the RFCs nor in practice. The EAI RFCs say to use the xn-- form
for MX lookups, to use an ASCII domain name for the EHLO argument, and
otherwise don't discuss xn--.
Lack of discussion simply means that the relevant discussion is in
other documents. For example, in X.509 subjectAltName DNS, the
domain name needs to be in ASCII form.

My impression is that UTF-8 domain names are are an MUA display
format issue. All domain names "on the wire", including in email
headers should be in ASCII form, to be displayed by MUAs as UTF-8
when appropriate. I've not checked whose responsibility it is to
perform the conversion from what the user types to the A-label form
of the domain. Ideally, this is done by the MUA. Potentially this
could also be done by an MSA (before applying DKIM signing, ...).

Perhaps UTF-8 domains are allowed in headers (a bad idea IMHO, even
if bless by the RFC), but they should be converted to A-labels as
quickly as possible. UTF-8 text may then appear in the address
localpart, and in "phrases" (Full Name, ...). One might also expect
UTF-8 in some MIME headers (obviating RFC 2231 encoding of MIME
attribute values), however when the payload is a domain it should
I think be in A-label (wire) form.

Thus for example, in DKIM the "d=" attribute should be ASCII, ..

Finally, I still view EAI RFCs with a healthy dose of skepticism.
Where good judgement runs contrary to the RFCs, I'll go with good
judgement.
--
Viktor.
Arnt Gulbrandsen
2014-06-04 18:34:07 UTC
Permalink
Post by Viktor Dukhovni
My impression is that UTF-8 domain names are are an MUA display
format issue.
There was tremedously tedious discussion of the approach you suggest, and
of many others. There was even a set of experimental RFCs issued. In the
end the experimental RFCs were discarded. here's how RFC 6532 sums up the
final message format changes:

The preceding changes mean that the following constructs now allow
UTF-8:

1. Unstructured text, used in header fields like "Subject:" or
"Content-description:".

2. Any construct that uses atoms, including but not limited to the
local parts of addresses and Message-IDs. This includes
addresses in the "for" clauses of "Received:" header fields.

3. Quoted strings.

4. Domains.

6531 references 6532, and the MAIL FROM/RCPT TO syntax allows UTF8. 6855
makes corresponding changes to IMAP (no more mUTF7, hurray), 6856 to POP,
etc.

Arnt
Arnt Gulbrandsen
2014-06-04 17:48:08 UTC
Permalink
Post by Wietse Venema
You are missing the point. The internal SMTP client does not
look up the recipient MX host. It just gives the mail to the
perimeter gateway.
Therefore, a non-EAI internal SMTP client can send an email reply
to an EAI sender.
I am not missing the point.

Compliant SMTP servers only accept mail to/from EAI addresses if the SMTP
client uses the SMTPUTF8 form of the MAIL FROM command. The SMTP client, in
turn, only uses that form if the origin too used it.

The purpose of this feature is to guarantee that EAI messages don't land in
the mailboxes of incompatible recipients. The relevant effect of this
feature is that in order to send mail to a unicode address, the _sender_
must declare that the message uses EAI. Having 8-bit clean relays on the
way is not enough.
Post by Wietse Venema
Thus an EAI domain name may show up as xn--mumble in HELO commands.
Yes. I think it's a bad idea to do that. The chance that some SMTP server's
gethostbyname() will return the UTF8 form and the SMTP server then complain
about EHLO/PTR mismatch is too great. But it can happen.
Post by Wietse Venema
There will be more. I'll just document them and fix them, so I
don't have to spend a lot of time reviewing another version.
Great.

Arnt
Wietse Venema
2014-06-04 18:38:49 UTC
Permalink
Post by Arnt Gulbrandsen
Post by Wietse Venema
You are missing the point. The internal SMTP client does not
look up the recipient MX host. It just gives the mail to the
perimeter gateway.
Therefore, a non-EAI internal SMTP client can send an email reply
to an EAI sender.
I am not missing the point.
Compliant SMTP servers only accept mail to/from EAI addresses if the SMTP
client uses the SMTPUTF8 form of the MAIL FROM command. The SMTP client, in
turn, only uses that form if the origin too used it.
Postfix has accepted 8-bit headers and localparts forever, and that
will not change. The mission of Postfix is to deliver mail, to force
everyone else into compliance with some newfangled RFC.
Post by Arnt Gulbrandsen
Post by Wietse Venema
Thus an EAI domain name may show up as xn--mumble in HELO commands.
Yes. I think it's a bad idea to do that. The chance that some SMTP server's
gethostbyname() will return the UTF8 form and the SMTP server then complain
about EHLO/PTR mismatch is too great. But it can happen.
I'll read the RFCs carefully and see where it allows UTF8 in SMTP
command parameters and replies.

However even without reading those RFCs it is clear that UTF8 cannot
be used in 220 server greetings or in EHLO commands or replies,
because at that time the server/client have not agreed to use UTF8.

Thus, myhostname (or equivalent) must be ASCII, as it always must
have been. There is no need to use valid_mail_domain() in
reject_non_fqdn_hostname etc.

Wietse
Arnt Gulbrandsen
2014-06-04 19:00:32 UTC
Permalink
Post by Wietse Venema
I'll read the RFCs carefully and see where it allows UTF8 in SMTP
command parameters and replies.
You'll do that, but I'll tell you anyway: The client may use it once the
server has issued an EHLO response containing SMTPUTF8, and the server may
use it once the client has issued a MAIL FROM, VRFY or EXPN command with
the SMTPUTF8 parameter.

Postfix (both with and without my patch) violates that. If a client tells
Postfix:

MAIL FROM:<æ@æ.æ>

then Postfix may conceivably answer that æ@æ.æ is not a legal sender
address, since æ.æ isn't a valid domain. 6531 says that that response
should be ASCII-only, since the client hasn't given permission to use UTF8
in responses. My viewpoint is that no matter what RFC6531 says, the client
must accept hearing its own arguments in the SMTP reply. Postfix is right
and 6531 is wrongish, so I followed Postfix' reply style rather than comply
with 6531.
Post by Wietse Venema
However even without reading those RFCs it is clear that UTF8 cannot
be used in 220 server greetings or in EHLO commands or replies,
because at that time the server/client have not agreed to use UTF8.
Right.
Post by Wietse Venema
Thus, myhostname (or equivalent) must be ASCII, as it always must
have been. There is no need to use valid_mail_domain() in
reject_non_fqdn_hostname etc.
Right. I made some mistakes. I wish I were perfect, but know I am not.

Arnt
Matthias Andree
2014-06-04 19:03:02 UTC
Permalink
Post by Arnt Gulbrandsen
Compliant SMTP servers only accept mail to/from EAI addresses if the
SMTP client uses the SMTPUTF8 form of the MAIL FROM command. The SMTP
client, in turn, only uses that form if the origin too used it.
The purpose of this feature is to guarantee that EAI messages don't land
in the mailboxes of incompatible recipients. The relevant effect of this
feature is that in order to send mail to a unicode address, the _sender_
must declare that the message uses EAI. Having 8-bit clean relays on the
way is not enough.
Post by Wietse Venema
Thus an EAI domain name may show up as xn--mumble in HELO commands.
Yes. I think it's a bad idea to do that. The chance that some SMTP
server's gethostbyname() will return the UTF8 form and the SMTP server
then complain about EHLO/PTR mismatch is too great. But it can happen.
Post by Wietse Venema
There will be more. I'll just document them and fix them, so I
don't have to spend a lot of time reviewing another version.
I'm late to the game, haven't checked the relevant RFCs or Arnt's patch,
but a few thoughts on this -- perhaps you can answer "all dealt with" --
but here we go:

* It reminds me a bit of the 8BITMIME feature that was in discussion in
the late 1990's/early 2000's. I think The World™ never consented on how
to deal with all that depending on how radical a certain software
implemented its policies. Meaning: do we need this? Is Microsoft going
to implement it? IBM's Lotus Domino/Notes suites on the client end?


* My bigger concern is that UNICODE opens up ambiguities at various
levels, for instance when doing table lookups (especially for policies,
such as access control):

+ IDN punycode (xn--blech-rassel), as mentioned above.

+ Unicode normalization forms, are these handled consistently?
<http://www.unicode.org/reports/tr15/>
I searched the patch for the word fragment "normal", no hits.
I find that worrisome.

+ Characters that are different but use similar-looking gylphs,
(homoglyphs), for instance, between Greek/Cyrillic/Latin scripts.
Latin A, Cyrillic A, Greek A are three code points for an
indistinguishable character. A А Α <- in what order are these?
Hint:
0000000: 4120 d090 20ce 910a A .. ...
or U+0041 U+0020 U+0410 U+0020 U+0391

Is there a consistent policy for treating them that does not open up
loop- and ratholes and pitfalls and barndoors and all other sorts of
unfortunate openings for unaware/malicious parties?

+ How does the patch make Postfix deal with table lookups for tables
that don't go through postmap and cannot be normalized?

I don't want to create artifical adoption obstacles here, but I think
there is some room for nasty surprises, and that space needs exploration
and solutions. That's not just security discussion, but also reliability.

(Perhaps Unicode requires - or I missed - homoglyph tables, and case
mapping tables...)

I think Wietse's expectation on how not to change established behaviour
of release versions is clear, and I've always known I can rely on
Postfix's compatibility. (Not to say that Postfix's compatibility is
exemplary, as in "good example", but I digress.)
Arnt Gulbrandsen
2014-06-04 19:33:15 UTC
Permalink
Post by Matthias Andree
Is Microsoft going
to implement it?
Microsoft has implemented it. They asked for interoperation testing earlier
this week.
Post by Matthias Andree
IBM's Lotus Domino/Notes suites on the client end?
No idea.

Except that IBM has offices in Beijing and sells to the Chinese government,
and the Chinese government really likes EAI.
Post by Matthias Andree
+ Unicode normalization forms, are these handled consistently?
<http://www.unicode.org/reports/tr15/>
I searched the patch for the word fragment "normal", no hits.
I find that worrisome.
That's in ICU, which the patch calls.
Post by Matthias Andree
+ Characters that are different but use similar-looking gylphs,
(homoglyphs), for instance, between Greek/Cyrillic/Latin scripts.
Latin A, Cyrillic A, Greek A are three code points for an
indistinguishable character. A А Α <- in what order are these?
0000000: 4120 d090 20ce 910a A .. ...
or U+0041 U+0020 U+0410 U+0020 U+0391
Is there a consistent policy for treating them that does not open up
loop- and ratholes and pitfalls and barndoors and all other sorts of
unfortunate openings for unaware/malicious parties?
That is, blessedly, not a problem for Postfix. It's mostly a TLD registry
issue. Each registry has rules, mostly similar but far from identical.
Post by Matthias Andree
+ How does the patch make Postfix deal with table lookups for tables
that don't go through postmap and cannot be normalized?
No changes done. Some are needed, yes.
Post by Matthias Andree
I don't want to create artifical adoption obstacles here, but I think
there is some room for nasty surprises, and that space needs exploration
and solutions. That's not just security discussion, but also
reliability.
Post by Matthias Andree
(Perhaps Unicode requires - or I missed - homoglyph tables, and case
mapping tables...)
ICU contains the tables required. (Before you ask, I don't know how ı/I/i/İ
is handled. I'm curious myself.)

I'm somewhat unhappy that the patch links ICU into more postfix executables
than the one that really needs it.
Post by Matthias Andree
I think Wietse's expectation on how not to change established behaviour
of release versions is clear, and I've always known I can rely on
Postfix's compatibility. (Not to say that Postfix's compatibility is
exemplary, as in "good example", but I digress.)
Wietse is right. It makes me sad, but he is right.

Arnt
Arnt Gulbrandsen
2014-06-04 20:07:24 UTC
Permalink
I want to digress about one aspect here: SMTP/EAI and unicode
normalization.

The general EAI approach to that is to avoid having the problem, ie. to
define the SMTP/email extensions such that the problems become other
people's problems.

Homoglyphs aren't an SMTP problem. Two codepoints may look the same, but an
SMTP server doesn't have to think about which of the two domains is
legitimate and which is the impostor. All that is the registry's headache.

De/composition are pushed to the DNS. The SMTP part just says: Convert to a
IDNA a-labels in order to do the MX lookup, and otherwise don't mess with
the bytes you received. (My patch uses ICU to convert to a-labels.)

That does leave a little trouble, mostly dealing with localparts, but also
with local domains. Some of it will be tricky, e.g. doing unicode-based
pcre on a system that doesn't use a unicode locale.

Arnt
Wietse Venema
2014-06-04 21:16:51 UTC
Permalink
Post by Arnt Gulbrandsen
De/composition are pushed to the DNS. The SMTP part just says: Convert to a
IDNA a-labels in order to do the MX lookup, and otherwise don't mess with
the bytes you received. (My patch uses ICU to convert to a-labels.)
That is a mis-conception.

DNS is not the only interface that requires xn--mumble names. Like
a cancer, EAI has the potential to infect many aspects of address
handling and policy lookup. This is why I estimated that SMTPUTF8
would be a major project.

* The form xn--mumble will also be required in server greetings and
EHLO commands, when an MTA host- or domain name contains non-ASCII
characters. This means that Postfix must convert myhostname into
xn--mumble form in those contexts that require ASCII text.

* With multiple forms for the same domain name, xn--mumble in
HELO/EHLO (and perhaps other SMTP commands) and UTF8 in
MAIL/RCPT/ETRN/VRFY, Postfix lookup tables must either contain
multiple lookup keys for the same domain name, or Postfix must
convert all domain/email-address lookup keys into one canonical
form. That is, either convert all UTF8 domain names into xn--mumble,
or convert all xn--mumble domain names into UTF8. Having only
one lookup key per domain in Postfix lookup tables will more
secure but it will be a royal pain to implement (and here is no
way to do that with header/body_checks).

* I am not sure that we can rely on the postmap "table query" or
"create map" commands to "normalize" domain names in lookup keys.
Also, LDAP/*SQL*/etc. databases aren't "created" with postmap
commands. All this could be another argument to use only xn--mumble
or to use only UTF8 forms in databases. Again, more secure but a
royal pain to implement, because postmap doesn't really know if
a lookup key is a user, a domain, or something else.

* If xn--mumble were to become the canonical form for table lookup,
then Postfix parent-domain matching will not be broken: where
buecher.com becomes xn--bcher-kva.com, foo.buecher.com becomes
foo.xn--bcher-kva.com.

Other things:

* Postfix table queries are case-insensitive. I don't see any attempt
to implement that for UTF8 addresses. This leaves an ambiguity.

Wietse
Arnt Gulbrandsen
2014-06-05 08:36:49 UTC
Permalink
Post by Wietse Venema
* Postfix table queries are case-insensitive. I don't see any attempt
to implement that for UTF8 addresses. This leaves an ambiguity.
I looked at this now.

As I read the code, tables mostly map to lower case and then do a binary
comparison. The mysql and pgsql tables may additionally use the database
server's ilike operation. Finally, lowercase() maps U to u, but leaves 0xC0
as 0xC0, even if the Postfix server runs in a locale where the lowercase
form of that is 0xE0.

Is that correct?

I can provide a supplementary patch that provides case insensitivity for
unicode. It's easy, but there are several ways to do it, and I don't know
which you prefer.

1. Toupper/tolower in Postfix, with the usual table. This adds the bulk of
a table and is language-independent but imperfect. The well-known problem
is i/ı. (The lowercase("I") equvalent is "ı" in Turkish and a handful of
other locales.)

2a. Toupper/tolower that call out to ICU if EAI is enabled and there's any
non-ASCII is in the argument. This slows down toupper()/tolower() but
Postfix escapes having the table and ICU devotes considerable effort to
correctness. It's easy to compose the string, too (composition means to use
å instead of "a"+"ring above").

2b. Ditto, but calling a language-sensitive function in ICU, so that i is
equal to İ if the Postfix server runs in one of those locales. I'm unhappy
about this alternative — a Swiss service provider may well service both
Kazakh and Korean users and how should the service providers's Postfix be
configured?

3. Switching to titlecase. A bigger change. Titlecase is a form in which in
which case differences are erased and in principle it's neither equal to
uppercase nor to lowercase. It's only usable for implementing
case-insensitive comparison/lookup using fast binary comparison.

In my opinion the change to titlecase isn't worth it. There aren't enough
problems with lowercase() to justify such a sweeping change. Also keeping
lower case allows compiled tables to survive upgrades/downgrades.

I'm neutral regarding 1 and 2a. If you'll tell me what you prefer I'll
write a patch and test that it matches another implementation.

Arnt
Arnt Gulbrandsen
2014-06-05 09:06:31 UTC
Permalink
Post by Arnt Gulbrandsen
In my opinion the change to titlecase isn't worth it. There
aren't enough problems with lowercase() to justify such a
sweeping change. Also keeping lower case allows compiled tables
to survive upgrades/downgrades.
Worse: There are likely user-supplied tables that depend on lowercase
input. Both the mysql and pgsql tables make it easy to configure
case-sensitive queries, which switching to titlecase would break. I think
titlecase is definitely out.

(Btw, I wrote tolower() instead of lowercase() once or twice in the
previous message. Sorry.)

Arnt
Wietse Venema
2014-06-05 11:17:48 UTC
Permalink
Post by Arnt Gulbrandsen
Post by Wietse Venema
* Postfix table queries are case-insensitive. I don't see any attempt
to implement that for UTF8 addresses. This leaves an ambiguity.
I looked at this now.
As I read the code, tables mostly map to lower case and then do a binary
comparison. The mysql and pgsql tables may additionally use the database
server's ilike operation. Finally, lowercase() maps U to u, but leaves 0xC0
as 0xC0, even if the Postfix server runs in a locale where the lowercase
form of that is 0xE0.
Is that correct?
That question is not applicable. Postfix locale is "C", and
lowercase() only translates ASCII characters.
Post by Arnt Gulbrandsen
I can provide a supplementary patch that provides case insensitivity for
unicode. It's easy, but there are several ways to do it, and I don't know
which you prefer.
1. Toupper/tolower in Postfix, with the usual table. This adds the bulk of
a table and is language-independent but imperfect. The well-known problem
is i/?. (The lowercase("I") equvalent is "?" in Turkish and a handful of
other locales.)
2a. Toupper/tolower that call out to ICU if EAI is enabled and there's any
non-ASCII is in the argument. This slows down toupper()/tolower() but
Postfix escapes having the table and ICU devotes considerable effort to
correctness. It's easy to compose the string, too (composition means to use
? instead of "a"+"ring above").
2b. Ditto, but calling a language-sensitive function in ICU, so that i is
equal to ? if the Postfix server runs in one of those locales. I'm unhappy
about this alternative ? a Swiss service provider may well service both
Kazakh and Korean users and how should the service providers's Postfix be
configured?
3. Switching to titlecase. A bigger change. Titlecase is a form in which in
which case differences are erased and in principle it's neither equal to
uppercase nor to lowercase. It's only usable for implementing
case-insensitive comparison/lookup using fast binary comparison.
In my opinion the change to titlecase isn't worth it. There aren't enough
problems with lowercase() to justify such a sweeping change. Also keeping
lower case allows compiled tables to survive upgrades/downgrades.
I'm neutral regarding 1 and 2a. If you'll tell me what you prefer I'll
write a patch and test that it matches another implementation.
This will require further research. If case canonicalization is as
complex as you describe then the "correct" result is likely to
differ from what real people expect. That is a security hole.

Wietse
Arnt Gulbrandsen
2014-06-05 12:24:38 UTC
Permalink
Post by Wietse Venema
This will require further research. If case canonicalization is as
complex as you describe then the "correct" result is likely to
differ from what real people expect. That is a security hole.
That was the case in the nineties, but by now the case folding algorithms
in unicode have won. They've been used to much that people have come to
expect that they're right. There are problems, but lowercase() escapes all
but ı/i.

But ı is nasty. I have even found two domains that differ only in ı/i, so
Postfix cannot treat them as equal.

Composition (the other part of canonicalization) is worse matter. You're
right, that might lead to security problems. It can lead to table lookup
misses, and I'm sure that table misses can lead to several kinds of
security problems. For example forgetting mandatory TLS.

The safest alternative is to fully compose table lookup keys. (Or fully
decompose, but fully compose is usually faster.) I'll provide a patch to do
the 2a alternative. It'll take a few days.

Arnt
Viktor Dukhovni
2014-06-05 14:32:52 UTC
Permalink
But ? is nasty. I have even found two domains that differ only in ?/i, so
Postfix cannot treat them as equal.
Domains passed to lookup tables and match lists need to be in
a-label form. The remaining surprises with domains and case-insensitive
comparisons vs. unicode will be with header/body checks, likely OK.
--
Viktor.
Arnt Gulbrandsen
2014-06-05 15:18:48 UTC
Permalink
Post by Viktor Dukhovni
Domains passed to lookup tables and match lists need to be in
a-label form.
That would make pcre almost impossible and mysql and pgsql lookups rather
inconvenient.

The a-label form of blåbærsyltetøy in a-label form is
xn--blbrsyltety-y8ao3x. Matching the PCRE /.*syltetøy.*/ in a-label form
would be inconvenient, perhaps impossible.

Postgres and Mysql have builtin support for UTF8 strings so mysql/pgsql
tables can use e.g. the ilike operator, but they do not support strings
composed from a-labels. Here's a pgqsl concoction to match usernames,
optionally with subaddresses:

select id from addresses where localpart='%u' or localpart ilike '%u-%'

I cannot imagine any way to implement that if %u is in a-label form.

Arnt
Viktor Dukhovni
2014-06-05 15:36:19 UTC
Permalink
Post by Arnt Gulbrandsen
Post by Viktor Dukhovni
Domains passed to lookup tables and match lists need to be in
a-label form.
That would make pcre almost impossible and mysql and pgsql lookups rather
inconvenient.
What's the problem with the canonical representation of the domain exactly
as it appears on the wire in DNS, in certificate DNS altnames, ...
Post by Arnt Gulbrandsen
The a-label form of bl?b?rsyltet?y in a-label form is
xn--blbrsyltety-y8ao3x. Matching the PCRE /.*syltet?y.*/ in a-label form
would be inconvenient, perhaps impossible.
Regular expressions on partial DNS labels are not that useful anyway.
Generally one just wants all the sub-domains of a particular domain.
Sometimes one wants to filter cable-modem/DSL PTR records, otherwise
I'm losing sleep over partial DNS label regexps.
Post by Arnt Gulbrandsen
Postgres and Mysql have builtin support for UTF8 strings so mysql/pgsql
tables can use e.g. the ilike operator, but they do not support strings
composed from a-labels. Here's a pgqsl concoction to match usernames,
Nothing lost when the domain name is a-label form. The localpart
remains unicode, and one still needs some sort of UTF-8 -> utf-8
lower-case operator that operates correctly on ASCII. Frankly
applying lowercase() to just the ASCII octets works fine in this
situation, provided the domain is in a-label form already. Unicode
email address localparts would be case-sensitive in their non-ASCII
octets, not the end of the world.
--
Viktor.
Arnt Gulbrandsen
2014-06-05 15:59:30 UTC
Permalink
Post by Viktor Dukhovni
What's the problem with the canonical representation of the domain exactly
as it appears on the wire in DNS, in certificate DNS altnames, ...
What's the problem with the canonical representation exactly as it appears
on the wire in SMTP? What's the problem with using the same representation
for domains as for localparts?

Maybe SMTP and DNS should have used the same wire representation. It's too
late to change that now, though.

Arnt
Viktor Dukhovni
2014-06-05 16:19:03 UTC
Permalink
Post by Arnt Gulbrandsen
What's the problem with the canonical representation exactly as it appears
on the wire in SMTP? What's the problem with using the same representation
for domains as for localparts?
I can't read or write Chinese, Japanese, Korean, Tamil, ... but
need to be able to set policy for such domain names from (e.g. white-list
them, ...). I can read/write a-labels.
Post by Arnt Gulbrandsen
Maybe SMTP and DNS should have used the same wire representation. It's too
late to change that now, though.
Indeed domain names in EAI SMTP (and in message headers) should
have been mandated to be a-labels with display conversion to UTF-8
left to MUAs. Since the EAI RFCs are in error, if Postfix implements
EAI, it needs to do its best to correct the errors.

DKIM signature domains need to be a-labels. When Postfix generates
header addresses, those should be a-labels. Obviously if header
addresses already contain unicode, and Postfix is not rewriting
the header, it should leave unicode domains alone, but otherwise
Postfix should never introduce them, and should pass only a-label
domain forms to table drivers (when passing addresses or domains
rather than free-form header text as with header_checks).
--
Viktor.
Arnt Gulbrandsen
2014-06-05 20:36:18 UTC
Permalink
Post by Viktor Dukhovni
I can't read or write Chinese, Japanese, Korean, Tamil, ... but
need to be able to set policy for such domain names from (e.g. white-list
them, ...).
Yes. Now, if check_sender_access contains UTF8, then what you paste to
whitelist an address is:

उदाहरण@उदाहरण.in ACCEPT

whereas if the domain uses a-labels, you need to use

उदाहरण@xn--p1b6ci4b4b3a.in ACCEPT

The first line is readable to people who understand hindi and has to be cut
and pasted by the rest of us. The second is halfway readable to people who
understand hindi, and has to be cut and pasted by the rest of us.
Post by Viktor Dukhovni
I can read/write a-labels.
Personally I've never had much luck reading or typing things like
xn--p1b6ci4b4b3a, and if I'm going to cut and paste I might as well paste
the human-readable form. At least that's readable if I understand the
writing system.
Post by Viktor Dukhovni
Post by Arnt Gulbrandsen
Maybe SMTP and DNS should have used the same wire representation. It's too
late to change that now, though.
Indeed domain names in EAI SMTP (and in message headers) should
have been mandated to be a-labels with display conversion to UTF-8
left to MUAs.
If you don't mind, I'd rather not rehash that discussion. It was long and
tedious the first time, and I do not want to repeat it. I'll follow the RFC
as the finally ended up, not do what I thought was best. Perhaps e.g. 6532
section 3.2 contains the wrong decision about how to generate Received
fields, but the decision is clear and I really, really do not want to
repeat the discussion. Sorry.

Arnt
Viktor Dukhovni
2014-06-05 21:08:08 UTC
Permalink
Post by Arnt Gulbrandsen
Yes. Now, if check_sender_access contains UTF8, then what you paste to
whereas if the domain uses a-labels, you need to use
Much more likely the table lookup will be for just the domain, and
in scripts I can read I can tell whether two entries are different,
quickly find the one I am looking for, or match an entry in a log
file to an entry in the table...

Forcing Postfix access tables to contain tower-of-babel writing is
not an answer.
Post by Arnt Gulbrandsen
The first line is readable to people who understand hindi and has to be cut
and pasted by the rest of us. The second is halfway readable to people who
understand hindi, and has to be cut and pasted by the rest of us.
Post by Viktor Dukhovni
I can read/write a-labels.
Personally I've never had much luck reading or typing things like
xn--p1b6ci4b4b3a, and if I'm going to cut and paste I might as well paste
I've just typed one by looking at the line above:

xn--p1b6ci4b4b3a

no cut/paste, scout's honour. Not all operations on domains are
write-only. I have to be able to later read the table files,
process them with various tools, ...
Post by Arnt Gulbrandsen
Post by Viktor Dukhovni
Indeed domain names in EAI SMTP (and in message headers) should
have been mandated to be a-labels with display conversion to UTF-8
left to MUAs.
If you don't mind, I'd rather not rehash that discussion. It was long and
tedious the first time, and I do not want to repeat it. I'll follow the RFC
as the finally ended up, not do what I thought was best. Perhaps e.g. 6532
section 3.2 contains the wrong decision about how to generate Received
fields, but the decision is clear and I really, really do not want to repeat
the discussion. Sorry.
That's fine, I don't want to rehash it either, but Postfix interfaces
need to be usable by Postfix users. So Postfix will have to make up
for deficits in the RFCs.
--
Viktor.
Dāvis Mosāns
2014-06-05 21:36:21 UTC
Permalink
Post by Viktor Dukhovni
That's fine, I don't want to rehash it either, but Postfix interfaces
need to be usable by Postfix users. So Postfix will have to make up
for deficits in the RFCs.
Exactly, but you look at it from English person's point of view. Latin
alphabet is native for you, but for others they would rather prefer to use
their own alphabets. I know I would rather have names/domains in Unicode
for files/database rather than having to deal with Punnycode. Basically
using Punnycode forces everyone to use converters, because I can't convert
that in my head (to and back).

I guess the only way to please everyone would be - it should be
configurable, which format it's stored Unicode or Punnycode.
Viktor Dukhovni
2014-06-05 22:32:18 UTC
Permalink
Post by Dāvis Mosāns
Post by Viktor Dukhovni
That's fine, I don't want to rehash it either, but Postfix interfaces
need to be usable by Postfix users. So Postfix will have to make up
for deficits in the RFCs.
Exactly, but you look at it from English person's point of view. Latin
alphabet is native for you, but for others they would rather prefer to use
their own alphabets. I know I would rather have names/domains in Unicode
for files/database rather than having to deal with Punnycode. Basically
using Punnycode forces everyone to use converters, because I can't convert
that in my head (to and back).
Not too many people in Russia read Hebrew (right to left) or can
even cut and paste it reliably into a left to right context.
Post by Dāvis Mosāns
I guess the only way to please everyone would be - it should be
configurable, which format it's stored Unicode or Punnycode.
Tower of Babel is all fine and good user<->user and user<->MUA,
but it is a terrible interface for postmaster<->MTA. The wire
formats in EAI are in error, and I want as little to do with them
as possible, in particular I think that postmaster<->MTA must be
an a-label interface.
--
Viktor.
Wietse Venema
2014-06-05 23:55:37 UTC
Permalink
Post by Viktor Dukhovni
Not too many people in Russia read Hebrew (right to left) or can
even cut and paste it reliably into a left to right context.
Postfix is meant to be used by human operators anywhere on the
Internet. Therefore, the postqueue/postmap/etc. tools will have
to accept non-ASCII domain names from a human operator in either
UTF-8 form and xn--mumble form, and they will have to convert those
forms into their stored form. Those tools will also have to render
non-ASCII domain names in their stored form, or convert them into
UTF-8 or xn--mumble form on request by the human operator.

This way, human operators can manage domain names that are in the
operator's native script, but they can fall back to ASCII when the
domain is in some alien script.

So it does not matter what the stored form is (and in the case of
the mail queue, the stored form is controlled by the sender anyway).
What matters is that Postfix management tools allow humans to use
the stored form effectively. In other words, the tools must allow
the human operator to choose how to enter a non-ASCII domain name,
and how to render it.

See also my longer, previous, post in this thread.

Wietse
Arnt Gulbrandsen
2014-06-06 13:13:08 UTC
Permalink
Post by Wietse Venema
Postfix is meant to be used by human operators anywhere on the
Internet. Therefore, the postqueue/postmap/etc. tools will have
to accept non-ASCII domain names from a human operator in either
UTF-8 form and xn--mumble form, and they will have to convert those
forms into their stored form. Those tools will also have to render
non-ASCII domain names in their stored form, or convert them into
UTF-8 or xn--mumble form on request by the human operator.
Makes sense; patch coming. Will take a few days.

Arnt
Arnt Gulbrandsen
2014-06-11 12:28:47 UTC
Permalink
Post by Wietse Venema
Postfix is meant to be used by human operators anywhere on the
Internet. Therefore, the postqueue/postmap/etc. tools will have
to accept non-ASCII domain names from a human operator in either
UTF-8 form and xn--mumble form, and they will have to convert those
forms into their stored form. Those tools will also have to render
non-ASCII domain names in their stored form, or convert them into
UTF-8 or xn--mumble form on request by the human operator.
I looked at this now and would like to defer this work until the first
patch has been accepted and I can work against a new canonical source tree.
OK?

Arnt
Wietse Venema
2014-06-11 12:59:12 UTC
Permalink
Post by Arnt Gulbrandsen
Post by Wietse Venema
Postfix is meant to be used by human operators anywhere on the
Internet. Therefore, the postqueue/postmap/etc. tools will have
to accept non-ASCII domain names from a human operator in either
UTF-8 form and xn--mumble form, and they will have to convert those
forms into their stored form. Those tools will also have to render
non-ASCII domain names in their stored form, or convert them into
UTF-8 or xn--mumble form on request by the human operator.
I looked at this now and would like to defer this work until the first
patch has been accepted and I can work against a new canonical source tree.
OK?
In that case, don't bother writing code. Instead, share/discuss
your design decisions/recommendations. Does it make sense to fold
case with table lookups? How do we deal with table lokups when the
same domain can show up in different forms at different stages of
email handling (client or server name, mail from/rcpt to domain,
mail headers). How would a russian system admin effectively configure
manage tables/logging/queue management with domains in chinese or
hebrew script?

I will be working bits and pieces of the patch into Postfix over
the remainder of 2014. This is invasive stuff and it needs to be
done right.

Wietse
Arnt Gulbrandsen
2014-06-11 14:27:46 UTC
Permalink
Post by Wietse Venema
In that case, don't bother writing code. Instead, share/discuss
your design decisions/recommendations. Does it make sense to fold
case with table lookups?
Of course. Email addresses are case insensitive in dozens of languages. It
was slack of me not to catch that mistake.
Post by Wietse Venema
How do we deal with table lokups when the
same domain can show up in different forms at different stages of
email handling (client or server name, mail from/rcpt to domain,
mail headers).
Autodetection is needed. Happily it is also possible.

My preferred approach:

Store UTF8 in the tables and use UTF8 in table lookups. I say this because
making pgsql_table work well with utf8 on the localparts and xn--mumble on
the domains is bothersome. It seems to me that the reasons for the bother
are general, not specific to Postgres. Unicode is very widely used.

Add two new files/functions in util, one to convert from utf8 to xn--mumble
and one to convert the other way. Refactor the code in smtp/smtp*.c to call
that (that refactoring is the main reason why I want to wait.)

Next, make many locations call the toutf8 function, so that postqueue, ETRN
etc. accept both formats on input. This autodetection only breaks if
someone has used "xn--" for some other purpose in an internal subdomain,
but that's a risk I am prepared to accept. The web browsers also
autodetect, and AFAICT it hasn't caused any problems.

Finally, make somewhat fewer locations call call one of the conversion
functions to generate the appropriate format for e.g. postqueue output and
the EHLO argument. (I still think using a unicode myhostname is a trouble
magnet. IIRC my patch disallows it, and I would at least warn against it on
startup.)

Once the right callers are there, the table lookups should just work, at
any stage.
Post by Wietse Venema
How would a russian system admin effectively configure
manage tables/logging/queue management with domains in chinese or
hebrew script?
This is several questions. Two or three, I think.

One answer is that if an ISP wants to sell service to Chinese, the staff
who talk to the Chinese realistically have to know Chinese. Having a
Russian monoglot answer support requests from Chinese will not be
effective. EAI just adds one more communication problem.

The fashion these days is to add self service. An ISP may employ a Russian
postmaster, but also Chinese sales staff and have web forms written in
polite Chinese. In that case, the core of the problem is to make the forms,
database and batch processes UTF8-clean.

The postmaster may end up with a support request written in a language he
does not understand. EAI means that the support request may include a
domain name the postmaster cannot understand too, which IMO is not a
significant extra problem.

The other part is: What about queues/mail/tables involving the domains of
strangers on the net. If e.g. you have to add a separate queue to a
particular domain because mail to it disturbs others.

Since autodetecting on input is possible, I think that's how it has to be.
That will cater to postmaster preferences, to some degree.

I have no very strong opinion on the default format used for output. I know
my preference as user (look at LANG and use UTF8 if the locale uses it),
but I also know that as maintainer, I'd go with whatever causes less
support mail. Since you plan to use -DNO_EAI by default for one release
cycle you'll have enough time to decide.
Post by Wietse Venema
I will be working bits and pieces of the patch into Postfix over
the remainder of 2014. This is invasive stuff and it needs to be
done right.
Yes. In a sense that's why I wanted to defer the autoconfiguration/format
choice; that might well involve merge conflicts and that would be even
worse in this change than usually.

Arnt
Wietse Venema
2014-06-11 16:54:19 UTC
Permalink
Post by Arnt Gulbrandsen
Post by Wietse Venema
How would a russian system admin effectively configure
manage tables/logging/queue management with domains in chinese or
hebrew script?
This is several questions. Two or three, I think.
One answer is that if an ISP wants to sell service to Chinese,
That is not the question. It's not about CUSTOMERS with an alien
script, It is about REMOTE SENDERS/RECEIVERS with domains in chinese
script. How does the russian adminstrator view/manage the mail
queue, how does he/she set/view/use rules in smtpd_mumble_restrictions.

I think the russian admin needs to choose how a command will render
domain names in access maps, address rewriting, mailq output. UTF8
would be best for domain names in russian script, and ASCII would
be best for domains in chinese script. That will involve one
command-line option for postmap, postqueue, etc.

This is regardless of the representation that will be used internally
in Postfix tables (UTF9 or xn--mumble). With the mail queue we have
no choice - the representation is chosen by the sender.
Post by Arnt Gulbrandsen
The other part is: What about queues/mail/tables involving the domains of
strangers on the net. If e.g. you have to add a separate queue to a
particular domain because mail to it disturbs others.
Since autodetecting on input is possible, I think that's how it has to be.
That will cater to postmaster preferences, to some degree.
I have no very strong opinion on the default format used for output. I know
my preference as user (look at LANG and use UTF8 if the locale uses it),
but I also know that as maintainer, I'd go with whatever causes less
support mail. Since you plan to use -DNO_EAI by default for one release
cycle you'll have enough time to decide.
Post by Wietse Venema
I will be working bits and pieces of the patch into Postfix over
the remainder of 2014. This is invasive stuff and it needs to be
done right.
Yes. In a sense that's why I wanted to defer the autoconfiguration/format
choice; that might well involve merge conflicts and that would be even
worse in this change than usually.
EAI support will be added in small steps. It will definitely not be
complete and it will be disabled by default (due to the external
dependency, with #ifdefs, in addition to main.cf configuration).

Wietse
Arnt Gulbrandsen
2014-06-12 08:33:12 UTC
Permalink
Post by Wietse Venema
That is not the question. It's not about CUSTOMERS with an alien
script, It is about REMOTE SENDERS/RECEIVERS with domains in chinese
script. How does the russian adminstrator view/manage the mail
queue, how does he/she set/view/use rules in smtpd_mumble_restrictions.
The last question, in other words.
Post by Wietse Venema
I think the russian admin needs to choose how a command will render
domain names in access maps, address rewriting, mailq output. UTF8
would be best for domain names in russian script, and ASCII would
be best for domains in chinese script. That will involve one
command-line option for postmap, postqueue, etc.
Hm... doable, I think.

Unicode is organized as blocks. (Most blocks aren't quite full, and new
characters are added occasionally, e.g. the euro symbol to the block
Currency Symbols.) What you're suggesting is that if a string uses only the
user-specified unicode blocks, use UTF8, otherwise, xn--mumble. For Russian
that would be Basic Latin, Basic Cyrillic and Cyrillic Supplement. For
German it would be Basic Latin, Latin-1 Supplement, Latin Extended
Additional and Currency Symbols.

I suppose the only way is to define aliases. "Cyrillic" for Basic Latin and
the various cyrillic blocks, etc. Perhaps 50-100 aliases in all. Doable,
and the utility function to check whether a string matches a supplied alias
is easy.

I don't see the point, though.

Arnt
Wietse Venema
2014-06-12 11:07:32 UTC
Permalink
Post by Arnt Gulbrandsen
Post by Wietse Venema
That is not the question. It's not about CUSTOMERS with an alien
script, It is about REMOTE SENDERS/RECEIVERS with domains in chinese
script. How does the russian adminstrator view/manage the mail
queue, how does he/she set/view/use rules in smtpd_mumble_restrictions.
The last question, in other words.
Post by Wietse Venema
I think the russian admin needs to choose how a command will render
domain names in access maps, address rewriting, mailq output. UTF8
would be best for domain names in russian script, and ASCII would
be best for domains in chinese script. That will involve one
command-line option for postmap, postqueue, etc.
Hm... doable, I think.
I am suggesting a BINARY switch.

- Render all names in UTF8.

- Render all names in ASCII (xn--mumble).

Don't try to figure out which UTF range is native.

Wietse
Arnt Gulbrandsen
2014-06-26 11:31:53 UTC
Permalink
Post by Wietse Venema
I am suggesting a BINARY switch.
I'm happy to hear that, because that's what I think makes most sense, too.

Arnt

Wietse Venema
2014-06-05 22:02:24 UTC
Permalink
By now it will be clear to everyone that SMTPUTF8 involves more
than changes in the syntax of SMTP commands and bounce message
attributes. That is not the most difficult part. The most difficult
part is how humans will manage Postfix.

Pretty much all Postfix lookup table interfaces will be affected
in some way or another. The same domain name can be in UTF-8 or
xn--mumble form depending on whether it is the client hostname, the
EHLO command parameter, or whether it appears in an envelope email
address. Logfile analysis will be affected, too.

Multiple forms for the same domain name complicate logfile analysis
and lookup table management. No-one wants to specify multiple forms
of the same domain name in an access table, policy table, or address
rewriting/routing table. Tables should use one form if possible.

Regardless of what form Postfix lookup tables and logfiles use
internally, I expect that many Postfix tools will need an option
to accept or display domain names as UTF-8 or xn--mumble just so
that human operators can effectively manage Postfix lookup tables,
mailq output, logging, and so on, with domain names in scripts other
than Latin.

Considering the complexity of the human interface aspects, I expect
that SMTPUTF8 will be "experimental" for more than one development
cycle, because it may take several incompatible changes to get
things right. Thus, I stick with my original estimate that SMTPUTF8
will take a few years to implement.

Wietse
Loading...