Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure only ASCII character set used #12

Merged
merged 1 commit into from
Oct 28, 2016
Merged

Ensure only ASCII character set used #12

merged 1 commit into from
Oct 28, 2016

Conversation

davegill
Copy link
Contributor

No description provided.

ELSE
PRINT *,' '
PRINT *,'Troubles, with ',problem_line_count,' lines.'
PRINT *,'File uses only ISO-8859 character codes, outside the standard ASCII range of ',FIRST_VALID,' to ',LAST_VALID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of "File uses only ISO-8859 character codes, outside the standard ASCII range of", perhaps something like "File uses character codes outside the standard ASCII range of"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mike,
The new print statement:
PRINT *,'File uses character codes outside the standard ASCII range of ',FIRST_VALID,' to ',LAST_VALID

EXIT big_read_loop
END IF

DO ind = 1 , MAX_LENGTH
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The \tab character is fairly prominent in our code, and has ASCII code 9 (which is outside of the 32-127 range). Should this loop include that exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coincidentally, the 9 files that I tested, none of them do have a tab character. I have now fixed the logic in the code to ignore all tab character (ASCII code #9).

! usage:
! a.out < file.F

PROGRAM finder
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we name this something more self-descriptive? Like "nonasciifinder"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mike,
The source code now lists the main program as "non_ascii_finder"

line_count = 1
problem_line_count = 0

! Loop over eah line of the input file.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"eah" -> "each"

Copy link
Contributor Author

@davegill davegill Oct 10, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mike,
"each" now has a "c", also added "ubiquitously" to get right back up on that spelling horse.

! Output: GA --- â(x)
! Purpose: Compute the gamma function Ahat(x)
! Input : x --- Argument of Ahat(x)
! ( x is not equal to 0,-1,-2,WHAT GOES HERE )
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ravanah
What should be in that original u'u'u' string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ravan,
It was pointed out, maybe this in supposed to be "..."?
Dave

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this was probably the indication for etcetera,

@davegill davegill changed the title Insure only ASCII character set used Ensure only ASCII character set used Oct 11, 2016
@jamiebresch
Copy link
Contributor

Aren't you going to commit updated var/convertor/wave2grid_kma/pvchkdv.F as well?

! line number and column count (for subsequent editing).

! usage:
! a.out file.F
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding the following (from your commit message) in the program itself.

build the finder program: gfortran -ffree-form finder.F
a.out some-file-name.F

…excluded

TYPE: bug fix

KEYWORDS: ISO, ASCII, sed, byte

SOURCE: internal

DESCRIPTION OF CHANGES:
Authors of a few physics schemes likely used a "cut-and-paste" technique
for including references and for units.

The offending references used quite a few different characters for an
intended dash (minus sign).

The offending units all used a superscript numeral 2 to mean "squared",
as in W/m^2.  I changed some to m^2 and some to m2, as both are used
in the modified schemes.

There were a few other single modifications (an "a" with a carat hat, etc).

All of the changes were to commented lines.  The change are necessary to
allow the use of sed to process the source code.

Outside of the physics directory, a number of files also had characters
outside of the Fortran character set (32-127).  These were all in comments,
but are still being removed.

LIST OF MODIFIED FILES:

chem/module_cam_mam_newnuc.F
chem/module_gocart_dmsemis.F
chem/module_gocart_seasalt.F
chem/module_mozcart_wetscav.F
chem/module_sea_salt_emis.F

dyn_em/module_sfs_driver.F
dyn_em/module_sfs_nba.F

frame/module_cpl.F

hydro/Routing/module_gw_gw2d.F

phys/module_bl_mfshconvpbl.F
phys/module_gocart_seasalt.F
phys/module_ltng_cpmpr92z.F
phys/module_ltng_crmpr92.F
phys/module_ltng_iccg.F
phys/module_mp_nssl_2mom.F
phys/module_mp_wdm6.F
phys/module_sf_bem.F
phys/module_sf_bep.F
phys/module_sf_bep_bem.F

var/convertor/wave2grid_kma/pvchkdv.F (Thanks Jamie!)

TESTS CONDUCTED:
The sed program works on the modified files, and does not work on the
original files.
@davegill
Copy link
Contributor Author

@jamiebresch and @mkavulich
I have added the file that Jamie fixed up for the var directory (it was in Japanese). I added a short blurb at the beginning of the new main program on how to build it, I swapped the default behavior to work with the "find" command. I also swapped out the "WHAT IS THIS SUPPOSED TO BE" with the ellipsis "..." in one of the chem files.

Can you guys review again to see if my pull request may now proceed?

Thanks
Dave

@jamiebresch
Copy link
Contributor

@davegill We would like to have tools/find.F renamed to tools/non_ascii_finder.F

@mgduda
Copy link
Collaborator

mgduda commented Oct 17, 2016

@davegill Would it be hard to allow the verbosity level to be specified on the command-line, and to let the user give a list of files to be scanned as command-line arguments? I'm imagining something like this:

verbose = 0
if (command_argument_count() == 0) then
    write(0,*) 'Usage: non_ascii_finder [verbosity option] <filenames>'
    stop 1
end if
do iarg = 1, command_argument_count()
    call get_command_argument(iarg, argval)
    if (trim(argval) == '-v') then
        verbose = 1
    else if (trim(argval) == '-V') then
        verbose = 2
    else
        ! Assume this is a filename... do the normal scanning code
    end if
end do

@davegill
Copy link
Contributor Author

Michael,
Good suggestion. I implemented this, and here are the quick test results. I added a superscript 2 to the test program, so that it can be run on its own source code to detect a problem. I run the executable on other Fortran codes, and got the "nothing wrong here" behavior. I ran on a non existing file, and that was detected also.

a.out
Usage:
./a.out
where is -v when using this program with "find", and
is -V when processing a single file
where is a WRF Fortran source file
STOP 1

a.out -v
Usage:
./a.out
where is -v when using this program with "find", and
is -V when processing a single file
where is a WRF Fortran source file
STOP 1

a.out -v non_ascii_finder.F
non_ascii_finder.F

a.out -V non_ascii_finder.F
non_ascii_finder.F
Found something on line # 25
! --> this line has a problem with the superscript numeral 2: [W/m²]
Character # 69 is a ?, which is character code 194
non_ascii_finder.F
Found something on line # 25
! --> this line has a problem with the superscript numeral 2: [W/m²]
Character # 70 is a ?, which is character code 178
Troubles, with 2 lines.
File uses character codes outside the standard ASCII range of 32 to 127

a.out -VV fortran_2003_fflush_test.G
Hmmm, troubles trying to open fortran_2003_fflush_test.G for READ.
STOP 4

a.out -v fortran_2003_fflush_test.F

a.out -V fortran_2003_fflush_test.F

a.out -VV fortran_2003_fflush_test.F
fortran_2003_fflush_test.F, End of file after attempting to read line # 15
OK, File uses only ASCII character codes from 32 through 127

Dave

PRINT *,'where <verbose level> is -v when using this program with "find", and'
PRINT *,' <verbose level> is -V when processing a single file'
! PRINT *,' <verbose level> is -VV is for developers and debugging'
PRINT *,'where <filename> is a WRF Fortran source file'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davegill I suppose I may be nit-picking at this point, but why does the input have to be a WRF Fortran source file? Taking a more general view, this utility tells whether there are any characters outside the set of printable ASCII characters (or characters not acceptable to cpp, or whatever). Also, stating that -v is used when using the program with find doesn't really help anyone to understand what the effect of using -v actually is. Also, stating that -V is used when processing a single file can suggest that the program can process multiple files.

Generally, we could reconsider the printout produced by this program with a broader view of what the program could potentially be used for.

@mgduda
Copy link
Collaborator

mgduda commented Oct 18, 2016

@davegill It's purely academic at this point, but it might be interesting to try to detect UTF-8 multi-byte encodings. For example, in the module_sf_bep_bem.F file, the non_ascii_finder reports:

 ./WRF/phys/module_sf_bep_bem.F
 Found something on line #        4204
       real sfr(ndm,nz_um)      ! Sensible heat flux from roofs [W/m²] 
 Character #           68  is a ?, which is character code          194
 ./WRF/phys/module_sf_bep_bem.F
 Found something on line #        4204
       real sfr(ndm,nz_um)      ! Sensible heat flux from roofs [W/m²]  
 Character #           69  is a ?, which is character code          178

Really, though, the 68th and 69th characters together form a UTF-8 character; their binary encoding is 1100 0010 1011 0010, which from this table suggests the two-byte character U+00B2: https://en.wikipedia.org/wiki/UTF-8#Description .

This explains why the line

       real sfr(ndm,nz_um)      ! Sensible heat flux from roofs [W/m²]

correctly shows the superscript 2 as a single character, but the lines

 Character #           68  is a ?, which is character code          194
 Character #           69  is a ?, which is character code          178

can't show any character.

@jamiebresch
Copy link
Contributor

@mgduda There are tons of languages and encoding methods. For the source code purpose, I think ASCII-only is a good rule.

@mgduda
Copy link
Collaborator

mgduda commented Oct 18, 2016

@jamiebresch Agreed. To be clear, I was definitely not suggesting that we allow anything but printable ASCII characters in the source code (I think this may even be part of the Fortran standard); rather, I was only saying that, because UTF-8 can be used to encode more or less every language, and it is by far the largest encoding used on e.g., the web, that the checker program could be more clever and recognize multi-byte UTF-8 encodings for what they are, rather than printing two, three, or four error messages for the same multi-byte character.

My previous comment only came about because I noticed that some of the messages from the checker referenced two characters, when I could only find one in the source code (e.g., the superscript 2), and I started looking further and thought the UTF-8 encoding bit was pretty cool.

@cmattocks
Copy link

Below is a script version of a UTF-8 -> ASCII character converter.

Craig


#!/bin/sh

@(#) utf2ascii Convert UTF-8 to ASCII text

################################################

Convert file encoded in UTF-8 to ASCII text.

Usage: utf2ascii filename

################################################

###############################

Set trap to abort on signal

###############################
trap 'echo “\nOuch. Dude! Ya fragged me." 1>&2; exit' 1 2 3 15

#####################################

Process command-line argument(s):

#####################################
case $# in
0) # User forgot to enter a filename:
[ -t 0 ] && echo "USAGE: basename $0 filename" >&2 && exit 1 ;;
) # Process input file:
for file in $

do
if [ ! -r $file ]; then
echo "Cannot read input file "$file"" >&2 && exit 2
else
echo "Converting file "$file" ..."
\rm -f "$file.orig"
mv -f "$file" "$file.orig"
iconv -sc -f UTF8 -t ASCII//TRANSLIT "$file.orig" > "$file"
fi
done
esac

exit


On Oct 18, 2016, at 1:31 PM, Michael Duda [email protected] wrote:

@jamiebresch Agreed. To be clear, I was definitely not suggesting that we allow anything but printable ASCII characters in the source code (I think this may even be part of the Fortran standard); rather, I was only saying that, because UTF-8 can be used to encode more or less every language, and it is by far the largest encoding used on e.g., the web, that the checker program could be more clever and recognize multi-byte UTF-8 encodings for what they are, rather than printing two, three, or four error messages for the same multi-byte character.

My previous comment only came about because I noticed that some of the messages from the checker referenced two characters, when I could only find one in the source code (e.g., the superscript 2), and I started looking further and thought the UTF-8 encoding bit was pretty cool.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.

@davegill davegill merged commit 4529d9a into wrf-model:master Oct 28, 2016
@davegill davegill deleted the ASCII_CODES_ONLY branch March 27, 2017 16:25
twjuliano pushed a commit to twjuliano/WRF that referenced this pull request Jun 15, 2021
…rt_registry

minor inconsequential removal of extra quote on memetum preturbations…
janmandel added a commit to wrf-sfire/WRF-SFIRE that referenced this pull request Feb 5, 2022
twjuliano pushed a commit to twjuliano/WRF that referenced this pull request Jun 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants