PDA

View Full Version : Code for checking validity of a DOS filename



mbbrutman
April 12th, 2009, 03:32 PM
I needed a function to tell me if a filename would be valid for DOS. It took a little more effort to write than I thought it would. Does anybody know of a better way to do this?


#include <stdio.h>
#include <string.h>

// I use char for return codes because only a single byte integer is needed
// and that is more efficient on 8088 machines. If you see char as a return
// code just think of a single byte integer.

char DosChars2[] = "!@#$%^&()-_{}`'~";

char isValidDosChar( char c ) {

// A-Z, a-z, 0-9 and any char above 127 are legal
if ( (c >='A') && (c<='Z') || (c >='a') && (c<='z') ||
(c >='0') && (c<='9') || (c > 127)
) return 1;

// Is it one of the acceptable punctuation chars?
for ( char i=0; i<16; i++ ) {
if ( c == DosChars2[i] ) return 1;
}

return 0;
}


char isValidDosFilename( const char *filename ) {

// Protect against incoming NULL strings
if (filename==NULL) return 0;

char len = strlen(filename);

// Reject zero length strings
if ( len == 0 ) return 0;

// Check the first character - it has to be valid and it cant be a '.'
// because a filename is not optional (but extensions are).
if ( !isValidDosChar( filename[0] ) ) return 0;

// Check the remainder of the filename part; stop when a '.' is found
int i;
for ( i=1; (i<8) && (i<len) ; i++ ) {
if ( filename[i] == '.' ) break;
if ( !isValidDosChar( filename[i] ) ) return 0;
}

// End of string - Ok, there is no extension
if ( i == len ) return 1;

// We are on the 9th char - this better be a '.'
if ( filename[i] != '.' ) return 0;

i++;
for ( int j=0; (j+i) < len; j++ ) {
if ( !isValidDosChar( filename[j+i] ) ) return 0;
}

// Is the extension too long?
if ( j > 3 ) return 0;

return 1;
}

int main( int argc, char *argv[] ) {

printf( "%s\n", (isValidDosFilename(argv[1])) ? "Yes" : "No" );

return 0;
}

Performance could be better with a SCASB instruction, but don't worry about that now ...

Just to review, the rules are:


Up to 8 characters for the filename
Up to 3 characters for an extension
Filenames are required - extensions are not
Only certain characters are allowed/safe



Mike

sqpat
April 12th, 2009, 04:14 PM
i would just do the "isValidChar" thing you did, with a big long switch case which ought to get compiled into a kind of quick hash/pointer lookup. It'll look like a lot of code but you'll have a fast, constant time check. i.e. replace your function with something like:



char isValidDosChar( char c ) {
switch (c){
case 33:
case 35:
case 36:
case 37:
case 38:
case 39:
case 40:
case 41:
case 45:
case 48:
case 49:
case 50:
case 51:
case 52:
case 53:
case 54:
case 55:
case 56:
case 57:
case 64:
case 65:
case 66:
case 67:
case 68:
case 69:
case 70:
case 71:
case 72:
case 73:
case 74:
case 75:
case 76:
case 77:
case 78:
case 79:
case 80:
case 81:
case 82:
case 83:
case 84:
case 85:
case 86:
case 87:
case 88:
case 89:
case 90:
case 94:
case 95:
case 96:
case 97:
case 98:
case 99:
case 100:
case 101:
case 102:
case 103:
case 104:
case 105:
case 106:
case 107:
case 108:
case 109:
case 110:
case 111:
case 112:
case 113:
case 114:
case 115:
case 116:
case 117:
case 118:
case 119:
case 120:
case 121:
case 122:
case 123:
case 125:
case 126:
case 128:
case 129:
case 130:
case 131:
case 132:
case 133:
case 134:
case 135:
case 136:
case 137:
case 138:
case 139:
case 140:
case 141:
case 142:
case 143:
case 144:
case 145:
case 146:
case 147:
case 148:
case 149:
case 150:
case 151:
case 152:
case 153:
case 154:
case 155:
case 156:
case 157:
case 158:
case 159:
case 160:
case 161:
case 162:
case 163:
case 164:
case 165:
case 166:
case 167:
case 168:
case 169:
case 170:
case 171:
case 172:
case 173:
case 174:
case 175:
case 176:
case 177:
case 178:
case 179:
case 180:
case 181:
case 182:
case 183:
case 184:
case 185:
case 186:
case 187:
case 188:
case 189:
case 190:
case 191:
case 192:
case 193:
case 194:
case 195:
case 196:
case 197:
case 198:
case 199:
case 200:
case 201:
case 202:
case 203:
case 204:
case 205:
case 206:
case 207:
case 208:
case 209:
case 210:
case 211:
case 212:
case 213:
case 214:
case 215:
case 216:
case 217:
case 218:
case 219:
case 220:
case 221:
case 222:
case 223:
case 224:
case 225:
case 226:
case 227:
case 228:
case 229:
case 230:
case 231:
case 232:
case 233:
case 234:
case 235:
case 236:
case 237:
case 238:
case 239:
case 240:
case 241:
case 242:
case 243:
case 244:
case 245:
case 246:
case 247:
case 248:
case 249:
case 250:
case 251:
case 252:
case 253:
case 254:
case 255:
return 1;
default:
return 0;
}
}


I'm not positive that I caught every case there so you might want to test it, but I think that's right, anyway.

mbbrutman
April 12th, 2009, 05:08 PM
I'm not too worried about the performance of the routine. When I get that far I'll use the SCASB instruction to find the current char in a target string of legal characters. That will be far faster.

I think the problem with your approach is that it will result in a tremendous about of code. And if the compiler doesn't do something like a computed goto, it's not going to run in constant time. We're talking about Turbo C++ 3.0 on a DOS PC, not the latest and greatest GNU gcc.

Chuck(G)
April 12th, 2009, 05:40 PM
SCASB? You mean strchr?

if ( strchr( "...list of permitted characters....", c))

Instead of checking explicitly for characters (A...Z and 0...9), use isalnum()--it's table-driven and faster.

One thing few programs address is the issue of "special" names, such as CLOCK$, CON, AUX, PRN... (Note that CLOCK$.WOW functions exactly the same as CLOCK$).

Do you want to restrict the filenames to 8.3 if there is longname support enabled?

How about network share names?

Just some things to think about...

mbbrutman
April 12th, 2009, 05:56 PM
I really meant SCASB - it's an x86 instruction. If strchr is lucky it is implemented with SCASB.

I thought about the device names - the problem with them is that besides the well known ones, any device driver might have a name too. I'm willing to protect people against some mistakes, but that's beyond where I want to go.

I am going to restrict filenames to 8.3. In any DOS that I am targetting (6.3 and below at least) there are no long filenames. There are long filenames in the command lines for the Windows variants, but my FTP code probably isn't running there because it needs a packet driver. (I should have noted that this is for the FTP client - for general use longfilename support might be desirable.)

Mike

sqpat
April 12th, 2009, 06:16 PM
I think the problem with your approach is that it will result in a tremendous about of code. And if the compiler doesn't do something like a computed goto, it's not going to run in constant time. We're talking about Turbo C++ 3.0 on a DOS PC, not the latest and greatest GNU gcc.

Um... the reason switch statements exist is because they are efficient, using a different compilation technique than your typical if/else routine. That one also really ought to compile into a function that is less than 400 bytes or so.

If space was such a concern you could write it in assembly if you wanted and do a similar routine using only about 50 bytes considering you can just have a single bit lookup for each of the 256 characters instead... and a lookup table is a really simple thing to implement.

Let's see... if it must be C/C++ then perhaps something along these lines instead.




bool legalChar[256] = {
false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false,
false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false,
false, true, false, true, true, true, true, true, true, true, false, false, false, true, false, false,
true, true, true, true, true, true, true, true, true, true, false, false, false, false, false, false,
true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true,
true, true, true, true, true, true, true, true, true, true, true, false, false, false, true, true,
true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true,
true, true, true, true, true, true, true, true, true, true, true, true, false, true, true, false,
true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true,
true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true,
true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true,
true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true,
true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true,
true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true,
true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true,
true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true};



keep in mind my C/C++ syntax is a little rusty. Is that kind of an array declaration fine?

Anyway, now simply replace isValidDosChar(c) with legalChar[c]

Same idea as the switch statement. In effect, a constant time lookup table at the processor level... just doing some optomizing in space usage.

mbbrutman
April 12th, 2009, 06:37 PM
I'll have to try the switch statement on my compiler to see what it does, but I'm not expecting a lot.

The routine is going to be used in my FTP client. The runtime importance of this isn't terribly important, as the time spent transferring the files will be far longer than the time spent checking to see if a filename is safe. I was more concerned with how ugly the main part of the code was as opposed to the character lookup.

I thought about using SCASB, but now that you bring up the bitmap that isn't a bad idea. The bitmap would consume 32 bytes of storage. It's not easy to generate the bitmap with this compiler, but I don't expect it to change so that's not an issue.

Chuck(G)
April 12th, 2009, 06:47 PM
I really meant SCASB - it's an x86 instruction. If strchr is lucky it is implemented with SCASB.

Um, Mike, I know that. I also know that at least Microsoft C uses SCAS for its strchr() function. I'd be surprised if other Cs didn't.

There are two questions to ask about the code, however. Is this something that needs the dickens optimized out of a task to scan the characters in a 12 character name? Is it worth making the next guy wonder what the heck you were doing when you wrote that? Or are you trying to gild the lily? (I can't answer that for you). But I'll venture that the time spent in filename validity checking is insignificant.

If you need pure speed, then the method that uses a 256-byte lookup table is the way to go. You could even use an XLAT instruction if you wanted to inline (although that's only on the lower x86 family; indexed addressing is faster than XLAT on the upper x86).

I've been guilty of writing a lot of inline assembly in my code, but I try to keep it localized. (e.g. a 32-bit CRC routine that checks for 32-bit register availability or, an MD5 hash that works over gigabytes of data) and meaningful.

I don't use gcc, so I can't comment on its efficiency.

mbbrutman
April 12th, 2009, 08:15 PM
Not trying to gild the lily here .. I'm just really suprised at how ugly the code turned out, and I'm wondering if I missed anything. Like a C library routine or a DOS function call that does this already. Or a more clever algorithm than what I came up with, which works but is pretty simplistic.

Chuck(G)
April 12th, 2009, 09:15 PM
No matter how you write it, it's ugly. No library functions in MS C to do the job and not even any DOS functions--though INT 21H Function 60H comes pretty close.

Another question to ask is if you to perform automatic truncation of either the name or the extension. Sometimes, this comes in handy on systems that support long filenames. So, FOOF.HTML is still accessible as FOOF.HTM.

On the other hand, you're doing something that's not being explicity requested by the user.

snq
April 13th, 2009, 04:49 PM
This is what WINE uses.. Contains a lot of crap you don't need but I figured you would be interested to see how they are doing it anyway.
Otherwise you could disassemble ntdll.dll and check the asm source to see how MS does it :)


/************************************************** ****************
* RtlIsNameLegalDOS8Dot3 (NTDLL.@)
*
* Returns TRUE iff unicode is a valid DOS (8+3) name.
* If the name is valid, oem gets filled with the corresponding OEM string
* spaces is set to TRUE if unicode contains spaces
*/
BOOLEAN WINAPI RtlIsNameLegalDOS8Dot3( const UNICODE_STRING *unicode,
OEM_STRING *oem, BOOLEAN *spaces )
{
static const char illegal[] = "*?<>|\"+=,;[]:/\\\345";
int dot = -1;
int i;
char buffer[12];
OEM_STRING oem_str;
BOOLEAN got_space = FALSE;

if (!oem)
{
oem_str.Length = sizeof(buffer);
oem_str.MaximumLength = sizeof(buffer);
oem_str.Buffer = buffer;
oem = &oem_str;
}
if (RtlUpcaseUnicodeStringToCountedOemString( oem, unicode, FALSE ) != STATUS_SUCCESS)
return FALSE;

if (oem->Length > 12) return FALSE;

/* a starting . is invalid, except for . and .. */
if (oem->Buffer[0] == '.')
{
if (oem->Length != 1 && (oem->Length != 2 || oem->Buffer[1] != '.')) return FALSE;
if (spaces) *spaces = FALSE;
return TRUE;
}

for (i = 0; i < oem->Length; i++)
{
switch (oem->Buffer[i])
{
case ' ':
/* leading/trailing spaces not allowed */
if (!i || i == oem->Length-1 || oem->Buffer[i+1] == '.') return FALSE;
got_space = TRUE;
break;
case '.':
if (dot != -1) return FALSE;
dot = i;
break;
default:
if (strchr(illegal, oem->Buffer[i])) return FALSE;
break;
}
}
/* check file part is shorter than 8, extension shorter than 3
* dot cannot be last in string
*/
if (dot == -1)
{
if (oem->Length > 8) return FALSE;
}
else
{
if (dot > 8 || (oem->Length - dot > 4) || dot == oem->Length - 1) return FALSE;
}
if (spaces) *spaces = got_space;
return TRUE;
}

Chuck(G)
April 13th, 2009, 08:45 PM
Well, if you're using DOS 3 or better and all you want to do is check for illegal characters, the following code should work:



mov ax,6000h
lea si,WhatToCheck ; ds:si -> name to check
lea di,FullBuffer ; where to put full name
int 21h ; DOS TRUENAME function
jb ItStinks ; if DOS rejects it

; The fully-qualified filename is in the buffer pointed to
; by ES:DI


(FWIW, this code works in every version of DOS 3.0 and later and in all versions of Windoze).

The function corresponds to the internal DOS command (undocumented) TRUENAME.

There are three gotchas. The first is that overly long names or extensions are truncated without comment. The second is that the backslash is permitted (assumed to be a path separator). The third is that the returned name is folded to uppercase.

snq
April 14th, 2009, 04:24 AM
Here's a cleaned up version of the function I pasted 2 posts up.
I'm not sure what this "\345" is supposed to be, so I added \x7F instead ;)
This function allows spaces though, so you might want to just return 0 in case of a space if you dont want them.

char isValidDosFilename(const char *filename)
{
static const char illegal[] = "*?<>|\"+=,;[]:/\\\x7F";
int dot = -1;
int len = strlen(filename);

if(len==0 || len>12)
return 0;

// starting . is invalid, except for "." and ".."
if(filename[0]=='.')
{
if(len!=1 && (len!=2 || filename[1]!='.'))
return 0;
}

for(int i=0; i<len; i++)
{
char c = filename[i];
switch(c)
{
// Leading/trailing spaces not allowed
case ' ':
if(i==0 || i==len-1 || filename[i+1]=='.')
return 0;
break;
// only one dot allowed
case '.':
if(dot!=-1)
return 0;
dot = i;
break;
// check for illegal chars
default:
if(strchr(illegal, c))
return 0;
break;
}
}

if(dot==-1)
{
// if no dot, max lenght is 8
if(len>8)
return 0;
}
else
{
// otherwise check file part <=8, extension<=3, does not end with dot
if(dot>8 || len-dot>4 || dot==len-1)
return 0;
}

return 1;
}

JohnElliott
April 14th, 2009, 08:05 AM
I'm not sure what this "\345" is supposed to be, so I added \x7F instead ;)


Octal 345 = Hex E5. Which, in the first byte of an on-disk filename, indicates 'file deleted'.

Chuck(G)
April 14th, 2009, 09:13 AM
Octal 345 = Hex E5. Which, in the first byte of an on-disk filename, indicates 'file deleted'.

Yes, but it's a legal filename character; translated by DOS to 05 in the directory. Try it yourself--create a file name of E5's and issue a DOS function 3C on it. Do dir--you'll see a file name of lowercase sigmas. Look in the on-disk directory and they'll be hex 05 E5 E5's. This was an old trick to make it difficult for a user to access the file from the command line, as COMMAND.COM seems to check for E5 and call foul. But DOS doesn't care.

If you look at the MS-DOS 6.0 source, you can see this code in MACRO2.ASM

mbbrutman
April 14th, 2009, 07:42 PM
The 0xE5 value is fascinating - I forgot about that! And it looks like somebody figured out it is a bad thing to do, so if a user actually wants to start a filename with 0xE5 a 0x05 is put in it's place. Only DOS can put a real 0xE5 byte in the first character of a filename.

Krille
May 5th, 2016, 04:33 AM
I'm just really suprised at how ugly the code turned out


No matter how you write it, it's ugly.

Unless you do it in QuickBASIC. :D

This is something I wrote a long time ago:


FUNCTION InvalidDOSFileName (FileName$)
X = LEN(FileName$)
IF X = 0 OR X > 12 THEN
FileNameInvalid = -1
ELSE
FOR X = 1 TO X
SELECT CASE ASC(MID$(FileName$, X, 1))
CASE 46: IF DotFound THEN FileNameInvalid = -1: EXIT FOR ELSE DotFound = X
'CASE 33, 35 TO 41, 45, 48 TO 57, 64 TO 90, 94 TO 123, 125 TO 126, 128 TO 255 ' Valid chars
CASE 0 TO 32, 34, 42 TO 44, 47, 58 TO 63, 91 TO 93, 124, 127 ' Invalid chars
FileNameInvalid = -1: EXIT FOR
END SELECT
NEXT
IF DotFound THEN IF X - 1 - DotFound > 3 THEN FileNameInvalid = -1
END IF
InvalidDOSFileName = FileNameInvalid
END FUNCTION


It doesn't get prettier than this. ;)

mbbrutman
May 5th, 2016, 06:34 AM
Krille,

"Necroposting" ? This was a seven year old thread! :D

Krille
May 5th, 2016, 06:55 AM
What can I say, it's been a slow day and there are a lot of old but interesting threads in the Vintage Computer Programming section. :thumbsup: :)