Geekpedia Programming Tutorials






Searching for a string in a File

Let's you find the first occurance of a string within a file.

On Sunday, April 25th 2004 at 12:46 PM
By Sean Eshbaugh (View Profile)
*****   (Rated 4.2 with 20 votes)
Contextual Ads
More C++ Resources
Advertisement

I started working on a project that was to replace the often times buggy and slow (and in my opinion just plain bad) Find Files/Folders function that comes with Windows (windows key + 'F'). In Windows XP the searching utility in the OS seems to be severly lacking in functionality. In previous versions of Windows I used i didn't find there to be too many problems.

The most important part of my project was searching for text within files, something which the Find Files/Folders function claims it can do but it never seems to return results even when I know there should be some. This is what caused me to look for a nice way to search for text inside of a file in much the same way strstr searches for a string inside a larger string. I did find a solution somewhere out there on the web but for reasons I still can't figure out (the code was very messy) it would stop actually looking once you searched through about 16MB worth of files in one run.

Since I could not find anything out there that would allow me to do very extreme amounts of file searching I had to make it myself. What i created is designed to generally be platform independent. Normally I do not write code to be this way because 99.9% of the time I develop things exclusively for Windows.

Because I happen to like them better this will be done with plain old C-style file functions for several reasons:

1. They're MUCH faster than C++ filestreams.
2. The compiled code is MUCH smaller than code using C++ filestreams.
3. They're compatable with old C code.
4. The code looks nicer (to me at least).
5. The project I was working with was using them.
6. They're fun and you should learn to use them.

I'm also going to be using malloc() and free() instead of new and delete. No real reason other than to make this code more C complient even though it is meant to be C++ code. And of course I'll be using C-stlye strings and C-syle string functions, I always do this with code I plan on recycling, because if I ever put it in a DLL I can rest assured programs written in another language will be able to use the function. A program written in VB won't be able to make use of a function inside a DLL that returns a std::string, but it can make use of a function that returns a pointer to a C-style string.

Enough talk, here is the actual code:

unsigned long FileSearch(FILE* pFile, const char* lpszSearchString)
{
    //make sure we were passed a valid, if it isn't return -1
    if ((!pFile)||(!lpszSearchString))
    {
        return -1;
    }

    unsigned long ulFileSize=0;

    //get the size of the file
    fseek(pFile,0,SEEK_END);

    ulFileSize=ftell(pFile);

    fseek(pFile,0,SEEK_SET);

    //if the file is empty return -1
    if (!ulFileSize)
    {
        return -1;
    }

    //get the length of the string we're looking for, this is
    //the size the buffer will need to be
    unsigned long ulBufferSize=strlen(lpszSearchString);

    if (ulBufferSize>ulFileSize)
    {
        return -1;
    }

    //allocate the memory for the buffer
    char* lpBuffer=(char*)malloc(ulBufferSize);

    //if malloc() returned a null pointer (which probably means
    //there is not enough memory) then return -1
    if (!lpBuffer)
    {
        return -1;
    }

    unsigned long ulCurrentPosition=0;

    //this is where the actual searching will happen, what happens
    //here is we set the file pointer to the current position
    //is incrimented by one each pass, then we read the size of
    //the buffer into the buffer and compare it with the string
    //we're searching for, if the string is found we return the
    //position at which it is found
    while (ulCurrentPosition<ulFileSize-ulBufferSize)
    {
        //set the pointer to the current position
        fseek(pFile,ulCurrentPosition,SEEK_SET);

        //read ulBufferSize bytes from the file
        fread(lpBuffer,1,ulBufferSize,pFile);

        //if the data read matches the string we're looking for
        if (!memcmp(lpBuffer,lpszSearchString,ulBufferSize))
        {
            //free the buffer
            free(lpBuffer);

            //return the position the string was found at
            return ulCurrentPosition;
        }
        
        //incriment the current position by one
        ulCurrentPosition++;
    }

    //if we made it this far the string was not found in the file
    //so we free the buffer
    free(lpBuffer);

    //and return -1
    return -1;
}


Just a note, I know the return value is unsigned and in all the error cases I returned -1, remember, -1 is the same as 0xFFFFFFFF in a 32-bit number. Since i sincerly doubt you will ever come across a single file that is over 4GB this should never be a problem. If you should need to search a file that is over 4GB then I suggest replacing "unsigned long" with "unsigned __int64" if your compiler supports it. If you do need to do that then I doubt even more your hard drive can even hold a file that is 2^64 bytes in size so returning -1 (a REALLY big number for 64-bit numbers) will do nicely.

The above code is probably not the most effecient way of doing this, but it works, and it works fast. If i get the time I might try and make this as fast as possible, but unless this becomes the bottleneck of a program I'm working on that might not be for a while.
Digg Digg It!     Del.icio.us Del.icio.us     Reddit Reddit     StumbleUpon StumbleIt     Newsvine Newsvine     Furl Furl     BlinkList BlinkList

Rate Rate this tutorial
Comment Current Comments
by suman on Monday, April 10th 2006 at 10:23 PM

Hi sean,
I am quite intreseted in you work.You have done a very good job.I would like to have full code of this program can u send it to my mail-id please . bsuman256@rediffmail.com is my id..
I would be very thankful to you for sharing your C code of seraching a string in a file.

Thank You very much in advance

Suman Bharath.

by Krishna on Thursday, November 23rd 2006 at 06:46 AM

It si good pice of work. Could u send me the complete copy of the C++ code u have written.My email id being kittu24@gmail.com
Regards
KS

by Muthukumar on Monday, February 5th 2007 at 02:23 AM

Thanks for ur code .If u didnt posted it i too gone for implementing ... this code is really nice thank u.

by vijay on Monday, April 30th 2007 at 10:02 AM

hi, It is really good. Could u send me the full code, to my id, satya.vijai@gmail.com.

Thanks in advance

by Beeteh on Thursday, May 17th 2007 at 06:09 AM

Hi Sean,
Thanks for sharing your code with us. It\'s pretty neat and I found it easy to understand.
If its not too much trouble, could you please send me the full code to skemii@hotmail.com?....you don\'t have to if you don\'t want to though.
Happy programming! & Thanks again!

by Sami on Thursday, August 9th 2007 at 02:14 PM

Hi Sean,
This is excellent. I was looking for something like for a while. We have a issue where I need to find string with file but couldn\'t find anything. Thanks for great work.

I have one question and it might be obvious to all but not me:( how do I run this code if I am searching for string\"RT5004\" with a file and I have over 10,000 plus files.

Thanks,
Sami

by vineeta on Thursday, December 27th 2007 at 08:29 AM

Hi sean

I want to work with file handling.so can u please send me full code ??

waiting for your reply.

Thanks & Regards
Vineeta

by leo on Friday, January 18th 2008 at 08:55 AM

hey man... It is too good. Can u send me the full code to my id !!! :) leoviveke@yahoo.com

Thanks in advance

by Vivek on Friday, February 1st 2008 at 04:10 AM

Your program is absolutely working fine.

But ther are cases where the results are not as required.

For ex: I need to find a string "if" in some file.
The words that contain the word if is also considered valid, which should not be.
i.e "theif" this word contains if... this is also taken into consideration.

can u suggest me a better way to avoid this fact.

by sarma on Friday, March 7th 2008 at 04:37 AM

i need to search for a string in a pdf file using C#.Net and Asp.Net , could u help me please

by Muneeswaran on Monday, March 10th 2008 at 03:37 AM

Hi i try to run your code into MFC application,But there i faced some err like cannot convert parameter 1 from \'struct _iobuf *\' to \'char *\'.so please send a full source code to me

by anand on Friday, March 14th 2008 at 12:51 AM

Hi very fantastic job done yar... Can i get a full source code...?

by pamplemoose on Wednesday, March 19th 2008 at 03:50 PM

Fantastic bit of code, helped me enormously in a project im working on.

I would say change:
while (ulCurrentPosition<ulFileSize-ulBufferSize)
to:
while (ulCurrentPosition<=ulFileSize-ulBufferSize)

without it i was missing the last character of the file off therefore if the required string was there it wasn\'t found.

by Merlin on Monday, April 21st 2008 at 09:48 AM

For purposes of speed here's a version that only reads each character from file once (and thus doesn't need the extra seeks). I'm using it for a true/false on whether the string occurs but the filePos variable has the right value to be returned instead of true to use this as a search for first occurrence method.

The key is to use the same buffer and keep shifting the bytes to make room for more. I use a 2n-1 buffer so that each byte is only moved in memory once.

//Use this signature with obvious changes to find position in file of first occurrence
//static unsigned long findStrInFile(FILE* pFile, const char* str)
inline bool fileContainsStr(FILE* pFile, const char* const str)
{
const unsigned long strLen = strlen(str), strLenM1 = strLen-1;
if( !str || !strLen || !pFile ) { return false; }

if( fseek(pFile, 0, SEEK_END) != 0 ) { return false; }
unsigned long fileLen = ftell(pFile); fseek(pFile, 0, SEEK_SET);

if( !fileLen || strLen > fileLen ) { return false; }

char* const searchBuf = (char*)malloc( 2*strLen - 1 ); if( !searchBuf ) { return false; }
char *pSearch = searchBuf, *pWrite = searchBuf, * const pMid = searchBuf strLenM1;

unsigned long filePos = 0;
fread(searchBuf, strLenM1, 1, pFile); pWrite = strLenM1;
while( 1 )
{
fread(pWrite, 1, 1, pFile); pWrite;

if( !memcmp( pSearch, str, strLen ) ) { free(searchBuf); return true; }

if( filePos > fileLen - strLen ) { break; }

if( pSearch > pMid ) { memcpy(searchBuf, pSearch, strLenM1); pSearch = searchBuf; pWrite = pMid; }
}
free(searchBuf); return false;
}

by Merlin on Monday, April 21st 2008 at 09:52 AM

pMid should be initialized to searchBuf PLUS strLenM1, seems to have dropped the " " on copy or paste somewhere. Also very importan that filePos > fileLen - strLen test should be precedeed by thre prefix increment operator \ \ , " "" ". plus plus. Not sure why my plus signs vanished. pSearch is also pre-incremented before comparison to pMid.

by Merlin on Monday, April 21st 2008 at 09:56 AM

For cur/pasteability

<pre>
For purposes of speed here's a version that only reads each character from file once (and thus doesn't need the extra seeks). I'm using it for a true/false on whether the string occurs but the filePos variable has the right value to be returned instead of true to use this as a search for first occurrence method.

The key is to use the same buffer and keep shifting the bytes to make room for more. I use a 2n-1 buffer so that each byte is only moved in memory once.

//Use this signature with obvious changes to find position in file of first occurrence
//static unsigned long findStrInFile(FILE* pFile, const char* str)
inline bool fileContainsStr(FILE* pFile, const char* const str)
{
const unsigned long strLen = strlen(str), strLenM1 = strLen-1;
if( !str || !strLen || !pFile ) { return false; }

if( fseek(pFile, 0, SEEK_END) != 0 ) { return false; }
unsigned long fileLen = ftell(pFile); fseek(pFile, 0, SEEK_SET);

if( !fileLen || strLen > fileLen ) { return false; }

char* const searchBuf = (char*)malloc( 2*strLen - 1 ); if( !searchBuf ) { return false; }
char *pSearch = searchBuf, *pWrite = searchBuf, * const pMid = searchBuf strLenM1;

unsigned long filePos = 0;
fread(searchBuf, strLenM1, 1, pFile); pWrite = strLenM1;
while( 1 )
{
fread(pWrite, 1, 1, pFile); pWrite;

if( !memcmp( pSearch, str, strLen ) ) { free(searchBuf); return true; }

if( filePos > fileLen - strLen ) { break; }

if( pSearch > pMid ) { memcpy(searchBuf, pSearch, strLenM1); pSearch = searchBuf; pWrite = pMid; }
}
free(searchBuf); return false;
}
</pre>

by naresh on Tuesday, June 3rd 2008 at 05:26 PM

I need your help sir.will you have create search programming in c . you have the send the program my mail id.

Thanking you,

by naresh on Tuesday, June 3rd 2008 at 05:27 PM

I need your help sir.will you have create search programming in c . you have the send the program my mail id.

Thanking you,

by Mac on Sunday, June 8th 2008 at 03:38 PM

I want to parse a log file for text
like say string starting with ABC and ending till first coccurance of ; after that
i want all such string in a seperate file
can you please help
thanks

by Saurabh on Monday, June 30th 2008 at 01:12 AM

Its a really good illustration of string searching.
Can u send me the full source code on saurabh_717@yahoo.co.in
so i can further work on it.


Comment Comment on this tutorial
Name: Email:
Message:
Comment Related Tutorials
There are no related tutorials.

Comment Related Source Code
There is no related source code.

Jobs C++ Job Search
My skills include:
Enter a City:

Select a State:


Advanced Search >>
Latest Tech Bargains

Advertisement

Free Magazine Subscriptions

Today's Pictures

Today's Video

Other Resources

Latest Download

Latest Icons