Monday, January 21, 2013

In which I again prove I know nothing about programming

Ok, so let us say, hypothetically of course but in suspiciously specific ways, that I have a text file that looks like this:
[spaces] [string of letters] [tab] [spaces] [string of letters] [line break]
[spaces] [string of letters] [tab] [spaces] [string of letters] [line break]
[spaces] [string of letters] [tab] [spaces] [string of letters] [line break]
[spaces] [string of letters] [tab] [spaces] [string of letters] [line break]

And so forth for, oh, say, 115 pages.

And oh fuck is it hard to explain what I want to do in words.  The thoughts are clear, explaining them much less so.

Thing 1:

Create text file [original file name]-output01.txt
For lines of original text file running from 1 to [the last line]{
Check second string of letters. Does a text file named [original file name]-[second string of letters].txt exist?
---If no create said file and have line one read:
-------[first string of letters] 1
---If yes read said file.  Does the file contain [first string of letters]?
-------If yes change that line to [first string of letters] [current number+1]
-------If no add a new line to the end of the file that reads: [first string of letters] 1

Add "[space][second string of letters][line number from previous step]" to the end of the text file [original file name]-output01.txt
}

I'm not sure if there was supposed to be additional, oh wait, there were.

Thing 2:

You know these are in no particular order because this is easier.


Create text file [original file name]-output02.txt
For lines of original text file running from 1 to [the last line]{
Check second string of letters.
Add "[space][second string of letters]" to the end of the text file [original file name]-output02.txt
}

So there was a reason for numbering the output files, but I very much doubt having two digits is necessary, for a while I could only remember one of them.

Thing 3:
The files (other than [original file name]-output01) will be text files of the form:
[string of letters][space][number]
Over and over again for line after line.

I want to make a new file that rearranges those lines so that the [number]s run from highest to lowest and [space][line number from original file] is added to the end of each line.

ie:
[string of letters][space][highest number][space][line number from original file]
[string of letters][space][next highest number][space][line number from original file]
[string of letters][space][third highest number][space][line number from original file]
[...]
[string of letters][space][second lowest number][space][line number from original file]
[string of letters][space][lowest number][space][line number from original file]

These new files will be named [original file name]-[string of letters]-sorted.txt and there will be one for each [original file name]-[string of letters].txt file.

Thing 4:

Create file [original file name]-output03 (I'm at three now, right?) by copying [original file name]-output 01

Read through file, for every [string of letters][number] replace it with uh, how do I say this?

Well I have to try it again  because I got it totally wrong the first time, good thing I caught it before I posted this.

[original file name]-output 01 will be entirely of the form [string of letters][number][space][string of letters][number][space][string of letters][number][space] and so on, with nary a line break in sight.

Assuming that I take the things between the spaces ([string of letters][number]) one at a time (so I don't have to worry about changing something I already changed) what I want to do is this, look at [original file name]-[string of letters]-sorted.txt find the line where [number] is the last number in that line.  Replace [number] with the line number.

So essentially I'm replacing the numbering of output 1, where things are numbered by when they first appear, with a numbering that has [string of letters]1 be the most frequent and [string of letters]2 the second most frequent, and so on.

In [original file name]-[string of letters].txt the line numbers correspond to when [first string of letters] was first mentioned.  In [original file name]-[string of letters]-sorted.txt the line numbers corrispond to how frequently [first string of letters] was mentioned (with 1 being the most frequent.)

Output01 corresponds to the unsorted, output03 would correspond to the sorted.

Thing 5:

It's safe to assume at this point that I've completely forgotten what I was doing or where I was going with this [and that was written before I corrected Thing 4] so let's just say modify thing four so that instead of replacing the number in [string of letters][number] we instead replace the entire "[string of letters][number]" with the string of letters on line number [number] (remember that each line contains a string of letters as its first component, followed by a space.)

-

Uh, now then, remember the title of the post?  How the fuck do I do any of that?

Seriously, I know nothing of programming.

3 comments:

  1. Once you've got the basics, the hard part of programming is, in fact, expressing what you want done in algorithmic terms.

    None of what you've described sounds terribly difficult to me. My tool of choice tends to be Perl anyway, but it's particularly good at handling text.

    So for example you might start by matching each line of the input file against the regular expression (qv):

    /^\s*([A-Z]+)\s+([A-Z]+)/i

    (beginning of line, zero or more spaces/tabs, one or more letters, one or more spaces/tabs, one or more letters)

    which would give you one variable with the first lot of letters and another with the second. Then you could build filenames out of those captured variables, and start looking for files, and so on. (Though actually I'd be inclined to try to do the whole thing in memory, and only write out the output files you actually wanted at the end. That shouldn't be any sort of challenge for a modern machine.)

    I'd be happy to cut you some actual code if you like, which I could then comment copiously so that you could see what was going on. I think you have my email address... feel free to send me a link to your source text.

    ReplyDelete
    Replies
    1. I don't doubt that it's simple, I just don't know how to do it. I shall look to see if I can locate your email.

      Delete
    2. Perl and Python are on my list of languages I really should learn.

      TRiG.

      Delete