Discussion:
From JoyceUlysses.txt -- words occurring exactly once
(too old to reply)
HenHanna
2024-05-30 20:09:39 UTC
Permalink
i'd not use Gauche for this, but maybe someone can change my mind.


_______________________
From JoyceUlysses.txt -- words occurring exactly once


Given a text file of a novel (JoyceUlysses.txt) ...

could someone give me a pretty fast (and simple) program that'd give me
a list of all words occurring exactly once?

-- Also, a list of words occurring once, twice or 3 times



re: hyphenated words (you can treat it anyway you like)

ideally, i'd treat [editor-in-chief]
[go-ahead] [pen-knife]
[know-how] [far-fetched] ...
as one unit.
Jeff Barnett
2024-05-30 22:33:30 UTC
Permalink
Post by HenHanna
i'd not use Gauche for this, but maybe someone can change my mind.
_______________________
From JoyceUlysses.txt -- words occurring exactly once
Given a text file of a novel (JoyceUlysses.txt) ...
could someone give me a pretty fast (and simple) program that'd give me
a list of all words occurring exactly once?
              -- Also, a list of words occurring once, twice or 3 times
re: hyphenated words        (you can treat it anyway you like)
       ideally, i'd treat  [editor-in-chief]
                           [go-ahead]  [pen-knife]
                           [know-how]  [far-fetched] ...
       as one unit.
Make a list (or array) of the individual words (as strings or symbols in
a special package) of the original document then sort the list using the
Lisp-supplied sort function. You than write a loop using your favorite
tools and look for interior sequences of the required length. This gives
you a program that is asymptotically efficient as the theoretical
run-time will look something like (* c N (log N)), where N is the length
of the list produced by the first step and c is some constant.

Note, any solution resembling this one is not really what you want. For
example it would think "Snark" and "Snarks" are different words. Some
differences such as capitalization can be suppressed by choosing a sort
predicate that is case insensitive. You can, of course, write your own
sort predicate. The thing to note is that the predicate (the <= operator
used by sort) will not access the words or maintain state between
invocations; otherwise, the complexity can become arbitra
Stefan Monnier
2024-05-30 22:45:00 UTC
Permalink
Post by HenHanna
Given a text file of a novel (JoyceUlysses.txt) ...
could someone give me a pretty fast (and simple) program that'd give me
a list of all words occurring exactly once?
tr ' .;:,?!' '\n' | sort | uniq -u

?


- Stefan
Kaz Kylheku
2024-05-30 23:20:08 UTC
Permalink
Post by Stefan Monnier
Post by HenHanna
Given a text file of a novel (JoyceUlysses.txt) ...
could someone give me a pretty fast (and simple) program that'd give me
a list of all words occurring exactly once?
tr ' .;:,?!' '\n' | sort | uniq -u
Yep, that's pretty much how Doug McIlroy famously shut down Knuth.
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca
Paul Rubin
2024-05-31 07:40:59 UTC
Permalink
Post by HenHanna
could someone give me a pretty fast (and simple) program that'd give
me a list of all words occurring exactly once?
To first approximation, this works for me (bash command):

tr -c "[a-zA-Z-]" "\n" < ulysses.txt |sort|uniq -c|sort -n
B. Pym
2024-05-31 10:13:50 UTC
Permalink
Post by HenHanna
i'd not use Gauche for this, but maybe someone can change my mind.
_______________________
From JoyceUlysses.txt -- words occurring exactly once
Given a text file of a novel (JoyceUlysses.txt) ...
could someone give me a pretty fast (and simple) program that'd give me a list of all words occurring exactly once?
-- Also, a list of words occurring once, twice or 3 times
re: hyphenated words (you can treat it anyway you like)
ideally, i'd treat [editor-in-chief]
[go-ahead] [pen-knife]
[know-how] [far-fetched] ...
as one unit.
Gauche Scheme

(use file.util) ;; file->string
(use srfi-13) ;; character sets
(use srfi-14) ;; string-tokenize

(define h (make-hash-table 'string=?))

(dolist
(s
(string-tokenize (file->string "Alice.txt")
(char-set-adjoin char-set:letter #\-)))
(hash-table-update! h
(regexp-replace* (string-upcase s) #/^-+/ "" #/-+$/ "")
(pa$ + 1) 0))

(filter (lambda(kv) (< (cdr kv) 3))
(hash-table->alist h))

===>

(("LASTED" . 2) ("WAY--NEVER" . 1) ("VISIT" . 1) ("CHANCED" . 1)
("WILDLY" . 2) ("BEHEAD" . 1) ("PROMISE" . 1) ("MEANWHILE" . 1)
("ENGAGED" . 1) ("KNIFE" . 2) ("ROARED" . 1) ("RETIRE" . 1)
("BLACKING" . 1) ("HATED" . 1) ("BRIGHT-EYED" . 1)
("SHEEP-BELLS" . 1) ("PROTECTION" . 1) ("CRIES" . 1) ("ADA" . 1)
("ENJOY" . 1) ("WRITHING" . 1) ("RAW" . 1) ("APPEALED" . 1)
("RELIEVED" . 1) ("CHILDHOOD" . 1) ("WEPT" . 1) ("RACE-COURSE" . 1)
("THEIRS" . 1) ("MAD--AT" . 1) ("SPOKEN" . 1) ("PENCILS" . 1)
("CLEAR" . 2) ("TREADING" . 2) ("RETURNED" . 2) ("CHERRY-TART" . 1)
("UNEASY" . 1) ("LOW-SPIRITED" . 1) ("BONE" . 1) ("PROMISED" . 1)
("HAPPENING" . 1) ("OYSTER" . 1) ("PATIENTLY" . 2) ("NEEDS" . 1)
("LESSON-BOOK" . 1) ("PITIED" . 1) ("UNCOMFORTABLY" . 1)
("ANTIPATHIES" . 1) ("PICTURED" . 1) ("DESPERATE" . 1)
("ENGRAVED" . 1)
...
)

Loading...