Text Analysis

Text
Analysis

4th.CoSy

Planetary Temperature

CoSy/Life ; CoSy/Liberty

FaceBook ; Twitter
; YouTube ;

© Bob Armstrong

Subject:	[svfig] Text analysis
Date:	Mon, 25 Apr 2016 11:48:02 -0600
From:	Bob Armstrong <bob@cosy.com>
To:	Silicon Valley Forth Interest Group <svfig@zork.net>

[ corrected ]

As often the case , Brad Nelson's talk , Ruminations on Forth Text Processing , resonates with me .

CoSy's highest value is maintaining a parsable log in some sense written in itself . The current interface is defined in http://cosy.com/4thCoSy/CoSy/Tui.f and interfaces the Reva Forth to a very stale version of http://www.tecgraf.puc-rio.br/iup/ . closing that loop so I could type the line and execute



      

       text> R ` text v!

saving the text I'm writing to the R Root list then save R with my current keystrokes reliably to a flat file . And then put the whole save process on ctrl_s . That appears to have occurred by2006 8 7 3 29 34 .

I remember Perlis from my first APL conference , Rochester NY , 1979 . The conference probably had at least 400 attendees . They were big in those days . Perlis was old and skinny and bald and in a wheel chair and clearly among the demi-gods . Guy Steele and Alan Kay were also there .

One of 3 quotes at the top of my CoS y.com/CoSy/ page is Perlis's about it being best to have a rich vocabulary on one data structure .

To me there's no alternative to just biting the bullet from the single address space ( n iota ) within which Forths build themselves to objects in their own dynamically allocated and released address spaces .
In 4th.CoSy there's just 1 fundamental type of object : reference counted lists of lists with leafs of arbitrary types , currently char , integer , float , and symbol . In fact , because of modulo indexing the topology is actually rings of rings of rings . Each object has a 4 cell header .

Given that remembering what you've done is such a focus of CoSy , text parsing and searching vocabularies are a priority focus . There's still a bunch to be fleshed out , but here's some relevant vocabulary :

s" toksplt " CShelp ( s" /4thCoSy/CoSy/CoSy.f" s" : tokcut ( str tok -- CV ) | cuts string at occurances of string `tok but includes segment before first token 2p LR@ css i0 swap ,I L@ swap cut 2P> ; : toksplt ( str tok -- CV ) | like ' tokcut but deletes the tokens from the cut pieces | cr ." toksplt " ( 2p> tokcut i0 R@ rho ) 2p LR@ swap cL >aux+> dup R@ css cut aux- R@ rho ['] cut* eachleft 2P> ;" s" : vm "lf toksplt ; | kludged : VMbl : nlfy "bl toksplt ; | using Rick Trice's fn name . : VMnl ( str -- list_of_strings_split_on_cr ) "nl toksplt ;| name from APL Vector to Matrix on "cr 'lf . : VMlf ( str -- list_of_strings_split_on_cr ) "lf toksplt ;| Vector to Matrix on "newlines" ." s" : ssr ( str s0 s1 ,L -- str ) | replaces occurences in str of s0 with s1 2p L@ R@ 0 i@ toksplt R@ 1 i@ ['] cL eachleft ,/ R@ 1 i@ rho -1*i cut* 2P> ; | Quick ( to think ) and dirty . Could be highly optimized ." )

For instance , to count the lines in the current text :

text> vm >T0> rho 551

And to count the number of words in each :

T0 ' VMbl 'm ' rho 'm ,/4 24 5 36 19 2 12 22 5 17 2 15 2 5 ...

Consider how many individual strings just got created and freed . But the cycles are for me to use to make things simple for me . And it must be pretty similar in any dynamic object language .

Brad's interest in counting occurrences of words immediately made me think of one of K's most sophisticated functions which I've not yet implemented and it would be great if someone else got interested in fleshing out . It's very useful in a number of accounting tasks . Here's the description from the Kref.pdf :

Group
= x
Arguments
Any list x.
Definition
Group produces a list of nonnegative integer vectors whose count is the number of
distinct items in the argument, and:
• in which each item of !#x appears once and only once in the result, and;
• i and j are in the same item of the result if x[i] matches x[j] (see Match);
For example:
= 2 1 2 2 1 1
(0 2 3         the indices of 2
1 4 5)         the indices of 1
= "weekend"
(,0             the indices of "w" in "weekend"
1 2 4         the indices of "e"
,3              the indices of "k"
,5              the indices of "n"
,6)             the indices of "d"
Each item of the result corresponds to a distinct item in the argument, and within
each item the indices are in increasing order. Those distinct items are ?x, i.e. the
ith item of =x corresponds to the ith item of ?x (see Range). For instance, in the
previous example, the second item of the result, 1 2 4, holds the indices of "e"
in "weekend", and "e" is the second distinct item in "weekend".

The argument to Group can be any list, not just vectors. For example:
=(9 2 3
     4 5
     9 2 3
     6 7 8
    4 5
    9 2 3)

(0 2 5         items 0, 2, and 5 equal 9 2 3
1 4             items 1 and 4 equal 4 5
,3)             item 3 is 6 7 8

Facts about Range
?x is identical to x@*:’=x (see Range).

For example , if I execute

Q : = VMs @ ,/ text / Ravel text ( which is a list of lines ) , split on spaces , and group .

on the root K.CoSy text , it ( instantly ) returns a 3870 item very ragged array of the indices of all the unique "words" . I can't think of any simple way to show this array , but the count of each of the first few items looks like this :

#:' Q 1 391 2 3186 1 1 1 ...

So this little bit of code returns essentially a histogram of each unique word which sounds rather like what Brad was wanting .

Sorry to go off on my own tangent . But as I say , Brad's talks tend to strike major chords in my own interests .

Peace thru Freedom ,

Bob A

-- 4th.CoSy Document your life in open Forth APL --

Here's a followup [ corrected ] :

-------- Forwarded Message --------

Subject:	Re: [svfig] Text analysis
Date:	Tue, 26 Apr 2016 14:00:48 -0600
From:	Bob Armstrong <bob@cosy.com>
To:	Silicon Valley Forth Interest Group <svfig@zork.net>

Actually K dictionaries , to use the term Dennis Shasha of NYU pushes , are 3 item lists of correlated ( symbols ; values ; attributes ) lists , essentially "associative arrays" . I initially copied Arthur's structure but found the navigating between correlated entries in the 3 lists clunky so added a meta cell to CoSy headers to hold attributes . Thus my dictionaries , like the R root dictionary I mentioned are just 2 "column" ( names ; values ) lists . I have yet to feel the need to assign them a separate type . But they are the top of the hierarchy of "types" for the most common triumvirate of functions building on Forth's ( addr ; @ ; ! ) where the address is generally just implicitly understood .

In 4th.CoSy for lists , the verbs are at and at! indexed by integer . For dictionaries , they are vx v@ and v! indexed by name .

All current APLs have generalized , non-homogenous , lists , "enclosed arrays" which allow elements to be arbitrary types including generalized lists . ( A few old APLers like Bob Bernecky insist that lists of lists languages aren't real APLs , but -- who cares . I thought it the ( nearly ) obvious path since first reading Backus's 1977 Turing lecture which probably had a lot to do with Iverson getting it two years later . ( I think everyone reading this will agree that ACM's failure so far to award the Turing to Chuck Moore reflects poorly on the ACM . ) ) Objects in 4th.CoSy are generally either homogeneous lists ( treated as leafs ) , or general lists -- essentially lists of pointers . General lists are of type 0 , ie , the first cell of the header is 0 .

Sam , some of your comments on J reflect why I eschewed it for Arthur's K . And in fact why I split from other traditional APLs such as John Scholes ( et al ) Dyalog or Jim Brown's IBM APL2 . K is conceptually simpler and its structure transparent .

I've not looked into the details of JSON , but , like XML , it seemed to be rather straight-forwardly transformable to a nested list structure . ( BTW : How do you know Steve Apter ? He's an old ( redundant ) NY K friend . )

I'm rather curious about Lua . The IUP crowd seem to be big on it .

Finally , it occurred to me that the range function mentioned in the K doc of group is worth noting . In most APLs it's a very common toolbox idiom unique .

Range
? x
Argument
Any list.

Definition
The result is a list of the unique items of x, in the order of their first occurrence
(i.e., the occurrence with the smallest index). For example:
? 9 6 8 6 9 7 8 9 6 ?         "strange"
9 6 8 7                                "strange"
                                             ? "raccoon"
                                            "racon"
? (9 2 3;4 5;9 2 3;6 7 8;4 5;9 2 3)
(9 2 3
4 5
6 7 8)
See the primitive function Group for the relationship between it and this primitive.
Range is an identity function for empty lists.

I believe Iverson called this "nub" . group evolved from several related notions .

Bob A

-- 4th.CoSy Document your life in open Forth APL --

--
On 2016-04-26 09:22, Samuel Falvo II wrote:

On Tue, Apr 26, 2016 at 12:39 AM, Brad Nelson <bnels123@gmail.com> wrote:

I've long harbored the suspicion that the "JSON" types are a sort of strange
subset of what would better be encompassed by the types in an array language

I may be pandering to semantics here, but I actually disagree with
this; rather than arrays, JSON types are actually associative arrays.
This is a key distinction which languages like K, J, and APL do not
natively make (associative arrays are constructs built on top of
primitive arrays).  Lua is actually a better language for this, since
their "table" (an associative array type) is the one and only
structured data type that is offered.

To be clear, by "array" (without any other qualifiers), I'm referring
to the classical definition, which is an aggregation of like-typed
entities, arranged contiguously in memory.

An associative array, then, is a data structure which associates an
arbitrary key with an arbitrary datum, each of which can be of
arbitrary type.  Thus, an "array" is a proper subset of "associative
array", where keys are strictly integers (since you're indexing by
relative record number).  The data the array stores can be of any
type, of course, provided all elements are of the same type.

K differs from most array-based languages in that its "arrays" are
actually "lists" (in the abstract, not necessarily in implementation),
since elements can be of any type whatsoever (e.g., not necessarily
uniform).  K, then, is closer to Scheme than to APL in this respect.

While K may exhibit some incredibly expressive mechanisms for
processing JSON, I'm sure that you'll find greater difficulty in
working with JSON in a more rigorously typed language like J.  To make
anything like an associative array in J, you need, at a minimum, a
pair of vectors (one being the vector of keys, one being the data
vector), itself expressed as an array of boxed references.  The data
vector, then, would need to be a vector of *boxed* data (otherwise J
enforces the constraint that all data remain the same type).  In
summary, a boxed vector of length two, containing a vector of keys,
the other a vector of boxed data items.

To gain access to any item in this associative array, you must first
locate the index at which you'll find the data, and then use that
index to dereference into the data vector.  It's easy enough to bundle
this into a library of course; but, just be aware, it is not intrinsic
in the language design itself.

Or, in Lua, you'd just use a table and be done with it.  ;)

Disqus allowed HTML

comments powered by Disqus

Whole CoSy	*CoSy*
	I reserve the right to post all communications I receive or generate to CoSy website for further reflection . Contact : Bob Armstrong ; About this page : Feedback ; 719-337-2733 Coherent Systems / 28124 Highway 67 / Woodland Park , Colorado / 80863-9711 /\ /\ Top /\ /\