Discussion:
GC implementation?
(too old to reply)
Mike
2004-10-01 12:37:38 UTC
Permalink
I have been researching lisp/scheme internals and last night
found what seems to be an interesting discussion about a
GC implementation. This implementation seems to be from the
perspective of a C application not knowing/caring about memory
allocation. With a lisp system, the lisp does not care what
objects are in or not in use, but the underlying internals
(C) have a better idea of what's going on, yes?

What's a typical implementation? Anyone have some links?

http://www.hpl.hp.com/personal/Hans_Boehm/gc

Mike
Nic Ferrier
2004-10-01 13:05:03 UTC
Permalink
Post by Mike
I have been researching lisp/scheme internals and last night
found what seems to be an interesting discussion about a
GC implementation. This implementation seems to be from the
perspective of a C application not knowing/caring about memory
allocation. With a lisp system, the lisp does not care what
objects are in or not in use, but the underlying internals
(C) have a better idea of what's going on, yes?
What's a typical implementation? Anyone have some links?
http://www.hpl.hp.com/personal/Hans_Boehm/gc
I am in the middle of building a scheme implementation that uses the
Boehm collector for GC.

The idea is that the C code that implements the scheme just uses
Boehm's malloc to allocate memory for the objects.

My environment model then ensures that objects are in scope in C when
they are in scope in scheme. This means having a simple chained
environment structure and an evaluator that is a single C function.

It works quite well.
--
Nic Ferrier
http://www.tapsellferrier.co.uk
Mikael Brockman
2004-10-01 13:20:25 UTC
Permalink
Post by Nic Ferrier
Post by Mike
I have been researching lisp/scheme internals and last night
found what seems to be an interesting discussion about a
GC implementation. This implementation seems to be from the
perspective of a C application not knowing/caring about memory
allocation. With a lisp system, the lisp does not care what
objects are in or not in use, but the underlying internals
(C) have a better idea of what's going on, yes?
What's a typical implementation? Anyone have some links?
http://www.hpl.hp.com/personal/Hans_Boehm/gc
I am in the middle of building a scheme implementation that uses the
Boehm collector for GC.
The idea is that the C code that implements the scheme just uses
Boehm's malloc to allocate memory for the objects.
My environment model then ensures that objects are in scope in C when
they are in scope in scheme. This means having a simple chained
environment structure and an evaluator that is a single C function.
It works quite well.
Do full continuations work?
Nic Ferrier
2004-10-01 13:48:52 UTC
Permalink
Post by Mikael Brockman
Do full continuations work?
I haven't implemented them yet... but I think they will.

When a continuation is captured the environment is captured too, as
long as the continuation object is in scope the environment is in
scope so objects are not cleared.
--
Nic Ferrier
http://www.tapsellferrier.co.uk
Ray Dillinger
2004-10-01 16:31:06 UTC
Permalink
Post by Nic Ferrier
I am in the middle of building a scheme implementation that uses the
Boehm collector for GC.
The idea is that the C code that implements the scheme just uses
Boehm's malloc to allocate memory for the objects.
I made an incremental, precise, non-relocating, three-color collector,
in about 400 lines of profusely commented C code. I'm pleased with it;
it handles soft pointers gracefully and it's reasonably unintrusive.

Actually, I guess technically it's a five-color collector. Remember
the basic three-color algorithm? You start by putting your root items
into a queue and calling it the "tovisit" queue. Everything that
survived the previous round of GC is live data, and goes in the
"unvisited" queue for the current round. Then one at a time, you
take items off the tovisit queue, look at their pointers, and if
they point at anything in the unvisited queue, you move the thing
they point at to the "tovisit" queue. Then you move the things
whose pointers you've been iterating over to the "visited" queue
to get them out of the way. When you come to the end of the
tovisit queue, anything that is still in the "unvisited" queue
is garbage so you reap it and start over by moving root items
into a new "tovisit" queue. So, classically speaking, there are
three colors at any moment; visited, unvisited, and tovisit.

I added two extra colors. The first is "root," and I put root
elements in it instead of back in the visited queue after visiting
them. This means when I'm ready for the new generation, the
rootlist becomes the tovisit list and I don't have to worry
about separating them out. The second extra color is the "stale"
list, where garbage is held for one generation before being
reaped. When I'm taking things off the "tovisit" list, I can
check its soft pointers for references into the "stale" list
and set them to NULL. This way I can guarantee that soft pointers
in live data will be set to NULL before the things they point at
get reaped.

I think soft pointers are important. The problem I came up with
that I couldn't get around without them was symbols. If you don't
have soft pointers, you can't reap any symbol. Ever. Or at least
I couldn't figure out how.

The problem is that you have to have some way of finding pointers
to the symbols so you can guarantee eq? behavior and resolve
variable references; but if you store ordinary pointers to them
in a structure the GC knows about, then none of them can ever be
reaped. The solution is a root-element table of soft pointers
to symbols. If there's nothing that refers to a symbol, it goes
stale, then soft pointers to it get set NULL, then it gets
reaped.

Bear
Nic Ferrier
2004-10-01 18:56:27 UTC
Permalink
Post by Ray Dillinger
I think soft pointers are important. The problem I came up with
that I couldn't get around without them was symbols. If you don't
have soft pointers, you can't reap any symbol. Ever. Or at least
I couldn't figure out how.
I think you're right.

But my strategy is to ignore that problem: all symbols are interned
forever in my implementation.

This is probably not that bad a strategy, but my scheme is intended
for highly available web applications so I might have to rethink
it. I'll wait and see what symbol usage is like in real applications.


FWIW I do agree that soft pointers (or weak references as certain
fashionable languages call them) are a really useful thing. I've been
wondering whether there was any research into their use in the LISP
area because they seem to be problematic (this symbol issue being one
problem).
--
Nic Ferrier
http://www.tapsellferrier.co.uk
Ray Dillinger
2004-10-02 21:34:49 UTC
Permalink
... my strategy is to ignore that problem: all symbols are interned
forever in my implementation.
This is probably not that bad a strategy, but my scheme is intended
for highly available web applications so I might have to rethink
it. I'll wait and see what symbol usage is like in real applications.
I'm doing a lot of natural-language, and I read words as symbols
in free text. So it becomes very important to me.
FWIW I do agree that soft pointers (or weak references as certain
fashionable languages call them) are a really useful thing. I've been
wondering whether there was any research into their use in the LISP
area because they seem to be problematic (this symbol issue being one
problem).
I was surprised how simple it was to extend my original 3-color
collector to include soft pointers. I think it added only about
12 lines of code, although it took half of a day to figure out
exactly what to put in those lines and debug ...

Since I implemented them for symbols, I used them to achieve proper
eq? behavior on bignums and bigchars too; in my lisp a character
can be a boxed value if it's several codepoints (a bigchar), and
the standard requires eq? to work on characters. So an index to
boxed characters allows a pointer to just point at an existing
boxed character if one already exists, and eq? gets the right
answer on characters without having to do a typecheck, pointer
follow, and data comparison. The same technique (in fact, the
same code with a different typetag argument) gets me eq? on
bignums "for free".

Bear
Marcin 'Qrczak' Kowalczyk
2004-10-02 21:52:41 UTC
Permalink
Post by Ray Dillinger
the standard requires eq? to work on characters.
It doesn't.
--
__("< Marcin Kowalczyk
\__/ ***@knm.org.pl
^^ http://qrnik.knm.org.pl/~qrczak/
Nic Ferrier
2004-10-02 23:35:51 UTC
Permalink
Post by Ray Dillinger
... my strategy is to ignore that problem: all symbols are interned
forever in my implementation.
This is probably not that bad a strategy, but my scheme is intended
for highly available web applications so I might have to rethink
it. I'll wait and see what symbol usage is like in real applications.
I'm doing a lot of natural-language, and I read words as symbols
in free text. So it becomes very important to me.
What size constraints are you hitting? English only has something like
a million words (including scientific ones). And English has a *big*
lexicon relative to other languages. Are you coping with low memory
systems?

You've got me thinking... Maybe strategies from spell checkers would
work for symbol storage; put some indexes around the symbol hashtable
for example.

Hmmm... thanks. That's useful food for thought.
Post by Ray Dillinger
Since I implemented them for symbols, I used them to achieve proper
eq? behavior on bignums and bigchars too; in my lisp a character
can be a boxed value if it's several codepoints (a bigchar), and
the standard requires eq? to work on characters. So an index to
boxed characters allows a pointer to just point at an existing
boxed character if one already exists, and eq? gets the right
answer on characters without having to do a typecheck, pointer
follow, and data comparison. The same technique (in fact, the
same code with a different typetag argument) gets me eq? on
bignums "for free".
And then you allocate a new box on assignment?
--
Nic Ferrier
http://www.tapsellferrier.co.uk
Ray Dillinger
2004-10-03 06:08:11 UTC
Permalink
Post by Nic Ferrier
Post by Ray Dillinger
So an index to
boxed characters allows a pointer to just point at an existing
boxed character if one already exists, and eq? gets the right
answer on characters without having to do a typecheck, pointer
follow, and data comparison. The same technique (in fact, the
same code with a different typetag argument) gets me eq? on
bignums "for free".
And then you allocate a new box on assignment?
Yup. Numbers and characters are immutable; to give them a
new value is to give them a new location. Once nothing is
still referring to the datum at its original location, it
gets reaped.

Bear
Siegfried Gonzi
2004-10-02 09:08:38 UTC
Permalink
Post by Mike
What's a typical implementation? Anyone have some links?
http://www.hpl.hp.com/personal/Hans_Boehm/gc
Mike
As far as I know: Bigloo Scheme compiler uses also Hans Boehm's garbage
collector. I am not sure what you are aiming at, but the Boehm gc has
the big advantage that it runs on a lot of platforms due to basic C
layout. I use Bigloo on: Linux, Sun OS and Mac OSX.

Fensterbrett
Anton van Straaten
2004-10-02 17:48:24 UTC
Permalink
Post by Mike
I have been researching lisp/scheme internals and last night
found what seems to be an interesting discussion about a
GC implementation. This implementation seems to be from the
perspective of a C application not knowing/caring about memory
allocation. With a lisp system, the lisp does not care what
objects are in or not in use, but the underlying internals
(C) have a better idea of what's going on, yes?
Other than the garbage collection system itself, there's not really a strong
reason for the "underlying internals [to] have a better idea of what's going
on". When objects are put under the control of the garbage collector,
there's no reason for even a low-level implementation to know or care what's
going on with those objects, and it would be a complex business if it were
to try to interfere and second-guess the garbage collector. The only code
that definitely "knows what's going on" with object memory is the garbage
collector itself, and the only sense in which it knows what's going on is
with respect to the algorithm it follows to trace active objects, classify
them according to its algorithm, and collect the dead ones.

BTW, remember that the underlying internals aren't always C. There are
Scheme implementations which compile to native code, and which are
implemented in Scheme, for example. Even those which compile Scheme to C
don't end up with C code that behaves in the way you might expect from an
implementation hand-written in C.
Post by Mike
What's a typical implementation? Anyone have some links?
A typical implementation of Scheme, or garbage collection in Scheme? I'm
not sure there is such a thing, either way. :) You could classify Schemes
by their approach to garbage collection, though. Here's a list from memory,
not comprehensive, and the categories are not all disjoint:

* Multiple GC support: some Schemes support more than one garbage
collector - e.g. PLT and Larceny
* Use the Boehm-Demers-Weiser conservative collector (which you referenced):
e.g. PLT and Bigloo
* Java-based Schemes: rely on JVM garbage collection - e.g. SISC, Kawa,
Bigloo (Java version)
* Cheney on the MTA [1]: e.g. Chicken, Larceny (?)
* Garbage collector code implemented in Scheme: e.g. Scheme48, Larceny, MIT
* "Ordinary" precise garbage collectors implemented in C: e.g. Guile (?),
PLT (optional), others...

Other than Cheney on the MTA, the above list doesn't cover the various GC
strategies, like mark and sweep, stop and copy, and generational. You can
find all of those and more in the various Scheme implementations.

Anton

[1] http://home.pipeline.com/~hbaker1/CheneyMTA.html
William D Clinger
2004-10-02 23:19:54 UTC
Permalink
Post by Anton van Straaten
You could classify Schemes
by their approach to garbage collection, though. Here's a list from memory,
* Multiple GC support: some Schemes support more than one garbage
collector - e.g. PLT and Larceny
e.g. PLT and Bigloo
* Java-based Schemes: rely on JVM garbage collection - e.g. SISC, Kawa,
Bigloo (Java version)
* Cheney on the MTA [1]: e.g. Chicken, Larceny (?)
* Garbage collector code implemented in Scheme: e.g. Scheme48, Larceny, MIT
* "Ordinary" precise garbage collectors implemented in C: e.g. Guile (?),
PLT (optional), others...
Other than Cheney on the MTA, the above list doesn't cover the various GC
strategies, like mark and sweep, stop and copy, and generational. You can
find all of those and more in the various Scheme implementations.
Anton
Some corrections: Larceny currently supports four and a half
garbage collectors (I'm not sure whether anyone has succeeded
in a recent build of Larceny with the BDW collector, but it
used to work). Larceny does not support Cheney on the MTA,
and its collectors are implemented in C, not Scheme.

Some additions: Chez Scheme and Larceny offer precise generational
collectors. Gambit may also, and has some kind of soft real-time
collector as well.

For general information on garbage collection, see
http://www.cs.kent.ac.uk/people/staff/rej/gc.html

Will
Anton van Straaten
2004-10-03 00:48:58 UTC
Permalink
Post by William D Clinger
Some corrections: Larceny currently supports four and a half
garbage collectors (I'm not sure whether anyone has succeeded
in a recent build of Larceny with the BDW collector, but it
used to work). Larceny does not support Cheney on the MTA,
and its collectors are implemented in C, not Scheme.
Some additions: Chez Scheme and Larceny offer precise generational
collectors. Gambit may also, and has some kind of soft real-time
collector as well.
For general information on garbage collection, see
http://www.cs.kent.ac.uk/people/staff/rej/gc.html
Thanks.

Regarding Larceny & Cheney on the MTA, I jumped to a (tentative) conclusion
based on the appearance of a file named cheney.c in Larceny 1.01a (which
also happened to be the largest C file in the Rts/Sys directory). I also
assumed that at least some of Larceny's GC had to be implemented in Scheme
because all of those C files seemed so small. I forgot to factor in the
expertise of the author(s).

I also neglected to mention that of course, using ordinary GC'd Scheme to
implement its own GC is not really feasible. Scheme 48 is worth mentioning
here, since it addresses this by implementing its GC in its restricted
dialect PreScheme, which compiles to C, and is not itself GC'd.

Anton
Taylor Campbell
2004-10-03 19:14:02 UTC
Permalink
Post by Anton van Straaten
Regarding Larceny & Cheney on the MTA, I jumped to a (tentative) conclusion
based on the appearance of a file named cheney.c in Larceny 1.01a (which
also happened to be the largest C file in the Rts/Sys directory). I also
assumed that at least some of Larceny's GC had to be implemented in Scheme
because all of those C files seemed so small. I forgot to factor in the
expertise of the author(s).
I'm guessing Larceny's cheney.c is an implementation of Cheney's
breadth-first copying collector algorithm. (I sha'n't investigate the
cheney.c file further, for I'm loathe to read C.) Cheney on the MTA is
really a compilation technique, not a garbage collection technique; the
only relevance it has to garbage collection is that it uses a copying
stack garbage collector. The 'Cheney' probably refers to Cheney's
garbage collector algorithm. (Another copying collector algorithm, for
completeness, is Clark's, which uses a depth-first pointer traversal.
T used this algorithm before it was switched to Cheney's.)
Post by Anton van Straaten
I also neglected to mention that of course, using ordinary GC'd Scheme to
implement its own GC is not really feasible. Scheme 48 is worth mentioning
here, since it addresses this by implementing its GC in its restricted
dialect PreScheme, which compiles to C, and is not itself GC'd.
Some more corrections: it _is_ possible for a Scheme to implement its
own GC in itself. This was done in, for example, T3 (for both the
Clark & Cheney algorithms). All it required was sufficiently carefully
written code and an adequately optimizing compiler (Orbit, in the case
of T3). Even Scheme48's GC -- and the rest of the VM -- was written in
Scheme before Kelsey even wrote the Pre-Scheme compiler; it was just
written in a very careful style. (I believe it was originally run with
T and Pseudoscheme, though Richard Kelsey & Jonathan Rees could give
undoubtedly more accurate information here.)
Anton van Straaten
2004-10-04 02:46:50 UTC
Permalink
Post by Taylor Campbell
Post by Anton van Straaten
Regarding Larceny & Cheney on the MTA, I jumped to a
(tentative) conclusion based on the appearance of a file
named cheney.c in Larceny 1.01a (which also happened
to be the largest C file in the Rts/Sys directory). I also
assumed that at least some of Larceny's GC had to be
implemented in Scheme because all of those C files
seemed so small. I forgot to factor in the expertise of
the author(s).
I'm guessing Larceny's cheney.c is an implementation of
Cheney's breadth-first copying collector algorithm.
That makes sense.
Post by Taylor Campbell
Cheney on the MTA is really a compilation technique,
not a garbage collection technique; the only relevance it
has to garbage collection is that it uses a copying stack
garbage collector. The 'Cheney' probably refers to Cheney's
garbage collector algorithm.
Well, "Cheney" in "Cheney on the MTA" refers to "a Cheney-style copying
garbage collection scheme". I agree that Cheney on the MTA is not primarily
a GC technique, although it seems to me to be rather inextricable from the
GC technique it uses. In theory, you could use the approach with any
CPS-compiled code, even if the target language were other than C; but the
garbage collection technique would probably remain the same.
Post by Taylor Campbell
Some more corrections: it _is_ possible for a Scheme to implement its
own GC in itself. This was done in, for example, T3 (for both the
Clark & Cheney algorithms). All it required was sufficiently carefully
written code and an adequately optimizing compiler (Orbit, in the case
of T3). Even Scheme48's GC -- and the rest of the VM -- was written in
Scheme before Kelsey even wrote the Pre-Scheme compiler; it was just
written in a very careful style.
I wrote "not feasible" as opposed to "not possible", thinking that even if
it were possible, it would probably be impractical, e.g. difficult to reason
about and maintain. I wasn't aware of the T3 and early Scheme 48 examples.
However, the fact that no currently active implementation does this makes me
think I might have a point. :)

Anton

Continue reading on narkive:
Loading...