Perl 5 Internals
From Books
This series contains material adopted from the Netizen Perl Training Fork, by kind permission of Kirrily Robert.
Contents
|
[edit] Preliminaries
Welcome to NetThink's Perl 5 Internals training course. This is a three-hour course which provides a hands-on introduction to how the perl interpreter works internally, how to go about testing and fixing bugs in the interpreter, and what the internals are likely to look like in the future of Perl, Perl 6.
[edit] Course Outline
- Development Structure
- Parts of the Interpreter
- Internal Variables
- The Lexer and the Parser
- Fundamental operations
- The Runtime Environment
- The Perl Compiler
- Hacking on
perl - Perl 6 Internals
[edit] Assumed Knowledge
On this course, it is assumed that you will:
- be able to program Perl to at least an "intermediate" level; completing NetThink's "Intermediate Perl" course is regarded as an adequate standard.
- have some familiarity with the C programming language.
- be able to use a compiler and, if necessary, symbolic debugger, without prompting.
[edit] Note
Knowledge of XS is not required, but is beneficial.
[edit] Objectives
The aim of this course is to give you not just an understanding of the workings of the perl interpreter, but also the means to investigate more about it, to analyze and solve bugs in the Perl core, and to take part in the Perl development process.
[edit] The course notes
These course notes contain material which will guide you through the topics listed above, as well as appendices containing other useful information.
The following typographic conventions are used in these notes:
System commands appear in this typeface
Literal text which you should type in to the command line or editor appears as monospaced font.
Keystrokes which you should type appear like this: ENTER. Combinations of keys appear like this: CTRL-D
Program listings and other literal listings of what appears on the screen appear in a monospaced font like this.
Parts of commands or other literal text which should be replaced by your own specific values appears <code>like this</code>
Notes which are marked "Advanced" are for those who are racing ahead or who already have some knowledge of the topic at hand. The information contained in these notes is not essential to your understanding of the topic, but may be of interest to those who want to extend their knowledge.
</advanced>
<readme>Notes marked with "Readme" are pointers to more information which can be found in your textbook or in online documentation such as manual pages or websites.
</readme>[edit] Perl Development Structure
The aim of this section is to familiarize you with the process by which the Perl interpreter is developed and maintained. Most internals hacking is carried out on the "bleeding edge" of the Perl sources, and so you need to understand what these are and how to get them.
It's also important to understand the structure of the Perl development community; how it's organized, and how it works.
[edit] Perl Versioning
Perl has two types of version number: versions before 5.6.0 used a number of the form x.yyyy_zz; x was the major version number, (Perl 4, Perl 5) y was the minor release number, and z was the patchlevel. Major releases represented, for instance, either a complete rewrite or a major upheaval of the internals; minor releases sometimes added non-essential functionality, and releases changing the patchlevel were primarily to fix bugs. Releases where z was 50 or more were unstable, developers' releases working towards the next minor release.
Now, since, 5.6.0, Perl uses the more standard open source version numbering system - version numbers are of the form x.y.z; releases where y is even are stable releases, and releases where it is odd are part of the development track.
[edit] The Development Tracks
Perl development has four major aims: extending portability, fixing bugs, optimizations, and adding language features. Patches to Perl are usually made against the latest copy of the development release; the very latest copy, stored in the Perl repository (see [#perlrep the section called “The Perl Repository”] below) is usually called `bleadperl'.
The bleadperl eventually becomes the new minor release, but patches are also picked up by the maintainer of the stable release for inclusion. While there are no hard and fast rules, and everything is left to the discretion of the maintainer, in general, patches which are bug fixes or address portability concerns (which include taking advantage of new features in some platforms, such as large file support or 64 bit integers) are merged into the stable release as well, whereas new language features tend to be left until the next minor release. Optimizations may or may not be included, depending on their impact on the source.
[edit] Perl 5 Porters
In February 2001, there were nearly 200 individuals involved in the development of Perl; these developers, or `porters', communicate through the use of the perl5-porters mailing list; if you are planning to get involved in helping to develop or maintain Perl, a subscription to this list is essential.
You can subscribe by sending an email to perl5-porters-subscribe@perl.org; you'll be asked to send an email to confirm, and then you should start receiving mail from the list. To send mail, to the list, address the mail to perl5-porters@perl.org; you don't have to be subscribed to post, and the list is not moderated. If, for whatever reason, you decide to unsubscribe, simply mail perl5-porters-unsubscribe@perl.org.
The list usually receives between 200 and 400 mails a week. If this is too much for you, you can subscribe instead to a daily digest service by mailing perl5-porters-digest-subscribe@perl.org. Alternatively, I write a weekly summary of the list, published on the Perl home page.
There is also a perl5-porters FAQ which explains a lot of this, plus more about how to behave on P5P and how to submit patches to Perl.
[edit] Pumpkins and Pumpkings
Development is very loosely organised around the release managers of the stable and the development tracks; these are the two ``pumpkings''.
Perl development can also be divided up into several smaller sub-systems: the regular expression engine, the configuration process, the documentation, and so on. Responsibility for each of these areas is known as a ``pumpkin'', and hence those who semi-officially take responsibility for are called ``pumpkings''.
At the time of writing, the Pumpking for 5.8.x is Nicholas Clark, and the Pumpking for 5.9.x is Rafael Garcia-Suarez.
You're probably wondering why the silly titles. It stems from the days before Perl was kept under version control, and people had to manually `check out' a chunk of the Perl source to avoid conflicts by announcing their intentions to the mailing list; while they were discussing what this should be called, one of Chip Salzenburg's co-workers told him about a system they had used for preventing two people using a tape drive at once: there was a stuffed pumpkin in the office, and nobody could use the drive unless they had the pumpkin.
[edit] The Perl Repository
Now Perl is kept in a version control system called Perforce, which is hosted by ActiveState, Inc. There is no public access to the system itself, but various methods have been devised to allow developers near-realtime access.
Firstly, there is the Archive of Perl Changes. This web site contains both the current state of all the maintained Perl versions, and also a directory of changes made to the repository.
Since it's a little inconvenient to keep up to date using HTTP, the directories are also available via the software synchronisation protocol rsync. If you have rsync installed, you can synchronise your working directory with the bleeding-edge Perl tree (usually called `bleadperl') in the repository by issuing the command
%rsync -avz rsync://public.activestate.com/perl-current/ .
There are also periodic snapshots of bleadperl released by the development pumpking, particularly when some important change happens. These are usually available from a variety of URLs, and always from ftp://ftp.funet.fi/pub/languages/perl/snap/.
Finally, there is a repository browser available at http://public.activestate.com/cgi-bin/perlbrowse which can tell you the current status of individual files, as well as provide an annotated `blame log' cross-referencing each line in a file to the latest patch to affect it.
[edit] Summary
- Perl versions are numbers of the form x.y.z, where y is odd for development and even for stable versions.
- Perl development takes place on the perl5-porters mailing list
[edit] Exercises
- Obtain a copy of the development sources to Perl from CPAN. Unpack the archive, and familiarize yourself with the layout of its contents.
- Use rsync to update the copy to
bleadperl. How many bytes changed? - Subscribe to perl5-porters, if you haven't already done so. Spend a few moments reading through the FAQ. If you have already subscribed, read through back issues of the summaries.
[edit] Parts of the Interpreter
This chapter will take you through the various parts of the perl interpreter, giving you an overview of its operation and the stages that a Perl program goes through when executed. By the end of this chapter you should be comfortable with the structure of the perl source and be able to locate functions and routines in the source tree based on a brief description of their operation.
[edit] Top Level Overview
perl is not exactly an interpreter and it's not exactly a compiler: it's a bytecode compiler. First compiles the input source code to an internal representation or bytecode, and then it executes the operations that the bytecode specifies on a virtual machine.
<advanced>
How does this differ from, say, Java? Java's virtual machine is designed to represented an idealised version of a computer's processor. In Perl's case, however, the individual operations that can be performed are considerably higher-level. For instance, a regular expression match is a single "instruction" in Perl's virtual machine.
Again, like a real hardware processor, Java's VM stores its calculations in registers; Perl, on the other hand, uses a stack to co-ordinate and communicate results between operations.
</advanced>
The name we give to the first stage is "parsing", although, as we'll see, parsing refers to a specific operation. The input to this stage is your Perl source code; the output is a tree data structure which represents what that code "means".
One of the nodes in this tree is designated the "start" node; every node will have an operation to perform, and a pointer to the node that the interpreter must execute next.
Hence, the second phase of the operation is to execute the start node and follow the chain of pointers around the tree, executing each operation in the correct order. In later parts of this course, we'll examine exactly how the operations are executed and what they mean.
First, however, we will examine the various distinct areas of the Perl source tree.
[edit] The Perl Library
The most approachable part of the source code, for Perl programmers, is the Perl library. This lives in lib/, and comprises all the standard, pure Perl modules and pragmata that ship with perl.
There are both Perl 5 modules and unmaintained Perl 4 libraries, shipped for backwards compatibility. In Perl 5.6.0 and above, the Unicode tables are placed in lib/unicode.
[edit] The XS Library
In ext/, we find the XS modules which ship with Perl. For instance, the Perl compiler (see [#compiler Chapter�7, The Perl Compiler]) B can be found here, as can the DBM interfaces. The most important XS module here is DynaLoader, the dynamic loading interface which allows the runtime loading of every other XS module.
As a special exception, the XS code to the methods in the UNIVERSAL, Tie::Hash::NamedCapture, Internals classes along with some methods of utf8, re, and version classes can be found in universal.c.
[edit] The IO Subsystem
Recent versions of Perl come with a completely new standard IO implementation, PerlIO. This allows for several "layers" to be defined through which all IO is filtered, similar to the line disciplines mechanism in sfio. These layers interact with modules such as PerlIO::Scalar, also in the ext/ directory.
The IO subsystem is implemented in perlio.c and perlio.h. Declarations for defining the layers are in perliol.h, and documentation on how to create layers is in pod/perliol.pod.
Perl may be compiled without PerlIO support, in which case there are a number of abstraction layers to present a unified IO interface to the Perl core. perlsdio.h aliases ordinary standard IO functions to their PerlIO names, and perlsfio.h does the same thing for the alternate IO library sfio.
The other abstraction layer is the "Perl host" scheme in iperlsys.h. This is confusing. The idea is to reduce process overhead on Win32 systems by having multiple Perl interpreters access all system calls through a shared "Perl host" abstraction object. There is an explanation of it in perl.h, but it is best avoided.
[edit] The Regexp Engine
Another area of the Perl source best avoided is the regular expression engine. This lives in re*.*. The regular expression matching engine is, roughly speaking, a state machine generator. Your match pattern is turned into a state machine made up of various match nodes - you can see these nodes in regcomp.sym. The compilation phase is handled by regcomp.c, and the state machine's execution is performed in regexec.c.
<advanced><title>Did You Know?</title>
The regular expression compiler and interpreter are actually switchable; it's possible to remove Perl's default regular expression engine and insert one's own custom engine. (This is done by changing the value of the global variables PL_regcompp and PL_regexecp to be function pointers to the required routines.) In fact, that's exactly what the re module does.
[edit] The Parser and Tokeniser
As mentioned above, the first stage in Perl's operation is to "understand" your program. This is done by a joint effort of the tokeniser and the parser. The tokeniser is found in toke.c, and the parser in perly.c. (although you're far, far better off looking at the YACC source in perly.y)
The job of the tokeniser is to split up the input into meaningful chunks, or tokens, and also to determine what type of thing they represent - a Perl keyword, a variable, a subroutine name, and so on. The job of the parser is to take these tokens and turn them into "sentences", understanding their relative meaning in context. We'll examine their operation in more detail in [#lexparse Chapter�5, The Lexer and the Parser].
[edit] Variable Handling
Perl's data types - scalars, arrays, hashes, and so on - are far more flexible than C's, and hence have to be implemented quite carefully in terms of C equivalents. The code for handling arrays is in av.*, hashes are in hv.* and scalars are in sv.*. See also [#perlvar Chapter�4, Internal Variables ].
[edit] Run-time Execution
What about the code to Perl's built-ins - print, foreach and the like? These live in pp.*, and will be examine in much more detail in [#ops.ppcode the section called “ PP Code ”]. Some of the functionality is shelled out to doio.c.
The actual main loop of the interpreter is in run.c.
[edit] Support Functions
There are a number of routines which help out to make the Perl internals easier to program. For instance, scope.[ch] contains functions which allow you to save away and restore values on a stack. locale.c handles locale functions, malloc.c is a Perl-specific memory allocation library, utf8.c handles all the Unicode manipulation, numeric.c contains many handy numeric functions and util.c has various other useful things.
[edit] Testing
Every aspect of Perl's operation has a related test, and these test files live in the t/ directory. Tests for individual library and XS modules are slowly being relocated to lib/ and ext/ respectively, but at time of writing, there are over 23,000 separate tests in over 400 test files.
On a related note, functions for debugging Perl itself are to be found in deb.c and dump.c. The distinction is that functions in deb.c are typically accessible from the -D flag on the Perl command line, whereas things in dump.c may need to be used from a source-level debugger.
[edit] Other Utilities
Perl ships with a host of utilities: from the sed, awk and find to Perl translators in x2p/, to the various utilities such as h2xs and perldoc in utils/.
[edit] Documentation
The POD documentation that ships with Perl lives in pod/, along with some of the utilities for manipulating POD documents.
[edit] Summary
We've examined the layout of the Perl source as well as an overview of the Perl interpreter. Perl runs programs in two stages: firstly reading in the source and using the tokeniser and parser to "understand" it, and then running over a series of operations to execute the program.
[edit] Exercises
- What and where is the function that implements the
tr///operator? Be as precise as you can. - How does the way Perl executes a program different from the way the Unix shell executes one? Contrast shell, Perl, Java and C.
- Without looking, where do you think the
Perl_keywordfunction would be? Find it, and explain what it does. - Several files in the Perl source tree are generated from other files. Look at all the
*.plfiles in the root of the Perl source tree, and find out what each file is responsible for generating, and from what sources. Be extremely careful when looking atembed.pl.
[edit] Internal Variables
Perl's variables are a lot more flexible than C's - C is a strongly-typed language, whereas Perl is weakly typed. This means that Perl's variables may be used as strings, as integers, as floating point values, at will.
Hence, when we're representing values inside Perl, we need to implement some special types. This chapter will examine how scalars, arrays and hashes are represented inside the interpreter.
[edit] Basic SVs
SV stands for Scalar Value, and it's the basic form of representing a scalar. There are several different types of SV, but all of them have certain features in common.
[edit] Basics of an SV
Let's take a look at the definition of the SV type, in sv.h in the Perl core:
struct STRUCT_SV {
void* sv_any; /* pointer to something */
U32 sv_refcnt; /* how many references to us */
U32 sv_flags; /* what we are */
};
Every scalar, array and hash that Perl knows about has these three fields: "something", a reference count, and a set of flags. Let's examine these separately:
[edit] sv_any
This field allows us to connect another structure to the SV. This is the mechanism by which we can change between representing an integer, a string, and so on. The function inside the Perl core which does the change is called sv_upgrade.
As its name implies, this changing is a one-way process; there is no corresponding sv_downgrade. This is for efficiency: we don't want to be switching types every time an SV is used in a different context, first as a number, then a string, then a number again and so on.
Hence the structures we will meet get progressively more complex, building on each other: we will see an integer type, a string type, and then a type which can hold both a string and an integer, and so on.
[edit] Reference Counts
Perl uses reference counts to determine when values are no longer used. For instance, consider the following two pieces of code:
{
my $a;
$a = 3;
}
Here, the integer value 3, an SV, is assigned to a variable. Remember that variables are simply names for values: if we look up $a, we find the value 3. Hence, $a refers to the value. At this point, the value has a reference count of 1.
At the closing brace, the variable $a goes out of scope; that is to say, the name is destroyed, and the reference to the value 3 is broken. The value's reference count therefore decreases, becoming zero.
Once an SV has a reference count of zero, it is no longer in use and its memory can be freed.
Now our second piece of code:
my $b;
{
my $a;
$a = 3;
$b = \$a;
}
In this case, once we assign a reference to the value into $b, the reference count of our value (the integer 3) increases to 2, as now two variables are able to reach the value.
When the scope ends, the value's reference count decreases as before because $a no longer refers to it. However, even though one name is destroyed, another name, $b, still refers to the value - hence, the resulting reference count is now 1.
Once the variable $b goes out of scope, or a different value is assigned to it, the reference count will fall to zero and the SV will be freed.
[edit] Flags
The final field in the SV structure is a flag field. The most important flags are stored in the bottom two bits, which are used to hold the SV's type - that is, the type of structure which is being attached to the sv_any field.
The second most important flags are those which tell us how much of the information in the structure is relevant. For instance, we previously mentioned that one of the structures can hold both an integer and a string. We could also say that it has an integer "slot" and a string "slot". However, if we alter the value in the integer slot, Perl does not change the value in the string slot; it simply unsets the flag which says that we may use the contents of that slot:
$a = 3; # Type: Integer | Flags: Can use integer
... if $a eq "3"; # Type: Integer and String | Flags: Can use integer,
| can use string
$a++; # Type: Integer and String | Flags: Can use integer
We'll see more detailed examples of this later on. First, though let's examine the various types that can be stored in an SV.
[edit] References
A reference, or RV, is simply a C pointer to another SV, as its definition shows:
struct xrv {
SV * xrv_rv; /* pointer to another SV */
}
<advanced>
Hence, the Perl statement $a = \$b is equivalent to the C statements:
sv_upgrade(a, SVt_RV); /* Make sure a is an RV */ a->sv_any->xrv_rv = b;
However, the SV fields are hidden behind macros, so an XS programmer or porter would write the above as:
sv_upgrade(a, SVt_RV); /* Make sure a is an RV */ SvRV(a) = b;</advanced>
[edit] Integers
Perl's integer type is not necessarily a C int; it's called an IV, or Integer Value. The difference is that an IV is guaranteed to hold a pointer.
<advanced>
Perl uses the macros PTR2INT and INT2PTR to convert between pointers and IVs. The size guarantee means that, for instance, the following code will produce an IV:
$a = \1; $a--; # Reference (pointer) converted to an integer
</advanced>
Let's now have a look at an SV structure containing an IV: the SvIV structure. The core module Devel::Peek allows us to examine a value from the C perspective:
%perl -MDevel::Peek -le '$a=10; Dump($a)'SV = IV(0x81559b0) at 0x81584f0 REFCNT = 1 FLAGS = (IOK,pIOK) IV = 10
|
The first line tells us that this SV is of type SvIV. The SV has a memory location of | |
|
The value has only one reference to it at the moment, the fact that it is stored in | |
|
<advanced> What about </advanced> | |
|
This shows the IV slot with its value, the "10" which we assigned to |
<advanced>
There's also a sub-type of IVs called UVs which Perl uses where possible; these are the unsigned counterparts of IVs. The flag IsUV is used to signal that a value in an IV slot is actually an unsigned value.
[edit] Strings
The next class we'll look at are strings. We can't call them "String Values", because the SV abbreviation is already taken; instead, remembering that a string is a pointer to an array of characters, and that the entry in the string slot is going to be that pointer, we call strings "PVs": Pointer Values
It's here that we start to see combination types: as well as the SvPV type, we have a SvPVIV which has string and integer slots.
Before we get into that, though, let us examine the SvPV structure, again from sv.h:
struct xpv {
char * xpv_pv; /* pointer to malloced string */
STRLEN xpv_cur; /* length of xpv_pv as a C string */
STRLEN xpv_len; /* allocated size */
};
C's strings have a fixed size, but Perl must dynamically resize its strings whenever the data going into the string exceeds the currently allocated size. Hence, Perl holds both the length of the current contents and the maximum length available before a resize must occur. As with SVs, allocated memory for a string only increases, as the following example shows:
%perl -MDevel::Peek -le '$a="abc"; Dump($a);print; $a="abcde"; Dump($a);print; $a="a"; Dump($a)'SV = PV(0x814ee44) at 0x8158520 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x815c548 "abc"\0 CUR = 3 LEN = 4 SV = PV(0x814ee44) at 0x8158520 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x815c548 "abcde"\0 CUR = 5 LEN = 6 SV = PV(0x814ee44) at 0x8158520 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x815c548 "a"\0 CUR = 1 LEN = 6
|
This time, we have a SV whose | |
|
The actual pointer, the string, lives at address | |
|
x | |
|
However, it is counted for the purposes of allocation: we have allocated 4 bytes to store the string, as reflected by | |
|
So what happens if we lengthen the string? As the new length is more than the available space, we need to extend the string. <advanced> The macro #define SvGROW(sv,len) (SvLEN(sv) < (len) ? sv_grow(sv,len) : SvPVX(sv)) </advanced> After growing the string to accomodate the new value, the value is assigned and the | |
|
And what if we shrink the string? Perl does not give up any memory: you can see that |
Now let's see what happens if we use a value as number and string, taking the example in [#var.flagsdemo the section called “Flags”]:
%perl -Ilib -MDevel::Peek -le '$a=3; Dump($a);print; $a eq "3"; Dump($a);print; $a++; Dump($a)'SV = IV(0x81559d8) at 0x8158518 REFCNT = 1 FLAGS = (IOK,pIOK) IV = 3 SV = PVIV(0x814f278) at 0x8158518 REFCNT = 1 FLAGS = (IOK,POK,pIOK,pPOK) IV = 3 PV = 0x8160350 "3"\0 CUR = 1 LEN = 2 SV = PVIV(0x814f278) at 0x8158518 REFCNT = 1 FLAGS = (IOK,pIOK) IV = 4 PV = 0x8160350 "3"\0 CUR = 1 LEN = 2
|
In order to perform the string comparison, Perl needs to get a string value. It calls | |
|
When we change the integer value of the SV by incrementing it by one, Perl updates the value in the IV slot. Since the value in the PV slot is invalidated, the POK flag is turned off. Perl does not remove the value from the PV slot, nor does it downgrade to an SvIV because we may use the SV as a string again at a later time. |
There's one slight twist here: if you ask Perl to remove some characters from the beginning of the string, it performs a (rather ugly) optimization called "The Offset Hack". It stores the number of characters to remove (the offset) in the IV slot, and turns on the OOK (offset OK) flag. The pointer of the PV is advanced by the offset, and the CUR and LEN fields are decreased by that many. As far as C is concerned the string starts at the new position; it's only when the memory is being released that the real start of the string is important.
[edit] Floating point numbers
Finally, we have floating point types, or NVs: Numeric Values. Like IVs, NVs are guaranteed to be able to hold a pointer. The SvNV structure is very like the corresponding SvIV:
%perl -MDevel::Peek -le '$a=0.5; Dump($a);'SV = NV(0x815d058) at 0x81584e8 REFCNT = 1 FLAGS = (NOK,pNOK) NV = 0.5
However, the combined structure, SvPVNV has slots for floats, integers and strings:
%perl -MDevel::Peek -le '$a="1"; $a+=0.5; Dump($a);'SV = PVNV(0x814f9c0) at 0x81584f0 REFCNT = 1 FLAGS = (NOK,pNOK) IV = 0 NV = 1.5 PV = 0x815b5c0 "1"\0 CUR = 1 LEN = 2
[edit] Arrays and Hashes
Now we've looked at the most common types of scalar, (there are a few complications, which we'll cover in [#var.complex the section called “More Complex Types”]) let's examine array and hash structures. These, too, are build on top of the basic SV structure, with reference counts and flags, and structures hung off sv_any.
[edit] Arrays
Arrays are known in the core as AVs. Their structure can be found in av.h:
struct xpvav {
char* xav_array; /* pointer to first array element */
SSize_t xav_fill; /* Index of last element present */
SSize_t xav_max; /* max index for which array has space */
IV xof_off; /* ptr is incremented by offset */
NV xnv_nv; /* numeric value, if any */
MAGIC* xmg_magic; /* magic for scalar array */
HV* xmg_stash; /* class package */
SV** xav_alloc; /* pointer to malloced string */
SV* xav_arylen;
U8 xav_flags;
};
We're going to skip over xmg_magic and xmg_stash for now, and come back to them in [#var.complex the section called “More Complex Types”].
Let's use Devel::Peek as before to examine the AV, but we must remember that we can only give one argument to Devel::Peek::Dump; hence, we must pass it a reference to the AV:
%perl -MDevel::Peek -e '@a=(1,2,3); Dump(\@a)'SV = RV(0x8106ce8) at 0x80fb380REFCNT = 1 FLAGS = (TEMP,ROK) RV = 0x8105824 SV = PVAV(0x8106cb4) at 0x8105824
REFCNT = 2 FLAGS = () IV = 0 NV = 0 ARRAY = 0x80f7de8
FILL = 2
MAX = 3
ARYLEN = 0x0
FLAGS = (REAL)
Elt No. 0 SV = IV(0x80fc1f4) at 0x80f1460
REFCNT = 1 FLAGS = (IOK,pIOK,IsUV) UV = 1 Elt No. 1 SV = IV(0x80fc1f8) at 0x80f1574 REFCNT = 1 FLAGS = (IOK,pIOK,IsUV) UV = 2 Elt No. 2 SV = IV(0x80fc1fc) at 0x80f1370 REFCNT = 1 FLAGS = (IOK,pIOK,IsUV) UV = 3
|
We're dumping the reference to the array, which is, as you would expect, an RV. | |
|
The RV contains a pointer to another SV: this is our array; the | |
|
The AV contains a pointer to a C array of SVs. Just like a string, this array must be able to change its size; in fact, the expansion and contaction of AVs is just the same as that of strings. | |
|
To facilitate this, | |
|
| |
|
We said that | |
|
The | |
|
|
Something similar to the offset hack is performed on AVs to enable efficient shifting and splicing off the beginning of the array; while AvARRAY (xav_array in the structure) points to the first element in the array that is visible from Perl, AvALLOC (xav_alloc) points to the real start of the C array. These are usually the same, but a shift operation can be carried out by increasing AvARRAY by one and decreasing AvFILL and AvLEN. Again, the location of the real start of the C array only comes into play when freeing the array. See av_shift in av.c.
[edit] Hashes
Hashes are represented in the core as, you guessed it, HVs. Before we look at how this is implemented, we'll first see what a hash actually is...
[edit] What is a "hash" anyway?
A hash is actually quite a clever data structure: it's a combination of an array and a linked list. Here's how it works:
- The hash key undergoes a transformation to turn it into a number called, confusingly, the hash value. For Perl, the C code that does the transformation looks like this: (from
hv.h)
register const char *s_PeRlHaSh = str;
register I32 i_PeRlHaSh = len;
register U32 hash_PeRlHaSh = 0;
while (i_PeRlHaSh--) {
hash_PeRlHaSh += *s_PeRlHaSh++;
hash_PeRlHaSh += (hash_PeRlHaSh << 10);
hash_PeRlHaSh ^= (hash_PeRlHaSh >> 6);
}
hash_PeRlHaSh += (hash_PeRlHaSh << 3);
hash_PeRlHaSh ^= (hash_PeRlHaSh >> 11);
(hash) = (hash_PeRlHaSh += (hash_PeRlHaSh << 15));
Converting that to Perl and tidying it up:
sub hash {
my $string = shift;
my $hash;
for (map {ord $_} split //, $string) {
$hash += $_; $hash += $hash << 10; $hash ^= $hash >> 6;
}
$hash += $hash << 3; $hash ^= $hash >> 1;
return ($hash + $hash << 15);
}
- This hash is distributed across an array using the modulo operator. For instance, if our array has 8 elements, ("Hash buckets") we'll use
$hash_array[$hash % 8] - Each bucket contains a linked list; adding a new entry to the hash appends an element to the linked list. So, for instance,
$hash{"red"}="rouge"is implemented similar to
push @{$hash->[hash("red") % 8]},
{ key => "red",
value => "rouge",
hash => hash("red")
};
<advanced>
Why do we store the key as well as the hash value in the linked list? The hashing function may not be perfect - that is to say, it may generate the same value for "red" as it would for, say, "blue". This is called a hash collision, and, while it is rare in practice, it explains why we can't depend on the hash value alone.
</advanced>
[edit] Hash Entries
Hashes come in two parts: the HV is the actual array containing the linked lists, and is very similar to an AV; the things that make up the linked lists are hash entry structures, or HEs. From hv.h:
/* entry in hash value chain */
struct he {
HE *hent_next; /* next entry in chain */
HEK *hent_hek; /* hash key */
SV *hent_val; /* scalar value that was hashed */
};
/* hash key -- defined separately for use as shared pointer */
struct hek {
U32 hek_hash; /* hash of key */
I32 hek_len; /* length of hash key */
char hek_key[1]; /* variable-length hash key */
};
As you can see from the above, we simplified slightly when we put the hash key in the buckets above: the key and the hash value are stored in a separate structure, a HEK.
The HEK stored inside a hash entry represents the key: it contains the hash value and the key itself. It's stored separately so that Perl can share identical keys between different hashes - this saves memory and also saves time calcu.llating the hash value. You can use the macros HeHASH(he) and HeKEY(he) to retrieve the hash value and the key from a HE.
[edit] Hash arrays
Now to turn to the HVs themselves, the arrays which hold the linked lists of HEs. As we mentioned, these are not too dissimilar from AVs.
%perl -MDevel::Peek -e '%a = (red => "rouge", blue => "bleu"); Dump(\%a);'SV = RV(0x8106c80) at 0x80f1370 REFCNT = 1 FLAGS = (TEMP,ROK) RV = 0x81057a0 SV = PVHV(0x8108328) at 0x81057a0 REFCNT = 2 FLAGS = (SHAREKEYS) IV = 2 NV = 0 ARRAY = 0x80f7748 (0:6, 1:2) hash quality = 150.0% KEYS = 2 FILL = 2 MAX = 7 RITER = -1 EITER = 0x0 Elt "blue" HASH = 0x8a5573ea SV = PV(0x80f17b0) at 0x80f1574 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x80f5288 "bleu"\0 CUR = 4 LEN = 5 Elt "red" HASH = 0x201ed SV = PV(0x80f172c) at 0x80f1460 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x80ff370 "rouge"\0 CUR = 5 LEN = 6
|
As before, we dump a reference to the AV, since | |
|
The | |
|
As we mentioned before, there are eight buckets in our hash initially - the hash gets restructured as needed. The numbers in brackets around | |
|
The "quality" of a hash is related to how long it takes to find an element, and this is in turn related to the average length of the hash chains, the linked lists attached to the buckets: if there is only one element in each bucket, you can find the key simply by performing the hash function. If, on the other hand, all the elements are in the same hash bucket, the hash is particularly inefficient. | |
|
| |
|
These two values refer to the hash iterator: when you use, for instance, | |
|
As with an array, the |
[edit] More Complex Types
Sometimes the information provided in an ordinary SV, HV or AV isn't enough for what Perl needs to do. For instance, how does one represent objects? What about tied variables? In this section, we'll look at some of the complications of the basic SV types.
<advanced>
The entirety of this section should be considered advanced material; it will not be covered in the course. Readers following the course should skip to the next section, [#var.inher the section called “Inheritance”] and study this in their own time.
</advanced>[edit] Objects
Objects are represented relatively simply. As we know from ordinary Perl programming, an object is a reference to some data which happens to know which package it's in. In the definitions of AVs and HVs above, we saw the line
HV* xmg_stash; /* class package */
As we'll see in [#var.stash the section called “Globs and Stashes”], packages are known as "stashes" internally and are represented by hashes. The xmg_stash field in AVs and HVs is used to store a pointer to the stash which "owns" the value.
Hence, in the case of an object which is an array reference, the dump looks like this:
%perl -MDevel::Peek -e '$a=bless [1,2]; Dump($a)'SV = RV(0x81586d4) at 0x815b7a0 REFCNT = 1 FLAGS = (ROK) RV = 0x8151b0c SV = PVAV(0x8153074) at 0x8151b0c REFCNT = 1 FLAGS = (OBJECT) IV = 0 NV = 0 STASH = 0x8151a34 "main" ARRAY = 0x815fcf8 FILL = 1 MAX = 1 ARYLEN = 0x0 FLAGS = (REAL) Elt No. 0 SV = IV(0x815833c) at 0x8151bc0 REFCNT = 1 FLAGS = (IOK,pIOK) IV = 1 Elt No. 1 SV = IV(0x8158340) at 0x8151c44 REFCNT = 1 FLAGS = (IOK,pIOK) IV = 2
[edit] Magic
This works for AVs and HVs which have a STASH field, but what about ordinary scalars? There is an additional, more complex type of scalar, which can hold both stash information and also permits us to hang additional, miscellaneous information onto the SV. This miscellaneous information is called "magic", (partially because it allows for clever things to happen, and partially because nobody really knows how it works) and the complex SV structure is a PVMG. We can create a PVMG by blessing a scalar reference:
%perl -MDevel::Peek -le '$b="hi";$a=bless \$b, main; print Dump($a)'SV = RV(0x8106ca4) at 0x810586c REFCNT = 1 FLAGS = (ROK) RV = 0x81058c0 SV = PVMG(0x810e628) at 0x81058c0 REFCNT = 2 FLAGS = (OBJECT,POK,pPOK) IV = 0 NV = 0 PV = 0x80ff698 "hi"\0 CUR = 2 LEN = 3 STASH = 0x80f1388 "main"
As you can see, this is similar to the PVNV structure we saw in [#var.pvnv the section called “Floating point numbers”], with the addition of the STASH field. There's also another field, which we can see if we look at the definition of xpvmg:
struct xpvmg {
char * xpv_pv; /* pointer to malloced string */
STRLEN xpv_cur; /* length of xpv_pv as a C string */
STRLEN xpv_len; /* allocated size */
IV xiv_iv; /* integer value or pv offset */
NV xnv_nv; /* numeric value, if any */
MAGIC* xmg_magic; /* linked list of magicalness */
HV* xmg_stash; /* class package */
};
The xmg_magic field provides us with somewhere to put a magic structure. What's a magic structure, then? For this, we need to look in mg.h:
struct magic {
MAGIC* mg_moremagic;
MGVTBL* mg_virtual; /* pointer to magic functions */
U16 mg_private;
char mg_type;
U8 mg_flags;
SV* mg_obj;
char* mg_ptr;
I32 mg_len;
};
|
First, we have a link to another magic structure: this creates a linked list, allowing us to hang multiple pieces of magic off a single SV. | |
|
The magic virtual table is a list of functions which should be called to perform particular operations on behalf of the SV. For instance, a tied variable will automagically call the C function The magic virtual tables are provided by Perl - they're in In theory, you can create your own virtual tables by providing functions to fill the </advanced> | |
|
This is a storage area for data private to this piece of magic. The Perl core doesn't use this, but you can if you're building your own magic types. For instance, you can use it as a "signature" to ensure that this magic was created by your extension, not by some other module. | |
|
Magic comes in a number of varieties: as well as providing for tied variables, magic propagates taintedness, makes special variables such as Each of these different types of magic have a different "code letter" - the letters in use are shown in | |
|
There are only four flags in use for magic; the most important is | |
|
This is another storage area; it's normally used to point to the object of a tied variable, so that tied functions can be located. | |
|
The pointer field is set when you add magic to an SV with the | |
|
This is the length of the string in |
What happens when the value of an SV with magic is retrieved? Firstly, a function should call SvGETMAGIC(sv) to cause any magic to be performed. This in turn calls mg_get which walks over the linked list of magic. For each piece of magic, it looks in the magic virtual table, and calls the magical "get" function if there is one.
Let's assume that we're dealing with one of Perl's special variables, which has only one piece of magic, "\0" magic. The appropriate magic virtual table for "\0" magic is PL_vtbl_sv, which is defined as follows: (in perl.h)
EXT MGVTBL PL_vtbl_sv = {MEMBER_TO_FPTR(Perl_magic_get),
MEMBER_TO_FPTR(Perl_magic_set),
MEMBER_TO_FPTR(Perl_magic_len),
0, 0};
Magic virtual tables have five elements, as seen in mg.h:
struct mgvtbl {
int (CPERLscope(*svt_get)) (pTHX_ SV *sv, MAGIC* mg);
int (CPERLscope(*svt_set)) (pTHX_ SV *sv, MAGIC* mg);
U32 (CPERLscope(*svt_len)) (pTHX_ SV *sv, MAGIC* mg);
int (CPERLscope(*svt_clear))(pTHX_ SV *sv, MAGIC* mg);
int (CPERLscope(*svt_free)) (pTHX_ SV *sv, MAGIC* mg);
};
So the above virtual table means "call Perl_magic_set when we want to get the value of this SV; call Perl_magic_set when we want to set it; call Perl_magic_len when we want to find its length; do nothing if we want to clear it or when it is freed from memory."
In this case, we are getting the value, so magic_get is called. #ftn.id2482392 1 This function looks at the value of mg_ptr, which, as noted above, is often the name of the variable. Depending on the name of the variable, it determines what to do: for instance, if mg_ptr is "!", then the current value of the C variable errno is retrieved.
A similar process is performed by SvSETMAGIC(sv) to call functions that need to be called when the value of an SV changes.
[edit] Tied Variables
Tied arrays and hashes are implementing by adding type "P" magic to their AVs and HVs; individual elements of the arrays and hashes have "p" magic. Tied scalars and filehandles have type "q" magic. The virtual tables for, for instance, "p" magic scalars look like this:
EXT MGVTBL PL_vtbl_packelem = {MEMBER_TO_FPTR(Perl_magic_getpack),
MEMBER_TO_FPTR(Perl_magic_setpack),
0,
MEMBER_TO_FPTR(Perl_magic_clearpack),
0}
That's to say, the function magic_getpack is called when the value of an element of a tied array or hash is retrieved. This function in turn performs a FETCH method call on the object stored in mg_obj.
We can invent our own "pseudo-tied" variables, using the user-defined "U" magic. "U" magic only works on scalars, and allows us to call a function when the value of the scalar is got or set. The virtual table for "U" magic scalars is as follows:
EXT MGVTBL PL_vtbl_uvar = {MEMBER_TO_FPTR(Perl_magic_getuvar),
MEMBER_TO_FPTR(Perl_magic_setuvar),
0, 0, 0};
As you should by now expect, these functions are called when the value of the scalar is accessed. They in turn call our user-defined functions. But how do we tell them what our functions are? In this case, we pass a pointer to a special structure in the mg_ptr field; the structure is defined in perl.h, and looks like this:
struct ufuncs {
I32 (*uf_val)(IV, SV*);
I32 (*uf_set)(IV, SV*);
IV uf_index;
};
Here are our two function pointers: uf_val is called with the value of uf_index and the scalar when the value is sought, and uf_set is called with the same parameters when it is set.
Hence, the following code allows us to emulate $!:
I32 get_errno(IV index, SV* sv) {
sv_setiv(sv, errno);
}
I32 set_errno(IV index, SV* sv) {
errno = SvIV(sv); /* Some Cs don't like us setting errno, but hey */
}
struct ufuncs uf;
/* This is XS code */
void
magicify(sv)
SV *sv;
CODE:
uf.uf_val = &get_errno;
uf.uf_set = &set_errno;
uf.uf_index = 0;
sv_magic(sv, 0, 'U', (char*)&uf, sizeof(uf));
If you need any more flexibility than that, it's time to look into "~" magic.
[edit] Globs and Stashes
SVs that represent variables are kept in the symbol table; as you'll know from your Perl programming, the symbol table starts at %main:: and is an ordinary Perl hash, with the package and variable names as hash keys. But what are the hash values? Let's have a look:
%perl -le '$a=5; print ${main::}{a}'*main::a
Well, that doesn't tell us very much - at first sight it just looks like an ordinary string. But if we use Devel::Peek on it, we find it's actually something else - a glob, or GV:
%perl -MDevel::Peek -e '$a=5; Dump ${main::}{a}'SV = PVGV(0x80fe3e0) at 0x80fb3ec REFCNT = 2 FLAGS = (GMG,SMG) IV = 0 NV = 0 MAGIC = 0x80fea50 MG_VIRTUAL = &PL_vtbl_glob MG_TYPE = '*' MG_OBJ = 0x80fb3ec MG_LEN = 1 MG_PTR = 0x81081d8 "a" NAME = "a" NAMELEN = 1 GvSTASH = 0x80f1388 "main" GP = 0x80ff2b0 SV = 0x810592c REFCNT = 1 IO = 0x0 FORM = 0x0 AV = 0x0 HV = 0x0 CV = 0x0 CVGEN = 0x0 GPFLAGS = 0x0 LINE = 1 FILE = "-e" FLAGS = 0x0 EGV = 0x80fb3ec "a"
|
Globs have get and set magic to handle glob aliasing as well as the conversion to strings we saw above. | |
|
The glob's magic object points back to the GV itself, so that the magic functions can easily access it. | |
|
The "name" is simply the variable's unqualified name; this is combined with the "stash" below to make up the "full name". | |
|
The stash itself is a pointer to the hash in which this glob is contained. | |
|
This structure, a GP structure, actually holds the symbol table entry. It's separated out so that, in the case of aliased globs, multiple GVs can point to the same GP. | |
|
As we know, globs have several different "slots", for scalars, arrays, hashes and so on. This is the scalar slot, which is a pointer to an SV. | |
|
The GP is refcounted because we need to know how many GVs point to it, so it can be safely destroyed when no longer needed. | |
|
The other slots are a filehandle, a form, an array, a hash and a code value. (see [#var.cv the section called “Code Values”]) | |
|
This stores the "age" of the code value. Every time a subroutine is defined, Perl increments the variable | |
|
The GP's flags are currently unused. |
Symbol tables are considered some of the hairiest voodoo in the Perl internals. <advanced>
From C, the variable PL_defstash is the HV representing the main:: stash; PL_curstash contains the current package's stash.
[edit] Code Values
The final data type we will examine is the CV, a code value used for storing subroutines. Both Perl and XSUB subroutines are stored in CVs, and blocks are also stored in CVs. The CV structure can be found in cv.h:
struct xpvcv {
char * xpv_pv; /* pointer to malloced string */
STRLEN xpv_cur; /* length of xp_pv as a C string */
STRLEN xpv_len; /* allocated size */
IV xof_off; /* integer value */
NV xnv_nv; /* numeric value, if any */
MAGIC* xmg_magic; /* magic for scalar array */
HV* xmg_stash; /* class package */
HV * xcv_stash;
OP * xcv_start;
OP * xcv_root;
void (*xcv_xsub) (pTHXo_ CV*);
ANY xcv_xsubany;
GV * xcv_gv;
char * xcv_file;
long xcv_depth; /* >= 2 indicates recursive call */
AV * xcv_padlist;
CV * xcv_outside;
#ifdef USE_THREADS
perl_mutex *xcv_mutexp;
struct perl_thread *xcv_owner; /* current owner thread */
#endif /* USE_THREADS */
cv_flags_t xcv_flags;
}
|
Although it might look like this provides the CV's stash, it is important to note that this is a pointer to the stash in which the CV was compiled; for instance, given package First;
sub Second::mysub { ...}
then package One;
$x = "In One";
package Two;
$x = "In Two";
sub One::test { print $x }
package main;
One::test();
will print | |
|
For a subroutine defined in Perl, these two pointers hold the start and the root of the compiled op tree; this will be further in [#ops Chapter�6, Fundamental Operations ]. | |
|
For an XSUB, on the other hand, this field contains a function pointer pointing to the C function implementing the subroutine. | |
|
This is how constant subroutines are implemented: Perl can arrange for the SV representing the constant to be returned by a constant XS routine, which is hung here. | |
|
This simply holds a pointer to the glob by which the subroutine was defined. | |
|
This stores the name of the file in which the subroutine was defined. For an XSUB, this will be the | |
|
This is a counter which is incremented each time the subroutine is entered and decremented when it is left; this allows Perl to keep track of recursive calls to a subroutine. | |
|
Explained below, | |
|
Consider the following code: {
my $x = 0;
sub counter { return ++$x; }
}
When inside |
[edit] Lexical Variables
Global variables live, as we've seen, in symbol tables or "stashes". Lexical variables, on the other hand, are tied to blocks rather than packages, and so are stored inside the CV representing their enclosing block.
As mentioned briefly above, the xcv_padlist element holds a pointer to an AV. This array, the padlist, contains the names and values of lexicals in the current code block. Again, a diagram is the best way to demonstrate this:
The first element of the padlist - called the "padname" - is an array containing the names of the variables, and the other elements are lists of the current values of those variables. Why do we have several lists of current values? Because a CV may be entered several times - for instance, when a subroutine recurses. Having, essentially, a stack of frames ensures that we can restore the previous values when a recursive call ends. Hence, the current values of lexical variables are stored in the last element on the padlist.
<advanced>
From inside perl, you can get at the current pad as PL_curpad. Note that this is the pad itself, not the padlist. To get the padlist, you need to perform some awkwardness:
I32 cxix = dopoptosub(cxstack_ix) /* cxstack_ix is a macro */ AV* padlist = cxix ? CvPADLIST(cxstadck[ix].blk_sub.cv) : PL_comppadlist;
We'll visit pads again when we look at operator targets in [#ops.scratch the section called “ Scatchpads and Targets ”].
</advanced>[edit] Inheritance
As we have seen, some types of SV deliberately build on and extend the structure of others. The SV code is written to attempt to provide an object-oriented style of programming inside C, and it is fair to say that some SV "classes" inherit from others. In the compiler module B, we see these inheritance relationships spelt out:
@B::PV::ISA = 'B::SV'; @B::IV::ISA = 'B::SV'; @B::NV::ISA = 'B::IV'; @B::RV::ISA = 'B::SV'; @B::PVIV::ISA = qw(B::PV B::IV); @B::PVNV::ISA = qw(B::PV B::NV); @B::PVMG::ISA = 'B::PVNV'; @B::PVLV::ISA = 'B::PVMG'; @B::BM::ISA = 'B::PVMG'; @B::AV::ISA = 'B::PVMG'; @B::GV::ISA = 'B::PVMG'; @B::HV::ISA = 'B::PVMG'; @B::CV::ISA = 'B::PVMG'; @B::IO::ISA = 'B::PVMG'; @B::FM::ISA = 'B::CV';
[edit] Summary
Perl uses several variable types in its internal representation to achieve the flexibility that is needed for its external types: scalars, (SVs) arrays, (AVs) hashes (HVs) and code blocks. (CVs)
The module Devel::Peek allows us to examine how Perl types are repesented internally. The field names produced by Devel::Peek can be easily turned into macros which allow us to get and set the values of the fields from C.
The key files from the Perl source tree which deal with Perl's internal variables are sv.c, av.c and hv.c; the documentation in the associated header files (sv.h, av.h and hv.h) is extremely helpful for understanding how to deal with Perl's internal variables.
[edit] Exercises
- One thing we didn't do in this chapter was run
Devel::Peekon a subroutine. Try it on a named subroutine reference, an anonymous subref and a subref to an XS routine. - See if you can work out what 'FM', 'IO', 'BM' and 'PVLV' are in the above; try creating them in Perl and dumping them out with
Devel::Peek. Usesv.hto explain the new fields.
#id2482392 1 We'll see later that Perl uses the Perl_ prefix internally for function names, but that prefix can be omitted inside the Perl core. Hence, we'll call Perl_magic_get "magic_get".
[edit] The Lexer and the Parser
In this chapter, we're going to examine how Perl goes about turning a piece of Perl code into an internal representation ready to be executed. The nature of the internal representation, a tree of structures representing operations, will be
