Python: module spug.util.tok

spug.util.tok

Simple tokenizer framework. At this time, all of the functionality of the @CTokenizer is implemented in @Tokenizer. Eventually, the *CTokenizer* personality should be broken out into the correct class. Synopsis: {{ from spug.util.tok import CTokenizer # create a new tokenizer to parse standard input toker = CTokenizer(sys.stdin) while 1: # get the next token tok = toker.nextToken() # check for end-of-stream if tok.isType(Token.end): break # print out the token type and the value print toker.tokTypes[tok.type - 1].name + ': ' + tok.val }} Run "#python tok.py#" to see the action of the above code.

Modules

re

Classes



exceptions.Exception(exceptions.BaseException)

util.tok.TokenizerError

util.tok.Token
util.tok.TokenInfo
util.tok.TokenStream
util.tok.Tokenizer

util.tok.CTokenizer

class CTokenizer(Tokenizer)

    Tokenizer for C-like languages.

Methods inherited from Tokenizer:

__init__(self, src)
Constructor.  Creates a new Tokenizer given a source file and (optionally) a set of flags.

breakOff(self, toksrc)
"breaks off" the regular expression match specified by toksrc from the buffer, and returns it.

fillBuffer(self)
Make sure that the buffer has data in it.

nextPreparsed(self)
Return the next token in the preparsed list (the list of tokens that has already been parsed and were put back).

nextToken(self)
Returns the next token, either from the preparsed cache, or from the stream.

parseNextToken(self)
Parses the next token directly off of the stream.  Clients should generally avoid using this.  Use nextToken() instead, since that will use the preparsed queue if tokens have been put back.

putBack(self, tok)
Puts the given token back on the list.  Puts it first on the list, so immediately after calling, /tok/ will be the next token return from @nextToken().

Data and other attributes inherited from Tokenizer:

character = <util.tok.TokenInfo instance at 0x831ba4c>

chr = 4

cmt = 5

comment = <util.tok.TokenInfo instance at 0x831b96c>

id = 1

identifier = <util.tok.TokenInfo instance at 0x831becc>

int = 2

integer = <util.tok.TokenInfo instance at 0x831bfec>

longComment = <util.tok.TokenInfo instance at 0x831b94c>

str = 3

string = <util.tok.TokenInfo instance at 0x831bf2c>

sym = 7

symbol = <util.tok.TokenInfo instance at 0x831ba0c>

tokTypes = [<util.tok.TokenInfo instance at 0x831becc>, <util.tok.TokenInfo instance at 0x831bfec>, <util.tok.TokenInfo instance at 0x831bf2c>, <util.tok.TokenInfo instance at 0x831b96c>, <util.tok.TokenInfo instance at 0x831b94c>, <util.tok.TokenInfo instance at 0x831beac>, <util.tok.TokenInfo instance at 0x831ba0c>]

whitespace = <util.tok.TokenInfo instance at 0x831beac>

ws = 6

class Token

    Tokens represent pieces of text.  Each token has: /val/::    source text of the token.  Its "value". /type/::    a numeric value indicating the tokens type /srcName/::    Name of the source stream that it came from /lineNum/::    Line number from which it came.     #Token.end# is a class variable set to zero. It is used to indicate that the end of the stream has been read.  *Do not use 0 as a token Id.*

Methods defined here:

__init__(self, type, val, srcInfo)
Constructor for Token.  Type is one of the types listed above, val is the value of the token (its text), srcName is the name of the source file that the token was tokenized from, and /lineNum/ is the line number in the source file. /srcInfo/ is a tuple indicating the source file name and the line number.

equals(self, type, val)
Returns true if the token is of the indicated type and has the indicated value.

isType(self, type)
Returns true if the token is of the indicated type.

Data and other attributes defined here:

end = 0

class TokenInfo

    TokenInfo holds information about token types.  Each has a /name/, a /regex/ (regular expression describing how the token is represented) and an /id/.

Methods defined here:

__init__(self, name, regex, create, continued)
Make one.  Public variables: /name/::    the name of the token type /regex/::    the regular expression that describes the tokens source form /create/::     a function that should expect a token source string /continued/::    an optional regular expression.  If it is present, it       indicates that the token may be continued over multiple lines       (if /regex/ matches to the end of the current line) and it       represents the kind of expression which will terminate the       multi line token beginning with /regex/.  All lines between       the line that begins the token and the portion of a line       which ends the token are considered to be part of the token.

class TokenStream

    This class can be used as a wrapper for any object that provides the file readline() method - it delegates that and also provides a "name" variable, required by the tokenizer.

Methods defined here:

__init__(self, src, name)

readline(self)

class Tokenizer

    A Tokenizer is used to extract tokens from a source stream.  Tokens are of the form normally accepted by C-like languages. XXX This class really needs to become generic, with its C personality moved to CTokenizer

Methods defined here:

__init__(self, src)
Constructor.  Creates a new Tokenizer given a source file and (optionally) a set of flags.

breakOff(self, toksrc)
"breaks off" the regular expression match specified by toksrc from the buffer, and returns it.

fillBuffer(self)
Make sure that the buffer has data in it.

nextPreparsed(self)
Return the next token in the preparsed list (the list of tokens that has already been parsed and were put back).

nextToken(self)
Returns the next token, either from the preparsed cache, or from the stream.

parseNextToken(self)
Parses the next token directly off of the stream.  Clients should generally avoid using this.  Use nextToken() instead, since that will use the preparsed queue if tokens have been put back.

putBack(self, tok)
Puts the given token back on the list.  Puts it first on the list, so immediately after calling, /tok/ will be the next token return from @nextToken().

Data and other attributes defined here:

character = <util.tok.TokenInfo instance at 0x831ba4c>

chr = 4

cmt = 5

comment = <util.tok.TokenInfo instance at 0x831b96c>

id = 1

identifier = <util.tok.TokenInfo instance at 0x831becc>

int = 2

integer = <util.tok.TokenInfo instance at 0x831bfec>

longComment = <util.tok.TokenInfo instance at 0x831b94c>

str = 3

string = <util.tok.TokenInfo instance at 0x831bf2c>

sym = 7

symbol = <util.tok.TokenInfo instance at 0x831ba0c>

tokTypes = [<util.tok.TokenInfo instance at 0x831becc>, <util.tok.TokenInfo instance at 0x831bfec>, <util.tok.TokenInfo instance at 0x831bf2c>, <util.tok.TokenInfo instance at 0x831b96c>, <util.tok.TokenInfo instance at 0x831b94c>, <util.tok.TokenInfo instance at 0x831beac>, <util.tok.TokenInfo instance at 0x831ba0c>]

whitespace = <util.tok.TokenInfo instance at 0x831beac>

ws = 6

class TokenizerError(exceptions.Exception)

    A Tokenizer error

Method resolution order:

TokenizerError

exceptions.Exception

exceptions.BaseException

__builtin__.object

Methods defined here:

__init__(self, toker, text)

Data descriptors defined here:

__weakref__

list of weak references to the object (if defined)

Data and other attributes inherited from exceptions.Exception:

__new__ = <built-in method __new__ of type object at 0x8140ce0>
T.__new__(S, ...) -> a new object with type S, a subtype of T

Methods inherited from exceptions.BaseException:

__delattr__(...)
x.__delattr__('name') <==> del x.name

__getattribute__(...)
x.__getattribute__('name') <==> x.name

__getitem__(...)
x.__getitem__(y) <==> x[y]

__getslice__(...)
x.__getslice__(i, j) <==> x[i:j] Use of negative indices is not supported.

__reduce__(...)

__repr__(...)
x.__repr__() <==> repr(x)

__setattr__(...)
x.__setattr__('name', value) <==> x.name = value

__setstate__(...)

__str__(...)
x.__str__() <==> str(x)

Data descriptors inherited from exceptions.BaseException:

__dict__

args

message

exception message