User:I8086/Draft

A PDF file can be generated and downloaded here.

NOTE: Since the name of the language isn't given, it will be called cola for this paper's purpose.

For the past 7 months, I've been working on a new programming language. It started out as a fork of another project of mine, a language called Indented-C. Indented-C was exactly like C, except C was a free-form language, while Indented-C followed the off-side rule rule. After a while, I got sick of recreating C in a different form and decide to create a whole new language inspired heavily by Python. So far, I've had great success developing the compiler. When the compiler becomes more mature, I plan to release it under "Simplified BSD" or "Revised BSD". Currently, the language is being developed under the codename cola.

The purpose of this "paper" is to create a "rough draft" of the language, including its keywords and grammar. Mainly, it helps me keep my thoughts in order by logically writing them down on paper.

Current Status
Currently, the project is undergoing rapid changes and updates to stay compatible with the language's specification. Now that basic data types are supported, the project has turned toward making the language Turing complete. Currently, a basic implementation of if and else are supported, with loops statement in development.

Goals
The main goals for this language is:


 * To be used for system programming and development on a higher level than assembly, C, and C++.
 * To be a language that's more clear and readable than C-based languages.
 * To help provide a limited amount of static and run time safety, along with a bunch of error handling.

Originally, the language was primarily planned for developing low level applications, like boot loaders, operating systems, drivers, and things of that sort. Now, as a general purpose language, it can be more flexible and used for user space applications, like GUI's.

It should be clarified right now that Cola isn't intended to replace C or C++, it merely gives developers a higher level choice that compiles to native code. Cola will not solve all of the problems faced in the field of computer science, it will make them easier to solve.

Compiling
Currently, the compiler is written in Python. This allows for rapid development and platform independence. Like any other compiler, it converts the source code into assembly language, with NASM as the de facto assembler. The assembly source code is then passed on to NASM, where it is assembly into a platform specific object file. From here a linker of choice is used to create the executable.

The Compiler
Currently, the compiler is made up of six python scripts. In total, it's consist of a little more than ~2300 lines of code with the parser accounting for over half of all of the lines. The compiler is likely to be unstable and under constant development until the language matures, then a new compiler will be written in the language. Basically, the compiler's intended goal is to help bootstrap the language, which will then be used to create a new compiler. In total, I estimate that the compiler will grow no larger that ~20000 lines of code, although this measurement of code is irrelevant to the status and strength of the compiler.

The Assembler
As of right now, the assembler of choice is NASM. The compiler requires a standalone version of NASM to be present on the system and it needs to support the object file format that the compiler wants to produce. Down the road, NASM may be built into a dynamic/shared library which can be linked to the compiler. Using YASM may be a better alternative than NASM if doing this. Both of these are licensed under a BSD styled license, which will keep it compatible with the compiler's license.

Operating Systems
Goals are to support the three major operating systems; Microsoft Windows, Apple OS X, and Linux. Although these are the main goals, other operating systems will be supported like MS-DOS and Unix. Along with that, the compiler will make an effort to produce os-less code for low level work.

Data Types
The language has several planned data types. These include integers, floats, Booleans, constants, pointers (including both smart pointers and well...not smart pointers.), strings (Basically a less verbose way of defining a certain type of constant.) , characters, lists (Pythonic name for dynamic arrays.) , dictionaries (Also know as an associative arrays or maps.) , and structures. Currently, only integers, Booleans, strings, and characters are supported by the compiler.

Dynamic Types
The goal is to support both a mixture of static and dynamic types. For example, you can have an 8-bit integer, 16-bit integer, 32-bit integer, 64-bit integer, or a dynamic integer that seems infinite. This will allow a developer to pick which type of integer is best for the job.

Booleans
Booleans are defined with the keywords  or. Since Booleans are a subtype of integers, a Boolean could be assigned a pointer or a number, which will compute if its Boolean value. For example, zero is  and nonzero numbers are.

Although Booleans are a subtype of integers, you cannot use Booleans in integer math. For obvious reasons, code like  would create undefined results, so a compile time error would follow.

Characters
Character literals are defined by using single quotation marks as delimiters. Characters are not constants, which means they get turned into integers based on their ASCII code. For example,  will be turned into   and   will be turned into. Characters also support escape characters, which is discussed in the next subsection. Since characters are really integers, they have no keywords that define themselves.

Strings
String literals are defined by using double quotation marks as delimiters. Escape characters always begin with a backslash followed by the escape sequences, which is usually just one character. Escape characters are automatically generated into the proper ASCII character at during compiling time, so there's no need for functions to handle strings as "formatted" and "unformatted". This reduces the need for the need of both "formatted" and "unformatted" functions, which can reduce errors and the size of the executable. ASCII strings are always defined by the keyword, while wide strings are defined with the keyword.

Strings, by default, strings are dynamic. This means that they can be manipulated via certain operators. While this makes it easy for string manipulation, it does take up memory and time. Dynamic strings start out as static strings, which are then copied into a dynamic buffer. Static strings can be created via the keyword. The use of this keyword will help prevent static strings from being modified, which could lead to a crash or worse... buffer overflows and memory corruption.

To sum it all up, static strings are read-only, which means they cannot be manipulated, while dynamic strings can be manipulated. Because of this, static string initialize faster than dynamic string.

Incrementing and Decrementing
Unlike a lot of B-based languages, this language doesn't intend to include increment and decrement operators. These operators cause a lot of trouble for programmers, especially if they don't understand prefixing and postfixing. Python chooses to leave these operators out; most likely for simplicity and readability.

Indexing
Indexing in this language will closely follow Python. Pointers and constants will only be able to access there data via indexing and slicing, since the asterisk key is for multiplication and it isn't overload ( possible for now ). If any variable wants to access/store items stored in memory, it will need to index. Index is zero-based, with the first item starting at zero.

Since this language was meant for systems development, I came up with the idea of indexing numbers bitwise. For example, this means  returns the second least significant bit. Why do this? It simple a easier means of typing certain bitwise operations do in operating system development, especially when working with ports of any kind. Take these two examples into consideration.

Even though Python is a great language, I do admit that there are some short falls within the language. For example, the C code is almost identical to this Python code.

The language I intend to implement will use less typing to achieve the same goal, as show below. func byte check(byte id): # NOTE: Booleans can be assigned number indexes...possible for now...   bool err = id[0] # err = id's least significant bit. bool dir = id[1] # dir = id's second least significant bit. bool file = id[2] # file = id's third least significant bit. bool vol = id[3] # vol = id's fourth least significant bit.

# If the upper nibble of id isn't zero, than there's a problem. # NOTE: This involves slicing and not indexing! if id[4:]: print("The id is invalid or corrupt!") return -1

# Some important code here.

return 0;

Not only is this easier to type out, it's a lot easier to remember, too! This can relieve the burden of knowing what number you need to "and" the variable with.

Code Structure
Since this said language is very Pythonic, the structure is very similar to Python. For example, here is a working "Hello, world!" example that demonstrates the structure of the language. include stdlib.h

func int main: puts("Hello, world!")

As stated in the summary, this language follows an off-sides rule, which means spaces matter. Each code that consecutively has the same amount of white spaces on its beginning left hand side is part of a code block. This replaces the the pesky brackets used in C-based languages. Here's a standard C "Hello, world!" example for comparison.

Keywords
This language, like almost even programming languages, has a set of reserved keywords that do or mean something special things.

Comments
Like in Python, comments begin with one number/pound sign. When the lexer finds a number sign, it immediately skips to the next line. Comments can be located anywhere so long as they aren't embedded within strings or constants. func int main: # This is a comment! int a = 102 # This is another comment! int b = 122 ##1112 Comments can be used to block out obsolete code...   string str = "# Embedding a comment within a string will result in no comment..."

Externc
Like extern, externc allows you to declare external variables and functions, except externc is used for c-styled namespaces. This easily allow someone to integrate C, assembly, and other languages into Cola.

Nop
This keyword creates an assembly nop, which stands for no operation. Essentially, this instruction doesn't do anything, although it does get executed, therefore burning up time. Nop is usually used in loops or for padding memory chunks, but it can serve other purposes. This keyword takes no arguments.

Pass
The pass keyword is purely syntactic sugar. Since at least one line of code must be in a code block pass is able to fill that space without adding bloat to the executable file. pass acts much like nop, except it doesn't compile into any code.

Calling Conventions
Currently, this language follows the cdecl calling convention. As such the caller needs to needs to save ?ax, ?bx, ?dx, while the callee saves the rest. The caller also needs to clean up the stack or else stack space will be wasted or leaked.

Stack Management
Currently, anything passed to the stack is aligned by 4 bytes; a dword. On a 32-bit system, everything will fit within this alignment except for longs, which will take up 8 bytes; a qword. Later, I plan to allow for packed alignment, which will save a lot of stack space.