Python Concepts/Strings

Objective

 * Understand Python strings.
 * Learn basic string manipulation.
 * Learn about escape characters and their role in strings.
 * Learn about linefeed techniques for large strings.
 * Learn the string formatting basics.
 * Learn about string indexing.
 * Learn about string slicing.
 * Learn about string encoding, ASCII, and Unicode.
 * Work with common built-in string methods.

Python Strings
The string is one of the simplest data types in python. Strings can be created by putting either single quotations or double quotations  at the beginning and end of a sequence of textual characters.

A simple string with single quotations:

A string can also use double quotations, which do not affect the string in any way.

You can concatenate (join together in sequence) strings by using the plus sign.

Strings can also be concatenated if they are literals (Strings not held in variables).

Now, let's say you need to type a very long string that repeats itself. You can repeat words by using the multiplication operator.

Examples of strings with  and   mixed:

Escape Characters
There are some characters that cannot be easily expressed within a string. These characters, called escape characters, can be easily integrated within a string by using two or more characters. In Python, we denote escape characters with a backslash at the beginning. For example, to put a new line in the string we could add a linefeed.

That's not really impressive, is it? To actually see that new line in action, use the built-in function.

Here is a table of other escape characters (no need to memorize them, the most important one you'll use is ).

Now you might start to see a problem with using  in your string. Let's print a Windows directory name.

See how  was interpreted as a linefeed? To correct this, use the backslash escape character. Be careful when using backslashes; remember that two of them will only output one backslash.

It could get tiresome to do that with very long directory strings, so let's use a simpler way than using two backslashes; just use the prefix  or. By putting this prefix before there are any strings quotations, we tell Python that this string is a literal string ('r' stands for raw, so it really is a raw string). That essentially tells Python to ignore all of the escape characters.

You can easily assign strings to variables.

The difference between displaying a string and printing a string:

Newlines
Now, let's say you want to print some multi-line text. You could do it like this.

A string like that could grow really long, but we can use an easy trick which will allow text to span multiple lines without cramming it all onto one line. To do this we use three quotations ( (   or ) to start and end a string:

That made things a lot easier. But we can still do better. By adding a backslash we can remove the first linefeed.

Some of you may have noticed that  automatically ends with an extra linefeed. There is a way to bypass this.

A useful way to span a string over multiple physical lines without inserting automatic line-feeds is to use parentheses (and escaping new lines as necessary):

Use parentheses and concatenation for long strings:

Formatting
Strings in Python can be subjected to special formatting, much like strings in C. Formatting serves a special purpose by making it easier to make well formatted output. You can format a string using a percent sign or you could use the newer curly braces  formatting. A simple example is given below.

The above simple code uses special format characters, which are interpreted as and replaced with a decimal integer. The percent sign after the format string indicates that following the   is the data to be printed according to the format string. That can be a lot to take in. Let's demonstrate this a couple more times.

This time, we used a different type of format that inserts a string. You'll need to do some extra work if the string needs to be formatted more than once.

Notice the need for parentheses and the comma when there are two or more items in the list to be printed. If we don't add the parentheses around the format arguments, then we'll get an error.

If you wish to print  literally, this will work: or

To keep you from guessing what is what, here is a table of all possible formats with a little information about them.

Indexing
Strings in Python support indexing, which allows you to retrieve part of the string. It would be better to show you some indexing before we actually tell you how it's done, since you'll grasp the concept more easily.

By putting the index number inside brackets, you can extract a character from a string. But what magic numbers correspond to the characters? Indexing in Python starts at 0, so the maximum index of a string is one less than its length. Let's try and index a string beyond its limits.

Here's a little chart of 's character positions.

Hopefully that chart above helped to visually clarify some things about indexing. Now that we know the formula for the last character in a string, we should be able to get that character.

In the above code, we used the formula, string length minus one, to get the last character of a string. By using the built-in function, we can get the length of a string. In this instance,  returns 13, which we reduce by 1, resulting in 12. This can be a bit exhausting and repetitive when you need to repeat this over and over again. Luckily, Python has a special indexing method that allows you to get the last character of string without needing know the string's length. By using negative numbers, we can index from right to left instead of left to right.

There is a table below showing the indexing number corresponding to the character. Take some time to study the table.

It is important that you understand that strings are immutable, which means that their content cannot be manipulated. Immutable data types have a fixed value that cannot change. The only way to change their value is to completely re-assign the variable.

From the above example,  is re-assigned to a different value. So what does this have to do with indexing? Well, the same rules apply to indexing, so all of the indexes cannot be assigned with a new value nor can they be manipulated. The example below will help clarify this concept.

To re-assign a string variable while replacing part of the substring will need a little extra work with slicing. If you aren't familiar with slicing, it is taught in the next section. You'll probably want to come back to this after you have read that section.

Slicing
Slicing is an important concept that you'll be using in Python. Slicing allows you to extract a substring that is in the string. A substring is part of a string or a string within a string, so "I", "love", and "Python" are all substrings of "I love Python.". When you slice in Python, you'll need to remember that the colon is important. It would be better to show you, then to tell you right away how to slice strings.

As you can see, slicing builds onto Python's indexing concepts which were taught in the previous section. gets the substring starting with the character at 0 until the character immediately before 1. So really the first number is where you start your slice and the number after the colon is where you end your slice. (The character at the number after the colon is not included)

Now slicing like this can be helpful in situations, but what if you'd like to get the first 4 characters after the start of a string? We could use the  function to help us, but there is an easier way. By omitting one of the parameters in the slice, it will slice from the beginning or end, depending on which parameter was omitted.

By slicing like this, we can remove or get part of a string without needing to know its length. As you can see from the example above,  +   is equal to. This helps ensure that we don't get the same character into both strings.

The handling of  is when slicing or indexing. An attempt to index a string with a number larger than (or equal to) its length will produce an error.

While slicing, this kind of error is suppressed, since it returns

The table below shows the indexing number corresponding to each character in the string "I love Wikiversity!"

The following examples illustrate the use of slicing to extract a substring from the string or to compare substrings within the string:

The expression  returns an empty string for s5 without error. To capture the '!' for s5:

Suppose you have a string like  and you want to format it to make it more readable: The following code illustrates the use of slicing to do the job: Execute the python code above and the result is:

Encoding
So we know what a string is and how it works, but what really is a string? Depending on the encoding, it could be different things without changing. The most prominent string encodings are ASCII and Unicode. The first is a simple encoding for some, but not all, Latin characters and other things like numbers, signs, and monetary units. The second, called Unicode, is a larger encoding that can have thousands of characters. The purpose of Unicode is to create one encoding that can contain all of the world's alphabets, characters, and scripts. In Python 3 Unicode is the default encoding. So this means we can put almost any character into a string and have it print correctly. This is great news for non-English countries, because the ASCII encoding doesn't permit many types of characters. In fact, ASCII allows only 127 characters! ( generally the characters and symbols which you see on an American-English keyboard plus control (non-printing) characters.) Here are some examples using different languages, some with non-Latin characters.

A brief review of ASCII
Each ASCII character fits into one byte, specifically the least significant 7 bits of the byte. Therefore each ASCII character has the value 0x00 .. 0x7F. The  and built-in functions do the conversion: The printable characters have values Remaining characters are control or non-printing characters.

The numbers  have values

The letters  have values

The letters  have values

Control characters have values

Control character  called   read as 'control A'. Control character  is called   control character , etc. Control character with value   is   or NULL. To be specific:

Named control characters are:

See the table of Escape Sequences under "Escape Characters" above. To make sense of the named control characters think of the 1970's and a 'modern' typewriter for the period. A proficient typist who did not look at the keyboard needed an audible alarm when the carriage approached end of line, hence "Bell". A Line feed advanced the paper one line. A Form Feed advanced paper to the next page. Perhaps the only control characters relevant today are Tab, Backspace and Return (which is interpreted as '^J'.)

The following Python code prints characters  the ASCII character set.

For example

If you send the output to a file and open the file with emacs, you will see that control character  is one byte. If a character is not recognized, emacs prints it using octal notation:

Modern character sets
In times past when hardware was expensive and the English speaking world dominated computing, one character occupied seven bits (0-6) of one byte. Then bit 7 was used to provide 128 extra characters.

For example  (English pound);

Today with cheap computers a world-wide phenomenon, one character may occupy more than two bytes in a four byte word with space for expansion up to and including 32 bits. Hence:

where  is a hexadecimal digit. Examples of modern characters in action:

String as object
The method  returns an encoded version of the string as a bytes object. The method  returns a string decoded from the given bytes.

Common String methods
In the lesson above we've seen some methods in action: Many more methods are available for processing strings. Some are shown below:

More operations on strings
At this point we're familiar with strings as perhaps a single line. But strings can be much more than a single line. The whole of "War and Peace" could be a single string. In this part of the lesson we'll look at "paragraphs" where a paragraph contains one or more non-blank lines. Consider string: This string contains a paragraph surrounded by messy white space. We'll improve the appearance of the string by removing insignificant white space. First: Remove insignificant white space around each line:

The next string is a "page" where a page contains two or more paragraphs. With this page we'll do the same as above, that is, remove insignificant white space. Remove white space around each line (including blank lines): Remove extraneous lines between paragraphs:

Complicated strings simplified
If you are working with strings containing complicated sequences of escaped characters, or if the whole concept of escaped characters is difficult, you might try:

Then you can build long strings: You can put the back_slash at the end of a string: If you have a long string, splitting it might help to reveal significant parts:

Assignments
It seems that modern international characters with numeric values greater then 0xFFFF are not standardized. Consider character '\U00010022'. This displays on the interactive Python command line as a little elephant with three legs, within emacs and on the Unix command line as a Greek delta (almost), and in Wikiversity as, well, that depends. Within interactive python, when you move the cursor over '\U+10022', it steps over one character. When you move the cursor over '\U00010022', it steps over ten characters. '\U00010022' is 'ð' as copied from emacs.

Experiment with characters with numeric values greater than 0xFFFF and note the apparently inconsistent results. If you are producing text with modern international characters, ensure that the character/s displayed are what you want.

Further Reading or Review

 * Previous Lesson: Numbers
 * This lesson: Strings
 * Next Lesson: Lists
 * Course Home Page