Working with Text Data

In this tutorial, we are going to learn how to use basic string methods of python, on Dataframes and Series. Almost all of the string methods can also be used on the Dataframes and Series. Perhaps most importantly, these methods exclude missing/NA values automatically. These are accessed via the str attribute and generally have names matching the equivalent (scalar) built-in string methods:

Let’s see some text data in action.

lower()

As in the strings, it will convert all the capital or the uppercase alphabets in the string to lowercase. It leaves the lowercase alphabets as it is.

upper()

As in the strings, it will convert all the lowercase alphabets in the string to uppercase. It leaves the uppercase alphabets as it is.

As we can see both the methods have excluded the nan entries.

len()

len method is used to calculate the length of the string.

Note: If you have a Series where lots of elements are repeated (i.e. the number of unique elements in the Seriesis a lot smaller than the length of the Series), it can be faster to convert the original Series to one of type categoryand then use .str.<method> or .dt.<property> on that. The performance difference comes from the fact that, forSeries of type category, the string operations are done on the .categories and not on each element of the Series. Please note that a Series of type category with string .categories has some limitations in comparison of Series of type string (e.g. you can’t add strings to each other: s + " " + s won’t work if s is a Series of type category). Also,.str methods which operate on elements of type list are not available on such a Series.

Concatenation

just like the string concatenation here also the cat method concatenates the strings but here all the string entries of series are concatenated. cat method has a parameter sep which takes a character as an argument. It separates the strings in the series with that character.

As we can see all the string entries are concatenated and separated by i. sep can take any character as an argument and it will concatenate the strings with that argument.

We can also concatenate a series or a list like to a series.

Remember do not try to concatenate anything to the nan value and pass the proper number of list values excluding nan if the number of items in the list is more or less it will raise a ValueError.

In the same way with DataFrame.str or Series.str we can apply any string method on Dataframe or Series.

Rest of them are listed and are performed in the same way.

      1. isnumeric(): Checks if the value is numeric or not.
      2. isupper(): Checks if the values are uppercase or not.
      3. islower(): Checks if the values are lowercase or not.
      4. swapcase(): It converts lower to upper and upper to lowercase alphabets.
      5. findall(pattern)Returns a list of all occurrence of the pattern.
      6. find(pattern)Returns the first position of the first occurrence of the pattern.
      7. strip()Helps strip whitespace(including newline) from each string in the Series/index from both the sides.
      8. replace(a,b): Replaces the value a with the value b.
      9. repeat(value): Repeats each element specified number of times.
      10. count(pattern): Returns count of appearance of the pattern in each element.
Close Menu