Where real data may not be available, or user privacy is a concern, use Python's Faker to generate dummy data for application testing.
It can be expensive to collect and clean real-world data. In some cases, such as bank fraud or cyberattack, the data may be unavailable. That’s why synthetic data is extremely useful. Companies use it to test new applications; machine learning experts rely on it to improve model performance. And if you care about user privacy, you’ll use synthetic data to replace the names, emails, and house addresses of real users.
In this article, I introduce you to the Python Faker library, with examples of how you can use it to generate various data for your testing needs.
What is Python's Faker?
Faker is an open-source Python package that generates synthetic data for you. The library is inspired by PHP Faker, Ruby Faker, and Perl Faker. You can use it to create fake data of all kinds for developing and testing applications and machine learning models.
Let’s look at how we can use several Faker providers to generate dummy data.
First, you’ll have to install the faker
library. To do that, type pip install faker
into the Command Pormpt. Press enter
, and Faker will be installed and ready to use.
Once you have the library installed, it’s pretty easy to start using it to generate data.
Let's explore some examples!
Example 1
# Import Faker from the faker library
from faker import Faker
# Create an instance of the class Faker
fake = Faker()
# Use it to create other objects such as names
print(fake.name())
#Output
'Kristin Watkins'
You can use first_name()
, last_name(), first_name_male(), last_name_female()
to generate data accordingly.
See more example providers below:
Example 2:
# emails
print(fake.email())
# addresses
print(fake.adrress())
print(fake.street_address())
# dates of birth
print(fake.date_of_birth())
# dates of birth with ranges
print(fake.date_this_century())
print(fake.date_this_decade(before_today=True))
# social security numbers
print(fake.ssn)
# passwords
print(fake.password())
print(fake.password(special_chars=True, upper_case=True))
print(fake.password(special_chars=False, lower_case=True))
#texts
print(fake.text()) # this will generate a random (often meaningless) text or texts(if you add the s).
Faker also has providers for phone_number()
, credit_card_full()
, license_plate()
, url
, domain_name()
, etc.
To generate data in batches, the simplest approach is to use a for
loop. For example, let's generate three fake user profiles containing names, dates of birth, and street addresses.
Example 3:
import Faker from faker
for _ in range(3):
... print(f"Name : {fake.name()}")
... print(f"DOB : {fake.date_of_birth()}")
... print(f"Address : {fake.street_address()}")
#Output:
Name : Andrew Gibson
DOB : 1909-11-06
Address : 399 Karen Lodge
Name : Matthew Hood
DOB : 1947-07-03
Address : 58996 Collins Forge
Name : Daniel Potts
DOB : 1961-10-14
Address : 2427 Debra Locks Apt. 065
You can also create a full profile without needing to type every profile element one by one.
Example 4
print(fake.profile()) # A detailed profile
print(fake.simple_profile()) # A less detailed profile
The first print call will produce a detailed profile including info such as current location, website, and even blood group. The second will generate a much simpler profile with only the most essential info such as name, date of birth, and sex.
Faker is also highly customizable. You can set the Faker generator to any language, and it will generate data reflecting that region and language.
Example 5:
from faker import Faker
import pprint
fake = Faker('es_ES') # sets Faker to Spanish
pprint.pprint(fake.simple_profile())
#Output
{'address': 'Cañada Gracia Iborra 39\nValladolid, 35795',
'birthdate': datetime.date(1973, 3, 25),
'mail': 'nsanchez@yahoo.com',
'name': 'Pili Carballo Alemán',
'sex': 'F',
'username': 'renefrancisco'}
Notice how the data generated reflects the language and region Faker is set to.
As with fake.text()
, you can also generate words()
, sentences()
, and paragraphs()
. You can even specify a list of words from which Faker should generate the data.
Example 6:
from faker import Faker
fake = Faker()
# generate three sentences from a given list of words
word_list = ['your', 'sentences', 'should', 'have', 'only', 'these']
for i in range(3):
... print(fake.sentence(ext_word_list=word_list))
#Output
These sentences only only your only your these.
Should sentences sentences your these.
Should only have only should sentences your.
Since Faker is all about randomness, the data you generate will be different every time. But in real-life tests, we sometimes want to reuse a certain data set. The seed()
method lets you generate the same data over and over again.
Example 7:
from faker import Faker
>>> fake = Faker()
>>> Faker.seed(2) # You can use any number(s), really
>>> for _ in range(3):
... print(fake.name())
#Output
Theresa Brown
Russell Fitzgerald
Elizabeth Obrien
Now, every time you create names with the same seed number, the data Faker will generate will be the same three names.
Let's experiment.
Example 8:
# Create three names by instantiating a new Faker object without the seed number.
from faker import Faker
>>> fake = Faker()
>>> for _ in range(3):
... print(fake.name())
#Output
Jeffrey Simpson
David Robinson
Dylan Smith
Of course, the output would be different—because they're randomly generated. Now let's generate some more profiles, but this time with the same seed number we used in Example 7
Example 9:
>>> fake = Faker()
>>> Faker.seed(2)
>>> for _ in range(3):
... print(fake.name())
#Output
Theresa Brown
Russell Fitzgerald
Elizabeth Obrien
Compare these names with the ones we generated in example 7. They're the same set of names!
Faker also has providers for generating random file names and file paths. Let's try some examples.
# random file name
print(fake.file_name())
#Output
direction.txt
# file name with specified category
print(fake.file_name(category='video'))
#Output
rate.mp4
# file name with specified category and extension
print(fake.file_name(category='audio', extension='wav')
news.wav
If you don't specify a category, Faker will generate a default one for you. The list of valid categories are audio
, video
, text
, image
, and office
.
The method to generate a pathname to a file is essentially the same. in this method, the keyword depth
signifies the depth of the directory path—i.e., how many folders to the filename. The generated file path begins with a forward slash /
.
>>> fake = Faker()
>>> for _ in range(3):
... print(fake.file_path(depth=3))
#Output:
# A random category
/policy/office/down/experience.wav
/together/have/draw/natural.bmp
/laugh/born/follow/poor.js
You can see that, though the file paths generated all have depths of 3, their categories are random. You can specify the category and extension you want, and Faker will generate them accordingly.
for _ in range(3):
... print(fake.file_path(depth=3, category='image', extension='jpeg'))
#Output
/college/experience/lot/follow.jpeg
/season/note/figure/cut.jpeg
/act/that/appear/one.jpeg
Now all the files have the specified category, jpeg.
The beauty of Faker is that you can integrate it seamlessly with other Python libraries and tools. For instance, in just a few lines of code, we could generate fake profiles and organize them into a Pandas Data Frame.
Check this out:
>>> import pandas as pd
>>> from faker import Faker
>>> fake = Faker()
>>> data = pd.DataFrame([fake.simple_profile() for _ in range(5)])
>>> print(data)
#Output
username name mail birthdate
- perezmisty Britt Bailey chery@hotmail.com 1975-04-02
- scott75 Vicky Warren nroberts@hotmail.com 1972-05-23
- cassandra Christy Lewis ariel50@gmail.com 1944-01-06
- willjason Toni Garcia mikewarren@yahoo.com 1947-07-07
- smithmichael Linda Ross katherine@gmail.com 1992-07-02
The data generated is a Pandas Data Frame with 4 rows and 4 columns (tweaked a little bit to fit in the code block).
This article covers only some basic methods in the library. For more advanced (and certainly more exciting) methods, skim through the official documentation.
That's it. Thanks for reading! Please feel free to share your thoughts about this awesome library in the comments.