Cryptography, Database Security And Anonymization

1 Cryptography

1.1 General information

Cryptography is a study of secure communication techniques that allow only the sender and the intended user to see the contents of the message.

09f8915a-84a4-4fee-9753-94aa770f7424.jpeg
This term is derived from the Greek word Kryptos, which means invisibility, and is closely related to encryption, the law of breaking between ordinary texts and what is known as the symbol and then re-entry upon arrival. Ancient Egyptians are known to use these methods in complex hieroglyphics, and Roman Emperor Julius Caesar is credited with using one of the first modern ciphers.

When electronic data are transmitted, the most common use of encoding is e-mail for encoding, decoding and other simple encoding messages. The simplest method of a similar system or secret key system is used here, where data is encoded using a secret key, then both the coded message and the secret key are sent to the recipient to decrypt the code. If the message is intercepted, then the third party has all it needs to decrypt and read the message and to address this problem, print scientists have designed an asymmetric system or "public key" system. In this case, each user has two keys: one public and one private, and senders request the public key to the intended recipient, encode the message and send it when the message arrives, only the private key to the recipient will think that theft is useless without the corresponding private key.

Also called cryptology, it is the practice and study of secure communication techniques in the presence of accustomary behavior and, more generally, the development and analysis of protocols that prevent third parties or the public from reading private messages; various aspects of information security such as data confidentiality, data integrity, documentation, and not exoneration are fundamental aspects of modern encoding.

Cryptography is a branch of applied mathematics that is used to secure and maintain
the privacy of the information. In practical terms, this involves converting a text (file, string,
characters / bits) in plain text in a cryptic text (called ciphertext).
This process of converting or encoding plain text is called encryption. The reverse process, of
conversion of ciphertext to plain text is called decryption.
Both processes use (in one form or another) an encryption procedure, called an encryption algorithm.
Most of these algorithms are in the public domain, ie they are known, the secret (private) character of
communication is provided by the use of an encryption / decryption key, a known key (ideally) only
by the entities that are entitled to know them, at both ends of the communication channel.
Cryptology is a branch of mathematics that describes the mathematical foundations of methods
cryptographic as well as the principles of authentication and restriction of access to information.
The term cryptology encompasses both cryptography (information encryption) and cryptanalysis (information analysis).
cryptographic), ie the art of breaking down encryption systems.

1.2 Goals and purposes

Modern cryptography has four main goals to protect data assets, messages and/or transmission channels:
1. Confidentiality/access protection: Only authorized persons should be able to read the data or the message or obtain information about its content.
2. Integrity/change protection: The data must be demonstrably complete and unchanged.
3. Authenticity/anti-counterfeiting: The originator of the data or the sender of the message should be clearly identifiable and its authorship should be verifiable.
4. Liability/non-repudiation: The originator of the data or sender of a message should not be able to dispute his authorship, i. i.e. it should be possible to prove it to third parties.

1.3 Methods

Cryptographic methods are divided into classic and modern methods.

Methods of classical cryptography: As long as no electronic computers were used for cryptography, complete letters or groups of letters were always replaced in encryption (the only application of cryptography at the time). Such methods are now outdated and unsafe.
Transposition: The letters of the message are simply rearranged. Example: picket fence method or scytale.
Substitution: The letters of the message are each replaced by a different letter or symbol; see Monoalphabetic Substitution and Polyalphabetic Substitution. Examples of this are the Caesar encryption and the Vigenère encryption.
Code book, also a classic method.
Methods of modern cryptography: In line with how computers work, modern cryptographic methods no longer work with whole letters, but with the individual bits of the data. This greatly increases the number of possible transformations and also allows non-text data to be processed. Modern crypto methods can be divided into two classes: Like classic cryptographic methods, symmetrical methods use a secret key for each communication relationship and for all operations (e.g. encryption and decryption) of the method; asymmetric methods use a private (i.e. secret) and a public key for each participant. Almost all asymmetric cryptographic methods are based on operations in discrete mathematical structures, such as B. finite bodies, rings, elliptic curves or grids. Their security is then based on the difficulty of certain computational problems in these structures. In contrast, many symmetric methods and (cryptological) hash functions are rather ad hoc constructions based on bit combinations (e.g. XOR) and substitution tables for bit sequences. Some symmetrical methods, such as Advanced Encryption Standard, Secret-Sharing or methods for stream encryption based on linear feedback shift registers, but also use mathematical structures or can be easily described in these.

classical1.gif

1.4 Cryptographic protocols and standards

DNSSEC - Domain Name Server Security. Protocol for distributed naming services
SSL - Secure Socket Layer - the main protocol for secure WWW connections. Increasing
importance due to higher sensitive information traffic. The latest version of the protocol is called TLS
- Transport Security Layer. Was originally developed by Netscape as an open protocol standard.
SHTTP - a newer, more flexible protocol than SSL / TKS. Specified in RFC 2660.
PKCS - Public Key Encryption Standards - developed by RSA Data Security and defines
safe ways to use RSA.

1.5 Vulnerability of cryptographic algorithms

Good cryptographic systems must always be designed to be as difficult to break as possible.

It is possible to build systems that cannot be broken in practice (although this cannot usually be
proven). This does not significantly increase the effort to implement the system; However,
attention and expertise are required. There is no excuse for a system designer to make the system
vulnerable from the start. Any mechanisms that can be used to circumvent security must be explained,
documented and brought to the attention of end-users.
In theory, any cryptographic method with a key can be broken by successively trying all keys.
If using brute force to try all the keys is the only option, the necessary computing power
increases exponentially with key length. A 32-bit key requires 2 ^ 32 (about 10 ^ 9) steps.
This is something anyone can do on their home computer. A 56-bit key system, such as
be DES, requires substantial effort, but the use of massive distributed systems requires only a few hours
calculation. In 1999, a brute force search using a specially designed supercomputer and a worldwide network
of almost 100,000 PCs on the Internet found a DES key in 22 hours and 15 minutes. It is believed that the keys
at least 128 bits (such as AES, for example) will be sufficient against brute force attacks in
the near future.
However, key length is not the only relevant issue. Many numbers can be broken without
try all possible keys. In general, it is very difficult to design encryption methods that could not
be broken more efficiently using other methods

1.6 Cryptographic analysis and cracking of encryption systems

Cryptanalysis is the art of decrypting encrypted communications without knowing the correct keys. There are many
cryptanalytic techniques. Some of the most important for a system implementer are described above
down.
ciphertext: this is the situation where the attacker knows nothing about
message content and should only work with ciphertext. In practice, it is quite possible to
make assumptions about the original text, as many types of messages have fixed-format headers. Even
ordinary letters and documents start in a very predictable way. For example, many classic attacks
I use frequency analysis of ciphertext, but this doesn't work well against algorithms
modern.
Modern encryption systems are not vulnerable to encrypted text-only attacks, though sometimes
they are considered with the additional assumption that the message contains some statistical peculiarities.
known original text attack (known plaintext): The attacker knows or can guess the original text for
certain parts of the ciphertext. The task is to decrypt the rest of the ciphertext blocks using them
information. This can be done by determining the key used to encrypt the data or by
by means of a shortcut.
One of the best known methods of partial original text is linear cryptanalysis
against block ciphers.
chosen plaintext attack:
The attacker is able to have any text on
conveniently encrypted with unknown key. The task is to determine the key used for encryption.
A good example of this attack is the differential cryptanalysis that can be applied against ciphers
(algorithms) of the block type (and, in some cases, also against hash functions).
man-in-the-middle attack: this attack is relevant to cryptographic communication
and for key distribution protocols. The idea is that when two parts, A and B, change the keys
for secure communication (for example, using the Diffie-Hellman protocol),
The usual way to prevent a man-in-the-middle attack is to use a
public key encryption capable of providing digital signatures. For configuration, the parties must
know the public keys in advance. Once the shared secret has been generated, the parties send the digital signatures
between them. Man-in-the-middle fails to attack because he fails to create
these signatures without knowing the private keys used for signing.
This solution is enough if there is a way to securely distribute public keys. a
such a mode is a certification hierarchy, such as X.509. It is used for example in IPSec.
• The correlation between the secret key and the outputs of the encryption system is the main source of information for
cryptanalyzer. In the simplest case, the secret key information is a direct leak of the system
encryption. More complicated cases require the study of the correlation (essentially, of any relationship that would not be
expected only by chance) between observed (or measured) information about the encryption system and
the essential information guessed.
For example, in linear (or differential) attacks against block ciphers, the crypto-analyzer studies
known (respectively chosen) texts and observed ciphertext. Guessing a few bits of the encryption key,
the analyst determines by correlating the plain text with the ciphertext if he guessed correctly. This can be
repeated and has many variations.

1.7 Hash function

A cryptographic hash function is a transformation that takes an input and returns a string of dimensions
fixed, which is called either the hash value or the checksum or digest.

Hash functions with this
property are used for a variety of computational purposes, including in cryptography. Hash value
is a concise representation of the longer message or document from which it was calculated. Or, it may be
seen as a kind of "fingerprint" of the larger document. Cryptographic hash functions are used
to perform message integrity checks and to generate digital signatures as well as others
information security applications, such as message authentication and integrity.

1.8 History of cryptography

Classic cryptography
The earliest use of cryptography can be found in the third millennium BC. in Ancient Egyptian cryptography of the Old Kingdom. Medieval Hebrew scholars used simple character-swapping algorithms (such as Atbash encryption). In the Middle Ages, a variety of secret codes were used throughout Europe to protect diplomatic correspondence, such as the Alphabetum Kaldeorum. Secret writings were also used for medical texts, for example to write down prescriptions against syphilis, which was spreading from 1495.[5]

At the end of the 19th century, new considerations in cryptography arose due to the widespread use of the telegraph (which could easily be tapped and eavesdropped on). Auguste Kerckhoffs von Nieuwenhof formulated a principle of cryptography with Kerckhoffs' principle, according to which the security of a cryptographic process should only depend on the secrecy of the key and not on that of the process. Rather, the procedure itself can be published and examined by experts for its suitability.
Cryptography in 2nd world war
During the Second World War, mechanical and electromechanical key machines such as the T52 or SZ 42 were widely used, although manual keys such as the double box key continued to be used in areas where this was not possible. Great advances in mathematical cryptography were made during this period. Of necessity, however, this was only done in secret. The German military made extensive use of a machine known as ENIGMA, which was broken by Polish code breakers from 1932 and British code breakers from 1939.

1.9 Modern cryptography

The age of modern cryptography began with Claude Shannon, possibly the father of mathematical cryptography. In 1949 he published the article Communication Theory of Secrecy Systems. This article, along with his other work on information and communication theory, established a strong mathematical basis for cryptography. This also ended a phase of cryptography that relied on the secrecy of the procedure in order to prevent or make it more difficult for third parties to decrypt it. Instead of this tactic – also known as security by obscurity – cryptographic methods now have to face open scientific discourse

1.10 Data Encryption Standard (DES)

In 1976 there were two important advances. First was the DES (Data Encryption Standard) algorithm developed by IBM and the National Security Agency (NSA) to create a secure uniform standard for inter-agency encryption (DES was published in 1977 under the name FIPS 46-2 (Federal Information Processing Standard) published). DES and more secure variants thereof (3DES) are still used today, e.g. B. used for banking services. DES was replaced by the new FIPS-197 standard AES in 2001.

1.11 Asymmetric cryptography(public key cryptography)

Public key cryptography is a cryptographic method in which, in contrast to a symmetric cryptosystem, the communicating parties do not need to know a shared secret key. Each user generates their own key pair, which consists of a secret part (private key) and a non-secret part (public key). The public key allows anyone to encrypt data for the owner of the private key, verify their digital signatures, or authenticate them. The private key enables its owner to decrypt data encrypted with the public key, to create digital signatures or to authenticate themselves.
Principle
The private key must be kept secret and it must be practically impossible to calculate it from the public key. The public key must be accessible to anyone who wants to send an encrypted message to the owner of the private key. It must be ensured that the public key is actually assigned to the recipient.

Generation of a pair of keys: Blue picture elements are secret, orange are public.

Public key encryption and private key decryption

Private key signing and public key verification
The theoretical basis for asymmetric cryptosystems are trapdoor functions, i.e. functions that are easy to calculate but practically impossible to invert without a secret (the "trapdoor"). The public key is then a description of the function, the private key is the trapdoor. A prerequisite is, of course, that the private key cannot be calculated from the public one. In order for the cryptosystem to be used, the communication partner must know the public key.
The decisive advantage of asymmetric methods is that they reduce the key distribution problem. In the case of symmetrical procedures, a key must be verified via a secure, i. H. tap-proof and tamper-proof channel can be exchanged. Since the public key is not secret, the channel does not need to be bug-proof in the case of asymmetric methods; The only important thing is that the public key can be unequivocally assigned to the owner of the associated private key. For this purpose, for example, a trustworthy certification authority can issue a digital certificate which assigns the public key to the private key (owner). As an alternative to this, a trust network (web of trust) can also be set up without a central office by mutually certifying keys.
How secure is it?
For the security of asymmetric methods, it is necessary that the one-way functions underlying the various methods are practically irreversible, since otherwise the private key could be calculated from the public key. The security of all asymmetric cryptosystems currently rests on unproven assumptions, in particular the assumption that P is not equal to NP. The non-reversibility of the used trapdoor functions is not proven. As a rule, however, these assumptions are strongly presumed to be correct. The information-theoretical security that can be achieved with the symmetric one-time pad cannot be achieved with an asymmetric method because a correspondingly powerful attacker can always solve the underlying mathematical problem.

History of asymmetric cryptography
Up until the 1970s, there were only symmetric cryptosystems in which the sender and receiver had to have the same key. This raises the problem of key exchange and key management. Ralph Merkle took the first step towards developing asymmetric methods in 1974 with the Merkles Puzzle, which was named after him and was not published until 1978. The first public-key encryption method was the Merkle-Hellman cryptosystem developed by Ralph Merkle and Martin Hellman. The MH method was broken by Adi Shamir in 1983. In the summer of 1975, Whitfield Diffie and Martin Hellman published an idea for asymmetric encryption, but without knowing the exact procedure. Influenced by this work, Diffie and Hellman developed the Diffie-Hellman key exchange in 1976.

The first asymmetric encryption method was developed in 1977 by Ronald L. Rivest, Adi Shamir and Leonard M. Adleman at MIT and was named RSA method after them. According to current terminology, this method is a trapdoor permutation that can be used to construct encryption methods as well as signature methods.
Regardless of the developments in scientific cryptology, in the early 1970s three employees at the British Government Communications Headquarters, James H. Ellis, Clifford Cocks and Malcolm Williamson, developed both a Diffie-Hellman key exchange and one similar to the RSA cryptosystem developed an asymmetric process, which was not published for reasons of secrecy, nor was a patent applied for.

2 Database security

Database (DB) is a way of storing information and data on an external medium (a storage device), with the possibility of easy expansion and quick retrieval.

A database, sometimes also called a database (abbreviated to DB), is a way of storing information and data on an external medium (a storage device), with the possibility of easy expansion and quick retrieval. At first glance the task may seem trivial. However, when working with millions of items, each of which may consist of quantities of data to be accessed simultaneously via the Internet by thousands of users spread across the globe; and when the availability of the application and data must be permanent (e.g. to avoid losing business), good solutions are not simple.
Databases have the ability to serve dynamic content and are essential applications of web pages.
Secret or confidential information is often stored in a database and because of this it is necessary to protect databases.

A database connection is required to receive or send information. The most commonly used query language is Structured Query Language (SQL).

Importance of databases:

  • Provides us with an extremely efficient method to easily handle large amounts of different types of data;
  • It allows the systematic storage of large amounts of data, and this data can be easily retrieved, filtered, sorted and updated efficiently and accurately;
  • Provides accuracy, meaning that the information available in a database is guaranteed to be correct in most cases;
  • Easy updating, in a database it is easy to update data using the various Data Manipulation Languages (DML) available. One of these languages is SQL (Structured Query Language);
  • Data security: databases have various methods to ensure data security. User logins are required before accessing a database and various access specifiers, so these allow only authorised users to access the database;
  • Easy access: it is very easy to access and search data in a database. This is done using Data Query Languages (DQL) that allow searching any data in the database and performing calculations on it.

2.1 Database design

The first step is always to create the database, unless you want to use a database created by someone else. When a database is created, it is assigned to a specific user, who executed the creation command. In general, only the owner (or superuser) can do anything with the objects in that database, and to let other users use it, they must have privileges.
Applications should never connect to a database with administrator or superuser privileges, because these users can execute any kind of query, for example, schema modification (deleting tables) or deleting the entire contents.
Different database users can be created for each aspect of the application, with strictly delimited rights to database objects. Only the strictly necessary privileges can be given, which grants the avoidance of the same user interacting with several databases: this means that if an intruder gains access to the database on behalf of your application, he will only be able to do operations that your application can do.

2.2 Connecting to the database

You can connect to the database using SSL encryption (stands for Secure Socket Layers, a process in which data between the user and the server is encrypted and decrypted so that an attacker cannot break into the connection to steal data) to increase data security, or you can use SSH (stands for Secure Shell and is a cryptographic network protocol that allows data to be transferred using a secure channel between network devices to encrypt data between network clients and the database server).
If one of these methods is used, then intercepting traffic and gaining access to sensitive database information would be very difficult for an attacker.

2.3 Data security

Data security is the practice of protecting digital information against unauthorised access, corruption or theft throughout its lifecycle.  It is a concept that covers all aspects of information security, from the physical security of hardware and storage devices to administrative and access controls and the logical security of software applications. It also includes organisational strategies and procedures.
Sound data security strategies, when properly implemented, will protect the information assets of an organization from electronic activities, and will also protect against internal threats and human error, which remain among the main causes of today's data breach. Data security involves the application of tools and technologies that increase the organization's visibility to where its critical data are found and how it is used. Ideally, these tools should be able to enforce protective measures such as encryption, data concealment, and rearrange sensitive files, and should be a reporting mechanism to simplify investigation and comply with organizational requirements.
SSL/SSH protects data traversal from client to server, but SSL/SSH does not protect data stored in the database. SSL is a transit protocol.
Once the attacker gains access to the database directly (bypassing the web server), the stored information can be exposed or abused if it is not protected by the database itself. Therefore data encryption is a good measure to mitigate this risk, but too few databases offer this type of encryption.
The easiest way to solve the problem is to create your own encryption package, and then use it with PHP scripts: PHP can help with this through a few extensions, such as Mcrypt and Mhash, which cover a wide variety of encryption algorithms. The script encrypts the data before it is inserted into the database, and decrypts it on comeback.
In the case of data that needs to be confidential, the exposure of which is not necessary in any context, hashing can be considered.
The most famous example is storing the MD5 hash of a password in a database rather than the password itself; you can also consider the crypt() and Md5() functions.
Moreover we can also mention some online encryption solutions that use military-grade encryption algorithms to protect data on their servers. These are represented by applications such as: iDrive (which transmits and stores data using the cryptography 256-Pit AES) or CloudSafe (transfer data using the highest standard of SSL encoding, mostly EV SSL with AES-256 bit);

2.4 SQL injection

Many web developers do not know how to manipulate SQL questions and place all their confidence in such command. Questions SQL can override access controls, thus bypassing authentication methods and verifying the declaration, and sometimes even facilitating access to system commands. Direct SQL injection is a technique in which the attacker creates or amends SQL commands to expose sensitive data, write a certain value, or even execute dangerous commands at the system level. This is done by the application retrieving user input, combining it with static parameters to form an SQL query.
These examples are based on real cases: the lack of validation and connection to the database with superuser rights or with a user that can even create other users  makes it possible for the hacker to create a superuser in the database. SQL injection is a vulnerability in web security that allows an attacker to input malicious SQL questions that can disrupt the database operation of the application. In general, data access controls that are implemented using this technology provide an attacker to access to information that they could not ordinarily be able to access. This data includes information belonging to other users or any other data that the application itself holds. In many cases, the attacker can change or delete these data, causing continuous changes in the content or behavior of the application.
In some circumstances, the hacker performs SQL injection attacks to damage the underlying server or other back-end infrastructure, or perform rejection service attacks (cybersecurity attacks that attempt criminals to make the network resources unavailable to the user).

2.5 Different types of database

The database model is the fundamental structure of a database and determine how data can be stored, organised, and manipulated. There are four main models used to store data. Depending on the specific needs, one of these models may be used:

1. Hierarchy Databases
This is one of the oldest database models developed by IBM for its information management system, and in a hierarchical database model, the data is organized into a such a tree structure.

Advantages:

  • the model enables us to easily add and delete new informations;
  • The data at the top of the hierarchy is very fast to reach;
  • It works well with data storage media such as cassettes;
  • It is well connected to anything that works through a one-to-more system; for example, in a hospital there is only one medical director managing 8 primary doctors which, in turns, manage 80 doctors and viceversa.

Disadvantages:

  • there is a need for data to be stored repeatedly in a number of different entities;
  • no longer use linear data storage media such as cassettes on a daily basis;
  • search data requires SGBD to run the entire model entirely the requested information is found, which makes the query very slow;
  • this model only supports one-to-more system.

2. Network databases
Network databases are similar to hierarchical databases, but they are modified versions of the tree database. The network database model organizes data graphically. The network model is a database model that is designed to be flexible and able to represent objects and their relationships. A network database structure can be represented using the concepts of nodes and sets. The node is a record collection, while the set establishes and represents the relationship in the network database. Such a transparent construction relates two nodes by using one as the owner and the other as a member.

3. Relational databases
This is the world's most widely used data model based on the development of a conceptual model. The relational model of a database makes it possible to extend databases to the level of the personal computer without the need for expensive devices required by minicomputers or mainframes. The relational model has a very strong theoretical base because it is based on mathematical set theory, which means that all operations are successful and the results of the operations are predictable.
The main advantage of the relational model is that it is not necessary to use both pointers and data within tables, but rather to use relations to access corresponding values from multiple tables (a relation consists of a relationship between records in two tables with the same attribute values).
Since the relational tables do not contain the pointers, the data in such tables are independent of the methods used by the data management system when working with the tables.

Relationship Properties:

  • the relationship consists of rows and columns;
  • the order of appearance of the line or column of the relationship is not important;
  • there is no clear correlation between the tables (no one can see the data);
  • each record is separately identifiable;
  • each row in the table has the same column;
  • each column has only one date type (different values do not accept redefinition).

4. Object-oriented databases
They describe data at the conceptual and external levels with high flexibility. Such models can explicitly specify the constraints applied to the data, and are based on the concepts of:

  • entity, a real-world object or concept that has its own identity;
  • attribute, a set of properties used to describe entities;
  • relations, an association between two or more entities.

Advantages:
the body database can manage different types of data, while the relationship database only manages one data. Object-oriented databases are different from traditional databases: they can handle different types of data, such as images, videos, and texts.

Disadvantages:
System is more complex than traditional SGBDs.

2.6 Database history

In 1961, Charles Bachman developed an integrated data store (IDS), the predecessor of the network model.
By the late 1960s, IBM is developing an information management system: IMS is based on a hierarchical model.
In the late 1960s, the Kodasil group (the Data System Language Committee) defined and standardized the network model.
In 1969, Edgar Codd, an IBM researcher, created a relational model.
In 1976, Peter Chen established the entity-relational model . In the 1980s, various database management systems (DBMSs) became available, including DB2, Oracle, Sybase, Informix, DBase, Paradox, and In 1986, the first SQL standard was defined. Between 1990 and 2000, various concepts such as OODB, ORDB (Postgres), Data Warehouse, OLAP, Data Mining, GIS, Mobile DB, Multimedia DB, Web DB, and XML DB appeared.

3 Anonymised and pseudo-anonymiseddata

What is personal data?

Personal data is all information about a specific or identifiable natural person. Different parts of information that may cause the identification of a particular person are also personal data.
Personal data that has been anonymised, encrypted or pseudonymised but can be used to re-identify a person remains personal data and is collected by the GDPR.
Personal data that has been anonymised in such a way that the person is no longer identifiable or identifiable shall no longer be considered personal data. In order for data to be truly anonymised, anonymisation must be irreversible.
The GDPR protects personal data regardless of the technology used to process such data: they are "technically neutral" and apply to both automated and manual processing, provided that the data is organized according to predefined criteria (ex. by alphabetic order). It is also not important how data are stored: in the ICT system, by video or paper surveillance; in all these cases personal data are subject to the protection requirements specified in the GDPR.
Data should not be stored too long: indeed once the storage period is over, the data can be stored only if it is not completely anonymised.
The GDPR encourages the use of the data pseudo-anonymisation as one of the most important protections (ex. by replacing a code with a symbol.

3.1 Anonymised data

Data anonymization means the way of storing personal or confidential information by deleting or encrypting the identification associated with personal data stored. It is done to protect the personal or corporate activity and to maintain the reliability of the data collected and exchanged. Data are anonymised when any identifier disappears.
If the data is completely anonymized, the data is beyond the regulatory protection.

Advantages of anonymizing data:

1. Protection against possible loss of market share and trust
Data anonymization is a method to ensure that the company understands and fulfils its duty to secure sensitive, personal and confidential data in a world with highly complex data protection mandates that may vary depending on the location of the company and its customers. It protects companies from potential losses of market shares and trust;
2. Protection against data abuse and insider trading risks
Data anonymization is a protection against data abuse and insider trading risks that lead to regulatory compliance consequences;
3. Increases governance and consistency of results
Data anonymisation also increases the governance and consistency of the results. Accurate data allow you to use apps and services and maintain big data analytics and privacy. Fuel Digital transformation by providing protected data for use in the generation of new market values.

Disadvantages of anonymising data:

1. Compliance with regulatory requirements requires that websites obtain permission from users to collect personal information such as cookies, IP addresses and computer ID, as a matter of fact the collection of anonymous data and the removal of identification data from the database would limit the ability to extract relevant information from the results;
2. Anonymous information, for example, cannot be used for purpose of targeting or personalizing the user’s experience.

3.2 Pseudonymised data

GDPR promotes the pseudonymisation of personal data as one of the most effective data protection measures. The application of pseudonymisation combing to personal data may reduce the risk to data subjects and help observers and processors meet their data protection obligations.
According to the GDPR, "pseudonymisation" means processing personal data in a way that can no longer be attributed to a specific subject matter of data without using additional information.
Pseudonymisation can be achieved by, for example, replacing the ID with a code.
This does not mean that the data will be permanently and irrevocably separated from the name of the person, but that the association between data and identifiers should not be directly possible. For people who do not have the decoding key, re-identification is difficult to achieve. However, for people who have the decryption key, re-identification can be done easily.
Example: The sentence "Michelangelo Buonarroti, born 6th March 1475 sculpted the David" can be pseudonymised as follows:
"M.B. 1475 sculpted the David" or
“312 sculpted the David” or
"SmmZ!Ft4uT2 sculpted the David"
Persons with access to this pseudonymised data are unable to identify "Michelangelo Buonarroti, born 6th March 1475 sculpted the David" from "312" or "SmmZ!Ft4uT2". Thus, pseudonymisation can be more secure against misuse: it can be very useful when data controllers need to work with the same data subjects, but not all people (such as employees) need to know the real identity of the data subjects. This pseudonymisation technique should be included in advanced data processing systems or new technologies at an early stage of their development.

3.3 How is the data anonymised?

There are a number of data anonymization techniques out there. While many of these methods are designed to mask data sufficiently, some can be used togheter with others to anonymise direct and indirect identifiers. Some of the most common data anonymization methods include:

  • Character masks in the character mask or "masking" the format of the data is retained, but selected characters are replaced by a mask character such as "X" or "#" (ex. change of the date of birth on 11.03.2001 into ##/ ##/ 20##;
  • Data shuffling is a technique that involves rearranging data so that its attributes remain present but do not match their original records. Shuffling data is often referred to as shuffling a deck of cards: this method is effective when it is not necessary to evaluate data based on the relationships between the information contained in each registry;
  • Data replacement, the data in a column with random values from a list of false but similar data are completely replaced. For example, surnames can be changed to irrelevant surnames or credit card numbers can be replaced with a random string of 16 numbers. To correctly anonymise data using this method, users must have lists of quantities that are equal to or greater than the amount of data they are trying to anonymise;
  • Generalisation works by removing the specificity of the data and replacing it with more general but still relevant information. Instead of saying that he is 33 years old, generalized data may say that a person is between 30 and 40 years old and that with regard to addresses, only the names of the roads can be included;
  • Data and numbers variation: algorithms can be used to change the value of numerical data by random percentages. This small step can make a big difference if implemented properly;
  • Scrambling, with the correct encoding, the letters are mixed and rearranged intensively so that the original data cannot be determined (ex. the name Federico to Odrief or Diferci).
  • Numerical shading is a way of obscuring data values so that they cannot be used to identify and as an individual: it can be achieved in several ways, Numerical smoothing can be achieved in several ways, including reporting rounded values or group averages. In some cases, certain data columns or records may contain identifying information that is useless for the data evaluator, but is still valuable. It is important that the data be removed entirely from the data table rather than simply hiding it.
  • Synthetic data, unlike other data anonymisation techniques, sets are imitated versions of real data rather than modified data, and these sets of synthetic data have many in common with real data, such as form and relationships between data characteristics, used when a large amount of data is needed to test systems and real data cannot be used.

3.4 Risks associated with data anonymisation:

Although data anonymity can lead to significant progress for companies in all sectors, it is not without limitations and risks. If the application is incorrect or a weak algorithm is used, the anonymity of poverty can lead to:
- Identity disclosure is also referred to describe the circumstances in which all or certain individuals may be identified in a data concentration;
- Attribute disclosure means that it is possible to determine whether the attributes in the data set are owned by a particular individual (ex. anonymous data sets may indicate that all employees of the sales department of a particular office arrive after 10 a.m.; if an employee was known to be in the sales department of the office, he or she would know that they had arrived after 10 a.m., even if their specific identity was concealed in the data pool);
- Link ability refers to the possibility of linking multiple data points either in the data set itself or in separate data sets to find a more coherent image of a given individual;
- Inference disclosure occurs when it is possible to infer the value of an attribute from other attributes.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License