La sintaxis del lenguaje de programación C es el conjunto de reglas que rigen la escritura de software en C. Está diseñada para permitir programas que son extremadamente concisos, tienen una estrecha relación con el código objeto resultante y, sin embargo, proporcionan una abstracción de datos de nivel relativamente alto. . C fue el primer lenguaje de alto nivel de gran éxito para el desarrollo de sistemas operativos portátiles .
La sintaxis de C utiliza el principio de munch máximo .
El lenguaje de programación C representa números en tres formas: integral , real y complejo . Esta distinción refleja distinciones similares en la arquitectura del conjunto de instrucciones de la mayoría de las unidades centrales de procesamiento . Los tipos de datos integrales almacenan números en el conjunto de números enteros , mientras que los números reales y complejos representan números (o pares de números) en el conjunto de números reales en forma de punto flotante .
Todos los tipos de enteros C tienen signed
variantes unsigned
. Si signed
no unsigned
se especifica explícitamente, en la mayoría de las circunstancias signed
se supone. Sin embargo, por razones históricas, la llanura char
es un tipo distinto de ambos signed char
y unsigned char
. Puede ser un tipo con signo o sin signo, según el compilador y el juego de caracteres (C garantiza que los miembros del juego de caracteres básico de C tengan valores positivos). Además, los tipos de campos de bits especificados como simples int
pueden estar firmados o sin firmar, según el compilador.
Los tipos de enteros de C vienen en diferentes tamaños fijos, capaces de representar varios rangos de números. El tipo char
ocupa exactamente un byte (la unidad de almacenamiento direccionable más pequeña), que normalmente tiene 8 bits de ancho. (Aunque char
puede representar cualquiera de los caracteres "básicos" de C, es posible que se requiera un tipo más amplio para los conjuntos de caracteres internacionales). La mayoría de los tipos de enteros tienen variedades con y sin signo , designadas por las palabras clave signed
y unsigned
. Los tipos de enteros con signo pueden utilizar una representación en complemento a dos , complemento a uno o signo y magnitud . En muchos casos, existen múltiples formas equivalentes de designar el tipo; por ejemplo, y son sinónimos.signed short int
short
La representación de algunos tipos puede incluir bits de "relleno" no utilizados, que ocupan almacenamiento pero no están incluidos en el ancho. La siguiente tabla proporciona una lista completa de los tipos de enteros estándar y sus anchos mínimos permitidos (incluido cualquier bit de signo).
El char
tipo es distinto de ambos signed char
y unsigned char
, pero se garantiza que tendrá la misma representación que uno de ellos. Los tipos _Bool
y long long
están estandarizados desde 1999 y es posible que no sean compatibles con compiladores de C más antiguos. _Bool
Generalmente se accede al tipo a través del typedef
nombre bool
definido por el encabezado estándar stdbool.h
.
En general, los anchos y el esquema de representación implementado para cualquier plataforma determinada se eligen en función de la arquitectura de la máquina, teniendo en cuenta la facilidad de importar el código fuente desarrollado para otras plataformas. El ancho del int
tipo varía especialmente entre las implementaciones de C; a menudo corresponde al tamaño de palabra más "natural" para la plataforma específica. El encabezado estándar limites.h define macros para los valores mínimos y máximos representables de los tipos de enteros estándar implementados en cualquier plataforma específica.
Además de los tipos de enteros estándar, puede haber otros tipos de enteros "extendidos", que se pueden utilizar para typedef
s en encabezados estándar. Para una especificación más precisa del ancho, los programadores pueden y deben usar typedef
s del encabezado estándar stdint.h .
Las constantes enteras se pueden especificar en el código fuente de varias maneras. Los valores numéricos se pueden especificar como decimal (ejemplo: 1022
), octal con cero ( 0
) como prefijo ( 01776
) o hexadecimal con 0x
(cero x) como prefijo ( 0x3FE
). Un carácter entre comillas simples (ejemplo: 'R'
), llamado "constante de carácter", representa el valor de ese carácter en el conjunto de caracteres de ejecución, con tipo int
. Excepto en el caso de las constantes de caracteres, el tipo de una constante entera está determinado por el ancho requerido para representar el valor especificado, pero siempre es al menos tan ancho como int
. Esto se puede anular agregando una longitud explícita y/o un modificador de firma; por ejemplo, 12lu
tiene tipo unsigned long
. No existen constantes enteras negativas, pero a menudo se puede obtener el mismo efecto utilizando un operador de negación unario " -
".
El tipo enumerado en C, especificado con la enum
palabra clave, y a menudo llamado simplemente "enum" (generalmente pronunciado / ˈ iː n ʌ m / EE -num o / ˈ iː n uː m / EE -noom ), es un tipo diseñado para representan valores en una serie de constantes nombradas. Cada una de las constantes enumeradas tiene tipo int
. Cada enum
tipo en sí es compatible con char
un tipo entero con o sin signo, pero cada implementación define sus propias reglas para elegir un tipo.
Algunos compiladores advierten si a un objeto con tipo enumerado se le asigna un valor que no es una de sus constantes. Sin embargo, a dicho objeto se le puede asignar cualquier valor en el rango de su tipo compatible, y enum
se pueden usar constantes en cualquier lugar donde se espere un número entero. Por esta razón, enum
los valores se utilizan a menudo en lugar de #define
directivas de preprocesador para crear constantes con nombre. Estas constantes son generalmente más seguras de usar que las macros, ya que residen dentro de un espacio de nombres de identificador específico.
Un tipo enumerado se declara con el enum
especificador y un nombre opcional (o etiqueta ) para la enumeración, seguido de una lista de una o más constantes contenidas entre llaves y separadas por comas, y una lista opcional de nombres de variables. Las referencias posteriores a un tipo enumerado específico utilizan la enum
palabra clave y el nombre de la enumeración. De forma predeterminada, a la primera constante de una enumeración se le asigna el valor cero y cada valor posterior se incrementa en uno sobre la constante anterior. También se pueden asignar valores específicos a las constantes en la declaración, y cualquier constante posterior sin valores específicos recibirá valores incrementados a partir de ese momento. Por ejemplo, considere la siguiente declaración:
colores de enumeración { ROJO , VERDE , AZUL = 5 , AMARILLO } color_pintura ;
Esto declara el enum colors
tipo; las int
constantes RED
(cuyo valor es 0), GREEN
(cuyo valor es uno mayor que RED
, 1), BLUE
(cuyo valor es el valor dado, 5) y YELLOW
(cuyo valor es uno mayor que BLUE
, 6); y la enum colors
variable paint_color
. Las constantes se pueden usar fuera del contexto de enum
(donde se permite cualquier valor entero) y se pueden asignar valores distintos de las constantes a paint_color
o cualquier otra variable de tipo enum colors
.
La forma de punto flotante se utiliza para representar números con un componente fraccionario. Sin embargo, no representan exactamente la mayoría de los números racionales; son más bien una aproximación cercana. Hay tres tipos de valores reales, indicados por sus especificadores: precisión simple ( float
), precisión doble ( double
) y precisión doble extendida ( long double
). Cada uno de estos puede representar valores en una forma diferente, a menudo uno de los formatos de punto flotante IEEE .
Las constantes de punto flotante se pueden escribir en notación decimal , por ejemplo 1.23
. La notación científica decimal se puede utilizar sumando e
o E
seguido de un exponente decimal, también conocido como notación E , por ejemplo 1.23e2
(que tiene el valor 1,23 × 10 2 = 123,0). Se requiere un punto decimal o un exponente (de lo contrario, el número se analiza como una constante entera). Las constantes hexadecimales de punto flotante siguen reglas similares, excepto que deben tener el prefijo 0x
y usar p
o P
para especificar un exponente binario, por ejemplo 0xAp-2
(que tiene el valor 2,5, ya que A h × 2 −2 = 10 × 2 −2 = 10 ÷ 4 ). Tanto las constantes de punto flotante decimal como hexadecimal pueden tener el sufijo f
o F
para indicar una constante de tipo float
, l
(letra l
) o L
para indicar tipo long double
, o dejarse sin sufijo para una double
constante.
El archivo de encabezado estándar float.h
define los valores mínimo y máximo de los tipos de punto flotante de la implementación float
, double
y long double
. También define otros límites que son relevantes para el procesamiento de números de punto flotante.
Cada objeto tiene una clase de almacenamiento. Esto especifica básicamente la duración del almacenamiento, que puede ser estático (predeterminado para global), automático (predeterminado para local) o dinámico (asignado), junto con otras características (vinculación y sugerencia de registro).
malloc()
y free()
.Las variables declaradas dentro de un bloque de forma predeterminada tienen almacenamiento automático, al igual que aquellas declaradas explícitamente con [nota 2] o especificadores de clase de almacenamiento. Los especificadores y sólo pueden usarse dentro de funciones y declaraciones de argumentos de funciones; como tal, el especificador siempre es redundante. Los objetos declarados fuera de todos los bloques y aquellos declarados explícitamente con el especificador de clase de almacenamiento tienen una duración de almacenamiento estático. El compilador inicializa las variables estáticas a cero de forma predeterminada .
autoregister
auto
register
auto
static
Los objetos con almacenamiento automático son locales del bloque en el que fueron declarados y se descartan cuando se sale del bloque. Además, register
el compilador puede dar mayor prioridad a los objetos declarados con la clase de almacenamiento para acceder a los registros ; aunque el compilador puede optar por no almacenar ninguno de ellos en un registro. Los objetos con esta clase de almacenamiento no se pueden utilizar con el &
operador unario dirección de (). Los objetos con almacenamiento estático persisten durante toda la duración del programa. De esta manera, una función puede acceder al mismo objeto a través de múltiples llamadas. Los objetos con una duración de almacenamiento asignada se crean y destruyen explícitamente con malloc
, free
y funciones relacionadas.
El extern
especificador de clase de almacenamiento indica que el almacenamiento de un objeto se ha definido en otro lugar. Cuando se usa dentro de un bloque, indica que el almacenamiento ha sido definido mediante una declaración fuera de ese bloque. Cuando se usa fuera de todos los bloques, indica que el almacenamiento se ha definido fuera de la unidad de compilación. El extern
especificador de clase de almacenamiento es redundante cuando se usa en una declaración de función. Indica que la función declarada se ha definido fuera de la unidad de compilación.
El especificador de clase de almacenamiento _Thread_local
( thread_local
en C++ , desde C23 , y en versiones anteriores de C si <threads.h>
se incluye el encabezado), introducido en C11 , se utiliza para declarar una variable local de subproceso. Se puede combinar con static
o extern
para determinar la vinculación.
Tenga en cuenta que los especificadores de almacenamiento se aplican sólo a funciones y objetos; otras cosas, como las declaraciones de tipo y enumeración, son privadas para la unidad de compilación en la que aparecen. Los tipos, por otro lado, tienen calificadores (ver más abajo).
Los tipos pueden calificarse para indicar propiedades especiales de sus datos. El calificador de tipo const
indica que un valor no cambia una vez que se ha inicializado. Intentar modificar un const
valor calificado genera un comportamiento indefinido, por lo que algunos compiladores de C los almacenan en rodata o (para sistemas integrados) en memoria de solo lectura (ROM). El calificador de tipo volatile
indica a un compilador de optimización que no puede eliminar lecturas o escrituras aparentemente redundantes, ya que el valor puede cambiar incluso si no fue modificado por ninguna expresión o declaración, o pueden ser necesarias múltiples escrituras, como para I mapeado en memoria. /O .
Un tipo incompleto es una estructura o tipo de unión cuyos miembros aún no se han especificado, un tipo de matriz cuya dimensión aún no se ha especificado o el void
tipo (el void
tipo no se puede completar). No se puede crear una instancia de dicho tipo (se desconoce su tamaño), ni se puede acceder a sus miembros (ellos también son desconocidos); sin embargo, se puede utilizar el tipo de puntero derivado (pero no eliminar la referencia).
A menudo se utilizan con punteros, ya sea como declaraciones directas o externas. Por ejemplo, el código podría declarar un tipo incompleto como este:
estructura cosa * pt ;
Esto se declara pt
como un puntero struct thing
y el tipo incompleto struct thing
. Los punteros a datos siempre tienen el mismo ancho de bytes independientemente de a qué apunten, por lo que esta declaración es válida por sí misma (siempre que pt
no se elimine la referencia). El tipo incompleto se puede completar más adelante en el mismo ámbito redeclarándolo:
estructura cosa { int num ; }; /* el tipo de estructura de cosa ahora está completo */
Los tipos incompletos se utilizan para implementar estructuras recursivas ; el cuerpo de la declaración de tipo podrá trasladarse a una etapa posterior en la unidad de traducción:
estructura typedef Bert Bert ; estructura typedef Wilma Wilma ; estructura Bert { Wilma * wilma ; }; estructura Wilma { Bert * bert ; };
Los tipos incompletos también se utilizan para ocultar datos ; el tipo incompleto se define en un archivo de encabezado y el cuerpo solo dentro del archivo fuente relevante.
En las declaraciones, el modificador de asterisco ( *
) especifica un tipo de puntero. Por ejemplo, donde el especificador int
se referiría al tipo de número entero, el especificador int*
se refiere al tipo "puntero a número entero". Los valores de puntero asocian dos piezas de información: una dirección de memoria y un tipo de datos. La siguiente línea de código declara una variable de puntero a entero llamada ptr :
int * ptr ;
Cuando se declara un puntero no estático, tiene un valor no especificado asociado. La dirección asociada con dicho puntero debe cambiarse mediante asignación antes de usarlo. En el siguiente ejemplo, ptr se configura para que apunte a los datos asociados con la variable a :
int a = 0 ; int * ptr = &a a ;
Para lograr esto, &
se utiliza el operador "dirección de" (unario). Produce la ubicación de memoria del objeto de datos que sigue.
Se puede acceder a los datos señalados a través de un valor de puntero. En el siguiente ejemplo, la variable entera b se establece en el valor de la variable entera a , que es 10:
int a = 10 ; int * p ; p = &a a ; int b = * p ;
Para realizar esa tarea, se utiliza el operador unario de desreferencia , indicado por un asterisco (*). Devuelve los datos a los que apunta su operando, que debe ser de tipo puntero. Por tanto, la expresión * p denota el mismo valor que a . Eliminar la referencia a un puntero nulo es ilegal.
Las matrices se utilizan en C para representar estructuras de elementos consecutivos del mismo tipo. La definición de una matriz (de tamaño fijo) tiene la siguiente sintaxis:
matriz int [ 100 ];
que define una matriz llamada matriz para contener 100 valores del tipo primitivo int
. Si se declara dentro de una función, la dimensión de la matriz también puede ser una expresión no constante, en cuyo caso se asignará memoria para el número especificado de elementos. En la mayoría de los contextos de uso posterior, una mención de la matriz de variables se convierte en un puntero al primer elemento de la matriz. El sizeof
operador es una excepción: sizeof array
produce el tamaño de toda la matriz (es decir, 100 veces el tamaño de an int
y sizeof(array) / sizeof(int)
devolverá 100). Otra excepción es el operador & (dirección de), que genera un puntero a toda la matriz, por ejemplo
int ( * ptr_to_array ) [ 100 ] = & matriz ;
La función principal para acceder a los valores de los elementos de una matriz es el operador de subíndice de la matriz. Para acceder al elemento indexado por i de la matriz , la sintaxis sería array[i]
, que se refiere al valor almacenado en ese elemento de la matriz.
La numeración de subíndices de matriz comienza en 0 (consulte Indexación basada en cero ). Por lo tanto, el subíndice de matriz más grande permitido es igual al número de elementos de la matriz menos 1. Para ilustrar esto, considere una matriz declarada con 10 elementos; el primer elemento sería a[0]
y el último elemento sería a[9]
.
C no proporciona ninguna posibilidad para la verificación automática de límites para el uso de la matriz. Aunque lógicamente el último subíndice de una matriz de 10 elementos sería 9, los subíndices 10, 11, etc. podrían especificarse accidentalmente, con resultados indefinidos.
Debido a que las matrices y los punteros son intercambiables, las direcciones de cada uno de los elementos de la matriz se pueden expresar en aritmética de punteros equivalente . La siguiente tabla ilustra ambos métodos para la matriz existente:
Dado que la expresión a[i]
es semánticamente equivalente a *(a+i)
, que a su vez es equivalente a *(i+a)
, la expresión también se puede escribir como i[a]
, aunque esta forma rara vez se usa.
C99 estandarizó matrices de longitud variable (VLA) dentro del alcance del bloque. Estas variables de matriz se asignan en función del valor de un valor entero en tiempo de ejecución al ingresar a un bloque y se desasignan al final del bloque. [1] A partir de C11, ya no es necesario que el compilador implemente esta característica.
int norte = ...; int a [ n ]; a [ 3 ] = 10 ;
Esta sintaxis produce una matriz cuyo tamaño es fijo hasta el final del bloque.
Se pueden producir matrices cuyo tamaño se puede cambiar dinámicamente con la ayuda de la biblioteca estándar de C. La malloc
función proporciona un método simple para asignar memoria. Se necesita un parámetro: la cantidad de memoria a asignar en bytes. Tras una asignación exitosa, malloc
devuelve un void
valor de puntero genérico (), que apunta al comienzo del espacio asignado. El valor del puntero devuelto se convierte implícitamente a un tipo apropiado mediante asignación. Si no se pudo completar la asignación, malloc
devuelve un puntero nulo . Por lo tanto, el siguiente segmento tiene una función similar a la declaración deseada anterior:
#include <stdlib.h> /* declara malloc */ ... int * a = malloc ( n * sizeof * a ); a [ 3 ] = 10 ;
El resultado es un "puntero a int
" variable ( a ) que apunta al primero de nint
objetos contiguos ; debido a la equivalencia entre matriz y puntero, esto se puede usar en lugar de un nombre de matriz real, como se muestra en la última línea. La ventaja de usar esta asignación dinámica es que la cantidad de memoria que se le asigna se puede limitar a lo que realmente se necesita en tiempo de ejecución, y esto se puede cambiar según sea necesario (usando la función de biblioteca estándar realloc
).
Cuando la memoria asignada dinámicamente ya no es necesaria, debe liberarse nuevamente al sistema de ejecución. Esto se hace con una llamada a la free
función. Se necesita un único parámetro: un puntero a la memoria previamente asignada. Este es el valor devuelto por una llamada anterior a malloc
.
Como medida de seguridad, algunos programadores [ ¿quién? ] luego establezca la variable de puntero en NULL
:
gratis ( un ); a = NULO ;
Esto garantiza que nuevos intentos de eliminar la referencia al puntero, en la mayoría de los sistemas, bloquearán el programa. Si no se hace esto, la variable se convierte en un puntero colgante que puede provocar un error de uso después de la liberación. Sin embargo, si el puntero es una variable local, configurarlo en NULL
no impide que el programa use otras copias del puntero. Los errores locales de uso después de la liberación suelen ser fáciles de reconocer para los analizadores estáticos . Por lo tanto, este enfoque es menos útil para punteros locales y se usa más a menudo con punteros almacenados en estructuras de larga duración. Sin embargo, en general, establecer indicadores NULL
es una buena práctica [ ¿según quién? ] ya que permite al programador NULL
verificar los punteros antes de eliminar la referencia, lo que ayuda a prevenir fallas.
Recordando el ejemplo de la matriz, también se podría crear una matriz de tamaño fijo mediante asignación dinámica:
int ( * a ) [ 100 ] = malloc ( tamaño de * a );
...Lo que produce un puntero a matriz.
El acceso al puntero a la matriz se puede realizar de dos maneras:
( * a )[ índice ];índice [ * a ];
La iteración también se puede realizar de dos maneras:
para ( int i = 0 ; i < 100 ; i ++ ) ( * a )[ i ]; para ( int * i = a [ 0 ]; i < a [ 1 ]; i ++ ) * i ;
La ventaja de utilizar el segundo ejemplo es que no se requiere el límite numérico del primer ejemplo, lo que significa que el puntero a la matriz podría ser de cualquier tamaño y el segundo ejemplo se puede ejecutar sin modificaciones.
Además, C admite matrices de múltiples dimensiones, que se almacenan en orden de fila principal . Técnicamente, los arreglos multidimensionales en C son solo arreglos unidimensionales cuyos elementos son arreglos. La sintaxis para declarar matrices multidimensionales es la siguiente:
int array2d [ FILAS ][ COLUMNAS ];
donde FILAS y COLUMNAS son constantes. Esto define una matriz bidimensional. Al leer los subíndices de izquierda a derecha, array2d es una matriz de longitud FILAS , cada elemento del cual es una matriz de COLUMNAS enteros.
To access an integer element in this multidimensional array, one would use
array2d[4][3]
Again, reading from left to right, this accesses the 5th row, and the 4th element in that row. The expression array2d[4]
is an array, which we are then subscripting with [3] to access the fourth integer.
Higher-dimensional arrays can be declared in a similar manner.
A multidimensional array should not be confused with an array of pointers to arrays (also known as an Iliffe vector or sometimes an array of arrays). The former is always rectangular (all subarrays must be the same size), and occupies a contiguous region of memory. The latter is a one-dimensional array of pointers, each of which may point to the first element of a subarray in a different place in memory, and the sub-arrays do not have to be the same size. The latter can be created by multiple uses of malloc
.
In C, string literals are surrounded by double quotes ("
) (e.g., "Hello world!"
) and are compiled to an array of the specified char
values with an additional null terminating character (0-valued) code to mark the end of the string.
String literals may not contain embedded newlines; this proscription somewhat simplifies parsing of the language. To include a newline in a string, the backslash escape \n
may be used, as below.
There are several standard library functions for operating with string data (not necessarily constant) organized as array of char
using this null-terminated format; see below.
C's string-literal syntax has been very influential, and has made its way into many other languages, such as C++, Objective-C, Perl, Python, PHP, Java, JavaScript, C#, and Ruby. Nowadays, almost all new languages adopt or build upon C-style string syntax. Languages that lack this syntax tend to precede C.
Because certain characters cannot be part of a literal string expression directly, they are instead identified by an escape sequence starting with a backslash (\
). For example, the backslashes in "This string contains \"double quotes\"."
indicate (to the compiler) that the inner pair of quotes are intended as an actual part of the string, rather than the default reading as a delimiter (endpoint) of the string itself.
Backslashes may be used to enter various control characters, etc., into a string:
The use of other backslash escapes is not defined by the C standard, although compiler vendors often provide additional escape codes as language extensions. One of these is the escape sequence \e
for the escape character with ASCII hex value 1B which was not added to the C standard due to lacking representation in other character sets (such as EBCDIC). It is available in GCC, clang and tcc.
C has string literal concatenation, meaning that adjacent string literals are concatenated at compile time; this allows long strings to be split over multiple lines, and also allows string literals resulting from C preprocessor defines and macros to be appended to strings at compile time:
printf(__FILE__ ": %d: Hello " "world\n", __LINE__);
will expand to
printf("helloworld.c" ": %d: Hello " "world\n", 10);
which is syntactically equivalent to
printf("helloworld.c: %d: Hello world\n", 10);
Individual character constants are single-quoted, e.g. 'A'
, and have type int
(in C++, char
). The difference is that "A"
represents a null-terminated array of two characters, 'A' and '\0', whereas 'A'
directly represents the character value (65 if ASCII is used). The same backslash-escapes are supported as for strings, except that (of course) "
can validly be used as a character without being escaped, whereas '
must now be escaped.
A character constant cannot be empty (i.e. ''
is invalid syntax), although a string may be (it still has the null terminating character). Multi-character constants (e.g. 'xy'
) are valid, although rarely useful — they let one store several characters in an integer (e.g. 4 ASCII characters can fit in a 32-bit integer, 8 in a 64-bit one). Since the order in which the characters are packed into an int
is not specified (left to the implementation to define), portable use of multi-character constants is difficult.
Nevertheless, in situations limited to a specific platform and the compiler implementation, multicharacter constants do find their use in specifying signatures. One common use case is the OSType, where the combination of Classic Mac OS compilers and its inherent big-endianness means that bytes in the integer appear in the exact order of characters defined in the literal. The definition by popular "implementations" are in fact consistent: in GCC, Clang, and Visual C++, '1234'
yields 0x31323334
under ASCII.[3][4]
Like string literals, character constants can also be modified by prefixes, for example L'A'
has type wchar_t
and represents the character value of "A" in the wide character encoding.
Since type char
is 1 byte wide, a single char
value typically can represent at most 255 distinct character codes, not nearly enough for all the characters in use worldwide. To provide better support for international characters, the first C standard (C89) introduced wide characters (encoded in type wchar_t
) and wide character strings, which are written as L"Hello world!"
Wide characters are most commonly either 2 bytes (using a 2-byte encoding such as UTF-16) or 4 bytes (usually UTF-32), but Standard C does not specify the width for wchar_t
, leaving the choice to the implementor. Microsoft Windows generally uses UTF-16, thus the above string would be 26 bytes long for a Microsoft compiler; the Unix world prefers UTF-32, thus compilers such as GCC would generate a 52-byte string. A 2-byte wide wchar_t
suffers the same limitation as char
, in that certain characters (those outside the BMP) cannot be represented in a single wchar_t
; but must be represented using surrogate pairs.
The original C standard specified only minimal functions for operating with wide character strings; in 1995 the standard was modified to include much more extensive support, comparable to that for char
strings. The relevant functions are mostly named after their char
equivalents, with the addition of a "w" or the replacement of "str" with "wcs"; they are specified in <wchar.h>
, with <wctype.h>
containing wide-character classification and mapping functions.
The now generally recommended method[note 3] of supporting international characters is through UTF-8, which is stored in char
arrays, and can be written directly in the source code if using a UTF-8 editor, because UTF-8 is a direct ASCII extension.
A common alternative to wchar_t
is to use a variable-width encoding, whereby a logical character may extend over multiple positions of the string. Variable-width strings may be encoded into literals verbatim, at the risk of confusing the compiler, or using numerical backslash escapes (e.g. "\xc3\xa9"
for "é" in UTF-8). The UTF-8 encoding was specifically designed (under Plan 9) for compatibility with the standard library string functions; supporting features of the encoding include a lack of embedded nulls, no valid interpretations for subsequences, and trivial resynchronisation. Encodings lacking these features are likely to prove incompatible with the standard library functions; encoding-aware string functions are often used in such cases.
Strings, both constant and variable, can be manipulated without using the standard library. However, the library contains many useful functions for working with null-terminated strings.
Structures and unions in C are defined as data containers consisting of a sequence of named members of various types. They are similar to records in other programming languages. The members of a structure are stored in consecutive locations in memory, although the compiler is allowed to insert padding between or after members (but not before the first member) for efficiency or as padding required for proper alignment by the target architecture. The size of a structure is equal to the sum of the sizes of its members, plus the size of the padding.
Unions in C are related to structures and are defined as objects that may hold (at different times) objects of different types and sizes. They are analogous to variant records in other programming languages. Unlike structures, the components of a union all refer to the same location in memory. In this way, a union can be used at various times to hold different types of objects, without the need to create a separate object for each new type. The size of a union is equal to the size of its largest component type.
Structures are declared with the struct
keyword and unions are declared with the union
keyword. The specifier keyword is followed by an optional identifier name, which is used to identify the form of the structure or union. The identifier is followed by the declaration of the structure or union's body: a list of member declarations, contained within curly braces, with each declaration terminated by a semicolon. Finally, the declaration concludes with an optional list of identifier names, which are declared as instances of the structure or union.
For example, the following statement declares a structure named s
that contains three members; it will also declare an instance of the structure known as tee
:
struct s { int x; float y; char *z;} tee;
And the following statement will declare a similar union named u
and an instance of it named n
:
union u { int x; float y; char *z;} n;
Members of structures and unions cannot have an incomplete or function type. Thus members cannot be an instance of the structure or union being declared (because it is incomplete at that point) but can be pointers to the type being declared.
Once a structure or union body has been declared and given a name, it can be considered a new data type using the specifier struct
or union
, as appropriate, and the name. For example, the following statement, given the above structure declaration, declares a new instance of the structure s
named r
:
struct s r;
It is also common to use the typedef
specifier to eliminate the need for the struct
or union
keyword in later references to the structure. The first identifier after the body of the structure is taken as the new name for the structure type (structure instances may not be declared in this context). For example, the following statement will declare a new type known as s_type that will contain some structure:
typedef struct {...} s_type;
Future statements can then use the specifier s_type (instead of the expanded struct
... specifier) to refer to the structure.
Members are accessed using the name of the instance of a structure or union, a period (.
), and the name of the member. For example, given the declaration of tee from above, the member known as y (of type float
) can be accessed using the following syntax:
tee.y
Structures are commonly accessed through pointers. Consider the following example that defines a pointer to tee, known as ptr_to_tee:
struct s *ptr_to_tee = &tee;
Member y of tee can then be accessed by dereferencing ptr_to_tee and using the result as the left operand:
(*ptr_to_tee).y
Which is identical to the simpler tee.y
above as long as ptr_to_tee points to tee. Due to operator precedence ("." being higher than "*"), the shorter *ptr_to_tee.y
is incorrect for this purpose, instead being parsed as *(ptr_to_tee.y)
and thus the parentheses are necessary. Because this operation is common, C provides an abbreviated syntax for accessing a member directly from a pointer. With this syntax, the name of the instance is replaced with the name of the pointer and the period is replaced with the character sequence ->
. Thus, the following method of accessing y is identical to the previous two:
ptr_to_tee->y
Members of unions are accessed in the same way.
This can be chained; for example, in a linked list, one may refer to n->next->next
for the second following node (assuming that n->next
is not null).
Assigning values to individual members of structures and unions is syntactically identical to assigning values to any other object. The only difference is that the lvalue of the assignment is the name of the member, as accessed by the syntax mentioned above.
A structure can also be assigned as a unit to another structure of the same type. Structures (and pointers to structures) may also be used as function parameter and return types.
For example, the following statement assigns the value of 74 (the ASCII code point for the letter 't') to the member named x in the structure tee, from above:
tee.x = 74;
And the same assignment, using ptr_to_tee in place of tee, would look like:
ptr_to_tee->x = 74;
Assignment with members of unions is identical.
According to the C standard, the only legal operations that can be performed on a structure are copying it, assigning to it as a unit (or initializing it), taking its address with the address-of (&
) unary operator, and accessing its members. Unions have the same restrictions. One of the operations implicitly forbidden is comparison: structures and unions cannot be compared using C's standard comparison facilities (==
, >
, <
, etc.).
C also provides a special type of member known as a bit field, which is an integer with an explicitly specified number of bits. A bit field is declared as a structure (or union) member of type int
, signed int
, unsigned int
, or _Bool
,[note 4] following the member name by a colon (:
) and the number of bits it should occupy. The total number of bits in a single bit field must not exceed the total number of bits in its declared type (this is allowed in C++ however, where the extra bits are used for padding).
As a special exception to the usual C syntax rules, it is implementation-defined whether a bit field declared as type int
, without specifying signed
or unsigned
, is signed or unsigned. Thus, it is recommended to explicitly specify signed
or unsigned
on all structure members for portability.
Unnamed fields consisting of just a colon followed by a number of bits are also allowed; these indicate padding. Specifying a width of zero for an unnamed field is used to force alignment to a new word.[5] Since all members of a union occupy the same memory, unnamed bit-fields of width zero do nothing in unions, however unnamed bit-fields of non zero width can change the size of the union since they have to fit in it.
The members of bit fields do not have addresses, and as such cannot be used with the address-of (&
) unary operator. The sizeof
operator may not be applied to bit fields.
The following declaration declares a new structure type known as f
and an instance of it known as g
. Comments provide a description of each of the members:
struct f { unsigned int flag : 1; /* a bit flag: can either be on (1) or off (0) */ signed int num : 4; /* a signed 4-bit field; range -7...7 or -8...7 */ signed int : 3; /* 3 bits of padding to round out to 8 bits */} g;
Default initialization depends on the storage class specifier, described above.
Because of the language's grammar, a scalar initializer may be enclosed in any number of curly brace pairs. Most compilers issue a warning if there is more than one such pair, though.
int x = 12;int y = { 23 }; //Legal, no warningint z = { { 34 } }; //Legal, expect a warning
Structures, unions and arrays can be initialized in their declarations using an initializer list. Unless designators are used, the components of an initializer correspond with the elements in the order they are defined and stored, thus all preceding values must be provided before any particular element's value. Any unspecified elements are set to zero (except for unions). Mentioning too many initialization values yields an error.
The following statement will initialize a new instance of the structure s known as pi:
struct s { int x; float y; char *z;};struct s pi = { 3, 3.1415, "Pi" };
Designated initializers allow members to be initialized by name, in any order, and without explicitly providing the preceding values. The following initialization is equivalent to the previous one:
struct s pi = { .z = "Pi", .x = 3, .y = 3.1415 };
Using a designator in an initializer moves the initialization "cursor". In the example below, if MAX
is greater than 10, there will be some zero-valued elements in the middle of a
; if it is less than 10, some of the values provided by the first five initializers will be overridden by the second five (if MAX
is less than 5, there will be a compilation error):
int a[MAX] = { 1, 3, 5, 7, 9, [MAX-5] = 8, 6, 4, 2, 0 };
In C89, a union was initialized with a single value applied to its first member. That is, the union u defined above could only have its int x member initialized:
union u value = { 3 };
Using a designated initializer, the member to be initialized does not have to be the first member:
union u value = { .y = 3.1415 };
If an array has unknown size (i.e. the array was an incomplete type), the number of initializers determines the size of the array and its type becomes complete:
int x[] = { 0, 1, 2 } ;
Compound designators can be used to provide explicit initialization when unadorned initializer lists
might be misunderstood. In the example below, w
is declared as an array of structures, each structure consisting of a member a
(an array of 3 int
) and a member b
(an int
). The initializer sets the size of w
to 2 and sets the values of the first element of each a
:
struct { int a[3], b; } w[] = { [0].a = {1}, [1].a[0] = 2 };
This is equivalent to:
struct { int a[3], b; } w[] ={ { { 1, 0, 0 }, 0 }, { { 2, 0, 0 }, 0 } };
There is no way to specify repetition of an initializer in standard C.
It is possible to borrow the initialization methodology to generate compound structure and array literals:
// pointer created from array literal.int *ptr = (int[]){ 10, 20, 30, 40 };// pointer to array.float (*foo)[3] = &(float[]){ 0.5f, 1.f, -0.5f };struct s pi = (struct s){ 3, 3.1415, "Pi" };
Compound literals are often combined with designated initializers to make the declaration more readable:[1]
pi = (struct s){ .z = "Pi", .x = 3, .y = 3.1415 };
C is a free-form language.
Bracing style varies from programmer to programmer and can be the subject of debate. See Indentation style for more details.
In the items in this section, any <statement> can be replaced with a compound statement. Compound statements have the form:
{ <optional-declaration-list> <optional-statement-list>}
and are used as the body of a function or anywhere that a single statement is expected. The declaration-list declares variables to be used in that scope, and the statement-list are the actions to be performed. Brackets define their own scope, and variables defined inside those brackets will be automatically deallocated at the closing bracket. Declarations and statements can be freely intermixed within a compound statement (as in C++).
C has two types of selection statements: the if
statement and the switch
statement.
The if
statement is in the form:
if (<expression>) <statement1>else <statement2>
In the if
statement, if the <expression>
in parentheses is nonzero (true), control passes to <statement1>
. If the else
clause is present and the <expression>
is zero (false), control will pass to <statement2>
. The else <statement2>
part is optional and, if absent, a false <expression>
will simply result in skipping over the <statement1>
. An else
always matches the nearest previous unmatched if
; braces may be used to override this when necessary, or for clarity.
The switch
statement causes control to be transferred to one of several statements depending on the value of an expression, which must have integral type. The substatement controlled by a switch is typically compound. Any statement within the substatement may be labeled with one or more case
labels, which consist of the keyword case
followed by a constant expression and then a colon (:). The syntax is as follows:
switch (<expression>){ case <label1> : <statements 1> case <label2> : <statements 2> break; default : <statements 3>}
No two of the case constants associated with the same switch may have the same value. There may be at most one default
label associated with a switch. If none of the case labels are equal to the expression in the parentheses following switch
, control passes to the default
label or, if there is no default
label, execution resumes just beyond the entire construct.
Switches may be nested; a case
or default
label is associated with the innermost switch
that contains it. Switch statements can "fall through", that is, when one case section has completed its execution, statements will continue to be executed downward until a break;
statement is encountered. Fall-through is useful in some circumstances, but is usually not desired.
In the preceding example, if <label2>
is reached, the statements <statements 2>
are executed and nothing more inside the braces. However, if <label1>
is reached, both <statements 1>
and <statements 2>
are executed since there is no break
to separate the two case statements.
It is possible, although unusual, to insert the switch
labels into the sub-blocks of other control structures. Examples of this include Duff's device and Simon Tatham's implementation of coroutines in Putty.[6]
C has three forms of iteration statement:
do <statement>while ( <expression> ) ;while ( <expression> ) <statement>for ( <expression> ; <expression> ; <expression> ) <statement>
In the while
and do
statements, the sub-statement is executed repeatedly so long as the value of the expression
remains non-zero (equivalent to true). With while
, the test, including all side effects from <expression>
, occurs before each iteration (execution of <statement>
); with do
, the test occurs after each iteration. Thus, a do
statement always executes its sub-statement at least once, whereas while
may not execute the sub-statement at all.
The statement:
for (e1; e2; e3) s;
is equivalent to:
e1;while (e2){ s;cont: e3;}
except for the behaviour of a continue;
statement (which in the for
loop jumps to e3
instead of e2
). If e2
is blank, it would have to be replaced with a 1
.
Any of the three expressions in the for
loop may be omitted. A missing second expression makes the while
test always non-zero, creating a potentially infinite loop.
Since C99, the first expression may take the form of a declaration, typically including an initializer, such as:
for (int i = 0; i < limit; ++i) { // ...}
The declaration's scope is limited to the extent of the for
loop.
Jump statements transfer control unconditionally. There are four types of jump statements in C: goto, continue
, break
, and return
.
The goto
statement looks like this:
goto <identifier> ;
The identifier must be a label (followed by a colon) located in the current function. Control transfers to the labeled statement.
A continue
statement may appear only within an iteration statement and causes control to pass to the loop-continuation portion of the innermost enclosing iteration statement. That is, within each of the statements
while (expression){ /* ... */ cont: ;}do{ /* ... */ cont: ;} while (expression);for (expr1; expr2; expr3) { /* ... */ cont: ;}
a continue
not contained within a nested iteration statement is the same as goto cont
.
The break
statement is used to end a for
loop, while
loop, do
loop, or switch
statement. Control passes to the statement following the terminated statement.
A function returns to its caller by the return
statement. When return
is followed by an expression, the value is returned to the caller as the value of the function. Encountering the end of the function is equivalent to a return
with no expression. In that case, if the function is declared as returning a value and the caller tries to use the returned value, the result is undefined.
GCC extends the C language with a unary &&
operator that returns the address of a label. This address can be stored in a void*
variable type and may be used later in a goto
instruction. For example, the following prints "hi "
in an infinite loop:
void *ptr = &&J1;J1: printf("hi "); goto *ptr;
This feature can be used to implement a jump table.
A C function definition consists of a return type (void
if no value is returned), a unique name, a list of parameters in parentheses, and various statements:
<return-type> functionName( <parameter-list> ){ <statements> return <expression of type return-type>;}
A function with non-void
return type should include at least one return
statement. The parameters are given by the <parameter-list>
, a comma-separated list of parameter declarations, each item in the list being a data type followed by an identifier: <data-type> <variable-identifier>, <data-type> <variable-identifier>, ...
.
The return type cannot be an array type or function type.
int f()[3]; // Error: function returning an arrayint (*g())[3]; // OK: function returning a pointer to an array.void h()(); // Error: function returning a functionvoid (*k())(); // OK: function returning a function pointer
If there are no parameters, the <parameter-list>
may be left empty or optionally be specified with the single word void
.
It is possible to define a function as taking a variable number of parameters by providing the ...
keyword as the last parameter instead of a data type ad variable identifier. A commonly used function that does this is the standard library function printf
, which has the declaration:
int printf (const char*, ...);
Manipulation of these parameters can be done by using the routines in the standard library header <stdarg.h>
.
A pointer to a function can be declared as follows:
<return-type> (*<function-name>)(<parameter-list>);
The following program shows use of a function pointer for selecting between addition and subtraction:
#include <stdio.h>int (*operation)(int x, int y);int add(int x, int y){ return x + y;}int subtract(int x, int y){ return x - y;}int main(int argc, char* args[]){ int foo = 1, bar = 1; operation = add; printf("%d + %d = %d\n", foo, bar, operation(foo, bar)); operation = subtract; printf("%d - %d = %d\n", foo, bar, operation(foo, bar)); return 0;}
After preprocessing, at the highest level a C program consists of a sequence of declarations at file scope. These may be partitioned into several separate source files, which may be compiled separately; the resulting object modules are then linked along with implementation-provided run-time support modules to produce an executable image.
The declarations introduce functions, variables and types. C functions are akin to the subroutines of Fortran or the procedures of Pascal.
A definition is a special type of declaration. A variable definition sets aside storage and possibly initializes it, a function definition provides its body.
An implementation of C providing all of the standard library functions is called a hosted implementation. Programs written for hosted implementations are required to define a special function called main, which is the first function called when a program begins executing.
Hosted implementations start program execution by invoking the main
function, which must be defined following one of these prototypes (using different parameter names or spelling the types differently is allowed):
int main() {...}int main(void) {...}int main(int argc, char *argv[]) {...}int main(int argc, char **argv) {...} // char *argv[] and char **argv have the same type as function parameters
The first two definitions are equivalent (and both are compatible with C++). It is probably up to individual preference which one is used (the current C standard contains two examples of main()
and two of main(void)
, but the draft C++ standard uses main()
). The return value of main
(which should be int
) serves as termination status returned to the host environment.
The C standard defines return values 0
and EXIT_SUCCESS
as indicating success and EXIT_FAILURE
as indicating failure. (EXIT_SUCCESS
and EXIT_FAILURE
are defined in <stdlib.h>
). Other return values have implementation-defined meanings; for example, under Linux a program killed by a signal yields a return code of the numerical value of the signal plus 128.
A minimal correct C program consists of an empty main
routine, taking no arguments and doing nothing:
int main(void){}
Because no return
statement is present, main
returns 0 on exit.[1] (This is a special-case feature introduced in C99 that applies only to main
.)
The main
function will usually call other functions to help it perform its job.
Some implementations are not hosted, usually because they are not intended to be used with an operating system. Such implementations are called free-standing in the C standard. A free-standing implementation is free to specify how it handles program startup; in particular it need not require a program to define a main
function.
Functions may be written by the programmer or provided by existing libraries. Interfaces for the latter are usually declared by including header files—with the #include
preprocessing directive—and the library objects are linked into the final executable image. Certain library functions, such as printf
, are defined by the C standard; these are referred to as the standard library functions.
A function may return a value to caller (usually another C function, or the hosting environment for the function main
). The printf
function mentioned above returns how many characters were printed, but this value is often ignored.
In C, arguments are passed to functions by value while other languages may pass variables by reference. This means that the receiving function gets copies of the values and has no direct way of altering the original variables. For a function to alter a variable passed from another function, the caller must pass its address (a pointer to it), which can then be dereferenced in the receiving function. See Pointers for more information.
void incInt(int *y){ (*y)++; // Increase the value of 'x', in 'main' below, by one}int main(void){ int x = 0; incInt(&x); // pass a reference to the var 'x' return 0;}
The function scanf works the same way:
int x;scanf("%d", &x);
In order to pass an editable pointer to a function (such as for the purpose of returning an allocated array to the calling code) you have to pass a pointer to that pointer: its address.
#include <stdio.h>#include <stdlib.h>void allocate_array(int ** const a_p, const int A) {/* allocate array of A ints assigning to *a_p alters the 'a' in main()*/ *a_p = malloc(sizeof(int) * A); }int main(void) { int * a; /* create a pointer to one or more ints, this will be the array */ /* pass the address of 'a' */ allocate_array(&a, 42);/* 'a' is now an array of length 42 and can be manipulated and freed here */ free(a); return 0;}
The parameter int **a_p
is a pointer to a pointer to an int
, which is the address of the pointer p
defined in the main function in this case.
Function parameters of array type may at first glance appear to be an exception to C's pass-by-value rule. The following program will print 2, not 1:
#include <stdio.h>void setArray(int array[], int index, int value){ array[index] = value;}int main(void){ int a[1] = {1}; setArray(a, 0, 2); printf ("a[0]=%d\n", a[0]); return 0;}
However, there is a different reason for this behavior. In fact, a function parameter declared with an array type is treated like one declared to be a pointer. That is, the preceding declaration of setArray
is equivalent to the following:
void setArray(int *array, int index, int value)
At the same time, C rules for the use of arrays in expressions cause the value of a
in the call to setArray
to be converted to a pointer to the first element of array a
. Thus, in fact this is still an example of pass-by-value, with the caveat that it is the address of the first element of the array being passed by value, not the contents of the array.
Since C99, the programmer can specify that a function takes an array of a certain size by using the keyword static
. In void setArray(int array[static 4], int index, int value)
the first parameter must be a pointer to the first element of an array of length at least 4. It is also possible to add qualifiers (const
, volatile
and restrict
) to the pointer type that the array is converted to by putting them between the brackets.
The following words are reserved, and may not be used as identifiers:
Implementations may reserve other keywords, such as asm
, although implementations typically provide non-standard keywords that begin with one or two underscores.
C identifiers are case sensitive (e.g., foo
, FOO
, and Foo
are the names of different objects). Some linkers may map external identifiers to a single case, although this is uncommon in most modern linkers.
Text starting with the token /*
is treated as a comment and ignored. The comment ends at the next */
; it can occur within expressions, and can span multiple lines. Accidental omission of the comment terminator is problematic in that the next comment's properly constructed comment terminator will be used to terminate the initial comment, and all code in between the comments will be considered as a comment. C-style comments do not nest; that is, accidentally placing a comment within a comment has unintended results:
/*This line will be ignored./*A compiler warning may be produced here. These lines will also be ignored.The comment opening token above did not start a new comment,and the comment closing token below will close the comment begun on line 1.*/This line and the line below it will not be ignored. Both will likely produce compile errors.*/
C++ style line comments start with //
and extend to the end of the line. This style of comment originated in BCPL and became valid C syntax in C99; it is not available in the original K&R C nor in ANSI C:
// this line will be ignored by the compiler/* these lines will be ignored by the compiler */x = *p/*q; /* this comment starts after the 'p' */
The parameters given on a command line are passed to a C program with two predefined variables - the count of the command-line arguments in argc
and the individual arguments as character strings in the pointer array argv
. So the command:
myFilt p1 p2 p3
results in something like:
While individual strings are arrays of contiguous characters, there is no guarantee that the strings are stored as a contiguous group.
The name of the program, argv[0]
, may be useful when printing diagnostic messages or for making one binary serve multiple purposes. The individual values of the parameters may be accessed with argv[1]
, argv[2]
, and argv[3]
, as shown in the following program:
#include <stdio.h>int main(int argc, char *argv[]){ printf("argc\t= %d\n", argc); for (int i = 0; i < argc; i++) printf("argv[%i]\t= %s\n", i, argv[i]);}
In any reasonably complex expression, there arises a choice as to the order in which to evaluate the parts of the expression: (1+1)+(3+3)
may be evaluated in the order (1+1)+(3+3)
, (2)+(3+3)
, (2)+(6)
, (8)
, or in the order (1+1)+(3+3)
, (1+1)+(6)
, (2)+(6)
, (8)
. Formally, a conforming C compiler may evaluate expressions in any order between sequence points (this allows the compiler to do some optimization). Sequence points are defined by:
&&
, which can be read and then) and logical or (||
, which can be read or else).?:
): This operator evaluates its first sub-expression first, and then its second or third (never both of them) based on the value of the first.Expressions before a sequence point are always evaluated before those after a sequence point. In the case of short-circuit evaluation, the second expression may not be evaluated depending on the result of the first expression. For example, in the expression (a() || b())
, if the first argument evaluates to nonzero (true), the result of the entire expression cannot be anything else than true, so b()
is not evaluated. Similarly, in the expression (a() && b())
, if the first argument evaluates to zero (false), the result of the entire expression cannot be anything else than false, so b()
is not evaluated.
The arguments to a function call may be evaluated in any order, as long as they are all evaluated by the time the function is entered. The following expression, for example, has undefined behavior:
printf("%s %s\n", argv[i = 0], argv[++i]);
An aspect of the C standard (not unique to C) is that the behavior of certain code is said to be "undefined". In practice, this means that the program produced from this code can do anything, from working as the programmer intended, to crashing every time it is run.
For example, the following code produces undefined behavior, because the variable b is modified more than once with no intervening sequence point:
#include <stdio.h>int main(void){ int b = 1; int a = b++ + b++; printf("%d\n", a);}
Because there is no sequence point between the modifications of b in "b++ + b++", it is possible to perform the evaluation steps in more than one order, resulting in an ambiguous statement. This can be fixed by rewriting the code to insert a sequence point in order to enforce an unambiguous behavior, for example:
a = b++;a += b++;
long long
modifier was introduced in the C99 standard.